**A Study of the Dynamic Response of Carbon Fiber Reinforced Epoxy (CFRE) Prepregs for Musical Instrument Manufacturing**

#### **Manuel Ibáñez-Arnal \*, Luis Doménech-Ballester and Fernando Sánchez-López**

Department of Mathematics, Physics and Technology Sciences. University CEU Cardenal Herrera, CEU Universities, Avda. Seminario s/n, 46113 Moncada, Spain; luis.domenech@uchceu.es (L.D.-B.); fernando.sanchez@uchceu.es (F.S.-L.)

**\*** Correspondence: manuel.ibanez@uchceu.es

Received: 23 September 2019; Accepted: 28 October 2019; Published: 30 October 2019

**Abstract:** Composite materials are presented in a wide variety of industrial sectors as an alternative to traditionally used materials. In recent years, a new sector has increasingly used these kinds of materials: the manufacture of musical instruments. Resonances of different elements that make up the geometries of musical instruments are commonly used with the aim of enhancing aspects of the timbre. These are sensitive to the mechanical characteristics of the material, so it is important to guarantee the properties of the composite. To do this, it is not uncommon to use pre-impregnated fibers (prepregs) which allow fine control of final volumetric fractions of the composite. Autoclaving is a high-quality process used to guarantee the desired mechanical properties in a composite, reducing porosity and avoiding delamination, but significantly raising production costs. On the contrary, manufacture without autoclaving increases competitiveness by eliminating the costs associated with autoclave production. In this paper, differences in dynamic behavior are evaluated under free conditions of different Carbon Fiber Reinforced Epoxy (CFRE) prepreg boards, processed by autoclave and out-of-autoclave. The results of the complex module are presented according to the frequency, quantifying the variations in the vibratory behavior of the material due to the change of processing.

**Keywords:** autoclave; out-of-autoclave; vacuum-bag-only; processing; CFRE; plates; modal; dynamic; musical instruments

#### **1. Introduction**

Interest in characterizing the elastic and damping properties of composites is increasing as more and more industrial sectors use these materials in very different applications. Among these, the most established sectors include the automotive, aerospace, and renewable energy industries [1–5]. However, in recent years, this has expanded to include other types of products, such as musical instruments.

In this expanding new sector, the modal and transient behavior of different structures has been studied. Commonly, in musical instruments resonances are usually sought to enhance the sound pressure levels obtained for different resonances, modifying the timbre and the qualitative aspects of the resulting sound. Research is particularly extensive in the chordophone family, particularly in the case of classical guitars [6–12], electric guitars [13,14], or the string family of instruments [15–19]. We also find studies linked to percussive chordophones such as the piano [20–22] and, though scarcer, those dedicated to membranophones [23–25]. Figure 1 shows some examples.

**Figure 1.** Some musical instruments made of carbon fiber reinforced with epoxy (CFRE): (**a**) Luis and Clark violoncello [26], (**b**) Rasch snare drum [27], (**c**) Boganyi piano [28], and (**d**) Klos carbon fiber guitar [29].

Composite materials use polymer matrices in their manufacture, so they have viscoelastic behavior. Viscoelasticity is the property of materials that exhibit both viscous and elastic characteristics when undergoing deformation. For the study of this type of material, we express the module in its complex form, as shown by Equation (1) [30]

$$E^\* = E' + jE'\tag{1}$$

.

where the real part (*E* ) is the storage modulus, and the imaginary part (*E*") is the loss modulus.

The study of the storage module (*E* ) is a determining factor when characterizing the resonance frequencies of a solid. *E* represents the elastic behavior of the material, which is directly linked to the resonance frequencies of any type of solid. Also, for typical resonant elements of musical instruments, such as membranes, plates or shells, we can express their resonances as [31]

$$f = a\sqrt{\frac{\mathcal{E'}}{\rho}}\tag{2}$$

where the square root of the ratio between *E* and ρ is equivalent to the longitudinal speed of sound in the solid, and α is a parameter defined by the modal and geometric characteristics of the solid. Variations of *E* modify the value of each one of the natural frequencies of the system.

In turn, the modulus of losses *E*", is related to the total damping of the system by means of Equation (5) and defines the amount of energy dissipated in each oscillation. The non-elastic or viscous behavior is related to the amount of sound pressure that a resonance will be able to accumulate and the modal decay time. The decay time due to internal damping processes is defined by [32]:

$$
\pi = \frac{1}{\pi f} \frac{E'}{E''} \tag{3}
$$

Because of their configurability, many manufacturers are investigating the use of composite materials for the manufacture of these instruments. The most common are those using Carbon Fiber Reinforced Epoxy (CFRE) [11,33,34], but others include those involving bio-composites [35,36]. These studies relate the mechanical characteristics of the composite to the sound behavior of different musical instruments, but do not contemplate the direct influence of the processing on the elastic and non-elastic behavior of the vibration.

Since the vibratory behavior and mechanical properties of the material are closely linked [32,37–39], in many cases the processing of the composite offers greater guarantees, such as the laminate of prepregs in an autoclave [40]. But due to the high associated cost, there are numerous studies that propose vacuum bag-only (VBO) processing as an alternative. Previous work has focused on investigating the effects that this can have on porosity, bubbles during curing, and different strategies for mitigating the defects associated with VBO [41–44], while trying to retain the mechanical properties that the autoclave offers.

It is known that defectology in processing affects the properties of the composite [45]. There are studies linking the appearance of voids during the processing of the composite to the mechanical characteristics of the composite [45,46], and others relating them to material failure [47]. Figure 2 shows the effects of defectology associated with VBO processing [48].

**Figure 2.** Micrographs of laminates: (**a**) autoclave processing, and (**b**) vacuum bag-only (VBO) processing. Blue circles show voids in the VBO laminate.

Studies have also investigated dynamic behavior in composites based on various variables, such as the addition of viscoelastic layers [49] or optimization of damping by characterizing laminate [50]. Since the vibrations of the material are of reduced amplitudes (<0.1 mm) in the musical instrument manufacturing industry, we can focus the study of the material only on the type of resonances emitted, without addressing aspects such as the failure of the material, either from fatigue or another mechanism.

Accordingly, it is essential to study the direct influence of processing on the modal behavior of the CFRE in so far as different manufacturers can establish criteria for choosing the type of processing. Therefore, this paper aims to evaluate the magnitude of the direct influence of processing by prepreg CFRE, autoclave or vacuum bag-only, on the vibratory characteristics of the composite, considering its sound speed, obtained by its modal behavior, as well as in the decay time result of the damping of the laminate.

#### **2. Materials and Methods**

For this study two laminates were generated, one by autoclave (A) and the other by vacuum bag-only (VBO) as shown in Figure 3.

**Figure 3.** (**a**) Autoclave and (**b**) VBO CFRE plates.

Since each solid presents resonances that can be expressed as a direct proportion to the speed of sound of the material that constitutes it, we can extrapolate from the study of the speed of sound to a simple geometry that allows analysis of the material with precision. The geometry used is the rectangular plate, since it allows for obtaining a great number of bending vibration modes.

One way to express the resonances of a rectangular plate is [32]

$$f\_n = \frac{0.113h}{L^2} \sqrt{\frac{E}{\rho}} \ [2n+1] \tag{4}$$

where *h* is the thickness, *L* is the length of the plate, *E* is the Young modulus, ρ is the density, and *n* is the number of nodes.

To be able to cover a range equivalent to three octaves above and below the A3 reference (440 Hz), the frequency range of the study was focused such that 0 < *f* < 2000 Hz. For this, the plate dimensions must be adjusted using Equation (4), which defines the resonances for plates.

The multipurpose CFRE system considered for this investigation was GG280T (Tenax HTA-3k)-DT806R-42 Fabric Laminate manufactured by an external company, (Magma Composites S.L, Alcañiz, Aragón, Spain) using prepreg plies supplied by Delta Preg Composites. DT806 is an epoxy resin with a glass transition temperature (Tg) of approximately 135 ◦C. GG280T is a 4/4 twill carbon fabric, 3K high strength (HS) carbon fiber reinforcement having a density of 198 g/cm2. Plates with dimensions of 220 × 220 mm were manufactured by the vacuum bag-only and the autoclave method, layering three collinear plies to an approximate total thickness of 1.03 mm. Plate A was cured in an autoclave at 120 ◦C for 1 hour, without post-cure. The pressure was 4 atm for the autoclave plate (A), and atmospheric pressure was used for the VBO plate. The manufactured plates were afterwards cut down to 30 × 200 mm plates, as shown in Figure 4.

A total of five specimens (dimensions: 200 mm × 30 mm × 1.035 mm) were manufactured using three collinear layers, considering that the fiber directions were parallel to the sides of the rectangles that make up the specimen, as shown in the Figure 5.

Properties provided by the manufacturer are shown in Table 1.

In order to characterize the elastic and non-elastic properties of the specimens, a dynamic test was carried out using the vibration procedure under free conditions stipulated in ISO 6721-3 [45], as illustrated in Figure 6.

**Figure 4.** Manufacturing process of CFRE specimens: (**a**) cutting of 30 × 200 mm specimens; (**b**) finished specimens; and (**c**) detailed visual comparison of the surface of autoclave and VBO top surfaces.

**Figure 5.** Specimens (**a**) VBO and (**b**) autoclave.

**Table 1.** Static mechanical properties of carbon fiber reinforced epoxy prepreg.


**Figure 6.** Experimental procedure for identifying resonance frequencies by external excitation [51].

In order to obtain experimental resonances for each plate, a wave generator was used to generate a frequency sweep for 0 < *f* < 2000 Hz. This sine wave was transferred to an amplifier, which sent the signal to a coil exciting one end of the plate. On the opposite side of the plate, a flat-spectrum microphone was used to capture the sound pressure levels that the plate emits at different frequencies, identifying the resonances of each of the plates.

Once all the resonance frequencies were obtained, the signal decrement was captured. To do this, the plate was subjected to resonance, and the external excitation was cut at the moment of greatest amplitude to let the plate oscillate freely until the oscillation stops. With all captured data, the following calculations were performed.

Vibration modes were obtained as Figure 7 shows, thus obtaining the values of the storage modulus for each of the resonancesǔ*E* Ǖ\_*f* with Equation (5). Resonance frequencies for a total of five vibration modes (at pure bending) shown in Figure 8.

**Figure 7.** Experimental test for the modal analysis of prepreg CFRE plates.

**Figure 8.** Modes obtained through experimentation, following the nomenclature (m,n) [32].

The storage modulus is defined as

$$E'\_f = \left[\frac{4\pi (3\rho)^{1/2} l^2}{h}\right]^2 \left(\frac{f\_{\rm ri}}{k\_i^2}\right)^2\tag{5}$$

where ρ is the density, *l* is the length of the specimen, *h* is the thickness of the plate, *fri* corresponds to the resonance frequency of the *i*-th mode, and *k*<sup>2</sup> *<sup>i</sup>* is a numerical factor dependent on the *i*-th vibrational mode and the contour conditions of the plate [45].

*E <sup>f</sup>* allows the calculation of the speed of sound in the composite, by its relationship to density, using the well-known Equation (6):

$$
\sigma = \sqrt{\frac{E'}{\rho}} \tag{6}
$$

In the second part of the study, the composite loss modulus *E <sup>f</sup>* was obtained for each of the vibrational modes.

To do this, the signal decrement of the oscillations in free conditions was analyzed for each of the resonances obtained.

The logarithmic decrement is defined as

$$\delta = \ln \left( \frac{A\_1}{A\_n} \right) / N \tag{7}$$

where *A*<sup>1</sup> is the amplitude of the initial oscillation, and is the amplitude of the *n*-th oscillation after the start of the decay.

The loss factor can be expressed as

$$E''\_{\;\;f} = E'\_{\;\;f} \tan \phi\_f \tag{8}$$

where φ*<sup>f</sup>* is the phase or loss angle.

There is a direct relationship between the phase and the loss factor [52]; for φ 1 we can obtain using the following equation:

$$
\tan \phi\_f = \frac{\delta\_f}{\pi} \tag{9}
$$

#### **3. Results and Discussion**

#### *3.1. Dynamic Characterization*

The results obtained for each of the processed specimens were then analyzed using the procedure described above, as shown in Tables 2 and 3.

**Table 2.** Average values obtained for CFRE prepreg plates processed by autoclave.


**Table 3.** Average values obtained for CFRE prepreg plates processed by vacuum bag-only.


We can see some notable differences between the two types of processing. As Figure 9 shows, the processing change caused a 27% decrease in the real part of the *E <sup>f</sup>* module, which was practically constant for all frequency values. We also observe in the results how the mean values of equivalent density of the plate decrease markedly due to the effects of porosity; the test processed using VBO had a density 16.2% lower than that processed by A.

**Figure 9.** Loss and storage modulus results for CFRE prepregs: A (autoclave) and VBO (vacuum bag-only).

Although both values decrease markedly, we see how the mean frequency values were less than each for both processes, with variations of between 2%–6%.

This can be explained by the resulting sound speed of composite *c* defined by Equation (6) and represented in Figure 10. It is defined as the square root of the ratio between the stiffness and the density of the medium. Because both variables are diminished by porosity, the ratio between them is maintained by mitigating the effect of processing losses, maintaining the wave rate and therefore the values for the resonance frequencies that are proportional to their sound speed [30].

**Figure 10.** Average sound speed (*c*) at different frequencies, for CFRE prepregs, processed by A (autoclave) and VBO (vacuum bag-only).

The biggest changes are seen in the non-elastic behavior of the composite. As Figure 9 shows, VBO processing generated a large increase in the loss modulus *E <sup>f</sup>* especially in the range *f* > 400 Hz; for example, 4.2 times higher for vibration mode (4.0) located around 700 Hz.

The increase of the loss modulus against the storage module generated an increase in damping, or loss factor η, of the composite processed by VBO. As shown, this increase was maximized for 1100 Hz, completely changing the behavior of the material that, in the case of the plate processed by autoclave, exhibited a behavior almost proportional to the frequency.

#### *3.2. E*ff*ects on The Vibrational Behavior.*

The effect of processing was clearly identifiable in transient analysis Figure 11. As Figure 12 shows, for vibration mode 2.0 with a frequency of *f* - 115 Hz the decay times are similar. However, for vibration mode 5.0 with a frequency *f* - 1150 Hz, damping values were much higher for the plate processed out-of-autoclave, so the duration of the vibration is much shorter.

**Figure 11.** Comparison of the loss factor for CFRE prepregs processed by autoclave and vacuum bag-only.

**Figure 12.** Transient analysis of CFRE plates: (**a**) Autoclave 2.0 vibrational mode; (**b**) vacuum bag-only 2.0 vibrational mode; (**c**) autoclave 5.0 vibrational mode; and (**d**) vacuum bag-only 5.0 vibrational mode.

The musical instrument manufacturing industry is commonly based on the use of resonances of secondary elements to boost the sound pressure levels of elements such as strings and membranes, but also to emit sound when impacted, as is the case with idiophones.

In both cases, the frequency values for each of the resonances are important, in addition to the damping values, since they define the amount of amplitude that each resonance can accumulate. Higher damping values result in a dissipation of the vibration energy.

As shown in Figure 13, the effect of the different processes of the CFRE is noticeable in the vibratory response of the composite.

**Figure 13.** Maximum amplitudes recorded on the different CFRE plates by frequency sweep excitation for 0 < *f* < 1400 Hz.

The values of the resonance frequencies were stable due to the similarity of the sound velocities in both processes. The biggest differences were due to the non-elastic behavior of the plates. For low frequency values, where the mechanical differences between the plates are minimal, the resonances were practically equal in amplitude and frequency.

As the frequency increases, the effects of the damping on the VBO plate become more important. This effect is critical near 1000 Hz, where some resonances disappear and their capacity to accumulate amplitude decreases.

#### **4. Conclusions**

This paper performed a dynamic test to analyze the speed of sound of CFRE prepreg plates processed by two different processes, autoclave and vacuum bag-only. In line with previous studies, we find that processing is an important factor to consider when dynamically characterizing a CFRE prepreg and is undoubtedly a new variable to be taken into account in material design applied to the musical instrument manufacturing industry.

However, although mechanical and physical differences were quantified in the resulting material, due to the stability of the sound speed in the composite, the vacuum bag-only method in CFRE prepregs is shown to be a feasible process for all those applications which require characterizing the elastic behavior and resonance frequencies of a product. Opting for an out-of-autoclave process is guaranteed to be sufficient for the manufacture of musical instruments, if we attend to the values of the resonance frequencies.

The biggest differences were generated at the non-elastic level; there were very significant increases in composite damping, which is something to consider in the design and production process. High damping values mean oscillation energy dissipation, so resonances accumulate less sound pressure. This factor may represent an added difficulty in the musical instrument manufacturing industry, where it is primarily of interest to the maximum accumulation of vibration energy to boost sound pressure levels, and generally materials with low attenuation are suited to this purpose.

In both cases, the knowledge of the influence of processing on the CFRE prepreg allows a more precise characterization of the material. While both processes can ensure stable resonances, due to their similarly elastic behavior, autoclave processing offers composites with lower damping values, improving maximum sound pressure levels. Thus, depending on the frequency range, the use of autoclave processing remains interesting.

**Author Contributions:** Conceptualization and writing—original draft preparation, M.I.-A.; methodology, M.I.-A. and validation, M.I.-A., F.S.-L. and L.D.-B.; investigation, M.I.-A.; writing—original draft preparation, M.I.-A.; writing—review and editing, F.S.-L., L.D.-B.; supervision, F.S.-L.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Dynamic Range Compression and the Semantic Descriptor Aggressive**

#### **Austin Moore**

Centre for Audio and Psychoacoustic Engineering, School of Computing and Engineering, University of Huddersfield, Huddersfield HD1 3DH, UK; a.p.moore@hud.ac.uk

Received: 29 January 2020; Accepted: 12 March 2020; Published: 30 March 2020

**Featured Application: The current study will be of interest to designers of professional audio software and hardware devices as it will allow them to design their tools to increase or diminish the sonic character discussed in the paper. In addition, it will benefit professional audio engineers due to its potential to speed up their workflow.**

**Abstract:** In popular music productions, the lead vocal is often the main focus of the mix and engineers will work to impart creative colouration onto this source. This paper conducts listening experiments to test if there is a correlation between perceived distortion and the descriptor "aggressive", which is often used to describe the sonic signature of Universal Audio 1176, a much-used dynamic range compressor in professional music production. The results from this study show compression settings that impart audible distortion are perceived as aggressive by the listener, and there is a strong correlation between the subjective listener scores for distorted and aggressive. Additionally, it was shown there is a strong correlation between compression settings rated with high aggressive scores and the audio feature roughness.

**Keywords:** dynamic range compression; music production; semantic audio; audio mixing; 1176 compressor; FET compression; listening experiment

#### **1. Introduction**

#### *1.1. Background*

In addition to general dynamic range control, it is common for music producers to use dynamic range compression (DRC) for colouration and non-linear signal processing techniques, specifically to impart distortion onto program material. Scholarly work has researched DRC use and has shown the industry has developed standard practices which mix engineers implement in their work [1,2]. One such standard is the use of Universal Audio 1176 (and, under its original name, Urei 1176) as a vocal compressor, particularly when processing vocals in popular music mixes where a specific character or distortion is desirable [3]. Users describe the sound quality of vocals processed in this manner with a number of subjective descriptors. This article investigates one common descriptor, "aggressive" to determine what it means at an objective level and answer empirically how an aggressive sound quality can be achieved when using the Universal Audio 1176 (abbreviated to simply 1176) and, more broadly, DRC in vocal productions. The findings will be of use to developers of software and hardware compressors, intelligent mixing algorithms [4,5] and industry professionals, as the novel results have the potential to change and speed up their workflow. More broadly, the work carried out in this article fills a gap in the research, as little work has been done to get a better understanding of the perceptual effects of dynamic range compression during the mixing stage of music production. Some findings reported in this article were originally presented at an Audio Engineering Society conference [6].

The literature relating to compression in music production has focused mainly on the effects of hyper-compression, often concentrating on whether its artefacts are detrimental to the perceived quality of audio material [7–9]. Taylor and Martens [10] claim that achieving loudness is a significant motivation for using compression, particularly in mastering, so one can argue that this is why hyper-compression is well researched.

Ronan et al. [11] investigated the audibility of compression artefacts among professional mastering engineers. For the study, twenty mastering engineers undertook an ABX listening experiment to determine whether they could detect artefacts created by limiting. Two songs were processed using the Massey L2007 digital limiter to achieve −4 dB, −8 dB and −12 dB of gain reduction. The masters (including the uncompressed versions) were then presented to the listeners using the ABX method. The results showed that the mastering engineers found it challenging to discern differences between a number of the audio tracks, particularly those with −8 dB of gain reduction and the unprocessed reference. The same experiment carried out by Ronan et al. [12] used untrained listeners and showed that they were unable to detect up to 12 dB of limiting.

Campbell et al. [13] had participants rate mixes with compression, which had been introduced at various points in the signal chain, namely on discrete tracks, subgroups and the master stereo buss. Their results showed that listeners preferred mixes where compression had been applied to individual tracks rather than to groups or on the stereo buss. However, their test used identical attack and release settings on all stimuli and made use of the same compressor (Pro Tools Compressor/Limiter), which may have played a role in the results. Adjusting the compressor settings so they were more appropriate for mix buss processing, and using a compressor with a suitable character for buss processing could have yielded a different outcome.

Wendl and Lee [14] looked into the effect of perceived loudness when using compression on pink noise split into octave bands. For this study, the authors wanted to observe if playback level and crest factor (a measurement of peak to RMS) affected perceived loudness after varying amounts of limiting had been applied. The results showed that there was a non-linear relationship between the octave bands and changes to the crest factor. Of interest to professionals is the result which shows that modifications to the crest factor in a band centered around 125 Hz does not correlate with perceived loudness at moderate playback levels. Moreover, the perceived loudness could be louder than one might expect. The authors recommended that music engineers should be cautious of compression activities that affect the low end.

Some noteworthy work was conducted by Ronan et al. [15] which sits outside of the hyper compression studies reviewed so far. The authors of this paper investigated the lexicon of words used to describe analogue compression. Ronan et al. conducted a discourse analysis on 51 reviews of analogue mastering compressors to look for common terms used in the texts and created inductive categories to group the words. The categories they created included signal distortion, transient shaping, special dimensions and glue. A qualitative investigation of a similar nature was carried out for the current study, and the results are presented in Chapter 3. Interestingly, a number of the descriptors gathered by both studies are similar, but aggressive was not included in the work by Ronan et al., suggesting that this descriptor is not commonly used to describe the sound of mastering compression.

Other connected work has been carried out by scholars involved with the Semantic Audio Labs and the Semantic Audio Feature Extraction (SAFE) project. SAFE aims to understand the audio features associated with semantic descriptors, which can then be used to create intelligent mixing tools. Stables et al. [16], investigated terms used to describe signal processing on the mix and conducted hierarchical clustering to look for similarities in the terms. The authors presented dendrograms relating to the signal processing techniques, compression, distortion, equalization and reverb. The word aggressive was not included in any dendrogram and, surprisingly, not in the distortion group. The authors do not stipulate which sources the signal processing techniques were applied to. So, the influence of source on the descriptor is not apparent, which is important to consider and will have played a role in the results. Hence, the current study focuses solely on vocal compression.

Bromham et al. [17] looked into compression ballistics (attack and release settings) and how they affected the perception of music mixes in four styles: Rock, Jazz, HipHop and Electronic Dance Music. They asked participants to rate which ballistic setting was the most appropriate for a genre and to select from a list of given words to describe the sound quality of their preferred setting. They discovered that attack played a more significant role on appropriateness than release, with the result applying most strongly to Jazz and Rock. It should be noted that this study made use of a Solid State Logic (SSL) bus compression emulation, which has a much slower attack time than the 1176 compressor used in the current study.

Not directly related to compression, but in the domain of semantic music production, is the work by Fenton and Lee [18], which aims to develop a perceptual model to measure "punch" in music productions. As noted by Fenton and Lee, punch is an attribute that is used by music listeners to describe a sense of power and weight in audio material. Their work uncovered that punch is related to "a short period of significant change in power in a piece of music or performance" as well as dynamic changes to particular frequency bands in the program material. The authors of the work went on to develop a perceptual model of punch for use in a real-time punch meter. The author of the current study advocates the creation of similar meters to measure other perceptual attributes, such as the one in this study, which can be integrated into modern digital audio workstations (DAWS) for use in music production activities.

As can be seen, apart from a small body of work, that there is a gap in the literature relating to the positive effects of compression, particularly during the mixing process. Work by Dewey and Wakefield [19] has shown that compression is one of the most used mixing tools and the present author's experience as a music production academic and professional suggests that audio engineers and scholars are interested in the character of compression. Therefore, the lack of work in this area is surprising.

#### *1.2. Research Aims*

Thus, the work in the current study aims to address several pertinent research questions relating to the use of compression during mixing and the semantic nature of its sonic signature. The following research questions are addressed in this article. Firstly, how do professionals describe the sonic signature of the 1176 compressor when processing a range of sources, but most specifically vocals? Secondly, and derived from the results of the first question, what does the subjective descriptor aggressive mean at an objective level?

To answer these questions, three studies were carried out. Initially, a qualitative study was conducted that asked several experienced users to describe the sound quality of the 1176 when compressing a range of audio sources. Based on their responses, they were then asked to rate the appropriateness of commonly used descriptors in a similarity matrix. The results suggested the descriptor aggressive was a synonym for distortion. Thus, the second stage of testing conducted a subjective listening test using the Audio Perceptual Evaluation (APE) method from the Web Audio Evaluation Tool (WAET) [20]. This tested whether listeners rated mixes with vocals compressed by 1176 hardware and using settings measured to have larger amounts of Total Harmonic Distortion (THD) as the most aggressive. Finally, a subsequent listening test was carried out to ascertain whether distortion, timing behaviour or a mixture of both were the most important factors in creating compressed audio perceived to be aggressive. The reason for this test was due to the 1176 s reputation as a fast-acting compressor (particularly when working with time constant settings, which will be addressed in Section 4). Therefore, it could be that argued its fast timing creates the aggressive sonic signature, rather than distortion. This was examined in the second listening experiment, which had vocals processed with a clean software compressor (measured with 0% THD) and set to mimic the timing behaviour of the 1176 as well as material compressed with 1176 hardware and measured to have

1.58% THD. The Klanghelm DC8C software compressor [21] was used for this test as it allows user control over a range of design traits that can be used to match the behaviour of several compressors. Most importantly, when used in its clean mode, it does not generate any distortion, even at the fastest time constants.

#### **2. Qualitative Studies**

#### *2.1. Professional User Questionaire*

An online questionnaire was created using the survey tool Qualtrics [22], which asked experts to describe the sound quality of the 1176 when compressing vocals, drum shells (bass and snare drums), room mics (ambient recordings of the drums in a room) and a bass guitar. Judgement sampling (as opposed to random sampling) was used to select experienced engineers and academics to complete the questionnaire. For an expert to be included they had to be knowledgeable in music production and familiar with the 1176. Judgement sampling does, however, have its limitations and is prone to bias [23]. Thirty-five respondents completed the questionnaire.

The results in Table 1 presents the ten most frequently used descriptors to describe the sound quality of the 1176 compressor for all four sound sources. Aggressive is the most popular word and is investigated in the current study. The other descriptors are likely to refer to amplitude modulation effects (pumping), transient reshaping (punch), colouration (forward, full, midrange and presence) and distortion (dirty and gritty). One would expect fast to be a description of the time constant speed, and not necessarily a description of sound quality.


**Table 1.** The top ten most frequently used descriptors to describe the sound of the 1176 compression.

Table 2 shows the most common words used to describe the sound quality of the 1176 when compressing a vocal, which is the main focus of this article. To reduce the number of words in the table, only those recorded more than once have been included. As can be seen, the descriptor aggressive is, again, the most popular, followed by the word gritty, which, as stated previously, is arguably a synonym for distorted. Other descriptors appear to refer to coluration (forward, midrange, presence, full, upfront and sparkly), amplitude modulation (pumping) and potentially the perceptual effect of the attack and release curve (smooth).


**Table 2.** Descriptors used for 1176 vocal compression.

#### *2.2. Similarity Matrix*

To help clarify the meaning of the descriptors, a second stage asked respondents to rate the appropriateness of the most popular descriptors in describing the sound quality of a given compression technique. The compression techniques were: linear processing, colouration general, colouration frequency related, distortion, modulation/altering rhythmic feel, general dynamic range control, attenuating transient and accentuating transient. The author selected these techniques based on prior research [24] which indicated these practices were commonly used by industry professionals when applying dynamic range compression. Respondents completed the task online and recorded their scores on a similarity matrix. This was conducted by creating an online spreadsheet that had compression activities on the X-axis and descriptors on the Y-axis. Respondents then allocated each descriptor a score between zero and four to rate its appropriateness (zero being totally inappropriate and four being totally appropriate). As a descriptor could relate to more than one compression technique, the respondents were instructed to rate the descriptor for as many techniques as they felt appropriate. The use of similarity matrices to look for associations between audio descriptors has been used in several previous studies [25–28].

The similarity matrix was completed by twelve experts, all of which had participated in the previous stage. The results were averaged to take the mean score for each descriptor and clustering was conducted using the Euclidean distance and Ward methods. The statistical software R was used to generate the dendrogram shown in Figure 1. As shown in the dendrogram, there are several subsets, illustrated in brown, green, blue and grey. Inspection of the grey subset highlights descriptors which appear to fall into two main categories, one relating to dynamics and the other pertaining to colouration and distortion (linear and non-linear distortion). Looking at a lower branch in the dendrogram highlights that aggressive is grouped with, among other words, attitude, energy and smashed. All of these words are subjective and have not been defined in any of the previous literature. Moreover, one could argue these descriptors relate to the character of the compressor and are potentially multi-dimensional attributes. Referring back to the descriptors used in Table 1, this shows that contentions made previously regarding the meaning of these words are generally correct. As an example, the terms dirt and gritty are connected to distortion in the brown subset. Pumping is connected to ambience (which is understandable as compression-related amplitude modulation often manifests as quick changes in the amplitude of the ambience present in the program material). Finally, punch is connected to definition and also attack, which would support the notion that this descriptor is related to the manipulation of transient shapes.

**Figure 1.** Results of clustering using the Ward method. The descriptors are split into five subsets (brown), four subsets (green), three subsets (blue) and two subsets (grey).

To get a better understanding of the descriptor aggressive, the focus of this study, statistical analysis was conducted on the mean scores allocated by the respondents to the compression techniques in relation to the term aggressive. The results show that there was a statistically significant difference between groups (compression techniques), as determined by one-way ANOVA (F (7,88) = 3.854, *p* = 0.001). A Tukey post-hoc test revealed that the experts considered the descriptor aggressive was statistically significantly lower for the compression techniques "general dynamic range control" (*p* = 0.027), "modulation" (*p* = 0.002) and "linear processing" (*p* = 0.001) compared to the compression technique "distortion". There was no statistically significant difference between the descriptor aggressive score for the compression techniques "colouration general" (*p* = 0.229), "colouration frequency related" (*p* = 0.171), "attenuating transient" (*p* = 0.088) and "accentuating transient" (*p* = 0.124) compared to the compression technique "distortion". The reason for the lack of significance between these techniques is thought to be as a result of distortion reshaping the transient portion of the audio material (particularly true for attenuating the transient) and the introduction of harmonic components, which leads to colouration. Therefore, it appears that, from this study, engineers consider the descriptor aggressive to relate to compression techniques that distort and colour the audio and, to a lesser extent, reshape the transient portion of the program material.

#### **3. Preliminary Objective Tests**

#### *3.1. Choice of Compressor Time Constant Settings*

In preparation for perceptual listening experiments, work was conducted to ascertain appropriate time constant settings for use in the experiments. The 1176 has continuously variable attack and release controls. Thus, a large amount of possible combinations are available. However, it would not be practical to use all of these in listening experiments as a large number of stimuli is known to cause listener fatigue [29]. Thus, content analysis [30] was conducted on 1176 vocal compression settings, created by professional engineers for the 1176 UAD plugin. This work was conducted to discover the most popular settings, which could then be used in the creation of stimuli for the listening experiments. The results revealed that specific combinations of attack and release settings were regularly used, with release positions between five and seven and attack positions between one and three being most common. Additionally, it was noted that the 4:1 ratio was often implemented for general vocal settings and the all-buttons mode (a popular "special mode" achieved by depressing all ratio buttons simultaneously) was employed for highly coloured processing. Table 3 shows how frequently particular settings are used in the vocal presets. The bottom two rows of the table show that positions between one and four are most common for attack and positions between five and seven most common for release. Anecdotal evidence by the author supports this result as they have observed many professional audio engineers setting the 1176 compressor with these time constant settings.

**Table 3.** Popularity of 1176 time constant settings. The table illustrates how often a setting was used by professionals.


Based on these findings, the following attack and release combinations were used in the following listening experiment (attack is abbreviated to A and release is abbreviated to R): A3R7, A1R7, A3R5, A1R5. The combinations were used in both the 4:1 and all-buttons ratio modes. More general research of content on the 1176 [31] showed the A3R7 combination to be a popular setting for a range of instrument sources. Therefore, the settings used in the experiment are considered by the author to be representative of real working scenarios. It is also worth bearing in mind the attack control on the 1176 is quoted as ranging between 20–800 microseconds and critical listening by the author revealed very little difference in sound quality between any attack time between positions one and four. Additionally, the reader should consider that the attack and release controls on the 1176 work counterclockwise, meaning attack and release position seven is the fastest and one the slowest.

#### *3.2. Distortion Characteristics*

A series of total harmonic distortion (THD) measurements were made on the 1176 at various attack and release configurations to observe how time constant settings affected the distortion characteristics of the compressor. The measurements for release were made by keeping the attack time fixed at its fastest setting, seven, and making a measurement at each release position. The measurements for attack were made by restricting the release time to its quickest setting, seven, and making measurements at each attack position. During testing, the compressor was adjusted to achieve −10 dB of gain reduction and a 1 kHz test tone was used as the input signal. The output of the compressor was recorded on a Digital Audio Workstation (DAW) at 24 bit/44.1 kHz. The results showed distortion artefacts reduced significantly when using release times slower than position five and that the attack control had a smaller effect on the reduction of distortion. Furthermore, higher ratios had the effect of increasing non-linearity, with the all-buttons mode increasing non-linearity significantly more than any other ratio. Plot (a) in Figure 2 illustrates the effect of lengthening the attack and release time on THD. As can be seen, there is a sharp drop off in THD up to release position five and a small reduction in THD with attack times slower than position seven. Plot (b) in Figure 2 shows the same measurements made in all-buttons, which is a so-called "special" ratio mode afforded by the 1176 compressor. Although not originally intended for use, it was found by music producers that depressing all the ratio buttons simultaneously produced intriguing sonic behaviour by the compressor. In actual terms, the FET's bias is set outside of calibration, resulting in a significant increase in distortion. Note the much larger amount of THD in this setting, but similar drop off in amount as the release and attack speeds are reduced.

**Figure 2.** (**a**) Total Harmonic Distortion (THD) as a function of attack and release using a 4:1 compression ratio and a 1 kHz test tone; (**b**) THD as a function of attack and release using the "all-buttons" compression ratio and a 1 kHz test tone.

Figure 3a,b are plots created using Matlab's THD function. They represent an FFT display which illustrates the nature of the harmonic distortion. To create the plots, a 1 kHz tone was input to the compressor to achieve −10 dB of gain reduction, with the time constants set for attack at three and release at seven. The output of the compressor using a 4:1 ratio and the all-buttons mode was then recorded into a DAW at 22 bit/44.1 kHz. The figures make clear the differences in distortion characteristics between the 4:1 ratio and the all-buttons mode. As one might expect, the distortion components are integers of the 1 kHz test tone and consist of a mixture of odd and even order harmonics. In Figure 3b, it is worth noting the significant increase in the amplitude of all components and also an increase in higher-level harmonics. Critical listening to an audio example of the compressed test tone reveals the all-buttons mode is highly distorted with increased brightness as a result of the additional high-level and high-order distortion components.

**Figure 3.** (**a**) Distortion components created using a 4:1 ratio and attack at three and release at seven; (**b**) distortion components created using the all-buttons ratio and attack at three and release at seven.

The attack and release settings, which were shown to be commonly used in Table 3, are the settings that generate the most distortion. Thus, it appears that professional mix engineers are, perhaps albeit unbeknown to themselves, actively seeking out distortion from the 1176 in their workflow. Therefore, the aggressive sound quality once again appears to pertain to distortion. However, additional perceptual testing was required to substantiate this hypothesis. THD measurements made on the

settings used in Listening Experiment 1 can be seen in Table 4, where the effect that both the attack and release and the all-buttons mode have on non-linearity can be observed.


**Table 4.** THD measurements made using a 1 kHz tone and the time constants and ratios used in Listening Experiment 1.

#### **4. Perceptual Listening Experiments**

#### *4.1. Listening Experiment 1 Method*

To test the hypothesis, a subjective listening test was devised using the Web Audio Evaluation Tool (WEAT), which made use of the Audio Perceptual Evaluation (APE) method. Stimuli were created by processing the vocal from two separate rock songs with, 1176 hardware, using the attack/release combinations mentioned previously. To restrict the number of stimuli, the amount of compression was limited to one setting, which was −10 dB of gain reduction. Ciletti et al. note that to best assess the sonic signature of a compressor, it is advisable to use the device in a heavy state of compression [32]. The amount of gain reduction was measured to show an average of −10 VU on the gain reduction meter. The compressed vocals were then mixed back into the audio tracks, which were level matched to −23 LUFS. In addition, a mix making use of the uncompressed vocal was used to create a total of nine stimuli per song. All audio was recorded and processed at 24 bit/44.1 kHz.

Listeners were presented the stimuli on four separate screens of the listening test (two per song), where they were asked to rate the amount of perceived distortion on two screens and the amount of perceived aggression on the remainder. Scales on the interfaces were labelled from least distorted to most distorted and least aggressive to most aggressive and were measured on a scale from zero to one. Participants were not instructed explicitly what aggressive meant, as the author wished to avoid training the listeners with their interpretation of the descriptor. The order of the audio and screens were randomized to prevent bias, and the test was carried out by 17 expert listeners in a university music laboratory environment using Sennhiesier HD650 headphones on iMac computers. The sample size was considered to be appropriate for a test of this nature and is commensurate with ITU recommendations [33].

#### 4.1.1. Listening Experiment 1 Results and Discussion

The results from the listening test can be seen in Figure 4, where (a) shows the mean result for the descriptor aggressive with a 95% confidence interval for both songs and all time constant settings tested. As can be seen, there is little difference between the time constant settings for both the ratios tested, but there is a difference between the uncompressed material, the 4:1 ratio and the all-buttons mode. It is worth noting that the two all-buttons modes that measured highest for THD (see Table 3 for THD results) are not rated any higher than the other two all-buttons settings. An inspection of the FFT plots suggests this is a result of the even-order harmonics remaining at fairly consistent levels across the four settings, while the odd-order harmonics are attenuated as attack and release are slowed. This results in a lower THD measurement, which evidently does not result in a perceptually less aggressive sonic signature. It should be noted that many of the participants reported that the difference between some of the stimuli was small, and they found the test to be challenging. Therefore, the effect of listener

fatigue should be kept in mind. The ratings for distortion are illustrated in Figure 4b where a similar trend is visible. Once again, there is a difference between the ratios and the uncompressed material, but no difference between the different time constant settings for the two ratio settings. Thus, it appears that there is little perceptual difference between the time constant settings used in the current study, but the additional harmonic distortion created in the all-buttons mode is noticeable to the participants of the experiment.

**Figure 4.** (**a**) Results for the descriptor aggressive from listening experiment 1; (**b**) results for the distortion from listening experiment 1. Note, no significance was found between time constant settings, but significance was found between ratio settings.

Audio features pertaining to the noise-like properties of sound (Roughness and Zero Crossing Rate) were extracted from the vocal tracks using MIRtoolbox for Matlab [34] and are presented in Table 5. The results for roughness show that the feature increases in value between the uncompressed audio and both ratio settings, and also between 4:1 and all-buttons mode. Within the time constant settings for each ratio, the results with the release time set to seven are the highest and this is largely comparable with the THD results shown in Table 4. However, the A1R7 combination for both ratio settings has slightly larger roughness values than the A3R7 combination, while the THD results highlight the A3R7 combination as having the largest amount of THD. The similarity in results within the ratio settings for roughness may be another reason why listeners rated the time constant settings comparably, despite the variation in THD. The values for zero crossing rate (ZCR) are less revealing, with no clear pattern in the results emerging, apart from an increase in ZCR when using compression.


**Table 5.** Roughness and zero crossing rate (ZCR) features extracted from the vocal material used in test one.

#### 4.1.2. Statistical Analysis of Experiment 1 Results

A two-way repeated measures ANOVA was run to determine the effect of compression settings and the interaction effect and compression settings of the two songs on perceived distortion. Mauchly's test of sphericity indicated that the assumptions of sphericity had been violated for the two-way interaction between the song and settings χ2(2) = 73.13, *p* = 0.001. Therefore, a Greenhouse–Geisser correction was applied (ε = 0.580). Mauchly's test of sphericity indicated that the assumptions of sphericity had not been violated for the effect of the settings χ2(2) = 48.45, *p* = 0.081.

Simple main effects were run and showed that there was no statistically significant two-way interaction between the songs and settings on perceived distortion., *F* (8,128) = 0.648, *p* = 0.653. There was, however, a statistically significant effect of the settings on perceived distortion, *F* (8,128) = 50.97, *p* < 0.001. Post-hoc analysis with a Bonferroni adjustment showed the mean distortion scores for the 4:1 and all-buttons settings were statistically significantly higher than the scores for no compression (*p* < 0.001). In addition, the mean distortion scores for the all-buttons settings were statistically significantly higher than the scores for the 4:1 ratio settings (*p* < 0.001). Within the four different time constant settings used for both 4:1 and all-buttons, there was no statistical significance.

A second two-way repeated measures ANOVA was run to determine the effect of the compression settings and the interaction and compression settings of the two songs on perceived aggression. Again, Mauchly's test of sphericity indicated that the assumptions of sphericity had been violated for the two-way interaction between the song and settings χ2(2) = 53.99, *p* = 0.028; thus, a Greenhouse–Geisser correction was applied (ε = 0.531). Mauchly's test of sphericity indicated that the assumptions of sphericity had not been violated for the effect of settings χ2(2) = 27.98, *p* = 0.081.

Simple main effects were run and showed that there was no statistically significant two-way interaction between the songs and settings on aggressive sound quality, *F* (8,128) = 0.301, *p* = 0.886. There was, however, a statistically significant effect of settings on aggressive sound quality, *F* (8,128) = 69.26, *p* < 0.001, suggesting that settings have a statistically significant effect on aggressive sound quality. Post-hoc analysis with a Bonferroni adjustment showed the same statistical significance between no compression and the two ratio settings as reported previously for distortion. Again, there was no statistical significance within the four different time constant settings used for either 4:1 or all-buttons, meaning that, in the current study, different time constant settings have no significant effect on the perception of distortion or an aggressive sound quality.

The mean scores for aggressive and distortion were analyzed to assess if there was a statistically significant correlation between the scores. Both variables (aggressive and distortion) for both songs were normally distributed, as assessed by a Shapiro–Wilk test (*p* > 0.05); thus, the variables were investigated for correlation. Pearson's product–moment correlation was run to determine the relationship between perceived aggressive and distortion scores and the results show that there is a strong correlation between the mean scores for aggressive and distortion, which is statistically significant for song one (*r* = 0.960, *n* = 9, *p* = 0.001) and song two (*r* = 0.983, *n* = 9, *p* = 0.001). A scatter plot of the mean scores for aggressive and distortion for both songs is illustrated in Figure 5, where the correlation between the two can be clearly observed.

Correlation between the aggressive scores and the roughness features extracted from the vocal files was investigated by running Pearson's product–moment correlation. The results show a strong correlation between roughness and aggressive, which is statistically significant for song one (*r* = 0.968, *n* = 9, *p* = 0.001) and song two (*r* = 0.962, *n* = 9, *p* = 0.001).

**Figure 5.** Scatter plot for aggression and distortion mean scores.

#### *4.2. Listening Experiment 2 Method*

The previous test demonstrated that vocals compressed with settings measured to have greater than or equal to 0.5% THD were rated as being the most aggressive. However, it could be argued the timing behavior of the 1176, particularly when working in all-buttons mode, plays a role in the result. Therefore, a second test was devised, which aimed to decouple distortion and timing behavior and answer whether distortion, timing behavior or a mixture of both were the key components in the creation of an aggressive sound quality. The experiment made use of the APE listening test interface and had participants rate the vocal tracks of three separate songs on the aggressive quality of the vocals. The two songs used in the previous experiment were utilized again, as well as a third new rock song, which was added to give the results more validity over a wider range of test scenarios. Participants were also asked to comment on the audio they were hearing using up to three descriptors.

During the previous experiment, it was found the time constant settings had no significant effect on an aggressive sound quality; therefore, the vocal tracks were compressed with the hardware 1176 using only the A3R7 time constant (which measured highest for THD) and in 4:1 and all-buttons ratio modes. In addition, the vocals were compressed with the Klanghelm DC8C software compressor, using settings that emulated the timing behaviour of the 1176 in both ratios, and set to measure 0% THD. The timing behaviour was emulated by feeding the hardware 1176 and the software compressor a tone burst and adjusting the parameters of the software compressor until the software closely resembled the timing curve of the 1176 in both settings (see previous work by the author where the tone burst method is used and discussed in detail [31]). While this method did not allow for the exact matching of the 1176 s timing curve, it did create very similar results. A more robust method could make use of a specifically designed software compressor algorithm that allows the experimenter to simply turn distortion on and off, but this would require close modelling of the 1176, which was beyond the scope of the current study. Finally, all audio used in Experiment 2 was recorded and processed at 24 bit/44.1 kHz.

#### 4.2.1. Listening Experiment 2 Results and Discussion

The results from the second listening experiment are depicted in Figure 6, which represents the mean result for the descriptor aggressive with a 95% confidence interval for all three songs and all time constant settings tested.

**Figure 6.** Aggressive results from the second listening experiment.

Looking at the plot, there is an overlap between the scores for SW 4:1 and SW All for songs one and three and an overlap between SW All and 1176 4:1 for song two. However, it is apparent the 1176 all-buttons setting has been rated as the most aggressive for all three songs and the clean software emulation measured to have 0% THD does not score as high as the 1176 all-buttons mode. Thus, the results suggest compression activities that generate audible distortion are needed for the most aggressive vocal sonic signatures.

#### 4.2.2. Statistical Analysis of Experiment 2 Results

A two-way repeated measures ANOVA was run to determine the effect of compression settings measured to have or not have distortion and the interaction effect of the three songs and compression settings on an aggressive sound quality. Mauchly's test of sphericity indicated that the assumptions of sphericity had been violated for the two-way interaction between the song and settings χ2(2) = 71.82, *p* = 0.001. Therefore, a Greenhouse–Geisser correction was applied (ε = 0.578). Mauchly's test of sphericity indicated that the assumptions of sphericity had not been violated for the effect of the settings χ2(2) = 7.45, *p* = 0.593.

Simple main effects were run and showed, again, that there was no statistically significant two-way interaction between the songs and settings on aggressive sound quality *F* (8,136) = 0.208, *p* = 0.081. There was, however, a statistically significant effect of settings on an aggressive sound quality, *F* (4,68) = 181.722, *p* < 0.001, suggesting that the settings used have a statistically significant effect. Post-hoc analysis with a Bonferroni adjustment showed the mean aggressive scores for all compressed settings were statistically significantly higher than the scores for no compression (*p* < 0.001). The scores for both the 1176 settings were statistically different from one another (*p* < 0.001), but the scores for both the software settings were not statistically different (*p* = 0.57). This indicates that the faster timing behaviour of the SW All setting, which was emulating the timing curve of the 1176 in all-buttons, has little additional effect over the SW 4:1 setting in the creation of an aggressive vocal sonic signature. The scores for both the 1176 settings were statistically higher than the scores for both the software settings (*p* < 0.001). This indicates that while a clean, fast-acting compressor can give a vocal a more aggressive sound quality than the uncompressed audio, compression settings that impart audible distortion are required for the most significant effect.

#### 4.2.3. Textural Analysis of Descriptors Used by the Participants

Participants of the second listening experiment were encouraged to write descriptors to describe the sound of the vocal in the stimuli they had heard. WAET allows the test designer to include text boxes in the listening test's interface. Thus, participants recorded descriptors into these boxes during listening. A total of 88% of the participants noted descriptors and Figure 7 shows word frequency plots of the twenty most frequently used descriptors for each compressor.

**Figure 7.** (**a**) Descriptors used for the Universal Audio 1176 (1176) compressor; (**b**) descriptors used for the clean software compressor measured to have 0% THD. Descriptors for both ratios and all three songs have been combined for both compressors.

The word distorted is the most frequently used descriptor for the 1176 compressor. Moreover, descriptors which were shown to relate to distortion in Figure 1, namely gritty, crunchy and dirty also appear often for the 1176. Harsh is also a popular term for this compressor and may be related to distortion. However, one could argue it is a hedonic judgement of preference. Present and bright are two prevalent terms for the 1176, and this is commensurate with the long-term average spectrum (LTAS) plot shown in Figure 8. The LTAS measurements were plotted with a Matlab function [35] using 1/16th octave smoothing. Only one of the songs used in the listening experiments is presented in Figure 8. However, all songs show a similar result, which is that the 1176 has more energy in the high end of the frequency spectrum compared with the uncompressed material and the clean software compressor output. In Figure 8, the increased energy occurs from 4 kHz onwards. Furthermore, the brightness, presence and harshness noted by the participants when listening to 1176 audio, may be related to the descriptor sibilance. Further work should investigate the association between these descriptors by conducting perceptual listening experiments in which the researcher controls these attributes.

Figure 7b illustrates the descriptors used to elucidate the sound quality of the clean software compressor. The most common term is soft, and one can argue that this word is being used as an antonym for aggressive. Natural, smooth, compressed, round, weak and bright are also used by the participants to describe this compressor. Except for bright, they also appear to be terms used to describe the antithesis of an aggressive sound quality. A study carried out by Bernays and Traube [36], which investigated descriptors used to describe piano timbre, found the terms soft, velvety, round and full-bodied were connected. It is worth noting that rounded and smooth are also connected terms in Figure 1. However, further work by the author aims to obtain a better understanding of the most popular descriptors shown in Figure 7 and how they relate to the timbre of DRC.

**Figure 8.** Long-term average spectrum (LTAS) measurement from the uncompressed audio (green), the clean software compressor (blue) and the 1176 compressor in all-buttons mode (red).

#### **5. Conclusions**

This paper has shown that professional engineers use the descriptor "aggressive" when describing the sound quality of compression techniques that distort the signal. The first listening experiment demonstrated that there is a strong positive correlation between the listeners' scores for distorted and aggressive when rating the same audio stimuli in a controlled listening experiment. It was also shown that compression settings measured to have 0.5% THD and above were rated as both the most distorted and the most aggressive, but there was no significant difference between settings measured to have more than 0.5% THD. Meaning, in this current study, that listeners could not discern any noticeable difference in perceived distortion or aggression amongst audio measured between 0.5% and 1.58% THD. The various time constant settings used in the experiment, which were gleaned from common settings used in the industry, had no significant effect on the perception of distortion or aggressive sonic signatures. Finally, the experiment revealed a strong correlation between settings rated as aggressive and the audio feature roughness, suggesting that this plays a role in the perception of aggressive sounding audio.

The second listening experiment revealed that compression, which imparts distortion onto the program material, is needed to achieve the most aggressive sound qualities. It appears that fast compression with no distortion (as emulated with the clean software compressor) can affect aggressive sound qualities. Still, the effect is not nearly as significant as using fast-acting compression and distorted artefacts. Both experiments indicated there was no interaction effect between the songs used and the compression settings. Thus, it appears that the songs had little bearing on the results, and the findings from these two experiments should translate to other songs in similar genres.

Finally, a textual analysis conducted on descriptors gathered during the second experiment highlighted the use of descriptors which relate to distortion. The author plans to carry out a new study which will investigate the lexicon of distortion, looking for the similarities between these terms. The results of this study will afford the academic and professional community with a better understanding of how music producers describe and implement distortion in music production.

**Funding:** This research received no external funding.

**Acknowledgments:** The author would like to thank Jonathan Wakefield for his help and advice on experimental design.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Design and Application of the BiVib Audio-Tactile Piano Sample Library**

#### **Stefano Papetti 1,†, Federico Avanzini 2,† and Federico Fontana 3,\*,†**


Received: 11 January 2019; Accepted: 26 February 2019; Published: 4 March 2019

**Abstract:** A library of piano samples composed of binaural recordings and keyboard vibrations has been built, with the aim of sharing accurate data that in recent years have successfully advanced the knowledge on several aspects about the musical keyboard and its multimodal feedback to the performer. All samples were recorded using calibrated measurement equipment on two Yamaha Disklavier pianos, one grand and one upright model. This paper documents the sample acquisition procedure, with related calibration data. Then, for sound and vibration analysis, it is shown how physical quantities such as sound intensity and vibration acceleration can be inferred from the recorded samples. Finally, the paper describes how the samples can be used to correctly reproduce binaural sound and keyboard vibrations. The library has potential to support experimental research about the psycho-physical, cognitive and experiential effects caused by the keyboard's multimodal feedback in musicians and other users, or, outside the laboratory, to enable an immersive personal piano performance.

**Keywords:** musical haptics; piano; auditory feedback; tactile feedback; binaural audio; keyboard vibrations; measurement; recording

#### **1. Introduction**

During instrumental performance musicians are exposed to auditory, visual and also somatosensory cues. This multisensory experience has been studied since long [1–5], however the specific interaction between sound and vibrations has been object of systematic research since the 1980's [6–12], when tactile and force feedback cues started to be recognized to have a prominent role in the complex perception-action mechanisms occurring during musical instrument playing [13]. More recently, research on the somatosensory perception of musical instruments has been consolidated, as testified by the emerging "musical haptics" topic [14].

This increased interest is partly due to the increased availability of accurate yet affordable sensors and actuators, capable of recording touch gestures and rendering vibrations and force, respectively. Using these devices, complex experimental settings can be realized to measure and deliver multisensory information in a musical instrument, often as a result of a real-time analysis and synthesis process taking place during the performance [15–17]. Once integrated with traditional audio microphone and loudspeaker systems, touch sensors and actuators can be employed first to investigate the joint role of the auditory and somatosensory modality in the perception of a musical instrument, and then to

realize novel musical interfaces and instruments building on the lessons of the previous investigation. Through this process, richer or even unconventional feedback cues can be conveyed to the performer, with the aim of increasing engagement, and hence the initial acceptability and subsequent playability of the new instrument [18–21].

In this scenario, the availability of multimodal databases combining and synchronizing different streams of information (audio, video, kinematic data of the instrument and performer in action, physiological signals, interactions among musicians etc.) is increasingly recognized as an essential asset for studying music performance. Recent examples include the "multimodal string quartet performance dataset" (QUARTET) [22], the "University of Rochester multimodal music performance dataset" (URMP) [23], the "database for emotion analysis using physiological signals" (DEAP) [24], the "TU-Note violin sample library" [25]. Furthermore, initiatives aiming at systematizing the creation of these databases have recently appeared, such as RepoVizz [26], a framework for storing, browsing, and visualizing synchronous multimodal data. Overall, this scenario suggests an increasing attention to the design of musical instrument sound databases, motivated by the concrete possibility for their content to be reproduced in instrumental settings including not only loudspeakers and headphones, but also haptic and robotic devices.

The piano represents an especially relevant case study not only for its importance in the history of Western musical tradition, but also for its potential in the musical instruments market due to the universality of the keyboard interface, a feature that has traditionally induced novel musical instrument makers to propose conservative instances of the standard piano keyboard whenever this interface made possible to control even revolutionary sound synthesis methods [27]. It is no accident that, also in recent years, sales of digital pianos and keyboard synthesizers have shown a growing trend as opposed to other instrument sales (https://www.namm.org/membership/global-report). For the same reason, researchers on new musical instruments have steadily elected the piano keyboard as the platform of choice for designing expansions of the traditional paradigm, affording a performer to accurately play two local selections of 88 available tones with the desired amplitude and temporal development [28,29].

When playing an acoustic piano, the performer is exposed to a variety of auditory, visual, somatosensory, and vibrotactile cues that combine and integrate to shape the pianist's perception–action loop. The present authors are involved in a long-term research collaboration around this topic, with a focus on two main aspects. The first one is the tactile feedback produced by keyboard vibrations that reach the pianist's fingers after keystrokes and remains active until key release. The second one is the auditory spatial information in the sound field radiated by the instrument at the performer's head position. Binaural piano tones are offered by a few audio plugin developers (e.g., Modartt Pianoteq (https://www.pianoteq.com/)) and digital piano manufacturers (e.g., Yamaha Clavinova (https://europe.yamaha.com/en/products/musical\_instruments/pianos/clavinova/)), but they have limited flexibility of use. Free binaural piano samples can be found, too, such as the "binaural upright piano" library, (https://www.michaelpichermusic.com/binaural-upright-piano) which however offers only three dynamic levels (as opposed to the ten levels provided by the present dataset). More generally concerning the presentation of audio-tactile piano tones, the existing literature is scarce and provides mixed if not contradictory results about the actual perceptibility and possible relevance of this multisensory information [8]. Specific discussions of these aspects have been provided in previously published studies, regarding both sound localization [30] and vibration perception [12] on the acoustic piano. As a notable result of these studies, a digital piano prototype was developed that reproduces various types of vibrations [20] including those generated by acoustic pianos as an unavoidable by-product of their sound production.

Across the years, this research has resulted in the production of an extensive amount of experimental data, most of which resulting from highly accurate measurements with calibrated devices. In an effort to provide public access to such data, the authors of this paper present a dataset of audio-tactile piano samples organized as libraries of synchronized binaural sound and vibration signals.

The dataset contains samples relative to all 88 tones, played at different dynamics on two instruments: a grand and an upright piano. A preliminary version was presented at a recent conference [31], and has now been updated by a new release containing additional upright piano sounds along with the binaural impulse responses of the room in which the same piano was recorded. In order to use this dataset, it is necessary to take into account the recording conditions and, on the user side, to take control of the rendering system in an effort to match the reproduction to the characteristics of the original recorded signals. For this reason, Section 2 describes the hardware/software recording setup and the organization of samples into libraries for the free version of a popular music software sampler. An explanation about how to reproduce the database is provided in Section 3. Section 4 suggests some applications of this library, based on the authors' past design experiences with multimodal piano samples, and on more experiences that are foreseen for future research.

Among past experiences, the most prominent have consisted of studies on the role of haptic feedback during the performance, both vibratory and somatosensory: the former concerns the perception of keyboard vibrations, their accurate reproduction and their effects on the performance [12,20]; the latter concerns the influence of actively playing on the keyboard in the auditory localization process of piano tones [30,32]. Such experiences let to the decision of sharing the dataset with the scientific community, with the goal of fostering the research on the role of vibrations and tone localization in the pianist's perceived instrument quality (both not completely understood yet), as well as adding knowledge about the importance at cognitive level of multisensory feedback for its use in the design of novel keyboard interfaces.

#### **2. Creation of the BiVib library**

The BiVib (binaural and vibratory) sample library is a collection of high-resolution audio files (.wav format, 24-bit @ 96 kHz) containing binaural piano sounds and keyboard vibrations, coming along with documentation and project files for its reproduction through a free music software sampler. The dataset, whose core structure is illustrated in Table 1, is made available through an open-access data repository (https://zenodo.org/record/2573232) and released under a Creative Commons (CC BY-NC-SA 4.0) license.


**Table 1.** Binaural and vibratory (BiVib) core structure. Piano lid configurations are given in square brackets.

#### *2.1. Recording Hardware*

The samples were recorded from two Yamaha Disklavier pianos—a grand model DC3 M4 located in Padova (PD), Italy, and an upright model DU1A with control unit DKC-850 located in Zurich (ZH), Switzerland. Disklaviers are Musical Instrument Digital Interface (MIDI from here on)-compliant acoustic pianos equipped with sensors for recording keystrokes and pedaling, and operating electromechanical motors for playback. The grand piano is located in a large laboratory space (approximately 6 × 4 m), while the upright piano is in an acoustically treated small room (approximately 4 × 2 m).

Binaural audio recordings made use of dummy heads for acoustic measurements, with slightly different setups in PD and ZH: the grand piano was recorded with a KEMAR 45BM (GRAS Sound & Vibration A/S, Holte, Denmark), whereas the upright piano with a Neumann KU 100 (Georg Neumann GmbH, Berlin, Germany). Both mannequins were placed in front of the pianos, approximately where the pianist's head is located on average (see Figure 1).

**Figure 1.** Binaural audio recording setup in Zurich (ZH). Piano lid in 'semi-open' position.

The dummy heads were connected to the microphone inputs of two professional audio interfaces, a RME Fireface 800 (PD, gain set to +40 dB) (Audio AG, Haimhausen, Germany) and a RME UCX (ZH, gain set to +20 dB) (Audio AG, Haimhausen, Germany). The pair of condenser microphones inside the dummy heads were respectively driven by two 26CB preamplifiers supplied by a 12AL power module (PD), and by 48V phantom power provided by the audio interface (ZH).

Three configurations of the keyboard lid were selected for each piano. The grand piano (PD) was measured with the lid closed, fully open, and removed (i.e., detached from the instrument). The upright piano was recorded with the lid closed, in semi-open position (see Figure 1), and fully open. Different lid configurations in fact add insight on the role of the mechanical noise coming from the moving keys to the creation of transient cues of lateralization in the sound field reaching the performer's ears [30]. As a result, three sets of binaural samples were recorded for both pianos—one set for each lid position.

Vibration recordings were acquired with a Wilcoxon Research 736 (Wilcoxon Sensing Technologies Inc., Amphenol, MD, USA) piezoelectric accelerometer connected to a Wilcoxon Research iT100M (Wilcoxon Sensing Technologies Inc., Amphenol, MD, USA) intelligent transmitter, whose AC-coupled output fed one line input of the audio interface. The accelerometer was manually attached with double-sided adhesive tape to each key in sequence, as shown in Figure 2. Its center was positioned 2 cm far from the key edge, where most vibration modes are radiated efficiently, and where pianists typically put their fingertip.

**Figure 2.** Vibration recording setup. A Wilcoxon Research 736 accelerometer is attached with adhesive tape to a key that is being played remotely via MIDI control.

#### *2.2. Recording Software*

Ten values of dynamics were chosen between MIDI key velocity 12 and 111 by evenly splitting this range into eleven intervals. This choice was motivated by a previous study by the present authors, which reported that both Disklaviers produced inconsistent dynamics outside this velocity range [12]. In general, the servo-mechanics of computer-controlled pianos fall short (to an extent that depends on the model) of providing a reliable response at extreme dynamic values [33]. Holding this assumption, two different software setups were used respectively for sampling sounds and vibrations.

Binaural samples were recorded via an automatic procedure programmed in SuperCollider. (A programming environment for sound processing and algorithmic composition: http://supercollider. github.io/.) The recording sessions took place at night time, thus minimizing unwanted noise coming from human activity in the building. On the grand piano, note lengths were determined algorithmically depending on their dynamics and pitch, ranging from 30 s when A0 was played at key velocity 111, to 10 s when C8 was played at key velocity 12. In fact, notes of increasing pitch and/or decreasing dynamics have shorter decay times. These durations allow each note to fade out completely, while minimizing silent recordings and the overall duration of the recording sessions—still, each session lasted approximately six hours.

On the contrary, an undocumented protection mechanism on the upright piano prevents its electromechanical system from holding down the keys for more than about 17 s, thus making a complete decay impossible for some notes, especially at low pitches and high dynamics. In this case, for the sake of simplicity all tones were recorded for just as long as possible.

Vibration samples were recorded through a less sophisticated procedure. A digital audio workstation (DAW) software was used to play back all notes in sequence at the same ten MIDI velocity values as those used for the binaural audio recordings, using a constant duration of 16 s. This duration in fact is certainly greater than the time taken by any key vibration to decay below sensitivity thresholds [12,34].

#### *2.3. Sample Pre-Processing*

Because of the mechanics of piano keyboards and the intrinsic limitations of computer-controlled electro-mechanical actuation, a systematic delay is introduced while reproducing MIDI note ON messages, which mainly varies with key dynamics. For this reason, all recorded samples started with silence of varying durations, which had to be removed in view of their use in a sampler (see Section 2.4). Given the number of files that had to be pre-processed (880 for each set), an automated procedure was implemented in SuperCollider to cut the initial silence of each audio sample. The procedure analyzed the amplitude envelope, detected the position of the largest peak, and finally applied a short fade-in starting a few milliseconds before the peak.

Additionally, vibration signals presented abrupt onsets in the first 200–250 ms right after the starting silence, as a consequence of the initial fly of the key and of the following impact against the piano keybed (see Figure 3). These onsets are not related to keyboard vibrations, and therefore they had to be removed.

**Figure 3.** Vibration signal recorded on the grand Disklavier by playing the note A2 at MIDI velocity 12. Picture from [12].

As such onset profiles showed large variability, in spite of several tests made in MATLAB, no reliable automated procedure could be realized for editing the vibration samples. A manual approach was employed instead: files were imported in a sound editor, their waveform was zoomed in and played back, and the onset was cut off.

#### *2.4. Sample Library Organization*

The sample library was organized for playback with the 'Kontakt Player' software—a free version of Native Instruments' Kontakt sampler, (https://www.native-instruments.com/en/products/ komplete/samplers/kontakt-5-player/) available for Windows and Mac OS systems. The full version of Kontakt 5 was instead used to develop Kontakt project files as described below. The resulting library is organized into several folders, named 'Instruments', 'Multis', 'Resources', and 'Samples'.

The 'Samples' folder—whose total size amounts to about 65 GB—held separate subfolders for the binaural and vibration samples, respectively, which contain further subfolders for each sample set (see Table 1), for example 'grand-open' under the 'binaural' folder.

Independent of their type, sample files were named according to the following mask:

$$\text{(note) [cotave \#] \\_[1 over \# \text{MIDI \textquotedbl{}} \underline{\text{"{u}} \text{per } \text{MIDI \textquotedbl{}} \text{\textquotedbl{}} \text{.} \text{\textquotedbl{}} , \text{\textquotedbl{}} ]}$$

where [note] follows the English note-naming convention, [octave #] ranges from 0 to 8, [lower MIDI velocity] equals the MIDI velocity (range 12–111) used during recording and is the smallest velocity value mapped to that sample in Kontakt (see below), [upper MIDI velocity] is the largest velocity value mapped to that sample in Kontakt. As an example, the file A4\_100\_110.wav corresponds to the note A from the 4th octave (fundamental frequency 440 Hz) recorded at MIDI

velocity 100, and mapped to the velocity range 100–110 in Kontakt. Since the lowest recorded velocity value was 12, for consistency no samples were mapped to the velocity range 1–11 in Kontakt.

Following the terminology used in Kontakt, each instrument reproduces a sample set (e.g., binaural recording of the grand piano with lid open), while each multi combines two instruments respectively reproducing one binaural and one vibration sample set belonging to the same piano. The two instruments in each multi are configured so as to receive MIDI input data on channel 1, thus playing back at once, while their respective outputs are routed to different virtual channels in Kontakt: binaural samples are routed to a pair of stereo channels (numbered 1–2), while vibration samples are played through a mono channel (numbered 3). In this way, when using audio interfaces offering more than two physical outputs, it is possible to render synchronized binaural and vibrotactile cues at the same time by routing the audio signal to headphones and, in parallel, the vibration signal to an amplifier driving one or more actuators.

In each instrument, sample mapping was implemented relying on the 'auto-map' feature found in the full version of Kontakt: this parses file names and uses the recognized tokens for assigning samples to e.g., a pitch and velocity range. The chosen file naming template made it straightforward to batch-import the samples.

The amplitude of the recorded signals was not altered, that is, no dynamic processing or amplitude normalization was applied. The volume of all Kontakt instruments was set to 0 dB. Because of this setting and of the adopted velocity mapping strategy, sample playback was kept as transparent as possible for simplifying the setup of acoustic and vibratory analysis procedures, experiments and interactive applications (see Sections 3 and 4).

#### **3. Application of the BiVib Library**

Binaural piano tones such as those offered by Yamaha digital pianos or the Modartt Pianoteq software synthesizer are not fully suitable for research purposes due to being undocumented, hence non reproducible, acquisition procedures and/or post-processing of the sound signals. Moreover, the present authors have no evidence of public samples including piano keyboard vibrations. Thus, BiVib fills two gaps found in the datasets currently available for the reproduction of piano feedback.

#### *3.1. Sample Reproduction*

Experiments and applications requiring the use of calibrated data need exact reconstruction of the measured signals at the reproduction side: acoustic pressure for binaural sounds, and acceleration for keyboard vibrations. Obviously the reproduction must take place on a set-up in which neither autonomous sounds nor vibrations are present. For instance, the reproduction of vibrations could take place on a weighted MIDI keyboard (such as those found in digital pianos), while binaural sounds may be rendered through headphones. Figure 4 (left) shows one such setting, in which a commercial digital piano is augmented through the setup schematized as in Figure 4 (right). Note that the bottom of the digital piano has been reinforced by substituting the keybed with a thicker wooden panel, to form what we will call a *haptic* digital piano from here.

**Figure 4. Left**: haptic digital piano customization for use with binaural and vibratory (BiVib). **Right**: schematic of a possible reproduction setting. Pictures from [20].

Knowledge of the recording equipment's nominal specifications enable this reconstruction. Such specifications are summarized in a companion document in the 'Documentation' folder. Two examples are listed below:

• Vibration accelerations in m/s<sup>2</sup> are calculated from the vibration samples by making use of the nominal sensitivity values of the audio interface and accelerometer: voltage values are reconstructed from the recorded digital signals (represented by values between −1 and 1) by means of the line input sensitivity of the audio interface, whose full scale level (0 dBFS) was set to +19 dBu (reference voltage 0.775 V) for recording. Finally, voltage values are transformed into proportional acceleration values through the sensitivity constant of the accelerometer, equal to 10.2 mV/m/s<sup>2</sup> at 25 ◦C, with a flat (±5%) frequency response in the range 10–15,000 Hz.

As an example, the digital value 0.001 is equivalent to:

$$20 \times \log\_{10}(|0.001|) = -60 \text{ dBFS} \tag{1}$$

and the corresponding voltage value is calculated based on the 0 dBFS level (+19 dBu, with 0.775 V reference) as follows:

$$0.775 \times 10^{\left(\left(19-60\right)/20\right)} = 6.9 \,\text{mV}.\tag{2}$$

Finally, the matching acceleration value is obtained as:

$$\frac{6.9 \,\mathrm{mV}}{10.2 \,\mathrm{mV/m/s^2}} = 0.677 \,\mathrm{m/s^2}.\tag{3}$$

• Analogously, acoustic pressure values in Pa are obtained from the binaural samples by making use of the relevant nominal specification values of the microphone input chain (microphone, pre-amplifier if present, audio interface input). For the upright piano: microphone sensitivity of 20 mV/Pa, audio interface input gain set to +26 dB for an equivalent 0 dBFS level of −16 dBu (reference 0.775 V). For the grand piano: microphone sensitivity 50 mV/Pa, preamplifier gain −0.35 dB, and audio interface input gain set to +40 dB for an equivalent 0 dBFS level of −19 dBu (reference 0.775 V).

As an example, the digital value 0.001, equivalent to −60 dBFS as seen in Equation (1), translates to the following voltages:

$$0.775 \times 10^{\left((-16-60)/20\right)} = 0.12 \text{ mV} \tag{4}$$

for the upright piano, or

$$0.775 \times 10^{\left((-19-60+0.35)/20\right)} = 0.09 \text{ mV} \tag{5}$$

for the grand piano. The equivalent acoustic pressures are then computed as:

$$\frac{0.12\,\text{mV}}{20\,\text{mV/Pa}} = 0.006\,\text{Pa}\,(\text{about49.5 dBSPL})\tag{6}$$

for the upright piano, or

$$\frac{0.09\,\mathrm{mV}}{50\,\mathrm{mV/Pa}} = 0.0018\,\mathrm{Pa}\,\left(\mathrm{about39\,\mathrm{dB}}\,\mathrm{SPL}\right) \tag{7}$$

for the grand piano.

Once the original measurements have been reconstructed, physical quantities are, in principle, ready for presentation of sounds through headphones and vibrations through tactile actuators, holding the issues and limitations that are listed in the next section.

#### *3.2. Sample Equalization and Limitations*

BiVib users should hear the same sounds as the binaural pressure signals measured by the dummy head microphones, and should receive vibrations at the fingers which are identical to the acceleration signals measured by the accelerometer.

The audio reproduction condition can be satisfied by using headphones or earphones that bypass the outer ear and provide a frequency response that is as flat as possible. In this way, the performer would listen to stereo sounds that contain the binaural information created by the dummy head. Note that these sounds also contain the contribution of room reverberation, as the pianos could not be located inside anechoic rooms during the recording sessions: especially the laboratory hosting the grand piano was moderately reverberant, and no reverberation data could be collected for this room at the time of the acquisition of samples. Conversely, the upright piano (ZH) benefited of a silent studio room whose impulse responses have been measured and recently included in BiVib (folder 'IR'). More precisely, responses from two source points generated by a pair of Genelec 8040A loudspeakers were taken in correspondence of the two ear canal entrances of the Neumann KU 100 dummy head with both outer ears removed: As shown in Figure 5, the loudspeakers were placed symmetrically at both sides of the upright piano, 80 cm above the floor, pointing toward the dummy head (angle with the vertical plane parallel to the piano equal to 47◦) which was positioned 55 cm far from the upright panel of the piano. Logarithmic sweeps were synthesized in the Audacity audio editor using the Aurora modules [35], then reproduced using the loudspeakers, and finally deconvolved again with Aurora to find four room transfer functions forming the source-ear transfer matrix. Such transfer functions were included in the dataset and can be used to remove the echoes of the recording room through inversion of the transfer matrix, using standard deconvolution techniques [36]. Also, because of the low energy of these echoes, binaural recordings were intentionally left unprocessed, thus allowing musicians and performers to use them as they are, while leaving any decision on their possible manipulation to advanced users.

**Figure 5.** Upright piano recording setup.

The vibration reproduction condition may be ideally satisfied by implementing a weighted MIDI keyboard (such as those found on digital pianos) where each key is provided with an actuator, similar to the prototype technology described in [37]. Even in the unlikely case that such actuators would offer a flat frequency response over the range of interest for tactile sensation [38], vibration reproduction would be altered by the shape, material and construction of the keyboard. That requires the implementation of a compensation procedure—similar to that described in the previous paragraph to remove room acoustic components—that would enable the accurate reproduction of the vibrations recorded by the accelerometer on each key. Unfortunately, on such a keyboard there would be 88 touch points requiring deconvolution. In alternative, two arrays of 88 transfer characteristics can be formed each by measuring vibrations on the same touch points, after exciting the digital piano body with a corresponding pair of tactile transducers.

Figure 6 (left) [12] shows a detail of the haptic digital piano that the authors realized according to the latter solution, where two transducers were mounted below the keybed, one mid-way the keyboard length and another approximately at one quarter way the same length—see also Figure 4 (right).

**Figure 6. Left**: tactile transducer mounted below the haptic digital piano keybed. **Right**: average spectral flattening curve. Pictures from [20].

By playing an impulse simultaneously on both of them, advanced BiVib users should accurately measure on each key the corresponding impulse response, and then (with Aurora or similar tools) design an equalizer which on average flattens the "coloured" response affecting the vibrations due to their path from the audio interface to the keyboard. Since such measures depend on the user's musical keyboard, they could not be included in BiVib as the binaural room transfer functions were instead. However, in the aforementioned digital piano customization it was experimentally observed by the authors that the tactile transducers shown in Figure 6 (left) were mostly responsible for the inclusion of distinct spectral peaks affecting all the key responses. Hence, the average equalization curve shown in Figure 6 (right) was estimated, and proved to be effective for flattening the peaks measured on an evenly-spaced subset of the keys. A similar curve can be approximated by BiVib users by designing an equalizer that flattens the frequency response of the tactile transducers equipping the setup. This response is normally reported in a good quality transducer's data sheet.

#### **4. Experiments, Applications and Future Work**

BiVib has been originally created to support multisensory experiments in which precise control had to be maintained over the simultaneous auditory and vibrotactile stimuli reaching a performing pianist, particularly when judging the perceived quality of an instrument. In a recent paper, the authors were able to confirm subjective vibrotactile frequency thresholds of active touch [34] by conducting tests with pianists who were asked to detect vibrations at the piano keyboard [12]. A comparison between such thresholds and the spectrum of a lower fortissimo A0 tone is shown in Figure 7. This demonstrates a previously unreported results: during active playing, pianists are able to integrate tactile sensation that would be imperceptible in passive touch conditions [38].

**Figure 7.** Magnitude spectrum of the vibration signal at the A0 key, recorded on the upright Disklavier playing at MIDI velocity 111. The dash-dotted curve depicts the reference vibrotactile threshold for passive touch [38], while the two horizontal dashed lines represent the minimum and maximum thresholds recently measured under active touch conditions [34]. Picture adapted from [12].

A precise manipulation of the intensity relations between piano sound and vibrations may be used to investigate the existence of cross-modal effects occurring during piano playing. Such effects have been discovered as part of a more general multisensory integration mechanism [39] that under certain conditions can increase the perceived intensity of auditory signals [40], or conversely enhance touch perception [41]. Aiming to understand whether piano keyboard vibrations impact the perceived quality of the instrument and, as a secondary effect, the quality of a performance, the authors have first observed significant differences in the perceived quality of vibrating vs. silently playing Disklavier pianos. This observation marks a point in favor of the former, especially since that pianists involved in the experiment reported to be unaware of the existence of vibratory feedback [12]. In other words, while most pianists preferred the vibrating instrument, they did not consciously realize that their decision was caused by the vibrations produced (or not produced) by the instrument during the performance. Based on this result, a further experiment was designed for the haptic digital piano using

diverse types of tactile feedback, synthesized by manipulating the BiVib samples. The test aimed at investigating possible consequences of the vibrotactile feedback on the pianist's playing experience (qualitative effect) and on the performance in terms of timing and dynamics accuracy (quantitative effect). Cross-modal effects resulting from varying the tactile feedback of the keyboard were observed, still these preliminary results are far from giving a systematic view about the impact of the different sensory channels on the pianist's playing experience, and especially on the accuracy of execution [20].

Another potential use of BiVib is in the investigation of binaural spatial cues for the acoustic piano. Using the recommendations given in the previous section, single tones of the upright piano can in fact be accurately cleared of the room echoes, and then be reproduced in the position of the ear entrance. The existence of localization cues in piano sounds has not been completely understood yet. Even in pianos where these cues are reported to be audible by listeners, their exact acoustic origin is still an open question [42]. Moreover, visual cues of self-moving keys (a condition possible on Disklavier pianos) producing the corresponding tones, as well as somatosensory cues occurring during active piano playing, may have an influence on localization judgments [30,32].

One further direction which may take advantage of BiVib deals with research in cognitive neuroscience: recently, pianists and their instrument served as key subjects for understanding diverse aspects of brain and motor development [43,44]. In this context, the contribution of the auditory, visual and tactile sensory modalities to this development have not been ascertained yet. Such knowledge could be of help not only to capture more general aspects of the development of human senses, but also to guide a perceptually and cognitively informed design of novel keyboard interfaces.

In summary, future research that can be conducted in the laboratory using BiVib includes tests aiming at conclusively understanding whether (i) vibrations affect the performance on the keyboard, and whether (ii) auditory lateralization is able to guide piano tone localization. If so, such multisensory cues may substantially contribute to the sense of engagement and, hence, improve the quality of the performance and make the learning curve of a keyboard interface more acceptable.

Outside the laboratory, the library can reward musicians who simply wish to use its sounds. On the one hand, the grand piano recordings contain echoes that bring distinct cues of the room where they have been recorded. In this sense they are ready for use, although labeled by a precise acoustic footprint. On the other hand, the upright piano recordings are much more anechoic and, hence, far less difficult to spatialize than piano recordings taken outside an acoustically controlled room [45]. Consequently, they can be easily imported and conveniently personalized by users through artificial reverberation.

#### **5. Conclusions**

BiVib provides a unique set of multimodal piano data recorded using high-quality equipment in controlled conditions through reproducible computer-controlled procedures. Since its original release, the library has been enriched with binaural responses of the room where the upright piano was recorded. We hope that a successful use of the BiVib dataset, in conjunction with this documentation and through publicly available projects for the Kontakt software sampler, will facilitate further research in piano acoustics, performance, and new musical interface design also for educational purposes.

**Author Contributions:** Conceptualization, F.A., F.F. and S.P.; methodology, F.A., F.F. and S.P.; software, S.P.; resources, F.A., F.F. and S.P.; data curation, S.P.; writing—original draft preparation, S.P.; writing—review and editing, F.A., F.F. and S.P.

**Funding:** This research was partially funded by Swiss National Science Foundation grants number 150107 and 178972.

**Acknowledgments:** This research received support by project AHMI (Audio-haptic modalities in musical interfaces, 2014–2016), and HAPTEEV (Haptic technology and evaluation for digital musical interfaces 2018–2022), both funded by the Swiss National Science Foundation. Support came also from the PRID project SMARTLAND, funded by the University of Udine. The Disklavier grand model DC3 M4 located in Padova was made available by courtesy of the Sound and Music Processing Lab (SaMPL), a project of the Conservatory of Padova funded by Cariparo Foundation (thanks in particular to Nicola Bernardini and Giorgio Klauer). Gianni Chiaradia

donated the wooden panel used to build the haptic digital piano. The authors would like to thank several students and collaborators who contributed to the development of this work along the years, in chronological order: Francesco Zanini, Valerio Zanini, Andrea Ghirotto, Devid Bianco, Lorenzo Malavolta, Debora Scappin, Mattia Bernardi, Francesca Minchio, Martin Fröhlich.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **References**


c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Analysis and Modeling of Timbre Perception Features in Musical Sounds**

#### **Wei Jiang 1,2,3, Jingyu Liu 1,2,3, Xiaoyi Zhang 1,2,3, Shuang Wang 1,2,3 and Yujian Jiang 1,2,3,\***


Received: 25 December 2019; Accepted: 20 January 2020; Published: 22 January 2020

**Abstract:** A novel technique is proposed for the analysis and modeling of timbre perception features, including a new terminology system for evaluating timbre in musical instruments. This database consists of 16 expert and novice evaluation terms, including five pairs with opposite polarity. In addition, a material library containing 72 samples (including 37 Chinese orchestral instruments, 11 Chinese minority instruments, and 24 Western orchestral instruments) and a 54-sample objective acoustic parameter set were developed as part of the study. The method of successive categories was applied to each term for subjective assessment. A mathematical model of timbre perception features (i.e., bright or dark, raspy or mellow, sharp or vigorous, coarse or pure, and hoarse or consonant) was then developed for the first time using linear regression, support vector regression, a neural network, and random forest algorithms. Experimental results showed the proposed model accurately predicted these attributes. Finally, an improved technique for 3D timbre space construction is proposed. Auditory perception attributes for this 3D timbre space were determined by analyzing the correlation between each spatial dimension and the 16 timbre evaluation terms.

**Keywords:** feature extraction; timbre modeling; auditory perception; timbre space

#### **1. Introduction**

The subjective perception of sound originates from three auditory attributes: loudness, pitch, and timbre [1]. In recent years, researchers have established relatively mature evaluation models for loudness and pitch [2,3], but a quantitative calculation and assessment of timbre is far more complicated. Studies have shown that timbre is a critical acoustic cue for conveying musical emotion. It also provides an important basis for human recognition and classification of music, voice, and ambient sounds [4]. Therefore, the quantitative analysis of timbre and the establishment of a parameterized model are of significant interest in the fields of audio-visual information processing, music retrieval, and emotion recognition. The subjective nature of timbre complicates the evaluation process, which typically relies on subjective evaluations, signal processing, and statistical analysis. The American National Standards Institute (ANSI) defines timbre as an attribute of auditory sensation in terms of which a listener can judge that two sounds similarly presented and having the same loudness and pitch are dissimilar [5], making it an important factor for distinguishing musical tones [6].

Timbre evaluation terms (i.e., timbre adjectives) are an important metric for describing timbre perception features. As such, a comprehensive and representative terminology system is critical for ensuring the reliability of experimental auditory perception data. Conventionally, timbre evaluation research has focused on the fields of music and language sound quality, traffic road noise control, automobile or aircraft engine noise evaluation, audio equipment sound quality design, and soundscape evaluation. Among these, research in English-speaking countries is relatively mature, as shown in Table 1. However, differences in nationality, cultural background, customs, language, and environment inevitably affect the cognition of timbre evaluation terms [7–11]. In addition, Chinese instruments differ significantly from Western instruments in terms of their structure, production material, and sound production mechanisms. The timbre of Chinese instruments is also more diverse than that of Western instruments and existing English timbre evaluation terms may not be sufficient for describing these nuances. As such, the construction of musical timbre evaluation terms is of great significance to the study of Chinese instruments.


**Table 1.** Previous studies on timbre evaluation terms.


**Table 1.** *Cont.*

Timbre contains complex information concerning the source of a sound. Humans can perform a series of tasks to recognize objects by listening to these sounds [38]. As such, the quantitative analysis and description of timbre perception characteristics has broad implications in military and civil fields, such as instrument recognition [39], music emotion recognition [40], singing quality evaluation [41], active sonar echo detection [42], and underwater target recognition [43]. Developing a mathematical model of timbre perception features is vital to achieving a quantitative description of timbre. Two primary methods have conventionally been used to quantify timbre perception features. The first is the concept of psychoacoustic parameters [6]. That is, by analyzing the auditory characteristics of the human ear, a mathematical model can be established to represent subjective feelings, such as sharpness, roughness, and fluctuation strength [44]. Since most of the experimental stimulus signals in these experiments were noise, the calculated value for the musical signal differed from the subjective feeling, which is both limited and one-sided. Another technique combines subjective evaluation experiments with statistical analysis. In other words, the experiment is designed according to differences in perceived features from sound signals, from which objective parameters can be extracted. The correlation between objective parameters and perceived features is established through statistical analysis or machine learning, which

is then used to develop a mathematical model of the perceived features. This approach has been widely used in the fields of timbre modeling [45,46], music information retrieval [47], instrument classification [48], instrument consonance evaluation [49], interior car sound evaluation [50], and underwater target recognition [42]. However, the experimental materials in these studies were Western instruments or noise. Chinese instruments are unique in their mechanisms of sound production and playing techniques, producing a rich timbre variety. As such, it is necessary to use Chinese instruments as a stimulus to establish a more complete timbre perception model.

Timbre is an auditory attribute with multiple dimensions, which can be represented by a continuous timbre space. This structure is of great importance to the quantitative analysis and classification of sound properties. The semantic differential method was used in early timbre space research [12,13]. Recently, multidimensional scaling (MDS) based on dissimilarity has been used to construct these spaces. For example, Grey used 16 Western instrument sound samples to create a three-dimensional (3D) timbre space [51]. McAdams et al. studied the common dimensions of timbre spaces with synthetic sounds used as experimental materials, establishing a relationship between the dimensions of a space and the corresponding acoustic parameters [52]. Martens et al. used guitar timbre to study the differences in timbre spaces constructed under different language backgrounds [53,54]. Zacharakis and Pastiadis conducted a subjective evaluation and analysis using 16 Western musical instruments, proposing a luminance–texture–mass (LTM) model for semantic evaluation. In this process, six semantic scales were analyzed using principal component analysis (PCA) and multidimensional scaling (MDS) to produce two different timbre spaces [55]. Simurra and Queiroz used a set of 33 orchestral music excerpts that were subjectively rated using quantitative scales based on 13 pairs of opposing verbal attributes. Factor analysis was included to identify major perceptual categories associated with tactile and visual properties, such as mass, brightness, color, and scattering [56]. Multidimensional scaling requires the acquisition of a dissimilarity matrix between each sample. However, existing methods use a paired comparison technique for the subjective evaluation experiment. This approach not only involves a large experimental workload, it also imposes a higher professional requirement, making the evaluation scale difficult to control. This paper proposes a new indirect model for constructing timbre spaces based on the method of successive categories. In this system, the dissimilarity matrix is calculated based on experimental data from the method of successive categories. This reduces the workload and increases the stability and reliability of the data.

The remainder of this paper is organized as follows. Section 2 introduces the timbre library construction process and Section 3 develops the timbre evaluation terminology. Section 4 introduces the perception feature model, and the timbre space is constructed in Section 5. Section 6 concludes the paper. The research methodology for the study is presented in Figure 1.

**Figure 1.** The proposed methodology.

#### **2. Timbre Database Construction**

#### *2.1. Timbre Material Collection*

A high-quality database of timbre materials was constructed by recording all materials required for the experiment in a full anechoic chamber, with a background noise level of −2 dBA. The equipment included a BK 4190 free-field microphone and a BK LAN-XI3560 AD converter. The performers were teachers and graduate students from the College of Music. Recordings consisted of musical scales and individual pieces of music. The Avid Pro Tools HD software was used to edit the audio material. The length of each clip was between 6–10 s, the sampling rate was 44,100 Hz, the quantization accuracy was 16 bits, and all audio was saved in the wav format. Previous studies on timbre used Western instruments as stimulus materials. However, the variety of timbre samples needed to be as rich as possible to increase the accuracy of timbre perception features. The timbre variety was enriched by using a collection of 72 different musical instruments, including 36 Chinese orchestral instruments, 12 Chinese minority instruments, and 24 Western orchestral instruments. The names and categories of the 72 instruments are listed in Appendix A. A timbre library containing 72 audio files was constructed from the data.

#### *2.2. Loudness Normalization*

In accordance with the definition of timbre, the influence of pitch and loudness are often excluded from timbre studies. However, previous research has shown that timbre and pitch are not independent in certain cases [57]. As such, timbre perception features presented in this paper include pitch as a factor. In order to eliminate the influence of loudness, a balance experiment was used to normalize the loudness of the timbre materials based on experimental results [58].

#### **3. Construction of the Timbre Subjective Evaluation Term System**

A timbre evaluation glossary including 32 evaluation terms was constructed and a subjective timbre evaluation experiment was conducted, based on a forced selection methodology (experiment A). Sixteen representative timbre evaluation terms were selected by combining the results of a clustering analysis. Finally, correlation analysis was used to calculate the correlation of these 16 evaluation terms. Six terms with a coefficient larger than 0.85 were removed. The remaining 10 terms were paired into five groups with opposite polarity (the absolute value of the correlation coefficient was greater than 0.81). These five pairs were used for timbre evaluation experiments based on the method of successive categories (experiment B), as well as the parametric modeling of timbre perception features.

#### *3.1. Construction of the Thesaurus for Timbre Evaluation Terms*

A thorough investigation of timbre evaluation terms was conducted under conditions of equivalent sound. A total of 329 terms were collected from the literature and a survey. Five people with a professional music background then deleted 155 of these terms (e.g., polysemy, ambiguous meaning, compound terms, etc.) that were, in their opinion, not suitable for a subjective experiment. A group of 21 music professionals listened to audio clips of the remaining 174 terms and judged whether they were suitable for describing the sound. The 32 most frequent evaluation terms were selected and a lexicon containing 32 timbre metrics was produced (Table 2). These terms completely describe all aspects of timbre dynamics, but they do include some redundant information, which needed to be assessed further using statistical analysis.


**Table 2.** A lexicon of 32 timbre evaluation terms in their original language (Chinese), with an accompanying English translation.

#### *3.2. Experiment A: A Subjective Evaluation Experiment Based on a Forced Selection Methodology*

A subjective evaluation experiment was conducted in a standard listening room with a reverberation time of 0.3 s, which conforms to listening standards [59]. A total of 41 music professionals (21 males) participated in the experiment. Their ages ranged between 18 and 35 and they had no history of hearing loss. A forced selection methodology was employed in which audio clips from the material library were played in turn and subjects determined whether a given evaluation term was suitable for describing the audio clip. Clustering analysis and correlation analysis were then used to assess the experimental data (as discussed below), producing a music expert timbre evaluation term system (including 16 evaluation terms) and an ordinary timbre evaluation term system (including 5 pairs of evaluation terms with opposite polarity).

#### *3.3. Data Analysis and Conclusion of Experiment A*

A multidimensional scale was used to analyze the distance relationships for 32 evaluation terms in the two-dimensional space. The distance relationship between the 32 terms is shown in Figure 2. It is evident from Figure 2 that the distance between terms was small in some regions, indicating a high degree of correlation. In order to reduce the workload of subsequent timbre perception feature modeling, cluster analysis was used to further reduce the dimensionality of the evaluation terms. Figure 3 shows a cluster pedigree diagram calculated using a system clustering method. Using this diagram and the selection frequency obtained previously, the 32 terms were combined to produce 16 timbre evaluation terms (see Table 3). These 16 terms constituted the music expert timbre evaluation system used in the modeling of timbre spaces (experiment C).


**Table 3.** A musical expert timbre evaluation term system, including 16 timbre evaluation terms in their original language (Chinese) and the corresponding English translations.

A common timbre evaluation terminology system was then developed by calculating the Pearson correlation coefficient (PCC) for these 16 terms. The 6 terms with the highest correlation (PCC > 0.85) were excluded, resulting in a correlation matrix for the remaining 10 terms (Table 4). Terms with negative PCCs or large absolute values were selected from this matrix to form evaluation pairs with opposite meanings. These 10 terms were then combined to form five pairs (Table 5), constituting an ordinary timbre evaluation system. These pairs were used for the timbre evaluation experiment

based on the method of successive categories (experiment B) and the parametric modeling of timbre perception features.

**Figure 2.** The distance relationship between the 32 evaluation terms.

**Figure 3.** A cluster diagram of 32 timbre evaluation terms.


**Table 4.** A correlation matrix for 10 timbre evaluation terms.

**Table 5.** An ordinary timbre evaluation term system including five pairs of evaluation terms in their original language (Chinese) and the associated English translations.


#### **4. Construction of a Timbre Perception Feature Model**

Objective acoustic parameters were extracted from audio samples in 166 dimensions. The method of successive categories was then used to conduct a timbre perception evaluation experiment (experiment B), as well as reliability and validity analysis for the resulting data. Linear regression, support vector regression, a neural network, and a random forest algorithm were used to construct a timbre perception feature model. The accuracy of this model was then evaluated and it was used to predict timbre perception features for new audio materials.

#### *4.1. Construction of the Objective Acoustic Parameter Set*

Timbre is a multidimensional perception attribute that is closely related to the time-domain waveform and spectral structure of sound [60]. In order to establish a timbre perception feature model, an objective acoustic parameter set was constructed using 54 parameters extracted from the timbre database. Objective acoustic parameters refer to any values acquired using a mathematical model representing a normal sound signal in the time and frequency domains. These 54 parameters can be divided into 6 categories [61]:


#### *4.2. Calculation Method*

The acoustic parameters were calculated as follows. The spectral centroid for the magnitude spectrum of the STFT [60] is given by:

$$C\_t = \frac{\sum\_{n=1}^{N} M\_t[n] \times n}{\sum\_{n=1}^{N} M\_t[n]},\tag{1}$$

where *Mt*[*n*] is the magnitude of the Fourier transform at frame *t* and frequency *n*. This centroid is a measure of the spectral shape, where higher centroid values indicate "brighter" sounds. Spectral slope was calculated using a linear regression over spectral amplitude values. It should be noted that spectral slope is linearly dependent on the spectral centroid as follows [62]:

$$\text{slope}(t\_{\text{w}}) = \frac{1}{\sum\_{k=1}^{K} a\_k(t\_{\text{w}})} \times \frac{K \sum\_{k=1}^{K} f\_k \cdot a\_k(t\_{\text{w}}) - \sum\_{k=1}^{K} f\_k \cdot \sum\_{k=1}^{K} a\_k(t\_{\text{w}})}{K \sum\_{k=1}^{K} f\_k^2 - \left(\sum\_{k=1}^{K} f\_k\right)^2} \tag{2}$$

where slope(*tm*) is the spectral slope at time *tm*, *ak* is the spectral amplitude at *k*, and *fk* is the frequency at *k*. Tristimulus values were introduced by Pollard and Jansson as a timbral equivalent to color attributes in vision. The tristimulus comprises three different energy ratios, providing a description of the first harmonics in a spectrum [63]:

$$\begin{aligned} T1(t\_m) &= \frac{\frac{a\_1(t\_m)}{H}}{\sum\limits\_{h=1}^{N} a\_h(t\_m)}, \\ T2(t\_m) &= \frac{\frac{a\_2(t\_m) + a\_3(t\_m) + a\_4(t\_m)}{H}}{\sum\limits\_{h=1}^{N} a\_h(t\_m)}, \\ T3(t\_{\text{ff}}) &= \frac{\frac{\sum}{H} a\_h(t\_m)}{\sum\limits\_{h=1}^{N} a\_h(t\_m)}, \end{aligned} \tag{3}$$

where *H* is the total number of partials and *ah* is the amplitude of partial *h*.

Spectral flux is a time-varying descriptor calculated using STFT magnitudes. It represents the degree of variation in a spectrum over time, defined as unity minus the normalized correlation between successive *ak* terms [64]:

$$\text{spectral flux} = 1 - \frac{\sum\_{k=1}^{K} a\_k(t\_{m-1}) a\_k(t\_m)}{\sqrt{\sum\_{k=1}^{K} a\_k(t\_{m-1})^2} \sqrt{\sum\_{k=1}^{K} a\_k(t\_m)^2}} \tag{4}$$

Inharmonicity measures the departure of partial frequencies *fh* from purely harmonic frequencies *hf* 0. It is calculated as a weighted sum of deviations from harmonicity for each individual partial [62]:

$$\text{inharromo}(t\_m) = \frac{2}{f\_0(t\_m)} \frac{\sum\_{h=1}^{H} (f\_h(t\_m) - hf\_0(t\_m))a\_h^2(t\_m)}{\sum\_{h=1}^{H} a\_h^2(t\_m)},\tag{5}$$

where *f* <sup>0</sup> is the fundamental frequency and *fh* is the frequency of partial *h*.

Spectral roll-off was proposed by Scheirer and Slaney [65]. It is defined as the frequency *fc*(*tm*) below which 95% of the signal energy is contained:

$$\sum\_{f=0}^{f\_f(t\_m)} a\_f^2(t\_m) = 0.95 \sum\_{f=0}^{sr/2} a\_f^2(t\_m),\tag{6}$$

where *sr*/2 is the Nyquist frequency and *af* is the spectral amplitude at frequency *f*. In the case of harmonic sounds, it can be shown experimentally that spectral roll-off is related to the harmonic or noise cutoff frequency. The spectral roll-off also reveals an aspect of spectral shape as it is related to the brightness of a sound.

The odd-to-even harmonic energy ratio distinguishes sounds with a predominant energy at odd harmonics (such as the Guan) from other sounds with smoother spectral envelopes (such as the Suona). It is defined as:

$$\text{OER}(t\_m) = \frac{\sum\_{h=1}^{H/2} a\_{2h-1}^2(t\_m)}{\sum\_{h=1}^{H/2} a\_{2h}^2(t\_m)}.\tag{7}$$

Twelve time-varying statistics were calculated for the 54 parameters, including the maximum, minimum, mean, variance, standard deviation, interquartile range, skewness coefficient, and kurtosis coefficient, producing an objective acoustic parameter set containing 166 dimensions (see Table 6). In this paper, Timbre Toolbox [62] and MIRtoolbox [66] were used for feature extraction. The corresponding acoustic parameters were extracted from materials in the timbre database and the acquired data were used to construct a timbre perception feature model.

#### *4.3. Experiment B: A Timbre Evaluation Experiment Based on the Method of Successive Categories*

A subjective evaluation experiment was conducted in a standard listening room with a reverberation time of 0.3 s, which conforms to listening standards [59]. A total of 34 subjects (16 males) with a professional music background participated in the experiment. Their ages ranged from 18 to 35 and they had no history of hearing loss. The experimental subjective evaluation process was conducted as follows. Material fragments were played, and the subjects judged the psychological scale of the piece for each timbre perception feature (evaluation term) in sequence, scoring it on a nine-level scale. All experimental materials were played prior to the formal experiment to familiarize subjects with the samples in advance. This was done to assist each subject in mastering the evaluation criteria and scoring scale, reducing the discretization of evaluation data for the same sample. Each piece was played twice with an interval of 5 s and a sample length of 6–10 s. Each evaluation term was tested for 10 min, with a 15-min break every half hour.

The validity and reliability of data from these 34 samples were analyzed to calculate a correlation coefficient between the scores for each subject. The Euclidean distance between the evaluation terms was calculated using cluster analysis to identify the two subjects with the largest difference in each group. Some subjects may not have had a sufficient understanding of the purpose of the experiment. Data from these subjects were excluded and not used for subsequent timbre perception feature modeling. The method of successive categories was used to conduct a statistical analysis of the experimental data [67]. The theoretical basis for this approach assumes the psychological scale to be a random variable, subject to a normal distribution. The boundary of each category was not a predetermined value, but a random variable identified from the experimental data. The Thurstone scale was then used to process the data and produce a psychological scale for all timbre materials and each perception feature for modeling purposes. Figure 4 shows the resulting scale for 72 musical instruments in 5 timbre evaluation dimensions. In each image, the dotted line represents the average value of each instrument in the corresponding dimension.


**Table 6.** Acoustic parameters.

It is evident from Figure 4 that the distribution of timbre values for Chinese instruments differed significantly from Western instruments. For example, raspy/mellow and hoarse/consonant exhibited drastically different scales. This suggested the timbre database containing Chinese instruments had a richer variety of timbre types than a conventional Western instrument database. In addition, the distribution of timbre samples in the five timbre evaluation scale pairs was relatively balanced. This suggested the proposed evaluation terminology was representative of multiple timbre types and could better distinguish the attributes of different instruments. These factors could help to improve the accuracy of timbre perception feature models.

**Figure 4.** *Cont.*

(**c**)

(**d**)

**Figure 4.** *Cont.*

(**e**)

**Figure 4.** A psychological scale of 72 musical instruments, including (**a**) bright/dark, (**b**) raspy/mellow, (**c**) sharp/vigorous, (**d**) coarse/pure, and (**e**) hoarse/consonant. The blue squares represent Western orchestral instruments, the yellow triangles represent Chinese minority instruments, and the red circles represent Chinese orchestral instruments. The dotted blue line represents the mean value of the Western orchestral instruments, the dotted yellow line represents the mean value of the Chinese minority instruments, and the dotted red line represents the mean value of the Chinese orchestral instruments.

#### *4.4. Construction of a Prediction Model*

In this study, multiple linear regression, support vector regression, a neural network, and a random forest algorithm were used to correlate objective parameters and subjective evaluation experimental data to construct a mathematical model of timbre perception features. Stepwise techniques were used for variable entry and removal in the multiple linear regression algorithm [68], and radial basis functions were selected as kernels for support vector regression [69]. A multi-layer perceptron was adopted in the neural network, which included a hidden layer [70]. Random forest is a common ensemble model consisting of multiple CART-like trees, each of which grows on a bootstrap object acquired by sampling the original data cases with replacements [71].

Before modeling, feature selection was conducted for the target attribute to be predicted. This process consisted of three steps:


During the modeling phase, 80% of the data were used for training and the remaining 20% were used for validation. The input to the model was a 166-dimensional objective parameter set and the output was the value of the five perception dimensions (bright/dark, raspy/mellow, sharp/vigorous, coarse/pure, and hoarse/consonant). Correlation coefficients were used to evaluate the accuracy of the model and represented the results of the correlation analysis between the model prediction data and subjective evaluation data, with higher coefficients representing a more accurate model.

The accuracy of prediction results for the four algorithms across the five perception dimensions are shown in Table 7. Figure 5 provides a histogram of the prediction accuracy in different dimensions. These experimental results suggested that the proposed technique provided valid predictions in each of the five dimensions. The algorithm exhibiting the best performance exceeded 0.9 for bright/dark, sharp/vigorous, coarse/pure, and hoarse/consonant sound types. The averaged results indicated that the neural network (0.915) and random forest (0.864) outperformed multiple linear regression (0.665) and support vector regression (0.670). The neural network was particularly accurate in its predictions of the five perception dimensions.


**Table 7.** A comparison of the accuracies achieved by four algorithms.

**Figure 5.** A prediction accuracy histogram for the five perception attributes.

#### **5. The Construction of Timbre Space**

Multidimensional scaling (MDS) was used to construct a 3D timbre perception space to represent the distribution of 37 Chinese instruments more intuitively. Unlike many common analysis methods, MDS is heuristic and does not require assumptions about spatial dimensionality [72]. It also offers the advantages of visualization and helps to identify potential factors affecting the similarity between terms. The construction of a timbre space includes three steps:

(1) *Subjective evaluation experiment based on sample dissimilarity:* where a dissimilarity matrix between samples was obtained using a subjective evaluation experiment. Existing research has conventionally paired up samples in the material database to score the dissimilarity. The process was simplified in this study, which reduced the workload.


The performance of multidimensional scaling algorithms depends on the sample dissimilarity matrix. In previous studies [51,52], this matrix was acquired using a subjective evaluation experiment that compared and scored the dissimilarity of any two samples. A total of *n*2/2 experiments must be conducted for *n* samples. This quadratic relationship significantly increases the computational complexity and runtime, which makes quantifying the dissimilarity more difficult. This paper presents an improved methodology in which a set of evaluation indicators were selected (as complete as possible) and all samples were successively scored with each indicator. These results constituted the feature vector for the sample and the distance to each vector was calculated to obtain the dissimilarity of all samples. The 16 timbre evaluation terms shown in Table 3 were used to assess the attributes of each dimension during the analysis phase.

The method of successive categories was then used to conduct a subjective evaluation experiment on timbre materials for 37 Chinese instruments (experiment C). Grade 9 was performed on 16 perception dimensions in Table 3 and the reliability and validity of the experimental data were analyzed. The Euclidean distance of the feature vectors was calculated, producing a dissimilarity matrix for 37 Chinese instruments. The MDS algorithm was used to process the timbre dissimilarity matrix and construct a 3D timbre perception space.

#### *5.1. Experiment C: Subjective Evaluation Experiment Based on Sample Dissimilarity*

Three factors were considered during sample selection to prepare the sound data needed in the subjective evaluation experiment [73]:


Subjective evaluation of the experimental environment and the subjects was conducted as in experiment B. The process was as follows: while playing each experimental sample, the subjects judged the psychological scale of the sample on 16 timbre perception features (timbre evaluation terms) in turn, scoring each on a 9-point scale.

#### *5.2. The Construction of the 3D Timbre Space Using MDS*

The reliability and validity processing method applied to the experimental data was the same as in experiment B. The processed data were averaged and the mean score for all subjects on each evaluation term was calculated for each sample. These data were then used to calculate the timbre dissimilarity, expressed in the form of a distance matrix. The MDS algorithm was adopted in this paper [77], which considers individual differences between subjects and assigns a corresponding weight to each score. This approach considers terms in every dimension and more fully utilizes the experimental data. Multidimensional scaling is based on dissimilarity analysis for two samples in a timbre attribute space, which can be expressed using a distance matrix as follows:

$$d\_{jk}^i = \sqrt{\sum\_{r=1}^R w\_{ir} \cdot (\mathbf{x}\_{jr} - \mathbf{x}\_{kr})^2},\tag{8}$$

where *di jk* represents the dissimilarity evaluation score for subject *i* assessing sounds *j* and *k*, *wir* represents the weight of subject *i* in the *r*th dimension, and *xkr* represents the coordinates of sample *k* in the *r*th dimension.

Equation (8) was used to calculate the distance for 37 timbre feature vectors and the dissimilarity distance matrix for 37 samples. This matrix was used as input into the MDS algorithm. The number of timbre space dimensions was determined by referring to previous research results [51,52]. The timbre space dimension was determined in three dimensions using Kruskal's stress function [78]. The coordinates of each sound sample in 3D timbre space were acquired by using MDS to reduce the dimensionality of the dissimilarity distance matrix (Figure 6).

**Figure 6.** A 3D timbre space for 37 Chinese instruments.

#### *5.3. Perception Attribute Analysis of the Timbre Space Dimension*

The correlation between 16 timbre perception attributes was calculated to analyze the auditory attributes of each dimension in the timbre space. The coordinates of the samples were projected into three dimensions to obtain the spatial distribution of the data. Pearson correlation coefficients were calculated between each dimension and the 16 timbre perception attributes (Table 8). Further analysis suggested dimension 1 was positively correlated with the "bright" perception attribute and negatively correlated with "vigorous." As such, dimension 1 could be defined as "bright/vigorous." Dimension 2 was positively correlated with "hoarse" and negatively correlated with "consonant." However, the correlation of dimension 3 was not as obvious, as it was only slightly correlated with "full/mellow." Figure 6 suggests that different types of instruments were distributed at different positions in the timbre space, which could be used to categorize individual timbres.


**Table 8.** The results of correlation analysis in 3D timbre space.

#### **6. Conclusions**

This study presented a novel methodology for the analysis and modeling of timbre perception features in musical sounds. The primary contributions can be summarized as follows:


In future research, we will focus on the following three aspects of this study. First, supplemental sample materials will be acquired based on the existing timbre database. We will attempt to expand the variety and quantity of the data to improve the consistency and robustness of the model. Second, a subjective evaluation experiment, statistical analysis, and other techniques will be used to select timbre evaluation terms that accurately reflect the essential attributes of timbre to provide support for the construction of simple and effective timbre spaces. Third, the machine learning algorithm will be improved by including more subjective evaluation data. Additional correlation algorithms will also be tested to improve the accuracy of the model predictions. Finally, mathematical modeling will be implemented for each dimension in the timbre space. The distribution of other (i.e., Western) instruments will be compared to that of Chinese instruments to identify common patterns.

**Author Contributions:** Investigation, conceptualization, methodology, data curation, and writing (original draft, review, and editing): J.L.; project administration and supervision: W.J., Y.J., and S.W.; software, experimental process, and data processing: J.L. and X.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was supported by the Key Laboratory Research Funds of Ministry of Culture and Tourism (WHB1801).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

The timbre materials mentioned in Section 2.1 contains 72 instruments, including 37 Chinese orchestral instruments, 11 Chinese minority instruments, and 24 Western orchestral instruments (Table A1). The names of the Chinese orchestral instruments and Chinese minority instruments are given in their original languages (Chinese), with an accompanying English translation.


**Table A1.** Instrument list.


**Table A1.** *Cont.*

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Microphone and Loudspeaker Array Signal Processing Steps towards a "Radiation Keyboard" for Authentic Samplers**

#### **Tim Ziemer 1,\*,† and Niko Plath 2,†**


Received: 30 December 2019; Accepted: 23 March 2020; Published: 29 March 2020

**Abstract:** To date electric pianos and samplers tend to concentrate on authenticity in terms of temporal and spectral aspects of sound. However, they barely recreate the original sound radiation characteristics, which contribute to the perception of width and depth, vividness and voice separation, especially for instrumentalists, who are located near the instrument. To achieve this, a number of sound field measurement and synthesis techniques need to be applied and adequately combined. In this paper we present the theoretic foundation to combine so far isolated and fragmented sound field analysis and synthesis methods to realize a *radiation keyboard*, an electric harpsichord that approximates the sound of a real harpsichord precisely in time, frequency, and space domain. Potential applications for such a radiation keyboard are conservation of historic musical instruments, music performance, and psychoacoustic measurements for instrument and synthesizer building and for studies of music perception, cognition, and embodiment.

**Keywords:** microphone array; wave field synthesis; acoustic holography; sampler; synthesizer

**PACS:** 43.10.Ln; 43.20.+g; 43.20.Bi; 43.20.Fn; 43.20.Px; 43.20.Tb; 43.20.Ye; 43.25.Lj; 43.25.Qp; 43.28.We; 43.30.Zk; 43.38.Lc; 43.38.Md; 43.40.+s; 43.40.At; 43.40.Cw; 43.40.Dx; 43.40.Fz; 43.40.Le; 43.40.Rj; 43.40.Sk; 43.58.Jq; 43.60.+d; 43.60.Ac; 43.60.Gk; 43.60.Hj; 43.60.Lq; 43.60.Pt; 43.60.Sx; 43.60.Tj; 43.60.Uv; 43.60.Wy; 43.66.+y; 43.66.Lj; 43.75.+a; 43.75.Cd; 43.75.Gh; 43.75.Mn; 43.75.St; 43.75.Tv; 43.75.Yy; 43.75.Zz; 01.50.fd

**MSC:** 76Q05; 97U80

#### **1. Introduction**

Synthesizers tend to focus on timbral aspects of sound, which contains temporal and spectral features [1,2]. This is even true for modern synthesizers that imitate musical instruments by means of physical modeling [3,4]. Many samplers and electric pianos on the market use stereo recordings, or pseudostereo techniques [5,6] to create some perceived spaciousness in terms of *apparent source width* or *perceived source extent*, so that the sound appears more natural and vivid. However, such techniques do not capture the sound radiation characteristics of musical instruments, which may be essential for an authentic experience in music listening and musician-instrument-interaction.

Most sound field synthesis approaches synthesize virtual monopole sources or plane waves by means of loudspeaker arrays [7,8]. Methods to incorporate the sound radiation characteristics of musical instruments are based on sparse recordings of the sound radiation characteristics [5], like far

field recordings from circular [9,10] or spherical [11] microphone arrays with 24 to 128 microphones. In these studies, a nearfield mono recording is extrapolated from a virtual source point. However, instead of a monopole point source, the measured radiation characteristic is included in the extrapolation function, yielding a so-called *complex point source* [9,12,13]. Complex point sources are a drastic simplification of the actual physics of musical instruments. However, complex point sources were demonstrated to create plausible physical and perceptual fields [5,14]. These sound natural in terms of source localization, perceived source extent and timbre, especially when listeners and/or sources move during the performance [5,15–20].

To date, sound field synthesis methods to reconstruct the sound radiation characteristics of musical instruments do not incorporate exact nearfield microphone array measurements of musical instruments, as described in [21–25]. This is most likely because the measurement setup and the digital signal processing for high-precision microphone array measurements are very complex on their own. The methods include optimization algorithms and solutions to inverse problems. The same is true for sound field synthesis approaches that incorporate complex source radiation patterns.

In this paper we introduce the theoretic concept of a *radiation keyboard*. We describe on a theoretical basis, and with some practical considerations, which sound field measurement and synthesis methods should be combined, and how to combine them utilizing their individual strengths. All presented results are preparatory for the realization. In contrast to conventional samplers, electric pianos, etc., a radiation keyboard recreates not only the temporal and spectral aspects of the original instrument, but also its spatial attributes. The final radiation keyboard is basically a MIDI keyboard whose keys trigger different driving signals of a loudspeaker array in real-time. When playing the radiation keyboard, the superposition of the propagated loudspeaker driving signals should create the same sound field as the original harpsichord would do. Thus, the radiation keyboard will create a more realistic sound impression than conventional, stereophonic samplers. This is especially true for musical performance, where the instrumentalists moves their heads. The radiation keyboard can serve, for example,


The remainder of the paper is organized as follows. Section 2 describes all the steps that are carried out to measure and synthesize the sound radiation characteristics of a harpsichord. In Section 2.1, we describe the setup to measure impulse responses of the harpsichord and the radiation keyboard loudspeakers. These are needed to calculate impulse responses that serve as raw loudspeaker driving signals. For three different frequency regions, f1 to f3, different methods are ideal to calculate. In Sections 2.2–2.4, we describe how to derive the loudspeaker impulse responses for frequency regions f1, f2 and f3. In Section 3, we describe how to combine the three frequency regions, and how to create loudspeaker driving signals that synthesize the original harpsichord sound field during music performance. After a summary and conclusion in Section 4, we discuss potential applications of the radiation keyboard in the outlook Section 5.

#### **2. Method**

The concept and design of the proposed radiation keyboard are illustrated in Figure 1. The sound field radiated by a harpsichord is analyzed by technical means. Then, this sound field is synthesized by the radiation keyboard. The radiation keyboard consists of a loudspeaker array whose driving signals are triggered by a MIDI keyboard. The superposition of the propagated loudspeaker driving signals creates the same sound field as a real harpsichord. To date, no sound field synthesis method is able to radiate all frequencies in the exact same way the harpsichord does. Therefore, we combine different sound field analysis and synthesis methods. This combination offers an optimal compromise: low frequencies f1 < 1500 Hz are synthesized with high precision in the complete half-space above the sound board, mid frequencies 1500 ≤ f2 ≤ 4000 Hz are synthesized with high precision within an extended listening region, and higher frequencies f3 > 4000 Hz are synthesized with high precision at discrete listening points within the listening region.

**Figure 1.** Design and concept of the radiation keyboard (**right**). A MIDI keyboard triggers individual signals for 128 loudspeakers, which are arranged like a harpsichord. Replacing the real harpsichord (**left**) by the radiation keyboard creates only subtle audible differences. Unfortunately, the radiation keyboard cannot synthesizes the harpsichord sound in the complete space. Low frequencies f1 are synthesized with high precision in the complete half-space above the loudspeaker array (light blue zone). Mid frequencies are synthesized with high precision in the listening region (green zone) in which the instrumentalist is located. The sound field of very high frequencies is synthesized with a high precision at discrete listening points within the listening region (red dots).

To implement a radiation keyboard, four main steps are carried out. Figure 2 shows a flow diagram of the main steps: firstly, the sound radiation characteristics of the harpsichord are measured by means of microphone arrays. Secondly, an optimal constellation of loudspeaker placement and sound field sampling is derived from impulse these measurements. As the third step the impulse responses for the loudspeaker array are calculated. These serve as raw loudspeaker driving signals. Finally, loudspeaker driving signals are calculated by a convolution of harpsichord source signals with the array impulse responses. These driving signals are triggered by a MIDI keyboard and play in real-time. The superposition of the propagated driving signals synthesizes the complex harpsichord sound field.

**Figure 2.** Flow diagram of the proposed method. First, harpsichord Impulse Responses (IR) are measured, then the distribution of loudspeakers and listening points is optimized, and finally, loudspeaker driving signals are calculated.

To synthesize the sound field, it is meaningful to divide the harpsichord signal into three frequency regions: frequency region f1 lies below 1.5 kHz, the Nyquist frequency of the proposed loudspeaker array. Frequency region f2 ranges from 1.5 kHz to 4 kHz, the Nyquist frequency of the microphone array. Frequency region f3 lies above these Nyquist frequencies. Different sound field measurement and synthesis methods are optimal for each region. They are treated separately in the following sections.

#### *2.1. Setup*

The setup for the impulse response measurements is illustrated in Figure 3 for a piano under construction. In the presented approach the piano is replaced by a harpsichord. An acoustic vibrator excites the instrument at the termination point between the bridge and a string for each key. Successive microphone array recordings are carried out in the near field to sample the sound field at *M* = 1500 points parallel to the sound board.

**Figure 3.** Measurement setup including a movable microphone array (in the front) above a soundboard, and an acoustic vibrator (in the rear) installed in an anechoic chamber. After the nearfield recordings parallel to the sound board, the microphone array samples the listening region of the instrumentalist in playing position.

In addition to the near field recordings, the microphone array samples the *listening region*. The head of the instrumentalist will be located in this region during the performance (ear channel distance to keyboard y ≈ 0.37 m, ear channel distance to ground z ≈ 1.31 m for a grown person). The location is indicated by black dots in Figure 4.

**Figure 4.** Depiction of the radiation keyboard. A regular grid of loudspeakers is installed on a board. The board has the shape of the harpsichord sound board. Microphones (black dots) sample the listening region in front of the keyboard.

A lightweight piezoelectric accelerometer measures the vertical polarization of the transverse string acceleration *h*(*κa*, *t*) at the intersection point between string and bridge for each of the *A* = 62 keys. This is not illustrated in Figure 3 but indicated as brown dots in Figure 5. The acceleration measured by the sensor is proportional to the force acting on the bridge. Details on the setup can be found in [32,33]. Alternatively, *h*(*κa*, *t*) can be recorded optically, using a high speed camera and the setup described in [34,35], or it can be synthesized from a physical plectrum-string model [36]. The string recording represents the *source signal* that excites the harpsichord.

**Figure 5.** Procedure to derive the impulse response *Rf* <sup>1</sup>(**Θ***l*, *κa*, *t*) for each loudspeaker and pressed key in the radiation keyboard. The brown curve represents the bridge, the brown dots depict exemplary excitation points. The black dots represent microphones near the soundboard, the light gray dots represent equivalent sources on the soundboard. The gray dots represent a regular subset of equivalent sources, which are replaced by loudspeakers (circles) in the radiation keyboard.

Measuring string vibrations isolated from the impulse response measurements of the sound board adds a lot of flexibility to the radiation keyboard. We derive impulse responses for the loudspeaker array of the radiation keyboard. This radiates all frequencies the same way the harpsichord would do. Consequently, any arbitrary source signal can serve as an input for the radiation keyboard. In addition to the measured harpsichord string acceleration *h*(*κa*, *t*), the radiation keyboard can load any sound sample, such as alternative harpsichord tunings, alternative instrument recordings, or arbitrary test signals.

Figure 4 illustrates the radiation keyboard. A rigid board in the shape of the harpsichord sound board serves as a loudspeaker chassis. A regular grid of loudspeakers is arranged on this chassis. The radiated sound field created by each single loudspeaker is recorded in the listening region.

#### *2.2. Low Frequency Region f1*

The procedure to calculate the loudspeaker impulse responses for frequency region f1 is illustrated in Figure 5. Firstly, impulse responses of the harpsichord are recorded in the near field. Next, the recorded sound field is propagated back to *M* = 1500 points on the harpsichord sound board. Then, an optimal subset of these points is identified. This subset determines the loudspeaker distribution of the radiation keyboard.

#### 2.2.1. Nearfield Recording

The setup for the near field recordings is illustrated in Figure 3. A microphone array **X**near,**<sup>m</sup>** with equidistant microphone spacing of 40 mm is installed at a distance of 5 cm parallel to the harpsichord soundboard surface. The index *m* = 1, ... , 1500 describes a microphone position above the harpsichord.

An acoustic vibrator excites the instrument at the intersection point of string and bridge; the *string termination point* [32] *κa*. Here, the index *a* = 1, ... , *A* describes the pressed key. For a harpsichord with 5 octaves, *A* = 62 keys exist.

To obtain impulse responses from the recorded data the so-called exponential sine sweep (ESS) technique is utilized [37]. The method has originally been proposed for measurements of weakly non-linear systems in room acoustics (e.g., loudspeaker excitation in a concert hall) but can also be adapted to structure-borne sound [38]. For the excitation an exponential sine sweep

$$s(t) = \sin(\omega(t))\tag{1}$$

is used, where

$$
\omega(t) = \omega\_1 e^{\left(\frac{t \ln\left(\frac{\omega\_1^2}{\omega\_2^2}\right)}{l}\right)}.\tag{2}
$$

Here, *<sup>ω</sup>*<sup>1</sup> = <sup>2</sup>*<sup>π</sup>* rad s−<sup>1</sup> is the starting frequency, *<sup>ω</sup>*<sup>2</sup> = <sup>2</sup>*π*× 24,000 rad s−<sup>1</sup> is the maximum frequency and *T* = 25 s is the signal duration. The vibrator excites the sound board with this signal. Figure 6 shows the spectrogram of an exemplary microphone recording *p*(**X**1, *t*). Since the frequency axis has logarithmic scaling, the sweep appears as a straight line. Due to non-linearity in the shaker excitation the recording shows harmonic distortions parallel to the sinusoidal sweep.

**Figure 6.** Spectrogram of an exemplary output of one of the array microphones. Harmonic distortions of several orders are observable.

A deconvolution process eliminates these distortions. The deconvolution is realized by a linear convolution of the measured output *p*(**X***m*, *t*) with the function

$$u(t) = s^{-1}(t)b(t)\,. \tag{3}$$

Here, *s*−1(*t*) is the temporal reverse of the excitation sweep signal (2) and *b*(*t*) is an amplitude modulation that compensate the energy generated per frequency, reducing the level by 6 dB/octave, starting with 0 dB at *t* = 0 s and ending with −6 log2(*ω*2/*ω*1) dB at *t* = *T*, expressed as

$$b(t) = 1/e^{\frac{t\ln\frac{\sigma\_2}{\sigma\_1}}{t}}.\tag{4}$$

This linear deconvolution delays *s*(*t*) of an amount of time varying with frequency. The delay is proportional to the logarithm of frequency. Therefore, *s*−1(*t*) stretches the signal with a constant slope, and compresses the linear part to a time delay corresponding to the filter length. The harmonic distortions have the same slope as the linear part and are, therefore, also packed to very precise times. If *T* is large enough, the linear part of an impulse response is temporally clearly separated from the non-linear pseudo IR.

This deconvolution process yields one signal

$$q'(\mathbf{X}\_{m}, t) = p(\mathbf{X}\_{m}, t) \* \boldsymbol{u}(t). \tag{5}$$

This signal *q* (**X***m*, *t*) is the linear impulse response *q*(**X***m*, *t*) preceded by the nonlinear distortion products, i.e., the pseudo-IRs. An example of *q* (**X**1, *t*) and *q*(**X**1, *t*) is illustrated in Figure 7. The linear impulse response part can be obtained by a peak detection searching for the last peak in the time series. In the figure the final, linear impulse response *q*(**X**1, *t*) is highlighted in red.

The driving signal and the convolution are reproducible. Repeated measurements are carried out to sample the radiated sound field. To cover each key and sample point, this yields *N* × *A* = 93,000 recordings.

**Figure 7.** Obtained impulse response *q* (**X***m*, *t*) of the signal in Figure 6 after deconvolution. The harmonic distortions are separated in time and precede the linear part *q*(**X***m*, *t*) (red), which starts at *t* ≈ 3.5 s.

#### 2.2.2. Back Propagation

The harpsichord soundboard is a continuous radiator of sound, but can be simplified as a discrete distribution of *N* = 1500 radiating points **Y***n*, referred to as *equivalent sources* [23]. These equivalent sources sample the vibrating sound board. The validity of this simplification is restricted by the Nyquist-Shannon theorem, i.e., two equivalent sources per wave length are necessary. The following steps are frequency-dependent. Therefore, we transfer functions of time into frequency domain using the discrete Fourier transform. Terms in frequency domain are indicated by capital letters and the *ω* in the argument. For example *Q*(*ω*) represents the frequency spectrum of *q*(*t*).

The relationship between the radiating soundboard *Q*(**Y***n*, *ω*) and the spectra of the aligned, linearized microphone recordings *Q*(**X***m*, *ω*) is described by a linear equation system

$$Q(\mathbf{X}\_{\text{n}},\omega) = \sum\_{n=1}^{N} \mathcal{G}(r,\omega)\mathcal{Q}(\mathbf{Y}\_{\text{n}},\omega),\tag{6}$$

where

$$G(r,\omega) = \frac{e^{i(kr)}}{r} \tag{7}$$

is the Free field Greens' function. It is a *complex transfer function* that describes the *sound propagation* of the equivalent sources as monopole radiators. Here, the term *r* = ||**X***<sup>m</sup>* − **Y***n*||<sup>2</sup> is the Euclidean distance between equivalent sources and microphones, *k* is the wave number and *i*, the imaginary unit. Equation (6) is closely related to the Rayleigh Integral which is applied in acoustical holography and sound field synthesis approaches, like wave field synthesis and ambisonics [39]. One problem with Equation (6) is that the linear equation system is ill-posed. The radiated sound *Q*(**X***m*, *ω*) is recorded but the source sound *Q*(**Y***n*, *ω*), which created the recorded sound pressure distribution, is sought. When solving the linear equation system, e.g., by means of Gaussian elimination or an inverse matrix of *G*(*r*, *ω*) [40], the resulting sound pressure levels tend to be huge due to small numerical errors, measurement and equipment noise. This can be explained by the propagation matrix being ill-conditioned when microphone positions are close to one another compared to the considered wavelength. In this case the propagation matrix condition number is high. A regularization method relaxes the matrix and yields lower amplitudes. An overview about regularization methods can be found in [21,23,40]. For musical instruments, the Minimum Energy Method (MEM) [23,41] is very powerful. The MEM is an iterative approach, gradually reshaping the radiation characteristic of *G*(*r*, *ω*) from monopole at Ω = 0 to a ray at Ω = ∞ using the formulation

$$\Psi\left(a,\omega,\Omega\right) = 1 + \Omega \times \left(1 - a\right),\tag{8}$$

where Ψ (*α*, *ω*, Ω) is multiplied by *G*(*r*, *ω*) in Equation (6) to reshaped the complex transfer function, like

$$Q(\mathbf{X}\_{m},\omega) = G(r,\omega) \times \Psi\left(a,\omega,\Omega\right) \times Q(\mathbf{Y}\_{n},\omega). \tag{9}$$

In Equations (8) and (9), *α* describes the angle between equivalent sources **Y***<sup>n</sup>* and loudspeakers **X***m* as inner product of both position vectors

$$\boldsymbol{\alpha}\_{m,n} = \left| \frac{\mathbf{Y}\_n \mathbf{X}\_m}{|\mathbf{X}\_m| \, |\, \mathbf{Y}\_n} \mathbf{n} \right|. \tag{10}$$

The angle *α* is given by the constellation of source- and receiver positions and is 1 in normal direction **n** of the considered equivalent source position and 0 in the orthogonal direction. The ideal value for Ω minimizes the reconstruction energy

$$
\Omega\_{\rm opt} = \min\_{\Omega} \{ E\_{\Omega} \} \tag{11}
$$

where

$$E\_{\Omega} \propto \sum\_{\mathbf{n}}^{N} \left| \frac{\partial P(\mathbf{Y}\_{\mathbf{n}}, \omega)}{\partial \mathbf{n}} \right|^{2}. \tag{12}$$

The energy *E* is proportional to the sum of the squared pressure amplitudes on the considered structure. In a first step, the linear equation system is solved for integers from Ω = 0 to Ω = 10 and the reconstruction energy is plotted over Ω. Around the local minimum, the linear equation system is again solved, this time in steps of 0.1. Typically, the iteration is truncated after the first decimal place. An example of reconstruction energy over Ω is illustrated in Figure 8 together with the condition number of *G* × Ψ in Equation (9). Near Ωopt, both the signal energy and the matrix condition number tend to be low.

**Figure 8.** Exemplary plot of reconstruction energy (black) and condition number (gray) over Ω. The ordinate is logarithmic and aligned at the minimum at Ωopt = 3.6.

Alternatively, the parameter Ω can be tuned manually to find the best reconstruction visually; the correct solution tends to create the sharpest edges at the instrument boundaries, with pressure amplitudes near 0. This is a typical result of the truncation effect: the finite extent of the source causes an acoustic short-circuit. At the boundary, even strong elongations of the sound board create hardly any pressure fluctuations, since air flows around the sound board. The effect can be observed in Figure 9.

The result of the MEM is one source term *Q*(**Y***n*, *ω*) for each equivalent source on the harpsichord sound board. Below the Nyquist frequency, these 1500 equivalent sources approximate the sound field of a real harpsichord. This is true for the complete half-space above the sound board.

The radiation of numerous musical instruments has been measured using the described microphone array setup and the MEM, like grand piano [35,42], vihuela [23,43], guitars [43,44], drums [23,41,45], flutes [41,45], and the New Ireland kulepa-ganeg [46]. The method is so robust that the geometry of the instruments becomes visible, as depicted in Figure 9.

**Figure 9.** Sound field recorded 50 mm above a grand piano sound board (**left**) and back-propagated sound board vibration according to the MEM (**right**). The black dot marks the input location of the acoustic vibrator.

#### 2.2.3. Optimal Loudspeaker Placement

The back-propagation method described in Section 2.2.2 yields one source term for each of the *N* = 1500 equivalent sources and *A* = 62 keys. Together, the equivalent sources sample the harpsichord sound field in the region of the sound board. Forward propagation of the source terms approximates the harpsichord sound field in the whole half-space above the sound board. This has been demonstrated, e.g., in [47]. Replacing each equivalent source by one loudspeaker is referred to as *acoustic curtain*, which is the origin of wave field synthesis [39,48]. In physical terms, this situation is a spatially truncated discrete Rayleigh integral, which is the mathematical core of wave field synthesis [7,8,39,49]. A prerequisite is that all equivalent sources are homogeneous radiators in the half-space above the soundboard. This is the case for the proposed radiation keyboard. For low frequencies, loudspeakers without a cabinet approximate dipoles fairly well. Naturally, single loudspeakers with a diameter in the order of 10 cm are inefficient radiators of low frequencies [50]. However, this situation improves when a dense array of loudspeakers is moving in phase. This is typically happening in the given scenario; when excited with low frequencies, the sound board vibrates as a whole [51], so the loudspeaker signals for the wave field synthesis will be in phase. While truncation creates artifacts in most wave field synthesis setups, referred to as *truncation error* [8,39,48], no artifacts are expected in the described setup due to natural tapering: at the boundaries of the loudspeaker array, an acoustic short-circuit will occur. However, the acoustic short-circuit also occurs in real musical instruments, as demonstrated in Figure 9. This is because compressed air in the front flows around the sound board towards the rear, instead of propagating as a wave. The acoustic short-circuit of the outermost loudspeakers acts like a natural tapering window. In wave field synthesis installations artificial tapering is applied to compensate the truncation error.

The MEM describes the sound board vibration by *N* = 1500 equivalent source terms. Replacing all equivalent sources by an individual loudspeaker is not ideal, because the spacing is too dense for broadband loudspeakers, and it is challenging to synchronize 1500 channels for real-time audio processing. Audio interfaces including D/A-converters for *L* = 128 synchronized channels in audio-cd quality are commercially available, using, for example, MADI or Dante protocol. In wave field synthesis systems, regular loudspeaker distributions have been reported to deliver the best synthesis results [52]. Covering the complete soundboard of a harpsichord with a regular grid consisting of 128 grid points, yields about one loudspeaker every 12 cm. This is a typical loudspeaker density in wave field synthesis systems and yields a Nyquist frequency of about 1.5 kHz for waves in air [7].

Three exemplary loudspeaker arrays are illustrated in Figure 10.

Every third equivalent source can be replaced by a loudspeaker with little effect on the sound field synthesis precision below 1.5 kHz. This yields *W* = 11 possible loudspeaker grid positions **Θ** *w*,*l* .

**Figure 10.** Three regular loudspeaker grids **Θ** *w*,*l* . A subset of the 1500 equivalent sources is replaced by 128 loudspeakers. In the given setup *W* = 11 loudspeaker grids are possible.

At the ideal location of the loudspeaker grid, all loudspeakers lie near antinodes of all frequencies of all keys. In contrast to regions near the nodes, sound field calculations near the antinodes do not suffer from equipment noise, numerical noise, and small microphone misplacements. Instead, all loudspeakers contribute efficiently to the wave field synthesis. Therefore, the optimal grid location **Θ***<sup>l</sup>* has the largest signal energy

$$\Theta\_{l} = \max\_{\mathbf{w}} \mathbb{E} \left\{ \sum\_{a=1}^{A} \int\_{t=0}^{\infty} \left( R\_{\mathbf{f}1}' (\boldsymbol{\Theta}\_{\mathbf{w},l'}' \boldsymbol{\kappa}\_{a'} t) \* h(\boldsymbol{\kappa}\_{a'} t) \right)^{2} \mathbf{d}t \right\}. \tag{13}$$

Equation (13) is solved for each of the *w* = 1, ... , 11 possible loudspeaker grid positions. The grid with the maximum signal energy is replaced by loudspeakers as indicated in Figure 5. This ideal grid is the optimal loudspeaker distribution **Θ***l*.

The Nyquist frequency of the loudspeaker array lies around 1.5 kHz. For reproduction of higher frequencies, other methods are necessary, as described in the following sections.

#### *2.3. Higher Frequency Region f2*

The procedure to calculate the loudspeaker impulse responses for frequency region f2 is illustrated in Figure 11. First, impulse responses of the harpsichord are recorded in the listening region. Next, impulse responses of the final loudspeaker grid **Θ***<sup>l</sup>* are recorded in the listening region. These are transformed such that the loudspeaker array creates the harpsichord sound field in the listening region. Then, the optimal position of listening points is determined. These listening points are a subset of the microphone locations that sample the listening region.

**Figure 11.** Procedure to derive the impulse response *R <sup>f</sup>* <sup>2</sup>(**Θ***l*, *κa*, *v*, *t*) for each loudspeaker **Θ***<sup>l</sup>* and pressed key *κa* in the radiation keyboard. The wooden plate represents the harpsichord sound board. The white plate represents the loudspeaker grid. The black dots represent microphones microphones in the listening region. At first, harpsichord and loudspeaker array create different sound fields in the listening region. Then, the loudspeaker signals are modified as to synthesize the harpsichord sound field at regular subset *v* of listening points that sample the listening region.

#### 2.3.1. Far Field Recording

In addition to the near field recordings, the radiated sound is also recorded with a microphone array **X**far,*v*,*<sup>j</sup>* that samples the region in which the instrumentalist's head may be located during playing. We refer to this region as the *listening region* and to the discrete sample points as *listening points*. The distance between equivalent sources on the sound board and the listening points lies in the order of decimeters to meters. For frequencies above 1.5 kHz, this means that the listening region lies in the far field.

In the near field measurement, Section 2.2.1, one microphone array samples a planar region parallel to the sound board. In the far field measurement the microphone array samples a rectangular cuboid. The setup for the far field recordings is illustrated in Figures 4 and 11. As described in Sections 2.1 and 2.2.1, the sound board is excited with an exponential sweep. An array of *J* = 128 microphones samples the listening region. The microphones are arranged as a regular grid with a spacing of 4 cm. The array samples the complete sound field in the listening region for all wave lengths above 0.08 m, i.e., frequencies below 4 kHz. About *V* = 11 repeated measurements are carried out with a slightly shifted microphone array. Equations (1)–(5) describe how to excite the sound board and derive impulse responses for the *A* × *J* × *V* = 87,296 different source-receiver constellations.

These far field impulse responses provide a sample of the *desired sound field Q*des(**X**, *κa*, *ω*) in the region in which the instrumentalist is moving her head. In frequency domain, it can be described as the relationship between source signal *S*(*ω*), complex transfer function *G*(*r*, *ω*) and microphone array recordings *P*des(**X**, *κa*, *ω*)

$$S(\omega) \times G(r, t) = P\_{\text{des}}(\mathbf{X}\_{v, j\_{\prime}} \mathbf{x}\_{u\prime} \omega) \tag{14}$$

where the recording of the sweep is aligned to receive the impulse response

$$Q\_{\rm des}(\mathsf{X}\_{v,j}, \mathsf{x}\_{a\prime}\omega) = P\_{\rm des}(\mathsf{X}\_{v,j}, \mathsf{x}\_{a\prime}t) \times \mathsf{U}(\omega). \tag{15}$$

The terms *S*(*ω*) and *G*(*r*, *t*) in Equation (14) are known. So instead of microphone array measurements, *P*des(**X**, *κa*, *ω*) can be calculated by this forward-propagation formula. In [47] it was demonstrated that the forward-propagation equals the measurements.

#### 2.3.2. Radiation Method

To synthesize the desired sound field *Q*des(**X**, *κa*, *ω*) with the given loudspeaker array **Θ***l*, *L* = 128 loudspeaker signals *R* f2(**Θ***l*, *ω*) need to be calculated. This is done in two steps. First, the swept sine, Equation (1), is played trough each individual loudspeaker **Θ***<sup>l</sup>* and recorded in the listening region. Here,

$$\begin{split} S(\omega) \times \mathcal{Q}(\alpha, \omega) &= P\_{\Theta}(\mathbf{X}\_{v, j'} \omega) \\ \Longleftrightarrow \mathcal{Q}(\alpha, \omega) &= \frac{P\_{\Theta}(\mathbf{X}\_{v, j'} \omega)}{S(\omega)} \end{split} \tag{16}$$

and

$$Q\_{\Theta}(\mathsf{X}\_{v,j'}\omega) = P\_{\Theta}(\mathsf{X}\_{v,j'}\omega) \times \mathsf{U}(\omega) \tag{17}$$

describe the relationship between the source signal, the raw microphone recordings and the final impulse response. Here, the unknown propagation term Ϙ(*α*, *ω*) is the ratio of the source signal and the recordings. In contrast to a real harpsichord source signal *H*(*κa*, *ω*) we know that *S*(*ω*) 0 for all audible frequencies. Thus, the complex transfer function Ϙ(*α*, *ω*) between each loudspeaker of the array **Θ***<sup>l</sup>* and each listening point **X**far,*v*,*<sup>j</sup>* is determined by simply recording the propagated swept sine, Equation (16), of each loudspeaker at each listening point, followed by the deconvolution, Equation (17).

This complex transfer function is neither the idealized monopole source radiation *G*(*r*, *ω*), nor the energy-optimized radiation function Ψ(*α*, *ω*)*G*(*r*, *ω*). Instead, Ϙ(*α*, *ω*) is the actual transfer function as measured physically. It includes the frequency and phase response of the loudspeakers, the amplitude decay and the phase-shift from each loudspeaker to each receiver. It can thus be considered the true transfer function. It includes the sound radiation characteristics of the loudspeakers, which tend to deviate from *G* and Ψ. Solving the linear equation system

$$Q\_{\rm des}(\mathbb{X}\_{\upsilon,j}, \mathbb{x}\_{\mathfrak{a}}, \omega) = \mathbb{R}\_{\mathfrak{t}\mathfrak{L}}^{'}(\Theta\_{l}, \mathbb{X}\_{\upsilon,j}, \mathbb{x}\_{\mathfrak{a}}, \omega) \times \mathcal{Q}(\mathfrak{a}, \omega) \tag{18}$$

for all *V* = 11 microphone array positions yields the impulse response for the loudspeakers *R* f2(**Θ***l*, **X***v*,*j*, *κa*, *ω*). This procedure is referred to as *radiation method* as it synthesizes a desired sound field by including the measured sound radiation characteristics of the loudspeakers. Accounting for the actual transfer function from each loudspeaker to each listening point has the advantage that the rows in the linear equation system described by Equation (18) tend to deviate stronger in reality compared to idealized monopole radiators. This has been demonstrated in [20,49]. The radiation method is a robust regularization method that has been demonstrated to relax the linear equation system [9,20,39,49]. It leads to (a) low amplitudes and (b) solutions that vary only slightly, when the source-receiver constellation or the source signal is varied slightly. The method synthesizes a desired sound field as long as (a) the sound field lies in the far field and (b) at least two listening points per wavelength exist. The method is only ideal for frequency region f2, as it does not account for nearfield effects and spatial aliasing [9,20,49].

#### 2.3.3. Optimal Listening Points

So far, Equation (18) delivers *V* = 11 sets of *L* = 128 impulse responses for each key. The solutions only vary slightly, due to microphone misplacements, equipment and background noise, numerical errors and the spatial variations of the loudspeaker sound radiation characteristics. The solutions are valid inside the listening region. Outside the listening region synthesis errors occur, because loudspeaker signals interfere in an arbitrary manner. Outside an anechoic chamber, this will lead to unnatural reflections. Consequently, the ideal impulse response minimizes synthesis errors outside the listening region. This is achieved by selecting the impulse responses with the lowest signal energy

$$R\_{\ell2}(\Theta\_l, \kappa\_{\mathfrak{a}}, \omega) = \min\_{v, \kappa\_{\mathfrak{a}}} \{ R\_{\ell2}'(\Theta\_l, \kappa\_{\mathfrak{a}}, \omega) \}, \tag{19}$$

where

$$\mathcal{R}\_{\mathbf{f}2}^{'}(\boldsymbol{\Theta}, \boldsymbol{\kappa}\_{\boldsymbol{a}\prime}\boldsymbol{\omega}) \propto \sum\_{l=1}^{L} \int\_{t=0}^{\infty} \mathbb{R}\_{\ell2}^{'} (\boldsymbol{\Theta}\_{l\prime} \mathbf{X}\_{v,j} \boldsymbol{\kappa}\_{\boldsymbol{a}\prime} t)^{2} \mathrm{d}t. \tag{20}$$

The signal energy is the sum of all *L* = 128 loudspeaker impulse responses. In Equation (20) the signal energy of each key *κ<sup>a</sup>* is calculated for each of the *v* = 1, ... , 11 microphone array positions. As defined in Equation (19), the final impulse response *R*f2(**Θ**, *κa*, *ω*) for each specific key is the one where *v* creates minimal signal energy. This solution exhibits the most constructive interference inside the listening region. The solution is valid for the considered frequency region f2, i.e., between 1.5 and 4 kHz. For higher frequencies, a third method is ideal, as discussed in the following section.

#### *2.4. Highest Frequency Region*

In principle, the method to calculate the impulse response for frequency region f3 equals the method described in Section 2.3. For frequencies over 4 kHz the radiation method only synthesizes the desired sound field at the discrete listening points, but not in between [49]. For human listeners, neither the exact frequency nor the phase are represented in the auditory pathway ([39] Chapter 3). Amplitude of such high frequencies mainly contributes to the perception of brightness and the phase contributes to the impulsiveness of attack transients. Both are important aspects of timbre perception ([53] Chapter 11), ([39] Chapter 2).

Therefore, it is adequate to approximate the desired amplitudes and phases at discrete listening points by solving Equation (18). The difference between frequency regions f2 and f3 is the selection of optimal listening points. Instead of choosing the impulse response with minimum signal energy, Equation (20), the ideal impulse response for f3 is the shortest, because it exhibits the highest impulse fidelity. When convolved with a short impulse response, the frequencies of source signals stay in phase. Quite contrary, long impulse responses indicate out of phase relationships. Phase is mostly audible during transients. Consequently, the shortest impulse response is ideal, because it maintains the characteristic, steep attack transient of harpsichord notes.

Impulse responses for one loudspeaker and three different microphone array locations *v* is illustrated in Figure 12. The shortest impulse response can be identified visually.

**Figure 12.** Three exemplary impulse responses *R* f3,*v*. Here, *t* is the time in seconds and *u* is the voltage to drive the loudspeaker. The shortest out of *V* = 11 impulse responses is ideal, as it exhibits the best impulse fidelity. It can be identified visually.

#### **3. Calculation of Loudspeaker Driving Signals**

Each method described above result in one truncated impulse response per loudspeaker and pressed key *R*f(**Θ**, *κa*, *t*). Here, each impulse response is truncated in frequency. To combine them, the three truncated impulse responses per loudspeaker and pressed key are simply added, i.e.,

$$R\left(\Theta\_{l},\kappa\_{a},t\right) = R\_{\rm fl}\left(\Theta\_{l},\kappa\_{a},t\right) + R\_{\rm fl}\left(\Theta\_{l},\kappa\_{a},t\right) + R\_{\rm fl}\left(\Theta\_{l},\kappa\_{a},t\right). \tag{21}$$

The result of Equation (21) is a broadband impulse response *R*(**Θ***l*, *κa*, *t*) that covers the complete audible frequency range. An example is illustrated in Figure 13.

**Figure 13.** Impulse responses *R*<sup>f</sup> truncated to frequency regions f1 (blue), f2 (green) and f3 (red). Adding up the time series yields a broadband impulse response *R*(**Θ***l*, *κa*, *t*) (black). Here, *t* is the time in seconds and *u* is a normalized voltage to drive the loudspeaker.

Summing the truncated impulse responses yields one broadband impulse response for each loudspeaker and pressed key, i.e., *B* = *L* × *A* = 7936 broadband impulse responses *R***Θ***l*,*κa*,*t*. These impulse responses describe how any frequency is radiated by the loudspeaker array to approximate the sound field that the harpsichord would create, if excited by a broadband impulse.

Naturally, a played harpsichord is not excited by an impulse. Instead, pressing a key creates a driving signal that travels through the string and transfers to the soundboard via the bridge. In Section 2.1 we provide literature that suggests three different ways to record or model this driving signal *h*(*κa*, *t*). In order to finally use the radiation keyboard as a harpsichord sampler, the loudspeaker driving signals *d*(**Θ***l*, *κa*, *t*) are calculated by a convolution of the impulse response with the source signal, i.e.,

$$d(\Theta\_l, \kappa\_a, t) = h(\kappa\_a, t) \* \mathcal{R}(\Theta\_l, \kappa\_a, t). \tag{22}$$

This yields *B* = 7936 sound files. These are imported to multiple instances of a software sampler in a digital audio workstation (DAW). Typically, one instance of a software sampler can address between 16 and 64 output channels. Consequently, between 2 and 8 sampler instances need to be initialized. Technologies like VST and Direct-X are able to handle this parallelism, and several multi-channel DAWs (like Steinberg Cubase, Ableton Live and Magix Samplitude) can handle the high number of output channels. Finally, the original keyboard of the harpsichord is replaced by a MIDI-Keyboard, whose note-on command triggers the 128 samples for the corresponding note.

As the effect of key velocity on the created level and timbre is negligibly, the harpsichord is the ideal instrument to start with; only one sample per note and loudspeaker is necessary. For more expressive instruments, such as the piano, the attack velocity affects the produced level and timbre. Here, several samples per note, or one attack-velocity controlled filter would have to be applied. This implies the need for much higher data rates and specific signal processing, which is out of scope of this paper.

#### **4. Conclusions**

In this paper the theoretic foundation of a radiation keyboard has been presented. It includes the complete chain from recording the source sound and the radiated sound of a harpsichord to synthesizing its temporal, spectral and spatial sound within an extended listening region, controlled in real-time. To achieve this, we choose the optimal method for each frequency region and inverse problem, and describe a way to combine so far isolated and fragmented sound field analysis and synthesis approaches.

For the low frequency region f1, a combination of nearfield recordings, the minimum energy method, an energy-efficient loudspeaker grid selection, and wave field synthesis is ideal. It synthesizes the desired sound field in the complete half-space above the sound board.

For the higher-frequency region f2, a combination of far field recordings, the radiation method, and energy efficient listening point selection is ideal. This combination synthesizes the desired sound field in the listening region with a high precision.

For the highest frequency region f3, far field recordings and an in-phase impulse response creation are ideal. It approximates the correct signal amplitudes in the listening region, while supporting the transient behavior of the source sound. The initial outcome of such a radiation keyboard is a sampler that mimics not only the temporal and spectral aspects of the original musical instrument, but also its spatial aspects.

#### **5. Outlook**

This paper presented the theoretic framework of our current research project. The effort to implement a radiation keyboard is very high and a number of sound field measurement and synthesis methods need to be combined, leveraging their individual strengths. We have not implemented the radiation keyboard yet; this paper rather describes the necessary means to realize it.

The implemented radiation keyboard is supposed to serve as a research tool to carry out interactive listening experiments that are more ecological than passive listening tests with artificial sounds in a laboratory environment. Note that the radiation keyboard is not restricted to harpsichord sounds. In principle, any arbitrary sound file can act as source signal and be radiated like a harpsichord. This enables us to manipulate the temporal and spectral aspects of the sound, while keeping the sound radiation constant. Loading different source sounds while keeping the sound radiation fixed, could reveal which temporal and spectral parameters affect the perception of source extent and naturalness in the direct sound of musical instruments. The radiation keyboard could answer the question, whether a saxophone sound with the radiation characteristics of a harpsichord sound larger than a real saxophone. Using the radiation keyboard, we can investigate apparent source width and immersion of direct sound both in presence and absence of room acoustics. To date, physical predictors of apparent source with originate in room acoustical investigations [15,16,39]. Findings disagree, which frequency region is of major importance for these listening impression. Different predictors and the discourse are examined in [5].

The strength of a real-time capable radiation keyboard is the interactivity: musicians can actively play the instrument instead of carrying out passive listening tests. Interactivity creates a dynamic sound and allows for a natural interaction in an authentic musical performance scenario. This is a necessity in the field of performance, gesture, and human-machine-interaction studies and a prerequisite for ecological psychoacoustics [30,31].

**Author Contributions:** Conceptualization, T.Z.; methodology, T.Z. and N.P.; software, T.Z. and N.P.; validation, T.Z. and N.P., formal analysis, T.Z. and N.P., investigation, T.Z. and N.P., resources, T.Z. and N.P.; data curation, T.Z. and N.P.; writing—original draft preparation, T.Z.; writing—review and editing, T.Z. and N.P.; visualization, T.Z. and N.P.; project administration, T.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** We acknowledge the detailed and critical, but also amazingly constructive feedback from the two anonymous reviewers. We thank the Claussen Simon Foundation for their financial support during this project.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Applied Sciences* Editorial Office E-mail: applsci@mdpi.com www.mdpi.com/journal/applsci

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18