1. Introduction
Fully immersive reproduction of spatial audio in a way that both artificial and real audio objects are perceived as plausible audible events in a virtual and/or augmented environment is something researchers have tried for many years. Recently, this has been demonstrated under certain conditions [
1]. One method which has been proposed to achieve an auditory illusion of a spatial acoustic environment is via the help of an existing position-dynamic binaural synthesis system [
2]. Even then, the occurrence of a plausible auditory illusion depends on many parameters. Beyond an adequate technical realization there are several context dependent quality parameters like congruence between synthesized scene and the listening environment or individualization of the technical system.
The goal of our research and this paper is to find efficient solutions to merge real and virtual acoustics in a plausible manner. Our approach is to start with measurements in real rooms and to stepwise simplify these measurements to identify the relevant cues and information which are mandatory to retain plausibility. The advantage of this approach is that we always can compare to the upper reference: The actual measurement in the real scenario. After identifying these cues, measurement efforts can be minimized and the computational efficiency of algorithms to create an auditory augmented reality (AAR) system can be improved. The primary question is to what extent simplifications in the technical realization of an audio-AR system are permissible without leading to an intolerable impairment of spatial auditory perception.
In the following, we describe a basic scheme of an AAR system built from several different functional blocks which realize a position-dynamic binaural synthesis. This includes:
provision of binaural room filters from measurements or simulations,
representation of the scene,
creation and/or adaptation of binaural room filters related to the pose and position of the listener, and
real-time rendering to make a position-dynamic auralization possible.
Figure 1 shows these blocks in a basic scheme. In each block, different techniques and approaches are listed. Bold entries are techniques which will be described in greater detail.
The reproduction of an audio object in a reverberant room can be realized using BRIRs. The audio signal of the source is convolved with binaural room impulse responses (BRIRs) of the current source–receiver position and head orientation of the listener. For the acquisition of binaural room filters, several popular approaches exist. BRIRs can be simulated [
3,
4] or directly measured, for example, with a dummy head. To create a position and head-pose-dependent binaural synthesis, measurements for each dummy head orientation and position must be conducted. The separation of the head-related transfer functions (HRTFs) and room measurements reduces the effort significantly. One way to do this is to just measure the room impulse responses (RIRs) and to estimate or measure the direction of arrival (DOA) in a separate calculation or measurement. An estimation can be done by assuming a certain room geometry, such as a shoe-box model, and by assigning a direction to the measured reflections [
5]. Information about the room can either be predefined or measured by other sensors (visual, radar, etc.). The DOA can also be measured with an appropriate microphone array. Once the DOA is available, a direction and an HRTF can be assigned to each reflection and a BRIR can be calculated. The Spatial Decomposition Method (SDM) is one approach in this domain [
6,
7]. An alternative way is to employ B-format or higher order ambisonic arrays to record transfer functions in the ambisonics signal representation. The spatial resolution directly depends on the number of microphones as well as on the array design. To create binaural signals the combination with a spherical HRTF data set is necessary as well [
8].
The next step is the creation of a scene. Depending on the synthesis approach, the room geometry, source position, and the walkable area have to be defined. Most of the existing systems and evaluations thereof are limited to simple room geometries like shoe-box rooms (for example, in [
5,
9]). Furthermore, the walkable areas are often placed centrally in the room, presumably to avoid special acoustic effects when being close to walls or corners. However, these restrictions are not necessarily system-related. As this research field is still emerging, more complex room acoustics simply have not been evaluated yet. Depending on the real-time capability of the succeeding filter creation step, a fine or coarse sampling of the walking area has to be considered. Sub-sampling the area based on psychoacoustic assumptions, for example, by considering just noticeable differences (JND) for direction and distance changes [
10,
11], can avoid the effort for a continuous adaptation of filter coefficients. In this paper, we will discuss the Maximum Allowed Error Method (MAEM) as one possibility to sample the walkable area [
12].
When the listener changes his/her pose or position new filters need to be computed (or loaded). The most flexible solution would be a parametric filter creation approach. These are usually based on a measured or assumed model of a room or scene [
5,
13,
14]. The room impulse response is decomposed into modifiable parameters which can be changed depending on the listener movement. This allows an efficient adaptation of filter coefficients, but the success relies on the quality of the model. Filter shaping approaches rely on a BRIR measurement on one position, and only certain properties of filters are adapted when the listener moves such as energy decay curve (EDC), level of direct sound, or initial time delay gap (ITDG) [
15,
16,
17]. Often these changes are empirically determined, but they also can be estimated by simple models (such as inverse square law). These algorithms will be discussed in
Section 2.2. The idea of these approaches are quite similar to ambisonics-based approaches. Auralization for different listener positions are realized by transformations based on simple models, e.g., distance-dependent attenuation or angular attenuation to mimic the directivity of sound sources [
9,
18].
The last step is the real-time rendering. A common solution for ambisonics-based approaches is the rendering of a virtual spherical loudspeaker setup [
8,
18]. The ambisonics audio signals are then converted to binaural signals. This is done by assigning an HRTF to each virtual loudspeaker. During the real-time rendering, only the gains of the virtual speakers are adjusted and HRTFs have to be applied. Ambisonics-based approaches are especially suited for realizing 6-DoF experience on the basis of recorded sound fields. Other approaches (especially the ones mentioned here) usually deliver binaural room impulse responses. These need to be convolved with the desired audio signals. The filters have to be changed each time the position or the pose of the listener changes.
As it has become clear, the plausible creation of virtual audio objects fused into a real room is the main challenge for AAR. To recognize audio objects, the human brain acts as a great pattern recognizer comparing learned and thus expected audio cues with the ones from the real surrounding. Additionally, presented audio objects have to fit these expectations [
19,
20]. A too large acoustic deviation between the acoustic properties of the real space and the virtually reproduced environment leads to a cognitive mismatch and thus to a collapse in the plausibility of the overall auditory scene. In the context of binaural synthesis, we call this the room divergence effect. The most prominent listening impression in this case is a collapse of the externalization of auditory events [
21,
22]. This is true for all auralization efforts using binaural technologies, but the effect is most prominent in an AAR environment. A comparison of the perceived sound events in real space with the virtual sound events is always possible in AR scenarios. The cognitive model of the environment created by experience is continuously updated and, in our brain, compared with the virtual auditory events [
23]. As a corollary to this model of spatial hearing, judging such systems is conventionally done via listening tests. As, in most cases, no reference is available, listening test paradigms as known from audio coding to evaluate audio quality [
24,
25], etc. cannot be used.
This paper gives a summary on a specific AAR system which is used at the Technische Universität Ilmenau as a research demonstrator. The system is one possible technical realization to address some of the challenges in spatial listening. The main components of the system are a psychoacoustically motivated spatial provision of the needed BRIRs in the walkable area, a synthesis of BRIRs using spatially sparse measurements of the room, and a real-time processing of the tracking data and filter convolution. The structure of the paper is as follows.
Section 2 describes the system components used in our system in detail.
Section 3 shows the perceptual evaluation of these components. Therefore, the names of the subsections are identical.
Section 4 gives a summary and discussion of open questions.
2. Proposal of a Position-Dynamic Binaural Synthesis System
A basic feature of the presented system is the provision of synthesized BRIR data sets for discrete positions in the room. These discrete areas can either have a uniform distribution (e.g., grids with rectangular or triangular grid areas/cells) [
26] or a nonuniform distribution of the single grid cells [
12]. The sizes and shapes of the grids are motivated by psychoacoustic features like localization and distance blur.
A block diagram of the proposed audio system is shown in
Figure 2. For each block, a number is given which is referenced in the text. On the right part of the figure, several blocks, which include input data to the blocks on the left side, are shown. The numbers in curly brackets show the connections.
2.1. Spatial Sub-Sampling
In the exemplary audio system, the different BRIR data sets, which are required for the listener’s movement in the room, are created for predefined discrete areas in the room. These areas are called cells. The totality of all cells is called grid. Depending on the chosen spatial resolution, a BRIR data set is valid for one of these cells. A BRIR data set consists of BRIRs for the left and right ear for all available head orientations and for one audio object. Only in the center of this area, the BRIR data set corresponds to a correct mapping of distance and direction to the sound source. Towards the edges, the deviation of direction and distance cues and thus the reproduction error increases. If the spatial resolution is chosen to be low, perceptible errors will increase. The shapes of these cells can be uniform in the sense of a position-independent shape, for example, quadratic. However, the shapes of the cells can also be nonuniform in the sense of a position-dependent shape.
The use of a uniform grid is not very good at taking perceptive inaccuracies in localization [
27] and distance perception [
28,
29] into account. Furthermore, it was shown that for a range of signal types a uniform positional BRIR grid requires a 5 cm resolution or higher to provide a smooth transition without noticeable discontinuities [
26]. If the listener is very close to the virtual audio object, a high spatial resolution must be selected to minimize perceptional errors. However, this high resolution is unnecessary at greater distances from the source which allows a reduction of the required number of BRIRs. The approach which is discussed here is called Maximum Allowed Error Method (MAEM) and was developed by Georg Götz and Samaneh Kamandi [
30].
The MAEM (see block 100 in
Figure 2) describes a method to parameterize the size and shape of an area in the listening room which is represented by one BRIR data set. The parameterization is motivated by perceptual thresholds in spatial hearing.
Figure 3 shows the principles of localization blur and distance blur in the horizontal plane. If a sound source is perceived from a certain direction and distance, the acoustic position of this sound source can change within certain limits without changing the perception of direction and distance. The displacement is inaudible. In a similar way, other sound sources located in this range are perceived from the same direction and distance. The size of this range is determined by the just noticeable differences (JNDs) for direction and distance perception of the direct sound of the source. If a fixed localization blur or minimum audible angle (e.g.,
) is assumed, the density of the cells (in the sense of the width of the cells) should increase at small distances to the source and decrease at larger distances. The situation is similar regarding distance blur. For small distances, the density (in the sense of the length of the cells) should be higher than for large distances. This approach leads to a grid of nonuniform cells which can effectively reduce the number of BRIRs required without causing increased errors in direction and distance perception. An extension of this approach could also consider the reflections to determine the JND and not only the direct sound.
The maps generated by the MAEM approach are based on a Voronoi diagram (
Figure 4). First, a floor plan of the room can be loaded as 1 bit graphic (blocks 101 and 161 in
Figure 2). In the immediate proximity of the walls, the calculation of the cells can be influenced by specifying a minimum wall distance (162). Acoustic influences of a very close wall cannot be represented, as in the process of filter creation (120) all BRIRs are extrapolated from only one measurement recorded in the middle of the room. The minimum wall distance therefore represents a distance from which no new cells are generated and thus no new BRIRs are made available. A minimum grid size (165) can be specified to avoid unnecessarily small cells, especially near the sources. The MAEM approach calculates the possible distances of the individual cells starting from the source position (163), using a minimum source distance (164), the maximum allowed distance change (166), and the maximum allowed angular change (167). Specifying a minimum source distance prevents too small and too many cells in the immediate proximity of the source. The maximum allowed distance change is the main parameter for the calculation of the map and it represents the distance blur described above and shown in
Figure 3. The second main parameter is the maximum allowed angular change (167) in degrees which corresponds to the minimum audible angle (see
Figure 3).
Figure 4 shows examples of nonuniform grids based on a Voronoi diagram. The main parameters for generating the grids are the maximum allowed angle error
and the maximum allowed relative distance change
d: Grid 1
,
; Grid 2
,
; Grid 3
,
; and Grid 4
,
. The distance parameter for Grid 1 is based on results from Spagnol et al. [
10] where distance blurs of
were found. The angle parameters are estimates which have been collected from the literature [
11,
27]. It is assumed that the localization blur ranges from
to
.
The output of MAEM (100) is the provision of a list of all possible listening positions for each audio object (111). For the MAEM approach a listening position map (grid) for each object and the spatial positions of the individual cells are provided. The map is used to assign the listeners position in the room to the correct BRIR filter selection in the real-time processing block (130). The cell positions are needed for the filter creation (120).
2.2. BRIR Synthesis
A set of BRIRs must be provided for each possible listening position. If a dataset of BRIRs is measured, e.g., with a head-and-torso-simulator at one or few selected positions in a room, then BRIRs for further positions can be generated by interpolation and extrapolation. In our group we pursued two different approaches for that which are presented below. The first approach is based on a quite strong simplification and thus allows for an efficient implementation. The second approach manipulates more details with the goal to provide a better quality.
Section 2.2.3 adds further functionality to both approaches in order to add sound source directivity.
2.2.1. Constant Reverberation
This first approach is based on the simple idea to keep the reverberation constant throughout the different positions. There have been several earlier studies, like that in [
31], indicating that around
ms after the direct sound, at least in small rooms the reverberation can be kept constant for direction-dependent reproduction. It was shown that at least for an approaching motion towards a virtual sound source a simple adjustment of energy of the direct sound according the distance to the sound source was sufficient to achieve a plausible reproduction. Moreover, it was perceived as plausible as the original set of BRIRs measured along the walking line [
17,
32].
The basic idea to keep the reverberant part constant is not new, but due to its simplicity a very interesting one. However, it is likely that it has its limitations and will deliver perceptually satisfying results only under certain conditions. For this reason, we started to investigate this approach for the given application scenario of synthesizing BRIRs for an AAR-system with 6-DoF. So far, two studies have been conducted for the case of walking towards and away from a virtual loudspeaker in two different rooms with a similar size but quite different reverberation times (RT60 of
s and
s). In both cases, using the reverberation measured at one position of a considered walking line, keeping the reverberant part of the BRIRs constant over the different positions did not significantly reduce the plausibility. Two different cases of direct sound were taken into account. In one version, the direct sound originally measured at all the different positions was used. The other version was built on the measurements from one specific position in the room. BRIRs for other positions were created by simply adjusting the level of the direct sound. Both realizations were perceived as plausible as the original fully measured set of BRIRs. More details about the experiments are provided in
Section 3.2.1.
In both experiments, the translation line with a length of m was located in front of the loudspeaker. Therefore, it remains an open question whether this very simple approach still creates convincing results for the cases of walking past and behind a virtual sound source. For these cases, the source directivity has to be considered when modeling the direct sound. Moreover, there is still a lack of knowledge about the perception of room acoustical details and their relative changes for the cases of low direct sound energy.
2.2.2. Acoustical Shaping
Acoustic shaping describes an approach to adapt a BRIR by changing single distant dependent acoustic parameters. The aim is to reach perceptual suitability of the synthesized BRIRs regarding spatial auditory perception. The basic idea is to change the acoustic structure of the early reflections of measured BRIRs based in a shift of the initial time delay gap (ITDG). The ITDG is the time between the incoming direct sound and the first reflection. The ITDG is a distance-dependent acoustic parameter. The ITDG is small for distant sources and bigger for closer ones.
Figure 5 shows this distance dependency of the ITDG if the first reflection is a ground reflection.
The principle of filter shaping (122) based on a manipulation of the ITDG is shown in
Figure 6 as a block diagram. The presented approach is based on the work from Füg et al. [
15], and it uses one measured data set of BRIRs at one position in the room. Depending on the new listening position and head orientation to be calculated, the BRIRs corresponding to the new yaw orientation are selected from the recorded data set. The BRIRs are split into a direct part, the early reflections, and the late reverberation. The transition point between early reflections and late reverberation is defined by the perceptual mixing time [
31] of the auralized room. The mixing time defines a point in time of a BRIR after which its content is perceptually independent from the head pose or the position in the room. The beginning of the early reflections is defined by the choice of the time index which is
before the first reflection. The samples within the defined area of early reflections are now rearranged in time regarding the distance dependency of the ITDG. Therefore, the time of the first reflection is changed depending on the distance between source and listening position. In the presented case, the ITDG change originates from a measurement in a TV studio at our university (TU Ilmenau), but it can be adapted for a specific room or geometric arrangement. If a BRIR for a closer distance than the recording position is needed, the samples before the first reflections are stretched in time while the samples up to the mixing time are compressed in a linear manner. If a greater distance is desired, it behaves in the opposite sense. The last adaptation is a change of the energy of the shaped BRIR depending on the new distance following the inverse-square law for energy distributions.
Figure 7 shows the ITDG-approach exemplarily on a BRIR shaping for a measured BRIR at approximately
m and a new BRIR at approximately
m. For clarity of the figure and approach, the traveling time between source and listening position is shown for the new BRIR and the measured BRIR is shifted to this time in the figure. In a real scenario, BRIRs would also be shifted along the time axis. In the case shown, the measured BRIR would have a longer traveling time than the closer, new BRIR. Furthermore, no distant-dependent energy adaptation is shown. The area where the BRIR is manipulated is indicated as a box in the figure. It covers the first reflections up to the mixing time (endpoint). Within this range the first reflection (usually the ground reflection) is detected and the ITDG is determined. According to the new distance all samples are mapped to their new times using a linear compression and expansion characteristic curve within the range of the early reflection. The transition point of the two curves is the time of the first reflection. The first curve shifts all samples before the first reflection, the second shifts all samples after the first reflection until the mixing time. In terms of
Figure 7, the samples are compressed before the first reflection and expanded afterwards.
The yaw orientation is realized by selecting the corresponding direct sound of the measured or otherwise calculated BRIRs. An improvement can be realized by including an interpolation of the direct sound to synthesize a finer yaw resolution for example. The intensity of the direct sound and reverberation sound is adapted according to the new synthesized grid position.
The assumption of this ITDG-based approach is that the detected first reflection is suitable for a distance adjustment. This is the case if it corresponds to a ground reflection, for example. It is not the case if the detected reflection is a wall reflection directly behind the measured sound source on a line from source to measurement position. In this case, the determined ITDG is not distance-dependent. The challenge lies in correctly determining the first valid reflection. Furthermore, the adjustment of the BRIR causes a more or less strong change of the total time range of the early reflections and thus a change of the acoustic mapping of the room. In detail this means that, e.g., with the synthesis of a closer position the ITDG is correctly enlarged but also that the later early reflections are also shifted in time. In general, this is justified by the change of location in space. However, it does not correspond to the actual change of the reflection pattern. However, it is conjectured that the validity of this approach is due to the ability of preserving the overall relative structure of the occurrence of the reflections. Timbre characteristics of the single early reflections are conserved which keeps the synthesis plausible (see
Section 3.2.2).
2.2.3. Sound Source Directivity
For both presented approaches of creating BRIRs for additional listening positions, an adequate modeling of the direct sound with its directivity is essential, as it is also relevant for the progress of the DRR within a given listening area. The shaping of the ITDG described in
Section 3.2.3 does not include a correct representation of the sound source directivity (SSD) pattern. The BRIRs of the measured position contain the directional characteristics of the sound source at that position. If these BRIRs are used to synthesize another position in the room, the directional characteristics remain unchanged. A correction is therefore desirable, to minimize the physical and the perceived differences between the measured and synthesized BRIRs. This is expected to result in a more plausible listening experience.
The shaping algorithm is therefore enhanced in a further implementation and study. An additional processing step to consider the SSD is shown in
Figure 8. A predefined source-directivity pattern of the sound source used for the measurements is taken into account to vary the direct sound part of the BRIRs. In this setup, a Geithain MO-2 loudspeaker is used for the measurements. The frequency and angular-dependent changes in the amplitude for different orientations are adopted for the calculations of the extrapolated BRIRs. For that, the algorithm determines the angular relation between the position of the loudspeaker to the measurement position and to the synthesis position. The amplitude of the direct sound part of the BRIR at the position to be synthesized is then boosted or attenuated according to the change in amplitude in the pattern between these angles. Additionally, the inverse distance law is applied to take care of the different distances to the position of the loudspeaker. Further details are described in [
33].
The evaluation of physical parameters shows an improvement in the deviation of the direct to reverberation energy ratio (DRR) between a measured reference BRIR and the synthesized BRIR considering the SSD in comparison with BRIRs not considering the SSD.
Figure 9 shows a heat map to display the deviations. In an investigation using the position “D1” for measurement for the most synthesized positions deviations smaller than 2 dB are reached. Especially for the positions off-axis to the sound source, an improvement is seen. However, the reduction of the DRR deviation is in a range from
dB to around 4 dB dependent on the difference between the synthesized and measured position. The deviations are mostly within the JND for the examined DRR range and therefore are not perceptible.
Even though good results for the DRR deviations are reached, there is still room for improvement. The perceived differences, shown in
Section 3.2.3, are probably based on other physical parameters. Further research should, for example, investigate whether changes of the first reflections and the reverberant parts of the BRIRs due to SSD need to be considered for the calculations as well to address spatial perception and coloration effects in more detail.
2.3. Real-Time Processing
Real-time convolution for binaural synthesis requires low delay in order to keep the overall system latency (SL) below perceptual thresholds. When the latency is above a certain threshold, static sound sources will no longer be stable when the listener moves his head. Therefore, high SL can be a cue for listeners to detect whether a sound source is real or virtual. For dynamic synthesis considering head rotation, Lindau [
34] mentions a threshold for SL around
ms. The measured thresholds depend on the signal and the test paradigm which is used for testing.
For pose and position dynamic systems in AR, additional parameters become evident: Typically, visual cues are available as reference for the virtual sound sources which could lower the SL threshold depending on the positional and temporal precision of the visual cue in terms of a temporal and spatial Ventriloquism-effect [
35]. However, the presence of a visual object may increase sensitivity to latency effects, as matching can always take place especially in an AR scenario. When the listener is able to change his position in addition to head movements, higher movement velocity and acceleration are expected in comparison to head movements only. No studies could be found on this subject.
In case of BRIR rendering, long filters have to be convolved with the source signal in real-time for several sound sources at the same time. The state-of-the-art solution for this use case is a blocked convolution (overlap-add or overlap-save) with uniformly partitioned filters. This solution is significantly faster than using non-partitioned filters, but the computational complexity increases linearly with increased filter size and decreased block size. In case of limited computational power, a compromise between filter length and block size (which directly relates to the delay induced by the convolution) has to be found. To overcome this limitation, filters can be partitioned nonuniformly (e.g., short segments for the direct sound and early reflections and long segments for the late reverb). The drawback of this solution is the implementation effort, because each sub-convolution needs to be scheduled correctly. This may require fine tuning to a specific hardware configuration [
36].
Another way to reduce the computational load is to make use of the perceptual mixing time [
31]. Depending on the room volume, this value can range between 30 to
ms [
31]. It has to be noted that these values were only evaluated for a change of the head orientation but not position. This means that only parts of the filter have to be exchanged when the listener moves around or changes his head pose. Even though this reduces the amount of data which needs to exchanged, the load for the convolution itself remains the same. Meesawat and Hammershøi [
37] conducted a small study considering different source and receiver positions in the room. For this specific room and these positions, they mention a time frame of 40 to
ms after which the BRIRs could be exchanged without perceptual consequences. However, the listening positions were always located in a close distance in front of the (virtual) loudspeaker. In 6-DOF scenarios, listeners can also walk to positions with very low direct sound energy. In such cases, listeners may be more sensitive to small changes in the room acoustics.
When BRIRs are not precomputed, additional processing power is needed to synthesize or simulate them on-the-fly. While real-time capability depends on the specific algorithm in the first place, some techniques to save processing power can be applied to any algorithm. One of these techniques could be the aforementioned Maximum Allowed Error Method. Filters for a new position only need to be generated when a perceptual difference is expected. Another technique could be the prediction of listener movement. As we cannot change our position arbitrarily fast, some movements can be predicted and thus corresponding filters can be computed dynamically. This might increase the overall number of calculations, but it would help to balance the processing load. When the listener moves and the prediction was successful, a new filter only needs to be loaded instead of being computed rapidly. As a result, BRIR computation can run at a constant pace without strong load peaks.
For the systems and experiments described in this article the partitioned convolution and filter management was realized with the open source Python tool
pybinsim [
38]. It is based on uniformly partitioned convolution with the overlap-save approach. In the different setups, the block size was either 256 or 512 samples at a sampling rate of
kHz
4. Discussion
This paper gives an insight into an audio system for the creation of an Auditory Augmented Reality (AAR) environment. Single system components were described, which are motivated by human spatial hearing. The objective is to minimize the technical effort while maintaining sufficiently good spatial auditory perception in an AR scenario. Specifically, methods for BRIR synthesis were presented that create new listening positions in a room based on very few measurements. The evaluations of these methods show that the spatial audio quality remains comparable to reference measurements and allows a high plausibility of the generated listening environment for a moving listener. Furthermore, a method was presented that optimizes the spatial deployment of BRIRs in space based on spatial auditory perception. This makes it possible to significantly reduce the number of BRIRs required without causing disruptive effects on auditory perception.
While the techniques described in this paper show significant progress towards the goal of a plausible synthesis of auditory events in an augmented reality, another important goal of such developments is not yet reached: “What is necessary to enable AAR systems, e.g., for consumer devices in a way that the illusion is perfect for every user?” While the authors are not aware of any technical systems which fulfill this goal, there are known limitations in our work, too.
The methods presented here interfere strongly with the structure of the BRIRs. Nevertheless, the evaluation shows that the perceived spatial audio quality is only moderately or partly not affected. It seems that essential characteristics and features are preserved even in the modified BRIRs. Further research in this area to determine relevant acoustic features and feature combinations which are used to build up a cognitive auditory model of our environment seems essential.
Our proposed algorithms can never extrapolate the first reflections correctly, but the negative impact seems to be weak in our listening tests. This observation questions the importance of first reflections in perception of the room. First reflections can have an influence on the perceived direction, timbre perception and coloration, apparent source width, and others [
47]. A study by Brinkmann et al. claims that rendering the first six reflections is sufficient to minimize the overall difference compared to an auralization containing all reflections [
48]. However, these study data are based on image source models and shoe-box-shaped rooms which may differ from the acoustics in real rooms. The perceptual metrics in our listening tests suggest that the effect of acoustically authentic reproduction of the first reflection on externalization is small. This could be an explanation why our approaches are good regarding this quality feature. However, it remains open to what extent these results can be generalized. However, it is a strong conjecture that essential patterns of the auditory space are preserved in the adapted BRIRs to ensure a high match with the internal representation of the room in the auditory system (cognitive model of the room). The endeavor to answer these questions are of interest for our further research to cover also larger areas to be synthesized, rooms with complex geometries, and for all types of audio signals. Observations may also differ with the tasks set to the listener and the complexity of the listening scenarios.
The presented work still leaves other open questions:
What is the influence of room modes regarding the measurement positions (avoided in the current measurements)?
How do the two proposed methods compare to each other and to other approaches? We have substantial data on our approach, but a comparison to other approaches lacks standardized test methods. A measurement based approach like the one described in this paper has always the advantage of a basic match of the filter with the actual room characteristics, but is this really necessary?
Can we propose listening tests which measure perceptual thresholds for this type of auralization? Could this be done using MAEM?
There is definitely much more research to be done in this field. Current results are promising enough to hope that at some point we will get a simple and highly plausible reproduction of sound in a room via headphones.