Next Article in Journal
Arundo donax L. Biomass Production in a Polluted Area: Effects of Two Harvest Timings on Heavy Metals Uptake
Next Article in Special Issue
Listener-Position and Orientation Dependency of Auditory Perception in an Enclosed Space: Elicitation of Salient Attributes
Previous Article in Journal
Nondestructive Evaluation of Aluminium Foam Panels Subjected to Impact Loading
 
 
Article
Peer-Review Record

Creation of Auditory Augmented Reality Using a Position-Dynamic Binaural Synthesis System—Technical Components, Psychoacoustic Needs, and Perceptual Evaluation

Appl. Sci. 2021, 11(3), 1150; https://doi.org/10.3390/app11031150
by Stephan Werner 1,*,†, Florian Klein 1,†, Annika Neidhardt 1,†, Ulrike Sloma 1,†, Christian Schneiderwind 1,† and Karlheinz Brandenburg 1,2,*,†
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2021, 11(3), 1150; https://doi.org/10.3390/app11031150
Submission received: 21 December 2020 / Revised: 22 January 2021 / Accepted: 25 January 2021 / Published: 27 January 2021
(This article belongs to the Special Issue Psychoacoustics for Extended Reality (XR))

Round 1

Reviewer 1 Report

This paper presents a broad overview of works undertaken to investigate various aspects of a auditory augmented reality system including estimating and processing BRIRs, and sampling schemes for quantisation of rooms. The key trade-off is between available processing power and the perceptual plausibility of the simulation.

--------------
The introduction and background section is generally good and covers all key areas, though there are many places where additional explanation and especially good citations are needed. These include:

Lines 57-61: This paragraph discussed dependencies of a realistic simluation including room geometry and psychoacoustic assumptions. Both of these could be expanded on e.g. what is a typical room geometry, what kinds of room geometries might break the general assumptions that are made? What are the psychoacoustic assumptions and relevant psychoacoustic effects? These should all be backed up by proper citations (probably psychoacoustic textbooks here).

Lines 65-59: This is detailing parametric filter creation. Citations needed here to back up the assertion that these are 'usually based on a measured or assumed mode...'.

Lines 81-82: Citations needed for the 'common solution' of 'rendering of a virtual spherical loudspeaker setup'.

Lines 102-103: Mention is made here of listening tests being the only 'sensible way' to judge. This implies there are other methods but these are for some reason not 'sensible'. Consider revising this assertion and the word 'sensible', which is slightly casual language. Citations are also needed here for the 'listening test paradigms as known from audio coding'.

--------------
Section 2, which outlines the system and problems is also generally clear, but there are a few areas for improvement:

Line 167: Figure 3 is good, but the explanation of the concept contained in its caption should appear in the main text. There is critical information for the reader's understanding here.

Lines 196-197: Citation needed for the 'estimates which have been collected from the literature'. Which literature?

Line 165: I feel it would be clearer to state something along the lines of 'for clarity of visualising the time-stretching manipulations, the direct sound components have been shifted to coincide, whereas in reality this would not be the case' - 'runtime' to me is ambiguous and suggests the cpu runtime of an algorithm rather than travel time for a sound wave.

Lines 327-328: Citation needed for 'lower values are mentioned by other authors'.

----------------

Section 3 is the most confusing. Three separate tests are presented in succession, and although these are separate tests, this is structurally confusing for the reader and breaks the sense of narrative in the paper. This section should be rearranged to a more standard structure whereby all test methodologies are described, followed by all results, rather than the three-part structure used here. Generally, it would make sense narratively to present the results for the BRIR synthesis before the results for the non-uniform grids, which are contingent on the plausibility of the BRIR synthesis. There are also a few points that need to be addressed specifically:

Lines 440-447: This paragraph is background information, and should have appeared earlier in the paper. Going from presenting results back to background is especially confusing.

Line 482: 'RDE' is not defined.

Lines 507-508: Interpolation approach is introduced here. I appreciate this approach is not the focus of the paper, but there should be at least some brief explanation of the specifics of how it operates in order to make a comparison with your new technique more meaningful.

Lines 541-542: 'The rating for the ability to perceive a direction...are not shown' - I presume this refers to localisation ability as defined on line 483. If these results are not to be published in this paper then don't mention them at all. It is confusing to the reader.

Section 3.3: There appears to be no Figures showing the results described here. I understand that there might not be a lot to see but it would be beneficial for the reader to be able to visually inspect what is being described, as with the other results. I see later (line 577) that we are referred to a conference paper for these results. I would suggest to either include these results in full, or omit them completely. What is included at the moment is a confusing halfway.

----------------

Section 4 is fine, however statement of the 'overall goal' (lines 582-583) should be moved to much earlier in the paper (and restated here if it is felt necessary).

---------------

Typos:

Line 114-115: Repetition of 'the current source-receiver position'.
Line 156: Broken referece.
Line 169-170: Repetition of 'increase'.
Line 591: 'source with' --> width?


I believe this is a solid paper, but a general restructure along the lines I have specified should make it much more coherent and readable. I will then feel able to recommend the paper for publication.

Author Response

Dear Reviewer,

thank you very much for your very detailed and technically helpfull comments. You can find our reply as attachment.

best regards,

the authors

Author Response File: Author Response.docx

Reviewer 2 Report

The paper describes a system for generating augmented reality audio that allows movement with 6 degrees of freedom, with the aim of matching the augmented reality elements to the real space as perceptually closely as possible. The technical solutions are partially described, together with perceptual experimentation to validate some of the choices made. This is a particularly timely topic, with great interest among researchers and industry.

The paper contains a good amount of interesting and well-researched information, though this could be better presented. The introduction section lacks clear aims; having the research questions clearly specified in the introduction will help the reader better understand what to expect in the paper. Having clear research questions will also help to improve the discussion/conclusions section, which at the moment doesn't sufficiently summarise or tie together all the elements of the work presented.

The information in the introduction section forms a good background to the work, and covers a wide range of potential problems to be solved to create a successful system, as well as suggestion of a range of potential solutions. The discussion of the directional of arrival calculation on line 46 could be clarified: when this is first mentioned it is not clear what elements of a scene this is needed for.

The start of section 2 feels a bit repetitive; it seems that there is too much overlap between the first 2 paragraphs of this section and the information contained in the introduction; this could be made more concise. In the start of the third paragraph there is some confusion in terminology between "positions" and "areas" – this should be clarified. Also note the missing reference on line 156.

The maximum allowed error method (MAEM) that is employed appears to be a useful way of quantifying the problem of localisation error, but it could do with a clearer description: how is the localisation error calculated? Is it calculated for only the sources or for every single reflection (and if the latter, for what order of reflections)? If the error is calculated for both sources and reflections, is the relative perceivability of errors of teach of these taken into account? Is there any advantage of plotting the error as a colour surface plot rather than just boundary lines? Once the 4 grids have been calculated, which resolution of grid do the authors propose to use?

The relationship between sections 2.2.1 and 2.2.2 should be clarified. It appears that the former of these argues that a constant reverberation could be used, but the latter discusses manipulating the reverberation. A clearer link between these sections would help the reader to follow the argument.

The discussion at the start of section 2.2.3 should be clarified to more clearly explain the relationship between the source directivity and its effect on the BIRI for different positions. In the second section it seems to be shown that the change in the physical parameters are within the JND range for the direct to reverberant energy ratio, which raises the question of why this is considered further - this should be discussed.

On line 328 it is stated that "lower values are mentioned by other authors" - who? and what values? And on line 329 "values below 50ms are considered as suitably low" – this needs more evidence to convince the reader.

In section 3 a number of perceptual experiments are described. The first of these considers the perceptual effect of the BRIR grids. Generally the ratings have large within-condition variation and only slightly more between-condition variation, which makes it difficult to make strong conclusions from the data.

In section 3.1 the trends discussed are reasonable, but it is odd that there is no mention of the results for the speech signal apart from that this was less critical. An extra sentence or two to describe the pattern of these results for the speech signal would be useful.

In figure 11 it looks as if a single scale is used for 2 attributes - was there in fact a separate scale for each attribute, or did subjects somehow make multiple use of a single scale?

In sections 3.2.2 and 3.2.3 it appears that there are no statistically significant differences based on the commonly used threshold of p<0.05, which indicates that if there is a real perceived difference it is a small effect. This should be discussed further. Could it be reasonably argued that there are no meaningful differences between the methods used, or that the differences are small enough to be ignored (meaning any of the proposed methods are suitable)? This should be explored further in the discussion.

Overall, the paper contains interesting and timely information, though it could be written to be clearer and more conclusive.

Author Response

Dear Reviewer,

thank you very much for your very detailed and technically helpfull comments. You can find our reply as attachment.

best regards,

the authors

Author Response File: Author Response.docx

Back to TopTop