Next Article in Journal
An AI-Powered Product Identity Form Design Method Based on Shape Grammar and Kansei Engineering: Integrating Midjourney and Grey-AHP-QFD
Previous Article in Journal
Advances in the Neural Network Quantization: A Comprehensive Review
 
 
Article
Peer-Review Record

Neural Coincidence Detection Strategies during Perception of Multi-Pitch Musical Tones

Appl. Sci. 2024, 14(17), 7446; https://doi.org/10.3390/app14177446 (registering DOI)
by Rolf Bader
Reviewer 2: Anonymous
Appl. Sci. 2024, 14(17), 7446; https://doi.org/10.3390/app14177446 (registering DOI)
Submission received: 3 June 2024 / Revised: 8 July 2024 / Accepted: 26 July 2024 / Published: 23 August 2024
(This article belongs to the Section Applied Biosciences and Bioengineering)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper reports on the experimental methodology and findings of an investigation of the neural correlates of separateness and roughness of musical sounds, based on computational auditory/neural modeling and perceptual testing.

This is a quite compelling topic, and I am glad to comment on the paper.

A significant advantage of the paper is the inclusion of the original sounds that were used in the investigation, which facilitates further comparative/extending work.

A few comments follow below:

1.    L104: plated->played

2.    Section 3.3: A schematic of the processing chain would greatly facilitate comprehension.  Also, I think that the reader will really benefit from an explanation of how coincidence detection is approached with the use of Gaussian spectral smoothing, especially in view of the related physiological processes such as phase locking, etc.

3.     L220-222: Please, explain how the accumulation (integration) interval is chosen in relation to the pitch's JND (as I understand) of ca.10c

4.     L223 (and elsewhere): cochlear->cochlea

5.     Section 4.2: In this section and Figs 4, 5 I would also like to see sd plots or (even better) a box plot of the subjects' responses. Possibly the data of Figs. 10,11 should be also presented here.

6.    Section 4.3: In the absence of any formal tests of difference between responses to familiar vs unfamiliar sounds, please see my previous comment of the necessity of presentation of sd/boxplots of responses

7.    L382: Pearson’s or Spearman’s correlation?

8.    L408-417 may be unnecessary

9.    L418-429 and below: are the correlations and their differences (between various areas of the plots) statistically significant, given the range of correlation values, as shown in the plots?

10.   L430: these are correlations of sd of the perceptual data with entropy, please rephrase

11.  L430-437: It seems I am missing the results on correlation of sd of the perceptual ratings with the other two measures (amount of peaks and weighted sum of peaks). If they do not offer any valuable information, it should be at least clearly mentioned.

12.   L456: Possibly, a tabulated form of the main findings and conclusions on the various statistics and measures vs experimental conditions/model parameters would greatly improve the exposition of the paper's points.

13.  L462: I would like to see some elaboration on this. Does this imply an independence of pitch and timbre judged merely on a difference between amplitudes of ISI? Or rather, (as I understand it) timbre is also relying on additional information from smaller ISI amplitudes (as they are distributed to ISI values around the 1-5msec expected maxima) which implies an ISI's contrasting feature for different timbres of the same pitch?

14.  Section 5: I think that a connection and elaboration on physiological processes of coincidence detection and their possible anatomical localization (as they are initially referred in the Introduction) would greatly strengthen the paper and its implications. Given the differences found between familiarity conditions, the huge amount of possibilities of top-down processes must also be pointed out. Additionally, I believe that the discussion must also deal with how such findings might be viewed under consideration of the structure and limitations of the BM computational spike models followed by specific phenomenological coincidence modeling approaches (as realized in the current paper) in terms of their abilities to account for and represent the possible effects of such higher perceptual/cognitive processes. Such a perspective highlights the possibilities for further work.

15.   L501: Does this imply that coincidence detection does not eventually affect the timbral perception? I suppose the term "timbre" in the text rather refers (and thus must be declared as such) to the "unprocessed" detailed spectral envelope's characteristics, rather than the perceptual entity referring to various qualitative aspects of sound.

16.   L513: Please, check phrasing

 

Comments on the Quality of English Language

At some points, the phrasing requires corrections.

Author Response

Reviewer 1

The paper reports on the experimental methodology and findings of an investigation of the neural correlates of separateness and roughness of musical sounds, based on computational auditory/neural modeling and perceptual testing.

This is a quite compelling topic, and I am glad to comment on the paper.

A significant advantage of the paper is the inclusion of the original sounds that were used in the investigation, which facilitates further comparative/extending work.

A few comments follow below:

  1. L104: plated->played

corrected

  1. Section 3.3: A schematic of the processing chain would greatly facilitate comprehension.  Also, I think that the reader will really benefit from an explanation of how coincidence detection is approached with the use of Gaussian spectral smoothing, especially in view of the related physiological processes such as phase locking, etc.

 

An additional text is added to the Introduction to make the very nature of coincidence detection more clear:

 

           Coincidence detection is also performed through complex neural networks on many levels. In the auditory pathway several neural loops are present, like the loop of cochlea, nucleus cochlearis, trapezoid body and back to the basilar membrane efferents for frequency sharpening\cite{Schonfield2011}. Loops also exist between inferior culliculus or medial geniculate body to and from the auditory cortex or the cochlea nucleus to and from the superior olive complex or the inferior culliculus. In all these loops complex coincidence mechanisms might occur.

 

           Rhythms in the brain are manifold\cite{Buzsaki2006} and of complex nature. Often interlocking pattern appear like that of the mechanical transition into spikes in the cochlea, as discussed above, or by neighboring neurons interlocking to each other. Such mechanisms often uses complex neural loops, like that from the cochlear to the nucleus cochlearis to the trapezoid body and back to the cochlear for increasing pitch perception. Another example is the loop from the cochlear up to the A1 of the auditory cortex and back down, where the way from A1 to the cochlear is transmitted through only one intermediate nucleus.

          

           In all cases, interlocking leads to a sharpening of the neural spike burst where blurred bursts are temporally aligned to arrive at a more concise, single burst. This is the very outcome of coincidence detection and used in the methodology discussed below.

 

The section 3.3, then referring to the Introduction text, an additional text is included:

 

As discussed in the Introduction, the very nature of coincidence detection is to make a burred neural spike temporally more concise. Therefore, using the Gauss blurring allows for a detection of how concise the spike bursts are. In case of a large $\sigma$ coincidence is weak as a large time interval is needed to align the peaks and vice versa.

 

  1. L220-222: Please, explain how the accumulation (integration) interval is chosen in relation to the pitch's JND (as I understand) of ca.10c

Text added:

Just-noticable differences (JDN) in pitch perception is frequency-dependent and strongly differ between adjacent and simultaneous sound presentation. The value of 10 cent was found reasonable in terms of the used sounds, where smaller values lead to too few incidences to accumulate and larger values would blur the results.

           

  1. L223 (and elsewhere): cochlear->cochlea

Done.

  1. Section 4.2: In this section and Figs 4, 5 I would also like to see sd plots or (even better) a box plot of the subjects' responses. Possibly the data of Figs. 10,11 should be also presented here.

 

I perfectly understand the point. I wanted to present such figures at first. Still then the main point was too much getting into the background. Therefore I decided to leave the standard deviations to Fig. 10/11. I additionally did an ANOVA for the perception test where I hope makes independence of distributions more clear.

  1. Section 4.3: In the absence of any formal tests of difference between responses to familiar vs unfamiliar sounds, please see my previous comment of the necessity of presentation of sd/boxplots of responses

 

See above.

  1.   L382: Pearson’s or Spearman’s correlation?

Pearson’s, text added

  1. L408-417 may be unnecessary

Text omitted.

  1. L418-429 and below: are the correlations and their differences (between various areas of the plots) statistically significant, given the range of correlation values, as shown in the plots?

Statistics added in 4.2:

Using a one-way ANOVA, the significance p-value was found to be $3.7 \times 10^{-46}$ for the separateness perception and $2.91 \times 10^{-25}$ for the roughness when comparing the subjects judgment distributions of the 30 sound examples.

 

  1.  L430: these are correlations of sd of the perceptual data with entropy, please rephrase

Phrase ‘correlations’ added

  1. L430-437: It seems I am missing the results on correlation of sd of the perceptual ratings with the other two measures (amount of peaks and weighted sum of peaks). If they do not offer any valuable information, it should be at least clearly mentioned.

I thank the reviewer for pointing me to this. Text added:

            The correlations between the standard deviations with the amount of peaks N and the amplitude weighted peak sum W do not show considerable correlations. This is pointing into the same direction as the correlations with the perceptual means, namely that the entropy is correlating better than the amount of spikes N or W.

           

  1.  L456: Possibly, a tabulated form of the main findings and conclusions on the various statistics and measures vs experimental conditions/model parameters would greatly improve the exposition of the paper's points.

Tables added.

  1. L462: I would like to see some elaboration on this. Does this imply an independence of pitch and timbre judged merely on a difference between amplitudes of ISI? Or rather, (as I understand it) timbre is also relying on additional information from smaller ISI amplitudes (as they are distributed to ISI values around the 1-5msec expected maxima) which implies an ISI's contrasting feature for different timbres of the same pitch?

Thank you for pointing to this. Text added:

For pitch detection, only a single spike is necessary each period of f0. Timbre on the contrary needs more pitches to differentiate the sound. Therefore, a small amount of spikes in the ISI is pointing to a pitch-based perception while many peaks allow for an elaborated timbre space. Also, the main peaks of the ISI histogram are associated with pitch in a harmonic sound as main peaks correspond to a common periodicity, so to f0. Smaller amplitudes stronger represent timbre, the elaboration of the sound. So we can conclude that perception strategy of familiar sounds is using pitch information and that of unfamiliar sounds using timbre.

  1. Section 5: I think that a connection and elaboration on physiological processes of coincidence detection and their possible anatomical localization (as they are initially referred in the Introduction) would greatly strengthen the paper and its implications. Given the differences found between familiarity conditions, the huge amount of possibilities of top-down processes must also be pointed out. Additionally, I believe that the discussion must also deal with how such findings might be viewed under consideration of the structure and limitations of the BM computational spike models followed by specific phenomenological coincidence modeling approaches (as realized in the current paper) in terms of their abilities to account for and represent the possible effects of such higher perceptual/cognitive processes. Such a perspective highlights the possibilities for further work.

Thank you for pointing to this. I added a text to clarify the reason for the study:

            Coincidence detection is highly interesting in terms of the musical sound production and perception cycle\cite{Bader2021}. Musical instrument internally work with impulses traveling within the instrument, are reflected at geometrical positions and return to an initial point where they synchronize to a periodicity and therefore create a harmonic sound\cite{Bader2013}. These impulse patterns are then radiated into air and become reverberated or distorted. In terms of the spectral sound content such blurring means that the single frequencies in the sound spectrum get out of phase. When they enter the ear and the auditory system, the desynchronized phases become aligned again through several mechanisms from which coincidence detection is an important one. So the brain tries to reconstruct the original impulses traveling in the musical instrument. In the brain, even stronger synchronization have been found towards perceptual expectation time points\cite{Sawicki2023}. Within a framework of a Physical Culture Theory\cite{Bader2021} an algorithm of Impulse Pattern Formulation (IPF) was proposed to model musical expectation using synchronization of coincidence aligned spike bursts\cite{Bader2024a}. This algorithm also works for human - human interaction modeling tempo alignment between two musicians\cite{Linke2021}. It was even suggested to model social or political interactions\cite{Bader2024b}.

  1.  L501: Does this imply that coincidence detection does not eventually affect the timbral perception? I suppose the term "timbre" in the text rather refers (and thus must be declared as such) to the "unprocessed" detailed spectral envelope's characteristics, rather than the perceptual entity referring to various qualitative aspects of sound.

Clarified by modified text:

With familiar sounds a strong coincidence detection seems to take place, where listeners concentrate on only a small amount of spikes representing merely pitch. With unfamiliar sounds they use the raw input much stronger without the coincidence detection reduction, so on a more elaborated pitch, on timbre.

  1.  L513: Please, check phrasing

Typo. Now:

It might be that listeners have it easier with familiar sounds to separate pitches when the timbre is more simple.

Reviewer 2 Report

Comments and Suggestions for Authors

The topic on multi-pitch musical tones is somewhat outside my expertise, but the approach of using a cochlear model raised my interest.

The use of non-western instruments makes it an interesting paper, although it would have been nice if pictures were included.

Reading the paper was not an easy task, as the sometimes complex jargon related to musical perception is not my expertise.

The link between 1) the perception scores of subjects and 2) the cochlear model including signal processing (ISI and blurring), remains a bit vague. It is my impression that the results with the responses of the subjects is more clearly presented (figures 4 & 5), compared to the correlation results (figures 6 to 9).

Some other remarks:

-        Is the number of 28 subject sufficient to draw conclusions. And is this group biased, as they are all musical students?

-        The music/tones were played real-time. Are there variations between the tests?

-        The played list is in Table 1. Is this also the playing order or are the instruments mixed (and what arguments can be provided for the order)

-        Line 176, are there 1 or 2 piano sounds? (1 in the Table, 2 in the text)

-        Section 3.2. In the cochlear model the sound pressure acts “ instantaneously” on the BM. But later on in the text in there is a “delay” between the high and low frequency spikes. Is this the same mechanism?

-        The description of the model is very short, using 5 references of the author (not checked). An FDTD model and a sample frequency of 96 kHz is mentioned. Are the pressure-time recordings the input for this model? Or are previously calculated transfer functions used to get the output at 48 points at the BM?

-        I am a bit surprised that frequencies above 4 kHz are not taken into account, considering perception of roughness. (not an expert though)

-        The description of the post-processing of the cochlear model output is not clearly written.
This makes the interpretation of the figures 6 to 9 difficult.

-        Figures 1 and 2 are too small (on paper). The same holds for the labels/axis in figures 6 to 9.

-        The (very short) note on standard deviations may be extended / more clearly written? Using figures 10 & 11.

-        I guess that the general reader of Applied Sciences needs some additional explanations, e.g. when it comes to ISI histograms and the interpretation (ISI: interspike intervals).

AAs a result I choose 'major' revision as recommendation.

Author Response

Reviewer 2

The topic on multi-pitch musical tones is somewhat outside my expertise, but the approach of using a cochlear model raised my interest.

The use of non-western instruments makes it an interesting paper, although it would have been nice if pictures were included.

Reading the paper was not an easy task, as the sometimes complex jargon related to musical perception is not my expertise.

The link between 1) the perception scores of subjects and 2) the cochlear model including signal processing (ISI and blurring), remains a bit vague. It is my impression that the results with the responses of the subjects is more clearly presented (figures 4 & 5), compared to the correlation results (figures 6 to 9).

Additional text and additional figure is included. I hope the text is more readable (see below). Please let me know if there is some sections still needing clarification.

Some other remarks:

-        Is the number of 28 subject sufficient to draw conclusions. And is this group biased, as they are all musical students?

The subjects were indeed intentionally biased. Text added in the Method section:

The subjects therefore had a bias towards Western musical instruments which is needed in the study comparing familiar (Western) and unfamiliar (non-Western) musical instrument sounds.

-        The music/tones were played real-time. Are there variations between the tests?

Text added:

In terms of standardization, all instruments were recorded and played back to subjects using loudspeakers. Headphones were rejected as the setup is very individual and introduces a bias. The sounds were played in random order where the same random distribution was used for all subjects.

-        The played list is in Table 1. Is this also the playing order or are the instruments mixed (and what arguments can be provided for the order)

See text above.

-        Line 176, are there 1 or 2 piano sounds? (1 in the Table, 2 in the text)

There is only one piano sound, text changed.

-        Section 3.2. In the cochlear model the sound pressure acts “ instantaneously” on the BM. But later on in the text in there is a “delay” between the high and low frequency spikes. Is this the same mechanism?

Yes, this is astonishing! I added a text in the ‘Cochlea Model’ section:

      In terms of the 3-4 ms delay it is interesting that, although the lymph is acting instantaneously on the whole basilar membrane, the traveling wave along the membrane builds up. This is the result of the inhomogeneous stiffness and damping of the basilar membrane leading to this very special kind of waves.

-        The description of the model is very short, using 5 references of the author (not checked). An FDTD model and a sample frequency of 96 kHz is mentioned. Are the pressure-time recordings the input for this model? Or are previously calculated transfer functions used to get the output at 48 points at the BM?

Clarifying sentences added in the text section:

      The model assumes the basilar membrane (BM) to consist of 48 nodal points and uses a Finite-Difference Time-Domain (FDTD) solution with a sample frequency of 96 kHz to model the BM motion. The sample frequency is needed to ensure model stability. The recorded sounds played to the subjects were used as input to the model. The sounds had a sample rate of 96 kHz, corresponding to that of the model. The BM is driven by the ear channel pressure, which is assumed to act instantaneously on the whole BM. This is justified as the sound in the peri- and endolymph is about 1500 m/s, where the sound speed on the BM is only 100 m/s down to 10 m/s. The pressure acting on the BM is the input sound. The output of the BM is a spike activity, where a spike is released at a maximum of the positive slope of the BM and a maximum slope in the temporal movement at one point on the BM. This results in a spike train similar to measurements.

-        I am a bit surprised that frequencies above 4 kHz are not taken into account, considering perception of roughness. (not an expert though)

Text added:

This is not considered a serious bias for roughness estimation as roughness, different from sharpness perception is mainly based on low and mid frequencies close to each other leading to musical beatings and therefore to roughness. Strong energy in high frequencies is merely perceived as brightness or sharpness and do not contribute to roughness perception in the first place.

-        The description of the post-processing of the cochlear model output is not clearly written.
This makes the interpretation of the figures 6 to 9 difficult.

Additional figure and text added:

      \begin{figure}

                  \centering

      \includegraphics[width=0.7\linewidth]{fig/Cochleogram_400Hz_10Partials}

                  \caption{Example of sound processing. Top: Sound wave, Middle: Cochlea model output of spikes in repective Bark bands (vertical axis) at respective time points (horizontal axis), Bottom: ISI histogram of occurences of spikes with f= 1/ISI (vertical axis) over time (horizontal axis).}

                  \label{fig:postprocessingexample}

      \end{figure}

     

      Fig. \ref{fig:postprocessingexample} gives an example of the method. At the top an arbitrary sound file, a sawtooth wave is used as input to the cochlea model. The model output is shown in the middle plot of spikes occurring at Bark bands displayed vertically over time which is shown horizontally. The model output is further processing using the ISI intervals between two adjacent spikes and ordered in a histogram for each Bark band, again vertically, over time. The histogram is calculated for the time windows and a 10 cent frequency discretization.

 

-        Figures 1 and 2 are too small (on paper). The same holds for the labels/axis in figures 6 to 9.

I am sorry, I needed to display the figures on a single page to still have the captions. When zooming into the figures on a computer screen, the details appear.

-        The (very short) note on standard deviations may be extended / more clearly written? Using figures 10 & 11.

Text extended:

      It is also interesting to have a look at the standard deviations for the perception of roughness and separateness shown in Fig. \ref{fig:perceptionroughnessstandarddeviation} and Fig. \ref{fig:perceptionseparatenessstandarddeviation} respectively. This perceptual standard deviation between subjects is expected to be low with familiar sounds, so subjects agree on their perception. It is expected to be higher with unfamiliar sounds where subjects are not trained to these sounds and judge more individually.

     

       Indeed, for roughness the guitar sounds have among the lowest values. Especially the unfamiliar hulusi sounds are perceived considerably different between sujects. Nevertheless, especially for the smaller tone intervals up to a minor third the differences between familiar and unfamiliar sounds are less prominent.

       

       For separateness the standard deviations are highest to the guitar sounds and with nearly all unfamiliar sounds subjects judged more similar. The two gongs are among the lowest values, indeed theses sounds are special in a way that one can decide to hear only one sound or many inharmonic partials.

       

       Clearly, in terms of judgments standard deviation a difference between roughness and separateness is present. This is also seen when examining the correlation between mean and standard deviation of roughness perception which is 0.62, pointing to increased uncertainty of judgments with higher values. While the correlation between mean and standard deviation for separateness is only 0.27 and so seem to be more independent.

-        I guess that the general reader of Applied Sciences needs some additional explanations, e.g., regarding ISI histograms and the interpretation (ISI: interspike intervals).

As mentioned above, I added an example figure and additional text. I hope the figure becomes more clear now.

Back to TopTop