Next Article in Journal
Filter Media-Packed Bed Reactor Fortification with Biochar to Enhance Wastewater Quality
Next Article in Special Issue
The Neanderthal Musical Instrument from Divje Babe I Cave (Slovenia): A Critical Review of the Discussion
Previous Article in Journal
Low-Cost Biochar Adsorbents for Water Purification Including Microplastics Removal
Previous Article in Special Issue
The Carabattola—Vibroacoustical Analysis and Intensity of Acoustic Radiation (IAR)
 
 
Article
Peer-Review Record

Analysis and Modeling of Timbre Perception Features in Musical Sounds

Appl. Sci. 2020, 10(3), 789; https://doi.org/10.3390/app10030789
by Wei Jiang 1,2,3, Jingyu Liu 1,2,3, Xiaoyi Zhang 1,2,3, Shuang Wang 1,2,3 and Yujian Jiang 1,2,3,*
Reviewer 1:
Reviewer 2:
Appl. Sci. 2020, 10(3), 789; https://doi.org/10.3390/app10030789
Submission received: 25 December 2019 / Revised: 19 January 2020 / Accepted: 20 January 2020 / Published: 22 January 2020
(This article belongs to the Special Issue Musical Instruments: Acoustics and Vibration)

Round 1

Reviewer 1 Report

Thank you very much for the change to review this manuscript. Here are a couple of small things that I noticed by line number:

123: Instead of saying the ratio of males to females was "nearly 1:1", I think it's worth reporting the exact number of each (e.g. in lines 121-122 have "41 music professionals (x males)" or something like that). This is also the case on line 238.

171 -172: For Tables 3 and 4, there seems to be a couple of errors. In Table 3, there are two "Consonants" in the left column. As far as I can guess from Table 4, I think the first Consonant is supposed to read "Coarse", but even so its correlation with Pure is listed as -.92 in Table 3 and -.93 in Table 4.

251-252: Could you elaborate on the method used to exclude the data from subjects that "may not have had a suffieicent understanding of..." the experiment?

271-272: In the caption for figure 4, I think it might be helpful to have a sentence or two describing the uses of the different shapes and colors as well as the meaning of the colored dotted line if only to make it overly clear outside of the main body of the text.

310: I might be missing something, but if you require n^2/2 experiments for n samples, isn't that a polynomial or quadratic relationship instead of an exponential one (which I would define as 2^n experiments for n samples). Again, if I'm missing something and this is right in a context I'm unaware of that is fine, but I wanted to check.

Author Response

Dear Reviewer:

Thank you for your letter and comments concerning our manuscript entitled “Analysis and Modeling of Timbre Perception Features in Musical Sounds” (ID: applsci-691587). Those comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our researches. We have studied comments carefully and have made correction which we hope meet with approval. Revised portion are marked in "Track Changes" function in Microsoft Word and the necessary notation is added.

In addition, according to the review comments, we think it is necessary to supplement relevant literature and explanation, so part of the papers have been rewritten (mainly including “Abstract”, “1. Introduction” and “6. Conclusion”). Because the revised paper supplemented the necessary content, and the structure of the article was adjusted a little (move “4. Construction of the Objective Acoustic Parameter Set” of original manuscript to “4.1. Construction of the Objective Acoustic Parameter Set” of new manuscript), the line number changed a lot. To facilitate your review, the line numbers of the new manuscript would be marked in brackets. I am very sorry that these changes may bring inconvenience to your review work. Thanks again!

Attachment is the revised version.

The main corrections in the paper and the responds to the comments are as flowing:

 

Response to Reviewer 1 Comments

 

Point 1: 123: Instead of saying the ratio of males to females was "nearly 1:1", I think it's worth reporting the exact number of each (e.g. in lines 121-122 have "41 music professionals (x males)" or something like that). This is also the case on line 238.

 

Response 1: Thank you for your helpful comment. Exact number of subjects have been added.

(Line 158, Line 257)

 

Point 2: 171 -172: For Tables 3 and 4, there seems to be a couple of errors. In Table 3, there are two "Consonants" in the left column. As far as I can guess from Table 4, I think the first Consonant is supposed to read "Coarse", but even so its correlation with Pure is listed as -.92 in Table 3 and -.93 in Table 4.

 

Response 2: Thank you for your valuable comment. I am very sorry that this is a serious mistake. Your guess is correct. The first "Consonants" in Tables 3 should be "Coarse". (Line 190)

 

We checked the original data, the (Coarse/Pure -0.93) in table 5 should be -0.92. The table below shows the original experimental data. (Line 192)

 

A correlation matrix for 10 timbre evaluation terms.

 

bright

dark

sharp

vigorous

raspy

coarse

hoarse

consonant

mellow

pure

bright

1.000

-.985

.903

-.925

.238

-.477

-.309

.131

-.265

.472

dark

-.985

1.000

-.888

.928

-.201

.492

.330

-.168

.260

-.477

sharp

.903

-.888

1.000

-.927

.583

-.143

.064

-.240

-.569

.174

vigorous

-.925

.928

-.927

1.000

-.429

.305

.087

.062

.371

-.280

raspy

.238

-.201

.583

-.429

1.000

.610

.740

-.832

-.819

-.508

coarse

-.477

.492

-.143

.305

.610

1.000

.890

-.820

-.548

-.918

hoarse

-.309

.330

.064

.087

.740

.890

1.000

-.861

-.618

-.828

consonant

.131

-.168

-.240

.062

-.832

-.820

-.861

1.000

.785

.751

mellow

-.265

.260

-.569

.371

-.819

-.548

-.618

.785

1.000

.511

pure

.472

-.477

.174

-.280

-.508

-.918

-.828

.751

.511

1.000

 

Point 3: 251-252: Could you elaborate on the method used to exclude the data from subjects that "may not have had a suffieicent understanding of..." the experiment?

 

Response 3: Take “Coarse/Pure” as an example, firstly calculates the correlation matrix of 34 subjects, and the experimental data of 34 subjects clustering. It can be seen from the clustering diagram that the distance between subject 3 and other subjects is the furthest (25 in this case). Therefore, it can be inferred that the data of subject 3 is unreliable and should be excluded. (Line 270-271)

The correlation matrix of 34 subjects

 

Point 4: 271-272: In the caption for figure 4, I think it might be helpful to have a sentence or two describing the uses of the different shapes and colors as well as the meaning of the colored dotted line if only to make it overly clear outside of the main body of the text.

 

Response 4: Thank you for your valuable comment. Relevant descriptions have been added to the caption. (Line 290)

 

Point 5: 310: I might be missing something, but if you require n^2/2 experiments for n samples, isn't that a polynomial or quadratic relationship instead of an exponential one (which I would define as 2^n experiments for n samples). Again, if I'm missing something and this is right in a context I'm unaware of that is fine, but I wanted to check.

 

Response 5: Thank you for your valuable comment. I am very sorry that this is a serious mistake. n^2/2 should be quadratic relationship. (Line 356)

 

Author Response File: Author Response.pdf

Reviewer 2 Report

Review of « Analysis and modeling of timbre perception features in musical sounds » by W. Jiang, J. Liu, X. Zhang, S. Wang, submitted to Applied Sciences.

The manuscript is quite well written and the English is good along the manuscript. The study in itself is also well lead in my opinion. I have no major objection in publishing it, but I have few remarks and questions; by answering them, the authors could improve the manuscript. Here are my remarks:

Line 91 : In the 72 instruments studied here, is there percussive instruments, for which the pitch cannot be defined ? Line 112 : I can guess that most of the listeners were Chinese. Are they familiar with both Chinese/Western instruments ? Naïve question maybe but could it have an slight influence on the classification ? In Eq. 1 : I believe \omega should be w_{ir} as in the text. Also in the following text, “the Rth dimension” should be the rth dimension. Finally, is the value of R 32 in this equation ? In Eq. 2 : what is \hat{d_ij} ? It is not defined. Line 148 “These 2D points represent inflection points”. Do the authors mean that : there is an inflection point at Dimensionality=2 ? In this case, the sentence can be better formulated. Line 149 : “that can be represented by a 2D plane” : because of low stress for Dim=2 we can represent the data on a 2D plane ? Figure 2: How was this figure generated ? Based on the distances only ? Is it ssimply a resuilt of the INDSCAL algorithm ? Figure 3 and table 2 : I don’t understand how the diagram is used to produce table 2. Some more explanations are welcome here ! Line 220/221 : Are the Guan and the Suona traditional Chinese instruments ? Table 5 : in quantity, why 3,4,5,12,24 ? I understand that feature with quantity>1 are time dependent feature, but why these specific values ? Line 260 : “the dotted line represents the average value of each instrument in the corresponding dimension” : I don’t understand well. Do the authors refer to the colored dotted lines: the blue line representing the average of the blue dots, the red line the average of the red dots, yellow line the average of the yellow dots ? Figure 4 : what do the colors represent ? Three families of instruments ? Line 319 “Grade 9 was performed” : do the authors mean that the samples are rated on a 9 points scale as previously ? Line 335 : “A 3D perception space was the produced using dimensionality reduction processing” How exacly ? Line 335 : In your experience, is the reduction processing well applicable to western instruments as well ? Reference 6 : contrary to the others references in which the first name of the authors are written with first initial only, this one uses the full first name.

Author Response

Dear Reviewer:

Thank you for your letter and comments concerning our manuscript entitled “Analysis and Modeling of Timbre Perception Features in Musical Sounds” (ID: applsci-691587). Those comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our researches. We have studied comments carefully and have made correction which we hope meet with approval. Revised portion are marked in "Track Changes" function in Microsoft Word and the necessary notation is added.

In addition, according to the review comments, we think it is necessary to supplement relevant literature and explanation, so part of the papers have been rewritten (mainly including “Abstract”, “1. Introduction” and “6. Conclusion”). Because the revised paper supplemented the necessary content, and the structure of the article was adjusted a little (move “4. Construction of the Objective Acoustic Parameter Set” of original manuscript to “4.1. Construction of the Objective Acoustic Parameter Set” of new manuscript), the line number changed a lot. To facilitate your review, the line numbers of the new manuscript would be marked in brackets. I am very sorry that these changes may bring inconvenience to your review work. Thanks again!

Attachment is the new manuscript.

The main corrections in the paper and the responds to the comments are as flowing:

 

Response to Reviewer 2 Comments

 

Point 1: Line 91 : In the 72 instruments studied here, is there percussive instruments, for which the pitch cannot be defined ?

 

Response 1: Percussion instruments can be divided into two types, one of which can produce a stable pitch, the other cannot produce a stable pitch (pitch cannot be defined). The percussion instruments used in this paper all produce a stable pitch, that is, pitch can be defined, including 3 Chinese orchestra percussion instruments (Bell chimes, Bianqing, Yunluo) and 5 western orchestra percussion instruments (Celesta, Vibraphone, Chimes, Xylophone, Marimba). A list of instruments is added at the end of the paper (Appendix A). Below are three Chinese percussion instruments. (Line 128)

     

Bell chimes

 

Bianqing

 

Yunluo

 

Point 2: Line 112 : I can guess that most of the listeners were Chinese. Are they familiar with both Chinese/Western instruments ? Naïve question maybe but could it have an slight influence on the classification ?

 

Response 2: Thank you for your valuable comment. All the listeners were Chinese. They are all students or teachers of the conservatory of music, with more than 10 years of experience in playing instruments, and familiar with Chinese and western instruments. Studies have shown that nationality, cultural background, customs, language, and environment inevitably affect the cognition of timbre [7-11] (Line 51). In the existing studies, few Chinese subjects have participated in the timbre perception experiment. This is also the purpose and significance of this paper. In the next study, we plan to study the differences between Chinese and western subjects' for timbre perception. (new manuscript Line 148)

 

Point 3: In Eq. 1 : I believe \omega should be w_{ir} as in the text. Also in the following text, “the Rth dimension” should be the rth dimension. Finally, is the value of R 32 in this equation ?

 

Response 3: Thank you for your valuable comment. I am very sorry that this is a mistake. Your comment is correct. In addition, MDS is used both in section 3.3 (Line 166) and 5.2 (Line 397) of this paper. Since MDS in section 3.3 is the auxiliary method, and in section 5.2 MDS is the main method to construct timbre space, the formula is moved to section 5.2 (Line 406). In section 3.3 R is 32 (32 timbre evaluation term) and in section 5.2 R is 16 (16 timbre evaluation term).

 

Point 4: In Eq. 2 : what is \hat{d_ij} ? It is not defined.

 

Response 4: Thank you for your helpful comment. The stress function can also be understood as the loss function. MDS is a method of dimension reduction. \{d_ij} represents the distance between the samples in the original space, and \hat{d_ij} represents the distance between the samples in the dimension reduction space. We expect the distance between the samples in original is closer to that in dimension reduction space, that is, the stress function to be as small as possible.

Since the MDS in section 3.3 is an auxiliary method to prove redundancy in the 32 timbre evaluation terms. The main methods to obtain 16 timbre evaluation terms are clustering experiment results. Therefore, section 3.3 removes this formula.

 

 

Point 5: Line 148 “These 2D points represent inflection points”. Do the authors mean that : there is an inflection point at Dimensionality=2 ? In this case, the sentence can be better formulated.

 

Response 5: Thank you for your helpful comment. That's exactly what you comment. When the dimension is 2, 32 timbre evaluation term can be compressed into a 2-dimensional space. Also because it is only an auxiliary method here, the new manuscript removes this figure.

 

 

Point 6: Line 149 : “that can be represented by a 2D plane” : because of low stress for Dim=2 we can represent the data on a 2D plane ?

 

Response 6: When we determine the number of dimensions, we hope the stress as small as possible (the smaller the stress function is, the less information is lost), and the dimension as small as possible, so we generally choose the inflection point as the dimension number. In this example, the stress is minimized when the dimension is 10. But our goal is to reduce the dimension, and we want the dimension to be as low as possible, so we choose the inflection point (Dimensionality=2). (new manuscript Line 176)

Due to the imperfection of our work, we have brought a lot of inconvenience to your review. We are deeply sorry to you.

 

Point 7: Figure 2: How was this figure generated? Based on the distances only? Is it ssimply a resuilt of the INDSCAL algorithm?

 

Response 7: Thank you for your helpful comment. In general, the algorithm consists of three steps: (Line 176)

Get the dissimilarity matrix of the evaluation object (here is the correlation matrix for the 32 evaluation terms, as shown in the table below); Calculate the stress function in different dimensions (usually 1 to 10 dimensions). Determine the number of dimension according to the stress function graph (usually choose the inflection point); Finally, MDS algorithm is used to calculate the coordinates of each evaluation object. If Dim=2, each evaluation object has a pair of x and y coordinates (as shown in the table below). Draw the graph according to the coordinates.

This process can be completed by using the SPSS software (dimensionality reduction function).

 

 

 

Table. the correlation matrix for the 32 evaluation terms.

 

Table. The x, y coordinates for the 32 evaluation terms.

 

Point 8: Figure 3 and table 2 : I don’t understand how the diagram is used to produce table 2. Some more explanations are welcome here !

 

Response 8: Thank you for your helpful comment. In the figure below, the terms marked red arrow are retained. The basic rule is: if the distance between two terms is less than 5, pick one of them and leave out the other. The selection work was carried out by five music experts, and the term chosen was the one used more in the description of timbre, and is familiar to everyone. For example, shrill and sharp, their distance is one. After discussion, music experts believe that sharp uses more, so they choose sharp. But there are exceptions, such as silvery and slim, which are typical terms for timbre description, and music experts have long argued that they should all remain.

This strategy of choosing terms seems to be somewhat subjective, but we believe that the clustering results can only be used as a reference, and professional musical knowledge of music, language habits, cultural background and familiarity should be considered. These timbre evaluation terms would be used in later subjective evaluation experiments. If the subjects were not familiar with them or cannot understand them well, the experimental results would be greatly affected. (Line 179)

 

 

Point 9: Line 220/221 : Are the Guan and the Suona traditional Chinese instruments ?

 

Response 9: (new manuscript Line 245) The Guan and The Suona traditional Chinese instruments. Guan  is a double reed instrument, including Soprano Guan, Alto Guan, Bass Guan and Doublebass Guan. Suona is also a double reed instrument, including Soprano Suona, Alto Suona, Tenor Suona and Bass Suona. A list of instruments is added at the end of the paper (Appendix A).

 

Soprano Guan   

 

Soprano Suona

 

Point 10: Table 5 : in quantity, why 3,4,5,12,24 ? I understand that feature with quantity>1 are time dependent feature, but why these specific values ?

 

Response 10: Thank you for your helpful comment. Some features are time dependent feature, in addition, other features (such as Centroid, Spread etc.) can be calculated from different forms of frequency domain transformation (such as spectral, harmonic spectral). In this paper, Timbre Toolbox [64] and MIRtoolbox [68] were used for feature extraction. (Line 250) The following figure is the details of temporal, energy, spectral, harmonic and perceptual descriptors extraction process in Timbre Toolbox. (Line 253)

 

 

Point 11: Line 260 : “the dotted line represents the average value of each instrument in the corresponding dimension” : I don’t understand well. Do the authors refer to the colored dotted lines: the blue line representing the average of the blue dots, the red line the average of the red dots, yellow line the average of the yellow dots ?

Figure 4 : what do the colors represent ? Three families of instruments ?

 

Response 11: Thank you for your helpful comment. That's exactly what you comment. In Figure 4, the blue square represents western orchestra instruments, the yellow triangle represents Chinese minority instruments, and the red circle represents Chinese orchestra instruments. We have corrected it. (Line 290)

 

Point 12: Line 319 “Grade 9 was performed” : do the authors mean that the samples are rated on a 9 points scale as previously ?

 

Response 12: Thank you for your helpful comment. We mean that the samples are rated on a 9 points scale as previously. We have corrected it. (Line 396)

 

Point 13: Line 335 : “A 3D perception space was the produced using dimensionality reduction processing” How exacly ?

 

Response 13: (new manuscript Line 414) The specific steps are similar to Response 7. The difference is that the determination of dimensionality refers to the results of principal component analysis. We applied principal component analysis (PCA) to the experimental data of 16 timbre evaluation terms, and we obtained the following results. The values in the table represent the loading of the timbre term on three factors (only values greater than 0.5 or less than -0.5 are retained). The higher the load, the more the factor is affected by the timbre term. PCA results show that the 16 terms can extract three factors. These three factors are orthogonal. Thus , it can be concluded that the dimensionality of the timbre space is 3.

 

 

factor conponent

factor 1

factor 2

factor 3

柔和(Mellow)

.971

 

 

协和(Consonant)

.896

 

 

丰满(Full)

.833

 

 

干瘪(Raspy)

-.828

 

 

嘶哑(Hoarse)

-.768

-.541

 

尖锐(Sharp)

-.739

 

 

单薄(Thin)

-.727

 

 

清脆(Silvery)

 

.913

 

明亮(Bright)

 

.897

 

纤细(Slim)

 

.819

 

纯净(Pure)

.537

.761

 

暗淡(Dark)

 

-.703

.588

粗糙(Coarse)

-.632

-.674

 

混浊(Muddy)

 

-.614

.538

浑厚(Vigorous)

 

 

.910

厚实(Thick)

 

 

.803

 

Point 14: Line 335 : In your experience, is the reduction processing well applicable to western instruments as well ?

 

Response 14: (new manuscript Line 414) Our research work referred to a large number of studies of western instruments using this method [12, 13, 51-56]. The reduction processing is well applicable to western instruments. In the introduction, we summarize the research work using this method (Line 81-104).

 

Point 15: Reference 6 : contrary to the others references in which the first name of the authors are written with first initial only, this one uses the full first name.

 

Response 15: Thank you for your helpful comment. We have corrected it. (Line 481)

Author Response File: Author Response.pdf

Back to TopTop