Next Article in Journal
Numerical Simulation of Ground Subsidence Factors Resulting from Unpressurized Pipeline Rupture Below the Water Table
Next Article in Special Issue
Analysis and Investigation of Speaker Identification Problems Using Deep Learning Networks and the YOHO English Speech Dataset
Previous Article in Journal
A Three-Stage Uyghur Recognition Model Combining the Attention Mechanism and Different Convolutional Recurrent Networks
 
 
Article
Peer-Review Record

Crossband Filtering for Weighted Prediction Error-Based Speech Dereverberation

Appl. Sci. 2023, 13(17), 9537; https://doi.org/10.3390/app13179537
by Tomer Rosenbaum 1,*, Israel Cohen 1 and Emil Winebrand 2
Reviewer 1:
Reviewer 2:
Reviewer 3: Anonymous
Appl. Sci. 2023, 13(17), 9537; https://doi.org/10.3390/app13179537
Submission received: 25 June 2023 / Revised: 19 August 2023 / Accepted: 22 August 2023 / Published: 23 August 2023
(This article belongs to the Special Issue Automatic Speech Signal Processing)

Round 1

Reviewer 1 Report

This paper introduces an extended version of WPE, in which the crossband filters between adjacent subbands are considered to improve the accuracy of the model approximation in the STFT domain. Experimental results verify the validity of the proposed method. This paper is technically sound, and I have only some minor comments:

1. Some more explanations are needed to clarify why the crossband filters are necessary. For a linear time-invariant (LTI) system, it only changes the phase and magnitude of the input signal, i.e., Y(\omega) = X(\omega)H(\omega). If this is strictly true, it is not necessary to use the crossband filters. What happens here?  Because the filterbank is not ideal? or because the intercorrelation of speech between adjacent subbands. According to my experience, the analysis and synthesis windows do have significant influence on the performance of WPE. 

2. Before Eq. (3), it is still unclear why the contribution of the crossband filters cannot be neglected if the system is linear time-invariant.

3. The impact of noise should be evaluated in Section 4, and some discussion is also necessary to be included in Section 5.

4. Some editorial revisions can be found in the attached file.

Comments for author File: Comments.pdf

English is good enough, and only some very minor revisions are necessary before publication.

Author Response

Thank you for your feedback.

  1. Regarding the crossband filtering - in LTI systems, the relation Y(\omega) = X(\omega)H(\omega) is true in the frequency domain. In the STFT domain, however, the accurate relation between the input and output signal is obtained using the crossband filtering, which is shown in Eq. (2). It is related to the intercorrelation of speech between adjacent subbands. A comprehensive analysis of this relation is available here: https://israelcohen.com/wp-content/uploads/2018/05/TASL_May2007.pdf. Anyway, I added more explanations to clarify this issue in the introduction.
  2. See the previous answer. Eq. (3) is not the accurate relation in the STFT domain, it is an approximation.
  3. Thanks for pointing it out, I revised the paper according to the notes.

Reviewer 2 Report

The paper presents an extension of the traditional weighted prediction error (WPE) method that improves the accuracy of the model fit by including cross-band filters. The approach is aimed at identifying interdependencies between neighboring subbands and refining the estimate of late reflections in the observed speech signal. Considering samples from adjacent subbands, we redefine the WPE observation vector and modify the inverse prediction filter to include both the cross-range components and the traditional inter-band component. Two versions of the proposed method have been studied. The first version favors accuracy at the expense of improved model approximation, albeit at the cost of increased computational complexity. The second version retains the same computational complexity as regular WPE but improves accuracy. The second option demonstrates competitive performance compared to the first method.

The article may be published.

Author Response

Thank you for your feedback.

Reviewer 3 Report

The paper proposes two versions of the Weighted Prediction Error Based Speech Dereverberation method with crossband filtering, one with greater and another with comparable computational complexity as the oryginal method. Both are shown in simulation experiment results to improve the performance of the original method. The paper is a case study of an application nature, simulation studies are conducted to shed light on parameterisation of the methods in one particularcase. However, no attempt was made to determine general parameterisation rules, thus the research results are difficult to generalize.

The choice of experimental setup and quality measures is questionable. It seems that the method is robust to fairly wide range of parameters, and this is the thesis that needs to be tested and justified.

Research on the parameterization of a given algorithm should be aimed at obtaining the most general possible set of rules for the widest possible range of applications. Here we have an overdetailed discussion of the problem for questionably selected measures, for one location in one room, for 10 speakers, with no disturbances.This is why the research should be rethought and experiments should be conducted once again, to get real insight into parameterisation of these methods and to formulate more general conclusions.

 

1. The choice of the experimental setup.

- While the experiments are simulated, what prevents you from testing more than one configuration?

- What effect would placing the microphone array in additional positions have on the results?

- How does the shape of the room affect the parameterization and operation of the methods?

- How the selected measures are varying for different speakers (line 267)?

- Why x is uniformly distributed from 2.936 to 2.999? - how the distance of 6,3 cm is related to the size of the enclosure (8m length) and the distance between the speaker and microphone or maybe the kind of the sound ?

- Why N = 3? (l. 229)

Answers to the above doubts would justify whether one particular setup was chosen, and whether the results and conclusions can be generalized to a wider range of applications.

This study is contrasted to study in  [30], which "had limited scope and did not provide a comprehensive analysis of the effectiveness of the WPE-based approach incorporating crossband filtering in various real-world scenarios", while suffers from the same problems.

 

2. The performance measures used are questionable.

 These are three different measures, chosen from among others.

- Why were these chosen?

- Their formulas and calculation conditions should be given directly. These formulas are not defined properly in the cited literature, which refers to another positions. The parameters adopted for these specific calculations should also be given.

- What are the absolute value ranges for these measures? Do the presented differences really correspond to the listener's impression and sound improvement? How does an improvement of 1.1 ΔFWSegSNR compare to an improvement of 1.1 ΔCD or ΔPESQ? Perhaps the percentage measures would be more adequate? Are further presented improvements noticable? How significant is a performance gain of 0.03 of ΔCD between case for Lbb=20 and 14?

 Further, the use of the proposed measures is questionable for me, especially concerning the choice of the optimal filters lengths basing on the information from Fig. 1. The curves showing the dependence of measures on Lbb are flat over a wide range of highest values, the performance measures for the selected optimal Lbb values stand out imperceptibly. E.g. both ΔFWSegSNR and ΔCD have almost the same value for Lbb form 10 to 17. The following figures present very similar curves for three measures and for three Lbb values. The differences between the values in the Table 1 suggest, that it could be sufficient to choose Lbb=15 and the similar conclusions could be drawn. Then the the measure curves for the two investigated method could be shown on one graph to visualise the differences between them.

Further, the curves of measures depending on the examined factors are very similar, wouldn't it be worth choosing other measures and checking whether the results are the same?

The research is overdetailed and unfortunately, there is no comprehensible summary, that could generalise the conclusions.

 

 In the article there are phrases that the observations are intriguing or surprising. Attempts should be made to explain these findings. Has an attempt been made to explain why the larger filter length deteriorates the performance of the method (p. 4.4)?

Author Response

Thanks for the feedback.

  1. Thanks for pointing out this issue. Accroding to this note, we conducted simulations for different room configurations, including different microphone spacing, different array positions, and different speaker positions. Results were consistent and the performance is very similar. We mentioned this in the revised paper. In addition, we explained the reason why we chose N=3 (we empirically observed it is sufficient for convergence).
  2. Thanks for the comment. The measures were chosen as they are widely used to evaluate performance of different dereverberation methods. An explanation regarding this issue (including reference) was added to the revised paper. Furthermore, references to the papers that provide the formulas of the proposed measures were added to the revised paper as well. According to your advice, we replaced the absolute measures with percentage measures and presented the std with respect to the different speakers.

Round 2

Reviewer 3 Report

ad. 1 The statement starting in line 221 is far too general and not convincing. In my opinion, the Authors should describe laboratory setup for the additional experiments.

ad. 2. In my opinion, the measures' formulas and calculation conditions should be given directly, because there is too much parameters that have to be specified, to refer the reader to literature that refers further. Currently, however, it is common practice to refer to definitions known from the literature, which should be not a problem for the readers interested in the topic.

Unfortunately, these shortcomings and the narrow scope of the research make the paper is interesting to the very narrow group of the readers.

Author Response

Thank you for your comments.

  1. We described and added more info regarding the different configurations for the experiments.
  2. The references were replaced with more relevant references that present directly the formulas for the measures without further referring. In addition, we also added the direct formulations.
Back to TopTop