Next Article in Journal
A Multi-Model Proposal for Classification and Detection of DDoS Attacks on SCADA Systems
Previous Article in Journal
Customer Complaint Analysis via Review-Based Control Charts and Dynamic Importance–Performance Analysis
 
 
Article
Peer-Review Record

Cross-Domain Conv-TasNet Speech Enhancement Model with Two-Level Bi-Projection Fusion of Discrete Wavelet Transform†

Appl. Sci. 2023, 13(10), 5992; https://doi.org/10.3390/app13105992
by Yan-Tong Chen, Zong-Tai Wu and Jeih-Weih Hung *
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3:
Appl. Sci. 2023, 13(10), 5992; https://doi.org/10.3390/app13105992
Submission received: 20 March 2023 / Revised: 4 May 2023 / Accepted: 10 May 2023 / Published: 12 May 2023

Round 1

Reviewer 1 Report

In article present employing sub-signals dwelled in multiple acoustic frequency bands in time domain and integrating them into a unified time- domain feature set. The discrete wavelet transform (DWT) is applied to decompose each input frame signal to obtain sub-band signals, and a projection fusion process is performed on these signals to create the ultimate features. This article introduces a novel SE framework that exploits the short-time DWT data as one of the sources to create the encoding features for an encoder-decoder in speech enhancement architecture like Conv-TasNet and DPTNet. 

In the article proposed speech enhancement model with two-level Bi-projection fusion of discrete wavelet transform in new SE framework. Also proposed using wavelet-domain features as a substitute for frequency-domain features, hoping to bring about further SE improvement.

The authors considered the results of experiments on Conv-TasNet with the VoiceBank-DEMAND task and VoiceBank-QUT task. Experiments on DPTNet with Two Tasks were also considered similarly. However, further results for the proposed methods are presented only on the example of Conv-TasNet (Spectrograms and Mean-Square Error Comparison). Therefore, I believe that the architecture of DPTNet is partially evaluated and creates the impression of an unfinished article.

Error in title of Table A6. The MSE loss of onv-TasNet using different domains of features for the test set in noise. “The MSE loss of Conv-TasNet…”. I think it will be right.

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

1. It is recommended that Figure A2 be followed by Figure A1 to prevent the belief that 1-D convolutions do not have a block structure

2. Suggest explaining the reason for using PReLu instead of ReLu, that is, briefly describing the advantages of PReLu

3. briefly explain why the 'dB2' wavelet function is selected in Line 415,

4. In the test phase, this paper can briefly introduce the hyperparameter set for model training in this paper, such as learning rate, optimizer type, etc

5. In section 3.1, SNRs of the dataset in QUT are not declared like DEMAND. What’s more the declared SNRs of QUT are not corresponding with SNRs in DEMAND. This will result invalidation of comparation between DEMAND task with QUT task in analyzing. If you want to investigate the influence of different noise types, you should verify the SNRs of QUT and control variables.

6. In the Tables of section 3.2, 3.3, why don’t you add the addition and concatenation integration manner in time and two-level DWT feature domain experiment? It seems that the addition and concatenation integration manner may achieve better result according to time and one-level DWT feature domain experiment. Is it same as two-level DWT feature domain experiment?

7.  In the results of section 3.2, 3.3, line 447 said DWT exhibit significantly better PESQ results than STFT while it is just better 0.042 at most and line 507 said in the same way. Is it exaggerated or the PESQ is very accurate and play a vital role in signal quality? Please explain the importance of PESQ.

8. The experiment in 3.5 don’t declare the SNRs and noise types. It may not be low SNRs according to Figure A16(b), and it is not very clear to see the effect of SE network.

9. According to the Table A5 in section 3.6, the MSE is clearly influenced by different integration manners. However, why the metrics of experiments in 3.2 and 3.3 have advantages and disadvantages of each other? Which metric can mainly reflect the quality of the signal?

10. Line 551 conclude that in particular, DWT features perform better than STFT features as the fusion component of the encoder in most cases. However, it seems not exactly so according to Tables in section 3.2,3.3. This conclusion may not appropriate.

11. Please check the composing problems, for example the blank in page 10 is too big, the Table A2 is in the analysis of Table A3 which result inconvenience and The Figure 16 and Figure 17 integrate to one Figure will be better.

12. In line 419, three objective metrics are used. In the following text, only the formula for SI-SNR is provided, but the formulas for PEQS and STOI are not provided. It is best to list them all, so the differences between the three can be seen intuitively.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

This study primarily investigates how to extract information from time-domain utterances to create more effective features in speech enhancement. The discrete wavelet transform (DWT) is applied to decompose each input frame signal to obtain sub-band signals, and a projection fusion process is performed on these signals to create the ultimate features.

 

The corresponding fusion strategy is either bi-projection fusion (BPF) or multiple projection fusion (MPF). In short, MPF exploits the softmax function to replace the sigmoid function in order to create ratio masks for multiple feature sources. The concatenation of fused DWT features and time features serves as the encoder output of two celebrated SE frameworks, fully- convolutional time-domain audio separation network (Conv-TasNet) and the dual-path Transformer network (DPTNet), to estimate the mask and then produce the enhanced time-domain utterances.

 

This is a high-quality paper, experiment results are comprehensive, the only thing is not having big improvement compared to baseline of paper [3].

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

This manuscript proposes an approach for speech enhancement by fusing the time-domain features with discrete wavelet transform-based features in fully-convolutional time-domain audio separation network and dual path transformer network. The topic is very interesting and trending. Unfortunately, the presentation has some weak points explained in the reviewer’s comments that should be addressed in the revision phase. This is why this reviewer’s recommendation is to make a major revision.

 

General comments:

- On page 3, in lines from 92 to 108, certain benefits of using redundant information (multiple-domain features plus original form) are emphasized. However, such an approach has also some drawbacks, e.g., enlargement of the input or introduction of a potential bias, and it would be beneficial to state/discuss the drawbacks as well.

- The introduction (almost two pages long) has only a single short paragraph describing what is done in this study. It would be useful to add here some more details about the present research, the content and organization of the manuscript. Besides, it seems that the motivation for carrying out this research is missing. Moreover, it would be beneficial to clearly and explicitly state the contributions of this study in a single paragraph, and not distributed throughout the manuscript as presently is the case.

- Since in the classical application of the wavelet transform, the frequency separating the low-pass (the approximation coefficients) and high-pass (the detail coefficients) parts of the decomposed signal is the half of the sampling frequency. In this way, the whole procedure depends on the choice of the sampling frequency. In case of a too high sampling frequency, the high-pass part will not contain that many important components of speech. It would be beneficial to include a short discussion on this matter.

- On page 11, in lines 308 and 309, it is stated: "Furthermore, in common noisy scenarios, WA has a higher signal-to-noise ratio (SNR) than WD." Since WA is the low-pass part of the signal, it would be useful to justify this statement as noise in real environments typically has a pink spectrum with higher levels at low frequencies.

- Since the DWT can be done using different levels of decomposition, it could be interesting to see what results can be obtained using levels of decomposition higher than 1 and 2 applied here. At least, it seems worthwhile to mention such an option in the manuscript.

- The training set contains 11572 utterances, and 200 of them are used for the validation. That means that 200/11572 = 1.7 % of the training set is applied for validation. What is the justification for such a ratio of training and validation samples?

- On page 16, in lines 434 and 435, it is stated: "These features are created by integrating time-domain features with any of STFT-domain (features)". Unfortunately, it seems that it is not stated which STFT-domain features are used here. Since an important part of discussion and validation of the proposed method is related to comparison of Conv-TasNet when the STFT features are replaced with DWT features, it is very important to explicitly state which STFT features are used (pure spectrogram, or its modification, or …). In this way, the validation of the proposed method depends on the choice of STFT feature(s).

- Since the differences in the results given in Table A1 for the chosen metrics are rather small (those obtained using the features from multiple domains – time domain and STFT or DWT). The situation is even worse in the results summarized in Table A2. The differences of the same order are present in other results presented here, and in some cases they are even smaller. So, the key question is whether these difference are statistically significant or not. The discussion given on a few pages would have real meaning only after proving that the obtained differences are significant.

- Discussion based on the spectrograms of original speech, noisy speech and de-noised speech is very problematic. It could be used for illustration purpose only. The focus is on a small portion of signals (small part of the spectrograms) and comparison of different cases. However, it is difficult to make a meaningful comparison. On page 19, in lines 526 and 527 the authors admit: "... it is somehow difficult to compare these methods in SE performance simply with these spectrogram demonstrations …"

- Analysis based on the mean-square error is also problematic. Speaking hypothetically, the de-noised speech can have similar time-domain amplitudes as the noise-free speech, still containing significant noise and/or distorted speech. Moreover, it is not clear why the results obtained using the fused the time-domain and STFT domain features are excluded from this analysis.

- In the Concluding Remarks, the authors stated: "In particular, DWT features perform better than STFT features as the fusion component of the encoder in most cases." However, the differences of the applied metrics when STFT features are compared to DWT features are very small, and the conclusion "in most cases" seems not to be completely supported by the results. If the authors want to continue in this direction, it would be necessary to justify such a conclusion, or they can re-phrase the conclusion explaining that in some cases STFT features provide better results, while in some other cases DWT features provide better results. In any case, the conclusion is rather "loose". Besides, it would be beneficial to state some other conclusions emphasizing the importance of the findings.

 

Detailed comments:

- The order of citing the references is not correct. The frist cited reference is [1] (see line 28), and the next one is [5] (see line 40) instead of [2], etc.

- Although the abbreviation for speech enhancement (SE) is already introduced on page 1, in line 36, both the full phrase and the abbreviations are used throughout the manuscript, which is not the best practice. This is also valid for some other abbreviations.

- On page 2, in lines from 65 to 67, it is stated: "The representation might be time-domain signal waveform, time-frequency (T-F) diagram (spectrogram), or cochleagram." Since, "cochleagram" can also be represented as a time-frequency (T-F) diagram, this statement should be clarified.

- Wavelet transform is a well-known transform. In already lengthy manuscript, it is not necessary to include a detailed description of this transform.

- On page 9, in the caption of Fig. A9: it seems that the abbreviation CD-TCN is not defined.

- On page 8, in line 281: it seems that due to a double description of the same processing (initial processing within the Conv-TasNet system (1-D convolution) on page 4 and the same processing on page 8 – Eq. (6)), the size of the generated matrices CA and CD is not correct – L/2 x M should be replaced with L/2 x K. Alternatively, the number of frames (page 8, line 270) should be corrected (K frames should be replaced with M frames). The same is valid for the matrices obtained after a 1-D convolution, so, it seems that their size should be N x K, instead of N x M.

- On page 11, in line 325, "the L-level DWT splits the lowest sub-band" is probably better to be "the L-level DWT splits the lower sub-band" since there are only two sub-bands.

- On page 13, in line 363 and on page 14, in line 383: check if the variable B(i,j) is correct or it should be replaced with M(i,j).

- On page 13 and 14, the phrase "coordinate" is a bit confusing. Is it related to both frames (K) and vectors (N) in the basis function? It could be beneficial to clarify this issue in the manuscript. In that regard, the text on page 14, in lines from 377 to 381, could be presented in a clearer way.

- Check if the hyper-parameters whose values are given on page 15, in line 417 are defined in the manuscript.

- On page 16, it would be useful to describe the symbols used in Eq. (25).

- Captions of Fig. A16 should be checked – is it correct to say "noise-corrupted by time feature-wise Conv-TasNet"?

 

English and typos:

English is at an acceptable level. There are still some mistakes and typos that should be corrected (some examples not making a full list are given below):

- On page 5, in line 194, "which functions" should probably be "whose functions".

- On page 7, in line 242, "the up-sampling sub-band" should probably be "the up-sampled sub-band".

- On page 7, in lines 246 and 247, "these two synthesis filters are with impulse responses" should probably be "these two synthesis filters with impulse responses".

- On page 12, in line 337, "Next, As done" should be "Next, as done".

- On page 12, in line 344, "As an extension of the BPF-wise use for" should probably be "As an extension of the BPF-wise used for".

- On page 19, in line 833, "From the two tables, We have" should be "From the two tables, we have".

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 4 Report

The authors have put a lot of effort into revising the manuscript and addressing the reviewers' comments in a short period of time. Consequently, the manuscript has been significantly improved, although there are still a few open questions worth to be answered. This reviewer believes that after correcting these minor issues, the manuscript could be accepted for publication.

General comments:

- On page 3, in lines from 113 to 115 (in the added paragraph), it is stated "DWT is a distortionless transformation like Fourier transform (FT), and it decomposes an input signal into sub-band signals with bandwidths that decrease with increasing frequencies, similar to how the human ear perceives sound." It seems that this statement is not correct. In the Discrete Wavelet Transform (DWT), the signal processing is based on passing the signal through a series of (analysis) filters, where at each level of decomposition there are two filters – low-pass and high-pass filter. Since only the low-frequency (approximation) part of the signal is further decomposed, this means that the signal bandwidth decreases with frequency decrease. This property is similar to behavior of human ear in sound perception.

- While contributions and organization of the manuscript are clarified in the revised manuscript, the motivation for carrying out this research can still be expressed in a clearer way.

- In the review of the previous version of the manuscript, the comment of this reviewer was: "The training set contains 11572 utterances, and 200 of them are used for the validation. That means that 200/11572 = 1.7 % of the training set is applied for validation. What is the justification for such a ratio of training and validation samples?" The authors addressed this comment in the "Response to Reviewer Comments". Unfortunately, the authors’ clarification (that might be interesting for potential readers) is not included in the revised manuscript.

- In the review of the previous version of the manuscript, one of the reviewer’s comments was related to the statistical significance of the compared results, since the given results are rather similar (close to each other). In the meantime, the authors have done the statistical analysis and provided the results in the "Response to Reviewer Comments". Unfortunately, again these results are not included in the revised manuscript. Instead of incorporating all the results of the statistical analysis, at least some conclusions from the analysis can be stated in the revised manuscript, which is currently not the case. An exception is on page 17, in lines from 514 to 516: "Furthermore, even though these metric scores is varied when using different fusion features (STFT, one-level DWT, and two-level DWT), their difference is quite small and statistically insignificant." Besides, the statistical significance of the results is shown only comparing DWT-wise cross-domain feature method with the time domain-wise method, but not other methods mutually (with each other). It would be beneficial to state this fact in the revised manuscript, too.

- Regarding the comment given in the review of the previous manuscript version related to the concluding remarks (starting with "In the Concluding Remarks ... "), the authors have revised the relevant statements. Unfortunately, it seems that some other conclusions are still missing that might be added to emphasize the importance of the findings.

 Typos:

- On page 8, in lines 283 and 284, "4k Hz" should be "4 kHz".

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop