Next Article in Journal
Special Issue on Functional Properties in Preharvest and Postharvest Fruit and Vegetables
Next Article in Special Issue
Improved Traffic Sign Detection Algorithm Based on Faster R-CNN
Previous Article in Journal
A Novel Signature and Authentication Cryptosystem for Hyperspectral Image by Using Triangular Association Encryption Algorithm in Gyrator Domains
Previous Article in Special Issue
A Control Method for the Differential Steering of Tracked Vehicles Driven Independently by a Dual Hydraulic Motor
 
 
Article
Peer-Review Record

SuperFormer: Enhanced Multi-Speaker Speech Separation Network Combining Channel and Spatial Adaptability

Appl. Sci. 2022, 12(15), 7650; https://doi.org/10.3390/app12157650
by Yanji Jiang 1,2, Youli Qiu 1, Xueli Shen 1, Chuan Sun 2,3,* and Haitao Liu 2
Reviewer 2: Anonymous
Appl. Sci. 2022, 12(15), 7650; https://doi.org/10.3390/app12157650
Submission received: 21 April 2022 / Revised: 26 July 2022 / Accepted: 26 July 2022 / Published: 29 July 2022
(This article belongs to the Special Issue Novel Methods and Technologies for Intelligent Vehicles)

Round 1

Reviewer 1 Report

The abstract mention that "At the end of the separation model, we add the speaker enhancement module to further enhance or suppress the speech of different speakers by using the mutual suppression features of each source signal. Experiments show that the SI-SNRi of the proposed separation network on the public corpus WSJ0-2mix achieves a separation performance of 20.8dB". The statement emphasizes a change from the original method to obtain a performance of 20.8 dB, which is roughly how much of a difference in performance is given using the upgrade module and without it.

In lines 83 to 84, it is stated that the proposal uses formula two and has been conveyed in lines 85 to 87 that the more accurate W is, the closer the prediction results to real speech; from this, the author should explain how correct the expected W value is and what values ​​are possible to produce or used from this study.

Table 1 shows the comparison results, but the author did not conduct an in-depth analysis. The author is expected to explain in-depth the implications of using parameters and step settings used.

Table 2 shows a comparison, but the explanations in lines 263 to 267 are not exhaustive. It is hoped that the author will provide an in-depth analysis of the advantages and disadvantages of the given results.

The author should emphasize measurable results and more in-depth analysis in the conclusion section.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

The paper describes an approach for enhanced Multi-speaker speech separation. Although some model components seem sound, the paper is barely understandable due to the non-English grammar. An explanation of key concepts is also missing, e.g. the same central concept of "correlation in mixed speech signal"-> do you mean long-term autocorrelation? Some features related to speech rate? Some spectral characterisation of the speakers?

Other examples of issues and faults are reported in the following:

Abstract:
    In the problem of multi-speaker speech separation, the correlation of speech signal sequence is an important basis for speech separation. -> The intra-correlation of speech signal sequence is an important basis for speech separation.
    The number 20.8dB is meaningless by itself unless it is compared with a baseline reference system.

Introduction:
    [..]there is no way to model the global features. -> Which are the global feature you are referring to? It's plenty of speaker-dependent features you can use for speaker separation, including supra-segmental features.
    [..]solve the disadvantage that DPRNN can not be parallelled-> This sentence is hanging, and moreover, it confuses complexity (i.e., parallel processing) with performance.

Methods:
There is low accuracy in the explanation of the terms, e.g.:
    The "]T" in line 80 is meaningless; T is not explained
    W is not defined.
    The concept of prediction signal is not explained
    Figure 1 is too generic and common to most speech processors.

General:
  Too many non-English sentences and typos are present, e.g. "The We use", "feature e output", and misplaced commas like in "Although many separation methods have been proposed, the accuracy of speech separation, remains inadequate."

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

An enhanced multispeaker speech separation network is proposed combining channel and spatial adaptability.

It seems that "correlation of the speech signal" would be more precisely designated as "autocorrelation of the speech signal".

Mathematical notation is nonstandard and not properly formatted throughout.

In Eq. (1) W^(-1) suggests you will invert matrix W. If that is not the case, using a matrix V, for instance, which is supposed to converge to W^(-1) would make your description clearer and more to the point.

The dimensions in the inequalities below are not defined nor properly derived:
O(1600²) > O(250²)
O(1600²) > O(34²).

 

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

The author made the review response accordingly and explained it in-depth. 

 

Author Response

Thank you very much for the comments of reviewer 1. We carefully checked the English grammar and style of the manuscript and correctly cited the references in the manuscript.

Reviewer 2 Report

The authors have largely improved the paper readability and have clarified key concepts. The methodology is now more understandable. I have indicated some comments in the attached PDF. 

An extensive English revision is still required, and some pieces report Reviewer's text instead or article text.

Comments for author File: Comments.pdf

Author Response

Thank you very much for reviewer2's comments. We have carefully revised the manuscript attachment you sent. At the same time, we also carried out a serious examination of the manuscript's English grammar and style, and the results were further displayed and discussed. All the modified parts have been marked in the manuscript. Finally, special thanks to you for your good comments.

Back to TopTop