Next Article in Journal
Physics-Informed Data-Driven Model for Predicting Streamflow: A Case Study of the Voshmgir Basin, Iran
Previous Article in Journal
Leveling and Minimizing the Load of the Universal Earthmoving Machinery Actuators by Improving the Kinematics of Their Movement When Digging the Soil
 
 
Article
Peer-Review Record

Pseudo-Phoneme Label Loss for Text-Independent Speaker Verification

Appl. Sci. 2022, 12(15), 7463; https://doi.org/10.3390/app12157463
by Mengqi Niu 1, Liang He 1,2,*, Zhihua Fang 1, Baowei Zhao 1 and Kai Wang 3
Reviewer 1:
Reviewer 2: Anonymous
Appl. Sci. 2022, 12(15), 7463; https://doi.org/10.3390/app12157463
Submission received: 9 July 2022 / Revised: 22 July 2022 / Accepted: 22 July 2022 / Published: 25 July 2022
(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

The manuscript proposes a pseudo-phoneme label (PPL) loss model for the TI-SR task by integrating content cluster loss at the frame level and speaker recognition loss at the segment level in a unified network by multitasking learning, without additional data requirements and exhausting computation. The pseudo-phoneme labels to adjust frame-level feature distribution are generated by using HuBER and DeepCluster. The authors have compared the proposed loss with the softmax loss, center loss, triplet loss, log-likelihood-ratio cost loss, additive margin softmax loss, and additive angular margin loss on the VoxCeleb database. The effectiveness of the proposed method is demonstrated in the experimental results on the VoxCeleb database.

The contents of the paper are organized into six sections.

The article title is appropriate and accurately reflects the article's content. The abstract is clear and well-defined. It states the main goal of the paper. The used keywords are appropriate. The introduction is clear and correctly written. Furthermore, the section presents related work and background. Section 2 introduces some of the loss functions, pooling methods, and clustering methods used in the experiments. The sources on which the methodology is based are duly cited. Section 3 presents pseudo-phoneme label loss for the TI-SR task. Section 4 illustrates the settings of the applied methodology. Section 5 gives experimental results and analysis.

Finally, the conclusions and the outlined future work are drawn in Section 6.

The manuscript content is structured correctly and contains all the relevant sections marked with subheadings. The manuscript consists of 18 pages, 54 references, 5 well-formatting figures, and 6 tables. 61% of all cited publications have been published in the last 5 years. The cited literature is from authoritative sources and does not need correction.

In general, the paper is well-formatted and followed the journal’s template. I think that this article is suitable for the journal.

Specific comments and suggestions:

·         There are some typing errors. The author can reread all the text and correct them.

·        * VoxCeleb should be written correctly. Please see lines 13 and 99 and corrected the name of the database.

·        *  Kaldi should begin with a capital letter. Please see line 388.

·        * The URL address of the VoxCeleb website on page 10 is not correctly written. The correct link is https://www.robots.ox.ac.uk/~vgg/data/voxceleb/

·       *  In the sentence: "In this paper, we explore the performance of three commonly used pooling methods combined with our proposed loss function, as shown in Table ??" is used table without number. It should be corrected.

Author Response

Point 1:There are some typing errors. The author can reread all the text and correct them.

  •       * VoxCeleb should be written correctly. Please see lines 13 and 99 and corrected the name of the database.
  •       * Kaldi should begin with a capital letter. Please see line 388.
  •       * The URL address of the VoxCeleb website on page 10 is not correctly written. The correct link is https://www.robots.ox.ac.uk/~vgg/data/voxceleb/

Response 1: Please provide your response for Point 1. (in red)

We are very sorry for our negligence in these typing errors, we have corrected these errors and checked the entire article. We feel sorry for our carelessness. Thanks for your correction.

Point 2: In the sentence: "In this paper, we explore the performance of three commonly used pooling methods combined with our proposed loss function, as shown in Table ??" is used table without number. It should be corrected.

Response 2: Please provide your response for Point 2. (in red)

In an earlier version of the article, we explored the impact of different pooling methods on speaker recognition performance, but considering that this article revolves around the loss function, the content on pooling was removed later. 

However, because of our oversight, we removed the pooling experiment, but the reference to the pooling experiment is still there, so the table without numbers is displayed. We have checked the whole article and removed the content about pooling.

We are very grateful to reviewer for reviewing the paper so carefully. We revised our manuscript, and some changes have taken place. In this revised version of the manuscript, the changes we made to the manuscript are highlighted in blue text in the document.

If there are any other modifications we could make, we would like very much to modify them and we really appreciate your help.

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper is well-written. The authors proposed a pseudo-phomene labels (PPL) loss for text-independent speaker recognition. Various comparisons were conducted to assess: (1) the performance using different losses; (2) the performance using different PPL loss implementation methods; (3) the performance after adding PPL loss to the classification loss; (4) and the effectiveness of hyperparameters. As a result of the experimental findings, the method proposed has been found to be effective. Below are my comments:

(1) As mentioned in the abstract, the aim of proposing this method is to avoid the need for a large amount of annotated data and consumes high computation resources. Please make one subsection and highlight how the proposed method addresses these challenges.

(2) There is a duplicate sentence in lines 51-56. Please proofread.

Author Response

Point 1:

  •       As mentioned in the abstract, the aim of proposing this method is to avoid the need for a large amount of annotated data and consumes high computation resources. Please make one subsection and highlight how the proposed method addresses these challenges.

Response 1: Please provide your response for Point 1. (in red)

We appreciate it very much for this good suggestion, and we have done it according to your ideas. We include a separate subsection describing the advantages of our proposed approach, including how to solve the data annotation problem and the computational problem, as described in section 5.6.

Point 2: 

There is a duplicate sentence in lines 51-56. Please proofread.

Response 2: Please provide your response for Point 2. (in red)

Thanks for your careful checks. We are sorry for our carelessness, and we have removed duplicate content from the text.

We greatly appreciate your complimentary comments and suggestions. We revised our manuscript, and some changes have taken place. In this revised version of the manuscript, the changes we made to the manuscript are highlighted in blue text in the document.

If there are any other modifications we could make, we would like very much to modify them and we really appreciate your help.

Author Response File: Author Response.pdf

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

 

Back to TopTop