Next Article in Journal
A Cloud-Based UTOPIA Smart Video Surveillance System for Smart Cities
Next Article in Special Issue
Intelligibility of English Mosaic Speech: Comparison between Native and Non-Native Speakers of English
Previous Article in Journal
Rentable CDN Using Blockchain and Proof of Provenance
Previous Article in Special Issue
Tonal Contour Generation for Isarn Speech Synthesis Using Deep Learning and Sampling-Based F0 Representation
 
 
Article
Peer-Review Record

Regularized Within-Class Precision Matrix Based PLDA in Text-Dependent Speaker Verification

Appl. Sci. 2020, 10(18), 6571; https://doi.org/10.3390/app10186571
by Sung-Hyun Yoon 1, Jong-June Jeon 2 and Ha-Jin Yu 1,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2020, 10(18), 6571; https://doi.org/10.3390/app10186571
Submission received: 10 August 2020 / Revised: 17 September 2020 / Accepted: 17 September 2020 / Published: 20 September 2020
(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)

Round 1

Reviewer 1 Report

The paper is well presented. The paper is mainly applying a new regularisation method to regularise within class precision matrix.

It is described as the GLASSO can be understood as a de-noising 75 operator to reflect the conditional independence structure in the underlying model. However, I could not able to identify the motivation behind this method. Specifically why this method should work to this specific problem. 

line 557: you said 'the GC embeddings are still unsuitable for GLASSO-PLDA.' Please explain why?

The paper technically sounds good, adding more reasons to the method will make the paper better.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

In this paper the authors used GLASS regression to improve the performance of the probabilistic linear discriminant analysis applied to the speaker verification problem. Good introduction, the topics are well introduced so that the reader is able to understand well what it is. The outline of the paper is also well treated. I would add references to allow the reader to deepen the subject. The Preliminaries section is quite clear, you need to add references and I think it is appropriate to briefly introduce some topics that I have reported in detail. The GLASSO section is quite good, as always it is necessary to add references. Also you need to review equation 8 missing the parentheses of the argument of the function argmax. Section 4 needs improvement, it is necessary to better explain how GLASS improves PLDA. Furthermore, the formulas but above all the figures must be better introduced to justify their presence. The Experiments section needs to be improved, in particular the graphics need to be revised and the captions need to be enriched. Finally in the Conclusion Section paragraphs are missing where the possible practical applications of the results of this study are reported. What these results can serve the industrial sector, it is necessary to insert possible uses of this study that justify their publication. They also lack the possible future goals of this work. Do the authors plan to continue their research on this topic? If so, maybe a sentence of future works could be added in the Conclusions section.

 

##

 

32) Do not use abbreviations (i.e.)

26-33) Good introduction, short but concise. Insert references to allow the reader to deepen the topic.

39) Do not use abbreviations (e.g.). I have seen that you often use abbreviations of this type in the following. I advise you not to use abbreviations. I will not repeat this advice for the rest.

34-42) Good TI-SV introduction, insert references to allow the reader to deepen the topic.

53-54) Add a link to the previous topic so there is continuity. Maybe you could introduce what speaker embedding is for.

54-61) Add references.

25-87) Good introduction, the topics are well introduced so that the reader is able to understand well what it is. The outline of the paper is also well treated. I would add references to allow the reader to deepen the subject.

90)” Traditional speaker embeddings were mostly based on linear generative models in statistics”. Add references.

91) Introduce what speaker embedding is for.

100-105) Add references to confirm these claims.

111-112)” The deep speaker embeddings have become the state-of-the-art in the field of ASV.” Add references.

113-114)” fully connected neural network (FNN)”. The acronym FNN could create ambiguity as it is also used for Fuzzy Neural Networks, you could use FCNN.

118-119)” In this paper, we use the d-vector based on long short-term memory (LSTM)”. You could briefly introduce LSTMs.

120) You could briefly introduce ResNet, in order to highlight the differences with the other DNNs.

126)” In this study, we extract the r-vectors from SE-ResNet34”. Add references.

127-128) You could briefly introduce TDNN.

131)” The x-vector/PLDA has shown state-of-the-art performance in TI-SV”. Add references to confirm these claims.

149) Period is missing.

150)” expectation-maximization (EM)”. You could briefly introduce this topic.

156)” log-likelihood ratio” You could briefly introduce this topic.

156-159) Make the definition of formula 7 consistent with the variables you used. First you talk about x1 and x2, but then those variables are not present in the formula. Also, it explains better the term ? (∙ | ?, ?).

170-172) Add references.

173-181) Add references to deep the topic.

183-184) Add references to cite the authors that treated this algorithm. For example:

  • Yuan, M. and Lin, Y., 2007. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1), pp.19-35.
  • Friedman, J., Hastie, T. and Tibshirani, R., 2008. Sparse inverse covariance estimation with the graphical lasso. Biostatistics9(3), pp.432-441.
  • Witten, D.M., Friedman, J.H. and Simon, N., 2011. New insights and faster computations for the graphical lasso. Journal of Computational and Graphical Statistics20(4), pp.892-900.

187)In equation 8 the parentheses of the argument of the function argmax are missing.

211) Explain equation (11) better. Try to make it clear what it is for in achieving the goal set by equation (8).

223) In equation 12 the parentheses of the argument of the function argmax are missing.

229-231) Explain these concepts better

239-244) Add references to confirm these claims.

276-282) Explain the meaning of these figures what they show. What are they for in understanding the topic? You could add these explanations in the captions of the figures.

311-317) Explain the meaning of these figures what they show. What are they for in understanding the topic? You could add these explanations in the captions of the figures.

386-387) Since ERR will be your results evaluation metric you should better understand this metric. Explain better how it is evaluated with an equation as well.

393-394) Remind the reader what the ? parameter represents

403-405) Since there are 4 subfigures, it is appropriate to denote them with letters from a to d. This will make it easier to refer to them.

428-430) Since there are 6 subfigures, it is appropriate to denote them with letters from a to f. This will make it easier to refer to them.

437-440) Since there are 6 subfigures, it is appropriate to denote them with letters from a to f. This will make it easier to refer to them.

501-510) Add references to deep the matrix banding methods.

529-541) Paragraphs are missing where the possible practical applications of the results of this study are reported. What these results can serve the industrial sector, it is necessary to insert possible uses of this study that justify their publication. They also lack the possible future goals of this work. Do the authors plan to continue their research on this topic? If so, maybe a sentence of future works could be added in the Conclusions section.

 

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 2 Report

The authors approached some of the reviewer's comments with sufficient attention and edited the document consistent with the suggestions provided. Other comments were not accepted by the authors, I refer for example to those relating to the labeling of images in multiple graphs (Fig 6,8,10). Too bad I think that in this way the paper would have acquired greater readability.

The new version of the paper has significantly improved both in the presentation which is now much more accessible even by a non-expert reader of the sector, and in the contents which now appear much more incisive. In the new version of the document, the introduction appears more complete, the addition of references will allow the reader to deepen the arguments. Now the Preliminaries section contains all the information necessary for the reader to understand the methods and tools used by the authors for this survey methodology. The formulas presented appear clearer. As indicated below I recommend incorporating the explanations provided to the reviewer in the paper. This will make it easier for the non-reader to follow the flow of information.

Finally, the conclusions are now complete, the results are presented and ideas are offered for possible extensions of the work.

 

 

125) I advised you to explain what LSTMs are, you simply said they are RNNs. At this point you need to explain what RNNs are.

156) I advised you to explain what log-likelihood ratios are. Why didn't you do it? The reader can be helped by a simple explanation.

172-174) You could add the explanation returned in your response 21: “An arbitrary embedding in the original space  is projected into the projected space through the transformation . Note that  can be both  and . In other words, the same transformation  is applied to , regardless of the kind of subscript . Therefore,  can correspond to both  and .

”.

248-256)You say: This explanation (lines 248-251) is about the GLASSO, which is already explained in Section 3.2. (lines 208-211). Then summarize what you have already said or refer to those topics with a link. This will make it easier for readers to follow the flow of information.

295-301) In your response 30 you say: “As we already mentioned in Section 4 (lines 257-263), the prerequisite of the GLASSO-PLDA is that the empirical within-class covariance and accompanying precision matrix should be close to diagonal. Therefore, it is important to check whether the covariance/precision matrix is close to diagonal. Figures 1 and 2 show the within-class covariance and precision matrix of each kinds of embeddings, respectively. As you can see, only the covariance and precision matrices of the i-vector (i.e., Figures 1a and 2a) are close to diagonal, and the others are far from diagonal. It means that only i-vector satisfies the prerequisite of the GLASSO-PLDA. Actually, as you can see from Figures 5~10 in Section 5.3., only the i-vector, which satisfies the prerequisite, showed the performance improvements when using the GLASSO-PLDA. Deep speaker embeddings, which have the covariance/precision matrix that is far from diagonal, showed the performance degradations when using the GLASSO-PLDA unless applying PCA transform that makes the covariance/precision matrix close to diagonal. All contents that I mentioned here are also explained in the manuscript”.

Good explanation. Why didn't you add this explanation to the paper? You should do it.

 

330-336) In your response 31 you say: “Figures 3 and 4 show the empirical within-class covariance/precision matrix of each of deep speaker embeddings, respectively. As shown in Figures 1 and 2, deep speaker embeddings generally have the covariance/precision matrix that is far from diagonal, which do not satisfy the prerequisite of the GLASSO-PLDA. We proposed the method of applying PCA transform to deep speaker embeddings for making the covariance/precision matrix close to diagonal. After applying the PCA transform, the covariance/precision matrix become closer to diagonal, as shown in Figures 3 and 4. In addition, PCA-transformed deep speaker embeddings shown the performance improvement with the GLASSO-PLDA. All contents that I mentioned here are also explained in the manuscript.”

Good explanation. Why didn't you add this explanation to the paper? You should do it.

423-426) I understand that you did it to group the results, but naming each figure facilitates a possible recall of the same. This is good typographic practice. Try to accept the advice of the reviewers.

449-451) The same advice as previous.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Back to TopTop