Next Article in Journal
Job Recommendations: Benchmarking of Collaborative Filtering Methods for Classifieds
Previous Article in Journal
A Single-Stage Electronic Lighting Driver Circuit Utilizing SiC Schottky Diodes for Supplying a Deep Ultraviolet LED Disinfection and Sterilization Lamp
Previous Article in Special Issue
Toward Safer Roads: Predicting the Severity of Traffic Accidents in Montreal Using Machine Learning
 
 
Article
Peer-Review Record

Efficient Speech Signal Dimensionality Reduction Using Complex-Valued Techniques

Electronics 2024, 13(15), 3046; https://doi.org/10.3390/electronics13153046
by Sungkyun Ko 1 and Minho Park 2,*
Reviewer 1:
Reviewer 2:
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Reviewer 5: Anonymous
Electronics 2024, 13(15), 3046; https://doi.org/10.3390/electronics13153046
Submission received: 3 June 2024 / Revised: 31 July 2024 / Accepted: 31 July 2024 / Published: 1 August 2024
(This article belongs to the Special Issue Advances in Artificial Intelligence Engineering)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Comments on “Efficient Dimensionality Reduction of Speech Signals Using Complex-valued Techniques”

The manuscript primarily discusses the use of complex-MFCC in complex-DNN for speech recognition.

According to the title and abstract, the main goal of this paper is to reduce the feature dimension of speech signals using complex-valued techniques. However, I did not find any relevant information in the paper about actual dimensionality reduction of speech signals. MFCC is a well-known speech feature extraction method that uses auditory filter banks reflecting the vocal tract response of the human auditory system. This network simply uses complex MFCC applied to a complex-network to detect keywords, which does not imply any dimensionality reduction in the feature space. Dimensionality reduction typically involves techniques such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), etc., to reduce feature dimensions and limit test time. Here are a few specific comments:

  1. Irrelevant Related Works: The author seems unclear about the paper's goal. If the goal is to design the feature space of a complex-network, they must explain state-of-the-art algorithms in the “Related Works” section. However, they simply described basic concepts like CNN, RNN, and BiLSTM.
  2. Unnecessary Information: MFCCs are widely known and used audio features. Detailed explanations of these features are unnecessary for a research paper and should be replaced with appropriate citations to the original papers.
  3. Complex-valued MFCC: MFCCs do not provide complex-valued features. They are real-valued features representing the short-term power spectrum of a sound, typically used in speech and audio processing. While the intermediate steps of MFCC computation involve the FFT, which produces complex-valued outputs, these complex values are only used to calculate the real-valued magnitude spectrum. All subsequent steps (power spectrum, Mel filter bank, logarithm, and DCT) result in real-valued MFCC features.
  4. Missing Important Results: There are no relevant results presented in this paper that support the claimed dimensionality reduction.
Comments on the Quality of English Language

The quality of English should be significantly improved.

Author Response

[1] Irrelevant Related Works: The author seems unclear about the paper's goal. If the goal is to design the feature space of a complex-network, they must explain state-of-the-art algorithms in the “Related Works” section. However, they simply described basic concepts like CNN, RNN, and BiLSTM.

  • We have noted that feature extraction methods like MFCC are still widely used. Additionally, we have included explanations about efforts to enhance the performance of complex-valued neural networks. Therefore, in this main text, our goal is to explore the potential synergy between the complexification of MFCC and complex-valued neural networks by leveraging the dimension reduction properties of complex numbers.

[2] Unnecessary Information: MFCCs are widely known and used audio features. Detailed explanations of these features are unnecessary for a research paper and should be replaced with appropriate citations to the original papers.

  • After adding references to the materials cited in this paper, we have retained only the explanation of the MFCC process and removed the equations.

[3] Complex-valued MFCC: MFCCs do not provide complex-valued features. They are real-valued features representing the short-term power spectrum of a sound, typically used in speech and audio processing. While the intermediate steps of MFCC computation involve the FFT, which produces complex-valued outputs, these complex values are only used to calculate the real-valued magnitude spectrum. All subsequent steps (power spectrum, Mel filter bank, logarithm, and DCT) result in real-valued MFCC features.

  • This paper proposes a method to generate complex-valued MFCCs using real-valued MFCCs obtained from existing algorithms, not through mathematical methods but via a simple conceptual approach. Figure 3 demonstrates combining two real-valued MFCCs into one complex value consisting of a real part and an imaginary part. Consequently, using n real-valued MFCCs produces n/2 complex-valued MFCCs. Therefore, all parameters of the neural network in this study utilize complex numbers.

[4] Missing Important Results: There are no relevant results presented in this paper that support the claimed dimensionality reduction.

  • When there are n real-valued MFCCs represented as [x1, x2, x3, x4, ..., x_n-1, x_n], we reduce the number of parameters to n/2 complex-valued MFCCs using the process illustrated in Figure 3, such as [x1 + jx2, x3 + jx4,.. , x_n-1 + jx_n]. This process is referred to as dimensionality reduction in this paper. For example, If there are 10 real-valued MFCCs, these are transformed through the complexification process shown in Figure 3 into 5 complex-valued MFCCs.

 

 

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

The article should be rewrite consider the followings:

1. The steps of MFCC computation are not correctly described. Firstly, the mathematical expression of Mel-filter is not given here. But, here it is very important as the authors proposed ideas on this topic. Secondly, The MFCCs are computed by applying discrete cosine transfer of Mel-filter output not according to the equations 3 and 4 (of the draft). 

2. How complex-valued MFCC is computed is not mentioned in "Proposed Method" section. 

3. Complex valued neural network input should be complex. But, here complex-valued input to NN as 2 inputs separately (real and imaginary separately). 

 

Comments on the Quality of English Language

The flow of writing should be unified. 

Author Response

[1] The steps of MFCC computation are not correctly described. Firstly, the mathematical expression of Mel-filter is not given here. But, here it is very important as the authors proposed ideas on this topic. Secondly, The MFCCs are computed by applying discrete cosine transfer of Mel-filter output not according to the equations 3 and 4 (of the draft).

  • In this paper, a simple conceptual approach as shown in Figure 3, rather than a mathematical method, is proposed to generate complex-valued MFCCs from the real-valued MFCCs obtained by existing algorithms. To calculate the real-valued MFCCs, the librosa.stft() and librosa.feature.mfcc() functions from the Python librosa library were used. After adding references to the materials cited in this paper, we have retained only the explanation of the MFCC process and removed the equations.

[2] How complex-valued MFCC is computed is not mentioned in "Proposed Method" section

  • The MFCC extracted from speech data is combined into a real and imaginary form through the process in Figure 3 to create Complex-valued MFCC. For example, if there are 10 real numbers such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, they can be reduced to 5 complex numbers like 1+j2, 3+j4, 5+j6, 7+j8, 9+j10.

[3] Complex valued neural network input should be complex. But, here complex-valued input to NN as 2 inputs separately (real and imaginary separately)

  • Through the process shown in Figure 3, the real and imaginary parts are combined into a single complex-valued MFCC. Additionally, complex numbers were used for all parameters in the neural network as well. 

 

 

 

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

The goal of this research is to propose an efficient dimensionality reduction technique for speech signals using complex-valued Mel Frequency Cepstral Coefficients (CVMFCC) and complex-valued neural networks (CVNN) to improve speech recognition performance.
The main achievements of the article:
 - Proposal of the CVMFCC-DR (Complex-valued MFCC Dimensionality Reduction) algorithm
 - Demonstration of improved speech recognition performance using CVMFCC and CVNN compared to traditional MFCC and real-valued neural networks
 - Introduction of a new interpretation method for the complex-valued Softmax function with complex inputs

The paper presents an original approach by combining complex-valued MFCC with complex-valued neural networks for speech recognition. This contributes to the field by offering a novel method for dimensionality reduction while maintaining or improving performance.

The paper demonstrates a good understanding of the underlying concepts and techniques. The authors provide detailed explanations of the MFCC process, complex-valued neural networks, and their proposed methods.

The methods and analyses appear appropriate for the research goals. The authors compare their proposed approach with traditional methods using a standard dataset and provide clear explanations of their experimental setup.

The conclusions generally follow logically from the presented data and analysis. The authors demonstrate improved performance using their proposed method compared to traditional approaches.

The paper is generally well-structured and follows a logical flow. However, there are some areas where clarity could be improved, particularly in the explanation of complex mathematical concepts.

The writing quality is generally good, but there are some grammatical and punctuation issues throughout the paper. I'll provide a list of language issues later in this review.

The authors provide details on their experimental setup and the dataset used (Google's speech command dataset). However, they could improve reproducibility by providing more specific implementation details or code.

Some sections could benefit from more clarity, especially for readers less familiar with complex-valued neural networks.

These issues should be addressed to improve the overall quality and readability of the paper.

Comments on the Quality of English Language

Language, readability, and grammar issues:

 

There are some grammatical and punctuation issues affecting readability.

Some examples:
Abstract: " CVMFCC-DR (Complex-valued MFCC Dimensionality Reduction)" and "Complex-valued Mel Frequency Cepstral Coefficients (Complex-valued MFCC)" - inconsistent mode
Section 2: "CNN has the advantage of being able to process spatial information well" - awkward phrasing
Section 4: inconsistent capitalization of section heading
Section 5: "The dataset comprised 7,000 samples for training asednd 1,000 samples for testing." - probably typo ?
Figure 6. The font size for Voice Input plot is unreadable and is bitmap instead of vector graphic format.
Figure 7. Consider using vector graphic format.
Conclusion: "Moving forward, we aim to contribute to enhancing the accuracy" - consider rephrasing for more formal academic tone
"Furthermore, this paper demonstrated" - "demonstrates" would be more appropriate

Author Response

[1] "Abstract: " CVMFCC-DR (Complex-valued MFCC Dimensionality Reduction)" and "Complex-valued Mel Frequency Cepstral Coefficients (Complex-valued MFCC)" - inconsistent mode"

  • We have consistently revised the notation.

[2] "Section 2: "CNN has the advantage of being able to process spatial information well" - awkward phrasing"

  • The content of Section 2 was edited and sentences that were awkward were removed.

[3] "Section 4: inconsistent capitalization of section heading"

  • We have revised the use of capitalization in the title of section 4 to ensure consistency.

[4] "Section 5: "The dataset comprised 7,000 samples for training asednd 1,000 samples for testing." - probably typo ?"

  • We have corrected the typos.

[5] "Figure 6. The font size for Voice Input plot is unreadable and is bitmap instead of vector graphic format."

  • We have converted it to PDF format and uploaded the image files separately.

[6] "Figure 7. Consider using vector graphic format."

  • We have converted Figures 7 to 10 to PDF format. Additionally, we uploaded the image files separately.

[7] "Conclusion: "Moving forward, we aim to contribute to enhancing the accuracy" - consider rephrasing for more formal academic tone"

  • We have revised the conclusion to include the limitations of the approach used in this paper and to elaborate on the extended directions for future research.

 

 

 

 

Author Response File: Author Response.docx

Reviewer 4 Report

Comments and Suggestions for Authors

The manuscript is interesting and provides nice details on how the complex MFCC optimization proceeds.  I do have a few comments in Section 5 Performance Analysis.

1.  There are typos in the paragraph above Eq. (29):  Line 4 "u" should probably be "use" and in line 5 "asend" should be "and."

2.  Just above Eq. (29) it is mentioned that MFCCs are generated for each frame.  My question is did you use a window function?  If so, which one?  If a rectangular window was used, please state that.

3.  In the discussions of the performance of the complex MFCCs vs. the real valued MFCCs, the specific performance of the complex valued MFCCs is stated (for example, 99.01% training and 85.00% testing for complex MFCC20) but the reader is left to find the specific values for the real-valued MFCCs from the figures.  Please explicitly state those for real valued MFCC20, MFCC10, and MFCC5.

4.  In Figs. 7 and 8, the performance of the complex and real valued MFCCs converge with increasing Epoch, but not for MFCC5 in Fig. 9.  Can you elaborate on why that is true?

Comments on the Quality of English Language

Just a little rough.  It would help if a native English speaker read through the manuscript and did some editing.

Author Response

[1] There are typos in the paragraph above Eq. (29):  Line 4 "u" should probably be "use" and in line 5 "asend" should be "and."

  • We have corrected the typo.

[2] Just above Eq. (29) it is mentioned that MFCCs are generated for each frame.  My question is did you use a window function?  If so, which one?  If a rectangular window was used, please state that.

  • The default Hanning window was used to calculate the real-valued MFCCs.
  • In this paper, a simple conceptual approach as shown in Figure 3, rather than a mathematical method, is proposed to generate complex-valued MFCCs from the real-valued MFCCs obtained by existing algorithms.

[3] In the discussions of the performance of the complex MFCCs vs. the real valued MFCCs, the specific performance of the complex valued MFCCs is stated (for example, 99.01% training and 85.00% testing for complex MFCC20) but the reader is left to find the specific values for the real- valued MFCCs from the figures.  Please explicitly state those for real valued MFCC20, MFCC10, and MFCC5.

  • We have specified the results of the real-valued MFCC as follows in the text.

[4] In Figs. 7 and 8, the performance of the complex and real valued MFCCs converge with increasing Epoch, but not for MFCC5 in Fig. 9.  Can you elaborate on why that is true?

  • If there are a sufficient number of MFCCs, with increasing epochs, it appears that the performance using real numbers begins to converge towards that using complex numbers. However, Figure 9 shows a different outcome. Complex valued MFCC 5 has 5 dimensionally reduced complex MFCCs obtained from 10 MFCCs through the complexification process shown in Figure 3. A real-valued MFCC 5 with only 5 MFCCs cannot catch up with Complex-valued MFCC 5 even as epochs increase. In other words, when the number of MFCCs is low, even as epochs increase, the results of real-valued MFCCs do not match the performance of complex-valued MFCCs. As shown in Figure 9, despite having the same number of parameters, the results of Complex-valued MFCC 5 are superior to those of Real-valued MFCC 5.

 

 

Author Response File: Author Response.docx

Reviewer 5 Report

Comments and Suggestions for Authors

Although the content of the manuscript is interesting and the numerical part is very well explained and detailed, the Authors must take into account the following important comments before the manuscript can be considered for publication in its present form.

-The first is the abstract. Although the concepts presented are interesting, it should be expanded. In addition, the tense agreement between subjects and verbs in some sentences should be revised.

-The introduction is short and poor. Three references are given and all three have been included together. Although the explanation of concepts is correct, it should be supported by other previous work. This is crucial for a positive review of the manuscript. In addition to the inclusion of new references, the three currently included should be disaggregated.

-The problem of references (from the point of view of the Reviewer the main issue to be improved regarding the manuscript) also applies to other sections. In particular, the second section of the manuscript 2. Related works. Although references to "other studies", "previous publications" are made throughout the text, the Authors do not provide references to such works. Authors are strongly encouraged to improve this aspect. If these references are related to previous works by the same authors, this should also be mentioned.

-The figures included in the manuscript are interesting. However, the first figures should be explained in more detail in the text. This is especially important for figures 1, 2, 3 and 4.

-The Authors are advised to complete the final section 6. Conclusions. Some limitations of the proposed model or further details of future work could be included at the end.

Comments on the Quality of English Language

In general, English is fine. Some minor changes should be included (some typos and review of tenses).

Author Response

[1] The first is The abstract. Although The concepts presented are interesting, it should be expanded. In addition, The tense agreement between subjects and verbs In some sentences should be revised.

  • The content regarding the introduction of the complex Softmax has been incorporated into the abstract, and the tense inconsistencies have been rectified.

[2] The introduction is short and poor. Three references are given and all three have been included together. Although the explanation of concepts is correct, it should be supported by other previous work. This is crucial for a positive review of the manuscript. In addition to the inclusion of new references, the three currently included should be disaggregated.

  • We have reinforced the content, and separated the references to treat each one individually. Additionally, we have included new references to support the proposed concepts in the introduction.

[3] The problem of references (from the point of view of the Reviewer the main issue to be improved regarding the manuscript) also applies to other sections.

In particular, the second section of the manuscript 2.  Related works. Although references to  "other studies", "previous publications" are made throughout the text, the Authors do not provide references to such works.

Authors are strongly encouraged to improve this aspect. If these references are related to previous works by the same authors, this should also be mentioned.

  • We have provided references for the studies mentioned. In particular, We have further strengthened the content of section 2, "Related Work."

[4] The figures included in the manuscript are interesting. However, the first figures should be explained in more detail in the text. This is especially important for figures 1, 2, 3 and 4.

  • We have included an explanation for Figure 1. This figure illustrates the MFCC feature extraction process and is explained throughout Section 3, Background Knowledge.
  • We have added an explanation for Figure 2. This figure offers an overview of the proposed method's overall structure and is detailed block by block in Section 4, Proposed Method.
  • We have included an explanation for Figure 3. This figure depicts the complexification process of MFCC in the proposed method and is detailed in the Complex-valued MFCC subsection of Section 4, Proposed Method.
  • We have included an explanation for Figure 4. This figure explains the learning algorithm of the complex-valued neural network and is thoroughly described in the Activation Function and Learning Algorithm subsection of Section 4, Proposed Method.

[5] The Authors are advised to complete the final section 6. Conclusions. Some limitations of the proposed model or further details of future work could be included at the end.

  • In the conclusion, we added the limitations of the current approach and expanded on future research directions.

 

 

 

 

Author Response File: Author Response.docx

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

1. The authors are requested to add the justification of forming the complexed valued MFCC by combing the sequential mfccs (i.e. first and second form first complex mfcc). Usually, the complexed-valued variables are used to represent the wave-like nature and phase information. Please add more references regarding this.

2. Please add more NN classifier metric such as F1, and ROC-AUC.

Comments on the Quality of English Language

N/A

Author Response

[1] The authors are requested to add the justification of forming the complexed valued MFCC by combing the sequential mfccs (i.e. first and second form first complex mfcc). Usually, the complexed-valued variables are used to represent the wave-like nature and phase information. Please add more references regarding this..

  • We have added a sentence justifying the complex-valued MFCC format we proposed, and we have cited relevant literature on the topic.
  • Please refer to the attached file for the revised phrase.

[2] Please add more NN classifier metric such as F1, and ROC-AUC.

  • In addition to accuracy, we also used the f1 score to get the same result. Therefore, we didn't add a graph to the paper, we only added a brief mention.
  • Please refer to the attached file for the result of using sklearn's f1_score() function.

 

Author Response File: Author Response.docx

Reviewer 5 Report

Comments and Suggestions for Authors

The authors have addressed and considered all my previous remarks. 

Author Response

[1] The authors have addressed and considered all my previous remarks. 

  • Thank you for your feedback. We're glad to hear all your previous remarks have been addressed.
 
Back to TopTop