Next Article in Journal
Semi-Supervised Portrait Matting via the Collaboration of Teacher–Student Network and Adaptive Strategies
Next Article in Special Issue
Automatic Assessment of Piano Performances Using Timbre and Pitch Features
Previous Article in Journal
Towards Obfuscation Resilient Feature Design for Android Malware Detection-KTSODroid
 
 
Article
Peer-Review Record

VAT-SNet: A Convolutional Music-Separation Network Based on Vocal and Accompaniment Time-Domain Features

Electronics 2022, 11(24), 4078; https://doi.org/10.3390/electronics11244078
by Xiaoman Qiao 1, Min Luo 2, Fengjing Shao 1, Yi Sui 1, Xiaowei Yin 1 and Rencheng Sun 1,*
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3:
Electronics 2022, 11(24), 4078; https://doi.org/10.3390/electronics11244078
Submission received: 28 October 2022 / Revised: 2 December 2022 / Accepted: 6 December 2022 / Published: 8 December 2022
(This article belongs to the Special Issue Machine Learning in Music/Audio Signal Processing)

Round 1

Reviewer 1 Report

The paper presents an interesting music separation model devised as a time-domain convolutional network, based on vocal and accompaniment time domain features. The mathematical model is simple, but efficient. A few corrections are recommended, regarding the incorrect use of the adverb "where" in place of the relative "in which", the use of the incorrect acronym "lg" in place of the logrithm symbol "log", for example. 

Author Response

Thank you for your thoughtful suggestion. In the revised manuscript, we have used the paid editing services provided by MDPI to revise the English of the manuscript, changed the adverb "lg" to "log", and modified the incorrect use of the adverb "where".

Reviewer 2 Report

This work presents an interesting approach of music separation (Vocal and accompaniment in single channel) using time domain characteristics.

Line 224:  foucs focus

 

 

 

Author Response

Thanks for the reviewer's comment. In the revised manuscript, we have used the paid editing services provided by MDPI to revise the English of the manuscript and changed "foucs" to "focus".

Reviewer 3 Report

The authors propose a music separation system that is based on working on masking a latent space calculated directly from the time-domain signal, instead of masking the popularly used time-frequency domain. The system achieved good separation results, outperforming ConvTasNet trained with time-domain signals, as well as some time-frequency-masking methods.

There are some major issues with the manuscript:

- Although the authors explain the issues of time-frequency methods in Section 3, the reader could benefit in having some notion of these issues at the start of the paper, close to the lines 76-77: "Converting waveform data into spectrograms will inevitably lose some unique features of music", since it is not clear at this point of the manuscript which features will be lost.

- Furthermore, the lines 147-151: "Although the frequency-domain music separation method can achieve the purpose of separating vocal and accompaniment, the waveform reconstruction process needs to combine the generated amplitude spectrum of target music with the phase spectrum of mixed music due to the limitation of the Short Time Fourier Transform causing phase loss." could use a diagram what is the effect of this phase loss in the resulting estimated signal.

- In lines 155-157: "there is usually a problem of the long delay, and the music separation method in the frequency-domain will be limited in low-latency scenarios (such as real-time music separation)" This seems unconclusive, considering that the authors did not provide any response times of the proposed system compared to the other evaluated techniques. The authors should either remove this line, or provide a response-time comparison.

- It appears that the main contribution of the manuscript is the use of the time domain for music separation. However, it was not compared to current time-domain-based separation techniques, such as:

Lam, Max WY, et al. "Effective Low-Cost Time-Domain Audio Separation Using Globally Attentive Locally Recurrent Networks." 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021.

Nakamura, Tomohiko, Shihori Kozuka, and Hiroshi Saruwatari. "Time-domain audio source separation with neural networks based on multiresolution analysis." IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021): 1687-1701.

The authors should carry out such comparison.

- It seems that the mask applied in the latent space is binary. This is one of the reasons why there is phase loss in the time-frequency domain. Is their any loss of unique features when applying the mask in the latent space? The authors should expand of what effect does the masking have on the resulting signal when doing masking the latent space.

- Lines 427-428: "To verify the validity of VAT-SNet, we randomly selected one piece of music in the dataset" The authors should clarify if the selected piece of music was from the 200 music clips in the "evaluation" subset of the dataset.

- Lines 431-432: "Limited by the paper presentation, the pure music and the separated resulting music are visualized and printed as waveforms." The authors should build a webpage of the model so that the reader can listen to the results.

- It is recommended that the authors submit the manuscript to a professional academic proofreading service to fix its considerable amount of grammatical and style errors.

Author Response

Please see the attachment

Author Response File: Author Response.docx

Round 2

Reviewer 3 Report

The authors have acknowledged all of the issues raised by this reviewer.

A minor observation on the webpage: the labels of two columns are mispelled. They should be:

"Miture" -> "Mixture"

"accomponiment" -> "Accompaniment"

Back to TopTop