Source Separation Using Dilated Time-Frequency DenseNet for Music Identification in Broadcast Contents
Round 1
Reviewer 1 Report
In this paper, the authors proposed a Time-Frequency DenseNet with novel submodules for sound source separation.
Performance of the proposed method was extensively compared with other state-of-the-arts neural networks used for sound source separation, such as U-Net and Dense-Net.
Among the proposed techniques, the idea of ​​multi-band block is excellent, and the performance improvement is shown objectively through experiments.
I believe that this paper may be published as current status.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 2 Report
This paper presents source separation architecture using dilated time-frequency DenseNet for background music identification of broadcast contents.
The paper is well organized and readable. I have some suggestions, which are described below, to be considered in order to improve the paper.
There are quite a few comparisons in the paper. In the references, I miss one paper (Music detection from broadcast contents using convolutional neural networks with a Mel-scale kernel) in which the authors of this paper also participated. Please make comparisons of the proposed algorithm in this missing reference. The additional experiments should confirm that in this paper presented technique is really better.
I recommend that the paper should be accepted with major revision.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 3 Report
A source separation method based on dilated Time-Frequency DenseNet is proposed, and used for music identification. The main contribution is on source separation based on the proposed architecture, an improved version of MDenseNet. The paper is well written. The topic is edge-cutting and interesting, nevertheless from my point of view, some enhancement should be done to achieve requirements for publication. Please, consider the following:
Have you tried the blind source separation (BSS) methods [1-4], such as ICA for determined BSS and SCA for the underdetermined case? I guess they are also suitable for the source separation. Please consider this concern in the revised paper. How to determined the number of sources in real applications? For instance, if there were many background speakers, will the proposed method still work? Why the dataset for source separation and identification are different? How the speech signal and music signal are mixed? In linear way or? I suggest to reduce the abstract and remark key findings and in which way the manuscript improves the state of the art or, please mention very briefly the advantages of the proposed techniques. Please double check the paper to avoid the typos, such as Line 45-line 46 background music is mostly mixed XXX Line 111, remove ‘this’ Line 151, the number of
Reference:
[1] Oja, Erkki, and Zhijian Yuan. "The FastICA algorithm revisited: Convergence analysis." IEEE Transactions on Neural Networks 17.6 (2006): 1370-1381.
[2] Georgiev, Pando, Fabian Theis, and Andrzej Cichocki. "Sparse component analysis and blind source separation of underdetermined mixtures." IEEE transactions on neural networks 16.4 (2005): 992-996.
[3] Zou, Liang, et al. "Underdetermined joint blind source separation of multiple datasets." IEEE Access 5 (2017): 7474-7487.
[4] De Lathauwer, Lieven, and Joséphine Castaing. "Blind identification of underdetermined mixtures by simultaneous matrix diagonalization." IEEE Transactions on Signal Processing 56.3 (2008): 1096-1105.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Round 2
Reviewer 2 Report
This paper presents source separation architecture using dilated time-frequency DenseNet for background music identification of broadcast contents.
The paper is well organized and readable. I have one suggestion, which is described below, to be considered in order to improve the paper.
The authors explained in the cover letter why they did not do additional experiments. However, I suggest that the authors mention their previous paper (Music detection from broadcast contents using convolutional neural networks with a Mel-scale kernel) and briefly outline the difference between both types of research.
I recommend that the paper should be accepted.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 3 Report
The manuscript has been significantly improved. It can be accepted for publication now.
Author Response
Please see the attachment.
Author Response File: Author Response.docx