Improving Deep Mutual Learning via Knowledge Distillation
Round 1
Reviewer 1 Report
Authors have explained Improving Deep Mutual Learning via Knowledge Distillation very well. Figures are very well explained but the quality of text in figures can be improved.
Contributions of paper are not well presented. Please give a precise and clear point to point contribution
What are the existing work till now. Compare with the help of a table at the end of related work
I am not able to find section 4 in the paper. Also, it seems that a section before result section is still pending.
The introduction and related work section demands more to be included like a) A hybrid convolutional neural network model for diagnosis of COVID-19 using chest X-ray images b) Enhanced convolutional neural network model for cassava leaf disease identification and classification c) Visualization of Customized Convolutional Neural Network for Natural Language Recognition
Explain the concept with an application area like automatic speech recognition.
Author Response
Please see the attachment
Author Response File: Author Response.pdf
Reviewer 2 Report
The manuscript titled "Improving Deep Mutual Learning via Knowledge Distillation" deals with knowledge transfer and the authors proposed two new approaches for the purpose. The two new approaches are named as Full Deep Distillation Mutual Learning and Half Deep Distillation Mutual Learning. According to authors and they are correct that the new approaches have significant effects on improving the performance of convolutional neural networks. These approaches work with three losses by using variations of existing network architectures and the experiments have been conducted on three public benchmarks dataset. Although the manuscript contains novel ideas and can have good perspectives but yet it has some serious flaws that should be addressed before reaching to any decision. Hence I recommend the following major revisions:
1. Manuscript needs comprehensive language revision, further it has too long sentences to understand. So this part needs complete attention.
2. Abstract and introduction should be rewritten keeping in view the point 1 and the historical background.
3. The manuscript is based on too many preprints. So there are serious concerns about the validity of the proposed results.
4. As a sample I am discussing eq. 1. It is not clear that from where it comes, either it is taken from somewhere or it is introduced by the authors. If it is taken from somewhere, then proper references should be cited and if it is introduced by the authors then it should be mentioned. Handle this issue overall.
5. It is mentioned that figure 1 is from [5], where it is not given.
6. Regarding comparative analysis the entries mentioned as from [5] in table 1 and the entries mentioned as from [17] in table 3 are not given in the respective references, so either delete these entries or mentioned the correct source.
7. Presentation and methodology needs improvement.
9. Conclusion should be supported by the presented results.
10. All references should be complete and must be written on same pattern.
Author Response
Please see the attachment
Author Response File: Author Response.pdf
Reviewer 3 Report
The article from a methodological point of view meets the expectations of a scientific article. The authors establish clear categories of analysis and comprehensively explain the procedure used machine learning and deep learning models. The methodology is also very clear and well worked out from a practical point of view.
Minor.
Lines 182-186 and lines 193-194 are for the contribution of this paper.
182 Inspired by the concepts of DML[22] and KD[5], we developed a new approach that
183 combines the two methods into a formula to improve the performance of DML. If the
184 concept used by DML is to pair two or more networks in the form of a cohort that aims to
185 conduct training simultaneously by utilizing KL divergence loss to guide another net[1]186 work to increase the posterior entropy of each student
193 -194 Our proposed method adopts two KD divergence to improve the network performance.
Minor.
If we search many existing sources, KL(P||Q) = CE(P,Q) - H(P), where KL is KL-divergence, CE is cross-entropy, and H is entropy. In general, to minimize the cross entropy or to minimize KL-divergence, the same result will be obtained. What we want to do is to optimize Q to be close to P, but in KL-divergence H(P) is not a function of Q. That is, since H(P) has no effect on optimization, it does not matter whether we use cross entropy or KL-divergence. Therefore, we use KL(P||Q) = CE(P,Q). In other words, it is common to use KL as a deep learning loss function
Therefore, there is a big question about whether pairing two networks and using two KL-divergence is a contribution. In other words, if you connect two deep learning networks, you will naturally use two KL-divergence.
Author Response
Please see the attachment
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
The manuscript is revised well and carefully and now I am fully satisfied with the revised version. In my opinion now the manuscript is ready for publication and so I recommend it to be accepted for publication in its current form.