Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Dual-Branch Multi-Scale Relation Networks with Tutorial Learning for Few-Shot Learning

Appl. Sci. 2024, 14(4), 1599; https://doi.org/10.3390/app14041599

by Chuanyun Xu^1,2,†

, Hang Wang^2,†, Yang Zhang^1,*, Zheng Zhou² and Gang Li^2,*

Reviewer 1:

Carlos R. Del-Blanco

Reviewer 2: Anonymous

Appl. Sci. 2024, 14(4), 1599; https://doi.org/10.3390/app14041599

Submission received: 11 January 2024 / Revised: 2 February 2024 / Accepted: 13 February 2024 / Published: 17 February 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper is generally well-written, interesting, and presents a reasonably set of experiments. However, there are some weaknesses that need to be addressed. Particularly, the minor contributions seem less substantial when compared to existing works, and the improvement in results is discrete in comparison to other works. Nevertheless, I believe the paper has merit and is worth publishing after addressing the following issues:

Major Issues:

1) The paper lacks an inference framework; only the training part is described. I recommend including the inference framework and providing a block diagram for clarity.
2) The tutoring procedure between branch 1 and 2 needs better explanation.
3) The total loss function in Eq. 7 is unclear. The repetition of the term L_D and the absence of weighting for the term L_z need clarification.
4) All experiments utilize tiny images (84x84) and simple neural networks. It is advisable to include an independent experiment with larger images, employing well-known neural backbones, to verify the proposed method's applicability in real-world scenarios.

Minor Issues:

1) The text in Figures 1, 2, and 3 is too small and needs adjustment for better readability.
2) References for the works discussed in Tables 2, 3, and 4 should be included to enhance the overall completeness of the paper.

Author Response

Reviewer#1, Concern # 1: The paper lacks an inference framework; only the training part is described. I recommend including the inference framework and providing a block diagram for clarity.

Author response: Thanks for your suggestion, we have added a block diagram about the training and inference framework in the revised version, located in section 3.1. On these foundations, according to your suggestion, we continue to supplement some descriptions of inference in section 4.3.

Reviewer#1, Concern # 2: The tutoring procedure between branch 1 and 2 needs better explanation.

Author response: Thanks for your suggestion, the innovative inspiration for the tutoring learning module in this paper is based on knowledge distillation, we are indeed missing the discourse on it. On the one hand, we adopted the relation network for measurement because of its learnability. On the other hand, we believed that combining knowledge distillation could further improve the ability of the relation network to learn to measure, and the results proved our idea. Specifically, we consider the relation network in branch 1 as the teacher network and relation networks in branch 2 as the student networks, because the relation network in branch 1 that shares the weight learns rich deep feature similarities at multiple scales. In contrast, relation networks in branch 2 learn less or basic similarity knowledge. So we came up with combining knowledge distillation with branch 1 tutoring branch 2 to learn. On these foundations, according to your suggestion, we continue to supplement some descriptions in section 3.5 and add a citation note about knowledge distillation [1], hoping to clear your doubts.

[1] HINTON G, VINYALS O, DEAN J. Distilling the Knowledge in a Neural Network[J], 2015.

Reviewer#1, Concern # 3: The total loss function in Eq. 7 is unclear. The repetition of the term L_D and the absence of weighting for the term L_z need clarification.

Author response: Thank you for your suggestion, we have optimized Eq. 7 and its related description in the revised version. We have added a new loss L_KD=αL_D + (1-α)L_T to represent the loss of the tutoring learning module separately, and then the total loss function is modified as L=L_D + L_Z + L_KD, with equal weights assigned to these three terms, which represented the losses of branch 1, branch 2 and the distillation loss of the tutorial learning module respectively. α=0.3 in the new loss function L_KD is just to control the contribution of the two parts in the distillation loss L_KD. We hope this makes our equation unambiguous.

Reviewer#1, Concern # 4: All experiments utilize tiny images (84x84) and simple neural networks. It is advisable to include an independent experiment with larger images, employing well-known neural backbones, to verify the proposed method's applicability in real-world scenarios.

Author response: Thanks for the suggestion, but due to the limitations of our experimental equipment, we are currently unable to utilize larger images or models. Following the conventions of this field of research, we replicated the same experimental setting employed by other researchers. We mainly want to show the effectiveness of the method and ensure a fair comparison with them. Nevertheless, in future work, we will strive towards your suggestion. We have added this to the future work part. Thank you again for your suggestion.

Reviewer#1, Concern # 5: The text in Figures 1, 2, and 3 is too small and needs adjustment for better readability.

Author response: We have made the text larger to ensure it is easier to read. Thanks for your reminder.

Reviewer#1, Concern # 6: References for the works discussed in Tables 2, 3, and 4 should be included to enhance the overall completeness of the paper.

Author response: This is indeed our mistake, we have added corresponding reference citations to the discussed work in the tables, thank you for your suggestion.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The article is devoted to the application of few-shot learning. The topic of the article is relevant. The structure of the article corresponds to that accepted in MDPI for research articles (Introduction (including analysis of analogues), Models and methods, Results, Discussion, Conclusions). The level of English is acceptable. The article is easy to read. The figures in the article are of acceptable quality. The article cites 59 sources, many of which are not relevant. The References section is designed carelessly.

The following comments and recommendations can be formulated regarding the material of the article:

1. Few-Shot Learning is a machine learning task in which the model must be pre-tuned on a training dataset so that it can learn well on a limited number of new labeled examples. That is, even on new classes or data domains we will be able to obtain high-quality predictions. Few-Shot tasks are characterized by two parameters: N and K. The N-way K-shot task means that we need to learn how to solve the classification problem into N classes using K examples for each of them. Let's look at the problem in more detail using an example from the article. How can the authors prove that they have selected a sufficiently representative training dataset? How do the authors take into account the relationship between N and K in their work?

2. To solve the Few-Shot problem, there are several paradigms, one of which is meta-learning. There are 3 main approaches to meta-learning: - Model-based (modify the model architecture); - Metric-based (similar to KNN; solve the problem using some distance function, based on the similarity of objects); - Optimization-based (modify the very structure of training and optimization for more effective “pre-training”). How do these approaches compare with what the authors propose?

3. The Reptile algorithm is described in the article On First-Order Meta-Learning Algorithms (2018) by OpenAI. To put it briefly, the essence of the algorithm is as follows: we train the model many times in k steps on a randomly selected dataset from the prepared ones. The authors of Reptile recommend using a value between 10 and 20, and with k=1 we get the so-called joint training. With this pre-training, the model finds new weights, which, according to the authors, will help the model generalize better and quickly adapt to new tasks. I recommend that the authors consider using this algorithm in their model.

4. Finally, I’ll note a few disadvantages of few-shot learning. Using few-shot usually means that there is very little manually labeled data available to solve the problem. There is also no off-the-shelf solution to compare against, and quality assessment metrics are often not available. And even if they exist, there is so little data that the signal is very noisy. There is another problem: in practice, I came across a lot of artifacts in the results that the model produces. She may begin to come up with pieces of information that have nothing to do with the original text (but are contained in a large part of the examples from the summary). You can get rid of such artifacts by manually changing the eyeliner. Moreover, with a seemingly small change, the output texts can change quite dramatically: becoming both better and worse. Even the presence of an extra line break or incorrect formatting in the liner affects the answer and can greatly degrade its quality. Therefore, engineers have to carefully compose the text of the eyeliner, try, restart, evaluate the results with their eyes - this is what I call the manual selection process. Therefore, prototyping with a few-shot turns out to be quite labor-intensive and time-consuming. And most importantly, this process does not guarantee that we will be able to find a lead that will push the model to the highest quality solution. And even if we find it, the text we compile will almost certainly not be the most optimal. I want to fight these problems. How do the authors account for these shortcomings in their model?

Author Response

Reviewer#2, Concern # 1: Few-Shot Learning is a machine learning task in which the model must be pre-tuned on a training dataset so that it can learn well on a limited number of new labeled examples. That is, even on new classes or data domains we will be able to obtain high-quality predictions. Few-Shot tasks are characterized by two parameters: N and K. The N-way K-shot task means that we need to learn how to solve the classification problem into N classes using K examples for each of them. Let's look at the problem in more detail using an example from the article. How can the authors prove that they have selected a sufficiently representative training dataset? How do the authors take into account the relationship between N and K in their work?

Author response: Thank you very much for your question, which is confusing to you due to our unclear presentation. We have added a new block diagram (Figure 1) in this paper describing the construction of few-shot tasks and added the necessary descriptions. Our experiments are conducted on several typical datasets for few-shot learning, in particular miniImagenet, which is a dataset that virtually all methods will use to validate their effectiveness. The dataset is divided into training, testing, and validation sets, all of whose categories are disjoint. As for the relationship between N and K, our work follows the mainstream few-shot episode mechanism, taking the most common 5-way 1-shot and 5-way 5-shot as examples, which means that there are 5 categories in a task, and each category only has 1 or 5 samples as the support set. In the test stage, the task is constructed in the same way, except that the categories do not intersect with the training. For the 1-shot case, the task is more challenging, and the 5-shot, which generally finds a prototype of the current category stronger than only one sample. When N is larger, the task will also become more challenging, only a few methods verify the effect in the case of 20-way.

Reviewer#2, Concern # 2: To solve the Few-Shot problem, there are several paradigms, one of which is meta-learning. There are 3 main approaches to meta-learning: - Model-based (modify the model architecture); - Metric-based (similar to KNN; solve the problem using some distance function, based on the similarity of objects); - Optimization-based (modify the very structure of training and optimization for more effective “pre-training”). How do these approaches compare with what the authors propose?

Author response: Thank you so much for asking this question, indeed, our method belongs to the category of metric-based methods, and the number of metric-based methods is the largest. In the related works part of our paper, we attribute the model-based method and the optimization-based method to the meta-learning-based methods without further division, as many other articles have. At the same time, we compared different types of methods in the result tables, but compared more with the same metric-based methods, on the one hand, because there are more such methods, and on the other hand, it is more convenient to make a fair comparison to show the effectiveness of the method.

Reviewer#2, Concern # 3: The Reptile algorithm is described in the article On First-Order Meta-Learning Algorithms (2018) by OpenAI. To put it briefly, the essence of the algorithm is as follows: we train the model many times in k steps on a randomly selected dataset from the prepared ones. The authors of Reptile recommend using a value between 10 and 20, and with k=1 we get the so-called joint training. With this pre-training, the model finds new weights, which, according to the authors, will help the model generalize better and quickly adapt to new tasks. I recommend that the authors consider using this algorithm in their model.

Author response: Thank you very much for your recommendation. This is a classic article published after MAML. However, due to the time, we cannot conduct in-depth research for the time being. In future work, we will seriously consider how to apply the ideas of this article to our model. Thank you again for your suggestion.

Reviewer#2, Concern # 4: Finally, I'll note a few disadvantages of few-shot learning. Using few-shot usually means that there is very little manually labeled data available to solve the problem. There is also no off-the-shelf solution to compare against, and quality assessment metrics are often not available. And even if they exist, there is so little data that the signal is very noisy. There is another problem: in practice, I came across a lot of artifacts in the results that the model produces. She may begin to come up with pieces of information that have nothing to do with the original text (but are contained in a large part of the examples from the summary). You can get rid of such artifacts by manually changing the eyeliner. Moreover, with a seemingly small change, the output texts can change quite dramatically: becoming both better and worse. Even the presence of an extra line break or incorrect formatting in the liner affects the answer and can greatly degrade its quality. Therefore, engineers have to carefully compose the text of the eyeliner, try, restart, evaluate the results with their eyes - this is what I call the manual selection process. Therefore, prototyping with a few-shot turns out to be quite labor-intensive and time-consuming. And most importantly, this process does not guarantee that we will be able to find a lead that will push the model to the highest quality solution. And even if we find it, the text we compile will almost certainly not be the most optimal. I want to fight these problems. How do the authors account for these shortcomings in their model?

Author response: Thank you very much for your question. From your practice, we can find that there is still a lot of work to be done in the research of few-shot problems. When the samples are small, it is indeed difficult to learn a stable representation. The goal of few-shot learning is to solve the problem of how to obtain a stable model in the case of limited samples. The traditional classification problem is to let the model learn the representation of each class, and then map to the corresponding class, while few-shot learning is to classify by comparing two samples. Taking our method or the typical metric-based method as an example, our purpose is not to remember the representation of each class but to learn to compare or distinguish. Specifically, the support sample and the query sample are compared in pairs. This method does not require a stable representation of a class, but only the ability to distinguish similarity or dissimilarity. Obviously, this is easier than remembering a stable representation. At present, this type of method is more effective in the case of few-shot problems. Hoping to clear your doubts.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Authors have considered most of my concerns and sugestions. I would like to find experiments in real settings with larger and more standard images. Nonetheless, I understand the authors' justification for not performing them.

On the other hand, I still believe that the text in the figures is still too small. Usually, the norm is that the text in figures should not be less small than the main text. However, probably, authors will face this issue when preparing the paper for final publication with journal staff.

Reviewer 2 Report

Comments and Suggestions for Authors

I formulated the following comments to the previous version of the article:

The authors responded to all my comments. I found their answers quite convincing. I support the publication of the current version of the article. I wish the authors creative success.

Article Menu

Dual-Branch Multi-Scale Relation Networks with Tutorial Learning for Few-Shot Learning

Further Information

Guidelines

MDPI Initiatives

Follow MDPI