Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Abnormality Detection of Blast Furnace Tuyere Based on Knowledge Distillation and a Vision Transformer

Appl. Sci. 2023, 13(18), 10398; https://doi.org/10.3390/app131810398

by Chuanwang Song^*

, Hao Zhang

, Yuanjun Wang, Yuhui Wang and Keyong Hu

Reviewer 1:

Philipp Aichinger

Reviewer 2:

Johannes Rieger

Reviewer 3: Anonymous

Appl. Sci. 2023, 13(18), 10398; https://doi.org/10.3390/app131810398

Submission received: 10 July 2023 / Revised: 7 September 2023 / Accepted: 15 September 2023 / Published: 17 September 2023

(This article belongs to the Special Issue Applications of Deep Learning and Artificial Intelligence Methods)

Round 1

Reviewer 1 Report

The authors submitted a manuscript dealing with classification of Blast Furnace Tuyere Images using Vision Transformers, CNNs, and Knowledge Distillation. The chose topic and approaches appear to be hot topics and competitive performance is reported. However, I have a number of issues with the submission. Most importantly, I feel that not much attention has been devoted to explaining what the original contribution of this submission actually is, and how exactly it relates to the prior state-of-the-art. For example, the authors appear to build on the proposal by Zhang and Yu-Bin (2021) to improve the self-attention mechanism, but it is not clear what their very own invention was. When proposals of other groups are used, it should be made more explicit right away: The authors refer to improved ViT and improved Transformer Encoder in the overviewing Figures 1 and 2, which implies that the improvements were entirely original contributions of the current submission. Apparently, as far as I understand correctly, that is not the case and must be rectified. Also, a paper similar to the one submitted was published some months ago, but was not mentioned anywhere in the submission.

More detailed comments below:

The title appears to be somewhat awkward: Literally all titles of articles could start with “Research” – so there is no information in that word. Also “Abnormal Detection”: Shouldn’t that rather be “Abnormality Detection”?

Knowledge distillation (KD) is normally used to let a larger teacher model train a smaller student model. However, resnet50, which the authors use as teacher model is rather smaller than a ViT, which the authors use as student model. ResNet50 had ~25M parameters, but ViTs are larger typically. So I am wondering how the KD paradigm is justified here.

Authors argue that they aim at improving the ViT’s ability of modeling local features and texture, but appear to neglect the prior state-of-the-art, see e.g., https://arxiv.org/pdf/2012.12556.pdf, and LocalViT https://arxiv.org/pdf/2104.05707.pdf. In particular, the authors claim to propose an improved ViT, but since ViT is a very hot topic, a specific comparison of their own attempts to previous attempts of other groups to improve the ViT would be needed. I’m not yet convinced that the insertion of a spatial attention module prior to the linear projection is a reasonable and significant step forward. (Maybe it is, but I’m not sure… The spatial attention module shown in Fig. 3 appears to be a quite trivial CNN and not employ any attention at all)

Further comments Fig. 2: is only the cls token used for classification? That is not shown in Fig. 2. Block “Improved Transformer Encoder”: Improved how? x6 ?

Fig. 4: I guess a citation is needed here?

Fig. 5: Please put Zhang citation into caption. Also, the prior state-of-the art should rather be explained in a separate section denoted as “preliminary”, “(prior) state-of-the-art”. It can alternatively be put into the introduction: “Zhang proposed …., which we use in our contribution for …” If you say “Improved Transformer Encoder”, that implies that you improved it. But apparently it was Zhang …?

Same applies to the rest: Please be more specific about what your original contributions is, and how you relate to the prior state-of-the-art. Write a paragraph that overviews what parts of the prior state-of-the-art you combine / alter to arrive at your solution.

Why is Rong et al.’s work on voiceprints relevant for the submitted work?

The authors do not mention the previous publication (similar technique and title, overlap in authors): https://doi.org/10.3390/app13020802. “Research on Blast Furnace Tuyere Image Anomaly Detection, Based on the Local Channel Attention Residual Mechanism”

Results:

Please cite in the tables the models that you evaluated: ViT: The original Vision Transformer model: Which exactly of these did you use? Something from here? à https://github.com/google-research/vision_transformer. Is ViT(EA) the proposal by Zhang? Who proposed the other models?

Table 2 - 4 appear to contain duplications. A better style would be to combine the 3 tables into one.

3.4. says “comparative experiment”, but 3.3 (ablation study) is also a comparative experiment. So, I am missing the contrast in the headings.

What is the “distillation parameter”?

Author Response

Dear Reviewer,

We greatly appreciate your professional review of our manuscript. As you have pointed out, there are several issues that require attention. Following your valuable suggestions, we have made extensive revisions to our previous manuscript, and the detailed corrections are outlined below.

#Regarding the response to more detailed comments

Comments 1:

Response 1:

We sincerely thank the reviewer for careful reading. In accordance with the reviewer's feedback, we have made the following changes to the original title: (1) Removed the word "Research on" and (2) replaced "Abnormal Detection" with "Abnormality Detection" The revised title now reads: "Abnormality Detection of Blast Furnace Tuyere Based on Knowledge Distillation and Vision Transformer". (line 2-3)

-----------------------------------------------------------------------------------------

Comments 2:

Response 2:

As you rightly expressed concerns, we indeed took into consideration the issues you pointed out when initially implementing knowledge distillation (KD). We observed that within the existing CNN-based classification methods, ResNet50 displayed outstanding performance in the classification of blast furnace tuyere images, particularly excelling in capturing local features and texture information. The primary objective of knowledge distillation is to transfer the strengths of the teacher model in specific aspects to the student model, aiming for a more comprehensive performance enhancement. It extends beyond just model size and serves as a means to convey the 'knowledge' held by the teacher model, which may encompass facets such as model generalization capabilities.

Furthermore, taking into account constraints related to dataset scale, computational resources, memory, and storage, this study opted for a smaller model (ResNet50) as the teacher. By reducing the number of layers in the Transformer model, we aimed to adapt it to resource-constrained environments. This strategy helped ensure that the model operates within feasible computational and memory resources and contributed to enhancing the performance of the student model (ViT).

We hope that this response addresses your concerns satisfactorily.

-----------------------------------------------------------------------------------------

Comments 3:

Response 3:

Thank you very much for your pointed out issues and comments. We understand your point of view. The goal of this paper is to combine the ability of ResNet50 to capture local features and texture information with the ability of Transformer to capture global information and long-distance dependencies. With the help of knowledge distillation, the improved VIT model in this paper can retain Transformer to capture the global Based on the ability to learn information and long-distance dependencies, the ability of ResNet50 to capture local features and texture information is not only aimed at improving the ability of ViT to extract local features and textures. https://arxiv.org/pdf/2012.12556.pdf The author reviews Transformer-based models in computer vision, including applications in image classification, high-level vision, low-level vision and video processing. Among them, the application of Transformer-based models in image classification is mainly focused on training models on large-scale datasets (14 million to 300 million images) and medium-sized datasets (such as: ImageNet), while Tuyere-Data, The Stanford-Dogs and CUB-200-2011 datasets are small in size, and the knowledge distillation method we use solves this problem to a certain extent. We apologize for not seeing LocalViT https://arxiv.org/pdf/2104.05707.pdf at the right time. Based on your comments, we have added a reference to this article in the ITRODUCTION section. (line90-92)

In our study, we integrated the spatial attention mechanism into our improved Vision Transformer (VIT) model. This integration aimed to make the model more attentive to fine-grained features, mitigate the impact of inter-class small differences, and enhance the model's perceptual and feature extraction capabilities across different regions of images. During the incorporation of the spatial attention mechanism, we conducted an analysis of parameters such as convolution kernel size and spatial weight coefficients, tailored to the requirements of this experiment. Ultimately, we chose convolution kernel sizes and spatial weight coefficients consistent with those used in the work by Park, J. Based on the reviewer's comments, we have added annotations at the spatial attention section of Figure 3. (line166-167)

-----------------------------------------------------------------------------------------

Comments 4:

Further comments Fig. 2: is only the cls token used for classification? That is not shown in Fig. 2. Block “Improved Transformer Encoder”: Improved how? x6 ?

Response 4:

I apologize for our oversight. As you inquired, we exclusively utilize the Cls Token in this manuscript, denoted as "0*" in Figure 2. (line 137-140) In response to your valuable question, we have supplemented and clarified the abbreviation for the Classification Token and its corresponding position in Figure 2 in our manuscript. (line 157-159)

Furthermore, we have annotated the Cls Token in Figure 2, labeled as “0*”.

The “Improved Transformer Encoder” comprises not only the utilization of a 6-layer Transformer encoder but also improvements made to the self-attention mechanism within the Transformer encoder, as detailed in Section 2.3."

Comments 5:

Fig. 4: I guess a citation is needed here?

Response 5:

As suggested by the reviewer, we have now cited the two sections of content in Figure 4. (line 187-188)

-----------------------------------------------------------------------------------------

Comments 6:

(When proposals of other groups are used, it should be made more explicit right away: The authors refer to improved ViT and improved Transformer Encoder in the overviewing Figures 1 and 2, which implies that the improvements were entirely original contributions of the current submission. Apparently, as far as I understand correctly, that is not the case and must be rectified.)

Response 6:

As suggested by the reviewer, we have incorporated Zhang's citation into caption (lines 208-209) and have placed the discussion on Zhang's proposed prior state-of-the-art technology in the introduction section, in accordance with the reviewer's guidance. (lines 92-96)

Additionally, we have added clarifications in Figure 1 and Figure 2 to note the inspiration of the improved Vision Transformer from the referenced article.

-----------------------------------------------------------------------------------------

Comments 7:

(Most importantly, I feel that not much attention has been devoted to explaining what the original contribution of this submission actually is, and how exactly it relates to the prior state-of-the-art. For example, the authors appear to build on the proposal by Zhang and Yu-Bin (2021) to improve the self-attention mechanism, but it is not clear what their very own invention was.)

Response 7:

Thank you for the valuable feedback from the reviewer. We are more than willing to provide additional details to further clarify the original contributions of our paper. The primary original contributions of our study can be summarized as follows:

Application of a Novel Method to Blast Furnace Tuyere Data: We combined Convolutional Neural Networks (CNN) and Vision Transformer (VIT) and employed knowledge distillation to create the Blast Furnace Tuyere image classification model (BDiT). The novelty lies in the introduction of VIT into the task of Blast Furnace Tuyere image classification, offering a new solution to this field.

Performance Enhancement: Our focus was primarily on improving the VIT component within the BDiT model. Through extensive experiments on the Blast Furnace Tuyere image dataset, we not only achieved outstanding classification accuracy but also demonstrated the generalization capability of the BDiT model on different datasets. Compared to the methods used in our previous research, BDiT shows greater potential for practical applications.

Experimental Validation and Verification: We conducted comprehensive experiments to thoroughly validate our approach. These experiments encompassed performance evaluations on the private Blast Furnace Tuyere image dataset (Tuyere-Data) and public fine-grained image datasets (Stanford-Dogs, CUB-200-2011), confirming the reliability and versatility of our method.

New Insights and Understanding: Our research not only provides new insights into the field of Blast Furnace Tuyere anomaly detection but also offers significant directions for future research.

We hope that this information clarifies the unique contributions of our research. Furthermore, we have made corresponding additions to the relevant sections in the manuscript. (lines 107-119 and lines 219-226)

-----------------------------------------------------------------------------------------

Comments 8&9:

8-Why is Rong et al.’s work on voiceprints relevant for the submitted work?

9-The authors do not mention the previous publication (similar technique and title, overlap in authors): https://doi.org/10.3390/app13020802. “Research on Blast Furnace Tuyere Image Anomaly Detection, Based on the Local Channel Attention Residual Mechanism”

(Also, a paper similar to the one submitted was published some months ago, but was not mentioned anywhere in the submission.)

Response 8&9:

Thank you for your feedback and suggestions. We appreciate your concerns regarding the discrepancy between our study and the work by Rong et al. In the process of writing this paper, we aimed to provide a comprehensive background by discussing relevant methods from various application scenarios, to help readers better understand our research. However, we also acknowledge that this approach may have resulted in excessive length or a lack of focus, potentially affecting the paper's central theme. Following your advice, we have removed the references to Rong's work from the introduction and incorporated relevant content from our previously published papers, ensuring better alignment with the manuscript's content. Once again, we sincerely appreciate your valuable input, which has greatly contributed to our work. (lines 107-119)

-----------------------------------------------------------------------------------------

# Feedback on the Results

Results 1:

Feedback 1:

Thank you for your detailed explanation. Based on the reviewer's comments, I have included the reference to the Vision Transformer model in Table 2. In the article, "VIT" refers to the original Vision Transformer model, which was proposed by Dosovitskiy A et al. (2020) [https://github.com/google-research/vision_transformer]. "ViT(EA)" is not proposed by Zhang, and the architecture of "ViT(EA)" is different from the ResT model proposed by Zhang. The other models in Table 2, such as "ViT(IEA)" and "ViT(IEA&SPA)", are modifications based on the original Vision Transformer model, gradually incorporating improved self-attention mechanism modules and spatial attention modules. These modifications are not derived from any other sources. I hope that my explanation can clarify your doubts. Thank you again for your valuable questions.

-----------------------------------------------------------------------------------------

Results 2:

Table 2 - 4 appear to contain duplications. A better style would be to combine the 3 tables into one.

Feedback 2:

Thank you for your comments. Based on the reviewer's suggestions, we have merged Tables 2-4 in the original manuscript into a single table, which is now Table 2 in the revised manuscript. We have also made corresponding changes to the table numbers in the text, and adjusted the paragraphs for better readability and aesthetics. We hope that these changes address your concerns and improve the manuscript.

-----------------------------------------------------------------------------------------

Results 3:

3.4. says “comparative experiment”, but 3.3 (ablation study) is also a comparative experiment. So, I am missing the contrast in the headings.

Feedback 3:

Thank you for your valuable comments on our paper. We greatly appreciate your attention and feedback.

Regarding the choice of titles for Sections 3.3 and 3.4, we understand your concern. The reason we selected such titles was to provide a clearer distinction between different experimental designs to aid readers in better understanding our research. In fact, we recognize that Section 3.3, titled "Ablation Experiment," essentially constitutes a comparative experiment aimed at evaluating the impact of various modules of our model on experimental results. Meanwhile, Section 3.4, titled "comparative experiment," seeks to compare the performance differences between various methods and also includes a comparative experiment with our proposed method against other classical methods on fine-grained image datasets. To eliminate any potential confusion, we have revised the title of Section 3.4 to "Comparative Experiment of Different Methods."(line368)

Once again, we appreciate your review and feedback, which have significantly contributed to the enhancement of the quality of our paper.

-----------------------------------------------------------------------------------------

Results 4:

What is the “distillation parameter”?

Feedback 4:

Thank you for your question. The distillation parameters refer to the set of parameters used in knowledge distillation to train the student model. These parameters are the weights of the student model, and they are trained by minimizing the distance between the student and teacher models. In this paper, we mainly use the temperature parameter to adjust the similarity between the teacher and student models.

-----------------------------------------------------------------------------------------

Once again, we sincerely appreciate your valuable input, and we will carefully consider and incorporate it into our future research endeavors.

Sincerely,

Hao Zhang

Author Response File: Author Response.docx

Reviewer 2 Report

Abstract, line 7: instead of the words steel smelting in the text part "...tuyere is a key position in the steel smelting production...", use "hot metal" instead. In a blast furnace, hot metal is produced by smelting and reducing iron ore and there is no steel smelting in the BF (no use of scrap in the BF).

Section 1, line 26: "In the smelting production of steel enterprises" does not sound metallurgically correct. Instead, it should be written "In the integrated route for iron and steelmaking),...".

General comment for the references used: they primarily focus on the work of Chinese researchers. There should be also included at least 2-3 research groups from other parts of the world (e.g., Dr. Puttinger from the Austrian Johannes Kepler University, who also investigated the tuyere of the BF, published in Steel Research International or ISIJ International).

Conclusions: you write at the end that the model has great potential for industrial applications. A comment would be valuable if this is planned for the near future. Apart from that, challenges for future research (if this is planned as well) might also be interesting for the reader to see the main points for improvements.

Author Response

Dear Reviewer,

Thank you very much for your comments and professional advice. These opinions help to improve academic rigor of our manuscript. Based on your suggestion and request, we have made corrected modifications on the revised manuscript. We hope that our work can be improved again. Furthermore, we would like to show the details as follows:

Comments 1:

Response 1:

We sincerely thank the reviewer for careful reading. As suggested by the reviewer, we have corrected the “steel smelting” into “hot metal”. (line 7)

-----------------------------------------------------------------------------------------

Comments 2:

Section 1, line 26: "In the smelting production of steel enterprises" does not sound metallurgically correct. Instead, it should be written "In the integrated route for iron and steelmaking),...".

Response 2:

We think this is an excellent suggestion. As suggested by the reviewer, we have corrected the “In the smelting production of steel enterprises” into “In the integrated route for iron and steelmaking”. (line 27)

-----------------------------------------------------------------------------------------

Comments 3:

Response 3:

We sincerely appreciate the valuable comments. We have checked the literature carefully and added more references ([8] [9]) into the INTRODUCTION part in the revised manuscript. (line 40-41)

-----------------------------------------------------------------------------------------

Comments 4:

Response 4:

Thank you very much for your valuable suggestions. Following your advice, we have added a discussion on future research prospects in the CONCLUSION section. We believe that through further research, we can delve deeper into the potential of this field and achieve improved performance and practical applications. (line 450-453)

-----------------------------------------------------------------------------------------

Once again, we sincerely appreciate your valuable input, and we will carefully consider and incorporate it into our future research endeavors.

Sincerely,

Hao Zhang

Author Response File: Author Response.docx

Reviewer 3 Report

Comments attached

Comments for author File: Comments.pdf

Author Response

Dear Reviewer,

We feel great thanks for your professional review work on our manuscript. As you are concerned, there are several problems that need to be addressed. According to your nice suggestions, we have made extensive corrections to our previous manuscript, the detailed corrections are listed below.

Comments 1:

It will be better if authors can demonstrate the evolution of the proposed spatial attention. This can be based on previous work or different configurations that authors investigated before presenting the final attention.

Response 1:

We greatly appreciate the valuable input from the reviewer. We understand the reviewer's perspective, and in our study, we integrated the spatial attention mechanism into our improved Vision Transformer (VIT) model. This integration aimed to make the model more attentive to fine-grained features, mitigate the impact of inter-class small differences, and enhance the model's perceptual and feature extraction capabilities across different regions of images. During the incorporation of the spatial attention mechanism, we conducted an analysis of parameters such as convolution kernel size and spatial weight coefficients, tailored to the requirements of this experiment. Ultimately, we chose convolution kernel sizes and spatial weight coefficients consistent with those used in the work by Park, J.

In the experiment, we focused on the improvement process of the VIT model, the improved VIT and the experimental results of our proposed BDiT. We apologize for not showcasing the evolution and configuration work of the spatial attention mechanism in the article and hope for your understanding regarding our intentions. We greatly appreciate your patience and advice. Thank you once again for your efforts.

-----------------------------------------------------------------------------------------

Comments 2:

author should indicate the exact position where this spatial attention is introduced. This can be demonstrated with a figure of the current ViT.

Response 2:

We greatly appreciate the valuable feedback from the reviewer. In our improved Vision Transformer model framework, the precise introduction position of the Spatial Attention Module (SAM) is as follows:

SAM is positioned before the serialization of image token embeddings and before the Cls Token. The Cls Token is typically used to incorporate global information, while SAM contributes to the handling of relationships among local image tokens. The combination of these two elements allows for a better capture of both global and local image information. Additionally, we have provided annotations for Cls Token, denoted as 0*, in Figure 2 to facilitate a clearer understanding of the introduction of SAM.

We sincerely thank the reviewer for their careful consideration and assistance.

-----------------------------------------------------------------------------------------

Comments 3:

Have the author investigated other alternative backbone of the current ViT, a justification of using ResNet50 should be given.

Response 3:

We greatly appreciate the reviewer's question. Regarding the choice of ResNet50 as the backbone for our current ViT model, we indeed carefully considered this choice and engaged in thorough deliberation. The following are the reasons for selecting ResNet50 and our considerations regarding alternative backbones:

Firstly, ResNet50 is a well-established deep convolutional neural network with extensive pretraining. This means that we can leverage pretraining weights from existing large-scale image datasets to initialize our model, which aids in accelerating convergence and improving performance.

Secondly, ResNet50 has become a significant benchmark model in computer vision research, serving as a basis for comparisons with many new methods and techniques. This facilitates ease of comparison and benchmarking of our work against existing research.

Furthermore, while we chose ResNet50 as the backbone, we acknowledge the existence of other alternative backbone architectures. In our experiments, we also considered classic CNN backbone networks such as VGG-19 and ResNet101, conducting comparative experiments. However, based on our experimental results and performance metrics, we found that ResNet50 performed best for our specific task and dataset, leading to its selection as the backbone for our improved ViT model.

Lastly, we have optimized the description of our experimental results in the manuscript to demonstrate the validity of our choice of ResNet50. We appreciate the reviewer's suggestions and feedback. Thank you.

-----------------------------------------------------------------------------------------

Comments 4:

How does the replacement of batchnorm instead of instancenorm in the self attention head improve the performance? Any insights based on the experiments?

Response 4:

Thank you very much for your valuable question. Below is my response to the issue you raised:

In order to enhance the model's generalization on test samples and mitigate overfitting issues, we replaced Instance Normalization with Batch Normalization in the self-attention mechanism. Table 2 in this manuscript showcases the classification accuracy before and after the replacement. From the experimental data, it is evident that the model achieved higher accuracy after the substitution, confirming that Batch Normalization aids in capturing the global characteristics and distribution of data more effectively.

This improvement further elevates the model's performance on unseen test samples, such as those from the publicly available fine-grained image datasets used in this experiment, namely, Stanford-Dogs and CUB-200-2011. The change enables the model to be less reliant on single-sample statistics, thus enhancing its adaptability to various types of input data.

The above is my response to the four comments you provided. Once again, I sincerely appreciate your patience and assistance.

Sincerely,

Hao Zhang

Author Response File: Author Response.docx

Article Menu

Abnormality Detection of Blast Furnace Tuyere Based on Knowledge Distillation and a Vision Transformer

Further Information

Guidelines

MDPI Initiatives

Follow MDPI