Abnormality Detection of Blast Furnace Tuyere Based on Knowledge Distillation and a Vision Transformer

Song, Chuanwang; Zhang, Hao; Wang, Yuanjun; Wang, Yuhui; Hu, Keyong

doi:10.3390/app131810398

Open AccessArticle

Abnormality Detection of Blast Furnace Tuyere Based on Knowledge Distillation and a Vision Transformer

by

Chuanwang Song

^*

,

Hao Zhang

,

Yuanjun Wang

,

Yuhui Wang

and

Keyong Hu

School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266520, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(18), 10398; https://doi.org/10.3390/app131810398

Submission received: 10 July 2023 / Revised: 7 September 2023 / Accepted: 15 September 2023 / Published: 17 September 2023

(This article belongs to the Special Issue Applications of Deep Learning and Artificial Intelligence Methods)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The blast furnace tuyere is a key position in hot metal production and is primarily observed to assess the internal state of the furnace. However, detecting abnormal tuyere conditions has relied heavily on manual judgment, leading to certain limitations. We proposed a tuyere abnormality detection model based on knowledge distillation and a vision transformer (ViT) to address this issue. In this approach, ResNet50 is employed as the Teacher model to distill knowledge into the Student model, ViT. Firstly, we introduced spatial attention modules to enhance the model’s perception and feature-extraction capabilities for different image regions. Furthermore, we simplified the depth of the ViT and improved its self-attention mechanism to alleviate training losses. In addition, we employed the knowledge distillation strategy to achieve model lightweighting and enhance the model’s generalization capability. Finally, we evaluate the model’s performance in tuyere abnormality detection and compare it with other classification methods such as VGG-19, ResNet-101, and ResNet-50. Experimental results showed that our model achieved a classification accuracy of 97.86% on a tuyere image dataset from a company, surpassing the original ViT model by 1.12% and the improved ViT model without knowledge distillation by 0.34%. Meanwhile, the model achieved a competitive classification accuracy of 90.31% and 77.65% on the classical fine-grained image datasets, Stanford Dogs and CUB-200-2011, respectively, comparable to other classification models.

Keywords:

transformer; cnn; knowledge distillation; self-attention mechanism; image classification

1. Introduction

In the integrated route for iron and steelmaking, the tuyere is a critical piece of equipment in the blast system of a blast furnace and serves as the primary device for operators to directly observe the smelting conditions inside the furnace. The proper condition of the tuyere is essential to ensure the smooth operation of various stages in the steelmaking process. Because most current judgments on the working condition of the blast furnace rely on the subjective judgment of the blast furnace operator, there are issues such as inaccurate detection and delayed response, which result in the blast furnace being in a high-temperature and high-pressure environment with explosions as well as other safety hazards. While most of China’s steel industries are currently undergoing industrial reorganization and upgrading, prompt monitoring of the blast furnace tuyere condition is critical to ensure safe production and lower melting costs [1,2,3]. For many years, most steelmaking scholars have focused on the adequacy of flame combustion in the blast furnace tuyere [4], the characteristics of the blast furnace tuyere gyratory zone [5,6], blast furnace coal injection status detection [7], blast furnace raceway blockages [8], pyrolysis of coal and its interaction with coke and iron burden in a blast furnace [9], etc., there is little research on the abnormal condition detection of blast furnace tuyeres. The blast furnace operator primarily observes the blast furnace tuyere with the naked eye or through an imaging device, which cannot make a dynamic and exact judgment on the state of the tuyere [3]. Due to the inconsistent positioning and angles of cameras installed at blast furnace tuyeres, the acquired images display variations in the shape of the tuyere and the coal spray gun. This results in significant deformations in the coal spraying process, leading to minimal differences between states of the same type and substantial differences between states of similar types.

Consequently, blast furnace tuyere images can be classified as fine-grained images [2]. Therefore, this paper investigates the detection of abnormal conditions in blast furnace tuyeres within steel enterprises to explore methods that surpass existing traditional approaches in terms of intelligence and efficiency. The aim is to further minimize production hazards and enhance production efficiency in the steel industry.

In response to the problem of inter-class similarity and intra-class difference in fine-grained images, many scholars have studied CNN’s implementation of fine-grained image classification algorithms. Fu [10] presented the Recurrent Attention Convolutional Neural Network (RA-CNN) structure, but this approach is unable to combine the attention of different regions. The RA-CNN structure learns discriminating via multi-scale recursive regional attention and feature representation. Zheng [11,12] proposed the Multi-Attention Convolutional Neural Network (MA-CNN) structure and the Progressive-Attention Convolutional Neural Network (PA-CNN) structure. The MA-CNN generates multiple region-based attention and utilizes a channel grouping loss function to enhance inter-class discrimination. Compared to the RA-CNN, the MA-CNN achieves superior image classification and recognition performance by leveraging multiple attentions. The PA-CNN consists of two modules: the Partial Proposal Network (PPN) and the Partial Refinement Network (PRN). The PPN generates multiple local attention maps, while the PRN learns features for each part and provides corrective positions to the PPN. The PPN and PRN are mutually reinforced during optimization to ensure precise localization for fine-grained analysis [13]. Wang [14] extended the VGG16 network structure with branches to extract local information, resulting in a dual-stream asymmetric network incorporating global and local features for fine-grained image classification.

Due to its combination of self-attention processes, the Transformer can be used to serialize images, parallelize training, and capture global information. More and more scholars have explored using the Transformer for fine-grained image classification tasks. Dosovitskiy [15] proposed the Vision Transformer model, which improves classification accuracy by extracting global features from images through the introduction of self-attention modules. However, this model can only capture correlations between pixels within a single sample, limiting its feature-extraction capability and significantly increasing the number of model parameters [16]. He et al. [17] were the first to propose the Transformer-based fine-grained image recognition framework, which uses a self-attention mechanism to capture the most discriminative regions in an image, processes the internal relationships between regions using image blocks, and uses a loss function to expand the distance between similar subclass feature representations. Because this method’s input image blocks are fixed, it has weak adaptability, a high computational cost, and low accuracy on certain datasets. Conde [18] proposed a multi-stage ViT framework for fine-grained image recognition. The framework utilizes multi-head self-attention mechanisms to capture distinctive image features from different local regions. Additionally, different attention-guided enhancements are employed to augment the model and learn diverse discriminative features, thereby improving the model’s generalization capability. Li [19] researched how to incorporate the locality mechanism into the Vision Transformer, verified the significance of the locality mechanism, and successfully applied the proposed locality mechanism in four different Vision Transformers. Zhang [20] introduced an efficient multi-head self-attention mechanism (EMSA), which employs a simple depth-wise convolution technique to reduce memory usage. We have incorporated EMSA into our contribution and made appropriate enhancements within the improved Transformer Encoder.

Hinton [21] proposed the concept of knowledge distillation to induce low-complexity Student models for training by introducing soft targets associated with the Teacher model as part of the total loss so that knowledge transfer can be achieved by training large models to obtain smaller models that are more suitable for inference. Touvron [22] presented the Deit model, which did not improve on the ViT model and instead used a knowledge-distillation approach to augment the predictive information of negative samples during model training while inheriting the Teacher model’s inductive bias capability. Zheng [23] introduced a knowledge-distillation approach for training deep neural networks, dramatically improving the neural network’s generalization ability and producing exceptional performance in image classification and facial expression recognition.

In the research of fine-grained image classification tasks, both the CNN and the Transformer have demonstrated their effectiveness to varying degrees. The CNN excels at capturing local features and texture information, while the Transformer proves superior in capturing global information and long-distance dependencies. Through knowledge distillation, the Deit model [22] successfully combines the strengths of both the CNN and the Transformer, achieving impressive results in coarse-grained image classification tasks. In contrast, previous studies on the blast furnace tuyere [24,25] were confined to CNN-based methods. Despite their high classification accuracy, these methods exhibited limited generalization capabilities.

Motivated by these insights, this paper proposes a Blast Furnace Tuyere Image Classification Model (BDiT) based on knowledge distillation and a Vision Transformer. In this model, the CNN serves as the Teacher model, and an enhanced Transformer acts as the Student model, facilitating transfer learning through knowledge-distillation strategies to enhance the overall model performance.

2. Methods

2.1. BDiT

The overall framework of the BDiT model is illustrated in Figure 1. Firstly, we trained the ResNet50 model on the dataset used in this study, demonstrating excellent performance in capturing local features and texture information. Subsequently, we utilized ResNet50 as the Teacher model to guide the training and testing of an improved Vision Transformer model on the same dataset. This approach allowed the Student model to inherit the Teacher model’s ability to capture local features and texture information while also transferring inductive biases to enhance the feature extraction capabilities of the entire model.

The framework of the improved Vision Transformer model is depicted in Figure 2. Initially, the input image was segmented into n image tokens (patch sequence) of size

p \times p

. In our approach, this sequence was first fed into the spatial attention module (SAM) for processing, as explained in Section 2.2. Subsequently, each image token was linearly mapped to a serialized embedding vector through an Embedding Layer. Furthermore, a Classification Token (Cls Token, denoted as “0*” in Figure 2) was added at the beginning of the patch sequence. This inclusion of the Cls Token introduced global information about the image into the sequence. Finally, a positional embedding vector was assigned to each position to incorporate the positional information of the image tokens.

Each embedding vector of the image tokens and its corresponding positional embedding vector was concatenated to form a sequence

Z_{p} \in ℝ^{n \times d_{m}}

representing the entire image. Here, p denotes the size of the image tokens, n represents the number of image tokens, and d_m represents the channel dimension. The sequence Z_p was input into a multi-layer improved Transformer encoder for feature extraction. Considering that the accuracy of the Transformer model no longer improved after reaching 12 layers and that it increased the model’s parameter count and computational complexity [15], in this study, we adopted a six-layer enhanced Transformer encoder (Improved Transformer Encoder). This choice was motivated by the dataset size and computational efficiency, aiming to enhance the generalization ability of the BDiT model on small datasets and reduce the computational overhead.

After passing through multiple layers of the improved Transformer encoder, the output of the last encoder was considered the feature representation of the input image. This feature representation was then fed into a classifier for classification prediction and loss computation, ultimately completing the model training.

2.2. Spatial Attention Module

The attention mechanism has been widely employed in computer vision tasks. For images with complex fine-grained features, we introduced a spatial attention module [26] to enable the model to focus more on fine-grained features, alleviating the impact of small inter-class differences and large intra-class differences. The spatial attention module is illustrated in Figure 3.

First, the input feature map

F \in ℝ^{C \times H \times W}

underwent a

1 \times 1

convolution operation to reduce its dimensionality to

ℝ^{\frac{C}{r \times H \times W}}

, where

r = 16

. Then, two

3 \times 3

convolutions were applied to extract useful information from the context, followed by a

1 \times 1

convolution to adjust the dimensionality of the spatial attention map to

ℝ^{1 \times H \times W}

. Finally, the spatial attention map M_s(F) was obtained after Batch Normalization (BN), as shown in Equation (1).

M_{s} (F) = B N (f_{3}^{1 \times 1} (f_{2}^{3 \times 3} (f_{1}^{3 \times 3} (f_{0}^{1 \times 1}))))

(1)

In the equation,

f^{i \times i}

represents the

i \times i

convolution for

i = 1, 3

. Through the spatial attention module, the ViT model can encode each image block in the image without losing global information.

2.3. Improved Self-Attention Mechanism

The Vision Transformer encoder consisted of L (in this study, L was set to 6) stacked encoding modules, each composed of multi-head self-attention (MSA) and a multi-layer perceptron (MLP). The structure of the encoding module is illustrated in Figure 4a. The MSA module consisted of N_h (in this paper, N_h was set to 8) individual self-attention heads. The self-attention mechanism was a fundamental component of the Vision Transformer architecture, playing a critical role in establishing pixel-level relationships and capturing global image information. Its architecture is illustrated in Figure 4b.

For the self-attention mechanism, firstly, the input Z_p underwent linear transformations to obtain the query matrix Q, key matrix K, and value matrix V. Secondly, the query matrix Q was used to compute the attention weight matrix A in combination with the key matrix K. The computation process is as follows:

A = {(α)}_{i, j} = S o f t m a x (\frac{Q \otimes K^{Τ}}{\sqrt{d_{k}}})

(2)

In the equation, A represents the attention weights, where

{(α)}_{i, j}

denotes the similarity between the ith and jth pixels.

\sqrt{d_{k}}

is the scaling factor given by

d_{k} = d_{m} / N_{h}

. Finally, the output feature was obtained via matrix multiplication between the attention weight matrix A and the value matrix V, followed by a residual connection with the feature map Z_p.

Z_{o u t} = Z_{p} \oplus A \otimes V

(3)

The computational cost of MSA is

ο (2 d_{m} n^{2} + 4 d_{m}^{2} n)

, where the computation grows quadratically with the number of image tokens or their dimensions, resulting in significant training and inference overhead. Furthermore, in the multi-head attention mechanism, each head was responsible for a sub-layer of the embedding dimension, which may compromise the network’s performance, especially when the token embedding dimension is small.

The architecture of EMSA is depicted in Figure 5a. Similar to the standard self-attention mechanism, efficient self-attention first obtained the query matrix Q through linear mapping. To compress the memory, the two-dimensional tokens were transformed into three-dimensional tokens, and then, the size was reduced through depthwise separable convolution (DWconv) to obtain two-dimensional feature maps. The feature maps were further mapped to obtain the key matrix K and value matrix V. The query matrix Q was multiplied with the key matrix K, followed by scaling, standard

1 \times 1

convolution, softmax layer, and Instance Normalization, resulting in the attention weight matrix for the attention mechanism.

Due to EMSA’s ability to effectively address issues such as large computational requirements and suboptimal network performance, this study, in response to research requirements, replaced the MSA with the EMSA and introduced improvements tailored for fine-grained image processing tasks with small-sample datasets. The specific enhancement involved substituting Instance Normalization with Batch Normalization within the EMSA framework. This modification aimed to enhance the model’s generalization capability on test samples, mitigate overfitting issues, and improve model robustness. The improved EMSA architecture is illustrated in Figure 5b. The computation process for the attention weight matrix A of the improved efficient attention mechanism was as follows:

A = {(α)}_{i, j} = B N (S o f t m a x (C o n v (\frac{Q \otimes K^{Τ}}{\sqrt{d_{k}}})))

(4)

Finally, similar to the standard self-attention mechanism, the feature output was obtained by multiplying the attention weight matrix A with the value matrix V, followed by a residual connection with the image tokens. The resulting feature was then given by:

Z_{o u t}^{'} = Z_{p} \oplus A \otimes V

(5)

2.4. Loss Function

In the classification task, the softmax layer is usually used to calculate the output classification probability, and manual labeling is a hard target, often used to train the model and ensure the reliability of the training data. In the training process of knowledge distillation, the training label of the small model uses the output result of the Teacher model in the softmax layer, and this label is also called a soft target. Hinton [21] introduced the temperature parameter into the calculation of the function, as shown in Formula (6):

p_{i} = \frac{\exp (\frac{z_{i}}{T})}{\sum_{j} (\frac{z_{j}}{T})}

(6)

The knowledge distillation strategy resulted in a total loss consisting of the cross-entropy functions of the Teacher and Student models, as well as the cross-entropy loss between the predicted probability of the Student model and the true classification, as shown in Figure 6.

The total loss was calculated via linear addition, and the final loss function is shown in Equation (7).

L o s s = λ C E (y, p_{i}) + (1 - λ) C E (p_{j}, p_{i})

(7)

In the formula,

p_{j}

is the output of the Teacher model,

p_{i}

is the output of the Student model, y is the real label, CE is the cross entropy function, and

λ

is the hyperparameter used to adjust the proportion of the loss function.

3. Experimental Results and Analysis

In this section, we conducted thorough tests on the proposed BDiT model and compared its performance with other existing methods on a dataset of blast furnace tuyere images. Additionally, we further validated the effectiveness of our model on two fine-grained image datasets. The experimental system environment is Ubuntu 18.04, and the deep learning framework PyTorch (https://pytorch.org/, accessed on 6 September 2022) was chosen to build the network model.

3.1. Dataset Selection and Preprocessing

This study primarily focused on six operational states of blast furnace tuyeres, namely normal, falling chunks, material block, hanging slag, coal break, and blowing down. We recorded multiple segments of videos using a camera installed at the tuyere, resulting in a total duration of 27 h. The videos were captured in MP4 format with a frame rate of 25 frames per second (FPS). Each operational state had a video duration of 270 min. To construct the dataset, we extracted one image from every 100 frames of each video segment, resulting in a collection of 24,300 images. Each image was assigned a corresponding class label based on its associated operational state. Prior to further analysis and model training, the dataset underwent the following preprocessing steps.

(1): To reduce computational complexity and prevent size distortion, the resolution of the blast furnace tuyere images was standardized from the original 960 × 576 to 256 × 256;
(2): To address the issue of potential image mirroring caused by factors such as shooting angles or sensor installation methods in blast furnace tuyere images, this study applied horizontal or vertical flipping to the images. By correcting the image mirroring problem and aligning it with the actual scene, the diversity of the data was increased, which helped improve the model’s generalization ability and enabled it to better adapt to images from different tuyere orientations or mirror scenarios;
(3): Considering the presence of particulate matter or pollutants to a certain extent in a blast furnace tuyere, as well as the potential degradation of information transmission in the tuyere camera over time, this study randomly selected 4860 images (810 images per category) from the pool of 24,300 images and added salt-and-pepper noise to them. This random introduction of black and white pixels in the images simulated the loss or degradation of details in blast furnace tuyere images, making them more consistent with the actual conditions observed in tuyere environments.

The blast furnace tuyere images dataset, named Tuyere-Data, consists of 4860 images with salt-and-pepper noise and the original 24,300 preprocessed images. Tuyere-Data encompasses six categories of working states: normal, falling chunks, material block, hanging slag, coal break, and blowing down, with 4860 images each. In total, the dataset comprises 29,160 images, with an example of each state depicted in Figure 7.

To further validate the effectiveness of the BDiT model, we conducted experimental comparisons using two well-known public datasets for fine-grained image analysis: Stanford Dogs [28] (http://vision.stanford.edu/aditya86/ImageNetDogs/, accessed on 15 January 2023) and CUB-200-2011 [29] (http://www.vision.caltech.edu/datasets/cub_200_2011/, accessed on 15 January 2023). Table 1 provides details on the number of subclasses, the total number of images, and the train-test split for each dataset.

3.2. Evaluation Metrics and Experimental Settings

In this paper, the average accuracy rate is selected as the evaluation index, and its formula is shown in Formula (8).

A c c u r a c y = \frac{1}{I_{c}} \sum_{j = 1}^{I_{c}} \frac{I_{j j}}{I_{j}}

(8)

In the formula, I_c is the total number of sample classes, which is six in this paper; j is the class label, which is 1–6 in this paper; I_j is the total number of samples with the class j; and I_jj is the total number of samples with category correctly predicted as a class j.

The whole experiment was conducted on PyCharm (https://www.jetbrains.com/pycharm/, accessed on 7 March 2022) with the number of iterative rounds set to 100, the learning rate initialized to 0.1, the hyperparameter in the cross-entropy loss set to 1, and the hyperparameter in the distillation loss set to 0.1. Stochastic Gradient Descent (SGD) was used as the optimization method, and the batch size was 16.

3.3. Ablation Experiment

To assess the effectiveness of each module in the BDiT model, a series of ablation experiments were conducted on the Tuyere-Data dataset. These experiments involved progressively incorporating various improvement modules into the ViT architecture. Specifically, the impact of the improved self-attention mechanism, the introduction of the spatial attention module, and the distillation loss were evaluated.

3.3.1. Self-Attention Mechanism and Spatial Attention Module

The experimental scheme in this subsection uses only the cross-entropy loss function to calculate the loss and compares the following four Vision Transformer network structures: (a) ViT denotes the original Vision Transformer model; (b) ViT(EA) denotes the model after improving the self-attention module in (a) to an efficient self-attention module; (c) ViT(IEA) denotes the model after replacing Instance Normalization in the efficient multi-headed self-attention module with Batch Normalization; and (d) ViT(IEA&SPA) denotes the model after introducing the spatial attention module based on the improvements made in (c). The effects of the improved self-attention mechanism and the spatial attention module’s introduction on the model’s recognition accuracy were explored, and the experimental results are shown in Table 2. According to the data in the table, ViT(EA), ViT(IEA), and ViT(IEA&SPA) each increase classification accuracy over ViT by 0.06%, 0.59%, and 0.78%, respectively. Among the four Vision Transformer network architectures, the classification accuracy of ViT(IEA&SPA) reaches the highest value of 97.52%.

3.3.2. Distillation Loss

Based on using the cross-entropy loss to verify the effectiveness of the model after introducing the distillation loss, the distillation loss is integrated into (a) ViT and (d) ViT(IEA&SPA), respectively, to obtain (e) ViT(SA&L_DIS) and (f) ViT(IEA&SPA&L_DIS), the experimental results of jointly using the cross-entropy loss function and distillation loss function as the total loss function are shown in Table 2. The results show that by adding distillation loss, the accuracy of ViT improved by 0.84%, ViT(IEA&SPA) improved by 0.78%, and ViT(IEA&SPA&L_DIS) improved by 0.28% compared to ViT(SA&L_DIS). This shows that ViT(IEA&SPA&L_DIS) has the best classification accuracy after the introduction of distillation loss.

Table 2. Ablation experiment results.

No.	Method	Model Composition	Teacher	Accuracy (%)
(1)	ViT [13]	SA	-	96.74
(2)	ViT(EA)	EA	-	96.80
(3)	ViT(IEA)	IEA	-	97.33
(4)	ViT(IEA&SPA)	IEA+SPA	-	97.52
(5)	ViT(SA&L_DIS)	SA+L_DIS	-	97.58
(6)	ViT(IEA&SPA&L_DIS)	IEA+SPA+L_DIS	-	97.86
(7)	BDiT (IEA&SPA&L_DIS1)	IEA+SPA+L_DIS	VGG-19	97.31
(8)	BDiT (IEA&SPA&L_DIS2)	IEA+SPA+L_DIS	ResNet-101	96.88
(9)	BDiT (IEA&SPA&L_DIS3)	IEA+SPA+L_DIS	ResNet-50	97.86

In order to further verify the effectiveness of distillation loss after its introduction, this paper used three classical CNN backbone networks, VGG-19 [30], ResNet-101 [31], and ResNet-50 [32,33], as Teacher models in the knowledge distillation strategy, based on ViT(IEA&SPA&L_DIS), and instructed ViT(IEA&SPA&L_DIS) for training; the training results are shown in Table 2. The results show that different Teacher models have different effects on BDiT model accuracy after introducing distillation loss on the basis of ViT (IEA&SPA). Among them, when ResNet-50 is selected as the Teacher model, the classification accuracy reaches 97.86%, which is 0.34% higher than before adding distillation loss. Compared with BDiT (IEA&SPA& L_DIS1) and BDiT (compared with IEA&SPA& L_DIS2), the accuracy increased by 0.55% and 0.98%, respectively. Among them, when ResNet-50 is selected as the Teacher model, the classification accuracy reaches 97.86%, which is 0.34% higher than before adding distillation loss. It is comparable to BDiT (IEA&SPA& L_DIS1) and BDiT (IEA&SPA& L_DIS1) and BDiT (compared with IEA&SPA& L_DIS2), the accuracy increased by 0.55% and 0.98%, respectively. In summary, BDiT (IEA&SPA& L_DIS3), which uses ResNet-50 as the Teacher model, is selected as the final training model of this article, that is, the BDiT model.

3.4. Comparative Experiment of Different Methods

To evaluate the effectiveness of the BDiT model, several popular CNN models, including SERNet [34], ResNeXt [35], repVGG [36], ESERNet [24], and LSERNet [25], were trained on the Tuyere-Data dataset. A comparative analysis was conducted between these models and the BDiT model, and the experimental results are presented in Figure 8. The results of the experiments demonstrate that, with a minimal difference in the number of model parameters, the BDiT model achieved an accuracy of 97.86%. Compared to the SERNet model, the BDiT model showed an accuracy improvement of 0.85%. Compared to the ResNeXt model, the BDiT model exhibited an accuracy improvement of 1.17%. Moreover, the BDiT model outperformed the ESERNet model with an accuracy improvement of 1.38%. Among the models with a larger number of parameters, the BDiT model achieved an accuracy improvement of 1.12% compared to the repVGG model. Furthermore, compared to the LSERNet model, the BDiT model exhibited a marginal difference of less than 0.1% in accuracy. Taking both accuracy and the number of model parameters into consideration, the BDiT model demonstrated favorable performance compared to the other models under comparison.

The confusion matrices of the BDiT model and LSERNet model on the Tuyere-Data test set, obtained from their respective training results, are presented in Figure 9.

In the confusion matrix shown in Figure 9, the numbers 1 to 6 represent the classes normal, falling chunks, material block, coal break, hanging slag, and blowing down, respectively. The results indicate that coal break, hanging slag, and blowing down are relatively easy to detect among the five abnormal states of the tuyere. However, the detection efficiency for normal, falling chunks, and material block is comparatively lower. The LSERNet model achieves the best performance in detecting the normal class, but it is outperformed by the BDiT model in detecting falling chunks and material block. The BDiT model achieves the highest accuracy among the five abnormal states of the tuyere.

We investigated the accuracy of the BDiT model for distillation parameters of 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, and 0.5, and the experimental results are shown in Figure 10.

It can be seen from the broken line trend in the figure that before the distillation parameter is set to 0.1, the accuracy of the model gradually increases with the increase of the distillation parameter, reaching the highest value of 97.86% at 0.1, and then, the accuracy decreases with the increase of the distillation parameter. It further proves that the experimental distillation parameter of 0.1 is the best choice in this paper.

To evaluate the generalization ability of our proposed model, the BDiT model was trained and tested on the CUB-200-2011 dataset and the Stanford Dogs dataset. A comparative analysis was conducted with various other models, and the experimental results are summarized in Table 3.

From the results in Table 3, it can be observed that on the CUB-200-2011 dataset, our proposed method achieved significantly higher accuracy compared to LSERNet, ResNet50, and other CNN-based methods. Compared to the Transformer-based methods included in the comparison, the accuracy of the BDiT model is only slightly lower than TransFG. This can be attributed to the fact that TransFG uses overlapping patches as input, resulting in a 3.0% higher accuracy than ViT, making it more suitable for the classification task on the CUB-200-2011 dataset. However, it should be noted that the BDiT model is primarily designed for analyzing blast furnace tuyere images, while the CUB-200-2011 dataset consists of bird images that may exhibit variations in pose, viewpoint, and background. Additionally, the dataset may have imbalanced sample distribution among different bird species, with some categories having more or fewer samples than others. These factors contribute to the challenge of classification due to limited training samples, making it difficult for the model to fully learn the distinguishing features of these specific categories.

The Stanford Dogs dataset has a relatively larger number of images and a smaller number of categories compared to the CUB-200-2011 dataset, which provides favorable conditions for model training. As shown in Table 3, our proposed BDiT model achieved comparable accuracy to other methods on the Stanford Dogs dataset, indicating that the BDiT model proposed in this paper exhibits a certain level of generalization ability.

4. Conclusions

In this paper, we proposed a Blast Furnace Tuyere Image Classification Model Based on Knowledge Distillation and the Vision Transformer (BDiT). We introduced spatial attention modules to enhance the model’s perception and feature-extraction capabilities across different regions of the image, mitigating the impact of small inter-class variances and large intra-class variances. Additionally, we incorporated and improved an efficient attention module based on the self-attention mechanism in Vision Transformers, which utilizes deep convolutions to reduce memory overhead and alleviate training losses. Moreover, the adoption of a knowledge distillation strategy addressed the issue of limited dataset size, resulting in model lightweighting and improved generalization capabilities.

The proposed approach achieved an identification accuracy of 97.86% on the Tuyere-Data. Through ablation experiments and comparative analysis, the effectiveness and robustness of the proposed method are demonstrated from various aspects. Moreover, the model exhibits a certain degree of generalization ability on the classical fine-grained image datasets. Compared to classification methods such as VGG-19, ResNet-101, and ResNet-50, the BDiT model achieved higher accuracy, making it a promising tool for detecting abnormal conditions in blast furnaces and providing reference for furnace operators. Therefore, the proposed model has great potential for industrial applications. Further work is currently underway, with potential challenges primarily focused on two aspects:

(1): Reducing the time required for abnormal state detection in blast furnace tuyeres;
(2): Enhancing the classification accuracy of individual state images.

Author Contributions

Conceptualization, C.S. and H.Z.; methodology, H.Z.; software, H.Z.; investigation, Y.W. (Yuanjun Wang) and Y.W. (Yuhui Wang); resources, C.S.; writing—original draft preparation, H.Z.; writing—review and editing, H.Z.; visualization, H.Z.; supervision, K.H.; funding acquisition, C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (NSFC), grant number 61902205, project manager Keyong Hu.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from Qingdao Special Steel Co., Ltd. and are available from the authors with the permission of Qingdao Special Steel Co., Ltd.

Acknowledgments

We thank Qingdao Special Steel Co., Ltd. for providing data support.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wen, L.; Bai, C.; Ouyang, Q.; Chen, F.; Dong, L.; Qiu, G. Recent Research on Supervision Running State of the Tuyere and Raceway in Blast Furnace. J. Chongqing Univ. (Nat. Sci. Ed.) 2005, 8, 39–41. [Google Scholar]
Zhang, T.; Zhang, X.; Han, T.; Shi, Z.; Guo, Y.; Wang, H. Application of artificial intelligence image recognition technology in blast furnace tuyere monitoring. Metall. Autom. 2021, 45, 58–66. [Google Scholar]
Zhang, T.; Zhang, J.; Peng, G.; Wang, H. Automated Machine Learning for Steel Production: A Case Study of TPOT for Material Mechanical Property Prediction. In Proceedings of the 2022 IEEE International Conference on e-Business Engineering (ICEBE), Bournemouth, UK, 14–16 October 2022; IEEE: New York, NY, USA, 2022. [Google Scholar]
Choi, Y.; Kwun, H.; Kim, D.; Lee, E.; Bae, H. Method of predictive maintenance for induction furnace based on neural network. In Proceedings of the 2020 IEEE International Conference on Big Data and Smart Computing (BigComp), Busan, Republic of Korea, 19–22 February 2020; IEEE: New York, NY, USA, 2020. [Google Scholar]
Zhao, X.; Amp, H.E. Current Research on Characteristics of Tuyere Raceway of Blast Furnace. Gansu Metall. 2015, 37, 35–39. [Google Scholar]
Liu, J.; Gao, Y.; Yang, H.; Li, F. Innovation Research of Large Blast Furnace Tuyere Equipment Maintenance Technology. Constr. Technol. 2018, 47, 498–500. [Google Scholar]
Qin, T.; Ma, L.; Chen, C.; Wang, Y. Research on blast furnace air outlet coal injection flow detection method. Ind. Metrol. 2015, 25, 36–39+66. [Google Scholar] [CrossRef]
Puttinger, S.; Stocker, H. Toward a Better Understanding of Blast Furnace Raceway Blockages. Steel Res. Int. 2020, 91, 202000227. [Google Scholar] [CrossRef]
Born, S.; Babich, A.; van der Stel, J.; Ho, H.T.; Sert, D.; Ansseau, O.; Plancq, C.; Pierret, J.-C.; Geyer, R.; Senk, D.; et al. Char formation by coal injection and its behavior in the blast furnace. Steel Res. Int. 2020, 91, 2000038. [Google Scholar] [CrossRef]
Fu, J.; Zheng, H.; Mei, T. Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4476–4484. [Google Scholar]
Zheng, H.; Fu, J.; Mei, T.; Luo, J. Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017. [Google Scholar]
Zheng, H.; Fu, J.; Zha, Z.J.; Luo, J.; Mei, T. Learning Rich Part Hierarchies with Progressive Attention Networks for Fine-Grained Image Recognition. IEEE Trans Image Process. 2019, 29, 476–488. [Google Scholar] [CrossRef]
Wei, X.-S.; Song, Y.-Z.; Mac Aodha, O.; Wu, J.; Peng, Y.; Tang, J.; Yang, J.; Belongie, S. Fine-Grained Image Analysis with Deep Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8927–8948. [Google Scholar] [CrossRef]
Wang, Y.; Morariu, V.I.; Davis, L.S. Learning a Discriminative Filter Bank within a CNN for Fine-Grained Recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. Int. Conf. Learn. Represent. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Guo, M.-H.; Liu, Z.-N.; Mu, T.-J.; Hu, S.-M. Beyond Self-Attention: External Attention Using Two Linear Layers for Visual Tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5436–5447. [Google Scholar] [CrossRef] [PubMed]
He, J.; Chen, J.-N.; Liu, S.; Kortylewski, A.; Yang, C.; Bai, Y.; Wang, C. TransFG: A Transformer Architecture for Fine-Grained Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2022; Volume 36, pp. 852–860. [Google Scholar] [CrossRef]
Conde, M.V.; Turgutlu, K. Exploring Vision Transformers for Fine-grained Classification. arXiv 2021, arXiv:2106.10587. [Google Scholar]
Li, Y.; Zhang, K.; Cao, J.; Timofte, R.; Van Gool, L. Localvit: Bringing locality to vision transformers. arXiv 2021, arXiv:2104.0570. [Google Scholar]
Zhang, Q.; Yang, Y.-B. Rest: An efficient transformer for visual recognition. Adv. Neural Inf. Process. Syst. 2021, 34, 15475–15485. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. Comput. Sci. 2015, 14, 38–39. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. arXiv 2020, arXiv:2012.12877. [Google Scholar]
Zheng, Z.; Xi, P. Self-guidance: Improve deep neural network generalization via knowledge distillation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022. [Google Scholar]
Wang, R.; Li, Z.; Yang, L.; Li, Y.; Zhang, H.; Song, C.; Jiang, M.; Ye, X.; Hu, K. Application of Efficient Channel Attention Residual Mechanism in Blast Furnace Tuyere Image Anomaly Detection. Appl. Sci. 2022, 12, 7823. [Google Scholar] [CrossRef]
Song, C.; Li, Z.; Li, Y.; Zhang, H.; Jiang, M.; Hu, K.; Wang, R. Research on Blast Furnace Tuyere Image Anomaly Detection, Based on the Local Channel Attention Residual Mechanism. Appl. Sci. 2023, 13, 802. [Google Scholar] [CrossRef]
Park, J.; Woo, S.; Lee, J.-Y.; Kweon, I.S. BAM: Bottleneck Attention Module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Khosla, A.; Jayadevaprakash, N.; Yao, B.; Li, F.F. Novel dataset for fine-grained image categorization: Stanford dogs. In Proceedings of the CVPR Workshop on Fine-Grained Visual Categorization (FGVC), Providence, RI, USA, 21–23 June 2011; pp. 1–2. [Google Scholar]
Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-Ucsd Birds-200-2011 Dataset; CNS-TR-2011-001; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
Wen, L.; Li, X.; Li, X.; Gao, L. A new transfer learning based on VGG-19 network for fault diagnosis. In Proceedings of the 2019 IEEE 23rd International Conference on Computer Supported Cooperative Work in Design (CSCWD), Porto, Portugal, 6–8 May 2019; IEEE: New York, NY, USA, 2019. [Google Scholar]
Zhang, Q. A novel ResNet101 model based on dense dilated convolution for image classification. SN Appl. Sci. 2021, 4, 9. [Google Scholar] [CrossRef]
Gan, G.; Xiao, X.; Jiang, C.; Ye, Y.; He, Y.; Xu, Y.; Luo, C. Strawberry Disease and Pest Identification and Control Based on SE-ResNeXt50 Model. In Proceedings of the 2022 3rd International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications (CVIDL & ICCEA), Changchun, China, 20–22 May 2022; IEEE: New York, NY, USA, 2022. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Aftab, A.; Morsali, A.; Ghaemmaghami, S.; Champagne, B. LIGHT-SERNET: A lightweight fully convolutional neural network for speech emotion recognition. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE: New York, NY, USA, 2022. [Google Scholar]
Zhou, T.; Zhao, Y.; Wu, J. Resnext and res2net structures for speaker verification. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; IEEE: New York, NY, USA, 2021. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]

Figure 1. The overall framework of the BDiT model.

Figure 2. Improved Vision Transformer model. The illustration of the Improved ViT was inspired by Dosovitskiy A. et al. (2020) [15].

Figure 3. Spatial attention module.

Figure 4. ViT encoding module and self-attention mechanism. (a,b) were inspired by Dosovitskiy et al. (2020) [15] and Vaswani et al. (2017) [27], respectively.

Figure 5. Multi-head self-attention: (a) efficient multi-head self-attention; (b) improved efficient multi-head self-attention. (a) were inspired by Zhang et al. (2020) [20].

Figure 6. Knowledge distillation process.

Figure 7. Example of blast furnace tuyere operating conditions: (a) Normal: The color inside the tuyere is bright, the edge is smooth, and the dark area of coal spray is clear; (b) Falling chunks: The tuyere appears to darken extensively and returns to brightness when the large pieces are melted; (c) Material block: There is a clear gray coke gyratory movement in the tuyere; (d) Coal break: The edge of the tuyere is smooth, the color inside the tuyere is bright, and there is no dark area of coal spray; (e) Hanging slag: The edge of the tuyere is not smooth, and there are dark areas with large ashiness at the hanging slag; (f) Blowing down: The coke gyration slows down to a pile-up while the wind gap gradually darkens.

Figure 8. Average accuracy and parameters of different classification models.

Figure 9. The confusion matrix of the model.

Figure 10. Different distillation parameters and accuracy.

Table 1. Overview of the three datasets.

Datasets	Category	Total	Train	Test
Tuyere-Data	6	29,160	23,328	5832
Stanford Dogs	120	20,580	16,464	4116
CUB-200-2011	200	11,788	9431	2357

Table 3. The accuracy comparison of different models on the CUB-200-2011 dataset and the Stanford Dogs dataset.

Model	Accuracy (%)
Model	CUB	Dogs
ESERNet [31]	60.72	73.20
LSERNet [32]	63.57	74.39
ResNet101 [26]	72.93	84.35
ResNet50 [27,28]	74.58	87.16
repVGG [31]	69.17	85.81
ViT [13]	77.42	88.27
DeiT [18]	76.06	91.32
TransFG [15]	80.43	90.24
BDiT	77.65	90.31

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, C.; Zhang, H.; Wang, Y.; Wang, Y.; Hu, K. Abnormality Detection of Blast Furnace Tuyere Based on Knowledge Distillation and a Vision Transformer. Appl. Sci. 2023, 13, 10398. https://doi.org/10.3390/app131810398

AMA Style

Song C, Zhang H, Wang Y, Wang Y, Hu K. Abnormality Detection of Blast Furnace Tuyere Based on Knowledge Distillation and a Vision Transformer. Applied Sciences. 2023; 13(18):10398. https://doi.org/10.3390/app131810398

Chicago/Turabian Style

Song, Chuanwang, Hao Zhang, Yuanjun Wang, Yuhui Wang, and Keyong Hu. 2023. "Abnormality Detection of Blast Furnace Tuyere Based on Knowledge Distillation and a Vision Transformer" Applied Sciences 13, no. 18: 10398. https://doi.org/10.3390/app131810398

APA Style

Song, C., Zhang, H., Wang, Y., Wang, Y., & Hu, K. (2023). Abnormality Detection of Blast Furnace Tuyere Based on Knowledge Distillation and a Vision Transformer. Applied Sciences, 13(18), 10398. https://doi.org/10.3390/app131810398

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Abnormality Detection of Blast Furnace Tuyere Based on Knowledge Distillation and a Vision Transformer

Abstract

1. Introduction

2. Methods

2.1. BDiT

2.2. Spatial Attention Module

2.3. Improved Self-Attention Mechanism

2.4. Loss Function

3. Experimental Results and Analysis

3.1. Dataset Selection and Preprocessing

3.2. Evaluation Metrics and Experimental Settings

3.3. Ablation Experiment

3.3.1. Self-Attention Mechanism and Spatial Attention Module

3.3.2. Distillation Loss

3.4. Comparative Experiment of Different Methods

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI