Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Open AccessArticle

Peer-Review Record

Design of Multi-Receptive Field Fusion-Based Network for Surface Defect Inspection on Hot-Rolled Steel Strip Using Lightweight Dataset

Appl. Sci. 2021, 11(20), 9473; https://doi.org/10.3390/app11209473

by Wei-Peng Tang¹, Sze-Teng Liong², Chih-Cheng Chen³, Ming-Han Tsai⁴, Ping-Cheng Hsieh¹, Yu-Ting Tsai⁵

, Shih-Hsin Chen⁶ and Kun-Ching Wang^1,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2021, 11(20), 9473; https://doi.org/10.3390/app11209473

Submission received: 9 September 2021 / Revised: 5 October 2021 / Accepted: 5 October 2021 / Published: 12 October 2021

Round 1

Reviewer 1 Report

The authors propose a method to detect surface defects using a deep learning approach. The method decompose raw data into different resolution to find good representative features. Then, a network is built to learn from these representations. The topic is interesting and the paper is easy to follow. However, some issues have to be addressed.

1- please give the full definition of all acronym in the first place and then use them accordingly.

2- The work needs extensive English revision and proof reading.

3- The authors mentioned the fine-tuning of pre-trained GoogleNet models without specifying what type of layers were frozen and what was adapted. Please, give more details on the fine-tuning process.

4- What is the exact input size for each of the two fine-tuned models, please include these information in the experimental results section.

5- Give the proper reference for the GoogleNet model at the first place it is mentioned.

6- Figures need some more explanations.

7- Remove the dot (.) after introducing the word Figure. For example, Figure.2, at line 241; Figure.4 at line 324; and so on.

8- The second sentence in lines 540-543, is hard to read, please simplify.

9- Revise the References section and use a unified format.

Author Response

Reviewer 1

Response:

Thank you for summarizing our work concisely and the encouraging words. We appreciate your comments and suggestions that allow us to greatly improve the quality of the manuscript. Besides, we have addressed your comments and highlight the changes made in red on the revised manuscript.

Comment 1:

Please give the full definition of all acronym in the first place and then use them accordingly.

Response:

Thank you the comment and we apologize for this careless mistake. The full definition of all acronyms has been added in the revised manuscript (Kindly refer to Page 2). The table of the abbreviations has been reproduced below for quick reference purposes:

Abbreviations

MRFFN	Multi-receptive Field Fusion-based Network
ASI	Automated Surface Inspection
GLCM	Gray-level Co-occurrence Matrix
ANN	Artificial Neural Network
OMFF-RF	Improved Random Forest Algorithm with Optimal Multi-feature-set Fusion
HOG	Histogram of Oriented Gradient
GOCM	Gradient-only Co-occurrence Matrices
SVM	Support Vector Machine
LBP	Local Binary Pattern
CLTP	Completed Local Ternary Pattern
AECLBPs	Adjacent Evaluation Completed Local Binary Pattern
CNN	Convolutional Neural Network
VGG	Very Deep Convolutional Networks designed by Visual Geometry Group
CPN	Classification Priority Network
MG-CNN	Multi-group Convolutional Neural Network
GPU	Graphics Processing Unit
DL	Deep Learning
cDCGAN	Categorized Deep Convolutional Generative Adversarial Network
PLCNN	Pseudo-Label Convolutional Neural Network
VAE	Variational Autoencoder
CCVAE	Conditional Convolutional Variational Autoencoder
WGANs	Wasserstein Generative Adversarial Networks
GAN	Generative Adversarial Network
PCA	Principal Component Analysis
DNN	Deep Neural Network
GAP	Global Average Pooling
RAM	Random-access Memory
IA	Image Augmentation
Adam	Adaptive Moment Estimation
t-SNE	t-distributed Stochastic Neighbor Embedding
NEU	Northeastern University

Comment 2:

The work needs extensive English revision and proof reading.

Response:

Thank you for the suggestion. We have carefully revised the paper.

Comment 3:

The authors mentioned the fine-tuning of pre-trained GoogleNet models without specifying what type of layers were frozen and what was adapted. Please, give more details on the fine-tuning process.

Response:

Thank you for pointing this out. We have explained the fine-tuning process in the revised manuscript (Please see line 259 to line 287 on Page 8) and they are reproduced below for quick reference purpose:

Since it is difficult to collect sufficient defect samples for the deep network, the main challenge in this study is to extract more significant features from limited data. Concretely, two pre-trained GoogLeNet are adopted as the baseline models of the MRFFN. Here, the input size of the level 0 model is modified to 200 x 200, which contains the exact resolution as the input images. However, the Gaussian pyramid downsamples the original images into a spatial resolution of 100 x 100 pixels, providing less information for the training model. This causes the pre-trained GoogLeNet, which contains 22 layers is too deep for the small-scale dataset. Thus, the input size of the level 1 model is fine-tuned as 100 x 100 pixels to prevent the risk of overfitting phenomenon. Besides, the last two Inception modules (i.e., Inception module 5a and inception module 5b) of the pre-trained GoogLeNet have been discarded to scale down the computational load of the level 1 model. The overall framework of the proposed method is shown in Figure 3.

Figure 4. The overall framework of the proposed method. First, the original images have been decomposed by the Gaussian pyramid. Then, the level 0 and level 1 are trained individually by the low-level and high-level images. Lastly, the confidence scores of both networks are fused togather for the final result.

In addition, the GoogLeNet, which was pre-trained with the 1.2 million samples, contains optimal weights for the feature extraction. In brief, GoogLeNet learned the feature from the 1000 categories (e.g., animal, flower, tool, building, and fruit) from the ImageNet dataset, which looks different from steel surface defects. Hence, to better characterize the pattern of the steel surface defects, the shallower layers of both level 0 and level 1 models are adopt-ed with higher learning rate factors. Here, the weights and bias of the Conv 1, Conv 2-reduce, Conv 2, Inception module 3a, Inception module 3b, and Inception module 4a are applied as 9, while the other layers remain the same. By increasing the learning rate factor of the shallower layers, one may improve the convergence speed of the training models simultaneously reduce the gradient vanishing problem, especially in the deep CNN mod-el. Furthermore, the average pooling of both models is replaced with global average pooling (GAP) to extract global information of each feature map. Meanwhile, the fully connected layer of the original network is replaced with a new fully connected layer which has the same output as the NEU dataset classes.

Comment 4:

What is the exact input size for each of the two fine-tuned models, please include these information in the experimental results section.

Response:

Thank you for pointing this out. We apologize for the confusion. The exact input size has been clearly explained in the revised manuscript (Please see line 261 to line 267 on Page 8). The modified sentence is reproduced below for quick reference purposes:

Here, the input size of the level 0 model is modified to 200 x 200, which contains the exact resolution as the input images. However, the Gaussian pyramid downsamples the original images into a spatial resolution of 100 x 100 pixels, providing less information for the training model. This causes the pre-trained GoogLeNet, which contains 22 layers is too deep for the small-scale dataset. Thus, the input size of the level 1 model is fine-tuned as 100 x 100 pixels to prevent the risk of overfitting phenomenon.

Comment 5:

Give the proper reference for the GoogleNet model at the first place it is mentioned.

Response:

Thank you for pointing this out and we apologize for this mistake. The mistake has been corrected in the revised manuscript (Please see line 57 to line 60 on Page 2) and the modified sentence is reproduced below for quick reference purpose:

Then, two pre-trained GoogLeNet [5] are fine-tuned respectively in which the shallower layers contain higher learning rate factor to improve the convergence speed of the model simultaneously to avoid the training model falling into local optimal.

Comment 6:

Figures need some more explanations.

Response:

Thank you for the suggestion. We have explained all figures in the revised manuscript and

they are reproduced below for quick reference purposes:

Figure 5. The result of Gaussian pyramid. Level 0 represent the raw image and level 1 represent the downsampled image which contain only half of the level 0 resolution.

Figure 6. The architecture of the pre-trained GoogLeNet. Conv represents convolutional layer, M. Pool represents the max pooling layer, Norm represents the local response normalization, and Avg Pool represent the average pooling.

Figure 7. The overall framework of the proposed method. First, the original images have been decomposed by the Gaussian pyramid. Then, the level 0 and level 1 are trained individually by the low-level and high-level images. Lastly, the confidence scores of both networks are fused togather for the final result.

Figure 4. The examples of the NEU dataset, where Cr, In, Pa, PS, RS, and Sc denote the crazing, inclusion, patches, pitted surface, rolled-in scale, and scratches defects respectively.

Comment 7:

Remove the dot (.) after introducing the word Figure. For example, Figure.2, at line 241; Figure.4 at line 324; and so on.

Response:

Thank you for pointing this out. We have removed the dots in all the captions of the figures in the revised manuscript.

Comment 8:

The second sentence in lines 540-543, is hard to read, please simplify.

Response:

Thank you for the suggestion. We have simplified the sentence in the revised manuscript (Please see line 577 to line 579 on Page 21) and is reproduced below for quick reference purpose.

This paper proposes a novel method to improve the automated defect inspection on hot rolled steel strip by introducing the multi-receptive field fusion-based networks (MRFFN).

Comment 9:

Revise the References section and use a unified format.

Response:

We apologize for this mistake. The reference section has been updated in the revised manuscript (Please see Page 22-23).

Author Response File: Author Response.pdf

Reviewer 2 Report

"Design of Multi-receptive Field Fusion-based Network for Surface Defect Inspection on Hot-rolled Steel Strip Using Lightweight Dataset"

Wei-Peng Tang, Sze-Teng Liong, Chih-Cheng Chen, Ming-Han Tsai, Ping-Cheng Hsieh, Yu-Ting Tsa, Shih-Hsin Chen, and Kun-Ching Wang

===============================

General Comments

---------------------

Implicitly applied procedure of bootstrapping performed by random sampling of 50 images, exhibits relatively high standard deviation which overlaps with accuracy level of the SOTA models being compared. Therefore it is hard to say whether the conclusion on model superiority over SOTA is valid.

Fortunately, there is another aspect presented in the paper: the model robustness on noise and motion blur. This could be important for some real life applications and in the future when enough bigger datasets will be available, MRFFN solution could be retrained giving enough accurate and image distortion robust tool, to be used for instance in steelworks.

===========================

Detailed comments

---------------------

1. In image classification which is based on six classes, it may happen that a image is out of those classes - in your case it represents a steel defect which cannot be assigned to any of six classes. My question: why there is no seven classes? Six classes + OTHER class including samples without defects, or including the cases of mixed defects or even including defects from an unknown class?

2. Image augmentation is the standard preprocessing stage offered by many neural platforms, like PyTorch or TensorFlow. Please do it getting for instance 5000 images for training instead of 50.

3. In the original GoogLeNet we have 7x7 average pooling with stride 1. However, you use the name GAP for this unit without any explanation of this acronym. I expect ??? Average Pooling. In the placeholder ??? should not be Global since it is not the same operation.

4. You make weighting of scores y_0, y_1 as the fusion operation. Please, explain in the paper the following items: (a) which scores you combine - before (called logits) or after (called probs) the SoftMax units in the branches (your levels) 0 and 1, (b) to make clear put also the annotations y_0, y_1 directly in the Figure 3.

5.The equation 3 for the scores weighting is the conditional one. However, the case y_0 = y_1 can be ignored since the general case gives the same result for y_final, providing the usual condition w_0+w_1=1. Please make the equation 3 simpler: y_final = w_0y_0 + w_1y_1.

BTW the best performance for the case w_0=w_1=0.5 you explain in Figure 10. However, there is no analysis for case y_0=y_1 and then for the choice of weights w_0 = 0.6, w_1 = 0.4. I know why: the probability that you get the same score is zero!

6. Usually, in CNN architectures, the separation of feature extraction component from the classification component is not clear, i.e. not well defined. The designers decide what unit's output he/she calls as the network deep feature, embedding, etc. So to fully understand the t_SNE based visualization of deep features in Figures 6-8, the readers will expect a statement explaining "what feature vector you mean."

7. It seems that recent papers on CNN classifiers beside the accuracy metrics include F_1 metrics and confusion matrices, as well. Please, follow this good practices.

8. It seems that from many years all papers on new DNN networks in their experimental sections include a graph with the curves of training and validation losses versus epoch id. Also in case when for some reasons the testing dataset is the same as the validation dataset. Please, join this graph.

9. Gaussian filtering used for the image decimation is based on the kernel mask of size r x r. Please, specify what value of r is actually used in the final model and why?

Author Response

Reviewer 2

The submitted paper describes the interesting research on CNN classifiers of defects arising on steel surface performed on the basis of their images. Authors adopted GoogLeNet architecture working together for the original surface image and its reduced resolution version. The image decimation is performed using the Gaussian filtering. The results of two independent GoogLeNet network branches are weighted to produce the final decision of the proposed classifier. The challenge of this research is the very small dataset (200 = 50 training + 150 testing images only). It seems that having so many parameters just trained with 50 image samples, moreover without the standard image augmentation, will lead to a nearly perfect fit of the learning model to the data. Implicitly applied procedure of bootstrapping performed by random sampling of 50 images, exhibits relatively high standard deviation which overlaps with accuracy level of the SOTA models being compared. Therefore it is hard to say whether the conclusion on model superiority over SOTA is valid. Fortunately, there is another aspect presented in the paper: the model robustness on noise and motion blur. This could be important for some real life applications and in the future when enough bigger datasets will be available, MRFFN solution could be retrained giving enough accurate and image distortion robust tool, to be used for instance in steelworks.

Response:

Thank you for the invaluable suggestions given to improve the quality of our paper. We truly appreciate your great efforts and constructive suggestions on this manuscript. The manuscript has been modified accordingly. We have addressed your comments and highlight the changes made in blue on the revised manuscript.

Comment 1:

In image classification which is based on six classes, it may happen that a image is out of those classes - in your case it represents a steel defect which cannot be assigned to any of six classes. My question: why there is no seven classes? Six classes + OTHER class including samples without defects, or including the cases of mixed defects or even including defects from an unknown class?

Response:

In real-world cases, the OTHER class should be considered in the classification task since an image may be out of those classes. However, most of the state-of-the-art [1,2,3] did not consider the OTHER class in the classification tasks. This is because the NEU-CLS dataset does not provide defect-free samples, or mixed defect samples, or unknown class samples. In addition, other popular works [4,5,6,7] that carried out the steel defect classification task on the steel object did not consider the OTHER class as well. Nevertheless, [6,8] included the defect-free category in their experiment because the defect-free sample was provided in the original textured data set and decorative sheet defect dataset. Therefore, we follow the common practice of the experiments conducted on the NEU-CLS dataset (i.e., 6-class classification) such that a fair comparison can be reported to demonstrate the effectiveness and the robustness of our proposed method. The table below summarizes some related datasets with the corresponding classification result and the defect types.

	Dataset	Reference	Accuracy (%)	Classification Categories
Without non-defective sample	NEU-CLS	He et al. [1]	99.56	Crazing, Inclusion, Patches, Pitted Surface, Rolled-in Scale, Scratches.
	NEU-CLS	Lee et al. [2]	99.44	Crazing, Inclusion, Patches, Pitted Surface, Rolled-in Scale, Scratches.
	NEU-CLS	Song et al. [3]	98.93	Crazing, Inclusion, Patches, Pitted Surface, Rolled-in Scale, Scratches.
	Hot Rolled Plate	He et al. [4]	97.20	Oxide Scales, Cracks, Water Mark, Seams, and Rolling Marks.
	Hot Rolled Strip	He et al. [4]	97.00	Longitudinal Cracks, Transverse Cracks, Wrinkles, Scars, Watermarks, Oxide Scales, Seams, Edge Cracks, and Rolling Marks.
	X-SDD Dataset	Feng et al. [5]	95.10	Oxide Scale of Plate System, Red Iron Sheet, Scratches, Inclusion, Finishing Roll Printing, Iron Sheet Ash, and Oxide Scale of Temperature System.
	NEU-CLS-64	Wang et al. [6]	96.67	Crazing, Grooves and Gouges, Inclusion, Patches, Pitted Surface, Rolling Dust, Rolled-in Scale, Scratches, and Spots.
	GC10 Dataset	Wang et al. [6]	82.73	Crease, Crescent Gap, Inclusion, Oil Spot, Punching Hole, Rolled Pit, Silk Spot, Waist Folding, Water Spot, and Welding Line.
	Solar Cell Surface Defect	Chen et al. [7]	93.23	Chipping, Broken Gates, Leaky Paste, Dirty Sheets, Scratches, Thick Lines, and Chromatic Aberrations.
With non-defective samples	Textured Data Set	Wang et al. [6]	68.43	Color, Cut, Hole, Metal Contamination, Thread, and Good.
With non-defective samples	Decorative Sheet Defect	Le et al. [8]	92.44	Dot, Spot, Abrasion, Fragment, Cut, and Normal.

Comment 2:

Image augmentation is the standard preprocessing stage offered by many neural platforms, like PyTorch or TensorFlow. Please do it getting for instance 5000 images for training instead of 50.

Response:

Thank you for your suggestion. The image augmentation technique has been adopted in the revised manuscript. We have adopted the most suitable image augmentation technique according to the experimental result below:

where Ro represent random rotation angle with a degree among -45 degree to 45 degree, Re represent random reflection in horizontal or vertical with 50% probability, and Tr represent random translation distances which and interval of 3 pixels. Hence, we adopted the image reflection technique to improve the training progress and the experimental results have been modified in the revised manuscript. The experimental results are reproduced below for quick reference purpose:

Table 2. Comparison results with the state-of-the-art based on NEU dataset (%).

Method	Training Sample	Testing Sample	Accuracy	Recall	Precision	F1-score
Lee et. al. [20]	210	30	99.44	-	-	99.00
Xiao et. al. [48]	150	150	97.42	-	-	-
Song et. al. [15]	150	150	98.93±0.63	97.89	97.91	97.90
Ren et. al. [25]	150	150	99.27	-	-	-
Gao et. al. [38]	50	250	99.26	99.26	100	99.63
Level 0	50	250	99.30±0.37	99.30	99.30	99.30
Level 1	50	250	99.03±0.39	99.03	99.04	99.03
MRFFN	50	250	99.61±0.23	99.61	99.61	99.61
MRFFN + IA	50	250	99.75±0.24	99.75	99.75	99.75

Table 4. The performance of the proposed method on Gaussian white noise.

Method	Accuracy (%)
Method	Original	Var 0.01	Var 0.05	Var 0.1	Var 0.3
AlexNet	-	96.45	91.48	84.62	67.24
VGG16	-	97.73	92.95	87.25	53.35
ResNet-18	-	97.60	91.81	86.31	73.02
Level 0*	99.30	59.55	37.35	30.29	21.19
Level 1*	99.03	76.85	47.23	37.70	20.87
MRFFN*	99.61	74.41	45.99	37.99	21.59
MRFFN + IA*	-	66.08	45.04	37.29	21.83
Level 0	-	97.86	94.55	92.17	81.31
Level 1	-	97.27	94.43	91.73	83.69
MRFFN	-	98.04	95.63	93.79	85.65
MRFFN + IA	-	98.71	96.63	94.87	88.25

* denotes as the model was trained by original dataset.

Table 5. The performance of the proposed method on salt and pepper noise.

Method	Accuracy (%)
Method	Original	Density 0.01	Density 0.05	Density 0.1	Density 0.3
AlexNet	-	96.80	93.99	90.69	75.48
VGG16	-	97.15	95.49	91.36	69.94
ResNet-18	-	96.74	94.31	91.08	72.16
Level 0*	99.30	77.43	51.58	37.92	21.09
Level 1*	99.03	86.92	51.96	40.81	20.92
MRFFN*	99.61	84.63	53.29	39.69	20.18
MRFFN + IA*	-	83.25	50.65	39.53	18.73
Level 0	-	98.83	95.76	92.52	79.57
Level 1	-	97.85	96.26	94.04	86.99
MRFFN	-	98.87	97.01	94.66	87.09
MRFFN + IA	-	99.31	97.87	96.41	90.54

* denotes as the model was trained by original dataset.

Table 6. The performance of the proposed method on motion blur.

Method	Accuracy (%)
Method	Original	Length 5	Length 10	Length 15	Length 20
AlexNet	-	97.61	96.17	95.12	93.57
VGG16	-	98.55	95.93	95.54	94.33
ResNet-18	-	98.55	96.57	95.15	94.31
Level 0*	99.30	68.02	53.81	45.05	40.65
Level 1*	99.03	94.03	74.85	63.31	56.19
MRFFN*	99.61	82.59	67.76	58.01	51.87
MRFFN + IA*	-	79.35	67.39	58.94	53.23
Level 0	-	98.83	97.41	96.59	95.43
Level 1	-	98.64	97.51	96.78	95.50
MRFFN	-	99.11	98.17	97.51	96.40
MRFFN + IA	-	99.28	98.91	98.11	97.43

* denotes as the model was trained by original dataset.

Table 7. The confusion matrices on different interference defect dataset which contain six types of defects (%).

(a) Variance 0.3
	Cr	In	Pa	Ps	Rs	Sc	Precision
Cr	90.44	0.76	0.00	2.96	2.64	0.24	93.20
In	0.40	79.60	0.00	5.92	13.64	7.12	74.62
Pa	0.04	0.00	99.20	0.48	0.00	0.00	99.48
Ps	2.44	10.20	0.56	88.00	0.84	1.00	85.40
Rs	5.96	5.64	0.00	1.60	82.48	1.88	84.54
Sc	0.72	3.80	0.24	1.04	0.40	89.76	93.54
Recall	90.44	79.60	99.20	88.00	82.48	89.76	88.25

(b) Density 0.3
	Cr	In	Pa	Ps	Rs	Sc	Precision
Cr	94.12	0.96	0.00	3.32	2.04	0.24	93.48
In	0.16	82.00	0.00	6.04	7.48	7.16	79.74
Pa	0.32	0.08	99.20	0.88	0.04	0.08	98.61
Ps	1.16	9.88	0.24	88.12	0.56	0.72	87.52
Rs	4.04	4.52	0.00	0.72	89.52	1.52	89.23
Sc	0.20	2.56	0.56	0.92	0.36	90.28	95.15
Recall	94.12	82.00	99.20	88.12	89.52	90.28	90.54

(c) Length 20
	Cr	In	Pa	Ps	Rs	Sc	Precision
Cr	99.12	0.00	0.00	1.40	0.60	0.00	98.02
In	0.00	97.20	0.00	3.00	0.00	4.00	93.28
Pa	0.04	0.00	100.00	0.00	0.00	0.00	99.96
Ps	0.04	1.88	0.00	94.12	0.08	0.68	97.23
Rs	0.80	0.12	0.00	1.12	99.28	0.44	97.56
Sc	0.00	0.80	0.00	0.36	0.04	94.88	98.75
Recall	99.12	97.20	100.00	94.12	99.28	94.88	97.43

Comment 3:

In the original GoogLeNet we have 7x7 average pooling with stride 1. However, you use the name GAP for this unit without any explanation of this acronym. I expect ??? Average Pooling. In the placeholder ??? should not be Global since it is not the same operation.

Response:

Thank you for pointing this out. We apologize for the confusion. In the original GoogLeNet, the input size of the 7x7 average pooling is 7x7x1024, hence the 7x7 average pooling may be considered as a global average pooling (GAP) which also average the whole feature map into one value. In this article, the input size of the GAP is 6x6. Hence, we use the name GAP instead of 6x6 average pooling to better understanding the process. The explanation of the acronym and the reason for using GAP is reproduced below for quick reference purposes (Please see line 284 to line 285 on Page 8):

Furthermore, the average pooling of both models is replaced with global average pooling (GAP) to extract global information of each feature map.

Comment 4:

You make weighting of scores y_0, y_1 as the fusion operation. Please, explain in the paper the following items: (a) which scores you combine - before (called logits) or after (called probs) the SoftMax units in the branches (your levels) 0 and 1, (b) to make clear put also the annotations y_0, y_1 directly in the Figure 3. The explanation of the fusion operation

Response:

Thank you for your suggestion and we apologize for the confusion. We have explained which scores we combine for predicting the final result (Please see Page 9). Besides, we have also added the annotations of and in Figure 3 (Please see Page 8-9). The explanation of the fusion operation is reproduced below for quick reference purposes:

where i indicates the defect classes, and indicate the highest and the second-highest predictions scores on the arbitrary testing image, and indicate the probabilities of the class i defect according to level 0 and level 1 models respectively. To prevent contain the two highest prediction scores, the weights of level 0 and level 1 models are set as 0.6 and 0.4 while and are equally the same and the explanation will be shown in Section 5.2.

Comment 5:

The equation 3 for the scores weighting is the conditional one. However, the case y_0 = y_1 can be ignored since the general case gives the same result for y_final, providing the usual condition w_0+w_1=1. Please make the equation 3 simpler: y_final = w_0y_0 + w_1y_1. BTW the best performance for the case w_0=w_1=0.5 you explain in Figure 10. However, there is no analysis for case y_0=y_1 and then for the choice of weights w_0 = 0.6, w_1 = 0.4. I know why: the probability that you get the same score is zero!

Response:

Thank you for pointing this out. We apologize for the careless mistake. We have corrected the definition of equation 3 in the revised manuscript. The updated equation has been shown in the response to the previous comment (i.e., Comment 4).

Comment 6:

Usually, in CNN architectures, the separation of feature extraction component from the classification component is not clear, i.e. not well defined. The designers decide what unit's output he/she calls as the network deep feature, embedding, etc. So to fully understand the t_SNE based visualization of deep features in Figures 6-8, the readers will expect a statement explaining "what feature vector you mean."

Response:

Thank you for pointing this out. We extract the most discriminative features from the fully connected layer (i.e., the penultimate layer of the pre-trained GoogLeNet structure) in both networks for feature visualization to understand each neuron for a decision of interest. The newly added sentence has been rephrased in the revised manuscript (Kindly refer to line 445 to line 448 on Page 15) and has been reproduced below for quick reference purposes:

The most discriminative features were extracted from the fully connected layer (i.e., the penultimate layer of the pre-trained GoogLeNet structure) in both networks for feature visualization, as it can provide better interpretation of each neuron for the decision of interest.

Comment 7:

It seems that recent papers on CNN classifiers beside the accuracy metrics include F_1 metrics and confusion matrices, as well. Please, follow this good practices.

Response:

Thank you for pointing this out. We have added the F1-metrics and confusion matrix in the revised manuscript (Please see Table 2-3 on Page 13-14) and they are reproduced below for quick reference purposes:

Table 2. Comparison results with the state-of-the-art based on NEU dataset (%).

Method	Training Sample	Testing Sample	Accuracy	Recall	Precision	F1-score
Lee et. al. [20]	210	30	99.44	-	-	99.00
Xiao et. al. [48]	150	150	97.42	-	-	-
Song et. al. [15]	150	150	98.93±0.63	97.89	97.91	97.90
Ren et. al. [25]	150	150	99.27	-	-	-
Gao et. al. [38]	50	250	99.26	99.26	100	99.63
Level 0	50	250	99.30±0.37	99.30	99.30	99.30
Level 1	50	250	99.03±0.39	99.03	99.04	99.03
MRFFN	50	250	99.61±0.23	99.61	99.61	99.61
MRFFN + IA	50	250	99.75±0.24	99.75	99.75	99.75

Table 3. The confusion matrices NEU dataset which contains six types of defects (%).

	Cr	In	Pa	Ps	Rs	Sc	Precision
Cr	100.00%	0.00%	0.00%	0.24%	0.00%	0.00%	99.76%
In	0.00%	99.20%	0.00%	0.20%	0.00%	0.28%	99.52%
Pa	0.00%	0.00%	100.00%	0.00%	0.00%	0.00%	100.00%
Ps	0.00%	0.72%	0.00%	99.56%	0.00%	0.00%	99.28%
Rs	0.00%	0.08%	0.00%	0.00%	100.00%	0.00%	99.92%
Sc	0.00%	0.00%	0.00%	0.00%	0.00%	99.72%	100.00%
Recall	100.0%	99.2%	100.0%	99.6%	100.0%	99.7%	99.75%

Comment 8:

It seems that from many years all papers on new DNN networks in their experimental sections include a graph with the curves of training and validation losses versus epoch id. Also in case when for some reasons the testing dataset is the same as the validation dataset. Please, join this graph.

Response:

Thank you for your suggestion. We have added the graph of training and testing losses in the revised manuscript (Kindly refer to line 371 to line 379 and Figure 6 on Page 12-13) and they are reproduced below for quick reference purpose:

(a)

(b)

Figure 6. The training and testing losses of models. (a) Training progress of level 0 model (b) Training progress of level 1 model.

Figure 6 shows the training and testing losses of level 0 and level 1 models. It can be seen that both networks have a high convergence speed in the first 100 epochs. While comparing the training progress between level 0 and level 1 models, the testing loss of level 0 model is much closer to the training loss. This result indicate that the level 1 model contain higher risk of overfitting phenomenon owing to less information is provided for training model. Lastly, the testing loss was gradually stabilizing after training for 100 epochs and the experimental results indicate that the testing loss almost remain the same during 200 to 300 epochs. Hence, this result suggest that 300 epochs is suitable for training the models.

Comment 9:

Gaussian filtering used for the image decimation is based on the kernel mask of size r x r. Please, specify what value of r is actually used in the final model and why?

Response:

Thank you for pointing this out. We apologize for the careless mistake. A 5 x 5 kernel mask is applied for the Gaussian filtering to provide global information simultaneously remain important details for the training model. The size of the kernel mask has been added in the revised manuscript (Please see Equation 1 on Page 5), and the modified equation has been reproduced below for quick reference purposes:

Here, the Gaussian kernel is applied as 5 × 5 to provide global information simultaneously remain important details for training model.

END OF RESPONSE TO REVIEWER 2 —

Reference

He, Y.; Song, K.C.; Dong, H.W.; Yan, Y.H. Semi-supervised defect classification of steel surface based on multi-training and generative adversarial network. Optics and Lasers in Engineering 2019, vol. 122, pp. 294-302.
Lee, S.Y.; Tama, B.A.; Moon, S. J.; Lee, S.C. Steel Surface Defect Diagnostics Using Deep Convolutional Neural Network and Class Activation Map. Applied Sciences 2019, 9(24), 5449.
Song, K.; Yan, Y. A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects. Applied Surface Science 2013, 285, 858–864.
He, D.; Xu, K.; Wang, D.D. Design of multi-scale receptive field convolutional neural network for surface inspection of hot rolled steels. Image and Vision Computing 2019, vol. 89, pp. 12-20.
Feng, X.L.; Gao, X.W.; Luo, L. X-SDD: A New Benchmark for Hot Rolled Steel Strip Surface Defects Detection. Image and Vision Computing, Symmetry 2021, 13(4), 706.
Wang, Y.C.; Gao, L.; Li, X.Y.; Gao, Y.P.; Xie, X.T. A New Graph-Based Method for Class Imbalance in Surface Defect Recognition. IEEE Transactions on Instrument and Measurement 2021, vol. 70, Article no. 5007816.
Chen, H.Y.; Hu, Q.D.; Zhai, B.S.; Chen, H.; Liu, K. A robust weakly supervised learning of deep Conv-Nets for surface defect inspection. Neural Computing and Applications 2020, vol. 32, pp. 11229-11244.
Le, X.Y.; Mei, J.H.; Zhang, H.D.; Zhou, B.Y.; Xi, J.T. A learning-based approach for surface defect detection using small image datasets. Neurocomputing 2020, vol. 408, pp. 112-120.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The authors addressed my comments. However, I suggest removing the Abbreviations from the second page to the end of paper, or they can be removed as every acronym is already defined in the text.

Author Response

Reviewer 1

The authors addressed my comments. However, I suggest removing the Abbreviations from the second page to the end of paper, or they can be removed as every acronym is already defined in the text.

Response:

Thank you for the invaluable comments. We have removed the Abbreviations in the revised manuscript since all the acronym is already defined in the text.

Reviewer 2 Report

Majority of my comments has been considered by authors. Some of the improvements could be presented in more clear manner.

For instance still I do not see in the paper a direct link of deep features used for t-SNE graphs and the architecture. In the authors' response yes but in the paper, for instance in the paper I miss them.

Another point is formula (3). In the new version I would replace the crisp equality by some threshold enough to consider the first and the second score as equal.

Small benefit of augmentation means only that your test data is poor.

Author Response

Reviewer 2

Majority of my comments has been considered by authors. Some of the improvements could be presented in more clear manner.

Response:

Thank you for the constructive comments given to improve the quality of our paper. We have addressed your comments and highlight the changes made in red on the revised manuscript.

Comment 1:

Still I do not see in the paper a direct link of deep features used for t-SNE graphs and the architecture. In the authors' response yes but in the paper, for instance in the paper I miss them.

Response:

Thank you for pointing this out and we apologize for the confusion. We applied the activations of the fully connected layer from both level 0 and level 1 networks as the feature vectors for feature visualization. We have provided a statement explaining “what feature vector we mean” in the revised manuscript (Kindly refer to line 447 to line 453 on Page 14) and is reproduced below for quick reference purposes:

Besides, the proposed MRFFN has adopted several inception modules to extract multiscale features and applied the max pooling layer to downsample the aggregate features. Lastly, a fully connected layer is applied to extract all the discriminative information from the above layers. Hence, the activations of the fully connected layer from both level 0 and level 1 networks are applied as the feature vectors for feature visualization to a provide better interpretation of activations for the decision of interest.

(a)	(b)
(c)	(d)

Figure 7. Feature visualization via t-SNE on Gaussian white noise (Variance 0.3). The feature vectors are extracted from the activations of the fully connected layer. (a) The original level 0 model (b) The retrained level 0 model (c) The original level 1 model (d) The retrained level 1 model.

(a)	(b)
(c)	(d)

Figure 8. Feature visualization via t-SNE on salt and pepper noise (Density 0.3). The feature vectors are extracted from the activations of the fully connected layer. (a) The original level 0 model (b) The retrained level 0 model (c) The original level 1 model (d) The retrained level 1 model.

(a)	(b)
(c)	(d)

Figure 9. Feature visualization via t-SNE on motion blur (Motion length 0.3). The feature vectors are extracted from the activations of the fully connected layer. (a) The original level 0 model (b) The retrained level 0 model (c) The original level 1 model (d) The retrained level 1 model.

Comment 2:

Another point is formula (3). In the new version I would replace the crisp equality by some threshold enough to consider the first and the second score as equal.

Response:

Thank you for the invaluable suggestions and we have adopted this suggestion. A summary of the analysis with the redefined equation 3 is shown below:

(3)

Here, we select some threshold values for this experiment validation, which are threshold = [0.5, 0.4, 0.3, 0.2, 0.1, 0.01, to ]. Besides, the proposed method is then compared to theses values to demonstrate its effectiveness. The experimental result shown above indicates that the proposed MRFFN is slightly inferior while applying larger threshold in Equation 3. However, the performance of the proposed MRFFN with smaller threshold yields similar result to the MRFFN+IA method. This result manifests that the redefined equation does not complement the method proposed. This phenomenon can be explained by the experimental result shown in Section 5.2 in which the ideal combination of the and are equally 0.5. The weights of 0.6 and 0.4 are applied only to avoid the divergent results between both networks. Hence, the original Equation 3 is maintained and some expressions are fine-tuned for explicit understanding. The fine-tuned equation has been modified in the revised manuscript (Kindly refer to line 287 to line 293 on Page 7-8) and is reproduced below for quick reference purposes:

= (3)

where and indicate the highest and the second-highest predictions scores on the arbitrary testing image, and indicate the probabilities of the class i defect according to level 0 and level 1 models respectively. To prevent contain two highest prediction scores, the weights of level 0 and level 1 models are set as 0.6 and 0.4 while and are equally the same and the explanation will be shown in Section 5.2.

Comment 3:

Small benefit of augmentation means only that your test data is poor.

Response:

Thank you for pointing this out. In Table 2, the application of the image augmentation technique has only slightly improved the performance of the MRFFN due to the proposed MRFFN has reached a near-perfect performance on the NEU dataset. However, while comparing the performance between the original MRFFN and the MRFFN+IA on the disturbance defect dataset, it can be seen that the image augmentation technique has improved the generalization of the training model especially in high noise or high motion blur scenarios. For instance, the accuracies of the proposed MRFFN have improved by 2.6% in variance 0.3 task, 3.45% on density 0.3 task, and 1.03% on motion length 20 tasks while applying image augmentation technique as shown in Table 4, Table 5, and Table 6 accordingly. Consequently, the image augmentation technique beneficial in increasing the generalization ability of training models, reducing overfitting phenomenon especially on a small sample task, and minimizing the costs of collecting and labeling training samples.

Table 2. Comparison results with the state-of-the-art based on NEU dataset (%).

Method	Training Sample	Testing Sample	Accuracy	Recall	Precision	F1-score
Lee et. al. [20]	210	30	99.44	-	-	99.00
Xiao et. al. [48]	150	150	97.42	-	-	-
Song et. al. [15]	150	150	98.93±0.63	97.89	97.91	97.90
Ren et. al. [25]	150	150	99.27	-	-	-
Gao et. al. [38]	50	250	99.26	99.26	100	99.63
Level 0	50	250	99.30±0.37	99.30	99.30	99.30
Level 1	50	250	99.03±0.39	99.03	99.04	99.03
MRFFN	50	250	99.61±0.23	99.61	99.61	99.61
MRFFN + IA	50	250	99.75±0.24	99.75	99.75	99.75

Table 4. The performance of the proposed method on Gaussian white noise.

Method	Accuracy (%)
Method	Original	Var 0.01	Var 0.05	Var 0.1	Var 0.3
AlexNet	-	96.45	91.48	84.62	67.24
VGG16	-	97.73	92.95	87.25	53.35
ResNet-18	-	97.60	91.81	86.31	73.02
Level 0*	99.30	59.55	37.35	30.29	21.19
Level 1*	99.03	76.85	47.23	37.70	20.87
MRFFN*	99.61	74.41	45.99	37.99	21.59
MRFFN + IA*	-	66.08	45.04	37.29	21.83
Level 0	-	97.86	94.55	92.17	81.31
Level 1	-	97.27	94.43	91.73	83.69
MRFFN	-	98.04	95.63	93.79	85.65
MRFFN + IA	-	98.71	96.63	94.87	88.25

* denotes as the model was trained by original dataset.

Table 5. The performance of the proposed method on salt and pepper noise.

Method	Accuracy (%)
Method	Original	Density 0.01	Density 0.05	Density 0.1	Density 0.3
AlexNet	-	96.80	93.99	90.69	75.48
VGG16	-	97.15	95.49	91.36	69.94
ResNet-18	-	96.74	94.31	91.08	72.16
Level 0*	99.30	77.43	51.58	37.92	21.09
Level 1*	99.03	86.92	51.96	40.81	20.92
MRFFN*	99.61	84.63	53.29	39.69	20.18
MRFFN + IA*	-	83.25	50.65	39.53	18.73
Level 0	-	98.83	95.76	92.52	79.57
Level 1	-	97.85	96.26	94.04	86.99
MRFFN	-	98.87	97.01	94.66	87.09
MRFFN + IA	-	99.31	97.87	96.41	90.54

* denotes as the model was trained by original dataset.

Table 6. The performance of the proposed method on motion blur.

Method	Accuracy (%)
Method	Original	Length 5	Length 10	Length 15	Length 20
AlexNet	-	97.61	96.17	95.12	93.57
VGG16	-	98.55	95.93	95.54	94.33
ResNet-18	-	98.55	96.57	95.15	94.31
Level 0*	99.30	68.02	53.81	45.05	40.65
Level 1*	99.03	94.03	74.85	63.31	56.19
MRFFN*	99.61	82.59	67.76	58.01	51.87
MRFFN + IA*	-	79.35	67.39	58.94	53.23
Level 0	-	98.83	97.41	96.59	95.43
Level 1	-	98.64	97.51	96.78	95.50
MRFFN	-	99.11	98.17	97.51	96.40
MRFFN + IA	-	99.28	98.91	98.11	97.43

* denotes as the model was trained by original dataset.

Author Response File: Author Response.pdf

Article Menu

Design of Multi-Receptive Field Fusion-Based Network for Surface Defect Inspection on Hot-Rolled Steel Strip Using Lightweight Dataset

Further Information

Guidelines

MDPI Initiatives

Follow MDPI