Explainable AI (XAI) Techniques for Convolutional Neural Network-Based Classification of Drilled Holes in Melamine Faced Chipboard

Sieradzki, Alexander; Bednarek, Jakub; Jegorowa, Albina; Kurek, Jarosław

doi:10.3390/app14177462

Open AccessArticle

Explainable AI (XAI) Techniques for Convolutional Neural Network-Based Classification of Drilled Holes in Melamine Faced Chipboard

¹

Department of Artificial Intelligence, Institute of Information Technology, Warsaw University of Life Sciences, Nowoursynowska 159, 02-776 Warsaw, Poland

²

Faculty of Medicine, Medical University of Lodz, Kościuszki 4, 90-419 Łódź, Poland

³

Department of Mechanical Processing of Wood, Institute of Wood Sciences and Furniture, Warsaw University of Life Sciences, Nowoursynowska 159, 02-776 Warsaw, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7462; https://doi.org/10.3390/app14177462 (registering DOI)

Submission received: 20 July 2024 / Revised: 15 August 2024 / Accepted: 22 August 2024 / Published: 23 August 2024

(This article belongs to the Special Issue Engineering Applications of Hybrid Artificial Intelligence Tools)

Download

Browse Figures

Versions Notes

Abstract

:

The furniture manufacturing sector faces significant challenges in machining composite materials, where quality issues such as delamination can lead to substandard products. This study aims to improve the classification of drilled holes in melamine-faced chipboard using Explainable AI (XAI) techniques to better understand and interpret Convolutional Neural Network (CNN) models’ decisions. We evaluated three CNN architectures (VGG16, VGG19, and ResNet101) pretrained on the ImageNet dataset and fine-tuned on our dataset of drilled holes. The data consisted of 8526 images, divided into three categories (Green, Yellow, Red) based on the drill’s condition. We used 5-fold cross-validation for model evaluation and applied LIME and Grad-CAM as XAI techniques to interpret the model decisions. The VGG19 model achieved the highest accuracy of 67.03% and the lowest critical error rate among the evaluated models. LIME and Grad-CAM provided complementary insights into the decision-making process of the model, emphasizing the significance of certain features and regions in the images that influenced the classifications. The integration of XAI techniques with CNN models significantly enhances the interpretability and reliability of automated systems for tool condition monitoring in the wood industry. The VGG19 model, combined with LIME and Grad-CAM, offers a robust solution for classifying drilled holes, ensuring better quality control in manufacturing processes.

Keywords:

explainable AI (XAI); LIME; Grad-CAM; convolutional neural networks; CNN; drilled holes classification; melamine-faced chipboard; tool condition monitoring

1. Introduction

The furniture manufacturing sector faces numerous challenges, particularly in the machining of composite materials where quality issues, such as delamination, can lead to substandard products. This is particularly true during the drilling operation, in which multiple factors might lead to undesirable results.

Laminated panels are widely utilized in the manufacture of furniture [1,2,3,4,5,6,7]. Research indicates that delamination in wood-based material cutting is primarily caused by tool wear. Studies show a clear correlation between tool wear and defects on the laminate surface of wood panels [4,5,6,8]. Monitoring tool condition is crucial, as identifying the optimal time for tool replacement is essential to prevent delamination and ensure product quality. Manual inspection of tools without prior condition assessment can lead to unnecessary downtime, while delaying tool replacement can damage materials. Thus, automating this monitoring process is increasingly important and a topic of extensive research.

The furniture manufacturing sector faces numerous challenges, particularly in the machining of composite materials where quality issues, such as delamination, can lead to substandard products. This is particularly true during the drilling operation, in which multiple factors might lead to undesirable results. In recent years, the application of machine learning and data augmentation techniques has emerged as a promising approach to address such challenges, especially when working with limited datasets. Capturing sufficient and balanced data for intelligent fault diagnosis remains a significant issue across various domains. For instance, recent research by Liu et al. (2024) highlights how counterfactual-augmented few-shot contrastive learning can effectively identify faults even with limited and imbalanced samples by focusing on discriminative representations [9]. Similarly, the use of interpretable data-augmented adversarial variational autoencoders (AVAE) has been shown to improve diagnostic accuracy in imbalanced datasets, a common scenario in fault diagnosis tasks like those encountered in the wood processing industry [10]. These advancements underscore the importance of integrating such approaches into automated systems for tool condition monitoring in the wood industry, where consistent quality control is critical.

Explainable Artificial Intelligence (XAI) has gained significant attention in recent years due to its potential to address the black-box problem associated with advanced deep learning models, especially in critical applications such as healthcare, finance, and industrial automation. XAI enables the interpretation of AI model outputs, fostering trust and transparency in AI systems, which are crucial for their broader adoption.

Kim et al. [11] explored the application of Convolutional Neural Networks (CNNs) for partial discharge (PD) classification in cast-resin transformers. The study highlights the challenge posed by the lack of transparency in CNNs, which leads to difficulties in understanding and trusting model predictions. The authors propose the use of Grad-CAM, an XAI technique, to provide visual explanations for model decisions, thereby enhancing the reliability of PD classification systems. The research demonstrated a classification accuracy of approximately 97%, and the incorporation of XAI helped identify the criteria used by the model, ultimately leading to improved trust and robustness in the system.

Apicella et al. [12] focused on leveraging XAI not only for interpretability, but also to enhance the performance of machine learning models. The authors assessed well-known XAI techniques, such as Integrated Gradients, in the context of classification tasks using datasets like Fashion-MNIST, CIFAR10, and STL10. The study revealed that XAI methods could be used strategically to identify critical features that directly contribute to improving model accuracy. This approach suggests that XAI can serve as a dual-purpose tool—both for interpretability and performance enhancement.

In another study, Miller et al. [13] applied XAI to light direction reconstruction models used in augmented reality. The study emphasized the caution needed when interpreting XAI outputs due to potential unreliability. Nevertheless, by carefully leveraging the meta-information provided by XAI methods, the authors were able to enhance the architecture and training dataset of a regression model, leading to improved performance. This work underscores the role of XAI in guiding the optimization of model design and training.

Selvaraju et al. [14] introduced Grad-CAM, a widely used XAI technique that produces visual explanations for CNN-based models by highlighting relevant regions in input images. Grad-CAM is applicable to various CNN architectures, and has demonstrated its utility across multiple tasks, including image classification, captioning, and visual question answering (VQA). By combining Grad-CAM with fine-grained visualizations, the study achieved superior model interpretability and generalization, addressing key challenges such as adversarial robustness and dataset bias.

Barredo Arrieta et al. [15] presented a comprehensive review of the XAI field, providing a taxonomy of existing methods and highlighting challenges in achieving responsible AI. The review emphasizes the significance of explainability in machine learning models, particularly in deep learning, where the complexity of models often obscures their decision-making processes. The authors propose a framework for responsible AI that integrates fairness, transparency, and accountability, paving the way for the large-scale implementation of AI in real-world applications.

These studies collectively underscore the critical role of XAI in enhancing the transparency, reliability, and performance of AI models. The integration of XAI not only aids in the interpretability of complex models, but also provides avenues for improving model performance through more informed architectural and training choices.

The foundational work in tool condition monitoring dates back nearly 40 years to [16,17,18], focusing on identifying reliable signals and features for detecting tool wear during production without downtime. This area of research has evolved, applying various approaches [19,20,21,22,23,24,25,26,27], typically involving multiple sensors to detect changes in physical parameters from the machining zone.

Conventional methods usually require an array of sensors, such as acoustic emission, vibration testing, noise measurement, or evaluating cutting torque and feed force [28,29,30,31,32,33]. While these methods can be accurate, they are often complex and time-consuming, requiring careful sensor placement and calibration. Initial setup errors can lead to unstable and inaccurate results, especially if regular folding of sensors is necessary. These drawbacks make sensor-based approaches challenging, particularly in dynamic production environments where conditions frequently change.

Given the multitude of factors influencing drilling quality, machine learning (ML) algorithms emerge as a logical solution. In recent years, ML methods have gained traction in the wood industry for various applications, from monitoring machining processes [34,35,36,37,38] to wood species recognition. Tailored solutions or adaptable ML algorithms are available for specific problems.

In this research, the study relies on photographs of drilled holes as the primary dataset, working from the premise that the condition of the drill edges can reveal the wear on the tool, thereby indicating the appropriate timing for tool replacement. This theory has yielded encouraging outcomes through the application of various techniques and algorithms. Techniques based on deep learning and transfer learning have demonstrated efficiency and a strong alignment with this specific challenge [39,40,41,42,43]. Additional investigations propose that the implementation of classifier ensembles and data augmentation techniques could improve the overall performance [44,45], with various classifiers being evaluated and showing considerable potential for real-world usage.

Explainable AI (XAI) techniques, such as Class Activation Mapping (CAM), have been applied to identify points of concern in image-based Deep Neural Networks (DNNs) used for failure analysis in industrial applications [46,47,48,49,50]. The study highlights the importance of XAI tools in providing insights into the decision process of DNNs and emphasizes the need for further analysis of their correctness. Local Interpretable Model-agnostic Explanations (LIME) and Gradient-weighted Class Activation Mapping (GRAD-CAM) have shown their utility in various industrial applications. CNN and GRAD-CAM techniques have been utilized in detecting defects and pinpointing their locations on industrial products. However, issues related to accurately identifying defective regions and the time required for processing have been noted [51,52,53,54]. Additionally, the application of LIME in the automated detection of fractures during material testing has been explored, offering insights into the capability for critical feature detection and the automation of crack detection.

Challenges in drill state recognition have been addressed by transfer learning using CNN, offering the advantage of training classification models with a small portion of data [19]. Data augmentation techniques combined with transfer learning have improved the accuracy of drill wear recognition, even with a small original dataset [20].

Despite the absence of direct information on the specific use of LIME and GRAD-CAM for drilled hole classification in melamine-faced chipboard, the insights from XAI techniques and their application in industrial contexts suggest their potential applicability. LIME and GRAD-CAM are valuable XAI techniques that provide insights into the decision-making process of CNN-based classification models for drilled holes in melamine-faced chipboard. These methods provide benefits regarding the accuracy of explanations and the visual representation of specific features. However, there are drawbacks that require further investigation in upcoming studies.

This paper explores the use of LIME and GRAD-CAM for enhancing the interpretability of CNN-based models in classifying drilled holes in melamine-faced chipboard. By applying these XAI techniques, we aim to better understand the model’s decision-making process, ensure its reliability, and improve the overall accuracy of the classification.

2. Materials and Methods

2.1. Data Collection

In the experiment, a Busselato automated Computerized Numerical Control (CNC) workstation (model Jet 100) was utilized. This system guaranteed repeatability in the experiment. The drilling process involved using a 12 mm diameter drill from FABA (model WP-01, as shown in Figure 1). The drill operated at a rotational speed of 4500 RPM and advanced at a feed rate of 1.35 m/min, following the recommendations provided by the manufacturer of the drill. The material subjected to the test was a standard 18 mm thick laminated board, commonly used in the furniture industry, produced by the Polish company KRONOPOL (model U 511).

The density profile for the chipboard was analyzed with a GreCon DAX device by Fagus-GreCon Greten GmbH & Co. KG (Alfeld, Germany), as illustrated in Figure 2. Additional characteristics of the material include a bending modulus of rupture at 15.4 MPa, a bending modulus of elasticity at 2950 MPa, and a Brinell surface hardness (HB) of 2.1. The evaluations of these material properties were conducted in accordance with standards [55,56], utilizing an Instron 3382 testing apparatus from Instron in Norwood, MA, USA, and a Brinell CV 3000LDB hardness testing equipment from the Bowers Group located in Camberley, UK, respectively.

For an in-depth learning and testing exercise, the condition of drill bits was accurately observed and evaluated using a conventional workshop microscope equipped with a digital camera (Model TM—505; manufactured by Mitutoyo, based in Kawasaki, Japan). In this context, the degree of wear and tear observed on the external corner of the drill bits was considered a direct measure of their condition. This wear (denoted as W) was quantitatively measured for each of the drill bits’ two cutting edges, utilizing the formula provided below:

W = W^{″} - W^{'}

(1)

where

W^{″}

is the initial width of a brand-new cutting edge near the outer corner (mm), and

W^{'}

is the current width of a brand-new cutting edge near the outer corner (mm) (Figure 3).

According to Equation (3), the state of the drill is divided into three distinct classifications:

“Green” tool wear class $W < 0.2$ mm.
“Yellow” tool wear class $0.2 \leq W < 0.35$ mm.
“Red” tool wear class $W \geq 0.35$ mm.

Figure 1 illustrates the FABA WP-01 drill used in the experimental setup. The left part of the figure provides a top-down view of the drill bit, clearly showing the sharp cutting edges designed for efficient drilling into materials. This view emphasizes the precision and engineering behind the drill’s design, crucial for achieving clean and accurate holes in melamine-faced chipboard.

The right part of the figure offers a side view of the drill bit, showcasing its overall length and the angled cutting surfaces. This perspective highlights the drill bit’s structure, which is designed to maintain stability and effectiveness during high-speed rotations and feed rates. Understanding both the top and side views of the drill bit helps in comprehensively analyzing its performance and wear characteristics during the drilling process.

The figure above (Figure 4) illustrates examples of drilled holes categorized into three different classes, based on the condition of the drill and the quality of the holes. The first row shows holes classified as “Green”, indicating that they were made with a new or minimally worn drill bit, resulting in clean and precise holes. The second row shows holes classified as “Yellow”, which were made with a moderately worn drill bit, resulting in slightly less clean holes with some rough edges. The third row shows holes classified as “Red”, indicating that they were made with a significantly worn drill bit, resulting in rough and irregular holes with noticeable defects. These examples are used for research analysis to train and evaluate the performance of the Convolutional Neural Network (CNN) models in classifying the condition of the drilled holes.

The evaluation of the drill’s condition was routinely performed using a Mitutoyo microscope (model TM-500), enabling the classification of each hole into one of three categories:

Good—represents a new, unworn drill that is suitable for continued use;
Worn—applies to a used drill that either needs manual inspection to determine its state or is still in a condition acceptable for ongoing production activities;
Requiring replacement—describes a used drill that is in a condition deemed unserviceable and must be replaced without delay.

Table 1 provides a summary of data collection and drill wear measurements. The table includes three columns: Drill Number, Green/Yellow/Red, and Total Images. The Drill Number column lists the five different drills used in the experiments, numbered from 1 to 5. The Green/Yellow/Red column shows the number of images categorized into three classes (Green, Yellow, Red) for each drill, indicating the condition of the drill bit. Specifically, Green represents a new or minimally worn drill bit, Yellow indicates a moderately worn drill bit, and Red signifies a significantly worn drill bit. The Total Images column provides the total number of images collected for each drill.

Drill 1 has 840 images classified as Green, 420 as Yellow, and 406 as Red, summing up to a total of 1666 images. Drill 2 has 840 Green, 700 Yellow, and 280 Red images, with a total of 1820 images. Drill 3 has 700 Green, 560 Yellow, and 420 Red images, totaling 1680 images. Drill 4 has 840 Green, 560 Yellow, and 280 Red images, also totaling 1680 images. Drill 5 has an equal distribution of 560 images in each category, amounting to 1680 images. The total count for all drills combined is 3780 Green, 2800 Yellow, and 1946 Red images, leading to an overall total of 8526 images.

2.2. Cost Analysis of Conventional Methods for Sensor Replacement and Calibration

In conventional machining processes, sensor-based monitoring systems play a critical role in detecting tool wear, ensuring production quality, and maintaining operational efficiency. However, these systems come with significant costs, both in terms of initial setup and ongoing maintenance. The costs associated with replacing or calibrating sensors can become substantial over time, especially in dynamic production environments.

One of the primary cost factors involves the periodic replacement of sensors. Sensors such as those used for acoustic emission, vibration analysis, and cutting torque measurement are subject to wear and degradation due to exposure to harsh machining conditions. Frequent replacement of these sensors is necessary to maintain accurate readings, which incurs direct material costs and the additional cost of halting production during the replacement process.

Another considerable expense arises from the need for regular calibration. Precision monitoring requires sensors to be regularly recalibrated to ensure accuracy, especially in environments where conditions change frequently. Calibration typically requires skilled technicians and specialized equipment, contributing to both labor and operational costs. Additionally, calibration processes may necessitate machine downtime, further increasing the overall expense.

A comparative study of conventional and AI-based monitoring systems highlights the cost-effectiveness of automated solutions. While AI-based approaches, such as those using Convolutional Neural Networks (CNNs) and Explainable AI (XAI) techniques, require initial investments in software and model training, they significantly reduce the ongoing costs associated with sensor management. AI-driven models can monitor tool wear and quality without the need for extensive sensor networks, thus minimizing both direct and indirect expenses.

In summary, conventional sensor-based methods for tool condition monitoring are associated with recurring costs related to sensor replacement, calibration, and production downtime. By integrating AI-based solutions, manufacturers can mitigate these costs while improving the reliability and efficiency of their monitoring systems.

2.3. CNN Models Architecture

In our research, we assessed the capabilities of three distinct Convolutional Neural Network (CNN) models—VGG16, VGG19, and ResNet101—for identifying and classifying drilled holes in melamine-faced chipboards, as referenced in various works [40,45,58,59,60]. These models were selected because of their proven proficiency in a variety of image analysis tasks. Herein, we detail each model’s architecture and the specific setups utilized in our study.

Introduced by Simonyan and Zisserman [61], the VGG16 architecture is favored for its straightforward design and depth, featuring 16 weight layers. This configuration comprises 13 convolutional layers paired with three fully connected layers at the end. It applies

3 \times 3

convolutional filters through its layers, enabling it to discern intricate patterns in the visuals. Max-pooling layers are strategically placed to decrease the spatial dimensionality and computational demands.

Expanding upon VGG16, the VGG19 model [61] incorporates an additional three convolutional layers, totaling 19 weight layers. This enhancement is aimed at augmenting the model’s ability to learn more elaborate features, albeit at the expense of increased computational requirements and extended training periods.

The ResNet101 model, brought forth by He et al. [62], belongs to the Residual Networks lineage. It introduces an approach of using residual learning to facilitate the training of profoundly deep networks. With its 101 layers, ResNet101 surpasses both VGG16 and VGG19 in depth. Its residual blocks are designed to learn identity functions, aiding in overcoming the issue of vanishing gradients and permitting the cultivation of deeper networks.

Our investigation into the efficacy of various CNN architectures for the categorization of drilled holes in melamine-faced chipboards involved an analysis of three renowned pretrained models: VGG16, VGG19, and ResNet101. We took into account each architecture’s depth, scale, parameter count, and optimal input image dimensions to gauge their effectiveness and efficiency for our specific task.

These models were initially trained on the extensive ImageNet dataset, which encompasses a wide assortment of image categories. We then fine-tuned them on our targeted dataset of drilled holes in melamine-faced chipboards. The images were resized to dimensions of

224 \times 224

pixels to conform to each model’s input requirements. To enhance the training dataset’s variety and fortify the models’ resilience, data augmentation techniques like rotation, flipping, and scaling were applied.

Table 2 presents a synopsis of the salient features of the pretrained CNN models explored in this study [63,64,65,66,67,68,69,70,71,72,73,74,75,76,77].

The comparison of these pretrained CNN models is based on several criteria, including depth, model size, number of parameters, and input image size. Each of these factors plays a crucial role in determining the model’s performance, computational requirements, and ability to generalize to the classification task at hand.

Depth—ResNet101, with its 101 layers, is the deepest model among the three. The depth of a model often correlates with its capacity to learn complex features. However, deeper models also require more computational resources and can be more challenging to train effectively.
Size—VGG16 and VGG19 are significantly larger in terms of model size compared to ResNet101. Larger models can store more information, but they also require more memory and storage, which can be a limitation in resource-constrained environments.
Parameters—VGG16 and VGG19 have a much higher number of parameters (138 million and 144 million, respectively) compared to ResNet101 (44.6 million). While having more parameters can increase a model’s learning capacity, it also makes the model more prone to overfitting and increases the computational load.
Image input size—all models were adapted to accept input images of $224 \times 224$ pixels. This standardization ensures a fair comparison and aligns with the common practice in image classification tasks.

The following sections provide detailed evaluation results and analysis of each model’s performance in classifying drilled holes in melamine-faced chipboard, highlighting their strengths and areas for improvement.

2.4. Techniques for Explainable AI (XAI)

In this section, we explore different methods applied to enhance the interpretability and explainability of AI models, especially focusing on Convolutional Neural Networks (CNNs). These methods are crucial for deciphering how these models arrive at their decisions and offer clarity on which aspects of the input data play a pivotal role in the model’s predictive outcomes. Our primary attention is directed towards two key techniques in Explainable AI (XAI): Local Interpretable Model-Agnostic Explanations (LIME) and Gradient-weighted Class Activation Mapping (Grad-CAM).

2.4.1. Local Interpretable Model-Agnostic Explanations (LIME)

Interpreting the predictions made by deep neural networks can be quite challenging due to their inherent complexity. Local Interpretable Model-Agnostic Explanations (LIME) is a technique developed to shed light on the reasons behind the classification decisions of deep neural networks [53,78,79,80,81,82,83].

Deep neural networks are often likened to “black boxes” due to the complexity and opacity of their decision-making processes. To mitigate this, LIME approximates the behavior of these intricate networks with a model that is simpler and more interpretable, such as a regression tree. This approach makes it easier to understand which features in the input data are significant in influencing the decisions made by the network.

LIME is particularly beneficial in the realm of image classification. It enables the creation of feature importance maps using the following key steps:

Segmenting features—the input image is divided into segments based on similar pixel characteristics such as color, texture, or intensity.
Creating segment masks—a binary mask is generated for each segment, indicating whether the segment is included (1) or excluded (0) in the synthetic image.
Generating synthetic images—synthetic images are produced by modifying the original image according to the segment masks, where excluded segments are replaced with average pixel values to obscure their information.
Identifying synthetic images—every synthetic image undergoes analysis through the neural network to ascertain how the presence or lack of specific segments influences the result of the classification.
Fitting an interpretable model—to establish a simpler, understandable model like a regression tree, the link between the presence of segments and their classification scores is leveraged.
Determining feature importance—the simpler model helps identify the significance of each segment, which is then mapped back to the original image to highlight influential regions.
Mapping feature importance—the significant values identified from segments are applied back to the initial image. This process generates a map that highlights the areas of the image that were crucial for the model’s decision-making.

By applying LIME in our image classification tasks, we gain a clearer understanding of how our Convolutional Neural Network (CNN) models classify drilled holes in melamine-faced chipboard. This insight is crucial for validating the models’ performance and ensuring they focus on the relevant features of the images.

In our study, with the VGG19 model, LIME was utilized to explain both correct and incorrect classifications. Figure 5 and Figure 6 display the LIME-generated explanations for the VGG19 model’s predictions on images of drilled holes. These visualizations are invaluable for identifying the critical regions in the images that the model considered, thus allowing us to assess its reliability and identify areas for potential improvement.

2.4.2. Gradient-Weighted Class Activation Mapping (Grad-CAM)

Gradient-weighted Class Activation Mapping (Grad-CAM) serves as a technique in deep learning that facilitates the visualization and interpretation of the reasoning behind decisions made by Convolutional Neural Networks (CNNs) [84,85,86,87,88,89,90,91,92]. By leveraging the gradients of the target class heading into the last convolutional layer, Grad-CAM produces a localization map. This map accentuates the critical regions within an input image that play a pivotal role in the model’s predictive process.

The Grad-CAM technique involves several key steps:

Forward pass—the input image is processed through the CNN to extract feature maps from the last convolutional layer.
Gradient calculation—the gradients of the classification score with respect to the feature maps are computed. These gradients indicate the sensitivity of the output value of the target class to changes in the feature map values.
Weight averaging—the gradients are averaged to obtain weights, which signify the importance of each feature map channel for the target class.
Weighted combination—a weighted combination of the feature maps is performed using these weights, resulting in an initial coarse localization map.
ReLU activation—the ReLU activation function is applied to the weighted combination to retain only positive influences, highlighting the regions in the image that positively impact the target class score.
Upsampling—the coarse localization map is upsampled to match the dimensions of the input image, producing the final Grad-CAM heatmap.

This heatmap can be overlaid on the original image to visualize which parts of the image the network focused on during its prediction.

In our research, we utilized Grad-CAM technology with a customized VGG16 network, originally developed for classifying drilled holes in melamine-coated chipboard. Our objective was to identify the specific areas the CNN model focuses on when making predictions. This approach helped us delve deeper into the model’s rationale and improved the clarity of our automated tool condition monitoring system.

3. Results and Discussion of CNN Models Classification

In this section, we disclose the findings from our study on Convolutional Neural Network (CNN) models applied to classify drilled holes in melamine-faced chipboard. We assessed the efficacy of each model by examining their accuracy, significant error frequencies, and confusion matrices over various folds. Detailed examinations of the outcomes for the VGG16, VGG19, and ResNet101 models are elaborated in the subsequent sections.

The evaluation metrics used in this study include:

Accuracy—refers to the ratio of correctly identified instances to the overall number of instances.
Critical errors—instances where the model misclassifies a “Green” hole as “Red” or vice versa, which are particularly significant due to their potential impact in practical applications.
Confusion matrix—a table that provides a summary of the model’s performance by displaying the count of accurate and inaccurate forecasts for every category.

Every model underwent training and assessment through a 5-fold cross-validation method, in which the dataset was segmented into five parts. Each part served as a test set on one occasion, with the rest of the parts merged to create the training set. Such a strategy guarantees a thorough appraisal of the model’s effectiveness.

The subsequent subsections present comprehensive findings for each model, encompassing their precision, significant error frequencies, and confusion matrices. These findings are then examined in detail to pinpoint the advantages and disadvantages of each model and to identify the most appropriate model for the classification task.

3.1. Numerical Experiments

Figure 7 presents a flowchart illustrating the workflow for the numerical experiments conducted in this study. The process began with the acquisition of data, which included collecting images of drilled holes categorized into three classes: Green, Yellow, and Red. The next step was data preparation, where the dataset was organized and augmented to enhance model training.

Three CNN models—VGG16, VGG19, and ResNet101—were then constructed and fine-tuned for the classification task. After model evaluation, the best-performing model, which in this study was VGG19, was selected for further analysis. Explainable AI (XAI) techniques, specifically Grad-CAM and LIME, were applied to this model to gain insights into its decision-making process.

The insights gathered from these XAI techniques were then used to adjust and improve the input images or training process, after which the entire cycle was repeated if necessary. This iterative process ensures continuous model refinement and improved classification performance.

In this section, we detail the numerical tests carried out to assess how well CNN models can classify drilled holes in melamine-faced chipboard. We applied a 5-fold cross-validation method (involving five different drills), dividing the dataset into five parts. Each part was alternately used as a test set, with the other four parts serving for training.

We arranged the image data into five separate folders, each folder corresponding to distinct data collection dates. The folders included images classified into three categories: “Green”, “Yellow”, and ”Red”, indicating the state of the drilled holes.

For cross-validation, we generated five folds. In every fold, a single subset was selected as the test set, and the rest were amalgamated to create the training set. This strategy guaranteed that each subset was utilized as a test set once, offering a thorough assessment of the model’s effectiveness.

For each fold, we performed the following steps:

Data preparation—the images in the training set were loaded and labeled accordingly.
Model configuration—we used the pretrained CNN network, modifying its final layers to fit our classification task. The fully connected layer was adjusted to have three outputs corresponding to the three classes, and the final classification layer was added.
Training—the network was trained using stochastic gradient descent with momentum (SGDM) for 10 epochs. Training parameters such as the mini-batch size (32) and initial learning rate ( $1 \times 10^{- 4}$ ) were optimized to achieve the best performance.
Evaluation—the trained model was evaluated on the test set for each fold. We recorded the accuracy and generated confusion matrices to analyze the classification performance.

We have applied a 5-fold cross-validation approach, which involved splitting the dataset into training and test sets in such a way that each fold had data from one drill reserved for testing, while the remaining four drills’ data were used for training. This division was suggested by industry experts to ensure the robustness and generalization of the models across different tools.

Table 3 provides a summary of the data distribution in each of the 5 folds. Each fold includes a test set from one drill and a training set from the remaining four drills. The table presents the number of images in both the training and test sets for each fold.

3.2. Framework and Implementation Details

This section outlines the framework and implementation details for the use of Convolutional Neural Networks (CNNs) including VGG16, VGG19, and ResNet101 models, as well as Explainable AI (XAI) techniques such as Grad-CAM and LIME. The entire workflow was implemented in Python 3.10, utilizing well-established libraries and frameworks commonly used for deep learning and model interpretability.

3.2.1. Framework Overview

The models and XAI techniques were developed and executed using Python 3.10. The primary libraries used include:

TensorFlow and Keras. These were used for building, training, and fine-tuning the CNN models (VGG16, VGG19, and ResNet101). TensorFlow and Keras offer high-level abstractions for defining and training deep neural networks.
OpenCV and Pillow. These libraries were used for image preprocessing, including resizing, normalization, etc.
Matplotlib and Seaborn. Visualization of the training process, evaluation metrics, and the output of the XAI techniques was performed using these libraries.
scikit-learn. For data splitting, evaluation metrics, and auxiliary tasks such as generating cross-validation folds.

3.2.2. Implementation of CNN Models

The VGG16, VGG19, and ResNet101 models were implemented using the Keras API, which provides pre-trained models accessible from the keras.applications module. The models were fine-tuned to the task of classifying drilled holes into three classes: Green, Yellow, and Red.

The training process involved applying data augmentation techniques such as random rotations, flips, and scaling to increase dataset variability. Stochastic Gradient Descent (SGD) with momentum was used as the optimizer, while categorical cross-entropy was applied as the loss function.

3.2.3. Explainable AI (XAI) Techniques

To interpret the decisions of the CNN models, two XAI techniques were integrated: Gradient-weighted Class Activation Mapping (Grad-CAM) and Local Interpretable Model-agnostic Explanations (LIME).

Grad-CAM was implemented using TensorFlow and Keras. This technique generates class-discriminative localization maps by computing gradients of the target class concerning the feature maps in the final convolutional layer. The resulting heatmaps highlight the regions of the input image that contribute the most to the model’s prediction. The implementation followed standard practices, using the tf.GradientTape() for computing gradients and overlaying the heatmaps on the original images for visualization.

LIME was implemented using the lLIME package in Python. This technique approximates the model’s behavior by generating perturbed versions of the input image and fitting an interpretable model (such as a linear model) to the local predictions. The LIME explanations provide a feature importance map, indicating which parts of the image were critical for the classification decision. The ImageExplainer class was used for generating explanations specific to image data.

3.2.4. Model Evaluation and Interpretation

The models were evaluated using 5-fold cross-validation, with accuracy, precision, recall, and F1-score as the primary metrics. Grad-CAM and LIME were applied both to correctly classified and misclassified samples, providing insights into model behavior and highlighting areas for potential improvement.

By combining the power of CNN architectures with explainability methods like Grad-CAM and LIME, the study not only achieved reasonable classification accuracy, but also offered transparency into the decision-making process, which is crucial for industrial applications.

3.3. Results for VGG16 Model

The evaluation results for the VGG16 model are presented in Table 4. The model’s accuracy varied across the folds, with the highest accuracy of 71.00% observed in the first fold and the lowest accuracy of 56.13% observed in the fourth fold. The overall accuracy across all folds was 66.60%.

The evaluation metrics for the VGG16 model, as presented in Table 5, provide a detailed analysis of the model’s performance across three classes: “Green”, “Yellow”, and “Red”. The table highlights key indicators such as precision, sensitivity, specificity, accuracy, and F1 score.

For the “Green” class, the model demonstrates strong performance with a precision of 79.58%, a sensitivity of 84.05%, and an F1 score of 81.76%. The high values in these metrics suggest that the model is effective in correctly identifying and classifying instances as “Green”, with a balanced trade-off between precision and recall.

In contrast, the “Yellow” class shows a significant drop in performance. The precision and sensitivity are 51.97% and 52.86%, respectively, with a corresponding F1 score of 52.41%. These results indicate that the model struggles to correctly classify instances in the “Yellow” category, leading to a higher number of misclassifications. The specificity for the “Yellow” class is higher at 76.11%, reflecting a better ability to distinguish “Yellow” instances from non-“Yellow” instances, but overall accuracy remains lower at 68.47%.

The “Red” class also faces challenges, with a precision of 60.62% and a sensitivity of 52.52%. Although the specificity is high at 89.91%, the F1 score of 56.28% reflects the model’s difficulty in accurately classifying instances as “Red”, with a tendency to misclassify these instances as either “Green” or “Yellow”.

Overall, the table indicates that the model performs well for the “Green” class, but struggles to maintain the same level of accuracy and precision for the “Yellow” and “Red” classes. These findings highlight areas for improvement, especially in refining the classification of intermediate and highly worn drill holes.

The VGG16 model’s confusion matrix, depicted in Figure 8, offers an in-depth examination of its classification efficacy over various folds. This matrix displays the count of instances that were accurately and inaccurately categorized for each class.

In the confusion matrix, the true positive rates for the classes “Green”, “Yellow”, and “Red” are as follows:

The “Green” class had 3177 correctly classified instances, with 531 instances misclassified as “Yellow” and 72 instances as “Red”.
The “Yellow” class had 1480 correctly classified instances, with 728 instances misclassified as “Green” and 592 instances as “Red”.
The “Red” class had 1022 correctly classified instances, with 837 instances misclassified as “Yellow” and 87 instances as “Green”.

These results indicate that the VGG16 model achieved moderate accuracy, with varying degrees of misclassification among the classes. The model showed a higher tendency to misclassify instances of the “Yellow” class, particularly as “Green” and “Red”.

Critical errors refer to the misclassification of instances where the model incorrectly identifies a “Green” drill hole as “Red” or vice versa. Such errors are particularly concerning due to their potential implications in real-world applications.

Fold 1—the model demonstrated a relatively high accuracy of 71.00%. Critical errors included one instance where a “Green” hole was classified as “Red” and three instances where “Red” holes were classified as “Green”. The total number of critical errors in this fold was four.
Fold 2—the accuracy for this fold was 70.82%. Critical errors increased significantly, with eight instances of “Green” holes misclassified as “Red”, and 11 instances of “Red” holes misclassified as “Green”. The total number of critical errors in this fold was 19.
Fold 3—this fold showed a lower accuracy of 63.98%. There was one instance where a “Green” hole was misclassified as “Red”, and a substantial number of 58 instances where “Red” holes were misclassified as “Green”. The total number of critical errors was 59.
Fold 4—the accuracy dropped to 56.13% in this fold. A significant number of 61 instances where “Green” holes were classified as “Red” were observed. No instances of “Red” holes misclassified as “Green” were recorded. The total number of critical errors was 61.
Fold 5—the accuracy for this fold was 70.77%. Critical errors included one instance where a “Green” hole was classified as “Red”, and 15 instances where “Red” holes were classified as “Green”. The total number of critical errors in this fold was 16.

The analysis of critical errors reveals several insights into the performance of the VGG16 model:

Consistency in Green to Red misclassification—across all folds, there is a consistent occurrence of “Green” holes being misclassified as “Red”. This misclassification is particularly problematic, as it could lead to unnecessary corrective actions or rework.
Variation in Red to Green misclassification—the misclassification of “Red” holes as “Green” varies significantly across folds. For instance, Fold 4 had no such misclassifications, while Fold 3 had a substantial number (58 instances).
Impact on overall accuracy—the high number of critical errors, especially in certain folds, significantly impacts the overall accuracy of the model. Folds with higher critical errors generally correspond to lower accuracy rates.
Need for model improvement—the analysis suggests that while the VGG16 model shows promise, there is a need for further refinement to reduce the critical error rate. This could involve tuning the model parameters, using more advanced architectures, or incorporating additional data preprocessing techniques.

By addressing these critical errors, the reliability and applicability of the VGG16 model for classifying drilled holes in melamine-faced chipboard can be enhanced, leading to more consistent and accurate results in practical applications.

3.4. Results for VGG19 Model

The evaluation results for the VGG19 model are presented in Table 6. The model’s accuracy varied across the folds, with the highest accuracy of 71.60% observed in the fifth fold and the lowest accuracy of 52.38% observed in the fourth fold. The overall accuracy across all folds was 67.03%.

The evaluation metrics for the VGG19 model, as shown in Table 7, provide an in-depth analysis of the model’s performance across the “Green”, “Yellow”, and “Red” classes. The table includes metrics such as precision, sensitivity, specificity, accuracy, and F1 score, which give a comprehensive understanding of the model’s classification capabilities.

For the “Green” class, the model exhibits strong performance, with a precision of 84.86%, sensitivity of 78.89%, and an F1 score of 81.77%. The high values for specificity (88.79%) and accuracy (84.40%) suggest that the model effectively identifies and classifies instances belonging to the “Green” category, with minimal misclassifications.

The “Yellow” class shows moderate performance, with a precision of 51.51% and sensitivity of 58.32%. The F1 score of 54.71% indicates that the model struggles with distinguishing “Yellow” instances, often confusing them with the other classes. The specificity of 73.16% and accuracy of 68.29% further highlight the challenges the model faces in correctly classifying “Yellow” instances, resulting in a higher rate of misclassification.

For the “Red” class, the model achieves a precision of 59.72% and a sensitivity of 56.53%. Although the specificity is high at 88.72%, indicating a good ability to correctly exclude non-“Red” instances, the F1 score of 58.08% reflects the model’s difficulty in accurately classifying “Red” instances. The accuracy for the “Red” class stands at 81.37%, suggesting some degree of reliability, though the model still struggles with edge cases.

In summary, while the VGG19 model shows robust performance for the “Green” class, it has noticeable difficulties in accurately classifying the “Yellow” and “Red” classes. These results point to the need for further fine-tuning and model improvements, particularly for intermediate and heavily worn drill conditions.

The confusion matrix of the VGG19 model, depicted in Figure 9, offers a comprehensive analysis of its classification accuracy across all folds. It details the count of both accurately and inaccurately classified instances by class.

In the confusion matrix, the true positive rates for the classes “Green”, “Yellow”, and “Red” are as follows:

The “Green” class had 2982 correctly classified instances, with 719 instances misclassified as “Yellow” and 79 instances as “Red”.
The “Yellow” class had 1633 correctly classified instances, with 504 instances misclassified as “Green” and 663 instances as “Red”.
The “Red” class had 1100 correctly classified instances, with 818 instances misclassified as “Yellow” and 28 instances as “Green”.

These results indicate that the VGG19 model achieved moderate accuracy, with varying degrees of misclassification among the classes. The model showed a higher tendency to misclassify instances of the “Yellow” class, particularly as “Green” and “Red”.

Critical errors refer to the misclassification of instances where the model incorrectly identifies a “Green” drill hole as “Red”, or vice versa. Such errors are particularly concerning due to their potential implications in real-world applications.

Fold 1—the model demonstrated a relatively high accuracy of 69.50%. Critical errors included two instances where a “Green” hole was classified as “Red”, and two instances where “Red” holes were classified as “Green”. The total number of critical errors in this fold was four.
Fold 2—the accuracy for this fold was 70.76%. Critical errors increased significantly, with 19 instances of “Green” holes misclassified as “Red”, and five instances of “Red” holes misclassified as “Green”. The total number of critical errors in this fold was 24.
Fold 3—this fold showed an accuracy of 70.59%. There was one instance where a “Green” hole was misclassified as “Red”, and 10 instances where “Red” holes were misclassified as “Green”. The total number of critical errors was 11.
Fold 4—the accuracy dropped to 52.38% in this fold. A significant number of 57 instances where “Green” holes were classified as “Red” were observed. No instances of “Red” holes misclassified as “Green” were recorded. The total number of critical errors was 57.
Fold 5—the accuracy for this fold was 71.60%. Critical errors included no instances where a “Green” hole was classified as “Red”, and 11 instances where “Red” holes were classified as “Green”. The total number of critical errors in this fold was 11.

The analysis of critical errors reveals several insights into the performance of the VGG19 model:

Consistency in Green to Red misclassification—across all folds, there is a consistent occurrence of “Green” holes being misclassified as “Red”. This misclassification is particularly problematic, as it could lead to unnecessary corrective actions or rework.
Variation in Red to Green misclassification—the misclassification of “Red” holes as “Green” varies significantly across folds. For instance, Fold 4 had no such misclassifications, while Fold 2 had a higher number (19 instances).
Impact on overall accuracy—the high number of critical errors, especially in certain folds, significantly impacts the overall accuracy of the model. Folds with higher critical errors generally correspond to lower accuracy rates.
Need for model improvement—the analysis suggests that while the VGG19 model shows promise, there is a need for further refinement to reduce the critical error rate. This could involve tuning the model parameters, using more advanced architectures, or incorporating additional data preprocessing techniques.

By addressing these critical errors, the reliability and applicability of the VGG19 model for classifying drilled holes in melamine-faced chipboard can be enhanced, leading to more consistent and accurate results in practical applications.

3.5. Results for Resnet101 Model

The evaluation results for the Resnet101 model are presented in Table 8. The model’s accuracy varied across the folds, with the highest accuracy of 66.42% observed in the fifth fold and the lowest accuracy of 56.04% observed in the second fold. The overall accuracy across all folds was 61.19%.

The evaluation metrics for the ResNet101 model, as presented in Table 9, provide a comprehensive analysis of the model’s performance across the “Green”, “Yellow”, and “Red” classes. These metrics include precision, sensitivity (recall), specificity, accuracy, and F1 score, offering a detailed view of how well the model performs for each class.

For the “Green” class, the model achieves a precision of 81.45% and a sensitivity of 80.48%. The F1 score of 80.96% indicates a balanced performance in identifying “Green” instances, with relatively few misclassifications. The model’s specificity of 85.40% and overall accuracy of 83.22% suggest that it is generally reliable in distinguishing the “Green” class from others.

The “Yellow” class poses a significant challenge for the model, as reflected by the lower precision of 47.52% and sensitivity of 36.96%. The F1 score of 41.58% highlights the model’s struggle to accurately classify “Yellow” instances, often leading to confusion with the “Green” and “Red” classes. Although the specificity for the “Yellow” class is reasonably high at 80.04%, the overall accuracy is lower at 65.89%, indicating that a considerable number of “Yellow” instances are misclassified.

For the “Red” class, the model shows moderate performance with a precision of 46.23% and a sensitivity of 62.08%. The F1 score of 52.99% reflects the model’s tendency to misclassify “Red” instances, although it performs slightly better than for the “Yellow” class. The specificity of 78.65% and accuracy of 74.87% suggest that the model can somewhat differentiate “Red” instances, but still faces challenges in correctly classifying all of them.

In summary, while the ResNet101 model demonstrates reasonably strong performance for the “Green” class, its effectiveness decreases significantly for the “Yellow” and “Red” classes. The lower precision and F1 scores for these classes indicate a need for further refinement and optimization of the model, particularly in addressing the classification challenges associated with intermediate and heavily worn drill conditions.

The confusion matrix for the Resnet101 model, as shown in Figure 10, provides a detailed breakdown of the classification performance across all folds. The matrix indicates the number of correctly and incorrectly classified instances for each class.

In the confusion matrix, the true positive rates for the classes “Green”, “Yellow”, and “Red” are as follows:

The “Green” class had 2780 correctly classified instances, with 846 instances misclassified as “Yellow” and 256 instances as “Red”.
The “Yellow” class had 1394 correctly classified instances, with 623 instances misclassified as “Green” and 938 instances as “Red”.
The “Red” class had 962 correctly classified instances, with 906 instances misclassified as “Yellow” and 77 instances as “Green”.

These results indicate that the Resnet101 model achieved moderate accuracy, with varying degrees of misclassification among the classes. The model showed a higher tendency to misclassify instances of the “Yellow” class, particularly as “Green” and “Red”.

Critical errors refer to the misclassification of instances where the model incorrectly identifies a “Green” drill hole as “Red” or vice versa. Such errors are particularly concerning due to their potential implications in real-world applications.

Fold 1—the model demonstrated a relatively high accuracy of 66.26%. Critical errors included 26 instances where a “Green” hole was classified as “Red”, and 13 instances where “Red” holes were classified as “Green”. The total number of critical errors in this fold was 39.
Fold 2—the accuracy for this fold was 56.04%. Critical errors increased significantly, with 158 instances of “Green” holes misclassified as “Red”, and seven instances of “Red” holes misclassified as “Green”. The total number of critical errors in this fold was 165.
Fold 3—this fold showed an accuracy of 65.59%. There were five instances where a “Green” hole was misclassified as “Red”, and 39 instances where “Red” holes were misclassified as “Green”. The total number of critical errors was 44.
Fold 4—the accuracy dropped to 56.13% in this fold. A significant number of 64 instances where “Green” holes were classified as “Red” were observed. Only two instances of “Red” holes misclassified as “Green” were recorded. The total number of critical errors was 66.
Fold 5—the accuracy for this fold was 66.42%. Critical errors included three instances where a “Green” hole was classified as “Red”, and 16 instances where “Red” holes were classified as “Green”. The total number of critical errors in this fold was 19.

The analysis of critical errors reveals several insights into the performance of the Resnet101 model:

Consistency in Green to Red misclassification—across all folds, there is a consistent occurrence of “Green” holes being misclassified as “Red”. This misclassification is particularly problematic as it could lead to unnecessary corrective actions or rework.
Variation in Red to Green misclassification—the misclassification of “Red” holes as “Green” varies significantly across folds. For instance, Fold 4 had only two such misclassifications, while Fold 3 had a higher number (39 instances).
Impact on overall accuracy—the high number of critical errors, especially in certain folds, significantly impacts the overall accuracy of the model. Folds with higher critical errors generally correspond to lower accuracy rates.
Need for model improvement—the analysis suggests that while the Resnet101 model shows promise, there is a need for further refinement to reduce the critical error rate. This could involve tuning the model parameters, using more advanced architectures, or incorporating additional data preprocessing techniques.

By addressing these critical errors, the reliability and applicability of the Resnet101 model for classifying drilled holes in melamine-faced chipboard can be enhanced, leading to more consistent and accurate results in practical applications.

3.5.1. Comparison of VGG16, VGG19, and Resnet101 Models

The evaluation of the VGG16, VGG19, and Resnet101 models focused on assessing their accuracy and significant error rates. A summary of the evaluation metrics is provided in Table 10.

The VGG16 architecture reached a mean accuracy level of 66.60%, accompanied by 159 critical errors. Conversely, the VGG19 architecture demonstrated a modest improvement, attaining a mean accuracy of 67.03% while registering 107 critical errors, marking it as a more accurate and reliable choice than the VGG16. Meanwhile, the Resnet101 model recorded the least impressive performance, with an average accuracy rate of 61.19% and the greatest amount of critical errors, totaling 333.

3.5.2. Selection of the Best CNN Model

Based on the comparison, the VGG19 model stands out as the best among the three models evaluated. It not only achieved the highest average accuracy, but also had the lowest number of critical errors. The consistency of its performance across different folds further supports its robustness and reliability.

In practical applications, the lower critical error rate of the VGG19 model is particularly advantageous, as it reduces the risk of misclassifications that could lead to costly errors in the manufacturing processes. Therefore, the VGG19 model is selected as the best model for the classification of drilled holes in melamine-faced chipboard.

In conclusion, the VGG19 model demonstrated superior performance in terms of accuracy and critical error rate compared to the VGG16 and Resnet101 models. Its consistent performance across different folds and lower propensity for critical errors make it the most reliable choice for the classification task. Future work will focus on further refining the VGG19 model to enhance its accuracy and reduce misclassification rates even further.

4. Results and Discussion of Explainable AI (XAI) for VGG19 Model

In this section, we discuss the application of two Explainable AI (XAI) techniques, namely Local Interpretable Model-agnostic Explanations (LIME) and Gradient-weighted Class Activation Mapping (Grad-CAM), to the VGG19 model. The aim is to interpret the model’s decisions and identify the most influential features contributing to the classification of drilled holes in melamine-faced chipboard.

LIME Results

Figure 5 shows LIME-generated explanations for correctly classified VGG19 model’s predictions on images of drilled holes. The top row (a–c) shows the original images, and the bottom row (d–f) shows the corresponding LIME explanations:

(a) Actual class: Green, predicted class: Green. The LIME analysis in (d) underscores the sections of the picture that had the most significant impact on the model’s identification of the “Green” class.
(b) Actual class: Yellow, predicted class: Yellow. The LIME explanation in (e) indicates the areas that influenced the model’s decision to classify the image as “Yellow”.
(c) Actual class: Red, predicted class: Red. The LIME explanation in (f) shows the key features that led the model to correctly identify the image as belonging to the “Red” class.

Figure 6 shows LIME-generated explanations for misclassified VGG19 model’s predictions on images of drilled holes. The top row (a–c) shows the original images, and the bottom row (d–f) shows the corresponding LIME explanations:

(a) Actual class: Green, predicted class: Red. The LIME explanation in (d) highlights the regions of the image that contributed most to the model’s prediction of the “Red” class. Based on the explanation image, it is evident that the actual condition should be Red instead of Green, and the VGG19 model predicted correctly (66% for the red class). However, the problem is that sometimes even with a good drill bit, damages (artifacts in the image) can occur due to local poor quality of the material or other issues not resulting from drill bit wear.
(b) Actual class: Green, predicted class: Red. The LIME explanation in (e) indicates the areas that influenced the model’s decision to classify the image as “Red”. Similarly, in this case, the state according to the image should be Red/Yellow. The model decided on the Red class (50% for red, 49% for yellow) on the boundary.
(c) Actual class: Red, predicted class: Green. The LIME explanation in (f) shows the key features that led the model to identify the image as belonging to the “Green” class. In this case, the problem lies with the model; it should have assigned the Red/Yellow class. However, it shows 58% for the green class and 32% for the yellow class, and only 10% for the Red class. The issue might be the greater number of artifacts at the bottom left of the drilled hole.

Grad-CAM Results

Figure 11 shows Grad-CAM-generated explanations for correctly classified VGG19 model’s predictions on images of drilled holes. The top row (a–c) shows the original images, and the bottom row (d–f) shows the corresponding Grad-CAM explanations:

(a) Actual class: Green, predicted class: Green. The Grad-CAM explanation in (d) highlights the regions of the image that contributed most to the model’s prediction of the “Green” class.
(b) Actual class: Yellow, predicted class: Yellow. The Grad-CAM explanation in (e) indicates the areas that influenced the model’s decision to classify the image as “Yellow”.
(c) Actual class: Red, predicted class: Red. The Grad-CAM explanation in (f) shows the key features that led the model to correctly identify the image as belonging to the “Red” class.

Figure 12 shows Grad-CAM-generated explanations for misclassified VGG19 model’s predictions on images of drilled holes. The top row (a–c) shows the original images, and the bottom row (d–f) shows the corresponding Grad-CAM explanations:

(a) Actual class: Green, predicted class: Red. The Grad-CAM explanation in (d) highlights the regions of the image that contributed most to the model’s prediction of the “Red” class. Here, it is evident that based on the image alone, the class should be Red, and the model performed very well with its decision (66% for the Red class, 32% for the Yellow class). The problem seems to be related to issues other than a worn drill bit, such as local poor quality of the material or other reasons not necessarily connected to a blunt drill bit.
(b) Actual class: Green, predicted class: Red. The Grad-CAM explanation in (e) indicates the areas that influenced the model’s decision to classify the image as “Red”. Similarly, in this case, the model made a very good decision, indicating the class should be Red (50%) or Yellow (49%).
(c) Actual class: Red, predicted class: Green. The Grad-CAM explanation in (f) shows the key features that led the model to identify the image as belonging to the “Green” class. In this case, the model did not perform well. The class should be Red, or possibly Yellow (based on the drilled hole image alone), but the model assigned the class as Green (58%) or Yellow (32%), with only 10% for the Red class. The issue might be the greater number of artifacts at the bottom left of the drilled hole.

4.1. Comparison of LIME and Grad-CAM Explanations for VGG19

The explanations generated by Local Interpretable Model-agnostic Explanations (LIME) and Gradient-weighted Class Activation Mapping (Grad-CAM) provide insights into the VGG19 model’s decision-making process. This section compares the two techniques based on their ability to explain the classifications of drilled holes in melamine-faced chipboard.

LIME produces explanations by perturbing the input data and observing the changes in the model’s predictions. It generates local approximations of the model’s behavior, highlighting the features most influential in a particular decision. Grad-CAM, on the other hand, uses the gradients of the target class flowing into the final convolutional layer to create a localization map highlighting the important regions in the input image.

4.1.1. Correctly Classified Instances

For correctly classified instances, both LIME and Grad-CAM highlight relevant regions that contribute to the VGG19 model’s predictions. Figure 5 and Figure 11 illustrate the explanations for the same set of correctly classified images. LIME focuses on specific features and areas in the images, showing which parts of the drilled holes most influenced the classification as “Green”, “Yellow”, or “Red”. Grad-CAM highlights broader regions within the images that are important for the model’s decisions. It provides a more holistic view of the areas contributing to the classification.

Both techniques show a high degree of agreement in identifying the critical regions for classification. However, LIME’s explanations are more granular, focusing on smaller image patches, while Grad-CAM provides a more general overview of the important areas.

4.1.2. Misclassified Instances

Figure 6 and Figure 12 show the explanations for misclassified instances by LIME and Grad-CAM, respectively. In cases of misclassification, LIME identifies specific features that misled the model. For example, it shows which regions of a “Green” hole led to its misclassification as “Red”. Grad-CAM indicates larger areas that influenced the wrong classification, providing a broader context for understanding the misclassification.

In nearly all cases, both LIME and Grad-CAM indicate similar regions (features) of explainability in the image. However, there is an exception in Figure 6c and Figure 12c—the same case where the model struggled more with the decision. This discrepancy in the explainability regions highlighted by the two algorithms indicates that the model’s decision-making process was less certain.

4.1.3. Discrepancies in Explanations

The discrepancies between LIME and Grad-CAM in the case shown in Figure 6c and Figure 12c can be attributed to several factors:

Local vs. global interpretations—LIME provides local interpretations by focusing on small perturbed regions, while Grad-CAM offers a global view based on gradients. The difference in their approach can lead to highlighting different aspects of the image.
Sensitivity to perturbations—LIME’s reliance on perturbations can sometimes make it sensitive to noise and local variations, which might not significantly affect the gradient-based approach of Grad-CAM.
Model uncertainty—the model’s lower confidence in its prediction for this particular case likely caused the divergence. When the model is uncertain, the explanations from different methods can vary more significantly, reflecting the underlying ambiguity in the model’s decision process.
Feature importance—different algorithms might prioritize different features. LIME might highlight specific small regions that it considers crucial, while Grad-CAM might emphasize broader regions that overall influence the decision.

4.1.4. Strengths and Weaknesses

LIME:
–
Strengths—provides detailed, feature-level explanations. Useful for understanding specific reasons behind individual predictions.
–
Weaknesses—can be computationally intensive due to the need for multiple perturbations. May not capture the global context as effectively.
Grad-CAM:
–
Strengths—offers a high-level view of important regions, useful for understanding the model’s overall focus. Computationally efficient compared to LIME.
–
Weaknesses—less detailed than LIME, may not highlight specific features as precisely.

4.1.5. Discussion of Yellow Class Misclassifications

The misclassification of instances belonging to the “Yellow” class deserves special attention, as it represents an intermediate state in the condition of drilled holes. The “Yellow” category is particularly challenging to classify, due to its transitional nature between the “Green” (good) and “Red” (requiring replacement) classes. This ambiguity introduces uncertainties in both model predictions and human assessments.

One of the primary challenges in accurately identifying the “Yellow” class lies in its inherent variability. Unlike the “Green” and “Red” classes, which have more clearly defined characteristics (new or significantly worn drill bits, respectively), the “Yellow” class represents a range of wear levels that are subject to interpretation. Factors such as minor surface defects, slight roughness around the edges of the drilled hole, or localized material inconsistencies can cause confusion between adjacent classes.

The CNN models tested, including VGG16, VGG19, and ResNet101, frequently struggled with distinguishing “Yellow” from either “Green” or “Red”. As seen in the confusion matrices, a substantial proportion of “Yellow” instances were misclassified as either “Green” or “Red”. This tendency was particularly noticeable in cases where the visual characteristics of “Yellow” overlapped with those of the other two classes. These misclassifications are not unexpected, considering the visual continuity in the degradation of drill bit performance over time.

From an industrial perspective, the misclassification of “Yellow” instances as “Green” or “Red” can have different implications. If a “Yellow” hole is misclassified as “Green”, there is a risk of continuing to use a drill bit that is approaching its wear limit, potentially leading to quality issues in subsequent operations. On the other hand, misclassifying a “Yellow” hole as “Red” could lead to premature replacement of a drill bit, resulting in unnecessary downtime and increased operational costs.

To reduce the frequency of these misclassifications, several strategies could be explored:

Improved data labeling. One approach involves refining the labeling process to better capture the nuances within the “Yellow” class. Introducing subcategories or gradations within the “Yellow” class might help the model learn the subtle differences between moderate wear levels.
Enhanced feature extraction. Incorporating additional feature extraction techniques, such as edge detection or texture analysis, could improve the model’s ability to distinguish between “Yellow” and other classes. These features could provide more detailed information about the quality of the drilled holes, leading to more accurate classifications.
Artifact removal based on LIME/Grad-CAM explanations. Analysis using LIME and Grad-CAM can identify specific artifacts in the images that lead to misclassifications. By focusing on these artifacts and either removing them from the dataset or adjusting the image preprocessing pipeline to mitigate their effects, the model’s performance can be improved, reducing the likelihood of incorrect class assignments.
Ensemble methods. Using ensemble models that combine the predictions of multiple CNN architectures may help to mitigate misclassifications. By aggregating the outputs of different models, it is possible to achieve a more balanced and reliable classification, especially for ambiguous cases.
Human-in-the-loop systems. Implementing a hybrid system where “Yellow” classifications are verified by human experts could provide a safety net for cases where the model’s confidence is low. This approach allows for continuous improvement through iterative feedback and model retraining.

Further research should focus on understanding the visual and contextual factors that contribute to “Yellow” misclassifications. Analyzing the decision boundaries in more detail using Explainable AI (XAI) techniques, such as LIME and Grad-CAM, can provide insights into the features the model struggles with. Additionally, investigating how the model’s uncertainty varies across the three classes could guide the development of more robust classification systems.

In conclusion, while the “Yellow” class poses significant challenges for both machine learning models and human operators, targeted strategies can help to minimize misclassification rates and improve overall system performance. Addressing these issues is critical for developing reliable tool condition monitoring systems that can effectively support decision-making in industrial environments.

4.1.6. Conclusions

LIME and Grad-CAM both contribute significantly to understanding how the VGG19 model makes decisions. The selection of either technique hinges on the analysis’s particular requirements. If in-depth explanations at the feature level are necessary, LIME is the ideal choice. On the other hand, Grad-CAM offers a wider perspective on the areas that impact the model’s predictions. Utilizing both techniques together can provide a well-rounded analysis of the model’s actions, thereby improving the AI system’s interpretability and reliability in identifying drilled holes in melamine-faced chipboard.

5. Conclusions

This study explored the application of Explainable AI (XAI) techniques, specifically LIME and Grad-CAM, to enhance the interpretability of Convolutional Neural Network (CNN) models used for the classification of drilled holes in melamine-faced chipboard. The primary focus was on understanding the decision-making process of these models to ensure reliability and improve classification accuracy.

The research involved evaluating three pre-trained CNN architectures: VGG16, VGG19, and ResNet101. The performance of these models was assessed based on accuracy and critical error rates using a 5-fold cross-validation approach. The VGG19 model emerged as the best performer, achieving the highest average accuracy of 67.03% and the lowest critical error rate. This model was therefore selected for further analysis using XAI techniques.

Both LIME and Grad-CAM were applied to the VGG19 model to generate explanations for its predictions. These techniques provided valuable insights into the features and regions of the images that influenced the model’s classifications. LIME produced detailed, feature-level explanations, highlighting specific areas that contributed to the model’s decisions. Grad-CAM, on the other hand, offered a more general overview, indicating broader regions that were important for the model’s predictions.

The comparison between LIME and Grad-CAM revealed that both methods generally agreed on the influential regions within the images, though LIME provided more granular details. In cases of misclassification, both techniques identified regions that misled the model, with some discrepancies attributed to the different approaches of the two methods. LIME’s perturbation-based approach and Grad-CAM’s gradient-based technique highlighted complementary aspects of the model’s decision-making process.

The study underscores the importance of using XAI techniques to interpret and validate the predictions of deep learning models in industrial applications. By understanding the factors influencing the model’s decisions, it is possible to improve the reliability and robustness of automated systems for tool condition monitoring.

Future research will focus on refining the VGG19 model further to enhance its accuracy and reduce critical errors. Additionally, exploring other XAI techniques and combining them with LIME and Grad-CAM could provide a more comprehensive understanding of model behavior. Implementing these improvements will contribute to more effective and trustworthy AI systems in the wood industry and beyond.

In conclusion, the integration of XAI techniques with CNN models for the classification of drilled holes in melamine-faced chipboard has demonstrated significant potential. The insights gained from this study pave the way for more transparent and reliable AI applications in manufacturing and other fields where understanding model decisions is crucial.

Author Contributions

Conceptualization, J.K. and A.S.; methodology, J.K. and A.S.; software, J.K. and A.S.; validation, J.B. and A.J.; formal analysis, J.K.; investigation, J.K. and A.S.; resources, J.K.; data curation, A.J.; writing—original draft preparation, J.K.; writing—review and editing, J.K., A.S., A.J. and J.B.; visualization, J.B.; supervision, J.K.; project administration, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to privacy and confidentiality.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ozlusoylu, I.; Istek, A. Effects of Surface Lamination Process Parameters on Medium Density Fiberboard (MDF) Properties. BioResources 2023, 18, 767–777. [Google Scholar] [CrossRef]
Kun, W.; Shen, Q.; Wang, C.; Liu, C. Influence of pneumatic pressure on delamination factor of drilling medium density fiberboard. Wood Res. 2015, 60, 429–440. [Google Scholar]
Park, B.D.; Kang, E.C.; Lee, S.M.; Park, J.Y. Formaldehyde emission of wood-based composite panels with different surface lamination materials using desiccator method. J. Korean Wood Sci. Technol. 2016, 44, 600–606. [Google Scholar] [CrossRef]
Szwajka, K.; Trzepieciński, T. Effect of tool material on tool wear and delamination during machining of particleboard. J. Wood Sci. 2016, 62, 305–315. [Google Scholar] [CrossRef]
Szwajka, K.; Trzepieciński, T. An examination of the tool life and surface quality during drilling melamine-faced chipboard. Wood Res. 2017, 62, 307–318. [Google Scholar]
Śmietańska, K.; Podziewski, P.; Bator, M.; Górski, J. Automated monitoring of delamination factor during up (conventional) and down (climb) milling of melamine-faced MDF using image processing methods. Eur. J. Wood Wood Prod. 2020, 78, 613–615. [Google Scholar] [CrossRef]
Huang, X.; Tao, Z.; You, H.; Li, C.; Liu, J. Research on cutting method of wood panels based on improved particle swarm algorithm. J. For. Eng. 2024, 9, 125–131. [Google Scholar] [CrossRef]
Zaida, H.; Bouchelaghem, A.M.; Chehaidia, S.E. Experimental study of tool wear evolution during turning operation based on DWT and RMS. Defect Diffus. Forum 2021, 406, 392–405. [Google Scholar] [CrossRef]
Liu, Y.; Jiang, H.; Yao, R.; Zeng, T. Counterfactual-augmented few-shot contrastive learning for machinery intelligent fault diagnosis with limited samples. Mech. Syst. Signal Process. 2024, 216, 111507. [Google Scholar] [CrossRef]
Liu, Y.; Jiang, H.; Yao, R.; Zhu, H. Interpretable data-augmented adversarial variational autoencoder with sequential attention for imbalanced fault diagnosis. J. Manuf. Syst. 2023, 71, 342–359. [Google Scholar] [CrossRef]
Kim, H.S.; Jung, J.; Hwang, R.; Park, S.C.; Lee, S.J.; Kim, G.T.; Lee, B.W. Classification of PRPD Pattern in Cast-Resin Transformers Using CNN and Implementation of Explainable AI (XAI) with Grad-CAM. IEEE Access 2024, 12, 53623–53632. [Google Scholar] [CrossRef]
Apicella, A.; Di Lorenzo, L.; Isgrò, F.; Pollastro, A.; Prevete, R. Strategies to Exploit XAI to Improve Classification Systems. Commun. Comput. Inf. Sci. 2023, 1901, 147–159. [Google Scholar] [CrossRef]
Miller, M.; Ronczka, S.; Nischwitz, A.; Westermann, R. Light Direction Reconstruction Analysis and Improvement using XAI and CG. J. WSCG 2022, 2022, 189–198. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Lemaster, R.L.; Tee, L.B.; Dornfeld, D.A. Monitoring tool wear during wood machining with acoustic emission. Wear 1985, 101, 273–282. [Google Scholar] [CrossRef]
Dutta, S.; Pal, S.; Mukhopadhyay, S.; Sen, R. Application of digital image processing in tool condition monitoring: A review. CIRP J. Manuf. Sci. Technol. 2013, 6, 212–232. [Google Scholar] [CrossRef]
Zhu, K. Tool Condition Monitoring with Sparse Decomposition. In Smart Machining Systems: Modelling, Monitoring and Informatics; Springer Series in Advanced Manufacturing; Springer: Cham, Switzerland, 2022; pp. 235–266. [Google Scholar] [CrossRef]
Lemaster, R.L.; Lu, L.; Jackson, S. The use of process monitoring techniques on a CNC wood router. Part 1. sensor selection. For. Prod. J. 2000, 50, 31. [Google Scholar]
Lemaster, R.L.; Lu, L.; Jackson, S. The use of process monitoring techniques on a CNC wood router. Part 2. Use of a vibration accelerometer to monitor tool wear and workpiece quality. For. Prod. J. 2000, 50, 59. [Google Scholar]
Zhu, N.; Tanaka, C.; Ohtani, T.; Usuki, H. Automatic detection of a damaged cutting tool during machining I: Method to detect damaged bandsaw teeth during sawing. J. Wood Sci. 2000, 46, 437–443. [Google Scholar] [CrossRef]
Zhu, N.; Tanaka, C.; Ohtani, T.; Takimoto, Y. Automatic detection of a damaged router bit during cutting. Holz Als Roh-Und Werkst. 2004, 62, 126–130. [Google Scholar] [CrossRef]
Suetsugu, Y.; Ando, K.; Hattori, N.; Kitayama, S. A tool wear sensor for circular saws using wavelet transform signal processing. For. Prod. J. 2005, 55, 79. [Google Scholar]
Szwajka, K.; Górski, J. Evaluation tool condition of milling wood on the basis of vibration signal. J. Physics: Conf. Ser. 2006, 28, 1205. [Google Scholar] [CrossRef]
Wilkowski, J.; Górski, J. Vibro-acoustic signals as a source of information about tool wear during laminated chipboard milling. Wood Res. 2011, 56, 57–66. [Google Scholar]
Górski, J.; Szymanowski, K.; Podziewski, P.; Śmietańska, K.; Czarniak, P.; Cyrankowski, M. Use of cutting force and vibro-acoustic signals in tool wear monitoring based on multiple regression technique for compreg milling. Bioresources 2019, 14, 3379–3388. [Google Scholar] [CrossRef]
Kurek, J.; Antoniuk, I.; Świderski, B.; Jegorowa, A.; Bukowski, M. Application of siamese networks to the recognition of the drill wear state based on images of drilled holes. Sensors 2020, 20, 6978. [Google Scholar] [CrossRef] [PubMed]
Osowski, S.; Kurek, J.; Kruk, M.; Górski, J.; Hoser, P.; Wieczorek, G.; Jegorowa, A.; Wilkowski, J.; Śmietańska, K.; Kossakowska, J. Developing automatic recognition system of drill wear in standard laminated chipboard drilling process. Bull. Pol. Acad. Sci. Tech. Sci. 2016, 64, 633–640. [Google Scholar]
Xie, Z.; Lu, Y.; Chen, X. A multi-sensor integrated smart tool holder for cutting process monitoring. Int. J. Adv. Manuf. Technol. 2020, 110, 853–864. [Google Scholar] [CrossRef]
Kuo, R. Multi-sensor integration for on-line tool wear estimation through artificial neural networks and fuzzy neural network. Eng. Appl. Artif. Intell. 2000, 13, 249–261. [Google Scholar] [CrossRef]
Bhuiyan, M.; Choudhury, I.; Dahari, M.; Nukman, Y. An Investigation into Turning of ASSAB-705 Steel Using Multiple Sensors. Mater. Manuf. Process. 2016, 31, 896–904. [Google Scholar] [CrossRef]
Jemielniak, K.; Urbański, T.; Kossakowska, J.; Bombiński, S. Tool condition monitoring based on numerous signal features. Int. J. Adv. Manuf. Technol. 2012, 59, 73–81. [Google Scholar] [CrossRef]
Panda, S.; Singh, A.K.; Chakraborty, D.; Pal, S. Drill wear monitoring using back propagation neural network. J. Mater. Process. Technol. 2006, 172, 283–290. [Google Scholar] [CrossRef]
Nasir, V.; Cool, J.; Sassani, F. Intelligent machining monitoring using sound signal processed with the wavelet method and a self-organizing neural network. IEEE Robot. Autom. Lett. 2019, 4, 3449–3456. [Google Scholar] [CrossRef]
Nasir, V.; Sassani, F. A review on deep learning in machining and tool monitoring: Methods, opportunities, and challenges. Int. J. Adv. Manuf. Technol. 2021, 115, 2683–2709. [Google Scholar] [CrossRef]
Nautiyal, A.; Mishra, A. Drill Bit Selection and Drilling Parameter Optimization using Machine Learning. IOP Conf. Ser. Earth Environ. Sci. 2023, 1261, 012027. [Google Scholar] [CrossRef]
Feng, Z.; Gani, H.; Damayanti, A.D.; Gani, H. An explainable ensemble machine learning model to elucidate the influential drilling parameters based on rate of penetration prediction. Geoenergy Sci. Eng. 2023, 231, 212231. [Google Scholar] [CrossRef]
Mendez, M.; Ahmed, R.; Karami, H.; Nasser, M.; Hussein, I.A.; Garcia, S.; Gonzalez, A. Applications of Machine Learning Methods to Predict Hole Cleaning in Horizontal and Highly Deviated Wells. SPE Drill. Complet. 2023, 38, 606–617. [Google Scholar] [CrossRef]
Kurek, J.; Swiderski, B.; Jegorowa, A.; Kruk, M.; Osowski, S. Deep learning in assessment of drill condition on the basis of images of drilled holes. In Proceedings of the Eighth International Conference on Graphic and Image Processing (ICGIP 2016) SPIE, Tokyo, Japan, 29–31 October 2016; Volume 10225, pp. 375–381. [Google Scholar]
Kurek, J.; Wieczorek, G.; Kruk, B.S.M.; Jegorowa, A.; Osowski, S. Transfer learning in recognition of drill wear using convolutional neural network. In Proceedings of the 2017 18th International Conference on Computational Problems of Electrical Engineering (CPEE), Kutna Hora, Czech Republic, 11–13 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–4. [Google Scholar]
Zhu, J.; Su, Z.; Han, Z.; Lan, Z.; Wang, Q.; Ho, M.M.P. An ensemble approach for enhancing generalization and extendibility of deep learning facilitated by transfer learning: Principle and application in curing monitoring. Smart Mater. Struct. 2023, 32, 115022. [Google Scholar] [CrossRef]
Bhuiyan, M.R.; Uddin, J. Deep Transfer Learning Models for Industrial Fault Diagnosis Using Vibration and Acoustic Sensors Data: A Review. Vibration 2023, 6, 218–238. [Google Scholar] [CrossRef]
Chai, Z.; Wang, J.; Zhao, C.; Ding, J.; Sun, Y. Deep transfer learning methods for typical supervised tasks in industrial monitoring: State-of-the-art, challenges, and perspectives. Sci. Sin. Informationis 2023, 53, 821–840. [Google Scholar] [CrossRef]
Kurek, J.; Antoniuk, I.; Górski, J.; Jegorowa, A.; Świderski, B.; Kruk, M.; Wieczorek, G.; Pach, J.; Orłowski, A.; Aleksiejuk-Gawron, J. Classifiers ensemble of transfer learning for improved drill wear classification using convolutional neural network. Mach. Graph. Vis. 2019, 28, 13–23. [Google Scholar] [CrossRef]
Kurek, J.; Antoniuk, I.; Górski, J.; Jegorowa, A.; Świderski, B.; Kruk, M.; Wieczorek, G.; Pach, J.; Orłowski, A.; Aleksiejuk-Gawron, J. Data augmentation techniques for transfer learning improvement in drill wear classification using convolutional neural network. Mach. Graph. Vis. 2019, 28, 3–12. [Google Scholar] [CrossRef]
Arrighi, L.; Barbon Junior, S.; Pellegrino, F.A.; Simonato, M.; Zullich, M. Explainable Automated Anomaly Recognition in Failure Analysis: Is Deep Learning Doing it Correctly? In World Conference on Explainable Artificial Intelligence; Communications in Computer and Information Science; Springer: Cham, Switzerland, 2023; Volume 1902, pp. 420–432. [Google Scholar] [CrossRef]
Famiglini, L.; Campagner, A.; Barandas, M.; La Maida, G.A.; Gallazzi, E.; Cabitza, F. Evidence-based XAI: An empirical approach to design more effective and explainable decision support systems. Comput. Biol. Med. 2024, 170, 108042. [Google Scholar] [CrossRef] [PubMed]
Gkartzonika, I.; Gkalelis, N.; Mezaris, V. Learning Visual Explanations for DCNN-Based Image Classifiers Using an Attention Mechanism. In European Conference on Computer Vision; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2023; Volume 13808, pp. 396–411. [Google Scholar] [CrossRef]
Makridis, G.; Theodoropoulos, S.; Dardanis, D.; Makridis, I.; Separdani, M.M.; Fatouros, G.; Kyriazis, D.; Koulouris, P. XAI enhancing cyber defence against adversarial attacks in industrial applications. In Proceedings of the 2022 IEEE 5th International Conference on Image Processing Applications and Systems (IPAS), Genova, Italy, 5–7 December 2022. [Google Scholar] [CrossRef]
Li, S.; Li, T.; Sun, C.; Yan, R.; Chen, X. Multilayer Grad-CAM: An effective tool towards explainable deep neural networks for intelligent fault diagnosis. J. Manuf. Syst. 2023, 69, 20–30. [Google Scholar] [CrossRef]
Shimizu, T.; Nagata, F.; Arima, K.; Miki, K.; Kato, H.; Otsuka, A.; Watanabe, K.; Habib, M.K. Enhancing defective region visualization in industrial products using Grad-CAM and random masking data augmentation. Artif. Life Robot. 2024, 29, 62–69. [Google Scholar] [CrossRef]
Arima, K.; Nagata, F.; Shimizu, T.; Otsuka, A.; Kato, H.; Watanabe, K.; Habib, M.K. Improvements of detection accuracy and its confidence of defective areas by YOLOv2 using a data set augmentation method. Artif. Life Robot. 2023, 28, 625–631. [Google Scholar] [CrossRef]
Zafar, M.R.; Khan, N. Deterministic Local Interpretable Model-Agnostic Explanations for Stable Explainability. Mach. Learn. Knowl. Extr. 2021, 3, 525–541. [Google Scholar] [CrossRef]
Harishyam, B.; Jenarthanan, M.; Rishivanth, R.; Rajesh, R.; Sai Girish, N. Visual inspection of mechanical components using visual imaging and machine learning. Mater. Today Proc. 2023, 72, 2557–2563. [Google Scholar] [CrossRef]
EN 310:1994; Wood-Based Panels: Determination of Modulus of Elasticity in Bending and of Bending Strength. European Standard: Brussels, Belgium, 1994.
EN 1534; Wood Flooring and Parquet: Determination of Resistance to Indentation—Test Method. European Standard: Brussels, Belgium, 2002.
Czarniak, P.; Szymanowski, K.; Panjan, P.; Górski, J. Initial Study of the Effect of Some PVD Coatings (“TiN/AlTiN” and “TiAlN/a-C:N”) on the Wear Resistance of Wood Drilling Tools. Forests 2022, 13, 286. [Google Scholar] [CrossRef]
Tuunainen, T.; Isohanni, O.; Jose, M.R. A comparative study on the application of Convolutional Neural Networks for wooden panel defect detection. In Proceedings of the 2024 IEEE 22nd World Symposium on Applied Machine Intelligence and Informatics (SAMI), Stará Lesná, Slovakia, 25–27 January 2024; pp. 321–326. [Google Scholar] [CrossRef]
Kurek, J.; Szymanowski, K.; Chmielewski, L.J.; Orłowski, A. Advancing Chipboard Milling Process Monitoring through Spectrogram-Based Time Series Analysis with Convolutional Neural Network using Pretrained Networks. Mach. Graph. Vis. 2023, 32, 89–108. [Google Scholar] [CrossRef]
Karathanasopoulos, N.; Hadjidoukas, P. Deep learning based automated fracture identification in material characterization experiments. Adv. Eng. Inform. 2024, 60, 102402. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
ImageNet. Available online: http://www.image-net.org (accessed on 10 August 2024).
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017; Volume 1, p. 3. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4510–4520. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. arXiv 2017, arXiv:1610.02357. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the AAAI, San Francisco, CA, USA, 4–9 February 2017; Volume 4, p. 12. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. arXiv 2017, arXiv:1707.01083v2. [Google Scholar]
Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning Transferable Architectures for Scalable Image Recognition. arXiv 2017, 2. arXiv:1707.07012. [Google Scholar]
Redmon, J. Darknet: Open Source Neural Networks in C. Available online: https://pjreddie.com/darknet (accessed on 10 August 2024).
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2019, arXiv:1905.1194. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. (IJCV) 2015, 115, 211–252. [Google Scholar] [CrossRef]
Zhou, B.; Khosla, A.; Lapedriza, A.; Torralba, A.; Oliva, A. Places: An image database for deep scene understanding. arXiv 2016, arXiv:1610.02055. [Google Scholar] [CrossRef]
Tan, L.; Huang, C.; Yao, X. A Concept-Based Local Interpretable Model-Agnostic Explanation Approach for Deep Neural Networks in Image Classification. In International Conference on Intelligent Information Processing; IFIP Advances in Information and Communication Technology; Springer: Cham, Switzerland, 2024; Volume 704, pp. 119–133. [Google Scholar] [CrossRef]
Meng, H.; Wagner, C.; Triguero, I. SEGAL time series classification—Stable explanations using a generative model and an adaptive weighting method for LIME. Neural Netw. 2024, 176, 106345. [Google Scholar] [CrossRef] [PubMed]
Visani, G.; Bagli, E.; Chesani, F. OptiLIME: Optimized lime explanations for diagnostic computer algorithms. arXiv 2020, 2699. arXiv:2006.05714. [Google Scholar]
Messalas, A.; Aridas, C.; Kanellopoulos, Y. Evaluating MASHAP as a faster alternative to LIME for model-agnostic machine learning interpretability. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 5777–5779. [Google Scholar] [CrossRef]
Dhurandhar, A.; Ramamurthy, K.N.; Ahuja, K.; Arya, V. Locally Invariant Explanations: Towards Stable and Unidirectional Explanations through Local Invariant Learning. In Proceedings of the Advances in Neural Information Processing Systems 36 (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023; Volume 36. [Google Scholar]
Barr Kumarakulasinghe, N.; Blomberg, T.; Liu, J.; Saraiva Leao, A.; Papapetrou, P. Evaluating local interpretable model-agnostic explanations on clinical machine learning classification models. In Proceedings of the 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), Rochester, MN, USA, 28–30 July 2020; pp. 7–12. [Google Scholar] [CrossRef]
Chakraborty, T.; Trehan, U.; Mallat, K.; Dugelay, J.L. Generalizing Adversarial Explanations with Grad-CAM. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, New Orleans, LO, USA, 18–24 June 2022; pp. 186–192. [Google Scholar] [CrossRef]
Liu, Y.; Tang, L.; Liao, C.; Zhang, C.; Guo, Y.; Xia, Y.; Zhang, Y.; Yao, S. Optimized Dropkey-Based Grad-CAM: Toward Accurate Image Feature Localization. Sensors 2023, 23, 8351. [Google Scholar] [CrossRef] [PubMed]
Desai, S.; Ramaswamy, H.G. Ablation-CAM: Visual explanations for deep convolutional network via gradient-free localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 972–980. [Google Scholar] [CrossRef]
Li, Z.; Xu, M.; Yang, X.; Han, Y.; Wang, J. A Multi-Label Detection Deep Learning Model with Attention-Guided Image Enhancement for Retinal Images. Micromachines 2023, 14, 705. [Google Scholar] [CrossRef] [PubMed]
Chen, L.; Chen, J.; Hajimirsadeghi, H.; Mori, G. Adapting grad-CAM for embedding networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 2783–2792. [Google Scholar] [CrossRef]
Song, W.; Dai, S.; Wang, J.; Huang, D.; Liotta, A.; Di Fatta, G. Bi-gradient verification for grad-CAM towards accurate visual explanation for remote sensing images. In Proceedings of the 2019 International Conference on Data Mining Workshops (ICDMW), Beijing, China, 8–11 November 2019; pp. 473–479. [Google Scholar] [CrossRef]
Brahmaiah, O.V.; Raju, M.S.N.; Jahnavi, V.; Varshini, M. Dense Net-Based Acute Lymphoblastic Leukemia Classification and Interpretation through Gradient-Weighted Class Activation Mapping. In Proceedings of the 2024 Third International Conference on Intelligent Techniques in Control, Optimization and Signal Processing (INCOS), Krishnankoil, India, 14–16 March 2024. [Google Scholar] [CrossRef]
Xiao, M.; Zhang, L.; Shi, W.; Liu, J.; He, W.; Jiang, Z. A visualization method based on the Grad-CAM for medical image segmentation model. In Proceedings of the 2021 International Conference on Electronic Information Engineering and Computer Science (EIECS), Changchun, China, 23–26 September 2021; pp. 242–247. [Google Scholar] [CrossRef]
Qiu, Z.; Rivaz, H.; Xiao, Y. Is Visual Explanation with Grad-CAM More Reliable for Deeper Neural Networks? A Case Study with Automatic Pneumothorax Diagnosis. In International Workshop on Machine Learning in Medical Imaging; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2024; Volume 14349, pp. 224–233. [Google Scholar] [CrossRef]

Figure 1. The provided description illustrates the FABA WP-01 drill from two different perspectives used in the experiments. From the left, a top view of the drill bit is depicted, showcasing its cutting edges and the general shape. Meanwhile, the image on the right offers a side perspective, emphasizing the length and contour of the drill bit.

Figure 2. Density profile of the chipboard utilized in the conducted experiments.

Figure 3. Approach to measuring tool degradation: (a) appearance of a freshly manufactured tool, (b) appearance of a tool after wear [57].

Figure 4. For illustration purposes, examples of drilling holes are categorized into three classes. The “Green” class examples are listed in the first row, followed by the “Yellow” class examples in the second row, and examples of the “Red” class are displayed in the third row. Refer to for additional details.

Figure 5. LIME-generated explanations for accurate predictions made by the VGG19 model on images featuring drilled holes are presented. The first row (a–c) displays the original images, while the second row (d–f) presents the LIME explanations corresponding to these images: (a) true class: Green, predicted class: Green; (b) true class: Yellow, predicted class: Yellow; (c) true class: Red, Predicted class: Red; (d) LIME explanation for image (a); (e) LIME explanation for image (b); (f) LIME explanation for image (c).

Figure 6. Explanations generated by LIME for misclassified predictions by the VGG19 model on images of drilled holes. The first row (a–c) presents the original images, while the second row (d–f) displays the LIME-generated explanations corresponding to each image: (a) real category: Green, predicted category: Red; (b) real category: Green, predicted category: Red; (c) real category: Red, predicted category: Green; (d) explanation by LIME for image (a); (e) explanation by LIME for image (b); (f) explanation by LIME for image (c).

Figure 7. Flowchart illustrating the workflow for the numerical experiments conducted in this study.

Figure 8. Overall confusion matrix for all folds for VGG16 model.

Figure 9. Overall confusion matrix for all folds for VGG19 model.

Figure 10. Overall confusion matrix for all folds for Resnet101 model.

Figure 11. Explanations generated by Grad-CAM for the VGG19 model’s correct classifications of images showcasing drilled holes are presented. The initial row, denoted as (a–c), displays the original images, while the subsequent row, labeled as (d–f), presents the Grad-CAM interpretations corresponding to these images: (a) true class: Green, model’s prediction: Green; (b) true class: Yellow, model’s prediction: Yellow; (c) true class: Red, model’s prediction: Red; (d) presents the Grad-CAM explanation for image (a), (e) for image (b), and (f) for image (c).

Figure 12. Explanations provided by Grad-CAM for incorrectly categorized images by the VGG19 model, focusing on drilled holes. The initial row (a–c) presents the genuine pictures, while the subsequent row (d–f) illustrates the Grad-CAM justifications corresponding to each: (a) true class: Green, identified as: Red; (b) true class: Green, identified as: Red; (c) true class: Red, identified as: Green; (d) is the Grad-CAM rationale for (a), (e) is for (b), and (f) for (c).

Table 1. Summary of image data collection and drill wear classification.

Drill Number	Green/Yellow/Red	Total Images
1	840/420/406	1666
2	840/700/280	1820
3	700/560/420	1680
4	840/560/280	1680
5	560/560/560	1680
Total	3780/2800/1946	8526

Table 2. Comparison of pretrained CNN models.

Pretrained CNN Model	Depth	Size	Parameters (Millions)	Image Input Size
Resnet101	101	167 MB	44.6	224-by-224
VGG16	16	515 MB	138	224-by-224
VGG19	19	535 MB	144	224-by-224

Table 3. Summary of data distribution in each fold of the 5-fold cross-validation.

Fold	Test Drill Number	Training Set Size	Test Set Size	Training/Test Percentage
1	1	6860	1666	80.46%/19.54%
2	2	6706	1820	78.65%/21.35%
3	3	6846	1680	80.30%/19.70%
4	4	6846	1680	80.30%/19.70%
5	5	6846	1680	80.30%/19.70%

Table 4. Evaluation results of VGG16 model.

Fold/Drill #	Accuracy	Critical Errors
Fold/Drill #	Accuracy	Green as Red	Red as Green	Total
1	71.00%	1	3	4
2	70.82%	8	11	19
3	63.98%	1	58	59
4	56.13%	61	0	61
5	70.77%	1	15	16
Total	66.60%	72	87	159

Table 5. Detailed evaluation metrics for VGG16 model.

Class	Precision (%)	Sensitivity (%)	Specificity (%)	Accuracy (%)	F1 Score (%)
Green	79.58	84.05	82.83	83.37	81.76
Yellow	51.97	52.86	76.11	68.47	52.41
Red	60.62	52.52	89.91	81.37	56.28

Table 6. Evaluation results of VGG19 model.

Fold/Drill #	Accuracy	Critical Errors
Fold/Drill #	Accuracy	Green as Red	Red as Green	Total
1	69.50%	2	2	4
2	70.76%	19	5	24
3	70.59%	1	10	11
4	52.38%	57	0	57
5	71.60%	0	11	11
Total	67.03%	79	28	107

Table 7. Detailed evaluation metrics for VGG19 model.

Class	Precision (%)	Sensitivity (%)	Specificity (%)	Accuracy (%)	F1 Score (%)
Green	84.86	78.89	88.79	84.40	81.77
Yellow	51.51	58.32	73.16	68.29	54.71
Red	59.72	56.53	88.72	81.37	58.08

Table 8. Evaluation results of Resnet101 model.

Fold/Drill #	Accuracy	Critical Errors
Fold/Drill #	Accuracy	Green as Red	Red as Green	Total
1	66.26%	26	13	39
2	56.04%	158	7	165
3	65.59%	5	39	44
4	56.13%	64	2	66
5	66.42%	3	16	19
Total	61.19%	256	77	333

Table 9. Detailed evaluation metrics for ResNet101 model.

Class	Precision (%)	Sensitivity (%)	Specificity (%)	Accuracy (%)	F1 Score (%)
Green	81.45	80.48	85.40	83.22	80.96
Yellow	47.52	36.96	80.04	65.89	41.58
Red	46.23	62.08	78.65	74.87	52.99

Table 10. Comparison of VGG16, VGG19, and Resnet101 models.

Model	Average Accuracy	Total Critical Errors	Consistency of Performance
VGG16	66.60%	159	Moderate
VGG19	67.03%	107	High
Resnet101	61.19%	333	Low

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sieradzki, A.; Bednarek, J.; Jegorowa, A.; Kurek, J. Explainable AI (XAI) Techniques for Convolutional Neural Network-Based Classification of Drilled Holes in Melamine Faced Chipboard. Appl. Sci. 2024, 14, 7462. https://doi.org/10.3390/app14177462

AMA Style

Sieradzki A, Bednarek J, Jegorowa A, Kurek J. Explainable AI (XAI) Techniques for Convolutional Neural Network-Based Classification of Drilled Holes in Melamine Faced Chipboard. Applied Sciences. 2024; 14(17):7462. https://doi.org/10.3390/app14177462

Chicago/Turabian Style

Sieradzki, Alexander, Jakub Bednarek, Albina Jegorowa, and Jarosław Kurek. 2024. "Explainable AI (XAI) Techniques for Convolutional Neural Network-Based Classification of Drilled Holes in Melamine Faced Chipboard" Applied Sciences 14, no. 17: 7462. https://doi.org/10.3390/app14177462

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Explainable AI (XAI) Techniques for Convolutional Neural Network-Based Classification of Drilled Holes in Melamine Faced Chipboard

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Cost Analysis of Conventional Methods for Sensor Replacement and Calibration

2.3. CNN Models Architecture

2.4. Techniques for Explainable AI (XAI)

2.4.1. Local Interpretable Model-Agnostic Explanations (LIME)

2.4.2. Gradient-Weighted Class Activation Mapping (Grad-CAM)

3. Results and Discussion of CNN Models Classification

3.1. Numerical Experiments

3.2. Framework and Implementation Details

3.2.1. Framework Overview

3.2.2. Implementation of CNN Models

3.2.3. Explainable AI (XAI) Techniques

3.2.4. Model Evaluation and Interpretation

3.3. Results for VGG16 Model

3.4. Results for VGG19 Model

3.5. Results for Resnet101 Model

3.5.1. Comparison of VGG16, VGG19, and Resnet101 Models

3.5.2. Selection of the Best CNN Model

4. Results and Discussion of Explainable AI (XAI) for VGG19 Model

4.1. Comparison of LIME and Grad-CAM Explanations for VGG19

4.1.1. Correctly Classified Instances

4.1.2. Misclassified Instances

4.1.3. Discrepancies in Explanations

4.1.4. Strengths and Weaknesses

4.1.5. Discussion of Yellow Class Misclassifications

4.1.6. Conclusions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI