1. Introduction
With the evolution from Industry 4.0 to 5.0, deep learning (DL) techniques have become even more integral, especially in the domain of condition monitoring (CM) for machinery and robots [
1]. Considering the complexity of modern robotic mechanisms and their diverse roles, leveraging DL to predict and proactively manage potential malfunctions or ensure their safe movements is a game changer. However, as we leverage these sophisticated DL algorithms, a fundamental challenge emerges: trustworthiness. The inherently opaque nature of DL models, which are often termed ‘black boxes’, presents a hurdle in achieving transparency and reliability in their predictions, especially in critical domains like CM where the stakes are high [
2]. As industries globally lean heavily on robots and machinery for production, ensuring the safe and consistent operation of these assets is paramount. A malfunction, if not pre-emptively addressed, can lead to substantial economic losses, compromised safety, and operational disruptions. Therefore, this paper endeavors to delve into the nuances of creating a DL-based CM system for robots and machinery that is not only efficient and predictive, but also transparent and trustworthy, bridging the gap between advanced AI capabilities and the indispensable human trust factor.
The historical trajectory of machine CM can be traced back to its rudimentary beginnings with simple circuit breakers designed to manage incoming power feeds. This was succeeded by the advent of multifunctional electronic relays, marking a pivotal evolution in monitoring capabilities. The introduction of Programmable Logic Controllers (PLCs) further revolutionized the field, enabling engineers to devise intricate protection mechanisms for machines. In today’s context, buoyed by technological advancements that have democratized data acquisition costs and amplified computational prowess, a palpable shift in the industrial sector is taking place. Many plants are now fervently moving towards integrating data-driven systems into their machinery and robotic CM processes [
3].
In recent years, the expanding domain of CM, especially within machinery and robotics, has greatly benefited from the advent and refinement of AI frameworks. Tailored to address intricate CM challenges, these AI-driven methodologies cast a wide net, encompassing a plethora of sensory data inputs. They proficiently integrate ultrasonic measurements [
4], electrical currents [
5], torque measurements on robotic arm joints [
6], temperature profiles [
7], and state-of-the-art techniques for wear debris detection [
8]. Vibration and torque signal analyses, paramount aspects of these frameworks, have been harnessed extensively for both machinery and robotic systems, demonstrating their pivotal role and potential [
9,
10]. This confluence of diverse measurements fosters a more holistic understanding of equipment and robotic health, propelling advancements in predictive accuracy and operational longevity. L. Yang et al. propose the Twin Broad Learning System (TBLS) as a solution to the challenges faced by deep learning models that rely on extensive datasets for fault diagnosis in rotating machinery. The TBLS incorporates two non-parallel hyper-planes, which results in an improved generalization capability and diagnostic accuracy for overlapping fault patterns. Experimental results on benchmark datasets support the effectiveness and efficiency of the TBLS [
11]. In another interesting study, Yuqing Zhou et al. introduce a semi-supervised fault diagnosis approach for rotating machinery, combining multi-scale permutation entropy with contrastive learning. This method significantly outperforms the benchmarks in gearbox and milling tool diagnosis, achieving high accuracy with limited labeled data [
12].
While DL methodologies have demonstrated significant potential in enhancing CM systems, their ’black-box’ character—being non-intuitive and opaque—poses challenges. This intrinsic nature often leaves human operators and management in a quandary, unable to decipher the underlying rationale behind the model’s decisions. This intricacy frequently leaves stakeholders, spanning from on-ground operators to upper-tier management, grappling with the logic underlying the decisions proffered by these models. Recent scholarly discourse, as highlighted by Antwarg et al. [
13], underscores this opacity as a notable impediment, potentially stymieing the trust and subsequent adoption of DL-driven CM systems among industry practitioners. Concurrently, the academic landscape has been enriched with a myriad of feature extraction methodologies, specifically tailored to signal-based CM paradigms. Yet, finding an optimal subset of features remains an academically contested area. This challenge is exacerbated by the increasing data dimensionality, which, in turn, amplifies the computational demands and intricacy of the accompanying DL architectures. It is against this backdrop that this paper introduces a novel framework, striving to surmount the aforementioned challenges. The novelty of our approach lies in its multifaceted capabilities:
(1)
Transparency in Decision Making: While a plethora of publications have focused on the efficacy and accuracy of such models, only a handful of studies have attempted to unravel the intricacies of their decision-making processes. There have been concerted efforts to leverage more transparent models, such as decision trees, to enhance interpretability. Notably, ref. [
14] showcased the potential of shallow-learning ensemble models, integrating XGBoost and Random Forests with model explainers to elucidate the underlying decision logic. On a similar tangent, the authors of [
15] employed the Logical Analysis of Data as a pathway to achieve an interpretable machine learning technique that is specifically tailored to fault detection and diagnosis within intricate industrial chemical processes. From a different perspective, to gauge the model’s confidence level in its decisions, ref. [
16] employed Bayesian models, particularly for monitoring machine signals in anomalous or unpredictable domains where misdiagnosis might occur in the absence of discernible symptoms. In their efforts to comprehensively model a robot’s vibrational attributes, ref. [
17] endeavored to quantify the uncertainty linked with eigenfrequency prediction, leveraging the precision of Monte Carlo uncertainty propagation.
Echoing a similar sentiment regarding the importance of confidence in outcomes, ref. [
18] innovatively integrated Bayesian variational learning into Transformer architectures. Their approach instilled uncertainty into attention weights, paving the way for a probabilistic Bayesian Transformer specifically designed for dependable CM in rotating machines.
Our proposed framework therefore strives to surmount existing limitations. It tries to elucidate the specific patterns and signatures instrumental in the decision-making processes, particularly emphasizing their contributions to or against each classification category. Presented in a visually intuitive manner, this elucidation aims to render the decision-making process transparent and cogent for human operators and stakeholders. This explicability acts as a catalyst, engendering trust and confidence in AI-based CM models.
(2)
Optimal Feature Selection and Model Efficiency: Utilizing specialized processing methods for sensory signals unveils informative signatures and patterns that are crucial to CM. In recent decades, a myriad of feature types have been proposed and implemented in the CM discourse. Notably, features like Kurtosis, Root Variance Frequency, Max Power Spectrum, Impulse Factor, and Crest Factor have been extensively employed as fundamental statistical attributes within machine-learning-based fault detection frameworks, providing pivotal insights into machinery states [
19,
20].
Moreover, the advent of sophisticated time–frequency analyses such as permutation entropy, multiscale permutation entropy, and multiscale entropy has been integral in reinforcing the analytical capabilities of both machine learning and DL-based CM frameworks [
8,
21,
22]. Empirical Mode Decomposition (EMD) stands out as a notably effective, self-adaptive processing method that is adept at analyzing non-linear and non-stationary processes, despite its inherent challenges; these include mode mixing, end effects, interpolation problems, and complexity in selecting the optimal Intrinsic Mode Function (IMF) [
23,
24].
In parallel, an assortment of CM frameworks have incorporated grayscale diagrams and diagrams based on Fourier and wavelet transforms, each illuminating different facets of the machine’s state, with their respective merits and demerits. Investigations into the reliability and efficacy of discrete Fourier transform (DFT), short-time Fourier transform (STFT), and continuous wavelet transform (CWT) have also been pivotal in defining rub detection parameters based on vibration signals [
25]. In a concerted effort to amalgamate the benefits of diverse techniques, recent studies have explored the integrative application of these feature extraction techniques [
26,
27,
28]. However, the field still lacks a unified consensus on the delineation of optimal feature engineering for machinery CM.
In light of the above considerations, our framework strives to amalgamate advanced signal processing techniques, focusing on harnessing the complementary strengths of the aforementioned methodologies to refine feature selection; this enhances model accuracy and reduces computational demands, leading to an expeditious and highly efficient CM framework that is suitable for machinery and robots.
(3)
Leveraging State-of-the-Art DL Architecture: At the core of our framework lies the integration of the Convolutional Long Short-Term Memory (CLSTM) architecture, which has gained recognition as a leading approach in the field of deep learning. CLSTM exhibits exceptional capabilities in handling sequential data, making it highly suitable for tasks in condition monitoring (CM) [
9,
29].
Our research aims to bridge existing gaps in the CM literature by combining advanced signal processing techniques, state-of-the-art DL architectures, and a strong commitment to transparency and efficiency. Through this study, we seek to strengthen the foundations of a new era in machinery and robotic system diagnostics, characterized by both insightful analysis and intuitive understanding. An earlier version of this work was presented at the 32nd International Conference on Flexible Automation and Intelligent Manufacturing in 2023 [
30].
The rest of this article is organized as follows: An overview of Shapley values and XAI is provided in
Section 2. Our framework is discussed in
Section 3. Experiments and discussions are presented in
Section 3. Finally, the conclusion is reached in
Section 4.
4. Experimental Results and Discussion
The proposed framework offers two key benefits. Firstly, it reveals the workings behind the DL model, explaining how the model makes its classifications. Secondly, it highlights the important features that significantly contribute to classification tasks and removes the unnecessary features that do not contribute to the output, thus making the model more efficient.
To assess the efficacy of the proposed framework, the authors employ two distinct datasets: one pertaining to fault detection task classification and the other focused on determining the status of robotic arms.
4.1. Dataset Descriptions
When emphasizing the importance of safety in robotic systems, understanding and monitoring their operational behavior becomes a focal point in advanced robotics research. A critical aspect of this is discerning the robot’s interaction with its environment, such as detecting collisions with obstacles, understanding when it is being manually operated, or ascertaining when it is moving freely. To this end, [
22], from the Technical University of Munich, conducted a comprehensive study where they recorded the torques for each joint of a seven-DoF KUKA arm. These signal instances span a time window of 1024 ms, collected at a 1 kHz sampling rate. Notably, any collisions or contact events are encapsulated around the 256 ms mark within these signal instances. For the purpose of this paper, and to focus on the most relevant data segments, we specifically extracted a 300 ms time window from each sample, ranging from 256 ms to 556 ms, when constructing our training and test sets. This ensured the efficient processing of the datasets, concentrating on the time frames that were most indicative of the robot’s interactions.
This study employed the Case Western Reserve University (CWRU) bearing dataset and used the test stand shown in
Figure 4, which consists of a motor, a torque transducer (/encoder), a dynamometer, and control electronics. The dataset included a healthy condition and five different fault types, i.e., three different outer race misalignments (outer@3, outer@6, and outer@12) along with the inner race and the ball faults. The faults were grouped according to their severity and range in diameter, from 0.007 inches to 0.040 inches. Additionally, the dataset was composed of four different motor speeds. However, for the sake of simplicity, this study only utilized one motor speed, 1797 RPM. The dataset was collected with a sampling rate of 12 kHz, with an accelerometer mounted on the drive-end of the machine. In the experiments, we took signal bursts of 800 timestamps, equal to 66.6 milliseconds, to generate some different datasets of approximately 25,500 signal bursts.
In this study, we also used a binary dataset, consisting of the torque values of a wind turbine with a 30 kW induction generator that were collected under rotor electrical unbalance (REU) conditions and in healthy conditions at varying loads and fault levels. The data were recorded for three different motor speeds of 1530, 1560, and 1590 rpm. The dataset contained some additional phase resistances equal to 0.099 , 0.1485 , and 0.198 for one rotor phase to obtain 150%, 225%, and 300% REU, respectively. For the fault class, we only used the signals corresponding to 150% REU.
4.2. Application I: Model Optimization/Feature Selection
To evaluate how well this framework works in feature selection, this study compared it with several top models: our proposed model (XAI-CLSTM), CLSTM [
9], sdAE [
33], a dfCNN [
34], CAE-fft [
35], and a CNN [
36].
Tests using the CWRU dataset show that our framework improves diagnostic results across different burst lengths. As seen in
Figure 5, for CWRU, XAI-CWRU has an
-score up to 7% better than that of other models. This difference is even clearer with shorter burst lengths, where there are fewer data to help figure out the signal and identify it. This speed in diagnosis is crucial when trying to spot machine problems quickly. The wind turbine dataset in
Figure 6 shows similar results. However, for longer burst lengths, CLSTM performs a bit better. The other four models consistently scored lower than our proposed model on both datasets. In addition, the arm dataset illustrates that the accuracy remains unaffected when the XAI-CLSTM model is made lighter by employing more efficient feature selection using the proposed framework, as shown in
Figure 7.
4.3. Application II: Evaluating Generative Models
The proposed XAI framework can be effectively applied to assess the synthetic samples made by GAN models. This paper includes a detailed evaluation of the Conditional Generative Adversarial Network (CGAN) introduced by [
37]. Their work brings forth a unique data augmentation approach tailored to fault data synthesis.
Ahang et al. presented a CGAN variant designed to train on both regular and faulty data under a single condition. Using this trained network, they generated fault data from standard samples for motor speeds lacking corresponding fault data. For evaluation, we use a CLSTM classifier that is already fine-tuned on the CWRU dataset, which acts as our benchmark.
The evaluation process entails a comparison of the signals produced by the CGAN against ground truth signals. This allows for a detailed examination of common patterns and significant signal inflection points. This rigorous comparison ensures that the CGAN offers good generalization capabilities, mitigating concerns regarding mode collapse. At the same time, it certifies the CGAN’s ability to produce diverse, high-quality samples. Such samples not only align well with ground truth training, but also produce discernible class identifications. This is further confirmed as the SHAP values visualized by our framework distinctly highlight key signal features.
4.4. Application III: Model Explainer and Training
The XAI framework introduced in this manuscript serves another crucial function: elucidating the decision-making processes of models, particularly in discerning classes. This paper introduces a ‘signal explainer’ that visualizes SHAP values, illuminating the underlying rationale for class assignments.
To provide a tangible demonstration, we utilize the model and dataset presented by Zhang et al. Their research is centered on devising an online collision detection and identification system for robots that collaborate with humans. The architecture they propose consists of a signal classifier and an online diagnoser. This system monitors the robot’s sensory signals, promptly detecting and classifying physical human–robot interactions [
22].
In
Figure 8, we illustrate the integration of Zhang et al.’s model with our XAI framework. By feeding the model with their dataset, we are able to provide deeper insights into how their online classifier discerns between torque signals. Specifically, the classifier segregates signals into three distinct categories: ‘free’, ‘collision’, and ‘contact’; each of these indicates a different interaction status between the robot and its environment.
This interpretability afforded by the XAI framework empowers practitioners to comprehend essential signal patterns that are indicative of specific classes. As a result, there is enhanced trust in these online CM tools within the robotics industry. Moreover, this understanding paves the way for the more effective training of human operators, ensuring smoother human–robot collaboration.
In the evaluation of the signal eXplainer software using the seven-DoF KUKA dataset, a noteworthy observation emerges from
Figure 8. Specifically, the software incorrectly classifies certain data points, leading to false positives. This misclassification is particularly evident in the lower panel of the figure, where the model confuses the ‘contact’ and ‘collision’ classes due to their similarity in the feature space. More precisely, torques 3, 5, and 6 suggest a ‘collision’ event, whereas torques 2 and 4 steer the model toward identifying it as ‘contact’. This visualization elucidates the susceptibility of the model to false positives in such ambiguous scenarios.
Furthermore, the dataset poses an inherent limitation on the signal eXplainer’s efficacy, particularly concerning the seven-DoF KUKA dataset. The dataset is imbalanced in terms of class distribution; for example, it lacks sufficient data points representing torque signals when the status corresponds to the ‘free’ class. This paucity of samples in the ‘free’ class renders the SHAP (Shapley Additive Explanation) values less reliable. Consequently, when the model encounters a data point from the ‘free’ class, the SHAP values may misrepresent its importance, biasing the model towards other classes. This is an intrinsic drawback that the current version of the model explainer is not equipped to address.
5. Conclusions
The critical importance of machinery and robot CM in ensuring operational efficiency and safety in the industrial sector cannot be overstated. As Industry 4.0 and 5.0 paradigms continue to shape the future of manufacturing and production, there is a pressing need for advanced CM methodologies. Signal-based CM methods, particularly those leveraging DL for fault diagnosis, have surged in prominence over recent years. However, despite their capabilities, a significant challenge has persisted, namely the ‘black-box’ nature of such models; this often hinders their adoption in critical scenarios due to the opaque decision-making processes.
In order to fill this important gap, this paper presents an innovative method that aims to improve the performance of DL-based CM models and provide insight into their decision-making processes, thus promoting trust and transparency, as shown in
Figure 9. Our proposed framework consists of a systematic four-step process, including (1) a comprehensive feature extraction phase, (2) advanced fault detection using a dual-path ConvLSTM architecture, (3) model optimization and feature refinement based on XAI principles, and (4) a dedicated module for interpreting inferences.
Drawing from a rich set of signal processing techniques, we extracted a comprehensive set of features from the raw signals. These features, sourced from both the time and frequency domains and further enriched with Fourier and wavelet spectra, served as the backbone of our diagnosis system. Through the use of SHAP values, we were able to pinpoint and prioritize the features that played pivotal roles in the ConvLSTM model’s decision making. This not only facilitated a cleaner and more streamlined input to our DL model, but also mitigated the risk of over-reliance on noise-prone or redundant features. The choice of a dual-path DL architecture—capitalizing on RNN-LSTM for temporal dependencies and a CNN for shift-invariant patterns—was pivotal in bolstering the robustness of our model, especially in the face of noisy data. Our crowning achievement, however, was in the application of SHAP values to elucidate the often complex reasoning of our network, revealing the key vibrational patterns and signatures linked to each diagnostic class.
In our study, we evaluated the performance of our XAI-ConvLSTM model compared to other algorithms in the field. By conducting thorough benchmarking on two different datasets, we found that our framework demonstrated a slightly superior performance in terms of both accuracy and computational efficiency. This advantage highlights the appropriateness of our approach for real-time applications, including resource-limited edge-computing environments.
In conclusion, the contributions of this paper have profound implications for the future of CM in the era of smart manufacturing. By ensuring a harmonious blend of performance, transparency, and efficiency, we hope to pave the way for safer and more reliable industrial ecosystems.
Future work will focus on domain adaptation and transfer learning associated with methods that aim to address model interpretability and improve the applicability of the proposed approach in different industrial scenarios.