Explainable Fault Classification and Severity Diagnosis in Rotating Machinery Using Kolmogorov–Arnold Networks

Rigas, Spyros; Papachristou, Michalis; Sotiropoulos, Ioannis; Alexandridis, Georgios

doi:10.3390/e27040403

Open AccessArticle

Explainable Fault Classification and Severity Diagnosis in Rotating Machinery Using Kolmogorov–Arnold Networks

¹

Department of Digital Industry Technologies, School of Science, National and Kapodistrian University of Athens, 34400 Psachna, Greece

²

Department of Physics, School of Science, National and Kapodistrian University of Athens, 15784 Athens, Greece

³

School of Electrical & Computer Engineering, National Technical University of Athens, 15780 Athens, Greece

^*

Author to whom correspondence should be addressed.

Entropy 2025, 27(4), 403; https://doi.org/10.3390/e27040403

Submission received: 11 February 2025 / Revised: 23 March 2025 / Accepted: 7 April 2025 / Published: 9 April 2025

(This article belongs to the Special Issue Signal Processing for Fault Detection and Diagnosis in Electric Machines and Energy Conversion Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Rolling element bearings are critical components of rotating machinery, with their performance directly influencing the efficiency and reliability of industrial systems. At the same time, bearing faults are a leading cause of machinery failures, often resulting in costly downtime, reduced productivity, and, in extreme cases, catastrophic damage. This study presents a methodology that utilizes Kolmogorov–Arnold Networks—a recent deep learning alternative to Multilayer Perceptrons. The proposed method automatically selects the most relevant features from sensor data and searches for optimal hyper-parameters within a single unified approach. By using shallow network architectures and fewer features, the resulting models are lightweight, easily interpretable, and practical for real-time applications. Validated on two widely recognized datasets for bearing fault diagnosis, the framework achieved perfect F1-Scores for fault detection and high performance in fault and severity classification tasks, including 100% F1-Scores in most cases. Notably, it demonstrated adaptability by handling diverse fault types, such as imbalance and misalignment, within the same dataset. The availability of symbolic representations provided model interpretability, while feature attribution offered insights into the optimal feature types or signals for each studied task. These results highlight the framework’s potential for practical applications, such as real-time machinery monitoring, and for scientific research requiring efficient and explainable models.

Keywords:

bearing faults; fault detection; fault classification; severity classification; Kolmogorov–Arnold Networks; explainable AI; symbolic representations

1. Introduction

Rotating machinery plays an indispensable role in modern industry, powering numerous applications across the manufacturing, energy, and transportation sectors [1]. Among their components, rolling element bearings (referred to simply as bearings hereafter) are vital yet vulnerable elements, with a heavy influence on the performance and lifespan of machines. Notably, it has been reported that up to 50% of motor faults are bearing-related [2], while several issues in rotating machinery today can be traced to the improper design or application of bearings [3]. Such failures can result in severe consequences, including unexpected downtime and costly repairs [4,5], or even catastrophic damage or loss of life in extreme cases [6]. Moreover, in manufacturing environments, where continuous production is critical, disruptions caused by bearing failures can lead to substantial losses in productivity [7]. Early and accurate bearing fault detection and classification are therefore essential in modern industrial and manufacturing practices.

Before the widespread use of machine learning (ML) and deep learning (DL) methodologies, bearing fault detection and classification relied on other widely adopted techniques to identify characteristic fault patterns. For instance, vibration analysis is commonly employed to detect frequency peaks associated with specific faults [2], while signatures of these vibration frequencies have also been identified in the current spectrum through electrical signal processing [8]. Additionally, features extracted from the Fourier spectrum of vibration signals were utilized to detect faults as peaks in the frequency domain [9], and acoustic emission (AE) monitoring offered earlier and more fine-grained fault detection compared to vibration monitoring [10]. However, these methods are often bound to specific fault scenarios or experimental conditions, which limit their applicability to diverse operating environments. For example, directly identifying bearing faults through raw vibration signals is challenging, as vibrations are typically dominated by imbalance and misalignment components [11]. Moreover, the experimental results of [8] were based on cases of extensive bearing damage, raising concerns about the applicability of this approach for less severe faults [11]. Features like spectral kurtosis have been shown to be sensitive to strong harmonic interference when used as the only fault indicator [12], and have also led to poor model accuracy when extracted from non-stationary signals [13]. Finally, AE monitoring has also proven to be highly susceptible to background noise [14].

Despite their limitations, the aforementioned techniques laid the foundation for identifying which types of sensor data are most effective for detecting and classifying bearing faults. Modern data-driven approaches have built upon this groundwork, incorporating features extracted from sensor data to develop more robust and generalizable frameworks. Examples of features extracted directly from the time-domain signal include, but are not limited to, the root mean square (RMS), crest factor (CF), skewness, and kurtosis [15]. Spectral features, such as fundamental frequencies, spectral kurtosis and spectral entropy are obtained by applying a Fast Fourier Transform (FFT) to the time-domain signals and have also been widely used in such applications [16]. Beyond time- and frequency-domain features, time–frequency representations, such as those derived from the Short-Time Fourier Transform or wavelet transformations, are often used to extract features for capturing transient and non-stationary behaviors [17]. Additionally, AE monitoring, despite its susceptibility to noise, has shown promising results when combined with modern time-series deep learning methods [18].

Drawing from these diverse feature sets, a series of ML models such as k-Nearest Neighbors [19,20], Support Vector Machines (SVMs) [21,22], and Random Forests [23,24,25] have been explored for the tasks of bearing fault detection and classification. More recently, the widespread adoption of sensors in industrial settings and the advent of the big data era have driven the use of DL approaches, initially based on fundamental architectures such as Autoencoders (AEs) [26,27], Recurrent Neural Networks [27,28], Convolutional Neural Networks (CNNs) [29,30,31] and Generative Adversarial Networks [32,33]. Building further on these core architectures, recent studies have introduced more sophisticated hybrid models that combine CNNs with AEs and attention mechanisms [13,34], or integrate recurrent models like Bidirectional Long Short-Term Memory (BiLSTM) networks with smoothing techniques [18].

Albeit successful in achieving high performance for bearing fault detection and classification, ML and especially DL models often fall short in areas where traditional approaches excel, with explainability being a notable example. The ability to understand and interpret a model’s decisions is crucial, especially in safety-critical applications or when deeper insights into the underlying physical processes are required [35]. In addition to explainability, a significant challenge arises in deploying DL models for real-time condition-based monitoring on edge devices, i.e., resource-constrained computing units located close to the machinery they monitor. Many DL architectures are computationally intensive, making them unsuitable for resource-constrained environments [36]. Another important consideration is the quantity and quality of features used by these models. While leveraging a large number of features can often yield superior results, achieving comparable performance with fewer features is far more desirable [37]; budget constraints and technical limitations in practical scenarios demand careful sensor selection, as collecting an exhaustive set of measurements is neither feasible nor economical. Furthermore, the effectiveness of features can vary significantly, depending on the dataset or system under study; features that perform well for one problem may be suboptimal for another. This variability highlights the need for models capable of adaptively selecting the most relevant features for a given problem [38]. To address these challenges, this paper presents a unified framework centered around Kolmogorov–Arnold Networks (KANs), designed to provide explainability, efficiency, and adaptive feature selection.

Inspired by the Kolmogorov–Arnold representation theorem, KANs were recently introduced by [39] as an alternative to Multi-Layer Perceptrons (MLPs), serving as a new paradigm for the underlying architecture of DL models. Unlike in the case of MLPs, where activation functions are fixed, KANs contain trainable univariate functions as activations, allowing them to represent relationships in symbolic forms. This inherent explainability, along with their demonstrated performance in domains such as differential equations [40,41,42], high-energy physics [43,44], and smart systems and devices [45,46], makes KANs a promising candidate for addressing both scientific and engineering problems [47]. In the context of bearing fault detection and classification, there is a notable lack of studies utilizing KANs, with the exception of [48]. In that study, the Case Western Reserve University (CWRU) bearing dataset [49] was employed; however, the primary focus of the paper was unrelated to bearing fault diagnosis and instead aimed at the early prediction of natural gas pipeline leaks.

Building on the potential of KANs and addressing the identified gaps, the main aspects of the proposed framework—and thus, the main contributions of the current work—can be summarized in the following points:

Automatic and explainable feature selection: By training shallow KANs with sparsity-inducing regularization, the framework automatically identifies the minimal set of highly relevant features based on explicit feature attribution scores and dynamic thresholds.
Interpretable and lightweight models: The trained activation functions are converted into symbolic mathematical representations based on the minimization of a custom cost function, enabling direct analysis and interpretation beyond black-box approaches while maintaining efficiency suitable for deployment on edge devices.
Unified bearing fault diagnosis framework: Fault detection, fault classification, and severity estimation are seamlessly integrated within a single, consistent methodology.
Generalizability beyond bearing faults: The versatility and generalization capabilities of the proposed approach are demonstrated through successful application to datasets containing diverse machinery faults beyond bearings alone.

To the best of the authors’ knowledge, this is among the first attempts to address bearing faults in a unified manner, handling multiple tasks such as fault detection, fault classification, and severity estimation. Unlike methods that only provide model architectures, this approach encompasses the entire pipeline, including data pre-processing, feature extraction, automated feature selection, and inference, within a single, lightweight deep learning framework. A recent related study by [7] touches upon some of these challenges by employing an SVM model for fault classification and genetic algorithms (GAs) for automated feature selection. Nevertheless, GAs are computationally demanding, and SVMs, along with GAs, lack the explainability provided by the proposed KAN-based approach.

The remainder of the present paper is structured as follows: Section 2 presents the proposed framework in detail, including its components, methodology, and theoretical foundation. Subsequently, the two datasets utilized in this study are introduced in Section 3, along with a discussion on the rationale for their selection and a presentation of the feature libraries extracted to implement the framework. In Section 4, the experimental results obtained on both datasets are reported, focusing on selected features, model performance, and symbolic representations. Finally, Section 5 provides a summary and discussion of the work’s main findings.

2. Proposed Methodology

Prior to the discussion of the proposed methodology’s technical details, an overview of KANs, their theoretical formulation, and the properties that establish them as a key component of the framework are provided.

2.1. Kolmogorov–Arnold Networks

The theoretical foundations of KANs lie in the Kolmogorov Superposition Theorem (KST), which provides a robust theoretical framework for decomposing multivariate functions into simpler univariate functions through summation operations. Earlier attempts to apply the theorem for function approximation sought to implement it in its exact form but faced significant challenges due to the pathological behavior of the inner univariate functions [50]. Recently, ref. [39] extended the KST to a “deep” equivalent, introducing a more flexible network architecture that contains an arbitrary number of layers with arbitrary widths, rather than adhering strictly to the original formulation of the theorem.

One such extended architecture with L layers is defined by an array

[n_{0}, n_{1}, \dots, n_{L}]

, where

n_{i}

denotes the number of input nodes of the i-th layer. Unlike in MLPs, a KAN layer corresponds to the activation functions between a set of input and output nodes, rather than the inputs or outputs themselves, which is why the array has

L + 1

elements. The corresponding model can be written as

u (x; θ) = [Φ^{(L)} \circ \dots \circ Φ^{(1)}] (x),

(1)

where

θ

represents the network’s trainable parameters, ∘ denotes the successive application of

Φ^{(l)}

, and

Φ^{(l)} (x^{(l)}) = (\begin{matrix} ϕ_{l, 1, 1} (\cdot) & \dots & ϕ_{l, n_{l}, 1} (\cdot) \\ ϕ_{l, 1, 2} (\cdot) & \dots & ϕ_{l, n_{l}, 2} (\cdot) \\ ⋮ & ⋱ & ⋮ \\ ϕ_{l, 1, n_{l + 1}} (\cdot) & \dots & ϕ_{l, n_{l}, n_{l + 1}} (\cdot) \end{matrix}) x^{(l)},

(2)

with

ϕ_{l, i, j}

being the l-th layer’s activation function, which connects the layer’s i-th input node to its j-th output node. These activation functions are given by

ϕ (x) = c_{r} r (x) + c_{B} B (x),

(3)

where

r (x) = \frac{x}{1 + exp (- x)}

(4)

is the Sigmoid Linear Unit (SiLU) function and

B (x) = \sum_{i = 1}^{G + k} c_{i} B_{i} (x)

(5)

is a spline activation composed of

(G + k)

B-spline basis functions of order k on a grid with G intervals. The parameters

c_{r}

,

c_{B}

, and

{\{c_{i}\}}_{i = 1}^{G + k}

of each activation function, and consequently the activation function itself, are trainable. Consequently, the number of trainable parameters within a single KAN layer consisting of

n_{in}

input nodes and

n_{out}

output nodes is given by

| θ | = \underset{c_{r}}{\underset{︸}{n_{in} \times n_{out}}} + \underset{c_{B}}{\underset{︸}{n_{in} \times n_{out}}} + \underset{{\{c_{i}\}}_{i = 1}^{G + k}}{\underset{︸}{n_{in} \times n_{out} \times (G + k)}} = n_{in} \times n_{out} \times (G + k + 2) .

(6)

This quantity provides a clear and universal metric for objectively determining whether a model is lightweight compared to others. In contrast, metrics commonly reported in the literature, such as inference time, are heavily influenced by external factors, including the specific hardware used for execution and the software implementation details or optimization of the model architecture. For instance, recent implementations such as [51] have achieved significantly faster inference speeds compared to pykan, used in this study. Therefore, model complexity, as quantified by the number of trainable parameters, offers a more reliable, consistent, and hardware-independent means of comparison.

Importantly, the fact that the activation functions of KANs are trainable directly contributes to their interpretability. Unlike standard neural networks, whose activations are effectively black-box functions with overly complex mathematical representations in terms of linear layers and non-linear activations, the activation functions of a trained KAN can be analytically approximated by symbolic representations. These provide explicit, human-readable mathematical equations that describe precisely how input features are transformed at each node of the network. As a result, the user can directly analyze, interpret and even derive theoretical insights from the trained model’s functional representation.

Beyond this interpretability, KANs feature an attribution scoring mechanism that provides explainability by quantifying the relative importance of the model’s input features [47]. This requires defining the attribution score of the i-th input node in the l-th layer, denoted by

A_{l, i}

. To compute such scores, one must also define the standard deviation of the l-th layer’s activation function connecting node i to node j as

E_{l, i, j} = \sqrt{\frac{1}{N} \sum_{s = 1}^{N} {[ϕ_{l, i, j} (x_{s}) - \frac{1}{N} \sum_{p = 1}^{N} ϕ_{l, i, j} (x_{p})]}^{2}},

(7)

where N is the number of samples. Then,

A_{l, i}

can be calculated recursively via

A_{l, i} = \sum_{j = 1}^{n_{l + 1}} E_{l, i, j} \cdot A_{l + 1, j} \cdot {(\sum_{p = 1}^{n_{l}} E_{l, p, j})}^{- 1}, l \in \{L, \dots, 1\},

(8)

and the initial condition

A_{L + 1, i} = 1, \forall i \in \{1, \dots, n_{L}\}

, i.e., setting the scores of the final layer’s output nodes to 1. Note that the notation refers to input nodes, which is why the subscript is

L + 1

, even though the model consists of L layers. The final score for the i-th input feature then simply corresponds to

A_{1, i}

.

This attribution score captures the global importance of a node, as it accounts for more than just the activations on its outbound edges via

E_{l, i, j}

, which are highly local. Instead,

E_{l, i, j}

is adjusted by the attribution scores

A_{l + 1, j}

of all subsequent nodes, j, that are linked to node i. Additionally, for each of these subsequent nodes, the expression is also normalized by

\sum_{p = 1}^{n_{l}} E_{l, p, j}

, which takes into account all activations on that node’s inbound edges. Essentially, the recursive relation in Equation (8) ensures that the importance of the i-th node in the l-th layer reflects that node’s contributions on the entire network downstream from the node, underscoring its global nature. The attribution scoring mechanism is illustrated for an example architecture in Figure 1, which provides a visual representation of how each node’s score is influenced by subsequent nodes and the activations along their connecting edges.

2.2. Proposed Framework

Building on these theoretical foundations, the initial objective of the proposed methodology is to leverage the attribution scoring mechanism of KANs for automatic feature selection. To this end, a feature library is first constructed using the existing literature and domain knowledge, automatic feature extraction methods (e.g., [52]), or a combination thereof, depending on the studied problem (bearing faults in the present case). Each data sample is thus represented as a K-dimensional vector, where K is equal to the total number of features in the library. After all data samples are split into distinct training, validation, and evaluation sets, the feature selection process is formulated as a grid-search multi-objective problem.

Specifically, multiple KAN model instances are trained on the training set by minimizing a loss function that incorporates the following regularization term:

L_{reg} = λ \sum_{l = 1}^{L} [A_{l} - \sum_{i = 1}^{n_{l}} \frac{A_{l, i}}{A_{l}} log (\frac{A_{l, i}}{A_{l}})],

(9)

where

A_{l} = \sum_{i = 1}^{n_{l}} A_{l, i} .

(10)

This sparsity-inducing expression corresponds to a mixture of L1 and entropy regularization (first and second summands of Equation (9), respectively), with an overall weight of

λ

, corresponding to a hyper-parameter. Each model instance is trained for a fixed number of epochs using a distinct value of

λ

, selected from a discretized range

[λ_{\min}, λ_{\max}]

. For each model instance corresponding to a particular

λ

, the attribution score is computed for all K features using Equation (8). The most important features are then selected based on the condition

A_{1, i} \geq τ,

(11)

where

τ

is a hyper-parameter which determines the threshold for feature selection. To ensure an appropriate choice of

τ

, multiple threshold values are evaluated, drawn from a discretized range

[τ_{\min}, τ_{\max}]

, to identify the most significant features. As a result, each combination of

(λ, τ)

corresponds to a distinct selection of important features.

The optimal values of

λ

and

τ

are determined based on a multi-objective criterion: maximizing the performance of a trained model on the validation set while minimizing the number of selected features. To achieve this, model instances are retrained, this time without regularization (

λ = 0

), and only using the features selected for each combination of

λ

and

τ

. The results are then analyzed to identify the Pareto front [53], representing the trade-off between model performance and feature count. If the Pareto front contains a single element, the corresponding

(λ, τ)

pair is deemed optimal. In cases where multiple points lie on the Pareto front, an additional rule can be applied to make the final selection—for instance, choosing parameters which lead to the model that achieves the highest performance while using up to k features, where typically,

k ≪ K

to ensure that only the features that contribute the most are retained.

With the feature selection phase complete, the methodology transitions to the model selection phase, which involves hyper-parameter tuning. This stage focuses on KAN-specific parameters, namely the grid size (G) and the grid adaptability factor (

0 \leq g_{e} \leq 1

), defined via

G = g_{e} G_{u} + (1 - g_{e}) G_{a},

(12)

where

G_{u}

and

G_{a}

correspond to purely uniform and purely adaptive grids, respectively, and

G

is a linear combination of the two [39]. During the feature selection phase, the model instances are trained using a fixed, small grid to prioritize computational efficiency. In the model selection phase, larger grid sizes are also considered, in order to allow the trainable activation functions to capture more intricate patterns in the data by using a greater number of B-spline basis functions. Additionally, incorporating grid adaptivity has been shown to enhance model performance by adapting each layer’s grid to the underlying data structure of its inputs [39,54]. Hyper-parameter tuning is conducted similarly to the feature selection process: a KAN model instance is trained for each configuration of

(G, g_{e})

over a fixed number of epochs, and its performance is evaluated on the validation set. However, in this phase, performance is assessed along two axes: the performance of the trained KAN itself, as well as the performance of its symbolic version.

As discussed in the context of KANs’ interpretability, the trained activation functions can be replaced with symbolic functions selected from a predefined library. The optimal symbolic function for each layer is determined as the one minimizing a cost function that balances high

R^{2}

values of the symbolic fit against the symbolic function’s complexity. Complexity can be assigned in various ways, for example, based on the statistical occurrence of functions in physical formulae [55]. In this work, complexities are assigned in a way that ensures consistency with the PySR framework [56], widely recognized as the state-of-the-art open-source solution for symbolic regression tasks. In the present work, the cost function is given by

C (C, R^{2}) = exp (α C) + β ln (1 - R^{2}),

(13)

where

R^{2}

is the fit’s

R^{2}

score, C denotes the assigned complexity of the symbolic function, and

α

and

β

are parameters that control the penalty for function complexity and the reward associated with the fit quality, respectively. The symbolic library utilized in this work is provided in Appendix A, along with all assigned complexities. Naturally, such libraries cannot be exhaustive, but they provide a good starting point for deriving interpretable symbolic representations.

Following each trained model’s evaluation on the validation set, both with and without symbolic fitting, the Pareto front for the performance metrics of the regular and symbolic KAN models is identified. Similar to the feature selection phase, if the Pareto front is a singleton, then the corresponding

(G, g_{e})

values are selected. However, if the Pareto front includes multiple points, additional rules are necessary for the selection. With the most important features identified during the feature selection phase and the optimal model hyper-parameters determined during model selection, the training and validation sets are combined, and a final model instance is trained on their union. This model is subsequently evaluated on the isolated evaluation set to provide a final and unbiased performance assessment. A symbolic version of this final model is also extracted. While its performance is generally expected to be lower than that of the regular KAN model, the symbolic version offers a more interpretable alternative, which may be preferable in scenarios where explainability is prioritized over optimal performance. A schematic representation of the complete framework can be seen in Figure 2.

3. Datasets and Feature Extraction

As previously outlined, the proposed framework has been designed for applicability across a wide range of problems beyond bearing faults. To apply it for bearing fault detection and classification, two widely recognized datasets are selected: the CWRU bearing dataset [49] and the Machinery Fault Database (MaFaulDa) dataset [57,58]. The CWRU dataset is chosen due to its characterization as a dataset where feature selection is highly nontrivial, containing data that deviate from the typical characteristics expected for certain fault types [6]. The MaFaulDa dataset, on the other hand, is selected for its broader scope, as it includes not only bearing faults but also additional types of machinery faults, thereby enabling the demonstration of the framework’s generalizability within a single dataset. Before detailing the process of constructing a feature library from the raw time-series signals of the two datasets, a more detailed introduction to each dataset is provided in Section 3.1 and Section 3.2.

3.1. CWRU Dataset

The CWRU dataset was generated using a test rig designed to simulate bearing faults under controlled conditions. The setup consisted of a 2-horsepower motor, a torque transducer, and a dynamometer, with the test bearings supporting the motor shaft. Three types of single-point faults were induced in the bearings using electro-discharge machining, inner raceway (IR), ball (B), and outer raceway (OR) faults, with fault diameters ranging from 7 mils (1 mil is equivalent to 0.001 inches) to 40 mils. Faults were applied to both the drive-end and fan-end bearings. The dataset comprises vibration measurements collected using accelerometers attached to the motor housing at the 12 o’clock position for both the drive-end and fan-end bearings, with an additional accelerometer attached to the base plate in some experiments. The signals were recorded at sampling rates of 12 kHz and, for certain drive-end faults, 48 kHz. For OR faults, experiments were conducted at different positions relative to the load zone (3 o’clock, 6 o’clock, and 12 o’clock) to capture variations in the vibration response. Thus, the dataset contains six classes for classification, labeled as N (normal), B, IR, OR@3, OR@6, and OR@12.

The original dataset’s files can be categorized along several axes. Based on motor speed, the files are divided into four groups: 1730, 1750, 1772, and 1797 rotations per minute (RPM). Based on fault location and sampling rate, the dataset includes normal files measured at 48 kHz, drive-end faults measured at 12 kHz, fan-end faults measured at 12 kHz, and drive-end faults measured at 48 kHz. Additionally, the files differ in terms of the time-series data they contain: some include only drive-end measurements, most include both drive-end and fan-end measurements, and a few include drive-end, fan-end, and base measurements. Due to the inconsistencies mentioned in [59], the version of the dataset curated for the purposes of the cited work was used. Moreover, all 48 kHz drive-end measurements were excluded for two reasons: to ensure uniformity, as corresponding fan-end measurements at 48 kHz were unavailable, and because these files exhibited significant variability in the number of measurements, making them unsuitable for consistent sampling for the purposes of feature extraction. This process resulted in a total of 101 retained files.

3.2. MaFaulDa Dataset

The MaFaulDa dataset was created using a test rig designed to emulate the dynamics of motors with two shaft-supporting bearings. It comprises multivariate time-series data collected from sensors mounted on a SpectraQuest alignment/balance vibration trainer machinery fault simulator. The sensors included one triaxial accelerometer for the underhang bearing (the bearing located between the motor and rotor) and three industrial accelerometers for the overhang bearing (the bearing located outside the rotor, opposite the motor), oriented along the axial, radial, and tangential directions. Additionally, an analog tachometer measured the system’s rotational frequency, and a microphone captured operational sound. All signals were recorded at a sampling rate of 50 kHz over a duration of 5 s.

The dataset includes scenarios representing both normal operation and various fault conditions. In the normal class (N), the system operated without faults across 49 distinct rotation frequencies, ranging from 737 to 3686 RPM at approximately 60 RPM intervals. Bearing faults, similar to those in the CWRU dataset, involved defects in the inner raceway (IR), ball (B), and outer raceway (OR). These faults were studied in both bearings, underhang and overhang, one at a time. To ensure fault detectability, additional imbalances of 6 g, 10 g, and 20 g were introduced. Bearing fault scenarios were recorded under 49 rotation frequencies for lighter imbalances, while fewer frequencies were studied for heavier ones due to increased vibrations.

Beyond bearing faults, the dataset also includes additional machinery faults, namely imbalance (I) and axis misalignment. Imbalance faults were simulated by attaching varying load weights (6 g to 35 g) to the rotor. For weights up to 25 g, all 49 rotation frequencies were studied, whereas higher weights limited the maximum frequency to 3300 RPM due to increased vibrations. Axis misalignment was divided into horizontal misalignment (HM) and vertical misalignment (VM), induced by shifting the motor shaft by offsets of 0.5 mm to 2.0 mm for the former, and 0.51 mm to 1.9 mm for the latter. For each misalignment severity, the same 49 rotation frequencies as in the normal class were studied. In total, the dataset corresponds to 10 distinct classes and comprises 1951 data files, all of which were retained for feature extraction.

3.3. Feature Library

The extracted features were acquired by first augmenting and then preprocessing data from both datasets. Data augmentation was particularly critical for the CWRU dataset, which contained only four data files per fault type and severity—corresponding to the four rotational frequencies studied. In contrast, the MaFaulDa dataset included nearly 50 examples per fault case, yet augmentation was still applied to further enhance the dataset. The first step involved identifying the rotational frequency,

f_{r}

, for each file. For the CWRU files, the exact RPM values were already known. However, for the MaFaulDa dataset, where the RPM values were estimated per file, the rotational frequency was calculated using the two-step algorithm proposed in [60], based on the tachometer signal. This method was selected to avoid misidentification of

f_{r}

, which could otherwise be obscured by spectral peaks introduced by machine faults in the signal’s frequency spectrum.

Once the rotational frequency was determined, it was combined with the sampling rate,

F_{s}

, to split each time-series into smaller segments of

N \cdot F_{s} / f_{r}

data points, where N represents the number of complete motor rotation cycles. The choice of N balances the trade-off between dataset size and segment quality: a smaller value leads to more segments but at the cost of lower quality, while a larger value preserves the original time-series’ quality at the expense of limited samples. For this study,

N = 48

was chosen as a compromise, yielding approximately six segments per file for the CWRU dataset. The same number of cycles was chosen for the MaFaulDa dataset to maintain consistency, resulting in augmented datasets containing 603 and 6268 segments for CWRU and MaFaulDa, respectively.

Using the augmented datasets, a series of time-domain, frequency-domain, and time–frequency features were extracted, based on the established literature on machinery fault diagnosis. For the time-domain features, the extracted metrics included the RMS, mean, variance, skewness, kurtosis, entropy, shape factor, crest factor, impulse factor, and margin factor, along with histogram upper and lower bounds, as described in [16]. From the frequency domain, spectral skewness and kurtosis were calculated after applying an FFT to each signal. Additionally, the signal magnitudes at the fundamental frequency and its first two harmonics were extracted, following [58].

For time–frequency features, wavelet transformations were employed, as they are highly effective for identifying machinery faults. The pywavelets library [61] was used to perform a multilevel decomposition of order 4 on each segment, utilizing a biorthogonal wavelet. Following [7], the features derived from the fine-grained wavelet coefficients included the mean, median, RMS, standard deviation, variance, skewness, kurtosis, and entropy. The percentile values at the 5th, 25th, 75th, and 95th levels were also extracted, along with the number of mean and zero crossings. Using this approach, a feature library of 62 and 243 features was compiled for the augmented CWRU and MaFaulDa datasets, respectively. This corresponds to 31 features per dataset signal, with the exception of the tachometer signal in MaFaulDa, from which no spectral features were extracted [58].

A detailed list of all extracted features for this work is provided in Appendix B. It should be noted that certain features overlap in terms of the information they encode; for instance, the impulse factor is the product of the crest and shape factors, while the standard deviation is the square root of the variance. This intentional redundancy allows the framework to identify the most relevant features automatically and discard the rest during the feature selection process. After all, if the most effective features for the studied problem were already known, the feature selection phase would be redundant.

4. Experimental Results

This section presents the experimental findings obtained by applying the proposed framework to the two constructed feature libraries, addressing three distinct tasks, fault detection, fault classification, and severity classification, for each fault type. Exclusively shallow KANs, i.e., models with a single layer, were considered throughout the experiments to minimize the number of model parameters, thus ensuring that the models remain lightweight and their symbolic representation does not become overly complex.

For all tasks, the datasets were split in a stratified manner into training, validation, and evaluation sets in a 70%-15%-15% ratio, respectively, and the features were standardized. Model training was performed with the Adam optimizer, using Cross-Entropy as the non-regularizing loss function, as all tasks were classification problems. The primary performance metric was the F1-Score, chosen for its greater suitability in handling imbalanced datasets compared to accuracy. The KAN implementation and training were performed using the PyTorch 2.4 [62] and pykan [39,47] frameworks, running on a single NVIDIA GeForce RTX 4070 GPU.

During the feature selection phase, KANs with

k = 3

,

G = 5

and

g_{e} = 0.05

were trained non-adaptively for 80 epochs. Regarding hyper-parameter tuning, the regularization parameter

λ

spanned the range

[0.01, 0.1]

; values below

10^{- 2}

had no discernible regularizing effect, while values above

10^{- 1}

imposed overly strict regularization, hindering the network’s training. The feature selection threshold parameter,

τ

, was in the range of

[0.001, 0.01]

; values smaller than

10^{- 3}

generally retained all candidate features, whereas those greater than

10^{- 2}

typically discarded (almost) all features. Both ranges were discretized into 20 equidistant points. When multiple Pareto-optimal solutions were identified, the model achieving the highest F1-Score with up to 10 features was selected. This constraint is intentionally strict, as most state-of-the-art models utilize at least 15 features on these datasets; however, developing lightweight models focused only on the most informative features aligns with the study’s primary objective.

For the model selection phase, higher-order KANs with

k = 4

were employed. Grid search was conducted over

G \in \{8, 10, 12, 15, 20, 30, 40, 50\}

and

g_{e} \in \{0.0, 0.05, \dots, 1.0\}

. The range of

g_{e}

naturally spans the interval

[0, 1]

, whereas the maximum considered value of G was limited to 50, as larger grids would be unnecessarily complex for single-layer architectures. This choice is further supported by the obtained results, where the optimal value of G never exceeded 20. Each model instance was trained adaptively for 200 epochs, with grid updates performed every 10 epochs until epoch 150. For the symbolic fitting cost function given in Equation (13), parameters

α = 0.05

and

β = 1.5

were selected to prioritize high

R^{2}

values over low complexity, unless model complexity became excessively high, triggering the exponential penalty term to dominate. In cases where the resulting Pareto front contained multiple candidate models, the one with the highest average F1-Score between the original and symbolic representations was selected.

4.1. Fault Detection

For the fault detection task, all data samples were categorized into two classes, normal (N) and faulty (F), with the latter encompassing all fault types. Fault detection is generally simpler than fault classification, as the model only needs to distinguish between normal and anomalous data. However, this task suffers from a significant imbalance in class representation, which presents a major challenge.

In both the CWRU and MaFaulDa datasets, the normal class is severely underrepresented. For the CWRU dataset, the normal class constitutes only 3.96% of the dataset, while the other classes range from 11.88% to 23.76%. Similarly, in the MaFaulDa dataset, the normal class accounts for just 2.51%, with the remaining classes spanning from 7.02% to 17.07%. When restructured for fault detection, the imbalance becomes even more pronounced, with the normal class constituting only 3.96% versus 96.04% for the faulty class in the CWRU dataset, and 2.51% versus 97.49% for the faulty class in the MaFaulDa dataset. This extreme imbalance means that even a trivial classifier which predicts all entries as faulty would achieve a high accuracy (e.g., 97.49% for MaFaulDa) while failing to provide any meaningful insights.

To address this imbalance, a balancing strategy is required. Among the common approaches are undersampling the dominant class or oversampling the minority class using techniques such as Synthetic Minority Oversampling (SMOTE) [63]. For this work, undersampling was adopted; specifically, 30 samples from each fault class in the CWRU augmented dataset and 60 samples from each fault class in the MaFaulDa augmented dataset were randomly selected. This adjustment reduced the imbalance to 12.28% normal versus 87.72% faulty for CWRU and 22.86% normal versus 77.14% faulty for MaFaulDa. Although still imbalanced, these distributions are far more manageable.

Following this preprocessing step, the framework’s feature selection process, as detailed in Section 2, was applied to identify the most relevant features for fault detection. Regarding the CWRU dataset, the Pareto front resulting from the

(λ, τ)

grid search was a singleton, yielding

λ = 2.42 \times 10^{- 3}

and

τ = 7.16 \times 10^{- 2}

. These values resulted in the selection of a single feature,

x_{24}^{2}

(the 25th percentile value for the drive-end signal; see Appendix B), as illustrated in Figure 3. Proceeding to the model selection phase, the grid search using only this feature again produced a Pareto front with a single point, corresponding to

G = 8

and

g_{e} = 0.0

. Based on Equation (6), the resulting KAN model comprised only 28 trainable parameters. The combination of a single feature and a small yet fully adaptive grid suggests that fault detection in the CWRU dataset is relatively straightforward, so approaching the task with a complex model is neither necessary nor a good practice. This is further corroborated by the final evaluation of the chosen model, for both the regular and symbolic versions of the KAN, as shown in the confusion matrices of Figure 4.

The selection of a single feature offers the opportunity to highlight the importance of extracting symbolic representations for the trained KAN, thus making its decisions and architecture fully interpretable. In this case, the symbolic representations of the KAN’s output edges are given by

y_{1} (x) = 42.62 - 76.65 σ (7.69 - 7.51 x),

(14)

and

y_{2} (x) = 10.85 - 52.3 tanh (10 x - 8.52),

(15)

where x denotes the scaled feature and

σ (x) = 0.5 [1 + tanh (x / 2)]

is the sigmoid function. It is noted that all numbers have been rounded to the second digit. Using these analytical expressions, a sample is classified as normal if

y_{1} (x) > y_{2} (x)

, and as a fault if

y_{1} (x) < y_{2} (x)

. Equations (14) and (15) allow for the study of otherwise inaccessible (or hard to compute) properties of the classification problem, such as determining the decision boundary by solving

y_{1} (x) = y_{2} (x)

. Figure 5 illustrates the two curves alongside all CWRU data points, color-coded by class. The decision boundary is also depicted, demonstrating that all of the dataset’s samples are correctly classified using these symbolic expressions.

The same procedure was applied to the MaFaulDa dataset. In this case, the feature selection process resulted in a Pareto front with three candidate points. The corresponding

(λ, τ)

values, the associated F1-Scores for each point, and the number of features retained are presented in Table 1. Although the configuration with the fewest features also exhibited the lowest performance, it still achieved a remarkably high F1-Score of 97.07%. Notably, the four features selected in the lowest-performing case are a subset of the six features selected in the middle-performing case, which, in turn, are a subset of the nine features selected in the highest-performing case. This hierarchical relationship highlights the consistency of the framework. Following the selection rule of prioritizing the highest-performing configuration with no more than 10 features, the combination

λ = 2.9 \times 10^{- 3}

and

τ = 4.32 \times 10^{- 2}

was chosen. This configuration retained nine features,

x_{6}^{2}

,

x_{15}^{4}

,

x_{30}^{4}

,

x_{4}^{5}

,

x_{30}^{5}

,

x_{31}^{5}

,

x_{11}^{6}

,

x_{2}^{7}

, and

x_{23}^{8}

, as illustrated in Figure 6.

For the model selection phase, using the nine selected features, the grid search over G and

g_{e}

produced a Pareto front with a single point, corresponding to

G = 8

and

g_{e} = 0.05

, similar to the results for the CWRU dataset. The final model had 252 trained parameters and achieved an F1-Score of 100% in its regular form and 98.08% after symbolic fitting. This result is not unexpected, as symbolic fitting cannot always perfectly replicate the trained activation function using analytical expressions. Consequently, the trade-off between performance and interpretability must always be considered; however, in this case, the performance impact is minimal. The performance in terms of each KAN version’s confusion matrices can be seen in Figure 7. When it comes to the KAN’s symbolic activation functions for this case, they are nine-dimensional, meaning that the visualization of their decision boundary—now corresponding to a 9-dimensional hypersurface—would require a 10-dimensional equivalent of Figure 5. Although such a depiction is impractical, these expressions remain computationally inexpensive and provide valuable insights into the model’s predictions, for instance, by keeping most features constant and examining decision boundaries as one or two features vary. Notably, the presence of multiple features allows for an analysis of their final importance in the trained model’s predictions. The normalized feature attribution scores, depicted in Figure 8, illustrate the relative contribution of each selected feature.

4.2. Fault Classification

Moving to fault classification, no oversampling or undersampling techniques were applied, despite the datasets being imbalanced as previously noted. For the CWRU dataset, the feature selection grid search yielded

λ = 10^{- 3}

and

τ = 6.68 \cdot 10^{- 2}

as the optimal parameters, resulting in the selection of the following seven features:

x_{4}^{1}

,

x_{10}^{1}

,

x_{15}^{1}

,

x_{1}^{2}

,

x_{3}^{2}

,

x_{15}^{2}

, and

x_{20}^{2}

. During model selection, the corresponding grid search identified a model with

G = 12

and

g_{e} = 0.5

as achieving the highest average F1-Score between its regular and symbolic versions. The final feature importance scores for this model, consisting of 756 trainable parameters and evaluated on the evaluation set, are shown in the left plot of Figure 9. The model in its regular form achieved a perfect F1-Score of 100%, while the symbolic version slightly underperformed at 97.80%. The corresponding confusion matrices for both versions are provided in Figure 10.

Applying the same approach to the MaFaulDa dataset, the feature selection process yielded a large Pareto front, ranging from models that retained a single feature (with poor performance) to those retaining up to 70 features (achieving a validation set F1-Score of 99.89%). The selected configuration,

λ = 7.63 \cdot 10^{- 3}

and

τ = 3.37 \times 10^{- 2}

, resulted in the retention of ten features:

x_{9}^{3}

,

x_{29}^{3}

,

x_{29}^{4}

,

x_{31}^{4}

,

x_{16}^{5}

,

x_{29}^{5}

,

x_{30}^{5}

,

x_{31}^{5}

,

x_{15}^{6}

, and

x_{31}^{6}

. The subsequent grid search for optimal model hyper-parameters produced a smaller Pareto front with only two points. Among these, the configuration

G = 12

and

g_{e} = 0.15

was chosen, corresponding to 1800 trainable parameters, and evaluated on the evaluation set. The selected model achieved a final F1-Score of 97.24% in its regular form and 92.03% in its symbolic form. The normalized feature attribution scores for the final model are shown in the right plot of Figure 9, while the corresponding confusion matrices are depicted in Figure 11.

It becomes evident that, unlike in the fault detection task, fault classification requires a greater number of features for each dataset to capture the details required for a more fine-grained classification than a binary one. For the CWRU dataset, the trained KAN once again achieved a perfect F1-Score in its regular form. However, for the MaFaulDa dataset, even the regular version of the KAN left some data points misclassified. This is not unexpected, as the MaFaulDa dataset includes not only bearing faults but also additional machinery faults. This, in fact, highlights the generalizability of the proposed framework, demonstrating its ability to perform well in more diverse scenarios beyond the narrower domain of bearing faults. It is worth noting that higher performance could have been achieved by increasing the number of selected features or adding more layers to the KAN. However, the focus of this study is not solely on achieving perfect scores but rather on promoting lightweight models with minimal parameters, favoring interpretability and ensuring suitability for deployment in resource-constrained environments.

4.3. Severity Classification

Apart from fault detection and classification, both datasets allow for an additional type of investigation: the analysis of fault severity. For the CWRU dataset, all faults have three severity levels: 7 mils, 14 mils, and 21 mils. An exception to this is OR@12 faults, which only have two severity categories: 7 mils and 21 mils. For the MaFaulDa dataset, severity classification encompasses a broader range of categories across different fault types. Imbalance faults are categorized into seven severity levels, corresponding to imbalance loads ranging from 6 g to 35 g. Horizontal misalignment faults have four severity levels (0.5 mm, 1.0 mm, 1.5 mm, and 2.0 mm), while vertical misalignment faults have six severity levels (0.51 mm, 0.63 mm, 1.27 mm, 1.4 mm, 1.78 mm, and 1.9 mm). Additionally, the bearing faults in the MaFaulDa dataset are all classified into four severity levels, determined by the extra imbalance introduced to amplify their effects, ranging from 0 g to 20 g.

Given these structured severity levels, it is natural to approach severity analysis as a classification problem. To this end, the proposed framework was applied, following the established processes of feature selection, model selection, and model evaluation for each fault type’s severity classification. Table 2 and Table 3 present the intermediate (i.e., feature selection and model selection outcomes) and final (i.e., F1-Scores for both the regular and symbolic KANs on the evaluation set) results for the CWRU and MaFaulDa datasets, respectively. For each model, the corresponding number of total trainable parameters is also reported.

Based on these results, previous observations regarding the relative simplicity of the CWRU dataset are reaffirmed: once again, the severity of each fault type can be classified with perfect accuracy using a very small number of features (no more than three), with only a single misclassification occurring in the case of IR faults. Notably, in this instance, the symbolic version of the KAN outperforms the regular version, achieving perfect classification accuracy. Another indicator of the CWRU dataset’s simplicity is that the model instances consistently utilize the smallest permitted grid size (

G = 8

), with only one exception requiring

G = 10

. In contrast, the MaFaulDa dataset demonstrates greater complexity. Larger grid sizes, such as

G = 20

, are employed in several cases, and the number of utilized features is generally higher, often reaching the maximum of ten to achieve the reported results. Nonetheless, with the exception of vertical misalignment faults, all severities across all fault categories achieve F1-Scores exceeding 95% for the regular KAN, underscoring the effectiveness of the proposed framework even in more diverse and challenging scenarios.

Interestingly, the patterns observed for feature prevalence in fault detection and classification tasks do not carry over to severity classification. Specifically, while time-domain and frequency-domain features dominated the fault detection and classification tasks for CWRU, and wavelet-based features were dominant for MaFaulDa, the opposite trend is observed in the severity classification task. Approximately two-thirds of the selected features for severity classification in CWRU and MaFaulDa are wavelet-based features and time- or frequency-domain features, respectively. This observation highlights the distinction between fault identification/classification and severity quantification. From a feature selection perspective, it confirms that no single feature set is universally suitable for all tasks, emphasizing the value of the proposed framework for automatic feature selection from a diversified feature library.

5. Discussion and Conclusions

In the present study, a novel framework leveraging KANs was developed and applied to bearing fault detection and classification, as well as fault severity classification tasks. The framework utilizes an attribution scoring mechanism coupled with a grid-search-based multi-objective optimization procedure for automatic feature selection from an extensive feature library and hyper-parameter tuning. By design, it emphasizes lightweight models with interpretable outputs, which is achieved by training shallow KANs and subsequently replacing their activation functions with analytical expressions drawn from a symbolic library. This approach prioritizes not only high performance, but also deployability and explainability, both of which are essential for real-world applications.

The framework was validated using two widely recognized datasets, CWRU and MaFaulDa, which were processed and augmented to enable feature extraction for the construction of a feature library. The experimental results demonstrated the framework’s effectiveness across all tasks. For fault detection, it achieved perfect performance on both datasets, highlighting its capability to distinguish between normal and faulty conditions even under significant class imbalance. In fault classification, the CWRU dataset proved relatively straightforward, with the framework again achieving perfect F1-Scores. In contrast, the MaFaulDa dataset required more features and larger grids to handle its increased complexity, yet the regular version of the trained KAN still achieved a 97.24% F1-Score on the evaluation set. For severity classification, the framework accurately identified fault severity levels with high F1-Scores in the regular KAN model instances: 100% for four out of five fault types of the CWRU dataset, and greater than 95% for all fault types in the MaFaulDa dataset, with the exception of VM faults. These results showcase the framework’s adaptability across diverse fault types and severity categories. Finally, while the symbolic versions of the models occasionally exhibited slightly reduced performance compared to their regular version, they are, in general, invaluable in terms of the interpretability they offer.

When comparing our results to those of other studies, it is important to highlight that the usage and handling of datasets such as CWRU and MaFaulDa vary significantly depending on the proposed framework and the targeted diagnostic tasks. For instance, some studies divide the CWRU dataset into 10 classes rather than the 6 used here, while others reduce it to as few as 4 classes [6]. Moreover, our work uniquely addresses fault detection, fault classification, and severity classification simultaneously within the same unified framework, an approach seldom seen in the existing literature. Additionally, variations in feature extraction methods and data-splitting strategies further complicate direct comparisons across different approaches, particularly given the absence of universally accepted benchmarks or standardized procedures for utilizing these datasets. Despite these differences, numerous existing deep learning-based methods also report near-perfect or perfect scores similar to our own. However, as clearly illustrated in Table 4, these methods typically rely on significantly more complex models with a substantially higher number of trainable parameters. Specifically, the complexity of deep learning architectures reported in the literature often exceeds that of our trained models by one to three orders of magnitude, highlighting our framework’s successful integration of high performance with remarkably lightweight architectures.

Along with parameter efficiency, explainability is another critical aspect of this framework, which is why shallow KANs with a limited number of features were prioritized. Beyond solving the task at hand, the framework can provide significant insights, such as identifying the most relevant signal types for each task or dataset. For instance, Figure 12 offers a breakdown of the frequency of each signal’s contribution to the feature library for the MaFaulDa dataset. Across all seven experiments conducted for the CWRU dataset (fault detection, fault classification, and five severity analyses), the drive-end and fan-end signals were equally represented among the selected features, with a perfect 50–50% ratio. Conversely, for the MaFaulDa dataset, the tachometer and microphone signals were consistently underrepresented in the selected features, underscoring their limited relevance to the studied tasks. In contrast, signals from the underhang bearing (denoted by U) accounted for approximately 50% of the selected features across all tasks.

Apart from a signal-centered analysis, further analysis of the selected feature types per task provides valuable insights into the nature of each dataset and task complexity. In the case of the CWRU dataset, faults were relatively easier to detect, allowing the fault detection task to rely exclusively on a single statistical feature extracted from time-domain signals. Fault classification, being slightly more complex, required a combination of time-domain statistical features and time–frequency domain features derived from wavelet transforms, while purely frequency-domain features were not selected at all. On the other hand, the MaFaulDa dataset demonstrated higher complexity, necessitating a more diverse set of features across all tasks. Fault detection and classification involved a comprehensive selection from time-domain, frequency-domain, and wavelet-based time–frequency domain features. Notably, severity classification tasks heavily favored wavelet-based features, underscoring their ability to effectively capture fault severity variations. Additionally, an interesting observation emerged regarding the magnitude at the fundamental frequency: while seldom selected for fault detection or general fault classification, this frequency-domain feature was consistently among the most influential for severity classification across different fault categories. These observations highlight the flexibility and robustness of the proposed framework in effectively adapting its feature selection strategy to varying task complexities and dataset characteristics.

Combining all of the above, a practical implementation of this framework in a real-world scenario could focus on identifying or classifying bearing faults—or any type of machinery fault, given the framework’s generalizability. Example industrial applications span from condition monitoring and fault diagnosis in wind turbines to industrial pumps, aero-engines, and manufacturing. Using historical data, the framework could first identify the optimal set of sensors to install on the machinery, reducing costs by focusing only on the most informative signals. Once the sensors are installed, signal data could be used to construct feature libraries based on domain knowledge while the framework automates feature selection for the specific tasks at hand. The resulting lightweight models could be deployed online for real-time inference through MLOps platforms like MLflow [70], with inference being rapid due to the models’ minimal computational requirements. Periodic retraining of the models could also be seamlessly integrated into the same platform, with short training times owing to the models’ small number of parameters. Furthermore, in the event of other types of faults or issues arising within the machinery, the framework’s versatility would allow it to be extended to address and analyze these problems as well.

Beyond these industrial applications, the framework also holds significant potential for scientific tasks, particularly in the domain of symbolic regression. Building on the groundwork of [47], this framework offers an alternative to traditional symbolic regression methods, avoiding the computational overhead typically associated with genetic algorithms. For instance, in scientific problems where the underlying equation describing the data is not known, the cost function defined in Equation (13) could be utilized to generate multiple symbolic expressions per run by varying the parameters

α

and

β

, effectively creating a grid-search process similar to those detailed in this work. From the resulting expressions, the optimal one could be selected using a defined metric or based on domain-specific knowledge, such as dimensional constraints.

Author Contributions

Conceptualization, S.R.; methodology, S.R. and M.P.; software, S.R.; validation, I.S. and M.P.; data curation, I.S.; writing—original draft preparation, S.R., M.P. and G.A.; supervision, G.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original MaFaulDa time-series dataset is available at https://www02.smt.ufrj.br/~offshore/mfs (accessed on 2 February 2025). The original CWRU time-series dataset is available at https://engineering.case.edu/bearingdatacenter (accessed on 2 February 2025). The extracted feature datasets and code used to produce the results of all experiments presented herein are available at https://github.com/srigas/KAN_Fault_Diagnosis (accessed on 2 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Symbolic Functions Library

This Appendix contains a symbolic library, shown in Table A1, which consists of a predefined set of univariate functions for replacing the trained activation functions of KANs with interpretable symbolic representations. The table also depicts the assigned complexities and analytical forms of the functions.

Table A1. Library of symbolic functions and assigned complexities.

Symbolic Function	$f (x)$	C
Zero	0	1
Linear	x	1
Exponential	$exp (x)$	2
Logarithmic	$ln (x)$	2
Absolute Value	$\| x \|$	2
Sine	$sin (x)$	2
Cosine	$cos (x)$	2
Tangent Hyperbolic	$tanh (x)$	2
Sign Function	$sgn (x)$	2
Arctangent	$arctan (x)$	2
Hyperbolic Cosine	$cosh (x)$	2
Square Root	$\sqrt{x}$	3
Quadratic	$x^{2}$	3
Cubic	$x^{3}$	3
Quartic	$x^{4}$	3
Quintic	$x^{5}$	3
Reciprocal	$\frac{1}{x}$	3
Reciprocal Square	$\frac{1}{x^{2}}$	5
Reciprocal Cubic	$\frac{1}{x^{3}}$	5
Reciprocal Quartic	$\frac{1}{x^{4}}$	5
Reciprocal Quintic	$\frac{1}{x^{5}}$	5
Reciprocal Square Root	$\frac{1}{\sqrt{x}}$	5
Gaussian	$exp (- x^{2})$	6
Sigmoid	$\frac{1}{1 + exp (- x)}$	6

Appendix B. Feature Library and Notation

This Appendix presents the features included in the feature libraries constructed for each dataset, along with the correspondence between their

x_{n}^{i}

notation and their actual feature names. This notation is used for brevity in figures and symbolic expressions, where the subscript n corresponds to a specific feature index, as shown in the Table A2, and the superscript i denotes a dataset’s signal. These features were processed as described in Section 4. For the CWRU dataset,

x_{n}^{1}

and

x_{n}^{2}

represent the fan-end and drive-end signals, respectively. For the MaFaulDa dataset,

x_{n}^{i}

corresponds to the tachometer signal for

i = 1

; the radial, axial, and tangential accelerometer signals on the underhang bearing for

i = 2, 3, 4

, respectively; the radial, axial, and tangential accelerometer signals on the overhang bearing for

i = 5, 6, 7

, respectively; and the microphone signal for

i = 8

.

Table A2. Feature notation, name, and domain for each feature of the feature library.

Notation	Name	Domain
$x_{1}^{i}$	Magnitude at Fundamental Frequency	Frequency
$x_{2}^{i}$	Magnitude at Second Harmonic	Frequency
$x_{3}^{i}$	Magnitude at Third Harmonic	Frequency
$x_{4}^{i}$	Spectral Skewness	Frequency
$x_{5}^{i}$	Spectral Kurtosis	Frequency
$x_{6}^{i}$	Statistical Mean	Time
$x_{7}^{i}$	Statistical Variance	Time
$x_{8}^{i}$	Statistical Kurtosis	Time
$x_{9}^{i}$	Statistical Skewness	Time
$x_{10}^{i}$	Statistical RMS	Time
$x_{11}^{i}$	Shape Factor	Time
$x_{12}^{i}$	Crest Factor	Time
$x_{13}^{i}$	Impulse Factor	Time
$x_{14}^{i}$	Margin Factor	Time
$x_{15}^{i}$	Shannon Entropy	Time
$x_{16}^{i}$	Histogram Upper Bound	Time
$x_{17}^{i}$	Histogram Lower Bound	Time
$x_{18}^{i}$	Wavelet Mean	Time–Frequency
$x_{19}^{i}$	Wavelet Median	Time–Frequency
$x_{20}^{i}$	Wavelet RMS	Time–Frequency
$x_{21}^{i}$	Wavelet Standard Deviation	Time–Frequency
$x_{22}^{i}$	Wavelet Variance	Time–Frequency
$x_{23}^{i}$	5th Percentile Value	Time–Frequency
$x_{24}^{i}$	25th Percentile Value	Time–Frequency
$x_{25}^{i}$	75th Percentile Value	Time–Frequency
$x_{26}^{i}$	95th Percentile Value	Time–Frequency
$x_{27}^{i}$	Mean Crossings	Time–Frequency
$x_{28}^{i}$	Zero Crossings	Time–Frequency
$x_{29}^{i}$	Wavelet Shannon Entropy	Time–Frequency
$x_{30}^{i}$	Wavelet Skewness	Time–Frequency
$x_{31}^{i}$	Wavelet Kurtosis	Time–Frequency

References

Song, L.; Wang, H.; Chen, P. Vibration-Based Intelligent Fault Diagnosis for Roller Bearings in Low-Speed Rotating Machinery. IEEE Trans. Instrum. Meas. 2018, 67, 1887–1899. [Google Scholar] [CrossRef]
Nandi, S.; Toliyat, H.A.; Li, X. Condition Monitoring and Fault Diagnosis of Electrical Motors—A Review. IEEE Trans. Energy Convers. 2005, 20, 719–729. [Google Scholar] [CrossRef]
Cao, H.; Niu, L.; Xi, S.; Chen, X. Mechanical model development of rolling bearing-rotor systems: A review. Mech. Syst. Signal Process. 2018, 102, 37–58. [Google Scholar] [CrossRef]
Heng, A.; Zhang, S.; Tan, A.C.C.; Mathew, J. Rotating machinery prognostics: State of the art, challenges and opportunities. Mech. Syst. Signal Process. 2009, 23, 724–739. [Google Scholar] [CrossRef]
Jian, C.; Ao, Y. Imbalanced fault diagnosis based on semi-supervised ensemble learning. J. Intell. Manuf. 2023, 34, 1572–8145. [Google Scholar] [CrossRef]
Neupane, D.; Seok, J. Bearing Fault Detection and Diagnosis Using Case Western Reserve University Dataset With Deep Learning Approaches: A Review. IEEE Access 2020, 8, 93155–93178. [Google Scholar] [CrossRef]
Bagci Das, D.; Das, O. GABoT: A Lightweight Real-Time Adaptable Approach for Intelligent Fault Diagnosis of Rotating Machinery. J. Vib. Eng. Technol. 2024, 12, 679–697. [Google Scholar] [CrossRef]
Schoen, R.R.; Habetler, T.G.; Kamran, F.; Bartfield, R.G. Motor bearing damage detection using stator current monitoring. IEEE Trans. Ind. Appl. 1995, 31, 1274–1279. [Google Scholar] [CrossRef]
Wang, Y.; Xiang, J.; Markert, R.; Liang, M. Spectral kurtosis for fault detection, diagnosis and prognostics of rotating machines: A review with applications. Mech. Syst. Signal Process. 2016, 66–67, 679–698. [Google Scholar] [CrossRef]
Hawman, M.W.; Galinaitis, W.S. Acoustic emission monitoring of rolling element bearings. In Proceedings of the IEEE 1988 Ultrasonics Symposium Proceedings, Chicago, IL, USA, 2–5 October 1988; pp. 885–889. [Google Scholar] [CrossRef]
McInerny, S.A.; Dai, Y. Basic vibration signal processing for bearing fault detection. IEEE Trans. Educ. 2003, 46, 149–156. [Google Scholar] [CrossRef]
Hu, Y.; Bao, W.; Tu, X.; Li, F.; Li, K. An Adaptive Spectral Kurtosis Method and its Application to Fault Detection of Rolling Element Bearings. IEEE Trans. Instrum. Meas. 2020, 69, 739–750. [Google Scholar] [CrossRef]
Ullah, N.; Ahmad, Z.; Siddique, M.F.; Im, K.; Shon, D.K.; Yoon, T.H.; Yoo, D.S.; Kim, J.M. An Intelligent Framework for Fault Diagnosis of Centrifugal Pump Leveraging Wavelet Coherence Analysis and Deep Learning. Sensors 2023, 23, 8850. [Google Scholar] [CrossRef]
Al-Ghamd, A.M.; Mba, D. A comparative experimental study on the use of acoustic emission and vibration analysis for bearing defect identification and estimation of defect size. Mech. Syst. Signal Process. 2006, 20, 1537–1571. [Google Scholar] [CrossRef]
Azeez, N.I.; Alex, A.C. Detection of rolling element bearing defects by vibration signature analysis: A review. In Proceedings of the 2014 Annual International Conference on Emerging Research Areas: Magnetics, Machines and Drives (AICERA/iCMMD), Kottayam, India, 24–26 July 2014; pp. 1–5. [Google Scholar] [CrossRef]
Caesarendra, W.; Tjahjowidodo, T. A Review of Feature Extraction Methods in Vibration-Based Condition Monitoring and Its Application for Degradation Trend Estimation of Low-Speed Slew Bearing. Machines 2017, 5, 21. [Google Scholar] [CrossRef]
Feng, Z.; Liang, M.; Chu, F. Recent advances in time–frequency analysis methods for machinery fault diagnosis: A review with application examples. Mech. Syst. Signal Process. 2013, 38, 165–205. [Google Scholar] [CrossRef]
Ullah, N.; Siddique, M.F.; Ullah, S.; Ahmad, Z.; Kim, J.M. Pipeline Leak Detection System for a Smart City: Leveraging Acoustic Emission Sensing and Sequential Deep Learning. Smart Cities 2024, 7, 2318–2338. [Google Scholar] [CrossRef]
Sharma, A.; Jigyasu, R.; Mathew, L.; Chatterji, S. Bearing Fault Diagnosis Using Weighted K-Nearest Neighbor. In Proceedings of the 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 11–12 May 2018; pp. 1132–1137. [Google Scholar] [CrossRef]
Lu, Q.; Shen, X.; Wang, X.; Li, M.; Li, J.; Zhang, M. Fault Diagnosis of Rolling Bearing Based on Improved VMD and KNN. Math. Probl. Eng. 2021, 2021, 2530315. [Google Scholar] [CrossRef]
Andrijauskas, I.; Adaskevicius, R. SVM Based Bearing Fault Diagnosis in Induction Motors Using Frequency Spectrum Features of Stator Current. In Proceedings of the 2018 23rd International Conference on Methods & Models in Automation & Robotics (MMAR), Miedzyzdroje, Poland, 27–30 August 2018; pp. 826–831. [Google Scholar] [CrossRef]
Kumar, R.; Anand, R.S. Bearing fault diagnosis using multiple feature selection algorithms with SVM. Prog. Artif. Intell. 2024, 13, 119–133. [Google Scholar] [CrossRef]
Xue, X.; Li, C.; Cao, S.; Sun, J.; Liu, L. Fault Diagnosis of Rolling Element Bearings with a Two-Step Scheme Based on Permutation Entropy and Random Forests. Entropy 2019, 21, 96. [Google Scholar] [CrossRef]
Roy, S.S.; Dey, S.; Chatterjee, S. Autocorrelation Aided Random Forest Classifier-Based Bearing Fault Detection Framework. IEEE Sens. J. 2020, 20, 10792–10800. [Google Scholar] [CrossRef]
Alhams, A.; Abdelhadi, A.; Badri, Y.; Sassi, S.; Renno, J. Enhanced Bearing Fault Diagnosis Through Trees Ensemble Method and Feature Importance Analysis. J. Vib. Eng. Technol. 2024, 12, 109–125. [Google Scholar] [CrossRef]
Li, C.; Zhang, W.; Peng, G.; Liu, S. Bearing Fault Diagnosis Using Fully-Connected Winner-Take-All Autoencoder. IEEE Access 2018, 6, 6103–6115. [Google Scholar] [CrossRef]
Liu, H.; Zhou, J.; Zheng, Y.; Jiang, W.; Zhang, Y. Fault diagnosis of rolling bearings with recurrent neural network-based autoencoders. ISA Trans. 2018, 77, 167–178. [Google Scholar] [CrossRef]
Zhang, Y.; Zhou, T.; Huang, X.; Cao, L.; Zhou, Q. Fault diagnosis of rotating machinery based on recurrent neural networks. Measurement 2021, 171, 108774. [Google Scholar] [CrossRef]
Pandhare, V.; Singh, J.; Lee, J. Convolutional Neural Network Based Rolling-Element Bearing Fault Diagnosis for Naturally Occurring and Progressing Defects Using Time-Frequency Domain Features. In Proceedings of the 2019 Prognostics and System Health Management Conference (PHM-Paris), Paris, France, 2–5 May 2019; pp. 320–326. [Google Scholar] [CrossRef]
Guo, Z.; Yang, M.; Huang, X. Bearing fault diagnosis based on speed signal and CNN model. Energy Rep. 2022, 8, 904–913. [Google Scholar] [CrossRef]
Chung, C.C.; Liang, Y.P.; Jiang, H.J. CNN Hardware Accelerator for Real-Time Bearing Fault Diagnosis. Sensors 2023, 23, 5897. [Google Scholar] [CrossRef]
Zhang, M.; Wang, D.; Lu, W.; Yang, J.; Li, Z.; Liang, B. A Deep Transfer Model With Wasserstein Distance Guided Multi-Adversarial Networks for Bearing Fault Diagnosis Under Different Working Conditions. IEEE Access 2019, 7, 65303–65318. [Google Scholar] [CrossRef]
Mao, W.; Liu, Y.; Ding, L.; Li, Y. Imbalanced Fault Diagnosis of Rolling Bearing Based on Generative Adversarial Network: A Comparative Study. IEEE Access 2019, 7, 9515–9530. [Google Scholar] [CrossRef]
Zhao, D.; Shao, D.; Cui, L. CTNet: A data-driven time-frequency technique for wind turbines fault diagnosis under time-varying speeds. ISA Trans. 2024, 154, 335–351. [Google Scholar] [CrossRef]
Decker, T.; Lebacher, M.; Tresp, V. Does Your Model Think Like an Engineer? Explainable AI for Bearing Fault Detection with Deep Learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Hakim, M.; Omran, A.A.B.; Ahmed, A.N.; Al-Waily, M.; Abdellatif, A. A systematic review of rolling bearing fault diagnoses based on deep learning and transfer learning: Taxonomy, overview, application, open challenges, weaknesses and recommendations. Ain Shams Eng. J. 2023, 14, 101945. [Google Scholar] [CrossRef]
Grover, C.; Turk, N. Optimal Statistical Feature Subset Selection for Bearing Fault Detection and Severity Estimation. Shock Vib. 2020, 2020, 5742053. [Google Scholar] [CrossRef]
Maliuk, A.S.; Ahmad, Z.; Kim, J.M. Hybrid Feature Selection Framework for Bearing Fault Diagnosis Based on Wrapper-WPT. Machines 2022, 10, 1204. [Google Scholar] [CrossRef]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
Shukla, K.; Toscano, J.D.; Wang, Z.; Zou, Z.; Karniadakis, G.E. A comprehensive and FAIR comparison between MLP and KAN representations for differential equations and operator networks. Comput. Methods Appl. Mech. Eng. 2024, 431, 117290. [Google Scholar] [CrossRef]
Howard, A.A.; Jacob, B.; Murphy, S.H.; Heinlein, A.; Stinis, P. Finite basis Kolmogorov-Arnold networks: Domain decomposition for data-driven and physics-informed problems. arXiv 2024, arXiv:2406.19662. [Google Scholar]
Jacob, B.; Howard, A.A.; Stinis, P. SPIKANs: Separable Physics-Informed Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2411.06286. [Google Scholar]
Erdmann, J.; Mausolf, F.; Späh, J.L. KAN we improve on HEP classification tasks? Kolmogorov-Arnold Networks applied to an LHC physics example. arXiv 2024, arXiv:2408.02743. [Google Scholar]
Abasov, E.; Volkov, P.; Vorotnikov, G.; Dudko, L.; Zaborenko, A.; Iudin, E.; Markina, A.; Perfilov, M. Application of Kolmogorov-Arnold Networks in high energy physics. Mosc. Univ. Phys. Bull. 2024, 79, S585–S590. [Google Scholar] [CrossRef]
Wang, X.; Li, Y.; Li, Y.; Kish, G. From Black Box to Clarity: AI-Powered Smart Grid Optimization with Kolmogorov-Arnold Networks. In Proceedings of the 2024 IEEE Energy Conversion Congress and Exposition (ECCE), Phoenix, AZ, USA, 20–24 October 2024. [Google Scholar]
Xu, A.; Zhang, B.; Kong, S.; Huang, Y.; Yang, Z.; Srivastava, S.; Sun, M. Effective Integration of KAN for Keyword Spotting, 2024. In Proceedings of the ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025. [Google Scholar]
Liu, Z.; Ma, P.; Wang, Y.; Matusik, W.; Tegmark, M. KAN 2.0: Kolmogorov-Arnold Networks Meet Science. arXiv 2024, arXiv:2408.10205. [Google Scholar]
Li, X.; Zuo, Z.; Dong, Z.; Yang, Y. Early Prediction of Natural Gas Pipeline Leaks Using the MKTCN Model. arXiv 2024, arXiv:2411.06214. [Google Scholar]
Case Western Reserve University. Case Western Reserve University Bearing Data Center. 2003. Available online: https://engineering.case.edu/bearingdatacenter (accessed on 2 February 2025).
Guilhoto, L.F.; Perdikaris, P. Deep Learning Alternatives of the Kolmogorov Superposition Theorem. arXiv 2024, arXiv:2410.01990. [Google Scholar]
Rigas, S.; Papachristou, M. jaxKAN: A unified JAX framework for Kolmogorov-Arnold Networks. J. Open Source Softw. 2025, 10, 7830. [Google Scholar] [CrossRef]
Yi, K.; Cai, C.; Tang, W.; Dai, X.; Wang, F.; Wen, F. A Rolling Bearing Fault Feature Extraction Algorithm Based on IPOA-VMD and MOMEDA. Sensors 2023, 23, 8620. [Google Scholar] [CrossRef] [PubMed]
Jahan, A.; Edwards, K.L.; Bahraminasab, M. Multi-Criteria Decision Analysis for Supporting the Selection of Engineering Materials in Product Design; Elsevier: Amsterdam, The Netherlands, 2013. [Google Scholar]
Rigas, S.; Papachristou, M.; Papadopoulos, T.; Anagnostopoulos, F.; Alexandridis, G. Adaptive Training of Grid-Dependent Physics-Informed Kolmogorov-Arnold Networks. IEEE Access 2024, 12, 176982–176998. [Google Scholar] [CrossRef]
Constantin, A.; Bartlett, D.; Desmond, H.; Ferreira, P.G. Statistical Patterns in the Equations of Physics and the Emergence of a Meta-Law of Nature. arXiv 2024, arXiv:2408.11065. [Google Scholar]
Cranmer, M. Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl. arXiv 2023, arXiv:2305.01582. [Google Scholar]
MaFaulDa. Machinery Fault Database. 2016. Available online: https://www02.smt.ufrj.br/~offshore/mfs/ (accessed on 2 February 2025).
Marins, M.A.; Ribeiro, F.M.L.; Netto, S.L.; da Silva, E.A.B. Improved similarity-based modeling for the classification of rotating-machine failures. J. Frank. Inst. 2018, 355, 1913–1930. [Google Scholar] [CrossRef]
Rigas, S.; Tzouveli, P.; Kollias, S. An End-to-End Deep Learning Framework for Fault Detection in Marine Machinery. Sensors 2024, 24, 5310. [Google Scholar] [CrossRef]
de Lima, A.A.; Prego, T.; Netto, S.L.; da Silva, E.A.B.; Gutierrez, R.H.R.; Monteiro, U.A.; Troyman, A.C.R.; Silveira, F.; Vaz, L. On fault classification in rotating machines using fourier domain features and neural networks. In Proceedings of the 2013 IEEE 4th Latin American Symposium on Circuits and Systems (LASCAS), Cusco, Peru, 27 February–1 March 2013; pp. 1–4. [Google Scholar] [CrossRef]
Lee, G.R.; Gommers, R.; Waselewski, F.; Wohlfahrt, K.; O’ Leary, A. PyWavelets: A Python package for wavelet analysis. J. Open Source Softw. 2019, 4, 1237. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 1–4. [Google Scholar]
Irfan, M.; Mushtaq, Z.; Khan, N.A.; Mursal, S.N.F.; Rahman, S.; Magzoub, M.A.; Latif, M.A.; Althobiani, F.; Khan, I.; Abbas, G. A Scalo Gram-Based CNN Ensemble Method With Density-Aware SMOTE Oversampling for Improving Bearing Fault Diagnosis. IEEE Access 2023, 11, 127783–127799. [Google Scholar] [CrossRef]
Magar, R.; Ghule, L.; Li, J.; Zhao, Y.; Farimani, A.B. FaultNet: A Deep Convolutional Neural Network for Bearing Fault Classification. IEEE Access 2021, 9, 25189–25199. [Google Scholar] [CrossRef]
Raj, K.K.; Kumar, S.; Kumar, R.R.; Andriollo, M. Enhanced Fault Detection in Bearings Using Machine Learning and Raw Accelerometer Data: A Case Study Using the Case Western Reserve University Dataset. Information 2024, 15, 259. [Google Scholar] [CrossRef]
Chen, X.; Zhang, B.; Gao, D. Bearing fault diagnosis base on multi-scale CNN and LSTM model. J. Intell. Manuf. 2021, 32, 971–987. [Google Scholar] [CrossRef]
Ullah, I.; Khan, N.; Memon, S.A.; Kim, W.G.; Saleem, J.; Manzoor, S. Vibration-Based Anomaly Detection for Induction Motors Using Machine Learning. Sensors 2025, 25, 773. [Google Scholar] [CrossRef]
Souza, R.M.; Nascimento, E.G.; Miranda, U.A.; Silva, W.J.; Lepikson, H.A. Deep learning for diagnosis and classification of faults in industrial rotating machinery. Comp. Ind. Eng. 2021, 153, 107060. [Google Scholar] [CrossRef]
Khan, A.; Hwang, H.; Kim, H.S. Synthetic Data Augmentation and Deep Learning for the Fault Diagnosis of Rotating Machines. Mathematics 2021, 9, 2336. [Google Scholar] [CrossRef]
Zaharia, M.A.; Chen, A.; Davidson, A.; Ghodsi, A.; Hong, S.A.; Konwinski, A.; Murching, S.; Nykodym, T.; Ogilvie, P.; Parkhe, M.; et al. Accelerating the Machine Learning Lifecycle with MLflow. IEEE Data Eng. Bull. 2018, 41, 39–45. [Google Scholar]

Figure 1. Example of an iteration of node score attribution in a layer of 3 input and 2 output nodes.

Figure 2. A schematic representation of the proposed framework, encompassing the feature selection, model selection, and model evaluation processes.

Figure 3. Feature attribution scores for all features within the feature library of CWRU. For the chosen threshold,

τ = 7.16 \times 10^{- 2}

, only a single feature is selected.

Figure 3. Feature attribution scores for all features within the feature library of CWRU. For the chosen threshold,

τ = 7.16 \times 10^{- 2}

, only a single feature is selected.

Figure 4. Confusion matrices for fault detection in the CWRU dataset by the regular (left) and symbolic (right) versions of the trained KAN model.

Figure 5. The symbolic KAN’s output edges uniquely define class-exclusive regions separated by a decision boundary. The CWRU dataset’s points are correctly classified in their respective categories.

Figure 6. Feature attribution scores for all features within the feature library of MaFaulDa. For the chosen threshold,

τ = 4.32 \times 10^{- 2}

, nine features are selected.

Figure 6. Feature attribution scores for all features within the feature library of MaFaulDa. For the chosen threshold,

τ = 4.32 \times 10^{- 2}

, nine features are selected.

Figure 7. Confusion matrices for fault detection in the MaFaulDa dataset by the regular (left) and symbolic (right) versions of the trained KAN model.

Figure 8. Final attribution scores (normalized) for the retained features of the trained model.

Figure 9. Normalized feature attribution scores after training the final model on the CWRU (left) and MaFaulDa (right) datasets.

Figure 10. Confusion matrices for fault classification in the CWRU dataset by the regular (left) and symbolic (right) versions of the trained KAN model.

Figure 11. Confusion matrices for fault classification in the MaFaulDa dataset by the regular (left) and symbolic (right) versions of the trained KAN model.

Figure 12. Pie chart of the frequency with which each of the eight signals available in the MaFaulDa dataset appeared in a task’s selected features.

Table 1. Combinations of

λ, τ

within the Pareto front and the corresponding model performance and number of retained features.

Table 1. Combinations of

λ, τ

within the Pareto front and the corresponding model performance and number of retained features.

$λ$ ( $\times 10^{- 3}$ )	$τ$ ( $\times 10^{- 2}$ )	F1-Score	No. of Features
$1.00$	$9.05$	$97.07$ %	4
$1.47$	$6.68$	$98.07$ %	6
$2.90$	$4.32$	$100.00$ %	9

Table 2. Results for severity classification on the different fault types of the CWRU dataset. In the F1-Score column, both the regular and symbolic versions are reported.

Fault	Features	Params	G	$g_{e}$	F1-Score
IR	[ $x_{21}^{2}$ ]	42	8	$0.00$	$95.45$ – $100.00$ %
B	[ $x_{15}^{1}$ , $x_{28}^{1}$ , $x_{27}^{2}$ ]	144	10	$0.40$	$100.00$ – $95.45$ %
OR@3	[ $x_{28}^{1}$ ]	42	8	$0.00$	$100.00$ – $100.00$ %
OR@6	[ $x_{21}^{1}$ , $x_{15}^{2}$ ]	84	8	$0.00$	$100.00$ – $100.00$ %
OR@12	[ $x_{25}^{1}$ ]	28	8	$0.50$	$100.00$ – $100.00$ %

Table 3. Results for severity classification on the different fault types of the MaFaulDa dataset. In the F1-Score column both the regular and symbolic versions are reported.

Fault	Features	Params	G	$g_{e}$	F1-Score
I	[ $x_{23}^{1}$ , $x_{1}^{3}$ , $x_{24}^{3}$ , $x_{25}^{3}$ , $x_{1}^{4}$ , $x_{11}^{4}$ , $x_{18}^{5}$ , $x_{31}^{5}$ , $x_{18}^{6}$ , $x_{15}^{8}$ ]	1120	10	$0.05$	$95.71$ – $91.35$ %
HM	[ $x_{2}^{3}$ , $x_{3}^{3}$ , $x_{24}^{3}$ , $x_{25}^{3}$ , $x_{6}^{4}$ , $x_{9}^{4}$ , $x_{31}^{6}$ , $x_{2}^{8}$ ]	512	10	$0.10$	$100.00$ – $100.00$ %
VM	[ $x_{23}^{1}$ , $x_{26}^{1}$ , $x_{22}^{3}$ , $x_{24}^{3}$ , $x_{6}^{4}$ , $x_{18}^{5}$ , $x_{19}^{5}$ , $x_{30}^{5}$ , $x_{18}^{6}$ , $x_{6}^{8}$ ]	1560	20	$0.10$	$86.67$ – $87.40$ %
UB	[ $x_{9}^{2}$ , $x_{1}^{3}$ , $x_{1}^{4}$ , $x_{1}^{5}$ , $x_{1}^{7}$ , $x_{30}^{7}$ , $x_{6}^{8}$ ]	392	8	$0.20$	$98.89$ – $97.79$ %
UIR	[ $x_{3}^{2}$ , $x_{11}^{3}$ , $x_{25}^{3}$ , $x_{29}^{3}$ , $x_{1}^{4}$ , $x_{4}^{4}$ , $x_{1}^{5}$ , $x_{30}^{5}$ , $x_{4}^{7}$ ]	504	8	$0.05$	$100.00$ – $100.00$ %
UOR	[ $x_{23}^{1}$ , $x_{2}^{2}$ , $x_{4}^{3}$ , $x_{8}^{3}$ , $x_{1}^{4}$ , $x_{4}^{4}$ , $x_{30}^{6}$ ]	448	10	$0.00$	$100.00$ – $95.54$ %
OB	[ $x_{1}^{2}$ , $x_{1}^{3}$ , $x_{1}^{4}$ , $x_{1}^{5}$ , $x_{1}^{6}$ , $x_{3}^{6}$ , $x_{1}^{7}$ , $x_{9}^{7}$ ]	832	20	$0.40$	$96.48$ – $94.29$ %
OIR	[ $x_{2}^{3}$ , $x_{1}^{4}$ , $x_{15}^{4}$ , $x_{4}^{6}$ , $x_{15}^{6}$ , $x_{31}^{7}$ ]	336	8	$0.15$	$100.00$ – $97.82$ %
OOR	[ $x_{1}^{3}$ , $x_{2}^{3}$ , $x_{4}^{3}$ , $x_{8}^{3}$ , $x_{1}^{4}$ , $x_{9}^{4}$ , $x_{1}^{7}$ , $x_{4}^{7}$ ]	448	8	$0.20$	$98.91$ – $98.91$ %

Table 4. Architecture complexity of related works employing deep learning architectures to study any of the three tasks studied herein using the CWRU or MaFaulDa datasets.

Reference	Dataset	Architecture	Params
[64]	CWRU	CNN	718,570
[65]	CWRU	1D CNN	$O (10^{5})$
[65]	CWRU	2D CNN	$O (10^{5})$
[65]	CWRU	1D MK-CNN	$O (10^{5})$
[65]	CWRU	LSTM	$O (10^{4})$
[65]	CWRU	BiLSTM	$O (10^{5})$
[66]	CWRU	CNN & LSTM	73,480
[67]	MaFaulDa	DNN	21,538
[68]	MaFaulDa	CNN	306,430
[69]	MaFaulDa	CNN	$O (10^{5})$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rigas, S.; Papachristou, M.; Sotiropoulos, I.; Alexandridis, G. Explainable Fault Classification and Severity Diagnosis in Rotating Machinery Using Kolmogorov–Arnold Networks. Entropy 2025, 27, 403. https://doi.org/10.3390/e27040403

AMA Style

Rigas S, Papachristou M, Sotiropoulos I, Alexandridis G. Explainable Fault Classification and Severity Diagnosis in Rotating Machinery Using Kolmogorov–Arnold Networks. Entropy. 2025; 27(4):403. https://doi.org/10.3390/e27040403

Chicago/Turabian Style

Rigas, Spyros, Michalis Papachristou, Ioannis Sotiropoulos, and Georgios Alexandridis. 2025. "Explainable Fault Classification and Severity Diagnosis in Rotating Machinery Using Kolmogorov–Arnold Networks" Entropy 27, no. 4: 403. https://doi.org/10.3390/e27040403

APA Style

Rigas, S., Papachristou, M., Sotiropoulos, I., & Alexandridis, G. (2025). Explainable Fault Classification and Severity Diagnosis in Rotating Machinery Using Kolmogorov–Arnold Networks. Entropy, 27(4), 403. https://doi.org/10.3390/e27040403

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Explainable Fault Classification and Severity Diagnosis in Rotating Machinery Using Kolmogorov–Arnold Networks

Abstract

1. Introduction

2. Proposed Methodology

2.1. Kolmogorov–Arnold Networks

2.2. Proposed Framework

3. Datasets and Feature Extraction

3.1. CWRU Dataset

3.2. MaFaulDa Dataset

3.3. Feature Library

4. Experimental Results

4.1. Fault Detection

4.2. Fault Classification

4.3. Severity Classification

5. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Symbolic Functions Library

Appendix B. Feature Library and Notation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI