1. Introduction
Rotating machinery plays an indispensable role in modern industry, powering numerous applications across the manufacturing, energy, and transportation sectors [
1]. Among their components, rolling element bearings (referred to simply as bearings hereafter) are vital yet vulnerable elements, with a heavy influence on the performance and lifespan of machines. Notably, it has been reported that up to 50% of motor faults are bearing-related [
2], while several issues in rotating machinery today can be traced to the improper design or application of bearings [
3]. Such failures can result in severe consequences, including unexpected downtime and costly repairs [
4,
5], or even catastrophic damage or loss of life in extreme cases [
6]. Moreover, in manufacturing environments, where continuous production is critical, disruptions caused by bearing failures can lead to substantial losses in productivity [
7]. Early and accurate bearing fault detection and classification are therefore essential in modern industrial and manufacturing practices.
Before the widespread use of machine learning (ML) and deep learning (DL) methodologies, bearing fault detection and classification relied on other widely adopted techniques to identify characteristic fault patterns. For instance, vibration analysis is commonly employed to detect frequency peaks associated with specific faults [
2], while signatures of these vibration frequencies have also been identified in the current spectrum through electrical signal processing [
8]. Additionally, features extracted from the Fourier spectrum of vibration signals were utilized to detect faults as peaks in the frequency domain [
9], and acoustic emission (AE) monitoring offered earlier and more fine-grained fault detection compared to vibration monitoring [
10]. However, these methods are often bound to specific fault scenarios or experimental conditions, which limit their applicability to diverse operating environments. For example, directly identifying bearing faults through raw vibration signals is challenging, as vibrations are typically dominated by imbalance and misalignment components [
11]. Moreover, the experimental results of [
8] were based on cases of extensive bearing damage, raising concerns about the applicability of this approach for less severe faults [
11]. Features like spectral kurtosis have been shown to be sensitive to strong harmonic interference when used as the only fault indicator [
12], and have also led to poor model accuracy when extracted from non-stationary signals [
13]. Finally, AE monitoring has also proven to be highly susceptible to background noise [
14].
Despite their limitations, the aforementioned techniques laid the foundation for identifying which types of sensor data are most effective for detecting and classifying bearing faults. Modern data-driven approaches have built upon this groundwork, incorporating features extracted from sensor data to develop more robust and generalizable frameworks. Examples of features extracted directly from the time-domain signal include, but are not limited to, the root mean square (RMS), crest factor (CF), skewness, and kurtosis [
15]. Spectral features, such as fundamental frequencies, spectral kurtosis and spectral entropy are obtained by applying a Fast Fourier Transform (FFT) to the time-domain signals and have also been widely used in such applications [
16]. Beyond time- and frequency-domain features, time–frequency representations, such as those derived from the Short-Time Fourier Transform or wavelet transformations, are often used to extract features for capturing transient and non-stationary behaviors [
17]. Additionally, AE monitoring, despite its susceptibility to noise, has shown promising results when combined with modern time-series deep learning methods [
18].
Drawing from these diverse feature sets, a series of ML models such as
k-Nearest Neighbors [
19,
20], Support Vector Machines (SVMs) [
21,
22], and Random Forests [
23,
24,
25] have been explored for the tasks of bearing fault detection and classification. More recently, the widespread adoption of sensors in industrial settings and the advent of the big data era have driven the use of DL approaches, initially based on fundamental architectures such as Autoencoders (AEs) [
26,
27], Recurrent Neural Networks [
27,
28], Convolutional Neural Networks (CNNs) [
29,
30,
31] and Generative Adversarial Networks [
32,
33]. Building further on these core architectures, recent studies have introduced more sophisticated hybrid models that combine CNNs with AEs and attention mechanisms [
13,
34], or integrate recurrent models like Bidirectional Long Short-Term Memory (BiLSTM) networks with smoothing techniques [
18].
Albeit successful in achieving high performance for bearing fault detection and classification, ML and especially DL models often fall short in areas where traditional approaches excel, with explainability being a notable example. The ability to understand and interpret a model’s decisions is crucial, especially in safety-critical applications or when deeper insights into the underlying physical processes are required [
35]. In addition to explainability, a significant challenge arises in deploying DL models for real-time condition-based monitoring on edge devices, i.e., resource-constrained computing units located close to the machinery they monitor. Many DL architectures are computationally intensive, making them unsuitable for resource-constrained environments [
36]. Another important consideration is the quantity and quality of features used by these models. While leveraging a large number of features can often yield superior results, achieving comparable performance with fewer features is far more desirable [
37]; budget constraints and technical limitations in practical scenarios demand careful sensor selection, as collecting an exhaustive set of measurements is neither feasible nor economical. Furthermore, the effectiveness of features can vary significantly, depending on the dataset or system under study; features that perform well for one problem may be suboptimal for another. This variability highlights the need for models capable of adaptively selecting the most relevant features for a given problem [
38]. To address these challenges, this paper presents a unified framework centered around Kolmogorov–Arnold Networks (KANs), designed to provide explainability, efficiency, and adaptive feature selection.
Inspired by the Kolmogorov–Arnold representation theorem, KANs were recently introduced by [
39] as an alternative to Multi-Layer Perceptrons (MLPs), serving as a new paradigm for the underlying architecture of DL models. Unlike in the case of MLPs, where activation functions are fixed, KANs contain trainable univariate functions as activations, allowing them to represent relationships in symbolic forms. This inherent explainability, along with their demonstrated performance in domains such as differential equations [
40,
41,
42], high-energy physics [
43,
44], and smart systems and devices [
45,
46], makes KANs a promising candidate for addressing both scientific and engineering problems [
47]. In the context of bearing fault detection and classification, there is a notable lack of studies utilizing KANs, with the exception of [
48]. In that study, the Case Western Reserve University (CWRU) bearing dataset [
49] was employed; however, the primary focus of the paper was unrelated to bearing fault diagnosis and instead aimed at the early prediction of natural gas pipeline leaks.
Building on the potential of KANs and addressing the identified gaps, the main aspects of the proposed framework—and thus, the main contributions of the current work—can be summarized in the following points:
Automatic and explainable feature selection: By training shallow KANs with sparsity-inducing regularization, the framework automatically identifies the minimal set of highly relevant features based on explicit feature attribution scores and dynamic thresholds.
Interpretable and lightweight models: The trained activation functions are converted into symbolic mathematical representations based on the minimization of a custom cost function, enabling direct analysis and interpretation beyond black-box approaches while maintaining efficiency suitable for deployment on edge devices.
Unified bearing fault diagnosis framework: Fault detection, fault classification, and severity estimation are seamlessly integrated within a single, consistent methodology.
Generalizability beyond bearing faults: The versatility and generalization capabilities of the proposed approach are demonstrated through successful application to datasets containing diverse machinery faults beyond bearings alone.
To the best of the authors’ knowledge, this is among the first attempts to address bearing faults in a unified manner, handling multiple tasks such as fault detection, fault classification, and severity estimation. Unlike methods that only provide model architectures, this approach encompasses the entire pipeline, including data pre-processing, feature extraction, automated feature selection, and inference, within a single, lightweight deep learning framework. A recent related study by [
7] touches upon some of these challenges by employing an SVM model for fault classification and genetic algorithms (GAs) for automated feature selection. Nevertheless, GAs are computationally demanding, and SVMs, along with GAs, lack the explainability provided by the proposed KAN-based approach.
The remainder of the present paper is structured as follows:
Section 2 presents the proposed framework in detail, including its components, methodology, and theoretical foundation. Subsequently, the two datasets utilized in this study are introduced in
Section 3, along with a discussion on the rationale for their selection and a presentation of the feature libraries extracted to implement the framework. In
Section 4, the experimental results obtained on both datasets are reported, focusing on selected features, model performance, and symbolic representations. Finally,
Section 5 provides a summary and discussion of the work’s main findings.
3. Datasets and Feature Extraction
As previously outlined, the proposed framework has been designed for applicability across a wide range of problems beyond bearing faults. To apply it for bearing fault detection and classification, two widely recognized datasets are selected: the CWRU bearing dataset [
49] and the Machinery Fault Database (MaFaulDa) dataset [
57,
58]. The CWRU dataset is chosen due to its characterization as a dataset where feature selection is highly nontrivial, containing data that deviate from the typical characteristics expected for certain fault types [
6]. The MaFaulDa dataset, on the other hand, is selected for its broader scope, as it includes not only bearing faults but also additional types of machinery faults, thereby enabling the demonstration of the framework’s generalizability within a single dataset. Before detailing the process of constructing a feature library from the raw time-series signals of the two datasets, a more detailed introduction to each dataset is provided in
Section 3.1 and
Section 3.2.
3.1. CWRU Dataset
The CWRU dataset was generated using a test rig designed to simulate bearing faults under controlled conditions. The setup consisted of a 2-horsepower motor, a torque transducer, and a dynamometer, with the test bearings supporting the motor shaft. Three types of single-point faults were induced in the bearings using electro-discharge machining, inner raceway (IR), ball (B), and outer raceway (OR) faults, with fault diameters ranging from 7 mils (1 mil is equivalent to 0.001 inches) to 40 mils. Faults were applied to both the drive-end and fan-end bearings. The dataset comprises vibration measurements collected using accelerometers attached to the motor housing at the 12 o’clock position for both the drive-end and fan-end bearings, with an additional accelerometer attached to the base plate in some experiments. The signals were recorded at sampling rates of 12 kHz and, for certain drive-end faults, 48 kHz. For OR faults, experiments were conducted at different positions relative to the load zone (3 o’clock, 6 o’clock, and 12 o’clock) to capture variations in the vibration response. Thus, the dataset contains six classes for classification, labeled as N (normal), B, IR, OR@3, OR@6, and OR@12.
The original dataset’s files can be categorized along several axes. Based on motor speed, the files are divided into four groups: 1730, 1750, 1772, and 1797 rotations per minute (RPM). Based on fault location and sampling rate, the dataset includes normal files measured at 48 kHz, drive-end faults measured at 12 kHz, fan-end faults measured at 12 kHz, and drive-end faults measured at 48 kHz. Additionally, the files differ in terms of the time-series data they contain: some include only drive-end measurements, most include both drive-end and fan-end measurements, and a few include drive-end, fan-end, and base measurements. Due to the inconsistencies mentioned in [
59], the version of the dataset curated for the purposes of the cited work was used. Moreover, all 48 kHz drive-end measurements were excluded for two reasons: to ensure uniformity, as corresponding fan-end measurements at 48 kHz were unavailable, and because these files exhibited significant variability in the number of measurements, making them unsuitable for consistent sampling for the purposes of feature extraction. This process resulted in a total of 101 retained files.
3.2. MaFaulDa Dataset
The MaFaulDa dataset was created using a test rig designed to emulate the dynamics of motors with two shaft-supporting bearings. It comprises multivariate time-series data collected from sensors mounted on a SpectraQuest alignment/balance vibration trainer machinery fault simulator. The sensors included one triaxial accelerometer for the underhang bearing (the bearing located between the motor and rotor) and three industrial accelerometers for the overhang bearing (the bearing located outside the rotor, opposite the motor), oriented along the axial, radial, and tangential directions. Additionally, an analog tachometer measured the system’s rotational frequency, and a microphone captured operational sound. All signals were recorded at a sampling rate of 50 kHz over a duration of 5 s.
The dataset includes scenarios representing both normal operation and various fault conditions. In the normal class (N), the system operated without faults across 49 distinct rotation frequencies, ranging from 737 to 3686 RPM at approximately 60 RPM intervals. Bearing faults, similar to those in the CWRU dataset, involved defects in the inner raceway (IR), ball (B), and outer raceway (OR). These faults were studied in both bearings, underhang and overhang, one at a time. To ensure fault detectability, additional imbalances of 6 g, 10 g, and 20 g were introduced. Bearing fault scenarios were recorded under 49 rotation frequencies for lighter imbalances, while fewer frequencies were studied for heavier ones due to increased vibrations.
Beyond bearing faults, the dataset also includes additional machinery faults, namely imbalance (I) and axis misalignment. Imbalance faults were simulated by attaching varying load weights (6 g to 35 g) to the rotor. For weights up to 25 g, all 49 rotation frequencies were studied, whereas higher weights limited the maximum frequency to 3300 RPM due to increased vibrations. Axis misalignment was divided into horizontal misalignment (HM) and vertical misalignment (VM), induced by shifting the motor shaft by offsets of 0.5 mm to 2.0 mm for the former, and 0.51 mm to 1.9 mm for the latter. For each misalignment severity, the same 49 rotation frequencies as in the normal class were studied. In total, the dataset corresponds to 10 distinct classes and comprises 1951 data files, all of which were retained for feature extraction.
3.3. Feature Library
The extracted features were acquired by first augmenting and then preprocessing data from both datasets. Data augmentation was particularly critical for the CWRU dataset, which contained only four data files per fault type and severity—corresponding to the four rotational frequencies studied. In contrast, the MaFaulDa dataset included nearly 50 examples per fault case, yet augmentation was still applied to further enhance the dataset. The first step involved identifying the rotational frequency,
, for each file. For the CWRU files, the exact RPM values were already known. However, for the MaFaulDa dataset, where the RPM values were estimated per file, the rotational frequency was calculated using the two-step algorithm proposed in [
60], based on the tachometer signal. This method was selected to avoid misidentification of
, which could otherwise be obscured by spectral peaks introduced by machine faults in the signal’s frequency spectrum.
Once the rotational frequency was determined, it was combined with the sampling rate, , to split each time-series into smaller segments of data points, where N represents the number of complete motor rotation cycles. The choice of N balances the trade-off between dataset size and segment quality: a smaller value leads to more segments but at the cost of lower quality, while a larger value preserves the original time-series’ quality at the expense of limited samples. For this study, was chosen as a compromise, yielding approximately six segments per file for the CWRU dataset. The same number of cycles was chosen for the MaFaulDa dataset to maintain consistency, resulting in augmented datasets containing 603 and 6268 segments for CWRU and MaFaulDa, respectively.
Using the augmented datasets, a series of time-domain, frequency-domain, and time–frequency features were extracted, based on the established literature on machinery fault diagnosis. For the time-domain features, the extracted metrics included the RMS, mean, variance, skewness, kurtosis, entropy, shape factor, crest factor, impulse factor, and margin factor, along with histogram upper and lower bounds, as described in [
16]. From the frequency domain, spectral skewness and kurtosis were calculated after applying an FFT to each signal. Additionally, the signal magnitudes at the fundamental frequency and its first two harmonics were extracted, following [
58].
For time–frequency features, wavelet transformations were employed, as they are highly effective for identifying machinery faults. The
pywavelets library [
61] was used to perform a multilevel decomposition of order 4 on each segment, utilizing a biorthogonal wavelet. Following [
7], the features derived from the fine-grained wavelet coefficients included the mean, median, RMS, standard deviation, variance, skewness, kurtosis, and entropy. The percentile values at the 5th, 25th, 75th, and 95th levels were also extracted, along with the number of mean and zero crossings. Using this approach, a feature library of 62 and 243 features was compiled for the augmented CWRU and MaFaulDa datasets, respectively. This corresponds to 31 features per dataset signal, with the exception of the tachometer signal in MaFaulDa, from which no spectral features were extracted [
58].
A detailed list of all extracted features for this work is provided in
Appendix B. It should be noted that certain features overlap in terms of the information they encode; for instance, the impulse factor is the product of the crest and shape factors, while the standard deviation is the square root of the variance. This intentional redundancy allows the framework to identify the most relevant features automatically and discard the rest during the feature selection process. After all, if the most effective features for the studied problem were already known, the feature selection phase would be redundant.
4. Experimental Results
This section presents the experimental findings obtained by applying the proposed framework to the two constructed feature libraries, addressing three distinct tasks, fault detection, fault classification, and severity classification, for each fault type. Exclusively shallow KANs, i.e., models with a single layer, were considered throughout the experiments to minimize the number of model parameters, thus ensuring that the models remain lightweight and their symbolic representation does not become overly complex.
For all tasks, the datasets were split in a stratified manner into training, validation, and evaluation sets in a 70%-15%-15% ratio, respectively, and the features were standardized. Model training was performed with the Adam optimizer, using Cross-Entropy as the non-regularizing loss function, as all tasks were classification problems. The primary performance metric was the F1-Score, chosen for its greater suitability in handling imbalanced datasets compared to accuracy. The KAN implementation and training were performed using the
PyTorch 2.4 [
62] and
pykan [
39,
47] frameworks, running on a single NVIDIA GeForce RTX 4070 GPU.
During the feature selection phase, KANs with , and were trained non-adaptively for 80 epochs. Regarding hyper-parameter tuning, the regularization parameter spanned the range ; values below had no discernible regularizing effect, while values above imposed overly strict regularization, hindering the network’s training. The feature selection threshold parameter, , was in the range of ; values smaller than generally retained all candidate features, whereas those greater than typically discarded (almost) all features. Both ranges were discretized into 20 equidistant points. When multiple Pareto-optimal solutions were identified, the model achieving the highest F1-Score with up to 10 features was selected. This constraint is intentionally strict, as most state-of-the-art models utilize at least 15 features on these datasets; however, developing lightweight models focused only on the most informative features aligns with the study’s primary objective.
For the model selection phase, higher-order KANs with
were employed. Grid search was conducted over
and
. The range of
naturally spans the interval
, whereas the maximum considered value of
G was limited to 50, as larger grids would be unnecessarily complex for single-layer architectures. This choice is further supported by the obtained results, where the optimal value of
G never exceeded 20. Each model instance was trained adaptively for 200 epochs, with grid updates performed every 10 epochs until epoch 150. For the symbolic fitting cost function given in Equation (
13), parameters
and
were selected to prioritize high
values over low complexity, unless model complexity became excessively high, triggering the exponential penalty term to dominate. In cases where the resulting Pareto front contained multiple candidate models, the one with the highest average F1-Score between the original and symbolic representations was selected.
4.1. Fault Detection
For the fault detection task, all data samples were categorized into two classes, normal (N) and faulty (F), with the latter encompassing all fault types. Fault detection is generally simpler than fault classification, as the model only needs to distinguish between normal and anomalous data. However, this task suffers from a significant imbalance in class representation, which presents a major challenge.
In both the CWRU and MaFaulDa datasets, the normal class is severely underrepresented. For the CWRU dataset, the normal class constitutes only 3.96% of the dataset, while the other classes range from 11.88% to 23.76%. Similarly, in the MaFaulDa dataset, the normal class accounts for just 2.51%, with the remaining classes spanning from 7.02% to 17.07%. When restructured for fault detection, the imbalance becomes even more pronounced, with the normal class constituting only 3.96% versus 96.04% for the faulty class in the CWRU dataset, and 2.51% versus 97.49% for the faulty class in the MaFaulDa dataset. This extreme imbalance means that even a trivial classifier which predicts all entries as faulty would achieve a high accuracy (e.g., 97.49% for MaFaulDa) while failing to provide any meaningful insights.
To address this imbalance, a balancing strategy is required. Among the common approaches are undersampling the dominant class or oversampling the minority class using techniques such as Synthetic Minority Oversampling (SMOTE) [
63]. For this work, undersampling was adopted; specifically, 30 samples from each fault class in the CWRU augmented dataset and 60 samples from each fault class in the MaFaulDa augmented dataset were randomly selected. This adjustment reduced the imbalance to 12.28% normal versus 87.72% faulty for CWRU and 22.86% normal versus 77.14% faulty for MaFaulDa. Although still imbalanced, these distributions are far more manageable.
Following this preprocessing step, the framework’s feature selection process, as detailed in
Section 2, was applied to identify the most relevant features for fault detection. Regarding the CWRU dataset, the Pareto front resulting from the
grid search was a singleton, yielding
and
. These values resulted in the selection of a single feature,
(the 25th percentile value for the drive-end signal; see
Appendix B), as illustrated in
Figure 3. Proceeding to the model selection phase, the grid search using only this feature again produced a Pareto front with a single point, corresponding to
and
. Based on Equation (
6), the resulting KAN model comprised only 28 trainable parameters. The combination of a single feature and a small yet fully adaptive grid suggests that fault detection in the CWRU dataset is relatively straightforward, so approaching the task with a complex model is neither necessary nor a good practice. This is further corroborated by the final evaluation of the chosen model, for both the regular and symbolic versions of the KAN, as shown in the confusion matrices of
Figure 4.
The selection of a single feature offers the opportunity to highlight the importance of extracting symbolic representations for the trained KAN, thus making its decisions and architecture fully interpretable. In this case, the symbolic representations of the KAN’s output edges are given by
and
where
x denotes the scaled feature and
is the sigmoid function. It is noted that all numbers have been rounded to the second digit. Using these analytical expressions, a sample is classified as normal if
, and as a fault if
. Equations (
14) and (
15) allow for the study of otherwise inaccessible (or hard to compute) properties of the classification problem, such as determining the decision boundary by solving
.
Figure 5 illustrates the two curves alongside all CWRU data points, color-coded by class. The decision boundary is also depicted, demonstrating that all of the dataset’s samples are correctly classified using these symbolic expressions.
The same procedure was applied to the MaFaulDa dataset. In this case, the feature selection process resulted in a Pareto front with three candidate points. The corresponding
values, the associated F1-Scores for each point, and the number of features retained are presented in
Table 1. Although the configuration with the fewest features also exhibited the lowest performance, it still achieved a remarkably high F1-Score of 97.07%. Notably, the four features selected in the lowest-performing case are a subset of the six features selected in the middle-performing case, which, in turn, are a subset of the nine features selected in the highest-performing case. This hierarchical relationship highlights the consistency of the framework. Following the selection rule of prioritizing the highest-performing configuration with no more than 10 features, the combination
and
was chosen. This configuration retained nine features,
,
,
,
,
,
,
,
, and
, as illustrated in
Figure 6.
For the model selection phase, using the nine selected features, the grid search over
G and
produced a Pareto front with a single point, corresponding to
and
, similar to the results for the CWRU dataset. The final model had 252 trained parameters and achieved an F1-Score of 100% in its regular form and 98.08% after symbolic fitting. This result is not unexpected, as symbolic fitting cannot always perfectly replicate the trained activation function using analytical expressions. Consequently, the trade-off between performance and interpretability must always be considered; however, in this case, the performance impact is minimal. The performance in terms of each KAN version’s confusion matrices can be seen in
Figure 7. When it comes to the KAN’s symbolic activation functions for this case, they are nine-dimensional, meaning that the visualization of their decision boundary—now corresponding to a 9-dimensional hypersurface—would require a 10-dimensional equivalent of
Figure 5. Although such a depiction is impractical, these expressions remain computationally inexpensive and provide valuable insights into the model’s predictions, for instance, by keeping most features constant and examining decision boundaries as one or two features vary. Notably, the presence of multiple features allows for an analysis of their final importance in the trained model’s predictions. The normalized feature attribution scores, depicted in
Figure 8, illustrate the relative contribution of each selected feature.
4.2. Fault Classification
Moving to fault classification, no oversampling or undersampling techniques were applied, despite the datasets being imbalanced as previously noted. For the CWRU dataset, the feature selection grid search yielded
and
as the optimal parameters, resulting in the selection of the following seven features:
,
,
,
,
,
, and
. During model selection, the corresponding grid search identified a model with
and
as achieving the highest average F1-Score between its regular and symbolic versions. The final feature importance scores for this model, consisting of 756 trainable parameters and evaluated on the evaluation set, are shown in the left plot of
Figure 9. The model in its regular form achieved a perfect F1-Score of 100%, while the symbolic version slightly underperformed at 97.80%. The corresponding confusion matrices for both versions are provided in
Figure 10.
Applying the same approach to the MaFaulDa dataset, the feature selection process yielded a large Pareto front, ranging from models that retained a single feature (with poor performance) to those retaining up to 70 features (achieving a validation set F1-Score of 99.89%). The selected configuration,
and
, resulted in the retention of ten features:
,
,
,
,
,
,
,
,
, and
. The subsequent grid search for optimal model hyper-parameters produced a smaller Pareto front with only two points. Among these, the configuration
and
was chosen, corresponding to 1800 trainable parameters, and evaluated on the evaluation set. The selected model achieved a final F1-Score of 97.24% in its regular form and 92.03% in its symbolic form. The normalized feature attribution scores for the final model are shown in the right plot of
Figure 9, while the corresponding confusion matrices are depicted in
Figure 11.
It becomes evident that, unlike in the fault detection task, fault classification requires a greater number of features for each dataset to capture the details required for a more fine-grained classification than a binary one. For the CWRU dataset, the trained KAN once again achieved a perfect F1-Score in its regular form. However, for the MaFaulDa dataset, even the regular version of the KAN left some data points misclassified. This is not unexpected, as the MaFaulDa dataset includes not only bearing faults but also additional machinery faults. This, in fact, highlights the generalizability of the proposed framework, demonstrating its ability to perform well in more diverse scenarios beyond the narrower domain of bearing faults. It is worth noting that higher performance could have been achieved by increasing the number of selected features or adding more layers to the KAN. However, the focus of this study is not solely on achieving perfect scores but rather on promoting lightweight models with minimal parameters, favoring interpretability and ensuring suitability for deployment in resource-constrained environments.
4.3. Severity Classification
Apart from fault detection and classification, both datasets allow for an additional type of investigation: the analysis of fault severity. For the CWRU dataset, all faults have three severity levels: 7 mils, 14 mils, and 21 mils. An exception to this is OR@12 faults, which only have two severity categories: 7 mils and 21 mils. For the MaFaulDa dataset, severity classification encompasses a broader range of categories across different fault types. Imbalance faults are categorized into seven severity levels, corresponding to imbalance loads ranging from 6 g to 35 g. Horizontal misalignment faults have four severity levels (0.5 mm, 1.0 mm, 1.5 mm, and 2.0 mm), while vertical misalignment faults have six severity levels (0.51 mm, 0.63 mm, 1.27 mm, 1.4 mm, 1.78 mm, and 1.9 mm). Additionally, the bearing faults in the MaFaulDa dataset are all classified into four severity levels, determined by the extra imbalance introduced to amplify their effects, ranging from 0 g to 20 g.
Given these structured severity levels, it is natural to approach severity analysis as a classification problem. To this end, the proposed framework was applied, following the established processes of feature selection, model selection, and model evaluation for each fault type’s severity classification.
Table 2 and
Table 3 present the intermediate (i.e., feature selection and model selection outcomes) and final (i.e., F1-Scores for both the regular and symbolic KANs on the evaluation set) results for the CWRU and MaFaulDa datasets, respectively. For each model, the corresponding number of total trainable parameters is also reported.
Based on these results, previous observations regarding the relative simplicity of the CWRU dataset are reaffirmed: once again, the severity of each fault type can be classified with perfect accuracy using a very small number of features (no more than three), with only a single misclassification occurring in the case of IR faults. Notably, in this instance, the symbolic version of the KAN outperforms the regular version, achieving perfect classification accuracy. Another indicator of the CWRU dataset’s simplicity is that the model instances consistently utilize the smallest permitted grid size (), with only one exception requiring . In contrast, the MaFaulDa dataset demonstrates greater complexity. Larger grid sizes, such as , are employed in several cases, and the number of utilized features is generally higher, often reaching the maximum of ten to achieve the reported results. Nonetheless, with the exception of vertical misalignment faults, all severities across all fault categories achieve F1-Scores exceeding 95% for the regular KAN, underscoring the effectiveness of the proposed framework even in more diverse and challenging scenarios.
Interestingly, the patterns observed for feature prevalence in fault detection and classification tasks do not carry over to severity classification. Specifically, while time-domain and frequency-domain features dominated the fault detection and classification tasks for CWRU, and wavelet-based features were dominant for MaFaulDa, the opposite trend is observed in the severity classification task. Approximately two-thirds of the selected features for severity classification in CWRU and MaFaulDa are wavelet-based features and time- or frequency-domain features, respectively. This observation highlights the distinction between fault identification/classification and severity quantification. From a feature selection perspective, it confirms that no single feature set is universally suitable for all tasks, emphasizing the value of the proposed framework for automatic feature selection from a diversified feature library.
5. Discussion and Conclusions
In the present study, a novel framework leveraging KANs was developed and applied to bearing fault detection and classification, as well as fault severity classification tasks. The framework utilizes an attribution scoring mechanism coupled with a grid-search-based multi-objective optimization procedure for automatic feature selection from an extensive feature library and hyper-parameter tuning. By design, it emphasizes lightweight models with interpretable outputs, which is achieved by training shallow KANs and subsequently replacing their activation functions with analytical expressions drawn from a symbolic library. This approach prioritizes not only high performance, but also deployability and explainability, both of which are essential for real-world applications.
The framework was validated using two widely recognized datasets, CWRU and MaFaulDa, which were processed and augmented to enable feature extraction for the construction of a feature library. The experimental results demonstrated the framework’s effectiveness across all tasks. For fault detection, it achieved perfect performance on both datasets, highlighting its capability to distinguish between normal and faulty conditions even under significant class imbalance. In fault classification, the CWRU dataset proved relatively straightforward, with the framework again achieving perfect F1-Scores. In contrast, the MaFaulDa dataset required more features and larger grids to handle its increased complexity, yet the regular version of the trained KAN still achieved a 97.24% F1-Score on the evaluation set. For severity classification, the framework accurately identified fault severity levels with high F1-Scores in the regular KAN model instances: 100% for four out of five fault types of the CWRU dataset, and greater than 95% for all fault types in the MaFaulDa dataset, with the exception of VM faults. These results showcase the framework’s adaptability across diverse fault types and severity categories. Finally, while the symbolic versions of the models occasionally exhibited slightly reduced performance compared to their regular version, they are, in general, invaluable in terms of the interpretability they offer.
When comparing our results to those of other studies, it is important to highlight that the usage and handling of datasets such as CWRU and MaFaulDa vary significantly depending on the proposed framework and the targeted diagnostic tasks. For instance, some studies divide the CWRU dataset into 10 classes rather than the 6 used here, while others reduce it to as few as 4 classes [
6]. Moreover, our work uniquely addresses fault detection, fault classification, and severity classification simultaneously within the same unified framework, an approach seldom seen in the existing literature. Additionally, variations in feature extraction methods and data-splitting strategies further complicate direct comparisons across different approaches, particularly given the absence of universally accepted benchmarks or standardized procedures for utilizing these datasets. Despite these differences, numerous existing deep learning-based methods also report near-perfect or perfect scores similar to our own. However, as clearly illustrated in
Table 4, these methods typically rely on significantly more complex models with a substantially higher number of trainable parameters. Specifically, the complexity of deep learning architectures reported in the literature often exceeds that of our trained models by one to three orders of magnitude, highlighting our framework’s successful integration of high performance with remarkably lightweight architectures.
Along with parameter efficiency, explainability is another critical aspect of this framework, which is why shallow KANs with a limited number of features were prioritized. Beyond solving the task at hand, the framework can provide significant insights, such as identifying the most relevant signal types for each task or dataset. For instance,
Figure 12 offers a breakdown of the frequency of each signal’s contribution to the feature library for the MaFaulDa dataset. Across all seven experiments conducted for the CWRU dataset (fault detection, fault classification, and five severity analyses), the drive-end and fan-end signals were equally represented among the selected features, with a perfect 50–50% ratio. Conversely, for the MaFaulDa dataset, the tachometer and microphone signals were consistently underrepresented in the selected features, underscoring their limited relevance to the studied tasks. In contrast, signals from the underhang bearing (denoted by U) accounted for approximately 50% of the selected features across all tasks.
Apart from a signal-centered analysis, further analysis of the selected feature types per task provides valuable insights into the nature of each dataset and task complexity. In the case of the CWRU dataset, faults were relatively easier to detect, allowing the fault detection task to rely exclusively on a single statistical feature extracted from time-domain signals. Fault classification, being slightly more complex, required a combination of time-domain statistical features and time–frequency domain features derived from wavelet transforms, while purely frequency-domain features were not selected at all. On the other hand, the MaFaulDa dataset demonstrated higher complexity, necessitating a more diverse set of features across all tasks. Fault detection and classification involved a comprehensive selection from time-domain, frequency-domain, and wavelet-based time–frequency domain features. Notably, severity classification tasks heavily favored wavelet-based features, underscoring their ability to effectively capture fault severity variations. Additionally, an interesting observation emerged regarding the magnitude at the fundamental frequency: while seldom selected for fault detection or general fault classification, this frequency-domain feature was consistently among the most influential for severity classification across different fault categories. These observations highlight the flexibility and robustness of the proposed framework in effectively adapting its feature selection strategy to varying task complexities and dataset characteristics.
Combining all of the above, a practical implementation of this framework in a real-world scenario could focus on identifying or classifying bearing faults—or any type of machinery fault, given the framework’s generalizability. Example industrial applications span from condition monitoring and fault diagnosis in wind turbines to industrial pumps, aero-engines, and manufacturing. Using historical data, the framework could first identify the optimal set of sensors to install on the machinery, reducing costs by focusing only on the most informative signals. Once the sensors are installed, signal data could be used to construct feature libraries based on domain knowledge while the framework automates feature selection for the specific tasks at hand. The resulting lightweight models could be deployed online for real-time inference through MLOps platforms like MLflow [
70], with inference being rapid due to the models’ minimal computational requirements. Periodic retraining of the models could also be seamlessly integrated into the same platform, with short training times owing to the models’ small number of parameters. Furthermore, in the event of other types of faults or issues arising within the machinery, the framework’s versatility would allow it to be extended to address and analyze these problems as well.
Beyond these industrial applications, the framework also holds significant potential for scientific tasks, particularly in the domain of symbolic regression. Building on the groundwork of [
47], this framework offers an alternative to traditional symbolic regression methods, avoiding the computational overhead typically associated with genetic algorithms. For instance, in scientific problems where the underlying equation describing the data is not known, the cost function defined in Equation (
13) could be utilized to generate multiple symbolic expressions per run by varying the parameters
and
, effectively creating a grid-search process similar to those detailed in this work. From the resulting expressions, the optimal one could be selected using a defined metric or based on domain-specific knowledge, such as dimensional constraints.