1. Introduction
The use of machines to assist human work needs to be supported by choosing the right machine maintenance strategy to prevent machine downtime. According to Senseye [
1], sudden machine downtime can cause a company to lose as much as 323 production hours a year and average losses of up to
$172 million per plant annually.
The most familiar and widely used machine maintenance strategies are reactive and preventive maintenance. Reactive maintenance is executed after a machine fails, so it risks hampering company productivity [
2]. This risk prompted as many as 33% of companies to change their maintenance strategy to more proactive machine maintenance, such as preventive maintenance [
3]. The preventive maintenance concept allows the machine to be in good condition but still replaced to follow the predetermined maintenance schedule [
4]. Thus, it can increase unnecessary maintenance costs.
The next type of maintenance is predictive maintenance, which uses sensors to monitor and store machine condition data and then utilizes this historical data to predict trends, behavior patterns, or correlations using statistical models or machine learning to predict imminent machine damage or failure [
5,
6]. The results of these estimates will later be used as a reference in taking the necessary actions to ensure the condition and smooth operation of the machine. Predictive maintenance can reduce maintenance costs by 25–30%, eliminate machine breakdowns by 70–75% and reduce downtime by 35–45% [
7].
The use of technology in predictive maintenance is inseparable from various studies that continue to try to find innovations and improve the prediction accuracy of learning models. Several studies have used Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), and Decision Tree algorithms to produce good performance [
8]. However, these studies have not considered that in real situations, failures are rare during the life cycle of a machine when compared to the normal operating cycle of the machine [
9]. As a result, the imbalance between classes in the historical data may allow misclassification.
Imbalances between classes in the dataset are usually resolved using under-sampling or over-sampling techniques. Some algorithms will then use processed data, e.g., Extreme Learning Machine (ELM), which produces good performance [
10]. ELM became popular since it produces comparatively high accuracy and cross-domain adaptation with low time consumption [
11]. Moreover, ELM has superior generalization ability and less training time than deep network models, since it does not require an iterative process for tuning parameters [
12,
13]. Another technique is using a weight element in the dataset, where the class with a more significant number (the majority) will be given a small-valued weight element. In contrast, the minority class will be given a large-valued weighted element to increase the influence of the data on model learning. An example of using the weight element is the Weighted-Extreme Learning Machine (WELM) algorithm created by Zong et al. [
14] to handle the imbalance in the class distribution of the datasets.
In addition to data imbalance, selecting the right features is crucial in building a classifier model that produces satisfactory results [
15,
16]. Particle Swarm Optimization (PSO) can provide promising results in feature selection, where the accuracy results using PSO increase by up to 21.46% compared to traditional methods without PSO [
17].
The main contributions of this research are:
Build a classification model suitable for predicting machine failure with unbalanced data in predictive maintenance based on the ELM method.
Explore the configuration for each ELM-based model to achieve the optimal performance.
The rest of the paper is structured as follows: The
Section 2 discusses previous research on predictive maintenance. The
Section 3 describes the dataset and the proposed models. The
Section 4 discusses the steps and results, while the
Section 5 consists of the analysis of the results of the proposed model. The
Section 6 provides conclusions from the research conducted.
2. Related Works
Many researchers have used ELM and its derivatives in handling unbalanced data. Mao et al. [
18] used online-sequential ELM combined with SMOTE for the under-sampling and over-sampling of unbalanced data, which can effectively reduce the possibility of information loss thanks to the use of granulation division. Apart from under-sampling and over-sampling, various combinations of WELM were also carried out to overcome balanced learning, which gave satisfactory results [
19,
20]. Although the studies carried out have obtained good performance, there are still things that need to be considered in using ELM and its derivative algorithms in classifying: its characteristics that determine the number of hidden nodes randomly, which allows overfitting [
21], so that the use of the kernel in ELM is also used as an alternative option [
19,
22,
23]. The parameters were initially determined randomly, and researchers used PSO to improve performance [
22,
24,
25]. In fields other than fault detection, PSO has proven its excellent capability in selecting features and determining optimal parameters when combined with ELM and WELM [
21,
26,
27].
Machine failure detection or predictive maintenance research usually uses machine historical data collected individually and not shared with the public. There are also some public datasets that are commonly used, e.g., a public dataset prepared by Microsoft under the name Microsoft Azure for Predictive Maintenance [
28]. This dataset consists of machine historical information and can be used both for binary and multiclass classification.
Teoh et al. [
29] used a two-class logistics algorithm, where class 0 represents equipment in good condition and class 1 represents equipment that has failed. The Microsoft Azure for Predictive Maintenance data was processed to obtain lag features within 24 h and processed by the under-sampling technique to overcome the problem of imbalance in the data. The dataset is divided into training and testing sets with a ratio of 70:30. The designed model produces training and testing accuracy of 95.1% and 94.5%, respectively.
Cardoso and Ferreira [
9] used the same dataset to see the effectiveness of random forest and ANN algorithms in predicting and classifying machine faults into five classes. Before training the models, the dataset was pre-processed to obtain lag features within 3 h and 24 h. Both algorithms provide satisfactory predictions based on precision, recall, and F1-score. The summary of related works using the dataset from Microsoft is presented in
Table 1.
3. Methodology
The research methodology includes data collection, feature extraction, model development, and evaluation. The model built in the experiment is the ELM algorithm and its varieties and combination step by step to obtain optimal performance. The first model is ELM, which is trained using under-sampled and over-sampled data. The second model is the ELM with a weight element called WELM. After that, a third model was created using WELM combined with the kernel to eliminate the possibility of overfitting due to the random determination of the number of hidden nodes. The final model combined WELM with the kernel and PSO, which helped determine the required parameter values and feature selection. In short, the algorithms used in this study are ELM, WELM, WELM with the RBF kernel, and WELM with the RBF kernel and PSO.
The data that will be used and collected is Microsoft Azure for Predictive Maintenance [
28], which is a public dataset that has been mostly used for predictive maintenance model training. Then, the raw data will be extracted to obtain features for model training. For ELM, WELM, and WELM with RBF, kernel models will be trained and tested using all the extracted features, but WELM with RBF kernel and PSO will be trained and tested using selected features by PSO. After the training and testing process, the performance of each model will be evaluated using precision, recall, and F1-score.
3.1. Data Collection
Microsoft Azure for Predictive Maintenance contains 876100 real-time telemetry data (or 8761 data per machine), 3919 error log data, 3286 maintenance history data, and 761 fault history data (approximately eight failure data per machine) from a total of 100 machines with four different types of models throughout 2015, except for maintenance history data collected from 2014. Each machine has four components and four sensors measuring tension, pressure, vibration, and rotation. A controller also monitors the system and can detect five types of errors. In detail, this dataset consists of five data files with the .csv extension:
Real-time telemetry data measure the stress, rotation, pressure, and vibration of each engine from four sensors per hour in real time.
Error logs contain errors that occur on the machine, but the machine can still operate.
Maintenance history contains information on replacing engine components for several reasons.
Fault history is a record of component replacements due to machine failures, and
Collection of information regarding the type of model and the number of years of work for each of the 100 machines.
From the five data files above, there are five conditions to be classified into:
Normal means the machine is in good condition and operates normally.
Fail1 means the machine is broken caused by component one of the machine fails.
Fail2 means the machine is broken caused by component two of the machine fails.
Fail3 means the machine is broken caused by component three of the machine fails.
Fail4 means the machine is broken caused by component four of the machine fails.
The dataset can be classified into binary and multiclass classifications. Teoh et al. [
29] previously used the binary classification by predicting machine conditions into 0, which means the machine operates normally, or 1, which means the machine fails. Meanwhile, the multiclass classification was previously performed by Cardoso and Ferreira [
9] by predicting the machine conditions into five categories, as discussed before.
3.2. Feature Extraction
Feature extraction in this research is carried out by creating lag features, using the same technique as Cardoso and Ferreira [
9]. The trick is creating a temporal window to divide the data into short-term and long-term historical data. Short-term historical data are data with an interval of three hours, while long-term historical data are data with an interval of 24 h. In addition, from the error log data, lag features are also made as information on the number of errors that occur every 24 h on the machine. Apart from the lag features, another feature extracted is when the machine is used before being replaced. This period is calculated for each component on each machine. Details of the features that can be extracted from the data can be seen in
Table 2.
All features that have been extracted will be combined and synchronized with the type of failure based on machineID and the date and time of data recording. Data lines that do not have a failure type are considered features for the machine under normal conditions. Next, the data are cleaned by deleting rows with incomplete or empty features; then, the data are normalized using the min–max normalization.
3.3. Model Development
3.3.1. First Model: Extreme Learning Machine with Under-Sampling and Over-Sampling
Extreme Learning Machine (ELM) is a neural network-based learning algorithm that only consists of three layers: one input layer, one hidden layer, and one output layer. Using random numbers of hidden nodes, ELM can provide good generalization results with faster processing time [
30]. The output of ELM is calculated using Equation (1).
H is the output hidden layer matrix, while H+ is the generalized inverse Moore–Penrose matrix of H and HT is the transpose matrix of H. C is the regularization parameter representing the trade-off between minimizing the training error and maximizing the marginal distance.
Under-sampling and over-sampling techniques are applied to the dataset before training the ELM model to handle the data imbalance. The under-sampling works by reducing the majority class sample, either randomly or using some statistical knowledge; then, it can also continue with data cleaning to further filter the majority of class samples if necessary. Meanwhile, the over-sampling adds new samples that have been created or replicated to the minority class to balance the dataset. The new sample can include important information to minority class data to prevent misclassification [
31].
3.3.2. Second Model: Weighted-Extreme Learning Machine
Weighted-Extreme Learning Machine (WELM) is a development algorithm from ELM introduced by Zong et al. [
14] to handle datasets with an unbalanced class distribution by multiplying the weight elements
W into Equation (1) to strengthen the influence of the minority class and weaken the influence of the majority class. There are two types of
W value determination schemes.
W1 can be seen in Equation (2) and
W2 can be seen in Equation (3):
is the weight element on the matrix diagonal
W,
is the number of samples in class
, and
is the average sample class
. Using the schemes, the minority class will have a more significant
W value, and the majority class will have a smaller
W value, which is calculated using Equation (4).
3.3.3. Third Model: Weighted-Extreme Learning Machine with RBF Kernel
Since WELM is a development of ELM, one of the properties passed down to WELM is the number of hidden nodes in the hidden layer, which is determined randomly. This random determination can result in overfitting if the number of hidden nodes selected is too large [
21]. One solution to overcome this random selection of hidden nodes is to use a matrix kernel so that WELM output calculations no longer depend on the number of hidden nodes [
26]. When using the kernel on WELM, the WELM output function equation can be described in Equation (5):
3.3.4. Fourth Model: Weighted-Extreme Learning Machine with RBF kernel and Particle Swarm Optimization
This model uses PSO to perform feature selection and determine the
C and gamma hyperparameters that affect the RBF kernel. The steps of how the model works can be seen in
Figure 1.
PSO works by involving a group of particles with their respective positions and velocities to calculate the fitness value using Equation (6) in each iteration.
is the level of importance of the F1-score result and
is the level of importance of the number of features on the fitness of each particle [
27]. The position and velocity of the particles will change according to the position and velocity with the best fitness value in each iteration for each particle. The particle position is calculated using Equation (7), while the particle velocity is calculated using Equation (8).
In Equation (6),
is the position of the
n particle in the
k iteration. In Equation (7),
is the velocity of the
k particle in the
n iteration. The value of
is the value of the tendency of the particle to follow the best position from itself. At the same time,
is the value of the tendency of the particle to follow the best position of the best particle. The values of
and
are random factors with values between 0 and 1,
pBestk is the best position that the
k particle has produced, and
gBestk is the best position that all particles have produced during all iterations. PSO particle representation for feature selection is completed, utilizing the particle being initialized with as many features. It will be randomized using a binary number, where one means the feature is selected, and 0 means vice versa. For example, a particle with a binary number 11001 means that the features selected are features 1, 2, and 5 out of a total of 5 features.
Table 3 shows the PSO parameter values used in this research.
3.4. Evaluation Metrics
Accuracy is commonly utilized to evaluate the performance of a classifier. However, due to the characteristics of unbalanced data, which favor the majority class more, it is inapplicable to imbalanced learning [
32]. Rather than that, the precision, recall, and F1-score are used as evaluation indicators of this work to evaluate the performance of the models [
33]. The value range of F1-scores is [0, 1], where the classification’s performance improves as the values of the three evaluation indicators become closer to 1. Precision represents the ratio of the true positives to true positives (TP) and false positives (FP), and recall is the ratio of true positives (TP) to the total of the true positives and false negatives (FN). Meanwhile, the F1-score represents the harmonic mean of the precision and recall. The evaluation metrics are defined in Equations (9)–(11).
4. Results
4.1. Feature Extraction
The features extracted from the Microsoft Azure for Predictive Maintenance dataset are 25 features consisting of lag features with 3 h and 24 h intervals for the average and standard deviation of engine voltage, rotation, vibration, and pressure, the number of errors on an engine in the previous 24 h for each type of error, and the length of time the machine was used from the last time it was replaced for each engine component. After the feature extraction results are synchronized with the target, the data are cleaned and normalized. The total data after cleaning and normalization comprise 291,628 rows of data.
It can be seen in
Table 4 that the dataset used has a considerable imbalance between classes, where the normal condition of the machine has a percentage of 97.43% of the total data. With a large amount of data, the data are divided into twenty batches.
To divide the data into training data, validation and testing are carried out following the method carried out by Cardoso and Ferreira [
9] with the following details:
Training data will be taken from a sample of data with a time range from the start of the sample to 31 August 2015, 01:00 (1 January 2015 6:00–31 August 2015 1:00).
Validating data will be taken from sample data with times ranging from 1 September 2015, at 01:00 to 31 October 2015, at 1:00 (1 September 2015 01:00–31 October 2015 01:00).
Testing data will be taken from data samples with times ranging from 1 November 2015, 01:00 to the end of the sample (1 November 2015 1:00–1 January 2016 6:00).
The data division resulted in 9625 training data, 2478 validation data, and 2476 data lines for testing data.
4.2. First Model: Extreme Learning Machine with Under-Sampling and Over-Sampling
The ELM model experiment was carried out five times, with each experiment having ten iterations. The data were sampled randomly using the random under-sampling and over-sampling library from Python. In this model, all the features that have been extracted are used without selecting features, and the number of hidden nodes used is determined randomly. The best experiment result of ELM with under-sampling is shown in
Figure 2, while the best result of ELM with over-sampling is shown in
Figure 3.
The experiment showed that the performance of ELM with the under-sampling technique is better than that of ELM with the over-sampling technique. ELM with under-sampling produces an average F1-score value in the training set of 0.9201 for binary classification and 0.9511 for multiclass classification. On the testing set, the average F1-score reaches 0.9541 for binary classification and 0.9555 for multiclass classification. In contrast, ELM with over-sampling produces an average F1-score value in the training set of 0.9315 for binary classification and 0.9458 for multiclass classification and in the testing set of 0.9442 for binary and 0.9409 for multiclass classification.
4.3. Second Model: Weighted-Extreme Learning Machine
The WELM model experiment was carried out five times, with each experiment having ten iterations. In this model, all the features that have been extracted are used without selecting features, and the number of hidden nodes used is determined randomly.
As previously discussed, WELM overcomes data imbalances by using W-weighted elements. There are two types of W-weighted element schemes, so experiments were also carried out using both weight element schemes to determine which weight elements are more suitable for use in this dataset. The best experiment results of WELM are shown in
Figure 4 and
Figure 5.
In the W1 weighting scheme, the average F1-score values for binary class and multiclass are 0.8211 and 0.85 for the training set, respectively. Then, the average for the testing set is 0.3563 in the binary class and 0.4579 in the multiclass. There is a significant difference between training and testing scores, reaching 25.81% for binary class and 39.21% for multiclass.
The average F1-scores for the binary class and multiclass of the W2 weighted-scheme are 0.8697 and 0.8252 for the training set, while the average for the testing set is 0.3486 for the binary class and 0.4621 for the multiclass. The difference between training and testing scores is more significant than that of the W1 weighting scheme: up to 52.11% for binary class and 36.31% for multiclass.
Looking at the results of developing the WELM model further, the reason the results of the WELM evaluation were unsatisfactory could be due to the random selection of hidden nodes. As previously discussed, the number of hidden nodes that are too large can cause overfitting, so the difference in the F1-score values of the training set and the testing set is huge.
Figure 6 shows that the two WELM weighting schemes produce similar results, but the W
1 scheme is slightly superior when used in binary classification. In contrast, the W
2 scheme is superior when performing multiclass classification.
4.4. Third Model: Weighted-Extreme Learning Machine with RBF Kernel
The WELM model combined with the RBF kernel model experiment was carried out five times, each with ten iterations. The RBF kernel was used to solve hidden node problems. In this model, all the extracted features are used without selecting features. In addition, two hyperparameters are used by the combination of WELM and this kernel, namely the C regularization parameter and the kernel bandwidth γ, which are determined randomly. Binary classification uses the W
1 weight element scheme, and multiclass classification uses the W
2 element weight.
Figure 7 shows the experiment result of the WELM and RBF kernel combination.
The performance of the WELM model with the RBF kernel is similar to WELM, which uses hidden nodes. The average F1-score results from the training set of this method are 0.8549 for the binary class and 0.9327 for the multiclass. At the same time, the results of the average F1-score from the testing set are 0.3849 and 0.473 for each binary and multiclass. There is a significant difference between the results of training and testing the F1-score: 38.13% for the binary class and 54.78% for the multiclass.
4.5. Fourth Model: Weighted-Extreme Learning Machine with RBF Kernel and Particle Swarm Optimization
Some of the possible reasons that can be factors causing WELM’s poor performance with the kernel are the selection of hyperparameter values C and γ in the kernel, which can reduce classification accuracy if not tuned. Another reason could also be that not all the features used significantly influence data classification. Therefore, the development will further use PSO optimization to determine influential features and C and γ hyperparameter values in WELM with the RBF kernel.
The combination method of WELM with the RBF and PSO kernels for feature selection and hyperparameter determination was carried out five times for each data group (where one experiment consisted of 10 PSO iterations). After feature selection and hyperparameter tuning using PSO, the results of the F1-score model slightly increased with the achievement of 0.5072 in binary classification and 0.5116 in multiclass classification. This increase was obtained after 10 PSO iterations. The best experiment result can be seen in
Figure 8.
The best experiment result of binary classification was carried out by using feature numbers 1, 2, 4, 9, 10, 11, 15, 17, 19, 20, 21, 22, and 25 as well as the parameter values of 9.15243 × 10
14 for C and 604,728.939 for gamma. Meanwhile, the best experiment result of multiclass classification was carried out by using feature numbers 2, 5, 7, 9, 10, 11, 12, 14, 15, 16, 17, 19, 22, and 23 as well as parameter values of 8.99652 × 10
14 for C and 649,796.189 for gamma. The details about the feature numbers can be seen in
Table 2.
6. Conclusions
This study uses ELM and its development to achieve the maximum performance of engine failure classification models with imbalanced datasets. To overcome dataset imbalances, the model used is ELM with under-sampling and over-sampling techniques, WELM, WELM combined with the RBF kernel, and WELM combined with the RBF kernel and PSO.
From the experiments and analyses that have been carried out, it is proved that the use of weight elements in WELM alone is still not able to overcome the problem of unbalanced datasets, as evidenced by the results of metric measurements such as F1-score, precision, and recall, which are sensitive to data which have not yet achieved satisfactory results. Meanwhile, under-sampling can balance the dataset, resulting in a better F1-score for each class. For future work, optimization methods other than the RBF kernel and PSO can be sought to improve WELM performance so that there is no need to use under-sampling or over-sampling techniques on data to achieve better performance.