1. Introduction
Maintenance is treated as one of the critical functions in any organization [
1,
2,
3]. Facility maintenance and operations contain two main task areas including (1) the care and maintenance of the building and technical systems (including the repair and replacement of technical systems, and equipment) and (2) the care and maintenance of outdoor areas [
4]. In practice, an unexpected failure reduces production, decreases efficiency, increases cost, and affects personal safety [
5,
6]. A random failure of an equipment part can lead to another failure in other parts of the system [
7,
8]. Hence, maintenance prediction (MP) has become a crucial maintenance strategy that aims to avoid the occurrence of failure and reduce the downtime of machines and maintenance costs.
MP aims at forecasting whether the failure of a component in a system has occurred at a specific point of time in the future. Predictive Maintenance (PdM) is usually performed to minimize preventive and corrective maintenance [
9]. A PdM system is described as a set of components that measure one or more input variables and processes to estimate the failure probability of equipment’s future state in the system [
10,
11]. It usually analyses the past and current data that illustrated the system status, events, and operations [
12]. Extracting the relationships among transactions is very useful to make a knowledge base for predicting failure and maintenance requirements, as well as analysing possible causes for the deviations [
13]. Therefore, this research proposes a framework for maintenance prediction of the future sequence of maintenance activities (repair or replacement) of faulty product parts.
The data mining (DM) approach is reported to be effective in extracting the relationship between failures and predicting the failure using events sequences of historical data [
14,
15]. DM is defined as a field that combined computer science and statistics used to extract hidden information and useful knowledge and find the correlation between events from large data [
16]. It can be classified into two categories: descriptive and predictive. Descriptive is a technique that describes the input data. Predictive is a technique that is implemented on the input to predict the required output [
17]. DM techniques can be performed to reduce downtime and related costs by determining the occurrence of the next activity and identifying the dependencies between maintenance activities, and then predicting the occurrence of each maintenance sequence based on historical data [
18]. DM techniques can be used for association rule mining, classification, prediction, and clustering [
19].
Furthermore, the DM process is classified based on the data type into various categories: sequential pattern mining, association rule mining, classification, and prediction. Sequential pattern mining has many applications in customer behaviour and web log mining. It identifies the pattern of activities among all objects during a specific time interval, while frequent sequential patterns can be determined using the support or threshold value. It can be performed using the Apriori algorithm. Some attributes, such as object number, maintenance type, spare parts used, and maintenance date are included in the transaction to determine a sequential pattern. Association rule mining uses the frequent pattern to identify the relationship between items that occurred in the first period and items that may occur in the next period using the IF-THEN rules. The strength of the association rule is then measured using the support and confidence values. Finally, classification and prediction propose a model to predict specific attributes considering the variability of the data [
20].
Data mining techniques have been widely employed in prediction. For example, Jeong et al. [
21] developed a decision support model for determining the target multi-family housing complex for green remodeling using data mining techniques. Jeong et al. [
22] proposed a data-driven approach for establishing a CO
2 emission benchmark for a multi-family housing complex using data mining techniques. Lv et al. [
23] proposed a complex data fusion and efficient learning algorithm (multi-graphics processing unit (GPU)) to process the multi-dimensional and complex big data based on the compositive rough set model. Zhang et al. [
24] proposed a novel category-induced coarse-to-fine domain adaptation approach (C2FDA) for cross-domain object detection. Zhang et al. [
25] analyzed the problems of existing container positioning methods and proposed a vision-based container position measuring system to provide precise parameters for container lifting operations. Mitici et al. [
26] employed dynamic predictive maintenance for multiple components using data-driven probabilistic remaining useful life prognostics illustrated by the case of turbofan engines. Zhou et al. [
27] explored the applicability of data mining and knowledge discovery in combination with geographic information system technology to allow management to better decide maintenance strategies, set rehabilitation priorities, and make investment decisions. Mining algorithms including decision trees and association rules were used in the analysis. The selected rules were employed to predict the maintenance and rehabilitation strategy of road segments. A pavement database covering four counties within the state of North Carolina, which was provided by North Carolina DOT, was used to test this method. Dindarloo and Siami-Irdemoosa [
28] investigated the application of classification and clustering approaches for pattern recognition and failure forecasting on mining shovels. The failure behaviour of a fleet of ten mining shovels during one year of operation was examined. The shovels were classified into four clusters using k-means clustering algorithms. Future failures were predicted using the support vector machine classification technique. Historical data for failure and time to repair were used to predict the next failure type for all shovels. Moharana et al. [
29] suggested a framework for extracting the sequential patterns of maintenance activities and related spare parts information from historical records of maintenance data with pre-defined support or threshold values. Gharoun et al. [
30] proposed an algorithm for fault detection in terms of condition-based maintenance with data mining techniques for an aircraft turbofan engine using flight data. The data-driven models were used to model the relationship between engine exhaust gas temperature (EGT) and other operational and environmental parameters of the engine. The faults occurring in each flight were detected based on the identification of abnormal events by a one-class support vector machine trained by the health condition EGT residual data set. Gholami and Hafezalkotob [
31] combined data mining techniques and time series models to schedule maintenance activities. The clustering algorithm was adopted to categorize failures based on the similarity in types of maintenance activities. Then, rules were extracted for characterizing the clusters and presenting a range for each factor by applying a proper association rule algorithm. Subsequently, time series models were employed to predict the period that a factor may meet its rule’s range. Kalathas and Papoutsidakis [
32] adopted stored-inactive data from a Greek railway company, used the method of data mining, and applied machine learning techniques to create strategic decision support and develop a risk and control plan for trains. Carrasco et al. [
33] generalised the concept of positive and negative instances into intervals to evaluate unsupervised anomaly detection algorithms. They proposed the Preceding Window ROC, a generalisation for the calculation of ROC curves for time series scenarios, and adapted the mechanism from an established time series anomaly detection benchmark to the proposed generalisations to reward early detection. The proposed evaluation method was evaluated by a case study of big data algorithms with a real-world time series problem provided by the ArcelorMittal company.
However, this research utilizes supervised data mining tools, the generation of sequential patterns and association rule mining, for predicting the next maintenance activities for faulty product parts with and without product attributes. The results of this research are valuable to maintenance engineers in the effective planning and management of maintenance activities. A real case study of a washing machine will be provided to illustrate the developed data mining framework.
The remainder of this research including the introduction is structured as follows:
Section 2 develops the data mining framework for maintenance prediction.
Section 3 presents an application of the framework.
Section 4 discusses the research results.
Section 5 summarizes research conclusions and future research.
2. Development of Data Mining Framework
The developed framework is considered as shown in
Figure 1.
Stage 1. Maintenance data regarding maintenance activities and attributes are collected. The types of data may be temporal, such as production date, repair date, and sales date. Categorical data include model and engine type [
34]. Numerical data include mileage at repair, maintenance, and spare parts cost. Finally, textual data involve failure description and taken corrective action. For example, assume there are four objects as shown in
Table 1, where different maintenance activities (M) are considered during three months. Three maintenance activities are M
1, M
2, and M
3. The attributes of a product are production year (2020, 2021) and engine type (A, B). Time to failure between two activities is one month, where the M
1, M
2, and M
3 activities can occur in any month. Moreover, the first activity indicates the maintenance activity that occurred in the first month, and so on.
Stage 2. Maintenance activity prediction is conducted without attributes as follows:
Step 1. The types of maintenance activities which are applied on objects to maintain the product in a good performance are defined taking into consideration the time occurred of activity. From
Table 1, three maintenance activities M
1, M
2, and M
3 will be analyzed.
Step 2. The Generalized Sequential Patterns (GSPs) are identified. A sequence database includes a sequence of ordered activities with or without time information. The sequence consists of multiple events where the event consists of one or more items in the transaction. For example, V = {v
1, v
2, …., v
n} is the set of activities, n is the number of activities, and the sequence is Q = (e
1, e
2, …., e
m), in which e represent the events and m is the number of events in the transaction, given that e
2 occurs after e
1. The possible number of patterns (
N) can be computed as follows [
29]:
For illustration, three activities can occur in any maintenance activity number as shown in
Table 2. So, (n = 9) which indicates 117 possible patterns, can be produced as shown in Equation (1).
Consequently, 117 possible patterns are obtained. Examples of possible patterns include the following:
(First activity = M1, Second activity = M2)
(First activity = M3, Second activity = M2)
(First activity = M1, Second activity = M2, Third activity = M3)
Sequences are then classified as frequent or infrequent based on the support value (Sup.), which represents the percentage of activities related to the sequence rule of all objects as given in Equation (2). If the threshold of support value is too low, it leads to the involvement of too many items in patterns and rules, but if the support value is too high, it leads to fewer items involved in patterns and rules. So, the minimum support value is determined by the manufacturer based on the characteristics of the products [
35]. When the support value of the sequence is more than or equal to the specified threshold value, the sequence is classified as a frequent sequential pattern.
Figure 2 presents the GSP framework. The support value for events e
1 and e
2, Sup (e
1 → e
2), does not depend on the sequence and is calculated as follows:
where e
1 and e
2 indicate the first event of sequence and the remaining events of sequence, respectively, ∣e
1 U e
2∣ is the number of objects that contains e
1 and e
2, and ∣R∣ is the number of objects.
The GSP for maintenance activities is used to identify the frequent patterns of maintenance activities. In
Table 1, for example, one of the possible patterns is the pattern (First activity = M
2, Second activity = M
3). The Sup(e
1 → e
2) can then be computed using Equation (2) as follows:
where e
1 and e
2 refer to the first activity (M
2) and the second activity (M
3), respectively, ∣e
1 U e
2∣ is the number of objects (=2) that contain the first activity of M
2 and second activity of M
3, and ∣R∣ is the number of objects (=4).
The Sup (e1 → e2) of 50% indicates that 50% of all objects have the sequence pattern (First activity = M2, Second activity = M3). All possible patterns are then defined and classified as frequent or infrequent patterns based on a predefined threshold support value. The best sequential patterns are further used to generate the rules association between items and finally determine the significant sequential patterns.
Step 3. The framework for generating the association rules is developed as shown in
Figure 3. An association rule is a sequential relationship between the condition and decision activities. Condition activities (C) are the activities implemented for a specific object. Decision activities (D) are the activities on the same object that will be serviced at a later time. The Apriori algorithm is then employed for generating the association rules. Typically, the association rule consists of IF (condition) and then (decision) statements. The condition and decision can be more than one activity. Statistical analysis including the support (Sup.) and confidence (Conf.) is applied to identify the significant association rules with the corresponding lift. The confidence (C → D) rule is the probability of the object having D given that it has C. The rule is significant when the confidence value is greater than or equal to a minimum confidence level determined by experts. It depends on the sequence of events so the Conf. (C → D) is different from Conf. (D → C). The Conf. value is computed using Equation (4).
where ∣C U D∣ is the number of objects that contain C and D, while ∣C∣ is the number of objects that contain C. The lift or interest factor is calculated by the ratio of the joint probability of the events to the expected joint probability if they are statistically independent. A lift value less than or equal to one indicates that the rule is insignificant and the events are independent, while a lift value greater than one implies that the events are dependent. The lift (C → D) value is computed using Equation (5).
The association rules can be generated for the maintenance activities to examine the sequential relationships between maintenance activities and determine the condition and decision activities. For illustration, in
Table 1 the best frequent pattern is the (First activity = M
2, Second activity = M
3) pattern. This pattern will be employed to associate the rule. It has a support value of 50%. Given that, C indicates the first activity = M
2, ∣C∣ is the number of objects that contain the first activity = M
2, D is the second activity = M
3, and ∣C U D∣ is the number of objects that contain the first and second activities, M
2 and M
3, respectively. Then, the estimated values of the Conf. (C → D) and lift (C → D) are calculated using Equations (4) and (5), respectively, as follows:
The Conf. (C → D) value indicates that 66.7% of the objects have the pattern (First activity = M2, and Second activity = M3). Suppose that the threshold Conf. is 60%, then this sequence is considered a significant pattern. The calculated lift value (=1.33) is greater than 1. Consequently, this rule is dependent. When a sequence consists of three events or more, many rules can be extracted from one rule by changing the condition and decision activities. All possible association rules should be classified into significant or insignificant rules. Only the significant rules are further employed to predict the next maintenance activities.
Step 4. Perform the rule-based classification of the maintenance activities using a collection of significant rules and evaluate the accuracy of prediction using coverage and accuracy measures. Let N
v and N
r denote the number of condition activities that satisfy the rule and the number of condition and decision activities that classify the rule, respectively. The ∣R∣ is the number of objects. Coverage is the percentage of the records that satisfies the condition of the rule as shown in Equation (6). Accuracy is the ratio of the number of condition and decision activities that classify the rule to the number of conditions as shown in Equation (7).
For illustration, suppose that the testing data are collected for four objects as shown in
Table 3. Suppose that the rule (First activity = M
2); then, (Second activity = M
3) is identified as a significant rule of maintenance activities.
The N
v and N
r denote the number of objects that have (First activity = M
2) that satisfied the rule and the number of objects that have (First activity = M
2, Second activity = M
3) that classified the rule, respectively. Then, the coverage and accuracy are estimated as follows:
The estimated coverage and accuracy values indicate that 75% of all objects have the first activity M2 and 100% of them validate the rule, respectively. All significant rules should be tested and evaluated.
Stage 3: Repeat steps (2–4) in stage 2 for maintenance activities with attributes. Each object has specific attributes that describe the properties of the object and define the physical or non-physical characteristics. In
Table 1, the production year (2020, 2021) and engine type (A, B) are the attributes of objects. For illustration:
Step 1. The Generalize Sequential Pattern (GSP) is applied for planning the maintenance activities with attributes: production year only, engine type, or both attributes together. Then, all sequences should be analysed to identify the best sequential sequence for each attribute and both attributes.
Table 4 represents the possible attributes of objects with either one attribute or both attributes.
One of the possible attributes that can be analysed with maintenance activities is engine type = A. Suppose that the best sequence of maintenance activities is (Engine type = A, First activity = M
2, Second activity = M
3); then, the support value can be computed as follows:
where e
1 indicates Engine type = A attribute, e
2 indicates (First activity = M
2, Second activity = M
3) sequence, and ∣e
1 U e
2∣ is the number of objects that contain Engine type = A, First activity = M
2 and Second activity = M
3. The estimated Sup(e
1 → e
2) value implies that 50% of all objects have the type of engine = A and a sequence (First activity = M
2, Second activity = M
3).
Step 2. The association rules can be generated for the best sequence of maintenance activities with each possible attribute, that is, either one attribute or both. The condition is the attributes such as production year, engine type, or both, while the decision is the maintenance activities sequence. The rules are determined significant if the Conf. (C → D) value is more than or equal to a threshold value of Conf. Association rules on sequence (Engine type = A, First activity = M
2, Second activity = M
3), which is considered as best pattern for engine type = A. The Conf. (C → D) and lift values are calculated using Equations (4) and (5), respectively, as follows:
where C is Engine type = A attribute, ∣C∣ refers to the number of objects that contain Engine type = A, D indicates (First activity = M
2, Second activity = M
3) sequence, and ∣C U D∣ refers to the objects that contain Engine type = A, First activity = M
2, Second activity = M
3.
where
The calculated Conf. (C → D) means that 66.7% of objects that have Engine type = A, have (First activity = M2, Second activity = M3) sequence. Suppose that the threshold of Conf. = 60%; then, this rule is significant with a lift value of more than 1, and thereby the rule is concluded as dependent.
Step 3. The rule-based classification is conducted by testing the significant rules of maintenance activities with each attribute using testing data in
Table 3 and measuring the coverage and accuracy. One of the significant rules is an object which has Engine type = A, and the next maintenance activities sequence is (First activity = M
2, Second activity = M
3). According to the testing data in
Table 3, the coverage and accuracy values are calculated using Equations (6) and (7), respectively.
where N
v is the number of (Engine type = A), N
r is the number of (First activity = M
2, Second activity = M
3) that classified the rule, and ∣R∣ is the number of objects.
The estimated coverage indicates that 75% of objects are produced with Engine type = A and 66.7% of them validate the rule. All the significant rules should be tested and evaluated.
Stage 3. A comparison is made between the results in stages 1 and 2 after predicting maintenance activities with and without attributes.