*3.2. Data Modeling*

The AutoML platform analyzes the dataset and constructs the models automatically. Such analysis includes text processing is made by the Google AutoML in addition to the other categorical and numeric data.

Multiclass classification technique is applied to develop four Machine Learning models, while the suitable algorithm is automatically developed by Google AutoML. Google AutoML is developed to help researchers in handling large data and building high accuracy models with the least coding experience and resource consumption. The datasets are uploaded, the input features are defined, and targeted elements are selected. The prepared datasets are analyzed to obtain three independent models that can evaluate every claim to predict three independent values for severity, occurrence, and impact, from which an RPN can be calculated by applying Equation (1). Additionally, a fourth model (for category) is obtained to identify the manufacturing process which caused this failure to occur. The manufacturing process could be cutting, bending, welding, painting, assembly, packaging, and transportation. The aim of the fourth model could be extended in the future to include more specific processes such as welding machine 1, assembly line 2 and so on. Figure 7 illustrates the four models obtained after training. As the AutoML platform is a cloud system, then the consumption processing can be measured by node hour. The training process consumed 0.944, 1.105, 0.86, and 1.111 node hours for severity, occurrence, impact, and category respectively. Every node hour includes the use of 92 n1-standard-4 equivalent machines in parallel, where a single n1-standard-4 machine operates four virtual CPUs and 16 GB of RAM memory.

**Figure 7.** RPN evaluation and Category Classification models.

## *3.3. Models Evaluation*

The models' evaluation is done by three evaluation metrics namely precision, recall, and F1 score. Further evaluation metrics are adopted here such as area under curve (AUC) and the confusion matrices.

Precision is the percentage of true positive predictions to all positive predictions (true positive and false positive). Recall is the percentage of true positive predictions among all actual values (true positive and false negative). While F1 score is a balanced evaluation between precision and recall, and it is used especially when the data in the datasets are not equally distributed over classes.

The area under the precision-recall curve (AUC-PR) and the area under the receiver operating characteristic curve (AUC-ROC) are used to visualize the performance of the models. AUC-PR shows the trade-o ff between precision and recall for the model. AUC-ROC shows the trade-o ff between true positive rate and false-positive rate.

The confusion matrices are used here to elaborate on the prediction accuracy and accepted tolerance of every class. For example, higher prediction resulted by any of the RPN elements models (MS, M O, and MI) can be accepted because assigning higher ranking value to an incident increases its priority. The limitation here is the degree of tolerance accepted by the company. To elaborate more, when an incident is evaluated for a severity class of three, it is accepted if the model predicts a value that is higher than the actual value by one step. However, it could be ine fficient if the model predicts two or higher steps than the incident deserves.

Finally, the accuracy of the models is a ffected by the type of every element, the number of rows that are used for training, the accuracy of details provided per row, and the scale of every element (or number of classes per element). It is important here to recall the objective of this work which is to provide a proof-of-concept that machine learning is an e ffective technique to enhance FMEA and the development of RPN value.

#### **4. Results and Discussion**

In this research, four machine learning models are trained and evaluated successfully in this research. Table 3 summarizes the training evaluation results and accuracy metrics for four models of severity (MS), occurrence (M O), impact (MI), and category (MC). The evaluation sample was automatically split and tested by the AutoML platform.

The performance metrics in Table 3 shows relatively high-quality models, with di fferent levels of precision for each model. The area under the precision-recall curve (AUC-PR) and the area under the receiver operating characteristic curve (AUC-ROC) are close to 1, which indicates high-quality classification models. Moreover, the models' precision rates are 93.2%, 87.6%, 89.9%, and 86.6% for MS, M O, MI, and MC respectively, which indicates that the models predicted correctly the classes of the validation sample for every model.



The highest F1 score is recorded for MI, where the full dataset is used for training, and the classification was only among three classes (1, 2 or 3) while the training dataset for MI contains 866, 511, 109 readings for every class from 1 to 3, respectively.

The confusion matrices shown in Tables 4–7 below show that the concentration of the true predictions is at the diagonal cells of all models. However, both models MS and MO show higher confusion for predicted labels against true labels, in contrast to MI and MC models where higher concentration is shown at the diagonal cells. This is highly connected with the data volume and will be improved when a larger volume of data is used for model upgrading.

However, predicting a higher value than the true one (negative true predictions in the confusion matrices) for the three models (MS, MO, and MI) could be accepted, as higher prediction value for severity will increase the RPN value and therefore, the priority to resolve the failure is increased. However, this tolerance is not acceptable for MC as it deals with a totally different interpretation, it describes the manufacturing process where the root cause of the failure is coming from. The model shouldn't predict a false manufacturing process instead of predicting a true one. In other words, a wrong prediction that a failure is caused by a process (X) is totally rejected if it is actually caused by another different process. However, such a disadvantage can be improved during the transition stage where the process of automatic claims evaluation is running in parallel with the manual traditional one so as to improve the next trained model after a larger dataset size is accumulated.


**Table 4.** Confusion matrix for the model of severity (MS).


**Table 5.** Confusion matrix for the model of occurrence (MO).


**Table 6.** Confusion matrix for the model of impact (MI).


**Table 7.** Confusion matrix for the model of category prediction (MC).

Another approach to evaluate the developed models is to examine the RPN in the original dataset (actual RPN) against the resulted RPN from applying Equation (1) to the three predicted elements, call it (predicted RPN). The frequency histogram in Figure 8 compares the two readings (actual vs. predicted) for the overall dataset. The histogram shows a high overlapping of results between the two RPN values. Applying statistical accuracy measurements between actual and predicted values, resulted in a mean absolute error of 3.86 and a root mean squared error of 12.76 which both represent acceptable accuracy of predicted against actual. Therefore, this is another approach to evaluate the models developed and showing that these models are e ffective and e fficient.

In contrast, this histogram in Figure 8 shows a shortage in predicting higher RPN values when the multiplication result is higher than 80 (the values larger than 140 in the histogram is a clear example). The reason behind this weakness is due to a lack of data at high classes for severity and occurrence in the training dataset. Enhanced accuracy can be reached by enlarging the training dataset and this could be fulfilled when more data is accumulated over time. Given that the dataset used in this activity contained 1532 claims for a single product in one year only. Further improvement can be achieved by reviewing the predictions of the models after a testing period, where an expert engineer can compare both the proposed AutoML approach with the traditional approach and conclude an enhanced and extended dataset that can be used for models retraining. Another approach to improve the models is to minimize the scale of classification for severity and occurrence from 1–10 to become 1–5 scale, such change will improve the model precision and accuracy. Hint, the accuracy for MI is higher than the others.

**Figure 8.** Originally evaluated RPN values frequency Vs Obtained RPN from the predicted Severity, Occurrence and Impact Classes.

Since the results of the proposed method are showing acceptable accuracy, given the dataset volume and used method, the models can be deployed at the partner company. The advantage of the proposed approach as compared to the traditional one is that it replaces the human intervention in the process and automates the decision-making process. In the traditional approach, once a claim is received from the mother company in Germany, a quality engineer in the quality managemen<sup>t</sup> o ffice in Hungary reviews the claim, decides the failure mode type and then assigns values to the three elements to calculate the RPN. Based on this judgment, further actions are decided. These actions can be by transferring the issue to critical issue resolution by using strategies such as the eight disciplines methodology (8D) if the RPN is above 160 points, or by updating the quality checklists at the production shop floor or could be both. However, this human intervention may imply some implementation error as it depends on the evaluator's experience. For example, assume a claim was evaluated by a quality engineer to be 160 points, while another engineer may underestimate the claim by ranking it to be 140 points based on his experience and memory. In the first situation, the engineer will transferee the claim to a more sophisticated process (8D strategy) which entails using more resources by forming a team to follow up and resolving the issue. On the other hand, the claim is highlighted to the production managemen<sup>t</sup> (the second case). This is because such a process depends mainly on individual judgment and experience of the sta ff members who may give inaccurate estimation. Meanwhile, if such a process is done by a machine that makes decision-based on the accumulated leaning process, such uncertainty in decision making can be avoided. Thus, the proposed solution replaces this human intervention with a machine learning algorithm that evaluates claims based on the accumulated and none individualized experience and avoids the uncertainty in the experience of quality engineers. Moreover, the proposed approach can automatically analyze the new claims and construct correlations between incidents and therefore ge<sup>t</sup> a better ability for future prediction. Such process saves time, e fforts, and improves responsiveness to failures either by alarming the quality managemen<sup>t</sup> team instantly to serious issues or by automatically updating the quality checklists in the production shop floor by notifying labor and production sta ff of this issue in a real-time manner. From a business perspective, the proposed solution can be operated at any time and provides higher e fficiency and e ffectiveness.
