*3.1. Data Pre-Processing*

A dataset that contains 1532 rows of failure incidents has been received from the industrial partner of this research project. The data was extracted from the ERP system of the company, it was recorded over one year, and related only to the selected product in Figure 2. Every row in the dataset contains the details of a single incident and described by 23 different input features (columns) that help the quality engineers to recognize the failure mode and therefore, refer to the FMEA documents to assign the proper RPN value that fits this failure mode. For example, a failure is claimed from the assembly line in Germany where the engineers reported an incident of "an insufficiently tightened screws at one component in the device", along with this reported failure, further information are provided such as the serial number of the device, the code number of the component as in the design, further description written in textual format by the labor who solved the issue including his opinion on the issue and its criticality, the damage code as picked from the list of options in the input screen, the expected root cause of the problem is explained in text, the time consumed to fix the problem, and cost involved for rework, and the final conclusion. Table 1 summarizes the input features types and roles in the models. Whereas this dataset is used to develop the machine learning models. The first step is to prepare the data for the AutoML platform. This includes ensuring that all features of the dataset are organized, and data types are well defined. Additionally, the claims are validated manually by the quality managemen<sup>t</sup> team at the partner company using a specially programmed interface that facilitates the manual validation process. This manual validation of data was made in order to ensure the quality of the input data and therefore ensure the quality of the output models.


**Table 1.** Dataset input features for the machine learning model.

Furthermore, 46 rows are excluded from the training process because of missing critical details such as claim textual description and the root cause input. Moreover, scales (8–10) in severity and (7–10) in occurrence had an insufficient number of claims (lower than 50 rows) for every element, these records are excluded too, as shown in Table 2a. The reason behind that, AutoML platform cannot start training with less than 50 readings per class. Therefore, the dataset is copied three times, and classes with less than 50 readings are eliminated. Finally, 1484, 1425, and 1486 claims are used for models training of severity, occurrence, and impact respectively. The data plot is shown in Figure 6 where the distribution of the data is illustrated.


**Table 2.** Summary of dataset included in the modeling.

In addition to RPN evaluation, the research work includes classification of claims according to the respective manufacturing process which is described to be the root cause process of the defect. The names of processes are masked in Table 2b where the process could be any of the known machining processes such as cutting, bending, welding, assembly, etc. Processes with less than 50 records are excluded as well.

**Figure 6.** Dataset plot of all claims based on RPN value.
