1. Introduction
Manufacturing is highly competitive [
1], and its operations are very complex [
2]. Quality is essential for companies aiming to improve customer satisfaction and competitiveness [
3]. Maintaining quality and improving operations is essential to remain competitive [
4]. One process required for the maintenance of and improvement in operations is Root Cause Analysis (RCA) [
5]. This process’s goal is to find the true origin (or root cause) of a problem. It is an essential process, as it allows companies to find a problem’s root cause, enabling manufacturers to learn the underlying issue and improve the manufacturing process [
6,
7]. Finding the root causes and solving the problems at the source allows stopping the problem from persisting, which means that locating and eliminating the source of the problem are critical [
8,
9]. RCA emphasizes the complexity of manufacturing operations, as it is not trivial to distinguish the root causes from the symptoms, and it requires extensive system and execution analysis [
10,
11].
The above-mentioned characteristics of the RCA problem led to several studies proposing solutions to improve the efficiency and efficacy of RCA. Recently, some studies have used the increase in the volume of data generated in manufacturing environments [
2,
12,
13,
14], to develop solutions called Automatic Root Cause Analyses (ARCA), which automatize at least part of the RCA process. However, even the development of these solutions is not trivial, as it requires ensuring that they are a good fit for the data’s characteristics. It is essential to take into consideration data characteristics in order to develop solutions with satisfactory performance. ARCA solutions aim at improving decisions through the analysis of data, as conceptualized under the framework of Industry 3.5, an intermediate stage before achieving Industry 4.0 [
15,
16,
17]. Some studies, such as [
18,
19], propose a combination between traditional RCA methods and automated solutions. More recently, efforts have been made to develop automated solutions for root cause analysis that incorporate notions of causality instead of relying only on correlations. References [
20,
21,
22] are examples of such works.
As described in [
23], root causes can be understood at three different levels, depending on the types of data available: (i) the root cause’s location, (ii) the physical characteristics of the root cause, and (iii) the human/organizational aspects of the root cause. The first level portrays the location of the root cause in the manufacturing process. For example, [
24,
25] are studies that use this type of data. The second level focuses on what physical attributes are the root cause (e.g., a sudden increase in voltages or high temperature). Examples of studies using this type of data are [
26,
27,
28]. The third and final level centers on the human and organizational aspects of the root cause (e.g., equipment maintenance), so as to identify what triggered the physical problem. References [
29,
30] are examples of studies exploring this final level of root causes. In this paper, we use the first level of data and try to find the location of the root cause. We choose this type of data as it is the most available one, and solutions based on it can be advantageous to more factories.
Analyzing location-type data is still relevant, despite being just the first level of data. To detect the physical aspects of ARCA, factories require proper infrastructure, which some factories do not have. However, data on how the product flows through the manufacturing process are usually readily available, which enables the determination of the root causes’ locations. In situations where the manufacturing process is particularly complicated (e.g., in semiconductor manufacturing [
31]), even production flow data can become difficult to analyze, which justifies the use of Data Mining (DM) and Machine Learning (ML) techniques in the development of solutions that aid in diagnosis.
An issue of locating the root causes is overlap, as presented in [
32]. A manufacturing process can be seen as a progression of steps products go through, starting as raw materials or parts and becoming a finished product in the end. Overlap is a phenomenon that happens when, in two separate manufacturing steps, all products that go through a certain machine in one of the steps are all processed in another given machine in a later step. With the data generated by this phenomenon, it is very hard to distinguish the influence each machine has on the quality of the final product, especially if it is analyzed through classification algorithms.
Overlap may happen due to stabilization in the manufacturing process. For example, if we have a process where products always flow from the same two machines in contiguous steps, as these have their times synchronized, as soon as the machine in the earlier step is finished, it is always the same machine in the later step that becomes available. Despite the example, this does not mean that overlap can only occur between machines that operate in contiguous steps. Attaining this sort of balance in the manufacturing process can be positive in terms of productivity and efficiency, but it becomes an issue during the analysis of the data generated from such a process. This contention between what is positive from a production perspective (that should be prioritised) and what is advantageous for diagnosis based on data analysis is relevant, as this signifies that overlap will occur regularly when performing diagnosis on data of the location type.
It is important to mathematically measure overlap in order to correctly gauge its effect on the development of ARCA solutions. However, that measure needs to consider two aspects: (i) the direction of the association between factors (manufacturing process variables), and (ii) it needs to be able to recognize the effect of overlap between individual machines (further explanations can be found in
Section 3). These aspects are not considered in [
32].
The goal of this paper is to develop an ARCA solution that is able to locate the root cause of a problem in a manufacturing process, and that is robust to overlap. To do so, we propose an overlap measure that considers the aspects mentioned in the above paragraph. The measure is based on information theory, in particular, the use of a variant of mutual information named Positive Mutual Information (PMI), proposed in [
33]. Building on top of this new measure, we introduce a new method that is used to handle RCA and identify the most probable root causes and is robust to overlap. A visualization tool is also proposed, making the task of identifying overlap and the root causes easier for practitioners. We validate the proposed approach in simulated data and real data from a case study in semiconductor manufacturing, which is a highly competitive sector [
34].
The rest of the paper has the following structure. In
Section 2, an overview of previous works relevant to this study is presented. First, a brief discussion about previous works of ARCA in manufacturing is presented. This is followed by a definition of the problem of overlap in
Section 3. In
Section 4, the proposed measure of overlap is described, in addition to the proposed method for RCA based on this measure.
Section 5.1 and
Section 5.2 explain and show (respectively) the results of the experimental procedures used to validate the proposed methods.
Section 6 discusses and summarises the results obtained in the previous section and discusses future research directions. Finally, we summarise our findings and present the main conclusions.
2. Previous Works
A manufacturing process consists of the processing of products and materials through several steps (detailed in
Section 3). This paper’s goal is to locate root causes in a manufacturing process, that is, to determine which machines were the root causes of a problem in the process. In [
35], the authors addressed this issue, calling it the “root cause machineset identification problem”. This study uses a three-phase method to locate the root cause: (i) it processes the dataset of a moment with a problem; (ii) it generates various candidate machinesets (groups of machines); and (iii) it applies association rule mining to the dataset. The rules obtained are analyzed based on a novel measure of interest that considers both confidence and the continuity between the defective products for a candidate machineset. The method extracts the location of the root cause from the antecedent of the association rule, with the consequent side representing whether the product is normal or defective.
Another relevant work on the same topic is [
24]. This study proposes a solution that is able to identify quality drifts and the root causes continuously and automatically. A two-phase algorithm is proposed, which first clusters common defects and then determines the root causes. The Squeezer algorithm is used for clustering. To locate the root causes in the second phase, each group of machines is examined to verify if it can be identified as a root cause or not, based on the sequence of defects. Ref. [
25] proposes a visualization technique based on the Herfindahl–Hirschman Index (HHI) to identify patterns in the concentration of faults in the machines. This technique helps practitioners find root causes in a quick and transparent way.
There are also some studies in the literature that mention that a correlation between factors when trying to identify root causes can become an issue, and strategies are proposed to tackle it. Ref. [
36] uses clustering to group highly-correlated factors, and selects an archetypal factor from them to use during modeling and analysis. Ref. [
37] presents a two-step technique to identify parameters with faulty values in semiconductor manufacturing. In the first step, Principal Component Analysis is used for feature engineering, making the distinction between normal products and faulty ones clearer by increasing the separation between both classes. In the second step, classifiers are used to identify the factors leading to the appearance of faults. Ref. [
38] mentions the complexity in determining whether a specific factor is the root cause when multiple faults are present and explains that the compound effect of multiple faults on factors can be very distinct from the effect of the individual faults. The authors couple data analysis with cause-and-effect information in order to address this issue. In [
39], the combination of faults and its resulting information are analysed using Bayesian networks. Partial Least Squares with Variable Importance in Projection (PLS-VIP) is used to select the most relevant factors, which ensures that the rules obtained contain only the necessary information.
Most papers that develop solutions based on location data use classifiers, analyzing the knowledge structure (e.g., decision trees, rules) to determine the root causes. We argue that solutions based on classifiers can be impaired in terms of performance by the presence of overlap. Although the method proposed by [
24] does not use classifiers, it does use product queues that can nevertheless be affected by overlap. Ref. [
25], despite also proposing a method not using classifiers, is based on visualization and concentration measures, that are not able to identify overlap and its detrimental effect. Ref. [
40] has a broad scope and aims at identifying pitfalls of applying data science to manufacturing problems. It alludes to some pitfalls in determining important factors but does not mention overlap. Ref. [
41] also focuses on feature selection in the context of RCA, but again does not present any mention of overlap. The issue of overlap was first identified in [
32], and the authors proposed measuring the overlap using the strength of association between factors. However, this measure does not consider how the association is directed. This method and measure were further extended in [
20] by using the concept of causality. A literature review about the topic of ARCA in general can be consulted in [
23].
Given the above-mentioned background, this paper contributes to the literature by presenting a novel measure of overlap rooted in information theory, which considers the aspect of the direction of the association. This measure is resilient to overlap, a phenomenon that can be detrimental to the performance of solutions focused on the use of classification algorithms for analyzing location data with the aim of identifying root causes. This paper also proposes an ARCA solution based on the novel measure that is robust to overlap.
3. Problem Definition
Overlap is a phenomenon specific to the problem of locating the root causes of problems in a manufacturing process. As products go through a sequence of manufacturing steps in a manufacturing process, in each of those steps, they are processed in a certain machine. A step–machine combination means that a product was processed in a certain machine in a certain step. This combination is also called a tuple.
Figure 1 depicts such a manufacturing process, where a product goes through several steps, and is processed in a machine in each step. The squares with
illustrate products as they flow through a manufacturing process. In this depiction, Product 1 is in line to be processed before Step B, Product 2 is still being assembled/transformed in Machine M_3, and Product 3 was already processed and is currently being monitored. The problem’s root cause is located in Step A, Machine M_1.
At the end of the process, the product has its quality monitored, and it is defined as normal or problematic. If the number of products with problems increases sharply in a short period, it indicates that the process has a problem needing to be tackled, which requires RCA to determine its origin.
The manufacturing process described above generates data similar to those shown in
Table 1. Each row corresponds to a product, each column corresponds to a step, and each cell indicates the machine where that row’s product was processed for that column’s step. The “Problem” column represents the final quality of the product with respect to whether it had a problem or not. We have chosen to use four instances in this table, as this provides enough variety of examples, but not enough complexity to prevent the comprehension of the conceptual example.
The objective when trying to locate a root cause in this type of data is to determine the step–machine tuple that represents the location of the origin of the problem. However, this can become extremely hard if overlap is present.
Overlap can be understood as a synchronization within the manufacturing problem that makes all products that pass through a given machine in a certain step also pass through another specific machine in a step further ahead in the process. In addition, all the products that pass through this later step have passed through the same machine in that previous step. This synchronization makes it extremely hard to differentiate how each of these machines influences the quality of the product. Note that the entire trace of step–machine tuples has to be analyzed, as the overlap may also occur between tuples of steps that are not contiguous.
Figure 1 and
Table 1 depict an example of overlap: the problem’s origin is located in Step A—Machine M_1; however, all products that go through the root cause tuple also go through Step B—Machine M_3. In such a scenario, it is not possible to distinguish which of the tuples was the origin of the problem. Overlap is particularly problematic when trying to use classifiers to automatically extract root causes, as previous solutions in the literature have proposed. This arises due to the knowledge structures (e.g., decision trees, rules) being generated through the selection of the most representative factors, regularly discarding factors highly correlated with the representative ones, due to their supposed redundancy. The criterion used to identify redundant factors can lead to hiding factors that are the true root cause, giving more relevance to the representative factors. For example, when generating a Decision Tree (DT), using the information-gain criterion for splitting promotes the use of factors with more levels, although the root cause may be a factor with a smaller number of levels. In the example above, a DT based on information gain would select Step B–Machine M_3 as a root cause, discarding the true root cause (Step A–Machine M_1) and providing a wrong diagnosis.
Overlap is significant when we only have location data available but the data are high-dimensional (both in number of products and number of steps/machines). In this context, traditional approaches (e.g., Ishikawa diagrams, Failure Mode, and Effects Analysis) are incapable of efficiently dealing with the high dimensionality of the data, and as such, ARCA solutions are necessary for efficiently obtaining the root causes’ locations. There are also works that try to expand these traditional solutions in order to improve them, such as [
42]. In what concerns ARCA solutions, determining the presence of overlap aids analysts by signaling them about an issue that has repercussions in the analysis through DM and ML algorithms, and prevents them from reaching wrong conclusions, which enables the use of more resilient solutions to locate the true root cause.
Given the description of the problem, it is necessary to consider two aspects when measuring overlap. First, one needs to take into consideration whether an association between factors is positive or negative. This means that we should only consider associations generated by a product that goes through a certain machine in a step and always goes through a certain machine in another step (positive association), and not when a product goes through a certain machine in a step and it never goes through a certain machine in another step (negative association). Overlap is problematic only in the case of positive association, as it is in that scenario that it becomes impossible to distinguish the tuples. In the case of negative association, the tuple that the product goes through can be immediately determined as a root cause (in the situation where the product is problematic and only those tuples are considered as possible root causes). The second aspect is that the measure needs to focus on comparing tuples (step-machine pairs) and not simply steps.
6. Discussion
In this section, we discuss and summarise the results of the different experiments and how these combine to form coherent conclusions.
In the Mockup experiments, the proposed PMI method was able to achieve a performance equal to the best solutions based on the previous overlap measure (CO) and a much better performance than the DT classifier. Through a comparative analysis of
Figure 4 and
Figure 5, we can see that an increment in the amount of overlap leads to degradation in the performance of the classifier. This is evidence in favor of the argument presented in
Section 2.
In the experiments with the Stochastic simulation data, PMI achieved better performance than the other algorithms, as it put the true root causes in the same or higher rankings than the CO method. When applying the different algorithms to the real case study, the proposed method was found to be able to identify most root causes in common with other methods.
An additional note on the experiments with the Stochastic simulation data is that, to enable the detection of tuples that overlapped with the Label, there was a need to lower the threshold (of Expression
5) to 0.8 for the first two datasets and 0.3 for the last dataset. As these datasets have greater levels of noise, this seems to indicate that the threshold needs to be adapted to the datasets at hand and that the noisier the dataset is, the lower the threshold needs to be to detect tuples overlapping with the label. The default threshold value of 0.9 was chosen so as to include instances with very high overlap but with some leeway for noise. The fact that we needed to lower the threshold in response to an increase in noise indicates that there is a reduction in the signal-to-noise ratio in terms of evidence of the presence of the root cause in the data. While overlap may be less evident due to the presence of noise, so is the root cause signal.
From the results of all the experiments, we can clearly reach the conclusion that the proposed PMI algorithm has better performance than both the classifier solution and the CO algorithm.
A relevant shortcoming of this work is that it was not possible to establish the root causes in the case study data, which hinders the validation in real data. This work would be improved if it was possible to access real data with the root causes identified. Another avenue for future work could be analyzing the effect of noise on the performance of the factor-ranking algorithms, as it is not clear what is causing the differences in the performance of the algorithms (although it is clear that they are resilient to overlap). Finally, it would be interesting to see if it is possible to find a method that is able to untangle the statistical impossibility of distinguishing between overlapped factors.
7. Conclusions
This work presents a new measure of overlap. Overlap is a synchronization in the manufacturing process that makes all products that pass through a given machine in a certain step also pass through another specific machine in a step further ahead in the process. With data generated in the presence of this phenomenon, it becomes impossible to discern the influence of the machines that processed all these products on the products’ quality.
We propose a novel measure of overlap that uses information theory concepts such as Positive Mutual Information (PMI). This measure considers two critical aspects, namely whether the associations are positive or negative, and whether it is appropriate to detect overlap among step-machine tuples. This measure is the basis of a factor-ranking algorithm that is used to detect root causes, in an ARCA solution.
To validate this new method, three experiments were conducted: (i) using mockup data, (ii) using simulated data that emulates a case study, (iii) using real data from the case study itself. It was possible to conclude that the proposed algorithm achieved better performance with simulated data, competing with the benchmark algorithms in the other two experiments.
This paper contributes to the literature by presenting a robust measure of overlap, which allows for a better understanding and analysis of this characteristic of the problem. In addition, a new factor-ranking algorithm is presented with positive results. A visualization tool was also developed that eases the analysis by practitioners, by depicting the relevant overlaps between tuples in a manipulable graph.
This work allows researchers and practitioners to have an improved comprehension of a new concept, which can lead to the development of improved ARCA solutions, making the management of manufacturing operations faster and reducing the associated workload.