1. Introduction
Two centuries have passed since the industrial revolution transformed manufacturing processes. Now in the era of industry digitalization and increased use of data, causal domain knowledge is, more than ever before, crucial. For example, in automated data analytics processes, causal domain knowledge inclusion helps to avoid biased results [
1] and increases data analysis algorithms’ robustness (e.g., increases machine learning algorithms’ robustness [
2]).
Risk assessment and root cause analysis are two practices performed in the industry that capture and document causal domain knowledge. Almost a century has passed since the Failure Mode Effect Analysis (FMEA) process was first proposed as a tool for risk assessment [
3]. In the FMEA process, a multidisciplinary team often uses the brainstorming method to identify lists of the failure modes and their possible root causes and potential effects. In addition to other information such as corresponding authors’ information and the scope of the FMEA process, these lists are saved in a document (the FMEA document) typically using a standard tabular format. This tabular format’s columns contain the failure mode, potential effects, and possible root causes, in addition to other columns such as risk priority number, detection, etc. The FMEA process successfulness as a risk assessment tool depends on (i) the complete and accurate identification of potential failure modes contained in a system and (ii) the rigorous evaluation of the risk level of these failure modes [
4].
Now, the FMEA process is widely used in many industries. For example, the FMEA process is used in the design and validation of a hydraulic torque converter [
5]. Additionally, the diagnosis of the bearing condition using the FMEA process is presented in [
6]. Additionally, in [
7], Bayesian networks are built from the FMEA process used by Markov decision processes to model uncertainties within unmanned aerial vehicles. Thus, in the context of extracting causal domain knowledge in the industry, FMEA documents represent a valuable data source.
As highlighted by previous research, the FMEA process is error-prone. As a result, FMEA documents suffer from data consistency issues. For example, Bluvband et al., in [
8] argued that due to the insufficient comprehensibility of the brainstorming session, cases of missing information might occur (i.e., by omitting failure modes). Consequently, many approaches are proposed to differently aggregate sources of information to improve the brainstorming sessions’ comprehensibility. Thus, solutions based on ontologies are presented in [
9,
10,
11] to satisfy the requirement to share, reuse, and maintain FMEA knowledge. Such solutions are labor-intensive and time-consuming. As such, text mining algorithms are proposed to extract a list of frequent failure modes and build the standard failure mode vocabulary (e.g., the method proposed in [
12]).
Additionally, FMEA process performance is limited in capturing rich information about implicit causal relations expressed between its columns. In order to enrich these causal relations with more information, a more sophisticated methodology, namely Failure Modes, Mechanisms, and Effects Analysis (FMMEA), is proposed in [
13].
Since our research aims to extract causal knowledge from FMEA documents, which is, by design, limited in capturing rich information about the causal relations, we are faced with consistency impairments in these documents. Moreover, these consistency impairments, to our best knowledge, were not previously addressed. Especially when the intention is to extract causal domain knowledge, these consistency impairments limit the effectiveness of the extracted knowledge for downstream tasks. Therefore, in this paper, we argue that consistency impairments concerning the implicit causal relation between the FMEA documents’ columns (i.e., root cause and failure mode, and failure mode and effect) are also significantly important.
Ideally, in a given FMEA cell, a short, descriptive text represents a single concept. In this case, a concept is a separable (identifiable) phenomenon that acts either as (i) a root cause, (ii) an effect of the failure mode observed in the product characteristics, or (iii) a failure mode as an intermediate state in the causal chain. However, based on the analysis of actual FMEA documents from a semiconductor manufacturing company, the majority of data consistency impairments are attributed to one of two main categories: (a) FMEA documents’ cell consistency; (b) FMEA relations consistency (i.e., consistency impairments concerning the implicit causal relation between the FMEA documents’ columns).
Figure 1 depicts our observations concerning consistencies found in actual FMEA documents. In the category of FMEA documents’ cell consistency impairments, we noticed cases of merged cells wherein the short text of a single cell comprises multiple concepts or even relations between multiple concepts, e.g., a causal chain of multiple causes and effects. In the category of FMEA relations consistency, we noticed the following:
Cases of missing information in the causal chain, where the documented relation actually describes a subsequent effect of the cause but skips its direct effect;
Cases of conflict in the direction of the relations, i.e., the direction of the causal effect relation is reversed.
These consistency impairments are mainly attributed to the manual creation of FMEA documents by human domain experts, the broad definition of FMEAs’ columns, and the complex nature of causal relations (e.g., many-to-many relations with transitive perception [
14]).
Hence, any disagreement among experts on the exact semantics of these columns will increase consistency impairments within these documents. Formally, these documents will still conform to the structure (i.e., tabular); their content no longer aligns with the intended semantics. Consequently, in the era of industry digitalization, these consistency impairments seriously limit the effectiveness of including causal domain knowledge documented by the FMEA process.
In this paper, we first propose methods to improve the consistency of FMEA documents by defining a classification scheme to provide an improved understanding of the consistency impairments with respect to the causal relations expressed in the FMEA documents. The classification scheme is derived from domain experts’ perception of the short text in the FMEA cell. The classification scheme can be applied in the form of metadata annotations to the tabular data to assess the data consistency for a given FMEA. Still, the dataset size limits the feasibility of a manual annotation process, i.e., in practice, it is not possible to manually label all available datasets in the production environment. As such, we leverage advances in artificial intelligence and natural language processing for an automated or at least semiautomated classification method. Consequently, the reversed direction of causal relations, which indicates swapped cells in the FMEA, is effectively addressed via this classification method combined with experts’ logic. Moreover, we are able to distinguish different types of causal relations based on the classes of the concept that it connects. We hypothesize that these classes and these different types of relationships would be extremely beneficial for downstream tasks, such as information retrieval and knowledge discovery, effectively addressing the challenges related to missing information issues. Next, this paper proposes a pattern-based method for merged cell identification in FMEA documents. This method leverages predefined patterns for causal cues to identify merged FMEA cells. Additionally, a patterns extraction method calculates the Mutual Information based on the co-occurrence of the terms on FMEA cells to identify merged cells is also presented.
In summary, our contributions are as follows:
We highlight the importance of including causal domain knowledge to improve the data analytics methods robustness;
We propose to use FMEA documents as a source of causal domain knowledge;
We highlight the main challenges with respect to manually written FMEA documents and their consistency impairments for extracting causal domain knowledge;
We propose a framework to address these challenges first based on explicitly defining these consistency impairments types;
Based on manually labeled examples from real-world FMEA documents, we derive methods for the classification and identification of consistency impairments;
The improved FMEA documents can then be used for many downstream tasks.
2. Materials and Methods
Based on the analysis of actual FMEA documents from a semiconductor manufacturing company, most data consistency impairments are attributed to one of two main categories: (a) FMEA documents’ cell consistency or (b) FMEA relations consistency, where our proposed methods address both types of inconsistencies.
Our primary approach relies on causal data science for checking the consistency of FMEA documents. Causal data science is concerned with the underlying data generation process. Thus, causal data science aims to adjust for spurious correlations present in the data. The spurious correlations could be a result but not limited to confounding bias or collider bias. Biases are translated in the FMEA documents to cases of merged cells. One example for confounder bias could be: “layer thickness A” affect “Functional parameter A and Functional parameter B”. Collider bias may look like: “layer thickness A is out of spec” due to “process A deviation or process B deviation”. In the cases of merged cells due to biases (i.e., confounding and conditioning on colliders), they result from how the FMEA documents are crafted, i.e., first identify the failure modes, then place the potential effects and the root causes. Thus, the failure mode could be a collider for the root causes or a confounder for the effects.
The FMEA documents do not seem to support the cases of interaction between multiple concepts, based on our study of FMEA documents. Thus, cases of merged cells that describe the interaction between concepts are typical to occur. An example could be: “
layer A stress in combination with layer B thickness limit violation” causes “
functional parameter C to deviate”. The “
layer A stress” alone is insufficient to cause “
the functional parameter C to deviate”. Also “the functional parameter C deviation” is not caused by “
layer B thickness limit violation” alone. In such cases, the merged cell might contain additional information about the relationships of the individual concepts. However, in some cases, not all the individual concepts are stated. To handle such cases, we follow Vanderweele and Robins steps as proposed in [
15]. Vanderweele and Robins studied the interaction between concepts introducing sufficient causal directed acyclic graphs. The sufficient cause is added to the original causal directed acyclic graphs as an artificial node to describe the interaction between the causes. In addition, the individual concepts are also represented in the graph and connected to the artificial node. The interaction of such concepts happens only when these concepts fulfill the role of causes in causal relations. Thus, in the FMEA documents, such merged cells of cases are only occurring as failure mode or root cause.
We also observed cases of merged cells that contain an entire causal chain of multiple causes and effects (e.g., “layer thickness A violation causes layer thickness B violation”). We assume that this is mainly attributed to the manual creation of FMEA documents by human domain experts, in combination with a broad definition of FMEA’s columns. For example, domain experts may try to document causal chains, which stretch across multiple cause and effect pairs.
To summarize, in the category of FMEA documents’ cell consistency impairments, we noticed cases of merged cells, for which we further categorized into four categories, based on our observation of real-world FMEA documents. These four categories of merged cell types are depicted in
Figure 2 and can be described as:
- (a).
Merged cells containing a causal chain;
- (b).
Merged cells as a result of confounding bias;
- (c).
Merged cells as a result of bias caused by conditioning on a collider;
- (d).
Merged cells caused by describing the interaction between concepts.
As a result of merged cells, the usability of the FMEA document for further processing is seriously limited. In many cases, these limitations can be attributed to the absence of standard terminology for the possible concepts. This is especially the case in many complex and rapidly developing industries, such as semiconductor manufacturing. Here, domain experts typically describe the concepts using short text, which comprises domain-specific language and many abbreviations.
In comparison to other proposed approaches concerned with FME cells’ consistency, most approaches are only concerned with the consistency of failure modes to achieve a standardized description. Our research is concerned with the effect, root causes, and failure modes. However, before addressing the standardization of all FMEA cells, we noticed cases of merged cells, which we address in part of this paper. Similar to [
12], we opted for a data-driven approach that requires limited labeling effort.
In the case of identifying merged cells, multiple indicators could be devised. For example, in the case of individual cells containing entire causal chains, causal relation extraction from texts could be of value. In [
16], the authors survey causal relation extraction from text and describe the most relevant approaches. They distinguish between explicit and implicit causal relations. For the case of FMEA documents, the causal relations are most likely to occur explicitly, i.e., the causal relation is articulated explicitly in the short text of the FMEA cell. Thus, we propose to leverage causal patterns identified as in [
17] to identify merged cells containing causal cues.
The first indicator (i.e., causal cues) and the proposed methods to detect mainly assumes consistent use of cues through the data (which is typically not the case) and does not cover cases (b), (c), (d) of merged cells. Thus, it yields high precision, but limited recall. As a response, we build upon the work presented in [
18], leveraging the Mutual Information (MI) to detect merged cells. The proposed approach to extract the intra-sentential patterns, which are meaningful combinations of terms (e.g., “voltage threshold”, “leakage current “, etc.), from FMEA cells. To construct patterns, It calculates MI between the terms based on their occurrence in the text of FMEA cells. The proposed approach is based on six steps, explained as follow:
Step 1 FMEA cells’ short text n_grams: In this step, the algorithm splits the FMEA short text into a list of terms based on spaces. The algorithm collects all the possible n_grams (i.e., sequence of terms of a length n) that occurred in the FMEA cells’ short text.
Step 2 n_grams occurrence extraction: This step counts the FMEA cells where an n-gram occurred in its short texts.
Step 3 n_grams Mutual Information calculation: This step calculates the collected n-grams Mutual Information based on the occurrence of n_gram terms in FMEA cells (i.e., based on the occurrences in FMEA cells). Hence, an n_gram with higher Mutual Information is more likely to represent a pattern. Whereas high Mutual Information indicates a large reduction in uncertainty, low Mutual Information indicates a small reduction, and zero Mutual Information between a set of terms means the terms are independent. The Mutual Information is calculated based on the occurrence of the terms in the FMEA cell using the equation below:
Step 4 Patterns extraction (n_gram filtering): The collected n_grams are filtered based on two criteria (their number of occurrences in FMEA cells and their Mutual Information). Thus, this approach is performed using two hyperparameters: the Mutual Information threshold (MIth) and the occurrences threshold (OCth). These two hyperparameters increase the approach precision in detecting merged cells for a range of values where MIth > 0 and OCth > 1. However, it decreases the recall. The filtered n_grams are considered as patterns.
Step 5 Top Patterns and subpatterns extraction (pattern clustering): A Top Pattern is a pattern (n_gram) that is not contained by any other pattern. In this step, all the Top Patterns are extracted from the filtered n_grams resulting from Step 4. Furthermore, all the patterns from the same filtered n_grams contained in a Top Pattern are collected and considered subpatterns. The results of this step are a number of clusters represented by the Top Patterns and the member of this cluster, which are the subpatterns.
Step 6 Merged FMEA cell detection: A rule-based approach leverages the patterns extracted from Step 4 and Step 5 to predict the merged FMEA cell. This step matches all the patterns from Step 4 contained in the FMEA cell then checks if there is a Top Pattern (cluster) that contains all the matched patterns. If this criterion is not met, the algorithm predicts the cell is merged.
The proposed approach to detect merged cells based on the intra-sentential pattern is highly dependent on the dataset where the Mutual Information is calculated. Thus, the MIth and OCth need to be adapted.
As for the causal relation documented during the FMEA process, it depends on the scope of the FMEA and the internal agreement between the multidisciplinary team creating the FMEA. This agreement typically is not documented. Thus, for FMEA documents conducted on the same scope, significant differences in the semantics of the concepts described in the columns are found. This is worsened by the broad definition of the FMEA columns. For example, while Bluvband et al. argue in [
8] that the definition of failure is too narrow and that this might lead to the omission of failure modes, we believe that the definitions of the root causes and effects are too broad. This claim is also supported by observing common practices in datasets for knowledge graphs that include causal relations such as Atomic [
19] and Glucose [
20]. In such datasets, the authors define different types of causal relations and attempt to describe the semantics of causal relations.
Consequently, the descriptions of the effects and the root causes are less consistent. Thus, because the transitive perception of causal relation might lead to the description of a distant cause or effect skipping the direct one. For example, given the causal chain [A]-[causes]->[B]-[causes]->[C] (A, B, C are three concepts in the FMEA documents), due to the transitive perception of causal relation [A]-[causes]->[C] is also valid and might be documented in the FMEA documents. However, the information about the mediation of the causal effect between A and C through B is missing, especially if the original causal chain [A]-[causes]->[B]-[causes]->[C] is not documented in FMEA documents. This missing information is critical, especially in industries with a long manufacturing process, where the manufacturing typically extends for hundreds of processing steps.
Additionally, in some cases, the direction of the causal relation is reversed. For example, [C]-[causes]->[B] is documented instead of [B]-[causes]->[C]. This impairment might be attributed to human error or ambiguity of the temporal aspect while creating the FMEA document.
In general, this might not pose a large problem for domain experts to cope with this missing information and automatically correct for the reversed relation due to their profound knowledge of the domain. However, in the case of automated data analytics, missing information and contradictions in the documented causal relation direction severely affects the usefulness of the data analytics method, especially in the considered causal model.
As a response, we propose to define a classification scheme to provide an improved understanding of the consistency impairments with respect to the causal relations expressed in the FMEA documents. As such, it is critical to firstly identify the concept classes that logically should be represented by the data (i.e., matching content interpretation by domain experts). In many cases, the concept classes are domain-specific and typically cannot be inferred from the data alone; they need to be established with the help of domain knowledge. To this end, we propose a set of rules to govern the identification of the classes:
Concepts classes need to be completely separable;
Concepts classes need to allow for the assessment of the causal relations consistency between concepts;
Concepts classes need to be aligned with domain experts’ perception of the cells’ short text content.
For our case study on FMEA documents, the identification of individual concept classes is rooted in actual production lines and based upon the knowledge of experienced domain experts. They were interviewed in order to establish our concept classification scheme, which forms the base for the FMEA documents consistency improvement methods. While our work is focused on causal relations found in FMEA documents of the semiconductor manufacturing industry, other domains also require domain-specific classification schemes, also taking expert knowledge and its formalization into consideration.
First, on a high level, in the semiconductor manufacturing company under study, FMEA documents can be split into multiple types depending on the respective scope of the FMEA. One can distinguish between Process FMEA documents (P_FMEA) and Unit Process FMEA documents (UP_FMEA). An individual P_FMEA contains multiple causal chains, each of which might be associated with numerous processes used in a product production line. UP_FMEAs include causal chains only belonging to single processes, which in turn could be related to multiple products’ production lines.
The expected concepts described in the P_FMEA documents are different from those expected to be described in UP_FMEA documents. In P_FMEA documents, one can expect concepts physical characteristics’ deviation of the products’ structures, which causes concepts describing functional characteristics’ deviation of the product. Many reasons could cause these physical characteristics’ deviation. However, due to the high degree of manufacturing automation and the highly controlled environment (i.e., the typical manufacturing conditions in the semiconductor manufacturing industry), the most relevant causes stem from the so-called unit process deviation.
In the UP_FMEA documents, it is expected to find concepts describing the so-called unit process key parameters deviation that causes the unit process deviation (which is also described in P_FMEA documents). Additionally, in the UP_FMEA documents, the root causes of the unit process key parameters deviation follow the 5M (Man, Machine, Material, Method, Measurement) risk-management model similar to the Ishikawa causal diagram. To connect the two types of FMEA documents, experts aim to link unit process cause in the P_FMEA document to the respective unit process effect of UP_FMEA. The linking concept (i.e., the unit process deviation) is abbreviated as UP-C/E. Domain experts further explained that the product functional characteristics’ deviations are caused by deviations in physical characteristics of the product structures, while the reversed direction is not valid.
To summarize, leveraging the information acquired from the discussions and multiple rounds of interviews with domain experts, we are finally able to distinguish five distinctive concepts classes: Functional, Physical, UP-C/E, Parametric, and 5M. The concept classes and their causal chains are depicted in
Figure 3. To readers who are familiar with measurement data typically found in a modern semiconductor manufacturing FAB, these measurements types are also in line with the proposed concept classes. Whereas in [
21], Qin et al. summarized the measurement data types available in semiconductor manufacturing as follows:
Real-time trace data at the tool level reflects equipment health conditions;
Integrated metrology or inline metrology provides geometric dimensions;
The sample and final electrical test provide data about the products’ electrical properties.
Thus, domain experts could map defined concept classes to the corresponding measurement data types. Namely, concepts belonging to the Functional, Physical, and Parametric concept classes could be mapped to the final electrical test, integrated metrology, real-time trace data, respectively. This alignment intends to enhance the reuse of FMEA knowledge concerning root cause analysis. However, the benefits of this alignment will be addressed in future research.
Consequently, the defined concept classes can be applied in the form of metadata annotations to the existing FMEA documents. This metadata allows for causal relation consistency assessments of a given FMEA document. Thereby, for an annotated causal relation found in the data, one can check for consistency via compliance with the consistency rules. Examples of such rules are:
- (i)
A causal relation from a concept that belongs to the Functional class to a concept that belongs to the Physical class is considered to have a consistency status of reversed;
- (ii)
A causal relation from a concept that belongs to the UP-CE class to a concept that belongs to the Functional class is considered to have a consistency status Missing Information due to the missing concepts belonging to the physical class.
Here, causal relations are found in the FMEA document that does not adhere to the consistency schema; one can distinguish two cases:
Case 1 Missing information: In this case, one or more intermediate concepts are missing from the causal chain. In an automated setting, this missing information would need to be completed using other FMEA documents or other data sources. These concepts might have other (root) causes, which also need to be considered and completed. In general, this case is not trivial and mostly needs to be dealt with on a case-by-case basis.
Case 2 Reversed direction: In this case, the direction of the causal relation is reversed according to the consistency schema. Such cases are attributed to the manual creation of the FMEA documents. In such cases, the data might be complete while inconsistencies concerning the classification scheme and defined causal chain can be observed. In this case, the problem can be automatically detected and automatically rectified.
In an actual scenario, the number of data is typically too large to allow for a complete manual classification of all the available documents according to the classification scheme, i.e., to conduct a large-scale manual annotation process by (highly paid) experts. (This also could be the case in transferring the FMEA to FMMEA). Yet from the perspective of dataset sizes required to train contemporary machine learning models, it is an open question if the available data fulfills these requirements, primarily since FMEA documents we consider comprise: (i) just short text snippets that (ii) do not follow grammatical rules (i.e., no full sentences), and (iii) represent domain-specific language (i.e., terms not found in generic dictionaries).
Historically, the task of building classifiers for textual input data has been approached following a rule-based approach [
22,
23]. Such methods are highly dependent on the experts’ knowledge, which is consecutively modeled as a set of guiding rules. As a welcome consequence, these systems do not require ground truth data for training. Thus, these systems evolve naturally by extending and optimizing the rules set. While these methods give good results for well-constrained and well-understood environments, the drawbacks of such methods concern the scalability and the difficulty of identifying a complete and consistent rule set (e.g., no loops or contradicting rules). As an alternative, machine learning methods are devised to learn directly from the data. The initial challenge associated with text classification is transforming the text into a representation suitable for machine learning models, a process comprising the tasks of feature extraction and feature engineering. Rule-based feature extraction faces similar challenges as rule-based classification methods. At the same time, general text classification and representation learning benefit highly from deep learning approaches [
24,
25]. Thus, for this research, we opted for an end-to-end deep learning classification model. Our text classification model consists of a preprocessing pipeline, an embedding layer, a recurrent network layer, and a fully connected layer as an output layer. Namely, we test four recurrent neural networks layers: LSTM [
26], GRU [
27], and their bidirectional variation [
28]. The classifier is trained on a multilabel, multiclass classification using weighted binary cross-entropy loss to account for the classes’ imbalance in the training dataset. The loss function is calculated using the equation:
where
and
are positive and negative class weights, respectively, for each concept class.
Our classification model is achieved as follows: First, the FMEA dataset is split into three datasets: training, testing, and development. The training and development datasets are used in the training phase. During the training phase, preprocessing is learned on the training dataset. The preprocessing pipeline is responsible for text data cleaning and transformation into a numerical representation using techniques of abbreviation substitution, text cleaning, and text tokenization. The resulting text representation is sparse. Next, the embedding layer and the recurrent neural network layer try to learn the mapping function of the sparse text representation to the target labels based on the loss function. Here, to control the training of the model, the model performance on the development dataset is used for an early stopping approach. Finally, the testing dataset is used to evaluate and report the model performance. In our case, we noticed that the model performance is affected by the dataset splits. Thus, we opted for k-folds cross-validation to adjust for biases induced by dataset splits. Hence, the average model performance is reported.