1. Introduction
In discrete manufacturing systems, root cause tracing of defective products plays a pivotal role in quality control. This is because it not only precisely pinpoints the underlying cause of defective products but also clearly uncovers the fault propagation path within the production process. Such insights are essential for optimizing processes and enhancing product quality [
1,
2]. Take the LST assembly line, a quintessential complex discrete manufacturing system, as an example. It encompasses processes like welding, sensor assembly, and filter mesh installation. These processes are intricately interlinked. A single production process failure can potentially cascade through the entire process [
3]. This leads to the generation of defective products with complex causes, and causes other processes to deviate from normal levels. Consequently, these challenges pose significant hurdles to the root cause tracing of defective products [
4,
5].
In recent years, root cause tracing for defective products has garnered increasing attention in the industry, with many effective methods emerging in engineering practice. These methods mainly include model-based [
6,
7], logic inference-based [
8,
9], and artificial intelligence-based [
10,
11] approaches. Logic inference-based methods analyze and identify the underlying cause by applying logical rules and reasoning. For example, Yan Feng Li et al. [
12] proposed a fault tree-based root cause tracing method, and its effectiveness was experimentally verified through the CNC machining center’s hydraulic system platform. However, logic inference-based methods often rely heavily on expert experience, and they struggle to capture the dynamic characteristics of the system in real-time. Additionally, artificial intelligence methods have gained increasing attention in recent years due to their powerful data processing and pattern recognition capabilities. Qiuping Ma et al. [
13] proposed a KNN clustering and MLP-driven root cause identification method for product quality inspection, aimed at automatically predicting the root cause of various quality issues. However, the black-box problem severely affects the interpretability of artificial intelligence methods, limiting their further application in actual industrial production. In contrast, model-based methods grounded in the white-box concept can construct dynamic models of the production system, clearly revealing each process’s operation logic, their interconnections, and the mechanisms linking these processes to product quality. Ruan Sui [
14] proposed a DMFD framework for complex systems. It can infer the most likely set of faults in real-time, reveal the fault propagation path, and accurately identify the root cause of defective products. Shakeri et al. [
15] developed a modeling method for a two-level coordinated solution framework, where dynamic programming techniques were used to solve the original DMFD problem. On this basis, Anuradha et al. [
16] conducted validation experiments on an automotive power generation and storage system, thereby demonstrating the effectiveness of the two-level coordinated framework. Similar successful applications of model-based methods have also been achieved in diverse fields, including electronics [
17], mechatronic systems [
18], mechanical systems [
19,
20], and chemical engineering [
21]. However, when it comes to specific scenarios like the LST assembly production line, existing model-based methods exhibit certain limitations. The production processes of LST assembly lines are complex and interconnected, while the test results of LST products are often imperfect, such as data missing or inadequate testing precision. There is a lack of modeling applications capable of effectively addressing these issues, making it difficult to meet the practical needs of precise analysis, fault diagnosis, and quality control in the production process.
Hidden Markov models (HMMs) are an ideal choice for modeling DMFD problems. In discrete manufacturing systems, HMM can use observed states to represent product inspection results related to quality and hidden states to describe the actual states associated with quality. Ying et al. [
22] were the first to use HMMs to formalize dynamic fault diagnosis problems. In addition, Q. Suxiang et al. [
23] introduced the theory of HMMs into the field of power transformer fault diagnosis. Qiu et al. [
24] integrated multi-feature fusion technology with Gaussian mixture hidden Markov models to conduct fault diagnosis on a multi-axis engraving machine platform. However, the HMM typically assumes that the system has only a single-component state at most, which restricts its capacity to comprehensively model multiple faults [
18]. As an extension of the HMM, the FHMM supposes the system to be composed of multiple independent Markov chains, which endows it with the capability to handle multiple related factors simultaneously. For instance, Satnam Singh et al. [
19] proposed a fault diagnosis method based on the FHMM, which provides important theoretical support for the modeling and analysis of dynamic multi-faults. Inspired by the aforementioned methods, a dynamic multi-fault diagnosis modeling method based on the FHMM is proposed, and it is applied to trace the root causes of defective products on the LST assembly line. First, the problem of tracing the root causes of defective products is mathematically modeled and an FHMM within the dynamic multi-fault diagnosis framework is constructed. Then, the model parameters are iteratively optimized by applying the EM algorithm. Finally, the hidden state transition matrix and the diagnostic matrix are solved using the Viterbi algorithm so that the optimal root cause tracing path for the defective LSTs can be obtained.
The contributions of this paper are as follows:
A DMFD-based framework is proposed to locate the root cause of defective products in the LST assembly line. An FHMM is established by utilizing key factors such as production, inspection processes, and inspection results to describe the changes in product quality. This transformation turns the problem of root cause analysis into a solvable DMFD problem.
The impact of imperfect testing on the root cause tracing of defective products is taken into account, and a model that is closely aligned with the actual scenario is constructed. Through formula derivation, the missing detection results are incorporated into the model. Moreover, experiments are designed to quantify the influence of incorrect results on the accuracy of root cause tracing. Consequently, the reliability of root cause tracing for defective products in practical production is enhanced.
Experimental verification has been carried out on a real LST assembly production line. The experimental results show that the proposed method can achieve a 100% accuracy rate for root cause tracing of three typical quality issues, namely welding misalignment, missing installation of the valve body, and sensor offset.
The structure of this paper is as follows. The related work is introduced in
Section 2. In
Section 3, we provide a system description and mathematical modeling of the LST assembly line.
Section 4 presents the dynamic inference algorithm based on the FHMM.
Section 5 presents the results of the computational experiments to evaluate the performance of the inference algorithm. In
Section 6, we discuss the application scenarios of the proposed method in this paper.
Section 7 provides the conclusion of this paper.
2. Related Work
Multi-fault diagnosis methods can be broadly classified into two categories: data-driven methods and model-driven methods. Data-driven methods use statistical analysis and machine learning to detect fault patterns from data without explicit system modeling [
25,
26]. For example, the Transformer model has made significant progress in mechanical equipment fault diagnosis [
27], benefiting from its self-attention mechanism and parallel computing capabilities. Muhammad Samiullah et al. [
28] proposed a Decision Tree algorithm for motor fault classification, which efficiently handles large-scale datasets and offers a certain degree of interpretability. However, it relies heavily on data and is highly dependent on labeled samples, which are often scarce in real-world scenarios. At the same time, it exhibits certain “black-box” characteristics, making its diagnostic results less interpretable.
Model-driven methods rely on the system’s mathematical modeling and physical laws [
29]. For example, Nan C et al. [
30] proposed a fault diagnosis model based on prior knowledge to address challenges in abnormal operating conditions in complex environments, demonstrating good interpretability through systematic modeling. Christoph Wehner et al. [
31] introduced an interactive intelligent RCA tool that significantly reduces the learning time of causal Bayesian networks and decreases the number of false causal relationships, thereby improving the efficiency of fault cause analysis in electric vehicle manufacturing. Yiming Xu et al. [
32] proposed a model-based fault diagnosis method for application in the battery management system (BMS) of lithium-ion batteries (LIBs). However, in practical applications, faults typically arise from the interactions of multiple factors. This complexity makes it challenging to model faults precisely. The aforementioned methods fail to effectively capture these complex dependencies. FHMM models multiply components by constructing hidden states, and can adeptly capture the complex dependencies within the system, providing a feasible solution.
3. System Description and Mathematical Modeling
The LST shown in
Figure 1 was produced by a company. It is typically made of transparent material and mainly consists of the tank cover, upper and lower tank bodies, liquid level alarm, float, and filter screen. The functional description of each of these components is presented in
Table 1, which provides a detailed understanding of how each part contributes to the overall function of the LST. The production of the LST is mainly accomplished through the cooperation of three workshops, namely the injection molding workshop, the small parts workshop, and the assembly workshop. The injection molding workshop manufactures the main components of the LST, including the upper and lower parts of the tank body. The small parts workshop produces the accessories for the liquid storage tank. The semi-finished products produced by the injection molding workshop and the small parts workshop are transferred to the assembly workshop. In the assembly workshop, the upper and lower parts of the LST are welded into a sealed tank body, and the accessories are assembled with the tank body to complete the final assembly of the LST. Specifically, only the production process of the assembly workshop is focused on in this paper, with the assumption that the semi-finished products provided by the injection molding workshop and the small parts workshop are of qualified quality.
The schematic diagram of the LST assembly line in the assembly shop is shown in
Figure 2. The automotive braking LST undergoes a series of processes from raw materials to finished products. Key production processes include corresponding detection steps. Due to the high scrap cost caused by quality issues, the production quality indices of both semi-finished and finished products are tested at the early, middle, and late stages of the production process.
Figure 3 shows the production installation and testing steps in the core of the assembly plant, including upper and lower body welding, check valve installation, air-tightness test, mechanical performance test, check valve plus shell air-tightness test, sensor installation, and sensor pull-out force test.
First, the upper and lower bodies of the LST with the float are welded using a servo welding machine. Then, a check valve is installed on the welded tank. Additionally, the LST undergoes an air-tightness test, which is included in the air-tightness test table. The mechanical performance test is recorded in the mechanical performance test table, and the check valve test is noted in the check valve shell air-tightness test table. After completing these three tests, the sensor is installed, followed by the sensor pull test, which tests the firm pull force of the sensor installation. Once the pull-out test is completed, the product is finished, and the process of marking and packaging begins.
In this article, the task of root cause tracing of defective LST products is defined as a DMFD problem.
As shown in
Figure 4, the problem can be represented as an FHMM, which is discussed in papers [
19,
26] Here, the FHMM state is factored into multiple state variables and presented in a distributed manner. Specifically, the state transitions between components are stable. Formally, a DMFD problem can be defined as
where
is a finite set of
m components associated with the system;
is the set of discretized observation epochs;
is a finite set of
n available binary tests, the passed tests
, and failed tests
;
O is a finite set of test outcomes up to and including epoch
K;
is the D-matrix;
is a set of probabilities of detection and false alarm; and
denotes the set of fault appearance probability
and fault disappearance probability
.
The state of the
m-th production components at the
k-th epoch is
, assuming that the initial state
is known (or its distribution is known). Here, for each
, the value of
is determined by
In practical situations, the DMFD tasks are divided into perfect and imperfect situations. In the perfect situation, each test result is available, i.e.,
, where
is the set of failed tests at epoch
k, and
is the set of passed tests at epoch
k. In the imperfect situation, due to human- or equipment-related factors, the detection results may not be completely recorded, i.e.,
. And the Markov observation sequence can be defined by
where
represents the outcome of the
j-th test at time
k. When
= 2, it implies that the test result is missing.
The likelihood function
, based on the assumption of conditional independence, which describes the probability of the observed test results
given the fault states
and the initial state
, is calculated as
where
,
is the probability of the test results
given the fault state
at time
k.
Then, we define the matrix as the diagnostic matrix, which represents the dependencies between the fault-related production processes and the detection processes . This matrix captures the causality between the failure component (or root cause failure) of the system and the corresponding test. We introduce the collection , which includes the fault detection and false alarm probabilities. Specifically, we have the fault detection probability , which is the probability that the j-th test detects the failure of the i-th component, and the false alarm probability , which is the probability that the j-th test falsely indicates a failure when the i-th component is functioning. The state of each fault is modeled as a non-homogeneous Markov chain. For each fault state, we define , where is the probability that the fault occurs at time k, given that it was not present at time , and is the probability that the fault disappears at time k, given that it was present at time .
4. Inference Algorithm for Fault Localization and Diagnosis
The fault diagnosis task in this paper can be defined as a problem of finding maximum a posteriori estimation to evaluate the evolution of fault sequence state with time step.
The solution of
can be used to explain the sequence of the observed test results:
where
K is the total number of epochs; when
, the problem is simplified to a static fusion problem. Using the Bayes formula, the objective function is equivalent to
In the case of a given fault state and the Markov property of the fault state evolution, the passed and failed test results are conditionally independent, so the objective function is equivalent to
where
and
represent the set of passed and failed tests at time
k, respectively. A new function
is defined as follows:
Given the failure state
, the test results are independent. Therefore,
Assuming that the test results
pass, it should pass all of its associated failure statuses; therefore,
where
Similarly, since the fault is independent of this assumption,
where
Therefore, the objective function of Formula (1) is equivalent to
where
The goal of the EM algorithm is to estimate the model parameters and maximize the log-likelihood function of the observed sequence
.
The log-likelihood function involves
Due to the difficulty of directly optimizing the logarithms that contain hidden variables, E-step computes the expectation of the a posteriori distribution under the current parameter,
So, the joint probability of sum expands to
where
So,
is changed to
Step
M is updated
to maximize
A thresholding method is utilized, in which the increment of the log-likelihood values between the current epoch and the previous one is closely monitored.
where
represents the log-likelihood value in the
t-th iteration, and
is a small positive number.
For fault sequences, the inference formula can be expressed as
where
Next, we use the Viterbi algorithm to find the optimal , where each path corresponds to a state sequence.
Initialization step: Assume that the initial state is known for all fault states. Let the maximum value of the function at time K be denoted as , and the maximum value of at this time be represented by .
When
,
where
and
, for
.
Recursive step: This step involves maximizing the target function at each epoch
K.
where
Termination step: This step computes the objective function for time
.
Optimal state sequence backtracking: The backtracking step computes the optimal state sequence through the backtracking path. The optimal state
of the
i-th fault at time
k is derived from the following formula:
Assumption 1. Within the system, there is one fault occurrence at each instance.
Assumption 2. When a component malfunctions, the entire system is regarded as being faulty.
Assumption 3. The faulty state will continue to exist until it is repaired manually.
Remark 1. Assumption 1, which limits the system to a single fault at a time, can simplify the complexity of the problem. Based on Assumption 1, Assumption 2 describes that no further faults will occur in the system when it is in the faulty state. As described in Assumption 3, this state will persist until it is manually lifted by the staff; otherwise, it will continue indefinitely. This ensures that the LST assembly line will resume normal operation.
5. Experiment
Based on the data from the LST assembly line in a rubber and plastic enterprise, multi-coupling faults in the production process were analyzed. As an essential liquid storage component in automotive brakings, the production process of LSTs is complex and involves many key processes. Due to equipment aging, process errors, or improper operations, various types of faults may occur during production, and strong coupling exists among these faults, posing significant challenges for fault diagnosis.
Figure 5 shows the production steps of the LST assembly line and their inspection results under the DMFD framework. The process begins with three main component-related steps: S1 for upper and lower body welding, S2 for check valve installation, and S3 for sensor installation. S1 is associated with air-tightness testing (result
under T3) and mechanical and valve testing (result
under T2/T4). S2 is related to both the air-tightness testing connected to T3 and the mechanical and valve testing under T2/T4. Meanwhile, S3 is only related to the pull test with the result
under T1. This figure systematically presents the production and inspection process flow, clearly demonstrating the relationships between various steps and test outcomes in the LST assembly line within the DMFD framework.
As shown in
Table 2, the LST assembly line primarily consists of the following key processes:
Welding misalignment (S1): This process is used to weld and secure the upper and lower parts of the LST. It is a fundamental step in the production process, but defects during welding may lead to tank leakage or breakage during pressure testing or actual use.
Missing installation of the valve body (S2): The check valve ensures the unidirectional flow of liquid within the tank. Deviations in its installation location, insecure installation, or inherent defects in the valve itself may prevent the liquid from flowing in one direction or cause leakage, resulting in failure during the production process.
Sensor offset (S3): The LST sensor monitors the operational state of the tank. If the sensor is improperly installed or experiences signal transmission issues, the monitoring data may become inaccurate, and it may fail the pull-out test, leading to suboptimal performance of the tank.
Table 2.
List of failures.
Table 2.
List of failures.
Fault | Fault Number |
---|
Welding misalignment | S1 |
Missing installation of the valve body | S2 |
Sensor offset | S3 |
As shown in
Table 3, the assembly line also includes several critical testing procedures to evaluate the quality and reliability of key processes:
Drawing Test (T1): This test is designed to assess the stability of the sensor by applying a drawing force. If the sensor is improperly installed, excessive displacement may occur, affecting the tank’s stability and its performance.
Performance Testing (T2): This test evaluates the mechanical properties of the LST, particularly the strength of the welded structure and the integrity of the check valve installation. Defects in either may cause failure during this test.
Air-Tight Test (T3): This procedure checks the overall sealing performance of the LST by applying pressurization to ensure that the tank does not leak under high or negative pressure conditions. Defects such as holes in welds, cracks, or voids in the check valve may result in test failure.
Check valve Air-Tightness Test with Shell (T4): This test is focused on verifying the air-tightness of the check valve and its shell. It ensures the valve’s unidirectional flow function and sealing performance after installation. If the valve is poorly installed or has manufacturing defects, it may lead to substandard results during this test.
Table 3.
Test list.
Fault | Fault Number |
---|
Pull results | T1 |
Mechanical performance test results | T2 |
Air-tightness test results | T3 |
Check valve plus shell air-tightness test | T4 |
These tests and processes provide valuable insights into the production quality of LSTs, enabling identification and diagnosis of faults during production.
Experimental Procedure:
- (1)
Data Pre-processing: Data pre-processing is performed based on prior knowledge by collecting data from various processes and tests of the LST assembly line. The result states of the processes are categorized. The related data are further divided into a training set and a test set, which will be used for subsequent model training and testing.
- (2)
Model Training: The FHMM is constructed, and the data from the pre-processed training set are fed into the model. The probability distribution learned by the hidden state chain after model training is used to analyze the coupling relationships between the process and the tests on the test set data.
- (3)
Result Analysis: The correct isolation rate (CI) and false isolation rate (FI) are calculated. Additionally, the detection probability/false alarm probability matrix is described.
The following formulas are used to compute the rates:
5.1. Analysis of the Results
As shown in
Table 4, a model’s dependency matrix is typically used to represent relationships between different elements in some systems or models. The upper and lower body welding, check valve installation, and sensor installation are denoted as S1, S2, and S3, respectively. The drawing test, performance test, airtight test, and check valve shell airtight test are represented as T1, T2, T3, and T4, respectively.
As shown in
Table 5, the detection probability refers to the probability that the test (T1, T2, T3, T4) will correctly identify and diagnose a state (S1, S2, S3) when it actually occurs. False positives represent the probability that a test will incorrectly identify a state (S1, S2, S3) as occurring when that state does not occur.
Table 6 presents performance metrics for the model across different fault states, including the correct isolation rate with 95% confidence intervals and the false isolation rate with 95% confidence intervals. It compares FHMM with Decision Tree, Fully Convolutional Neural Network (FCNN), and Support Vector Machine (SVM) under the same evaluation metrics. The results show that FHMM achieves a perfect correct isolation rate of 1.0 for all fault states and an error isolation rate of 0, indicating superior diagnostic capability.
The correct isolation rates of Decision Tree for S1, S2, and S3 are (0.6829, 0.9024), (0.8293, 1), and (0.8165, 0.9756), with higher error isolation rates in S1 and S2, which are (0.0726, 0.3058) and (0.0304, 0.1918). These results are inferior to FHMM, likely due to Decision Trees not accounting for feature interdependencies. Similarly, FCNN shows lower correct isolation rates, such as (0.7805, 0.9756) in S1, possibly due to insufficient training data or inadequate feature extraction. SVM also demonstrates similar limitations, especially in S1 and S3, where the correct isolation rates are (0.6829, 0.9152) and (0.8049, 1), with poor error isolation rates in S1 and S2, indicating limited generalization ability in high-dimensional feature space.
The computational complexity of FHMM is , compared to for Decision Trees, for FCNN, and for SVM. When comparing these complexities based on the highest-order terms, FHMM’s complexity is dominated by , which is higher than the of Decision Trees but significantly lower than the of SVM. Therefore, FHMM strikes a balance between computational cost and model complexity.
These comparisons emphasize FHMM’s robustness and reliability in fault diagnosis, outperforming traditional machine learning models in accuracy and fault isolation precision.
5.2. Sensitivity Experiments
The sensitivity experiments were conducted to evaluate the impact of different initialization strategies on system performance. Specifically, variations in the initialization parameters, including the transition matrix and initial hidden state distribution, were explored to assess their influence on diagnostic results. To ensure a fair comparison, all other experimental settings were kept consistent with previous experiments. The parameters were initialized using two different distributions: the uniform distribution and the Dirichlet distribution.
As shown in
Table 7, the results indicate that stable diagnostic accuracy was maintained under both the CI and FI metrics, regardless of the initialization strategy. This demonstrates the model’s robustness to changes in initialization conditions.
On this basis, we note that there may be mislabeling in the actual assembly line; this so-called mislabeling refers to the cause of the fault into the product. To simulate this, we added a small number of negative samples to the training dataset. After the same test set, we obtained the model performance metrics shown in
Table 8.
6. Discussion
The proposed method, under the DMFD framework, offers unique advantages for fault tracing in the LST assembly line. This method provides an important reference for fault diagnosis in discrete manufacturing industries. The following discusses the potential application scenarios and implementation guidelines:
(1) Potential application scenarios. The LST assembly process involves several key steps, such as upper and lower body welding, valve installation, and sensor installation. By implementing the DMFD method, it ensures that faults are identified at the root, enabling a quick response and reducing the risk of defective products entering the market. The main advantage of the method proposed in this paper lies in modeling the LST assembly line using the DMFD framework, which is particularly suitable for discrete industrial assembly lines. By establishing the relationship between production and testing processes, and integrating the FHMM, this method is expected to effectively predict and diagnose potential issues during the production process.
(2) Guidelines for implementation. In practice, the effective implementation of a fault diagnosis system requires the establishment of a comprehensive and robust data acquisition infrastructure. High-quality data must be collected at every stage of the production process, including sensor data for monitoring key parameters such as temperature, pressure, and pull-out displacement. However, the proposed method does not require perfect fault data, which greatly reduce the difficulty of obtaining data in real industrial scenarios. Currently, different assembly lines often have different processes and testing procedures. This makes it necessary to adapt general fault diagnosis methods to the specific needs of each assembly line during implementation. To effectively achieve cross-line adaptability, the proposed method must be flexible. For instance, by optimizing testing processes and modifying data analysis strategies, the components of the method can be dynamically adjusted according to different production environments.
Overall, while the proposed framework shows excellent potential in the LST assembly line, establishing the relationship between production and testing processes and integrating the FHMM, it is expected to effectively predict and diagnose potential issues in the production process. Although further experimental validation is required, this method could play an important role in improving production efficiency and product quality. This research direction has broad application prospects and warrants further exploration through carefully designed experiments to fully verify its feasibility and effectiveness in complex industrial environments.