**1. Introduction**

Todays' manufacturing era focuses on monitoring the process on shop-floor by utilizing various sensorial systems that are based on data collection [1–3]. The automated systems directly collect an enormous amount of performance data from the shop-floor (Figure 1), and are stored into a repository, in a raw or accumulated form [4,5]. However, automated decision making in the era of Industry 4.0 is ye<sup>t</sup> to be discovered. The current work is a step towards holistic alarms managemen<sup>t</sup> (independently of context [6]), and, in particular, automated root-cause analysis.

**Figure 1.** Key Performance Indications (KPIs) aggregation from Measured Values (MV).

#### *1.1. KPIs and Digital Twin Platforms in Manufacturing*

A lot of KPIs have been defined in literature (Manufacturing, Environmental, Design, and Customer) [7], and, despite the fact that methods have been developed for the precise acquisition of measured values [8], the interconnection of the values (Figure 1) can lead to mistaken decision making when one searches for the root cause of an alarm occurring in a manufacturing environment.

In addition, regarding manufacturing-oriented dashboard systems, they aid in the visualization of complex accumulations, trends, and directions of Key Performance Indications (KPIs) [9]. However, despite the situation awareness [10] that they o ffer, they are not able to support the right decision made at managemen<sup>t</sup> level by elaborating automatically the performance metrics and achieving profitable production [11]. In the meantime, multiple criteria methods are widely used for decision support or the optimization of production, plant, or machine levels [1,12], based (mainly) on the four basic manufacturing attributes i.e.,: Cost, Time, Quality, and Flexibility levels [1]. This set of course can be extended. The "performance" that is collected from the shop-floor has been considered as measured values (MV) and their aggregation or accumulation into higher level Key Performance Indicators, as initially introduced by [13]. The goal of all these is the achievement of digital manufacturing [14].

On the other hand, the term digital twin can be considered as an "umbrella", and it can be implemented with various technologies beneath, such as physics [15], machine learning [16], and data/control models [17]. A digital twin deals with giving some sort of feedback back to the system and it varies from process level [15] to system level [18], and it can even handle design aspects [19].

Regarding commercial dashboard solutions, they enable either manual or automated input of measured values for the production of KPIs [20–22]. However, their functionality is limited to the reporting features of the current KPI values, typically visualized with a Graphical User Interface (GUI), usually cluttered, with gauges, charts, or tabular percentages. Additionally, there is need to incorporate various techniques from machine learning or Artificial Intelligence in general [23] and signal processing techniques [24]. The typical functionality that is found in dashboards serves as a visual display of the most important information for one or more objectives, consolidated and arranged on a single screen, so as for the information to be monitored at a glance [9]. This is extended in order to analyze the aggregated KPIs and explore potential failures in the future by utilizing monitored production performances. Finding, however, the root cause of a problem occurring, utilizing the real time data in an e fficient and fast way, is still being pursued.

#### *1.2. Similar Methods and Constrains on Applicability*

In general, categories in root-cause finding are hard to be defined, but if one borrows the terminology from Intrusion Detection Alarms [25], they can claim that there are three major categories: anomaly detection, correlation, and clustering. Specific examples of all three in decision making are given below, in the next paragraph. The alternative classifications of the Decision Support Systems that are given below permit this categorization; the first classification [26] regards: (i) Industry specific packages, (ii) Statistical or numerical algorithms, (iii) Workflow Applications, (iv) Enterprise systems, (v) Intelligence, (vi) Design, (vii) Choice, and (viii) Review. Using another criterion, the second classification [27] concerns: (a) communications driven, (b) data driven, (c) document driven, (d) knowledge driven, and (e) model driven.

There are several general purpose methods that are relevant and are mentioned here to achieve root-cause finding. Root Cause Analysis is a quite good set of techniques, however, it remains on the descriptive empirical strategy level [28,29]. Moreover, scorecards are quite descriptive and empirical, while the Analytical Hierarchical Process (AHP) requires criteria definition [30]. Finally, Factor Analysis, requires a specific kind of manipulation/modelling due to its stochastic character [31].

Regarding specific applications in manufacturing-related decision making, usually finding that the root-cause has to be addressed through identifying a defect in the production. Defects can refer to either product unwanted characteristics, or resources' unwanted behaviour. Methods that have been previously used—regardless of the application—are Case Based Reasoning [32], pattern recognition [33,34], Analysis of Variance (ANOVA) [35], neural networks [36], Hypothesis testing [37], Time Series [38], and many others. However, none of these methods is quick enough to give the results from a deterministic point of view and without using previous measurements for training. Additionally, they are quite focused in application terms, which means that they cannot be used without re-calibration to a di fferent set of KPIs. On the other hand, traditional Statistical Process Control (SPC) does not o ffer solutions without context, meaning the combination of the application and method [31].

#### *1.3. Research Gaps & Novelty*

In this work, an analysis method on KPIs has been implemented in the developed dashboard to identify, automatically and without prior knowledge, for a given KPI, the variables that are responsible for an undesired production performance. This mechanism is triggered by predicting a threshold exceedance in managemen<sup>t</sup> level KPIs. The prediction utilizes (linear) regression, which is applied on the historical trend of each variable for the estimation of the performance values for the upcoming working period in order for the weaknesses of the production to be elicited beforehand. The dashboard in which the current methodology has been framed acts as an abstractive Digital Twin for production managers and it supports automated decision making.

In the following sections, the analysis approach is presented, followed by its implementation details within dashboard. The results from case studies, the points of importance, and the future research trends have been discussed.

#### **2. Materials and Methods**

#### *2.1. The Description of the Flow and the Calculations*

The current method consists of two stages. Initially, the formulas of each KPI are analyzed and their dependent variables are collected. Thus, the relationships between the Measured Values (MV), Intermediate Values (IV), Performance Indicators (PI), and Key Performance Indicators (KPIs) are known. This stage, in reality, constitutes the modelling of the production and it runs only once, while the data are aggregated form IoT technologies. The exact methodology can be found as IoT-Production in previous literature [39]. Regarding the second stage, Figure 2 presents its flow of the analysis for a single KPI. This stage runs continuously. In the beginning, the trend for each measure of the current period that is examined is acquired from the OLAP. For reasons of flexibility and modularity, they are not directly acquired from the data acquisition system. The measure trend is supplied as input to the tool, whereas a linear regression is applied to the estimation of the data points until the end of the next period, namely, horizon (as defined in detail of Figure 3). Once an estimated value is over the measure of the supposed goal (A1/A2), it is marked as 'out-of-goal' and the user is notified by the user interface. The user can then choose to repeat the process for lower level KPIs (turning it into investigated metric), after the automated suggestion of the analysis tool as to who is accounted for this result (B). The analysis method utilizes di fferentials; this mathematical approach that utilizes di fferentials is explained hereafter. The decisions that have to be taken (C) are beyond the scope of the current work. Nevertheless, it is noted that their generation often is straightforward, even though the existence of a knowledge base, to this end, would be extremely useful.

**Figure 2.** Schematic of the way the tool is run in real production.

**Figure 3.** Alarm that is created by prediction of KPI exceeding threshold.

Figure 4 summarizes this procedure while using a flowchart. Additionally, for easy comprehension, the algorithmic description follows, along with some notes for each step.


In the context of a specific example, it is mentioned that Figure 5 is indicative of the graphical illustration of such information. The derivatives formulas are also pre-installed, as they are used during this step.

The partial differential is the tool that is applied within the analysis function used to estimate the cause of the variation. It is a powerful tool that deterministically quantifies the effect of the variation of one KPI to the variation of another KPI. More specifically, if KPI *A(t)* is a function of PIs *A1(t)*, *A2(t)*, and *A3(t)*, the percentage that each *An*(t) contributes in the variation of *A(t)* in time is the quantity

$$\delta\_{A\_{\rm n}}^{A\*} = \varepsilon \left\{ \frac{\partial A}{\partial A\_{\rm n}} \Delta A\_{\rm n} \right\} \tag{1}$$

Given the corresponding notation, the operators of absolute value |·|, mean value <sup>ε</sup>{·}, partial derivative ∂/∂X, and difference Δ are used for the computation of the mean partial differential. This helps in pointing out the direction that the production manager should focus on.


**Figure 4.** Flow chart of the algorithmic procedure.

**Figure 5.** Relationship between Measured Values and KPI. Operations include summation, products and inversion.

#### *2.2. Implementation within a Dashboard: Services and Hardware Framework*

The dashboard is implemented in Java in an Object-Oriented (OO) paradigm as a Web Application that runs on a typical Java Servlet Container (Apache Tomcat); thus, it is accessible through a typical Internet browser. The Digital Twin is implemented in a Service-Oriented Architecture (SOA) that follows the N-Tier Architecture with multiple layers per tier. An integration layer handles the asynchronous communication of the browser with the server by enabling the user to begin the analysis at any time and without leaving the current screen. The browser sends HTTP requests to the server, whereas the JavaScript Object Notation (JSON) objects, which contain the results, are returned to the client.

Figure 6 illustrates the system's architecture comprising two individual processes. The Monitor Process records every equipment performance activity in the database, while the Actor directly interacts with the Analysis Process through the dashboard pages and requests an analysis for certain KPIs. Numbers 1 and 6 are indicative of communication between human (engineer/operator) and the dashboard, 2 and 5 denote a query request or result, and 3 and 4 are indicative of the OLAP/RDBMS communication.

**Figure 6.** Application architecture.

In particular, the Actor interacts with the AnalyticAction of the dashboard, as depicted in Figure 7, which, in turn, cooperates with the AnalysisService, employing the Facade Design Pattern [40]. This architecture allows the usability of a software library and provides a context-specific interface launching a set of services without the user being exposed to the complexity of the algorithms beneath. This performed through a variety of entities. Firstly, the AnalysisService directly uses the OLAPQueryService to acquire the values of the KPIs requested, holds them in the KPIManager along with their formula expression stored into the database—this is due to an implementation limitation of the current OLAP servers providing the calculation function for a KPI. Next, the PredictionTool applies the regression function to the data, which are consecutively fed to the FactorAnalysisTool for the analysis of the influence factor on each dependent variable of the KPI. The PredictionTool and the FactorAnalysisTool both correspond the implementations of the extrapolation tool and the tool estimating differentials that are described in the previous section.

**Figure 7.** Unified Model Language (UML) of objects implemented in dashboard and their workflow.

The dashboard also uses the RDBMS to store the information that is required for the visualization of the customized KPI views in various forms. Each KPI is defined in a KPIDefinition instance along with its formula. The Measured Values are retained in the same structure but without a formula. Formulas are lexical names that correspond to definition names. For instance, the Availability KPI is defined by the lower case name 'availability', and it is referred to by the OEE KPI as #{availability} variable, whereas the interpreter replaces it at runtime with the actual value. The exact UML of the Informational Model used is given in Figure 8. Black rhombus indicates composition, white rhombus indicates aggregation, and pure arrow indicates association.

Historical data of the production's execution are stored into the relational database managemen<sup>t</sup> system/RDBMS (Figure 6) as records regarding the depiction of the data acquisition method. A relational table holds the machine's cycle output, whereas each record represents a single machine cycle with attributes that relate to the corresponding entities in the production system. The smallest set of these attributes can be:


**Figure 8.** UML of the Informational Model.

These are attributes and records that mostly refer to productivity. Additional attributes, even of different character (i.e., sustainability related, such as energy consumption), can be added. A graph is then formed with involved KPIs, where each KPI is represented by a vertex v and their dependency by an edge e, similar notation to precedence diagrams. The analysis method exploits the data and follows their dependencies until the root measure has been identified (among Measured Values). Finally, that measure is presented to the user as an indication of the type of measure that will negatively affect the performance of production in the near future.

The production performance is loaded from the OLAP cube [41], where the measures are formed with the help of the aggregation functions. For instance, measure 'Good Quantity' is aggregated by the count function of the 'Good Part' attribute.

#### **3. Results & Discussion**

The functionality of the methodology that is presented in the above sections is proved herein utilizing adequately complex cases. Noise is also present in the first three cases, where dummy KPIs have been used, to simulate the short-time variations of the KPIs. These first case studies have been established in order to check the numerical success of the algorithm, where the fourth case study originates from a real industrial problem and its scope is to test the applicability of the digital twin and the validity of the main algorithm in problems set in environments of higher Technology Readiness Level.

#### *3.1. Case I*

In this case, the relationship between the KPI *B(t*) and the PIs *A1(t)*, *A2(t)*, *A3(t)* is *B(t)* = *A1(t)\*A2(t)* + *A3(t).* The evolution of the PIs trends are as follows:


It is evident from the contribution diagram of Figure 9 that there is success in predicting that the contribution of the first factor should be high. The same happens if *A1(t)* decreases in time. Despite the noise existence, the algorithm worked to a satisfactory extent, which resulted in characterizing *A1(t)* as the root-cause of the variation.

**Figure 9.** Case I. Top: Evolution of KPI, Middle: Evolution of PIs, Bottom, Contribution of each PI.

#### *3.2. Case II*

In this case, the relationship between the KPI *B(t*) and the PIs *A1(t)*, *A2(t)*, *A3(t)* is *B(t)* = *A1(t)\*A2(t)* + *A3(t).* The evolution of the PIs trends is, as follows:


It is evident from the contribution diagram of Figure 10 that there is success in predicting that the contribution of the third factor should be high. The same happens if *A3(t)* decreases in time. It seems that the algorithm runs successfully, regardless of the operations that are performed among the KPIs.

**Figure 10.** Case II. Top: Evolution of KPI, Middle: Evolution of PIs, Bottom, Contribution of each PI.

#### *3.3. Case III*

In this case, the relationship between the KPI *B(t)* and the PIs *A1(t)*, *A2(t)*, and *A3(t)* is *B(t)* = *A2(t)* / *A1(t)* + *A3(t).* The evolution of the PIs trends is, as follows:


It is evident from the contribution diagram in Figure 11 that there is success in predicting that the contribution of the first signal should be higher than that of the second one. It is apparent that both of the first two PIs are causes of the variation, however, the denominator change is the one that affects more the result. Accordingly, the algorithm has pointed towards the correct correction.

**Figure 11.** Case III. Top: Evolution of KPI, Middle: Evolution of PIs, Bottom, Contribution of each PI.

#### *3.4. Case IV*

This case regards a true production system and it is used to test the validity of the algorithm in terms of the physical relationship between the PIs. This is done through forcing the Measured Values to create an KPI-level alarm and check whether the algorithm will find the root cause and indicate towards the correct Measured Value. Thus, a production system of thirty (30) identical machines that are arranged in eight work-centers has been simulated by a commercial software [42] (Figure 12). The system simulates a real-life machine-shop producing two variants of Cylinder Head parts: (a) Petrol and (b) Diesel. Different processing times and setup times have been configured.Additionally, negative exponential distribution for Mean-Time-Between-Failures (MTBF), and a real distribution on Mean-Time-To-Repair (MTTR) considered the equipment breakdown on machines.Customer demand has been set by a normal distribution. Finally, the performances have been recorded and manually fed in the informational model.

**Figure 12.** Workcenters arrangemen<sup>t</sup> of the production system. The number indicates the amount of machines for each workcenter.

*Appl. Sci.* **2020**, *10*, 2377

The KPIs that are utilized in this case study (top to bottom) are mentioned below and their relationship is also given in Figure 5, beginning with Overall Equipment E fficiency (OEE).


A simulation has been conducted with the machine breakdowns deliberately increased with time. Subsequently, the algorithm of finding the root-cause is run once. In Figure 13, it is clearly shown (yellow shaded rectangles and curved arrows) that the algorithm dictates the major role of E ffectiveness. This definitely helps in pointing out the direction that the production manager should focus on.

As a matter of fact, if one repeats the procedure of running the algorithm, but this time utilizing E ffectiveness instead of OEE (68% contribution), and then the Measured Value that comes up is Processing Time. This means that the engineer should focus on why the machines do not work that much, which is indeed the root cause of the OEE drop (Figure 13).

**Figure 13.** Tracking the cause of the OEE alarm back to Processing Time.

It seems that the algorithm works successfully without setting any rules or training. On the contrary, in the literature, works coming up often have to use one or the other, so a direct comparison cannot be made. For instance, in another alarm root cause detection system [43], the authors, even though they are describing the impact on case studies in brief, explicitly point out the need for rules in an internal expert system. Additionally, the use of KPIs at process level [44] is complementary to the current work. A framework that aims to detect the quality of the process, as per classifiers utilizing depth of cut and spindle rate could be integrated; corresponding indicators can be regarded as processed Measured Values. Additionally, the issue of validity of the indicators that are aggregated from IoT is also an issue that has not been raised herein, as the data have been considered to be valid. In addition to this, architectures that are based on complexity handling [45] and algorithms, such as Blockchain [46], can guarantee this goal, while, as indicated by corresponding results in literature [47], the introduction of a manipulation strategy, such as KPI-ML, will boost the performance of any algorithm, including the current. In other pieces of literature [48], the Multiple Alarms Matrix is utilized and hierarchical clustering is applied. In their results, they seem to be presenting the usability of the method towards the correlation of the alarms and extraction of the unique ones. This has not pointed out herein, as the goal has been to search for immediate causal relationships. Inferring Bayesian Networks are utilized in a di fferent work [49]; this approach can also be useful towards a causal model [50]. However, Abele et al., in their own discussion, state the use of probabilities and simulations, while expert knowledge is needed, whereas, in the current work, knowledge use is solely used in the actions generation. Furthermore, the enhancement with Max–Min Hill Climbing algorithm seems to be useful, however it is characterized as time consuming [51]. Case Based Reasoning seems to be similar to the present work, given the results in the literature [52]; however, as per the authors, it is best applied when records of previously successful solutions exist.

Moreover, the environment plays a significant role in the e fficiency of such aggregation. More specifically, regarding manufacturing, large networks that share decision making, alternative bandwidth reservation [53] can be adopted for reasons of performance. Connection to control layer, as performed in other works [54], could be potentially addressed in an extension of the current study. Along the same lines, KPIs related to CPPS issues [55] could potentially extend the applicability of the current framework. Finally, regarding innovation and technology readiness, the upscale of technologies that are integrated in shop-floor is also relevant, in terms of decision making for designing systems and process planning integrating Industry 4.0 Key Enabling Technologies. The Technology Readiness Level of the various technologies could also be described in terms of KPIs, as per the literature [56].
