1. Introduction
The advent of Industry 4.0 [
1] was a revolutionary event [
2]. The core and innovative element was the use of data to transition from traditional programmed and control-based processes and systems to intelligent processes and systems that can predict the behaviour of various stakeholders in the industry value chain (e.g., customers, operators, machines, etc.) and proactively adjust their operations at different levels [
3]. It aims to maximise production efficiency by achieving self-awareness, self-prediction and self-maintenance [
4,
5].
A key point of Industry 4.0 is the interaction between production and maintenance planning [
6]. The term maintenance planning “identifies the set of technical, administrative, and managerial activities carried out during the life cycle of an item, with the aim of maintaining or restoring its functionality” [
7]. In this context, early warning systems (EWSs) have emerged as valuable tools to anticipate unexpected disruptions and support maintenance strategies [
8]. Indeed, EWSs are designed to provide timely alerts and notifications about potential risks, hazards, or critical events [
9,
10,
11].
These systems employ real-time data from a variety of sources, including sensors, surveillance cameras, and environmental monitors, to identify discrepancies from standard conditions [
12]. They employ statistical analysis, machine learning, and pattern recognition algorithms to identify unusual patterns or anomalies that might indicate the occurrence of an undesired event [
13,
14]. EWSs are typically designed with predefined thresholds or rules that trigger alerts when specific conditions are met [
15].
The benefits of EWSs are considerable [
16,
17,
18,
19]. While they are widely used in disaster management and environmental monitoring [
11,
20,
21], their versatility has enabled their adoption across multiple domains including manufacturing, healthcare, finance, and business management [
11]. In the healthcare sector, for instance, the EWS supports clinical assessment to detect acute deterioration at an early stage in order to prevent or reduce adverse events such as unexpected cardiopulmonary arrests, intensive care unit admissions and deaths [
22]. In the manufacturing context, EWSs are used to detect early signs of possible interruptions in the production cycle, caused either by machine downtimes or production errors. Targeted actions on product quality control enable EWSs to optimise both quality and productivity [
23]. In the financial sector, EWSs are instrumental in anticipating the onset of financial crises in individual countries [
24]. Another research line focusing on EWSs leads to identifying early warning signals for decision-makers. These signals can be derived from statistical data, artificial intelligence (AI), or cognitive behavioural techniques. The EWSs in question are referred to by the term “managerial early warning systems” (MEWSs) [
25]. Although MEWSs can automate the detection of early warning signals, managers still determine how to respond to them. The purpose of these studies is not automation, but rather helping managers develop strategies [
25,
26,
27].
Across all contexts, EWSs today are referred to as intelligent because they employ the technologies inherent to warning systems, such as sensors, and the technologies inherent to artificial intelligence in a synergistic approach; although there is no strict definition of an intelligent early warning system (IEWS) in the literature, there are numerous examples of implementation [
28,
29,
30,
31,
32]. The IEWS is a significant advancement in condition-based monitoring systems. It uses big data technology and intelligent algorithms to continuously monitor equipment in various conditions. The system can predict future signal changes, detect faults in the creep process before they occur, diagnose equipment faults, and calculate the equipment’s remaining useful life [
33]. The integration of EWSs and AI allows companies to move from reactive to predictive maintenance, anticipating problems before they occur and optimising the planning of maintenance activities.
Despite their growing adoption, EWSs still face critical limitations. In particular, in the industrial context, one of the main issues lies in the event collection phase: systems must handle large volumes of heterogeneous data streams, often affected by noise and false positives, which compromise the effectiveness of subsequent analyses. Moreover, correlating events from disparate sources raises complex challenges related to semantic integration and the timely detection of rare or weak signals, all while ensuring security, scalability, and adaptability across different production environments. Lastly, activities such as anomaly forecasting, signal prioritisation, and visualisation demand advanced proactive solutions, which remain largely underexplored in the current literature [
34].
These limitations highlight the need for more robust and flexible approaches capable of improving predictive accuracy, reducing response times, and enhancing the ability to distinguish relevant events from background noise. In this context, the research question that is addressed in this study is as follows: How can an intelligent early warning system, based on deep learning techniques, improve predictive accuracy and real-time event correlation from heterogeneous sources in industrial contexts?
To answer this question, the paper proposes an innovative approach for designing and developing an intelligent early warning system. Specifically, it presents the implementation and validation of a predictive model based on Long Short-Term Memory (LSTM) neural networks, trained and tested on real-world industrial data. The system is designed to accurately predict the most frequent error classes, thereby enhancing equipment reliability and improving the planning of maintenance activities.
The activities described in this study are part of the SCREAM research project, which aims to develop a platform capable of providing companies with strategic insights into their production processes. The goal is to address both product development dynamics and production-related elements, such as infrastructure and machinery, that are critical for optimising every stage of the supply chain. To achieve this, data from industrial machinery were analysed, with a specific focus on performance metrics and malfunction events. The analysis focuses on the prediction of high-frequency error classes to improve overall system reliability and enable more effective preventive maintenance strategies.
The rest of the paper is organised as follows.
Section 2 presents the theoretical field on which this study is based.
Section 3 explains the research method adopted.
Section 4 contextualises the case study and the results reached.
Section 5 discusses the results found, including the limitations of the research, implications and potential future research. Finally,
Section 6 draws the conclusions, summarising the approach.
3. Research Design
The design of this research was informed by both theoretical considerations and practical constraints derived from the industrial context under investigation. The primary objective was to develop an early warning system capable of predicting critical machine faults based on time-series data collected from industrial sensors. The application domain concerned a legacy industrial printing machine, specifically a paper embosser, owned by Sofidel, a leading company in Tuscany (IT) for tissue paper products. This machine is equipped with a wired industrial sensor array installed by the manufacturer. Through its Application Programming Interface (API), access to the floating-point data is guaranteed. This machine lacks a system to alert or advise the operator of potential issues before they occur.
Based on the literature and the preliminary data analysis (see
Section 2 and
Section 4), Long Short-Term Memory (LSTM) networks were identified as the most appropriate deep learning architecture.
LSTM networks, when combined with attention mechanisms, provide significant advantages in modelling time-series data, particularly when using a sliding-window approach. Time-series data often exhibit irregular patterns, non-stationarity, and long-range dependencies, which require models capable of effectively capturing both short- and long-term relationships [
55]. LSTMs are specifically designed to address the limitations of standard recurrent neural networks (RNNs), notably the vanishing gradient problem, by incorporating gating mechanisms that allow information to persist across long sequences. This capability is crucial for time-series data, where dependencies may span across multiple time steps. Furthermore, LSTMs efficiently handle sequences of variable lengths, making them well suited for applications involving sliding window frames. The sliding window approach segments the time series into overlapping sub-sequences, allowing the model to learn meaningful temporal patterns and generalise better to unseen data [
56].
The integration of attention mechanisms with LSTM networks enhances their ability to selectively focus on the most informative parts of the input sequence. Attention layers assign different weights to each time step, ensuring that the model prioritises essential information while downplaying less relevant data points. This is particularly beneficial in scenarios where the significance of past observations varies over time.
Although CNNs have been successfully applied to time-series analysis, they primarily rely on local receptive fields and weight-sharing mechanisms, which make them more suitable for tasks involving spatial dependencies rather than long-term temporal relationships. CNN-based approaches may struggle to model long-term dependencies effectively without additional mechanisms, such as dilated convolutions or temporal pooling layers.
Transformer models, on the other hand, have gained prominence for their ability to capture long-range dependencies without the need for recurrence. However, transformers typically require large amounts of training data to generalise well, and their computational complexity scales quadratically with sequence length due to self-attention operations. Given the constraints of many real-world time-series datasets (where data availability may be limited), LSTM networks with attention provide a more efficient trade-off between performance and computational feasibility [
57].
A case study was conducted to describe the activities and results. This method is appropriate for exploring problems and their solutions in real organisational settings [
58]. The development of the case study was a collaborative effort between industrial engineers and university researchers, and it was based on the following research phases:
Company selection: The company selected, Sofidel, was deemed a suitable sample [
58] due to its involvement in the SCREAM research project, which is funded by the Italian Ministry of Industrial Development.
Dataset Elaboration: The collected dataset has been observed and prepared for the analysis phase.
Exploratory Data Analysis: This is the process of examining data from sensors to gain insights into system behaviour.
Deep Learning implementation: A neural network application is applied as machine learning approach.
Performance Measurements: Performance results are treated in these phases for collecting final feedback on the case study implementation.
4. The Case Study Phases and Results
4.1. Company’s Context Description
The objective of this study is to design and test an algorithm that can work for an EWS that predicts the occurrence of a problem or an error as soon as possible. A particular part of a printing machine from “Sofidel” has been selected as the test scenario of the case study. It is focused on a specific facet of the printing machine, colloquially referred to as the “embosser”. This integral component comprises an assembly of rolls and wheels, collectively responsible for exerting pressure and force on one or more paper substrates. The culmination of these actions yields intricate patterns on the paper. The embosser plays a critical role in the mass production of various paper products and is frequently prone to intricate paper jams. Anticipating those or related errors in advance would be valuable, enabling domain experts to proactively address and prevent potential disruptions, thereby saving time and money.
The machine is equipped with integrated sensors at various levels that enable domain experts to understand whether its mechanical parts are failing or not. The sensor data generated revolve around the measurements of applied forces, pressure distributions, and rotational velocity of the rolls. Their acquisition hinges on the implementation of Internet of Things (IoT) sensors placed on the left, right, and middle parts, in multiple areas, to collect the mentioned physical information with the lowest energy consumption. This cascade of data forms the core of the explored analytical case study, helping to provide a comprehensive understanding and effective management of the embossing process. Once acquired, data are subsequently conveyed to a localised gateway through the Message Queuing Telemetry Transport (MQTT) protocol, known for its efficiency and reliability in data transmission, to an edge server. In the gateway, data are collected for advanced visualisation processes and monitoring, giving domain experts important insights into the production process. In addition, data are stored in a company-internal data storage system to enable engineers to monitor any malfunctions. In parallel, the collected data are marshalled and sent through the Internet as JavaScript Object Notation (JSON) packages to a cloud server for further analysis and elaboration.
4.2. The Dataset Elaboration
Data are collected on the machine’s edge devices and sent to the gateway according to Algorithm 1, which can be summarised as follows: ship the data every time a value among those monitored changes or every 360 s (6 min), no matter if values are changed or not. This approach creates a dataset in the form . The total number of sensors on the embosser provides a total number of 34 numerical characteristics, representing each monitored part, and a class that defines the type of operation for a given .
This is essentially of two types:
Error, along with a “type”, represents the different nature of the error;
Normal functioning defines that the machine is operating correctly.
The initial analysis of the dataset is based on the conversion and processing of the cloud-stored data into a set of rows corresponding to
time–feature pairs. To ensure better data handling procedures, features and classes have been processed around a single
with a pivoting procedure. There are some caveats when using this approach: as rows are grouped and rotated into a matrix, certain features may become
due to incomplete data for the specific
.
Algorithm 1 Data collection. |
- Require:
set of features values acquired over time - 1:
procedure Send data - 2:
- 3:
while True do - 4:
if then▹ Every 6 min - 5:
bulk send - 6:
- 7:
else - 8:
for do - 9:
if then - 10:
send - 11:
end if - 12:
end for - 13:
- 14:
end if - 15:
end while - 16:
end procedure
|
To address this situation, Algorithm 1 is used, where it is noticed that the values can be handled through forward and backward filling. This process involves the replication of the closest non-null value for each feature, both forward and backward, until it changes. By doing so, the original structure of the dataset is maintained, and if “not updated”, the value of a feature remains constant over time.
Table 2 presents an example of how
values have been handled through the pivot procedure. For simplicity, only a general approach is presented, considering timestamps
, where
. The forward-fill procedure replicates, for each feature, its value from the most remote non-null timestamp until it changes (refer to Feature 1 and Feature 2 in
Table 3). Similarly, backward fill is used to fill values from the most recent non-null timestamp part of the dataset towards the beginning (see “Feature n” in
Table 4). Particular attention was paid to the class variable in the dataset, as it required a different treatment compared to the other features. The forward-fill procedure does not change from what has just been described; however, the backward fill does not guarantee that the missing value is replicated until the beginning of the dataset, when it encounters an error code. For this reason, the backward-
values were filled with the corresponding class of “normal behaviour”.
4.3. Exploratory Data Analysis
This step of the study involved the analysis of the data from the sensors; the dataset had a range from 1 January 2023 to 1 June 2023. Due to the nature of data collection and conversion, no constant [s] was identified that consistently separates the of a single dataset row from the next. A simple statistical analysis revealed that [s], [s] and [s]. Since a 1:1 correspondence between the of the dataset rows could not be established, time will be expressed in terms of time units (TU) instead of seconds. The TU represents the unit step between consecutive rows, independent of the actual time elapsed. Furthermore, no further investigation was conducted on the trend of , as it will be used only as an index to better represent the data.
Exploratory Data Analysis (EDA) began with a simple plot illustrating the trend of all features over time. This initial step provided a clearer understanding of their behaviour, revealing that 5 of the 34 features maintained a constant trend throughout the time frame. As a result, these features were removed from the dataset and excluded from further analysis. Subsequently, to gain deeper insight into the domain of each feature, a random forest analysis was performed using the Mean Decrease Impurity metric (MDI), as shown in
Figure 1. The data exhibit significant variability, primarily due to the machine’s inherent nature, which frequently transitions between different states, along with occasional outliers observed during maintenance operations.
Further investigations were conducted by analysing the correlation matrix between features and classes, revealing that an additional 14 variables exhibited identical correlation patterns. Discussions with domain experts confirmed that these values came from closely positioned sensors, allowing a further reduction in the number of features by retaining only one representative variable per group. The final set of selected features, along with their correlation values, is presented in
Figure 2, which highlights a refined selection of the 21 features of the original 34 (excluding the target class).
Additional studies in the domain of each feature did not show evidence of seasonal and residual values. Furthermore, the autocorrelation with the target class, as well as the partial autocorrelation of each feature, did not show a strong relationship.
By generating both histograms and box plots to represent the operational value range, the inherent data distribution of each feature was revealed. This analysis facilitated the identification of outliers, highlighting the need for further data preprocessing.
To standardise the columns corresponding to each feature, the “quantile transform” technique was applied. This approach remaps the original probability distribution while mitigating the influence of outliers, ensuring that no rows are excluded and preserving the continuity of the time series .
The analysis of the target class, namely “First Alarm” (“Primo Allarme”, in Italian), required specific attention, since multiple types of “First Alarm” occur during the time frame considered. They are analysed by plotting the general trend of the occurrences over time, studying each error’s frequency. This is illustrated in
Figure 3, where the axes’ dimensions “Primo Allarme” indicate the type of error and the occurrence time. The
x axis represents the time, while the
y represents the “Error” type.
In this way, a total of 54 distinct classes were counted: “ErrorCode 0” identifies the normal machine behaviour, while the others represent either general warnings or actual machine malfunctions. Then, how many times each class appeared during the time frame was counted and represented in
Table 5. It is important to note that most of them represent only warnings or general information about the machine’s production process.
The analysis was specifically focused on critical errors associated with paper breakage, because when this happens, the recovery procedure takes time and effort, reducing the production rate until the problem is solved. The company’s domain experts suggested that the most trivial and important errors to identify in advance were those related to the tension of the paper and “paper breaking protection”: 2509, 2556, 2557, 2558, and 2559.
Therefore, it was necessary to isolate these five target classes from the others. An analysis was performed to identify potential time–cause–consequence relationships among errors; however, no hidden patterns were detected. Furthermore, discussions with domain experts did not provide any information on how to merge a certain“unwanted” class into any of the five mentioned above. Further attempts were made to create and analyse a meta-dataset with meta-classes, but the high variability of the time-series errors prevented further exploration of this approach.
Given these complexities, the principle of Occam’s razor was applied, favouring simplicity in dataset management. This approach involved reclassifying all errors, except those previously identified, as “0” to represent normal system behaviour. The result of this choice is an unbalanced time-based dataset, where most of the entries represent “normal functioning” of the machine and a small fraction of it includes multiple errors given by the five distinct classes. This problem will be discussed in
Section 4.5.
The proposed approach ensures that the results can be reproduced at any time by adopting a pipeline-based research methodology. To achieve this, Kedro has been used to manage various stages of the entire process: from data preparation to neural network training and testing (represented in
Figure 4 as a Direct Acyclic Graph, or DAG). The data were processed with Pandas and subsequently visualised with Seaborn 0.12.1. Lastly, the AI models discussed in
Section 4.4 were implemented using PyTorch 2.1.0. and the entire process was deployed on a proprietary cloud service named “Alida”.
Kedro is an open-source Python 3 framework hosted by the Linux Foundation that allows us to build data science pipelines, and was created at QuantumBlack to reduce technical debt in data science experiments, making an easier transition from experimentation to production [
59]. Its flexibility and integration among modern Integrated Development Environment (IDE) and cloud hosts enable scientists and engineers to easily analyse data and train AI models focusing on reproducibility, reporting, and deployment on remote hosts. Another important feature is the possibility of visualising and debugging pipelines using integrated visualisation tools.
Figure 4 represents the entire training pipeline that we built using Kedro. From top to bottom, it represents the data flows that come from the loading of the representative parameters (marked with a coloured purple line), to the creation of the result images and tables proposed in this work. For the sake of simplicity and clarity, we have omitted the portion of the pipeline related to data preprocessing and cleaning, as it involves standard procedures with no novel contributions. Additionally, to avoid redundancy, we omitted pipeline replication for all five models.
4.4. The Deep Learning Approach Implementation
In the previous section, it is discussed how data are distributed over time, which features are important to consider and how many errors need to be identified to prevent the machine from malfunctioning. This scenario suggests the need to address a multi-classification challenge, trying to predict the occurrence of an error as ahead of time as possible.
To correctly classify the incoming data, a deep learning approach was chosen and adopted, as suggested in several recent studies [
60,
61]. Among the many neural networks available in the literature, the LSTM was selected as it is particularly well suited to handle the irregular and time-dependent data generated by industrial sensors. Additionally, to mitigate the problem of gradient vanishing, the attention mechanism to enhance the model’s ability to focus on the most informative parts of the data [
62] has been used.
Therefore, the generation of the functional EWS is based on the exploration of various strategies, including the following:
Creating a single neural network able to multi-classify the five errors;
Creating multiple neural networks, each specialised in a single error.
To correctly justify the adopted approach, it was necessary to further investigate the distribution of the five error classes. The number of times a signal changed from “0” to any of the error classes was counted. The resulting count is reported in the row “Total occurrences” of
Table 6.
It can be observed that the total number of error events changes drastically from one class to another.
The first strategy involved the usage of a single neural network to multi-classify the errors in a naturally unbalanced dataset with an unbalanced number of error classes. Additionally, this would result in the creation of a large model, both in size and computational complexity. Although the entire system is not meant to work with real-time requirements, this approach limits the extendability of the system, leading to the need for a single node with high computational capabilities.
To adopt the second one, it is required to extract chunks of data from the dataset related to the specific error class and feed the models singularly. This would require that the dataset is passed to each individual neural network model and the use of a finalisation step to highlight the most probable error class (e.g., compare all network results and pick the one with the highest elicitation). Therefore, under the mentioned hypothesis, it was decided to adopt the second approach as the research team considered it more extensible and robust, being aware that some neural networks might perform better than others due to the different number of examples provided.
To extract a discrete number of TU rows with which each neural network would be trained and tested, a
around each error occurrence was considered. Specifically, all the features up to
[s] (5 min) before the error were taken, along with the first five TU elements following the occurrence. This extraction window was agreed with the domain experts as a result of the intrinsic machine behaviour during the paper blocking process (e.g., the slowing of the embosser rolls). Furthermore, since the whole dataset exhibits an imbalance between error and non-error classes, retaining only a portion around the error class can help to reduce overfitting. Each time frame segment encompassed a variable number of TUs including the machine’s “normal functioning” and relevant feature values describing the error class. “The average TU number” was calculated for each error-specific dataset and the results are reported in
Table 6. Henceforth, each dataset was independently treated, dividing each into
for training and
for testing; an additional
of the training dataset was left for validation during the training process. Before each training session, the data chunks were shuffled and the training lasted for a maximum of 500 epochs, with an early stop mechanism implemented to stop the training if the validation error exhibited an increasing trend across epochs (
and
). The Learning Rate (LR) was set at
, and the Adam optimiser was used to calculate the error, while the Mean Square Error was used as a loss function.
To find the most suitable network architecture, the Pytorch Ray Tune was used to write a test matrix with grid search on LSTM layers (
), attention head number (
), and dropout rate (
) [
63]. The total size of the network and the performance obtained by each prototype were compared. To manage the high computational complexity of the models and data, mini-batch training with a
was applied. The results of the prototypes were contrasting, as increasing the complexity of the network did not bring any relevant improvements to the test set. Therefore, considering that the final networks would be deployed on edge machines, with limited computational resources, we opted for the smallest architecture with the highest performance. These share five stacked LSTM layers with 100 hidden units to extract sequential dependencies. Furthermore, a single layer with 10 attention heads concentrates on the most important parts of the data. Then, a fully connected layer processes the attention output, followed by a linear layer and a ReLU activation layer. The architecture is finalised with another linear layer and the dropout rate is set to
.
The general approach for LSTM applications suggests converting the dataset into sliding windows of size , where n represents the total number of TUs in the window and p denotes the TUs to predict. When discussing with the company’s domain experts, a suggestion was given to look for the error a few minutes before the event occurrence and considering a few instants after its rising. Hence, we decided to design a window size of and , to allow a small warning gap.
4.5. The Performance Measurement
Each neural network was independently trained 10 times, according to the parameters specified in
Section 4.4, and in this section, the best results among those runs are presented. For each model, the confusion matrix (along with accuracy), precision, recall, and the area under the receiver operating characteristic (AUROC) were calculated using the test set extracted prior to the training phase. Following the same approach, the retrieved results are reported in
Table 7,
Table 8,
Table 9,
Table 10 and
Table 11. In particular, each row represents the specified metrics for the prediction time at the given
.
Model 2556 consistently demonstrates superior accuracy performance across all forecast horizons, indicative of its robust predictive capabilities. Models 2509 and 2557 share similar accuracy trends, whereas Models 2558 and 2559 exhibit slightly diminished but still commendable accuracy levels.
Precision analysis reveals that Model 2556 achieves notable precision values, particularly at t + 2 and t + 3, underscoring its proficiency in accurately identifying positive instances. In contrast, Model 2557 displays variability in precision across forecast horizons. Models 2509, 2558, and 2559 manifest lower precision values, which implies a higher incidence of false positives.
In terms of recall, Model 2556 performs consistently well, attaining elevated recall values. Model 2559 shows sustained recall values, reflecting an effective equilibrium between false negatives and true positives. Models 2509, 2557, and 2558 display disparate recall outcomes across forecast horizons.
Regarding discriminatory power, Model 2556 maintains consistently high AUROC values, emphasising its ability to distinguish between classes. Model 2559 also demonstrates commendable performance in AUROC. Models 2509, 2557, and 2558 exhibit variability in AUROC values on different forecast horizons.
Additional considerations can be made by relating these results to the training data presented in
Table 6. The superior performance of Model 2556 on various metrics aligns with its relatively large training dataset size (14 chunks with an average of 234 entries per chunk). The abundance of data may have contributed to the robust learning and generalisation capabilities of this model.
The similar accuracy patterns observed in Models 2509 and 2557 might be attributed to their comparable dataset sizes (56 and 41 chunks, respectively). However, nuanced differences in precision, recall, and AUROC suggest that other factors, such as the inherent characteristics of the errors or the model architectures, contribute to their distinct performance.
The slightly lower but reasonable accuracy of Models 2558 and 2559 corresponds to their smaller dataset sizes (12 and 37 chunks, respectively). Despite having fewer data entries, these models exhibit notable predictive capabilities, indicating effective learning from the available information.
Generally, the correlation between the size of the dataset and model performance is evident, and larger datasets often lead to better results. However, the influence of other factors, such as error characteristics and the quality of the data, cannot be overlooked.
The final step of the study involved merging the predictions of neural networks to accurately identify a single error event. To achieve this, the entire original dataset was processed and streamed independently to each neural network. By applying the same normalisation functions used during training and maintaining the same feeding window, the predictions were plotted at different levels. Part of the results of this process are presented in
Figure 5. This should be read from the bottom to the top and from left to right to understand the behaviour of the signals. Each row represents a different time prediction frame: the top one shows the prediction results of the neural network closer to the current event, while the bottom depict those related to the one with the longer forecast. The total time frame taken into account is about a total of 300 TUs, and each prediction of the neural network is drawn with a different colour.
The squared green signal represents the actual error that occurred during that specific time frame, and its value is labelled in the upper right corner. In addition, a threshold was set, represented as a horizontal dotted line, at to highlight signals that exceed this value throughout the time frame. The expected ideal behaviour is that the neural network corresponding to the error reported on the top right of the squared signal (in this specific case, it is the 2509) should rise in advance and keep the same high value until the moment of the error occurrence. As the other neural networks are trained on different target classes, they should not show any activation. The row shows that multiple neural networks seem to recognise an error event, including 2509 (coloured red). In row , the neural network for error 2558 stops its elicitation, as well as the neural network for error 2556 in row . Until row , the neural network 2509 confirms its prediction and extends its accuracy, intercepting the actual occurrence of the error. After this, it correctly stops the recognition, going back to a “zero” state.
5. Discussion
5.1. Theoretical Implications
This study contributes to EWS research [
51,
64,
65], with a specific focus on the implementation of an Intelligent Early Warning System (IEWS) in the manufacturing sector, distinguishing it from previous works such as [
31,
32]. Although similar to [
54], which employs a Bayesian Hypothesis Testing-based model for anomaly detection, this study shifts the focus to predicting the most frequent error classes using deep learning techniques and the power of the attention mechanism.
This shift introduces a new perspective in the theoretical discourse around EWSs, moving from anomaly detection to event anticipation, thereby aligning more closely with the proactive needs of predictive maintenance strategies.
The study also contributes methodologically by proposing a multi-model classification strategy, where each model is specialised in identifying a specific class of critical errors. This approach departs from the common use of monolithic classifiers and opens new avenues for research on modular and scalable architectures for industrial fault prediction. The integration of attention layers into LSTM networks further supports the theoretical discourse on temporal feature selection, allowing the model to identify which segments of the time series carry the most predictive value, a mechanism particularly relevant in scenarios where the importance of historical data fluctuates over time.
5.2. Practical Implications
The approach described in this work brings numerous advantages in real-world scenarios. The adoption of Kedro and the pipeline-based workflow allows the replication of the results and the extraction of new ones using the same identical piece of code. In fact, each node that creates the pipeline can be used multiple times on different data to output the same data structure. This means that the system is scalable, as new errors might be mapped and new models can be trained just by adding a new entry to the data to extract.
Similarly, new entries can be quickly analysed, updating the initial statistical analysis and leading to more insightful details of the general behaviour of the data. With the additional knowledge, models could be updated all at once, retrieving on-screen training/testing performance, and having the models deployed on edge machines.
Having an EWS active and running with these automatic “updates” provides the core value of this work:
Each model warns the operator in advance about possible paper jams, so that immediate recovery procedures can start;
Constantly updated models with improved discrimination capability can enlarge the forecast horizon;
The possibility to extend the developed pipeline to different machine parts to monitor new zones;
Lowers the downtime of the system and improves the production as models are fed with new data.
Moreover, the adoption of Kedro, which allows the automatic deployment of the trained neural networks and their usage on real-time clients (like those close to the machine), brings improvements on many levels:
Finally, the findings of this study demonstrate that legacy machines, despite not being originally designed with advanced monitoring capabilities, can be enhanced through the integration of sensors and AI models without requiring a complete replacement of existing infrastructure. This approach presents a cost-effective opportunity for manufacturing companies seeking to modernise their production systems with advanced predictive technologies, reducing technological upgrade costs while improving operational efficiency.
5.3. Limitations and Future Works
The current work presents some important limitations that will be addressed in future studies. Initially, the result metrics were plotted from all networks over time to give a general insight into how the entire system works.
From the statistical analysis in
Figure 6, it appears evident that the models have an important improvement margin from a precision point of view. As is also clear in
Figure 5, false positive events are frequent throughout the test frame. In the same way, the high accuracy levels suggest that there is a tendency to overfit. Despite the shared network architecture, the results vary significantly between all the models. It can be assumed that many factors contribute to this limitation:
Despite these challenges, tests on the entire dataset present highly variable results among both the models and the prediction time, generally corresponding to the AUROC values reported in the table metric for each model. The main aspect that causes confusion is the presence of many false positives and false negatives, as reported in
Figure 5, but this problem could be mitigated by improving both the quality and quantity of the data through more accurate class handling.
All the limitations cited bring us to many possible future scenarios to overcome them and improve the reported results. First, a more detailed and complete dataset should be arranged, along with a cleaner error configuration and description. The high number of classes/errors reported in
Figure 3 is considered to have highly polluted the behaviour of the models, as the features drastically change the overall classification process. This implies setting not only to “0” the value classes, but also the level of the features at a low excitation, to support the model classification process. Second, the current model classification process does not consider some of the “drag” that the machine has during the stopping process. In fact, a problem with the same error might have a different stopping time and speed. This might lead to a different type of classification and/or data extraction. Third, the continuing collection of data to feed the neural networks will be positive for the overall performance; however, this is a time-consuming process, as errors rarely occur.
In conclusion, it is also acknowledged that LSTM networks have long been employed in similar scenarios. However, in this particular case, as reported in
Figure 6, the system lacks robustness. Therefore, a possible improvement involves the implementation of hybrid models that combine different predictive methods to classify error classes, or the usage of advanced techniques like transfer learning.
6. Conclusions
EWSs represent essential tools for enabling domain experts to anticipate potential system failures. This study investigated the implementation of such a system in a real-world industrial context, using data collected from the embosser of a legacy industrial printing machine. The data required significant preprocessing due to the heterogeneity and asynchronous nature of the acquisition and communication layers.
After a review of the literature on EWS and fault prediction, and following an in-depth analysis of the sensor variables in collaboration with domain experts, a deep learning framework was designed to predict specific classes of paper-related errors. LSTM networks enhanced with attention mechanisms were selected as the most suitable architecture for this task. A multi-model classification approach was adopted, where five distinct neural networks were each trained to identify a specific critical error class.
The development process was managed using a modular pipeline built with Kedro, ensuring both reproducibility and scalability. This framework also facilitated seamless deployment of the trained models on cloud infrastructure, allowing automated retraining and updates as new data become available.
Model performance was assessed using standard metrics, including accuracy, precision, recall, and AUROC. The best-performing model achieved an AUROC of 0.93 and a recall above 90%, with the ability to predict fault events up to 10 time units in advance. While some degree of overfitting was observed, particularly in models trained on smaller subsets, this is attributed to the limited number of training instances for certain error classes.
To assess robustness in real-world conditions, the full dataset was partitioned and streamed sequentially to each model, simulating a real-time deployment scenario. The results confirmed the models’ capacity to anticipate faults within a proactive maintenance horizon. In conclusion, this study offers several practical insights for the design and deployment of intelligent EWSs in industrial environments: (1) the proposed data pipeline can be reused across different machines and operational contexts; (2) the AI models are modular and easily re-trainable on new error types without modifying the overall structure; (3) automatic retraining enables continuous improvement in predictive accuracy; and (4) the architecture is portable and adaptable, making it suitable for deployment on other legacy industrial systems. Researchers and practitioners in the fields of Big Data analytics, cloud-based predictive maintenance, and industrial AI applications can leverage these results to design and address future research and industrial activities.