1. Introduction
The steel industry continually faces challenges related to maintaining product quality amid a complex production process [
1,
2]. In rolling mills, defects may emerge at various stages, from the steel mill through logistics (storage of the billets), rolling, and, ultimately, sales (storage and transportation of the rolls). Traditionally, experts on the production line classify products using a range of quality labels. This project proposes an innovative solution: an intelligent assistant capable of replicating these expert decisions before defects occur. AI tools are widespread across the industry nowadays [
3,
4], so their adoption in this context is timely and justified.
Recent studies have shown that machine learning approaches can significantly enhance predictive maintenance and process optimization in manufacturing environments. For instance, in [
5], a multiple classifier approach was shown to improve failure prediction accuracy while reducing unexpected downtime and associated costs. The integration of cyber-physical systems, as highlighted in [
6], has further driven advancements in Industry 4.0-based manufacturing, enabling a seamless connection between physical processes and digital analytics to support robust, data-driven maintenance strategies. Within this context, machine learning techniques have demonstrated particular potential in industrial settings such as steel production, where process complexity and data heterogeneity are significant. Similarly, ref. [
7] highlights how AI-driven classification can automate quality control decisions in the hot rolling process, underscoring the value of these models in standardizing decision-making and minimizing human error.
Time series modeling methods, such as Long Short-Term Memory (LSTM) neural networks, have demonstrated superior performance in predicting mechanical properties in steel production. Ref. [
8] applied LSTM models to forecast yield strength, tensile strength, and elongation of deep-drawing steel, showing that capturing temporal dependencies from sequential process data significantly improves prediction accuracy compared to other deep learning architectures. Moreover, LSTM-based architectures have demonstrated effectiveness in capturing time-series dependencies in manufacturing data. In [
9], the authors proposed a quality prediction model that integrates AdaBoost with LSTM networks and rough set theory to improve the prediction accuracy. The study also confirms the relevance of ensemble learning and hybrid models in dealing with high-dimensional industrial data.
The integration of multi-source sensor data is also gaining momentum in steel quality management. For instance, ref. [
10] presents a deep learning-based approach to predict the mechanical properties of hot-rolled steel plates using process parameters from multiple stages. Their model captures complex nonlinear relationships between input features and material properties, illustrating the value of multi-source data integration in production monitoring. It underscores the effectiveness of data-driven modeling in steel manufacturing, leveraging rich sensor information to support quality decisions across departments.
From a systems architecture perspective, ref. [
11] presents a decision support framework that integrates random forest models with domain expert knowledge encoded as ontological rules for production supervision in steel plants. Their work exemplifies a hybrid approach that enhances decision reliability by combining data-driven classification with semantic reasoning. Our system similarly incorporates ontological rules and neural network inference, validating the use of this hybrid methodology.
Preventing defects is crucial, not only for improving profitability by reducing waste and rework but also for significantly lowering CO
2 emissions by minimizing unnecessary production and energy consumption [
12,
13,
14]. Furthermore, early detection and prevention of defects help extend the operational life of rolling mills by reducing wear and tear caused by processing faulty materials. This project encompasses the entire pipeline, from data acquisition to decision simulation, using real production data. The main objective is to automate expert decisions through modular AI components, including event detection, expert knowledge extraction, and a dual-layered inference engine (departmental and interdepartmental). The latter component aggregates stage-specific decisions and applies explainability mechanisms to identify defect propagation and interdependencies across production stages.
This paper outlines the technical approach and presents the evaluation of the system’s performance, emphasizing both departmental and interdepartmental decision-making. The results demonstrate the potential of this approach to reduce scrap rates, improve consistency in quality evaluations, and empower experts with an anticipatory decision-support tool. In doing so, it contributes to advancing industrial AI solutions for predictive quality assurance.
2. Methodology
2.1. Data Acquisition and Preprocessing
The system began with the acquisition of heterogeneous sensor data and expert input provided by human experts from each department (steel mill, logistic, rolling mill, and sales). Over a period of three years, these experts contributed their decisions and corresponding quality labels, which served as the foundation for the system’s classification tasks. In parallel, the experts developed ontological rules based on the decisions they made throughout this period through the use of statistical software. These rules were formulated to codify their expert knowledge in some way and guide the automated decision-making process. All of this is illustrated in
Figure 1.
The data, provided by the manufacturer in anonymized form due to NDA agreements, consisted of 278,312 samples, each containing readings from multiple sensors positioned along the production process; however, due to anonymization, it was not possible to identify the specific sensor corresponding to each variable. Certain production stages are equipped with a greater number of sensors and provide more extensive measurement data than others; for example, it is obvious that there are more sensors in stages where the material is being transformed (like steel mill and rolling mill) than in stages where the product is being stored or transported. The sensors present in these stages provide information such as the following:
Minimum, maximum, and average water flows.
Minimum, maximum, and average temperatures in the different lamination blocks.
Minimum, maximum, and average temperatures of the RSM (finishing blocks)
Tundish oscillation frequency.
The raw sensor data were collected in a single CSV file during the three-year period and segmented according to production stages, adding the defects information provided by the EDDYEyes system of ISEND S.A, shown in
Figure 6. This is an eddy current system that includes visual information about the defects and is capable of measuring the severity of them; this type of technology is one of the most valuable in the steel industry in defect detection [
15,
16,
17]. In this “defectology dataset”, the Quality Indicator (QI) is a critical metric that quantifies the quantity and severity of defects for each billet/roll.
Regarding data preprocessing, outliers that were deemed too obvious were eliminated using an interquartile range method. Moreover, the data were processed to ensure that they were easily interpretable by any algorithm (through standard normalization of the dataset and encoding of some variables). Notably, only one value per variable was recorded for each product, as the values hardly vary during the manufacture of each billet or roll. This uniformity greatly facilitated both the analysis and subsequent processing of the data.
2.2. Event Filtering and Expert Knowledge Extraction Modules: Structure and Operation
To address the twin challenges of detecting sensor anomalies linked to quality degradation and extracting information from the labels chosen by the human experts, this part of the system was built using a two-module architecture. The design of these modules was driven by the need for flexibility, clarity, and scalability, leading to a final structure comprising two distinct but complementary pipelines.
2.2.1. Event Filtering Module
The objective of this module is to process raw sensor data and identify significant deviations that may indicate a potential drop in product quality. This is critical because even slight deviations (when sustained) can signal underlying issues that affect the final quality of the steel product.
The module is architected as a multi-stage pipeline with clear separation of responsibilities.
Reads raw CSV data and a complementary criteria file, which defines the expected data types, valid ranges, and grouping intervals. This ensures that all incoming data meet predefined standards.
Corrects any formatting inconsistencies and applies basic corrections as needed. This step ensures the data’s integrity, preserving the raw relationships for later processing.
The data are segmented into fixed time windows (typically 4-h windows, limited to the most recent 100 points) to capture temporal behavior while smoothing out transient fluctuations.
Within each window, statistical metrics (e.g., minimum value, mean, and standard deviation) are computed. A logarithmic transformation is applied for smoothing, and an anomaly is flagged when the window’s value exceeds the mean plus three standard deviations (this is the criterion adopted to obtain satisfactory results). A threshold of at least 10 anomalous points is enforced to eliminate spurious spikes.
The final output is two-fold: a CSV file that adds to the original dataset Boolean flags (indicating the presence of anomalies) and graphical representations for quick visual analysis.
This modular design, whose operation is illustrated in
Figure 7, was chosen to isolate each processing step, thereby increasing maintainability and allowing for independent tuning or replacement of submodules. The clear delineation between file handling, transformation, windowing, and statistical detection ensures robustness and scalability. It also facilitates debugging and further enhancements as new types of sensor data or criteria emerge.
2.2.2. Automatic Expert Knowledge Extraction Module
Parallel to anomaly detection, it is necessary to extract and codify the decision-making process of human experts. Over three years, experts in each department provided quality labels and developed ontological rules to guide their decisions. This module’s purpose is to analyze these expert decisions, compare them against the formal rules, and generate a refined knowledge database. This module is organized in a sequential pipeline.
Historical records of expert decisions are collected along with the ontological rules that were formulated concurrently. By combining these two sources, the module ensures that both practical decision outcomes and formalized expert knowledge (the rules) are represented.
The module compares the decisions made by experts with the outcomes predicted by the ontological rules (it is important to note that ontological rules are statistical representations of the choices taken, but their results do not have to overlap with the decisions). Instances of coincidence, discrepancy, or missing rule output are identified.
In cases where the ontological rules do not generate a clear decision or when discrepancies arise, the module applies techniques based on neural networks to refine the rule set; this approach has been widely used in industry for its ability to learn from complex data [
18]. This step is critical for trying to determine what the “ideal” decision would be given the complete set of sensor inputs. We must take into account that we should not, in any case, contradict a bad quality label from the human experts; the aim of the module is to be even more restrictive than the human experts, learning from their knowledge.
The refined decisions, now aligned with both sensor data and expert insight, are compiled into a new CSV file that can be easily added to the knowledge extraction module database. This final database is then used to train the neural network inference modules (departmental and interdepartmental decision-making), ensuring that the system learns an accurate representation of expert decision-making.
This structure of the expert extraction module is represented in
Figure 8 and was chosen to address the inherent complexity of human decision-making. The human expert’s decisions are mostly influenced by the visual inspection of the product and the information given by the EDDYEyes system developed by ISEND S.A. so that the decision is not influenced at all by the sensor information, making the extraction of expert information a complex task. By separating data collection, discrepancy analysis, and output generation, the module allows for a systematic and iterative improvement of the knowledge base. This design enhances transparency, making it easier to pinpoint where expert judgments diverge from formal rules, and supports continuous refinement. Ultimately, the refined database served as the cornerstone for training our decision-making models, ensuring that they can replicate expert decisions or even improve them.
To achieve this behavior, we built the module with the following architecture. The model starts with an input layer with dimensions according to the number of parameters in each stage (steel mill → 28, logistics → 8, rolling mill → 40, sales → 4); this is followed by 3 hidden layers with double size than the input layer dimension and, finally, an output layer with a dimension according to the number of labels for each stage. This architecture was chosen so that the model was kept simple (adding more layers does not improve the results obtained) to achieve the best possible results; a dense neural network type structure is the most efficient for extracting information to classify into the different labels.
Together, these two modules form the backbone of the acquisition and treatment of the data recollected. Their carefully engineered modular architectures not only provide clear operational benefits but also ensure that each stage of processing, whether filtering anomalies or extracting expert knowledge, can be independently optimized and updated, as needed, in the future. This design was essential to handle the diversity of sensor data and the complexity of expert decision patterns in a robust, scalable manner.
2.3. Departmental and Interdepartmental Inference Modules: Architecture, Operation, and Justification
The motivation behind these inference modules is two-fold. At the departmental level, our goal is to replicate the decisions made by human experts. In each production stage, experts assign quality labels based on the apparent final quality of the material. By forecasting sensor data and processing event information, our system can predict these quality labels before the product is processed, avoiding the manufacturing of low-quality product that would be scrapped. On the other hand, the interdepartmental module is in charge of integrating outputs from the different stages, allowing for us to identify the root causes of quality degradation, facilitating targeted corrective actions and collecting new information that could be used to retrain the departmental modules in the future.
2.3.1. Departmental Inference Module
Each departmental module begins by processing pretreated sensor data using Long Short-Term Memory (LSTM) networks. The data, organized into fixed-length windows (for example, 50-time steps), are input to the LSTM network, which forecasts future sensor readings and process characteristics.
LSTM networks were chosen for their proven ability to capture long-term dependencies and temporal patterns in sequential data, which is widely proven in the industry [
19,
20]. This property is critical in our context, since sensor values remain nearly constant for each billet or roll; however, there exist fluctuations between products, and if this fluctuation follows the same tendence for a while, the deviation will be evident, and we have to take in account that even small deviations sustained over time may indicate an impending quality issue. The attention mechanisms help the network focus on the most relevant segments of the time window [
21], ensuring that subtle yet critical deviations are not overlooked.
The predictions from the LSTM component along with Boolean flags generated by the event filtering module in these predictions serve as input to a Multilayer Perceptron (MLP); this structure is illustrated in
Figure 9. The MLP is trained using, as targets, historical expert decisions and the ontological rules formulated by those experts through the outputs obtained in the expert knowledge extraction module. This network processes the multidimensional input through several dense layers with nonlinear activation functions (e.g., ReLU) and ultimately outputs a categorical quality label (such as OK, BLOCK, ALARM, SCRAP, etc.).
The choice of an MLP for decision emulation stems from its capacity to model complex, non-linear relationships between sensor forecasts and expert-derived quality labels. By integrating the outputs from the LSTM network and the filtered anomaly events, the MLP captures the interactions between multiple sensor variables and decision criteria. This layered approach not only mirrors the human expert’s process but also provides the anticipatory power required to trigger early corrective actions.
To achieve the desired results, we chose the following architecture for this departmental inference module. We start the module with an LSTM layer with 64 units, followed by a 10% dropout layer. After that, we define a custom attention layer, which consists of a dense type layer with the “softmax” activation function to calculate the attention weights, which are then used to create a context vector (defined as a weighted representation of the original input) at the output of the layer itself. This custom attention layer is followed by a dense type layer with 32 units, using “relu” as the activation function. Finally, we end with an output dense type layer with linear activation.
2.3.2. Interdepartmental Inference Module
The interdepartmental module aggregates the outputs from the departmental modules. It employs a random forest algorithm to analyze the combined data, specifically correlating predictions and quality labels from earlier stages (such as steel mill and logistic) with the final quality outcomes observed in later stages (like rolling mill). This module uses ensemble methods to compute the importance of various features and to determine which preceding factors most significantly contribute to defects at later stages in batches of the chosen size as shown in
Figure 10. With the departmental module, we know the relevant features for the final label in each stage, but the aim of the interdepartmental module is refining the departmental module by differentiating the bad labels at every stage that have no reason with the variables of the stage itself. Applying techniques such as SHAP (SHapley Additive exPlanations) to render the decision process of the random forest interpretable allows for us to determine if the most relevant variables for the bad labelling come from the current stage or from a previous stage, thereby highlighting the key variables that drive quality outcomes.
Random forests were selected for their robustness in handling heterogeneous and high-dimensional data, but the main reason of the choice was the ability to interpret the main features that drive the decision-making [
22,
23]. The ensemble approach reduces overfitting and enhances generalization, which is essential when linking subtle sensor anomalies from early stages to downstream quality issues. Moreover, the interpretability of random forests through feature importance and SHAP values enables process engineers to understand the root causes of defects. This is crucial for not only verifying model predictions but also for supporting operational decision-making and continuous process improvement.
In this module, the architecture is basically a random forest model built with XGBoost, which uses some predefined parameters by defect (estimators → 1000, learning rate → 0.001, early stopping rounds → 100, evaluation metric → ”auc”), but the hyperparameters can be improved for every batch through an optuna library implementation option. If we enable this option, the process becomes much slower, but it improves the results. This interdepartmental module is capable of running on a mid-range computer, but depending on the size of the batches and whether we use the optimization function, the process can become very slow. The speed in processing the data is not a problem, since the interdepartmental inference is performed once the lamination process is finished, and we need to produce an entire batch of product to analyze it.
Each module’s architecture was deliberately chosen to address specific aspects of the problem.
For departmental inference, the combination of LSTM and MLP networks enables temporal forecasting and complex decision emulation, respectively. The LSTM component isolates and predicts trends in sensor data, while the MLP integrates these predictions with event signals to generate anticipated quality labels.
For interdepartmental inference, the random forest-based module synthesizes data across departments to reveal causal relationships between early-stage anomalies and final product quality. Its ensemble structure and interpretability ensure that the system’s diagnostic conclusions are both robust and actionable.
In essence, these architectures were selected because they provide a modular, scalable, and interpretable framework. They not only replicate human expert decisions but do so in a manner that anticipates potential quality issues, thereby allowing for proactive interventions that improve overall efficiency and product quality.
3. Results
The outcomes of extensive testing for both the departmental and interdepartmental decision-making modules are detailed below. This section focuses on the tests performed and presents the performance results clearly for each production department and for the integrated interdepartmental module.
3.1. Departmental Decision-Making Module
As previously described, the departmental module is designed to replicate and anticipate the decisions made by human experts at each production stage. To evaluate its performance, a controlled set of tests were conducted; for this set of tests, a reserved subset of data (20% of the overall dataset) was used for testing. These tests employed confusion matrices to compare the system’s predicted quality labels against the experts’ actual decisions. Performance evaluation was rated by three metrics, sensitivity, precision, and F1-score, applied as follows:
3.1.1. Steel Mill
The module in the steel mill department was tested on decisions such as OK (when the product is valid to move on to the next phase), BLOCK (when the product is retained in the current phase for a more thorough inspection to make another decision), ALARM (there has been some alarm in the inspection systems of this stage during production, so it is suspected that the product may have some serious fault and has to be examined to make another decision), and SCRAP (the product has been classified as faulty, so it is going to be turned into scrap). The results are presented in the confusion matrix of the test in
Table 1 and the resulting metrics in
Table 2.
The results demonstrate that the system achieved a score of about 80%, which is good enough to take this expert as a reliable source of information. Similar performance levels were observed in the “bad” labels, indicating strong consistency in replicating expert decisions.
3.1.2. Logistics
In the logistics department, the quality decisions are limited to OK, ALARM, and SCRAP, and the meaning of every one of them is the same as in the previous stage. The results at this stage are presented in the confusion matrix in
Table 3 and the metric scores in
Table 4.
The results show that the system achieved a score a little bit lower than the steel mill AI expert, in this case, about 76%. For the ALARM and SCRAP labels, the sensitivity and precision were slightly lower (e.g., ALARM had a sensitivity of 68.15% and precision of 71.34%), yet the overall performance remained competitive with expert judgments.
3.1.3. Rolling Mill
The module for the rolling mill stage has to handle the following labels: OK, BLOCK, ALARM, DOWNGRADE, and SCRAP. The DOWNGRADE label is introduced in this stage; this label is assigned when the final quality of the product is not the one desired/pretended, so the final product is degraded to a lower quality. The results of this stage are presented in the confusion matrix in
Table 5 and the metric scores in
Table 6.
The system obtained a worse score in this stage than in the previous, but we have to highlight that, in the most important labels (OK and SCRAP), we obtained a score above 80%, although other decision categories (BLOCK, ALARM, and DOWNGRADE) showed slightly varied performance.
3.1.4. Sales
For the sales department, where decisions include SOLD (the product can be sold), PENDING (the product has to be inspected better before being sold, similar to the BLOCK label in the previous stages), CLAIM (the product own to a batch of products, where some have been claimed by the customers), ROLLBACK (claimed and returned to the factory), and SCRAP, the results reveal that the module achieved varied results, as shown in the confusion matrix in
Table 7 and the metrics score in
Table 8.
We achieved a high performance for the SOLD label (sensitivity of 94.36%, precision of 97.72%, and an F1-score of 96.01%). The other decision categories, while exhibiting moderate variations, demonstrated that the system consistently approximates the expert’s decision-making process.
In the results obtained in the different departments, we can see how the applied model can identify, with enough precision, when the product has the right quality and is ready to move on to the next production phase, thus managing not to slow down the production process by generating false alarms and speeding up decision-making in the vast majority of cases.
3.1.5. Summary of Class-Wise Performance Across Departments
To provide a concise and visual overview of the departmental modules’ performance, we generated a class-wise heatmap of F1-scores. For this visualization, the original classification labels used by each expert were consolidated into three unified categories: OK/SOLD, OTHER, and SCRAP. The OK/SOLD category includes labels such as OK in technical departments and SOLD in the commercial module, representing products deemed acceptable. The SCRAP category corresponds to defective materials identified for rejection. The intermediate group, OTHER, aggregates all other possible labels used by the experts, such as BLOCK or ALARM, which generally reflect borderline or uncertain cases that do not clearly belong to the acceptable or defective classes. This regrouping allows for consistent cross-departmental comparison. The heatmap shown in
Figure 11 summarizes the classification results of the four departmental modules, capturing both predictive accuracy and potential difficulties in replicating expert decisions.
This heatmap highlights several important trends. As expected, the OK/SOLD class consistently achieves the highest F1-scores across all departments, with the sales module reaching 96.01%. Conversely, the OTHER class presents greater classification challenges, with the lowest score observed in the sales module (63.68%). The SCRAP class, although more critical due to its implication in quality loss, also shows strong performance in the rolling mill module (80.05%). These results support the conclusion that the system is particularly effective in identifying clear-cut cases, such as acceptable or scrap-quality material, while additional refinement may be needed for ambiguous or intermediate classifications. This visualization serves as a synthesis of the detailed metrics presented earlier and confirms the consistency and reliability of the departmental decision models.
3.2. Interdepartmental Decision-Making Module
The interdepartmental module was designed to integrate outputs from the departmental modules and analyze them collectively. Its primary purpose is to diagnose the root causes of quality issues by correlating data across production stages. Tests for this module were performed. These tests used a big piece of the datasets and applied the random forest algorithm to examine how well the module could identify which variables from earlier stages (e.g., from the steelworks or logistics) contributed to quality defects observed later.
During testing, the interdepartmental module processed consecutive production sets where departmental decisions were predicted in advance. By applying ensemble methods and utilizing interpretability tools such as SHAP values, the module highlighted the importance of input features. The results clearly indicate that, in multiple scenarios, sensor data from the steel mill stage were the primary drivers of defects in previous stages. We chose the rolling mill stage because we have a clear indicator of the quality of the product that we can use as a target (the QI factor provided by the EDDYEYES system). To perform the analysis, we had a class “1” of “OK” products with too many defects (high QI), and we had a class “0”, which were the rest of the products. In this analysis, we found some batches of products where we observed a correlation between the steel mill variables and the quality achieved in the rolling mill stage.
Taking batches of 1000 rolls, we used 70% for training and 30% for testing; these 1000 rolls were fed into the random forest algorithm (which is optimized to receive this data), obtaining the following results (
Figure 12,
Figure 13 and
Figure 14).
Once it was determined that the model had high reliability in discriminating between the bad rolls and the good ones, which can be seen in the good results obtained in the confusion matrix, the SHAP method was applied to the model. Through the SHAP values obtained, we can determine the problematic variables that lead us to a quality deterioration.
In short, we use the confusion matrix as a way of assessing whether the model is able to understand the problematic features that are leading to a deterioration in quality. We then apply the SHAP method to obtain first-hand knowledge of these characteristics and to which department they belong.
Here, we can see that the variables coming from steel mill have the higher impact for this classification. Using the same method, we can find more cases that submit the same behavior.
The module’s confusion matrices and feature importance analyses demonstrated that it could reliably pinpoint problematic variables coming from the steel mill, providing actionable insights for process improvement. The cases provided show how the six SHAP values that are most important in the decision of the model come from the steel mill instead of coming from the rolling mill (the current stage in this case), allowing for us to collect more information about the origin of the defects that appear in the analyzed stage. This information can be used in the future to retrain the departmental modules and increase their reliability.
4. Discussion
The results demonstrate the potential of the proposed intelligent assistant to support expert decision-making across departments by replicating their logic and anticipating quality issues in advance. The performance varied depending on the department, with the steel mill stage yielding some of the most promising results. Specifically, the model achieved a sensitivity of 75% for the “SCRAP” label, correctly identifying three out of four billets that should be rejected before processing. Additionally, the system demonstrated a precision of over 72%, meaning most predictions for rejection were valid. Out of 59,578 analyzed billets, 403 would have been correctly flagged as scrap before production, potentially avoiding unnecessary processing. Only 97 billets (0.16%) would have been wrongly flagged. These results are in line with the findings in [
7], where the application of ML-based classifiers for hot-rolling quality grading also resulted in high precision and effective early rejection of defective outputs. A similar direction is observed in [
9], where a hybrid model combining AdaBoost and LSTM outperformed other approaches in a real manufacturing setting, highlighting the value of ensemble deep learning architectures for anticipating quality issues based on production data.
In the logistics department, the intelligent system also showed strong performance, particularly in terms of early detection. With a sensitivity above 70%, the model would have correctly flagged 2536 defective billets (4% of total production) before entering the billet park. This level of anticipation, although with 641 false alarms, represents a valuable opportunity for reducing downstream defects. The combination of temporal models and decision emulation, such as LSTM with MLP, has demonstrated high predictive power across complex industrial datasets, as validated in real steel manufacturing environments.
The rolling mill posed a more complex challenge, especially for the “DOWNGRADE” label. The system struggled with this class, likely due to the limited number of labeled samples and the subtle differences distinguishing it from the “OK” and “SCRAP” categories. These types of difficulties are common in industrial settings, particularly when dealing with visually ambiguous or infrequent defects. The integration of explainable AI techniques, like those explored in [
11], enhances transparency by translating random forest models into semantic rules. This interpretability supports informed decision-making in steel production and highlights the value of combining domain knowledge with machine learning. Such findings highlight the importance of expert knowledge, data balance, and interpretable models when dealing with rare or borderline defect classes in steel manufacturing. Nonetheless, for the critical “SCRAP” cases in the rolling stage, the AI achieved 79% sensitivity and 81% precision, correctly identifying 598 defective coils and issuing only 56 false positives. This suggests that the temporal modeling performed by the LSTM-based architecture is well suited to detect anomalies in sequential data where quality evolves along the production line.
The sales department presented additional complexity. The model achieved an F1-score of 72%, lower than the performance in upstream departments. This is likely due to the narrower data diversity and the influence of commercial criteria that are not always encoded in sensor data. Similar conclusions are drawn in [
11], where integrating domain knowledge and contextual semantic reasoning was essential to enhance the interpretability and robustness of machine learning decisions in steel production. The formalization of expert knowledge into SWRL rules, when combined with data-driven classification, demonstrated a significant improvement in decision support capabilities. Our results suggest that enriching the sales decision module with data from earlier stages—as proposed in the interdepartmental model—could improve classification accuracy by providing a broader context. This approach also aligns with the hybrid methodology validated in [
11], where expert knowledge was formalized into SWRL rules and combined with data-driven classifiers.
From a detailed comparative perspective, ref. [
11] presents a hybrid decision support tool for quality classification in steel production, combining semantic rules written in SWRL with random forest classifiers. Their system was evaluated on real cold-rolled steel data to assist human experts in interpreting quality outcomes using structured knowledge and machine learning inference. While effective, the system operates after production, classifying finished products rather than predicting quality before fabrication as we do. In their results, the reported weighted precision and recall reached 78%, and the weighted F1-score was 77%, demonstrating robust classification capabilities in a static setting. In contrast, our approach tackles a significantly more complex scenario, as it focuses on anticipating product quality prior to completion by integrating heterogeneous, sequential data from multiple departments. Despite this increased difficulty, our expert modules achieved strong performance across departments: the steel mill module reached a precision of 78.01%, recall of 80.50%, and F1-score of 79.18%; the logistics module showed 75.67% precision, 74.54% recall, and 75.10% F1-score; the rolling mill achieved 77.63% precision, 74.57% recall, and 76.06% F1-score; and the sales module reported 74.59% precision, 71.71% recall, and 72.71% F1-score. These values confirm the robustness of our anticipatory and hybrid methodology, delivering performance that matches or exceeds post-production models like [
11] under more challenging predictive conditions.
The interdepartmental model offers significant potential to compensate for local uncertainties by aggregating information from multiple departments. While [
8] focuses on a single-stage prediction task, their results show that leveraging temporal sensor data across various positions within a production line enhances the accuracy of downstream property estimation. This suggests that broader integration of production data may further improve predictive performance in multi-stage industrial systems. Moreover, by adopting explainable components, such as feature importance ranking (from random forests), it becomes possible to identify which departments or variables are most responsible for quality degradation—a principle validated by [
9] in real-world rolling fault detection scenarios. The hybrid use of LSTM networks for temporal prediction and random forests for final decision logic proved to be a robust and interpretable combination for industrial applications.
To bridge the gap between research and industrial deployment, a pilot-scale validation plan is proposed. It involves selecting a production line with sufficient sensor coverage and historical data and initially operating the AI system in parallel to existing decision-making processes without influencing real-time operations. The system’s outputs would be systematically compared against expert judgment and current quality control protocols to build trust. Depending on performance, low-risk decisions would be progressively automated, ultimately leading to full integration into Level 2 automation and continuous monitoring for iterative refinement.
In parallel, the adaptability of the proposed system to different industrial environments must be addressed. While many data types and sensor configurations are shared across steel plants, transferring the model to new facilities would likely require some degree of retraining to reflect site-specific conditions. Currently, the absence of a systematic identification of the key variables that influence model transferability represents a limitation. Tackling this challenge will be essential for enhancing the system’s generalization capabilities and enabling scalable adoption across diverse production contexts.
In summary, the proposed AI system not only replicates expert decision patterns but, in many cases, anticipates quality deviations earlier in the process. Its consistent performance across departments combined with high interpretability and modular design position it as a promising solution for enhancing product quality and process stability in steel manufacturing environments. While initial steps toward industrial integration and validation have been outlined, further efforts will be needed to assess adaptability across different production sites and to benchmark the system against alternative approaches. These findings reinforce the growing trend of deploying AI-assisted quality assurance systems in complex production chains.
5. Conclusions
This study demonstrates that an AI-based system, which integrates event filtering, expert knowledge extraction, and advanced inference modules, can effectively predict quality issues in the steel rolling process. With an overall accuracy around 80%, the system shows promise in replicating and anticipating expert decisions, leading to potential reductions in production losses and improved operational efficiency. Although certain categories—such as the “DOWNGRADE” label in the rolling mill and the complex decisions in the sales module—require further data enrichment and model integration, the system provides actionable insights for process improvement. Future research should focus on expanding the dataset, refining the model for challenging categories, and exploring tighter integration across departments to further enhance predictive accuracy.
Related to the industrial applicability, the proposed system is designed with practical integration in mind. Most of the targeted production facilities are already equipped with a certain degree of sensorization, typically connected to automation level 1, and in more advanced setups, integrated into level 2. Our approach leverages this existing infrastructure by continuously collecting data from these levels. The decision-making algorithms would be executed on dedicated industrial hardware, and the resulting outputs would be automatically communicated back to level 2. This enables real-time interventions, such as product rejection, downgrade, or blocking, ensuring that predictive insights directly inform operational decisions.