Interdepartmental Optimization in Steel Manufacturing: An Artificial Intelligence Approach for Enhancing Decision-Making and Quality Control

Bernárdez, José M.; Boo, Jonathan; Díaz, José I.; Medina, Roberto

doi:10.3390/asi8030063

Open AccessArticle

Interdepartmental Optimization in Steel Manufacturing: An Artificial Intelligence Approach for Enhancing Decision-Making and Quality Control

by

José M. Bernárdez

¹,

Jonathan Boo

¹

,

José I. Díaz

^1,*

and

Roberto Medina

²

¹

ISEND S.A., Parque Tecnológico de Boecillo, 47151 Valladolid, Spain

²

CARTIF Foundation, Parque Tecnológico de Boecillo, 47151 Valladolid, Spain

^*

Author to whom correspondence should be addressed.

Appl. Syst. Innov. 2025, 8(3), 63; https://doi.org/10.3390/asi8030063

Submission received: 25 February 2025 / Revised: 17 April 2025 / Accepted: 25 April 2025 / Published: 4 May 2025

(This article belongs to the Special Issue New Challenges of Innovation, Sustainability, Resilience in X.0 Era)

Download

Browse Figures

Versions Notes

Abstract

:

Recent advances in artificial intelligence have intensified efforts to improve quality management in steel manufacturing. In this paper, we present the development and results of a system that aims to learn from the decisions made by experts to anticipate the problems that affect the final quality of the product in the steel rolling process. The system integrates a series of modules, including event filtering, automatic expert knowledge extraction, and decision-making neural networks, developed in a phased approach. The experimental results, using a three-year historical dataset, suggest that our system can anticipate quality issues with an accuracy of approximately 80%, enabling proactive defect prevention and a reduction in production losses. This approach demonstrates the potential for industrial AI applications for predictive quality assurance, highlighting the technical foundations and potential for industrial applications.

Keywords:

steel manufacturing; artificial intelligence; deep learning; LSTM networks; random forest; industrial AI; quality control

1. Introduction

The steel industry continually faces challenges related to maintaining product quality amid a complex production process [1,2]. In rolling mills, defects may emerge at various stages, from the steel mill through logistics (storage of the billets), rolling, and, ultimately, sales (storage and transportation of the rolls). Traditionally, experts on the production line classify products using a range of quality labels. This project proposes an innovative solution: an intelligent assistant capable of replicating these expert decisions before defects occur. AI tools are widespread across the industry nowadays [3,4], so their adoption in this context is timely and justified.

Recent studies have shown that machine learning approaches can significantly enhance predictive maintenance and process optimization in manufacturing environments. For instance, in [5], a multiple classifier approach was shown to improve failure prediction accuracy while reducing unexpected downtime and associated costs. The integration of cyber-physical systems, as highlighted in [6], has further driven advancements in Industry 4.0-based manufacturing, enabling a seamless connection between physical processes and digital analytics to support robust, data-driven maintenance strategies. Within this context, machine learning techniques have demonstrated particular potential in industrial settings such as steel production, where process complexity and data heterogeneity are significant. Similarly, ref. [7] highlights how AI-driven classification can automate quality control decisions in the hot rolling process, underscoring the value of these models in standardizing decision-making and minimizing human error.

Time series modeling methods, such as Long Short-Term Memory (LSTM) neural networks, have demonstrated superior performance in predicting mechanical properties in steel production. Ref. [8] applied LSTM models to forecast yield strength, tensile strength, and elongation of deep-drawing steel, showing that capturing temporal dependencies from sequential process data significantly improves prediction accuracy compared to other deep learning architectures. Moreover, LSTM-based architectures have demonstrated effectiveness in capturing time-series dependencies in manufacturing data. In [9], the authors proposed a quality prediction model that integrates AdaBoost with LSTM networks and rough set theory to improve the prediction accuracy. The study also confirms the relevance of ensemble learning and hybrid models in dealing with high-dimensional industrial data.

The integration of multi-source sensor data is also gaining momentum in steel quality management. For instance, ref. [10] presents a deep learning-based approach to predict the mechanical properties of hot-rolled steel plates using process parameters from multiple stages. Their model captures complex nonlinear relationships between input features and material properties, illustrating the value of multi-source data integration in production monitoring. It underscores the effectiveness of data-driven modeling in steel manufacturing, leveraging rich sensor information to support quality decisions across departments.

From a systems architecture perspective, ref. [11] presents a decision support framework that integrates random forest models with domain expert knowledge encoded as ontological rules for production supervision in steel plants. Their work exemplifies a hybrid approach that enhances decision reliability by combining data-driven classification with semantic reasoning. Our system similarly incorporates ontological rules and neural network inference, validating the use of this hybrid methodology.

Preventing defects is crucial, not only for improving profitability by reducing waste and rework but also for significantly lowering CO₂ emissions by minimizing unnecessary production and energy consumption [12,13,14]. Furthermore, early detection and prevention of defects help extend the operational life of rolling mills by reducing wear and tear caused by processing faulty materials. This project encompasses the entire pipeline, from data acquisition to decision simulation, using real production data. The main objective is to automate expert decisions through modular AI components, including event detection, expert knowledge extraction, and a dual-layered inference engine (departmental and interdepartmental). The latter component aggregates stage-specific decisions and applies explainability mechanisms to identify defect propagation and interdependencies across production stages.

This paper outlines the technical approach and presents the evaluation of the system’s performance, emphasizing both departmental and interdepartmental decision-making. The results demonstrate the potential of this approach to reduce scrap rates, improve consistency in quality evaluations, and empower experts with an anticipatory decision-support tool. In doing so, it contributes to advancing industrial AI solutions for predictive quality assurance.

2. Methodology

2.1. Data Acquisition and Preprocessing

The system began with the acquisition of heterogeneous sensor data and expert input provided by human experts from each department (steel mill, logistic, rolling mill, and sales). Over a period of three years, these experts contributed their decisions and corresponding quality labels, which served as the foundation for the system’s classification tasks. In parallel, the experts developed ontological rules based on the decisions they made throughout this period through the use of statistical software. These rules were formulated to codify their expert knowledge in some way and guide the automated decision-making process. All of this is illustrated in Figure 1.

The data, provided by the manufacturer in anonymized form due to NDA agreements, consisted of 278,312 samples, each containing readings from multiple sensors positioned along the production process; however, due to anonymization, it was not possible to identify the specific sensor corresponding to each variable. Certain production stages are equipped with a greater number of sensors and provide more extensive measurement data than others; for example, it is obvious that there are more sensors in stages where the material is being transformed (like steel mill and rolling mill) than in stages where the product is being stored or transported. The sensors present in these stages provide information such as the following:

Minimum, maximum, and average water flows.
Minimum, maximum, and average temperatures in the different lamination blocks.
Minimum, maximum, and average temperatures of the RSM (finishing blocks)
Tundish oscillation frequency.

The data types present in the dataset collected can be seen in Figure 2, Figure 3, Figure 4 and Figure 5.

The raw sensor data were collected in a single CSV file during the three-year period and segmented according to production stages, adding the defects information provided by the EDDYEyes system of ISEND S.A, shown in Figure 6. This is an eddy current system that includes visual information about the defects and is capable of measuring the severity of them; this type of technology is one of the most valuable in the steel industry in defect detection [15,16,17]. In this “defectology dataset”, the Quality Indicator (QI) is a critical metric that quantifies the quantity and severity of defects for each billet/roll.

Regarding data preprocessing, outliers that were deemed too obvious were eliminated using an interquartile range method. Moreover, the data were processed to ensure that they were easily interpretable by any algorithm (through standard normalization of the dataset and encoding of some variables). Notably, only one value per variable was recorded for each product, as the values hardly vary during the manufacture of each billet or roll. This uniformity greatly facilitated both the analysis and subsequent processing of the data.

2.2. Event Filtering and Expert Knowledge Extraction Modules: Structure and Operation

To address the twin challenges of detecting sensor anomalies linked to quality degradation and extracting information from the labels chosen by the human experts, this part of the system was built using a two-module architecture. The design of these modules was driven by the need for flexibility, clarity, and scalability, leading to a final structure comprising two distinct but complementary pipelines.

2.2.1. Event Filtering Module

The objective of this module is to process raw sensor data and identify significant deviations that may indicate a potential drop in product quality. This is critical because even slight deviations (when sustained) can signal underlying issues that affect the final quality of the steel product.

The module is architected as a multi-stage pipeline with clear separation of responsibilities.

File Loading Submodule

Reads raw CSV data and a complementary criteria file, which defines the expected data types, valid ranges, and grouping intervals. This ensures that all incoming data meet predefined standards.

Content Transformation Submodule

Corrects any formatting inconsistencies and applies basic corrections as needed. This step ensures the data’s integrity, preserving the raw relationships for later processing.

Data Grouping and Windowing Submodule

The data are segmented into fixed time windows (typically 4-h windows, limited to the most recent 100 points) to capture temporal behavior while smoothing out transient fluctuations.

Statistical Processing and Anomaly Detection Submodule

Within each window, statistical metrics (e.g., minimum value, mean, and standard deviation) are computed. A logarithmic transformation is applied for smoothing, and an anomaly is flagged when the window’s value exceeds the mean plus three standard deviations (this is the criterion adopted to obtain satisfactory results). A threshold of at least 10 anomalous points is enforced to eliminate spurious spikes.

Output Generation Submodule

The final output is two-fold: a CSV file that adds to the original dataset Boolean flags (indicating the presence of anomalies) and graphical representations for quick visual analysis.

This modular design, whose operation is illustrated in Figure 7, was chosen to isolate each processing step, thereby increasing maintainability and allowing for independent tuning or replacement of submodules. The clear delineation between file handling, transformation, windowing, and statistical detection ensures robustness and scalability. It also facilitates debugging and further enhancements as new types of sensor data or criteria emerge.

2.2.2. Automatic Expert Knowledge Extraction Module

Parallel to anomaly detection, it is necessary to extract and codify the decision-making process of human experts. Over three years, experts in each department provided quality labels and developed ontological rules to guide their decisions. This module’s purpose is to analyze these expert decisions, compare them against the formal rules, and generate a refined knowledge database. This module is organized in a sequential pipeline.

Data Collection and Integration

Historical records of expert decisions are collected along with the ontological rules that were formulated concurrently. By combining these two sources, the module ensures that both practical decision outcomes and formalized expert knowledge (the rules) are represented.

Discrepancy Analysis Submodule

The module compares the decisions made by experts with the outcomes predicted by the ontological rules (it is important to note that ontological rules are statistical representations of the choices taken, but their results do not have to overlap with the decisions). Instances of coincidence, discrepancy, or missing rule output are identified.

Rule Evaluation and Refinement Submodule

In cases where the ontological rules do not generate a clear decision or when discrepancies arise, the module applies techniques based on neural networks to refine the rule set; this approach has been widely used in industry for its ability to learn from complex data [18]. This step is critical for trying to determine what the “ideal” decision would be given the complete set of sensor inputs. We must take into account that we should not, in any case, contradict a bad quality label from the human experts; the aim of the module is to be even more restrictive than the human experts, learning from their knowledge.

Knowledge Database Generation Submodule

The refined decisions, now aligned with both sensor data and expert insight, are compiled into a new CSV file that can be easily added to the knowledge extraction module database. This final database is then used to train the neural network inference modules (departmental and interdepartmental decision-making), ensuring that the system learns an accurate representation of expert decision-making.

This structure of the expert extraction module is represented in Figure 8 and was chosen to address the inherent complexity of human decision-making. The human expert’s decisions are mostly influenced by the visual inspection of the product and the information given by the EDDYEyes system developed by ISEND S.A. so that the decision is not influenced at all by the sensor information, making the extraction of expert information a complex task. By separating data collection, discrepancy analysis, and output generation, the module allows for a systematic and iterative improvement of the knowledge base. This design enhances transparency, making it easier to pinpoint where expert judgments diverge from formal rules, and supports continuous refinement. Ultimately, the refined database served as the cornerstone for training our decision-making models, ensuring that they can replicate expert decisions or even improve them.

To achieve this behavior, we built the module with the following architecture. The model starts with an input layer with dimensions according to the number of parameters in each stage (steel mill → 28, logistics → 8, rolling mill → 40, sales → 4); this is followed by 3 hidden layers with double size than the input layer dimension and, finally, an output layer with a dimension according to the number of labels for each stage. This architecture was chosen so that the model was kept simple (adding more layers does not improve the results obtained) to achieve the best possible results; a dense neural network type structure is the most efficient for extracting information to classify into the different labels.

Together, these two modules form the backbone of the acquisition and treatment of the data recollected. Their carefully engineered modular architectures not only provide clear operational benefits but also ensure that each stage of processing, whether filtering anomalies or extracting expert knowledge, can be independently optimized and updated, as needed, in the future. This design was essential to handle the diversity of sensor data and the complexity of expert decision patterns in a robust, scalable manner.

2.3. Departmental and Interdepartmental Inference Modules: Architecture, Operation, and Justification

The motivation behind these inference modules is two-fold. At the departmental level, our goal is to replicate the decisions made by human experts. In each production stage, experts assign quality labels based on the apparent final quality of the material. By forecasting sensor data and processing event information, our system can predict these quality labels before the product is processed, avoiding the manufacturing of low-quality product that would be scrapped. On the other hand, the interdepartmental module is in charge of integrating outputs from the different stages, allowing for us to identify the root causes of quality degradation, facilitating targeted corrective actions and collecting new information that could be used to retrain the departmental modules in the future.

2.3.1. Departmental Inference Module

Each departmental module begins by processing pretreated sensor data using Long Short-Term Memory (LSTM) networks. The data, organized into fixed-length windows (for example, 50-time steps), are input to the LSTM network, which forecasts future sensor readings and process characteristics.

LSTM networks were chosen for their proven ability to capture long-term dependencies and temporal patterns in sequential data, which is widely proven in the industry [19,20]. This property is critical in our context, since sensor values remain nearly constant for each billet or roll; however, there exist fluctuations between products, and if this fluctuation follows the same tendence for a while, the deviation will be evident, and we have to take in account that even small deviations sustained over time may indicate an impending quality issue. The attention mechanisms help the network focus on the most relevant segments of the time window [21], ensuring that subtle yet critical deviations are not overlooked.

The predictions from the LSTM component along with Boolean flags generated by the event filtering module in these predictions serve as input to a Multilayer Perceptron (MLP); this structure is illustrated in Figure 9. The MLP is trained using, as targets, historical expert decisions and the ontological rules formulated by those experts through the outputs obtained in the expert knowledge extraction module. This network processes the multidimensional input through several dense layers with nonlinear activation functions (e.g., ReLU) and ultimately outputs a categorical quality label (such as OK, BLOCK, ALARM, SCRAP, etc.).

The choice of an MLP for decision emulation stems from its capacity to model complex, non-linear relationships between sensor forecasts and expert-derived quality labels. By integrating the outputs from the LSTM network and the filtered anomaly events, the MLP captures the interactions between multiple sensor variables and decision criteria. This layered approach not only mirrors the human expert’s process but also provides the anticipatory power required to trigger early corrective actions.

To achieve the desired results, we chose the following architecture for this departmental inference module. We start the module with an LSTM layer with 64 units, followed by a 10% dropout layer. After that, we define a custom attention layer, which consists of a dense type layer with the “softmax” activation function to calculate the attention weights, which are then used to create a context vector (defined as a weighted representation of the original input) at the output of the layer itself. This custom attention layer is followed by a dense type layer with 32 units, using “relu” as the activation function. Finally, we end with an output dense type layer with linear activation.

2.3.2. Interdepartmental Inference Module

The interdepartmental module aggregates the outputs from the departmental modules. It employs a random forest algorithm to analyze the combined data, specifically correlating predictions and quality labels from earlier stages (such as steel mill and logistic) with the final quality outcomes observed in later stages (like rolling mill). This module uses ensemble methods to compute the importance of various features and to determine which preceding factors most significantly contribute to defects at later stages in batches of the chosen size as shown in Figure 10. With the departmental module, we know the relevant features for the final label in each stage, but the aim of the interdepartmental module is refining the departmental module by differentiating the bad labels at every stage that have no reason with the variables of the stage itself. Applying techniques such as SHAP (SHapley Additive exPlanations) to render the decision process of the random forest interpretable allows for us to determine if the most relevant variables for the bad labelling come from the current stage or from a previous stage, thereby highlighting the key variables that drive quality outcomes.

Random forests were selected for their robustness in handling heterogeneous and high-dimensional data, but the main reason of the choice was the ability to interpret the main features that drive the decision-making [22,23]. The ensemble approach reduces overfitting and enhances generalization, which is essential when linking subtle sensor anomalies from early stages to downstream quality issues. Moreover, the interpretability of random forests through feature importance and SHAP values enables process engineers to understand the root causes of defects. This is crucial for not only verifying model predictions but also for supporting operational decision-making and continuous process improvement.

In this module, the architecture is basically a random forest model built with XGBoost, which uses some predefined parameters by defect (estimators → 1000, learning rate → 0.001, early stopping rounds → 100, evaluation metric → ”auc”), but the hyperparameters can be improved for every batch through an optuna library implementation option. If we enable this option, the process becomes much slower, but it improves the results. This interdepartmental module is capable of running on a mid-range computer, but depending on the size of the batches and whether we use the optimization function, the process can become very slow. The speed in processing the data is not a problem, since the interdepartmental inference is performed once the lamination process is finished, and we need to produce an entire batch of product to analyze it.

Each module’s architecture was deliberately chosen to address specific aspects of the problem.

For departmental inference, the combination of LSTM and MLP networks enables temporal forecasting and complex decision emulation, respectively. The LSTM component isolates and predicts trends in sensor data, while the MLP integrates these predictions with event signals to generate anticipated quality labels.
For interdepartmental inference, the random forest-based module synthesizes data across departments to reveal causal relationships between early-stage anomalies and final product quality. Its ensemble structure and interpretability ensure that the system’s diagnostic conclusions are both robust and actionable.

In essence, these architectures were selected because they provide a modular, scalable, and interpretable framework. They not only replicate human expert decisions but do so in a manner that anticipates potential quality issues, thereby allowing for proactive interventions that improve overall efficiency and product quality.

3. Results

The outcomes of extensive testing for both the departmental and interdepartmental decision-making modules are detailed below. This section focuses on the tests performed and presents the performance results clearly for each production department and for the integrated interdepartmental module.

3.1. Departmental Decision-Making Module

As previously described, the departmental module is designed to replicate and anticipate the decisions made by human experts at each production stage. To evaluate its performance, a controlled set of tests were conducted; for this set of tests, a reserved subset of data (20% of the overall dataset) was used for testing. These tests employed confusion matrices to compare the system’s predicted quality labels against the experts’ actual decisions. Performance evaluation was rated by three metrics, sensitivity, precision, and F1-score, applied as follows:

S e n s i t i v i t y = \frac{T r u e P o s i t i v e s}{T r u e p o s i t i v e s + F a l s e N e g a t i v e s}

(1)

P r e c i s i o n = \frac{T r u e P o s i t i v e s}{T r u e P o s i t i v e s + F a l s e P o s i t i v e s}

(2)

F 1 S c o r e = 2 \frac{P r e c i s i o n \times S e n s i t i v i t y}{P r e c i s i o n + S e n s i t i v i t y}

(3)

3.1.1. Steel Mill

The module in the steel mill department was tested on decisions such as OK (when the product is valid to move on to the next phase), BLOCK (when the product is retained in the current phase for a more thorough inspection to make another decision), ALARM (there has been some alarm in the inspection systems of this stage during production, so it is suspected that the product may have some serious fault and has to be examined to make another decision), and SCRAP (the product has been classified as faulty, so it is going to be turned into scrap). The results are presented in the confusion matrix of the test in Table 1 and the resulting metrics in Table 2.

The results demonstrate that the system achieved a score of about 80%, which is good enough to take this expert as a reliable source of information. Similar performance levels were observed in the “bad” labels, indicating strong consistency in replicating expert decisions.

3.1.2. Logistics

In the logistics department, the quality decisions are limited to OK, ALARM, and SCRAP, and the meaning of every one of them is the same as in the previous stage. The results at this stage are presented in the confusion matrix in Table 3 and the metric scores in Table 4.

The results show that the system achieved a score a little bit lower than the steel mill AI expert, in this case, about 76%. For the ALARM and SCRAP labels, the sensitivity and precision were slightly lower (e.g., ALARM had a sensitivity of 68.15% and precision of 71.34%), yet the overall performance remained competitive with expert judgments.

3.1.3. Rolling Mill

The module for the rolling mill stage has to handle the following labels: OK, BLOCK, ALARM, DOWNGRADE, and SCRAP. The DOWNGRADE label is introduced in this stage; this label is assigned when the final quality of the product is not the one desired/pretended, so the final product is degraded to a lower quality. The results of this stage are presented in the confusion matrix in Table 5 and the metric scores in Table 6.

The system obtained a worse score in this stage than in the previous, but we have to highlight that, in the most important labels (OK and SCRAP), we obtained a score above 80%, although other decision categories (BLOCK, ALARM, and DOWNGRADE) showed slightly varied performance.

3.1.4. Sales

For the sales department, where decisions include SOLD (the product can be sold), PENDING (the product has to be inspected better before being sold, similar to the BLOCK label in the previous stages), CLAIM (the product own to a batch of products, where some have been claimed by the customers), ROLLBACK (claimed and returned to the factory), and SCRAP, the results reveal that the module achieved varied results, as shown in the confusion matrix in Table 7 and the metrics score in Table 8.

We achieved a high performance for the SOLD label (sensitivity of 94.36%, precision of 97.72%, and an F1-score of 96.01%). The other decision categories, while exhibiting moderate variations, demonstrated that the system consistently approximates the expert’s decision-making process.

In the results obtained in the different departments, we can see how the applied model can identify, with enough precision, when the product has the right quality and is ready to move on to the next production phase, thus managing not to slow down the production process by generating false alarms and speeding up decision-making in the vast majority of cases.

3.1.5. Summary of Class-Wise Performance Across Departments

To provide a concise and visual overview of the departmental modules’ performance, we generated a class-wise heatmap of F1-scores. For this visualization, the original classification labels used by each expert were consolidated into three unified categories: OK/SOLD, OTHER, and SCRAP. The OK/SOLD category includes labels such as OK in technical departments and SOLD in the commercial module, representing products deemed acceptable. The SCRAP category corresponds to defective materials identified for rejection. The intermediate group, OTHER, aggregates all other possible labels used by the experts, such as BLOCK or ALARM, which generally reflect borderline or uncertain cases that do not clearly belong to the acceptable or defective classes. This regrouping allows for consistent cross-departmental comparison. The heatmap shown in Figure 11 summarizes the classification results of the four departmental modules, capturing both predictive accuracy and potential difficulties in replicating expert decisions.

This heatmap highlights several important trends. As expected, the OK/SOLD class consistently achieves the highest F1-scores across all departments, with the sales module reaching 96.01%. Conversely, the OTHER class presents greater classification challenges, with the lowest score observed in the sales module (63.68%). The SCRAP class, although more critical due to its implication in quality loss, also shows strong performance in the rolling mill module (80.05%). These results support the conclusion that the system is particularly effective in identifying clear-cut cases, such as acceptable or scrap-quality material, while additional refinement may be needed for ambiguous or intermediate classifications. This visualization serves as a synthesis of the detailed metrics presented earlier and confirms the consistency and reliability of the departmental decision models.

3.2. Interdepartmental Decision-Making Module

The interdepartmental module was designed to integrate outputs from the departmental modules and analyze them collectively. Its primary purpose is to diagnose the root causes of quality issues by correlating data across production stages. Tests for this module were performed. These tests used a big piece of the datasets and applied the random forest algorithm to examine how well the module could identify which variables from earlier stages (e.g., from the steelworks or logistics) contributed to quality defects observed later.

During testing, the interdepartmental module processed consecutive production sets where departmental decisions were predicted in advance. By applying ensemble methods and utilizing interpretability tools such as SHAP values, the module highlighted the importance of input features. The results clearly indicate that, in multiple scenarios, sensor data from the steel mill stage were the primary drivers of defects in previous stages. We chose the rolling mill stage because we have a clear indicator of the quality of the product that we can use as a target (the QI factor provided by the EDDYEYES system). To perform the analysis, we had a class “1” of “OK” products with too many defects (high QI), and we had a class “0”, which were the rest of the products. In this analysis, we found some batches of products where we observed a correlation between the steel mill variables and the quality achieved in the rolling mill stage.

Taking batches of 1000 rolls, we used 70% for training and 30% for testing; these 1000 rolls were fed into the random forest algorithm (which is optimized to receive this data), obtaining the following results (Figure 12, Figure 13 and Figure 14).

Once it was determined that the model had high reliability in discriminating between the bad rolls and the good ones, which can be seen in the good results obtained in the confusion matrix, the SHAP method was applied to the model. Through the SHAP values obtained, we can determine the problematic variables that lead us to a quality deterioration.

In short, we use the confusion matrix as a way of assessing whether the model is able to understand the problematic features that are leading to a deterioration in quality. We then apply the SHAP method to obtain first-hand knowledge of these characteristics and to which department they belong.

Here, we can see that the variables coming from steel mill have the higher impact for this classification. Using the same method, we can find more cases that submit the same behavior.

The module’s confusion matrices and feature importance analyses demonstrated that it could reliably pinpoint problematic variables coming from the steel mill, providing actionable insights for process improvement. The cases provided show how the six SHAP values that are most important in the decision of the model come from the steel mill instead of coming from the rolling mill (the current stage in this case), allowing for us to collect more information about the origin of the defects that appear in the analyzed stage. This information can be used in the future to retrain the departmental modules and increase their reliability.

4. Discussion

The results demonstrate the potential of the proposed intelligent assistant to support expert decision-making across departments by replicating their logic and anticipating quality issues in advance. The performance varied depending on the department, with the steel mill stage yielding some of the most promising results. Specifically, the model achieved a sensitivity of 75% for the “SCRAP” label, correctly identifying three out of four billets that should be rejected before processing. Additionally, the system demonstrated a precision of over 72%, meaning most predictions for rejection were valid. Out of 59,578 analyzed billets, 403 would have been correctly flagged as scrap before production, potentially avoiding unnecessary processing. Only 97 billets (0.16%) would have been wrongly flagged. These results are in line with the findings in [7], where the application of ML-based classifiers for hot-rolling quality grading also resulted in high precision and effective early rejection of defective outputs. A similar direction is observed in [9], where a hybrid model combining AdaBoost and LSTM outperformed other approaches in a real manufacturing setting, highlighting the value of ensemble deep learning architectures for anticipating quality issues based on production data.

In the logistics department, the intelligent system also showed strong performance, particularly in terms of early detection. With a sensitivity above 70%, the model would have correctly flagged 2536 defective billets (4% of total production) before entering the billet park. This level of anticipation, although with 641 false alarms, represents a valuable opportunity for reducing downstream defects. The combination of temporal models and decision emulation, such as LSTM with MLP, has demonstrated high predictive power across complex industrial datasets, as validated in real steel manufacturing environments.

The rolling mill posed a more complex challenge, especially for the “DOWNGRADE” label. The system struggled with this class, likely due to the limited number of labeled samples and the subtle differences distinguishing it from the “OK” and “SCRAP” categories. These types of difficulties are common in industrial settings, particularly when dealing with visually ambiguous or infrequent defects. The integration of explainable AI techniques, like those explored in [11], enhances transparency by translating random forest models into semantic rules. This interpretability supports informed decision-making in steel production and highlights the value of combining domain knowledge with machine learning. Such findings highlight the importance of expert knowledge, data balance, and interpretable models when dealing with rare or borderline defect classes in steel manufacturing. Nonetheless, for the critical “SCRAP” cases in the rolling stage, the AI achieved 79% sensitivity and 81% precision, correctly identifying 598 defective coils and issuing only 56 false positives. This suggests that the temporal modeling performed by the LSTM-based architecture is well suited to detect anomalies in sequential data where quality evolves along the production line.

The sales department presented additional complexity. The model achieved an F1-score of 72%, lower than the performance in upstream departments. This is likely due to the narrower data diversity and the influence of commercial criteria that are not always encoded in sensor data. Similar conclusions are drawn in [11], where integrating domain knowledge and contextual semantic reasoning was essential to enhance the interpretability and robustness of machine learning decisions in steel production. The formalization of expert knowledge into SWRL rules, when combined with data-driven classification, demonstrated a significant improvement in decision support capabilities. Our results suggest that enriching the sales decision module with data from earlier stages—as proposed in the interdepartmental model—could improve classification accuracy by providing a broader context. This approach also aligns with the hybrid methodology validated in [11], where expert knowledge was formalized into SWRL rules and combined with data-driven classifiers.

From a detailed comparative perspective, ref. [11] presents a hybrid decision support tool for quality classification in steel production, combining semantic rules written in SWRL with random forest classifiers. Their system was evaluated on real cold-rolled steel data to assist human experts in interpreting quality outcomes using structured knowledge and machine learning inference. While effective, the system operates after production, classifying finished products rather than predicting quality before fabrication as we do. In their results, the reported weighted precision and recall reached 78%, and the weighted F1-score was 77%, demonstrating robust classification capabilities in a static setting. In contrast, our approach tackles a significantly more complex scenario, as it focuses on anticipating product quality prior to completion by integrating heterogeneous, sequential data from multiple departments. Despite this increased difficulty, our expert modules achieved strong performance across departments: the steel mill module reached a precision of 78.01%, recall of 80.50%, and F1-score of 79.18%; the logistics module showed 75.67% precision, 74.54% recall, and 75.10% F1-score; the rolling mill achieved 77.63% precision, 74.57% recall, and 76.06% F1-score; and the sales module reported 74.59% precision, 71.71% recall, and 72.71% F1-score. These values confirm the robustness of our anticipatory and hybrid methodology, delivering performance that matches or exceeds post-production models like [11] under more challenging predictive conditions.

The interdepartmental model offers significant potential to compensate for local uncertainties by aggregating information from multiple departments. While [8] focuses on a single-stage prediction task, their results show that leveraging temporal sensor data across various positions within a production line enhances the accuracy of downstream property estimation. This suggests that broader integration of production data may further improve predictive performance in multi-stage industrial systems. Moreover, by adopting explainable components, such as feature importance ranking (from random forests), it becomes possible to identify which departments or variables are most responsible for quality degradation—a principle validated by [9] in real-world rolling fault detection scenarios. The hybrid use of LSTM networks for temporal prediction and random forests for final decision logic proved to be a robust and interpretable combination for industrial applications.

To bridge the gap between research and industrial deployment, a pilot-scale validation plan is proposed. It involves selecting a production line with sufficient sensor coverage and historical data and initially operating the AI system in parallel to existing decision-making processes without influencing real-time operations. The system’s outputs would be systematically compared against expert judgment and current quality control protocols to build trust. Depending on performance, low-risk decisions would be progressively automated, ultimately leading to full integration into Level 2 automation and continuous monitoring for iterative refinement.

In parallel, the adaptability of the proposed system to different industrial environments must be addressed. While many data types and sensor configurations are shared across steel plants, transferring the model to new facilities would likely require some degree of retraining to reflect site-specific conditions. Currently, the absence of a systematic identification of the key variables that influence model transferability represents a limitation. Tackling this challenge will be essential for enhancing the system’s generalization capabilities and enabling scalable adoption across diverse production contexts.

In summary, the proposed AI system not only replicates expert decision patterns but, in many cases, anticipates quality deviations earlier in the process. Its consistent performance across departments combined with high interpretability and modular design position it as a promising solution for enhancing product quality and process stability in steel manufacturing environments. While initial steps toward industrial integration and validation have been outlined, further efforts will be needed to assess adaptability across different production sites and to benchmark the system against alternative approaches. These findings reinforce the growing trend of deploying AI-assisted quality assurance systems in complex production chains.

5. Conclusions

This study demonstrates that an AI-based system, which integrates event filtering, expert knowledge extraction, and advanced inference modules, can effectively predict quality issues in the steel rolling process. With an overall accuracy around 80%, the system shows promise in replicating and anticipating expert decisions, leading to potential reductions in production losses and improved operational efficiency. Although certain categories—such as the “DOWNGRADE” label in the rolling mill and the complex decisions in the sales module—require further data enrichment and model integration, the system provides actionable insights for process improvement. Future research should focus on expanding the dataset, refining the model for challenging categories, and exploring tighter integration across departments to further enhance predictive accuracy.

Related to the industrial applicability, the proposed system is designed with practical integration in mind. Most of the targeted production facilities are already equipped with a certain degree of sensorization, typically connected to automation level 1, and in more advanced setups, integrated into level 2. Our approach leverages this existing infrastructure by continuously collecting data from these levels. The decision-making algorithms would be executed on dedicated industrial hardware, and the resulting outputs would be automatically communicated back to level 2. This enables real-time interventions, such as product rejection, downgrade, or blocking, ensuring that predictive insights directly inform operational decisions.

Author Contributions

Conceptualization, J.M.B.; methodology, J.I.D. and R.M.; software, J.B.; validation, J.I.D. and R.M.; formal analysis, J.M.B.; investigation, J.B. and R.M.; data curation, J.B.; writing—original draft preparation, J.B.; writing—review and editing, J.M.B., J.I.D., and R.M.; visualization, J.B.; supervision, J.M.B.; project administration, J.M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Ministerio para la Transformación Digital y de la Función Pública of Spain and European Union -NextGenerationEU- grant number C005/21-ED.

Data Availability Statement

Data are available on request due to legal restrictions: The data presented in this study are available on request from the corresponding author due to an NDA signed with the data owner.

Conflicts of Interest

Authors José Manuel Bernárdez, Jonathan Boo and José Ignacio Díaz was employed by the company ISEND S.A. and author Roberto Medina was employed by CARTIF Foundation. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Kano, M.; Nakagawa, Y. Data-based process monitoring, process control, and quality improvement: Recent developments and applications in steel industry. Comput. Chem. Eng. 2008, 32, 12–24. [Google Scholar] [CrossRef]
Zhang, L.; Thomas, B. State of the Art in Evaluation and Control of Steel Cleanliness. ISIJ Int. 2003, 43, 271–291. [Google Scholar] [CrossRef]
Takalo-Mattila, J.; Heiskanen, M.; Kyllönen, V.; Määttä, L.; Bogdanoff, A. Explainable Steel Quality Prediction System Based on Gradient Boosting Decision Trees. IEEE Access 2022, 10, 68099–68110. [Google Scholar] [CrossRef]
Liu, Z.; Wu, L.; Liu, Z.; Mo, Y. Quality control method of steel structure construction based on digital twin technology. Digital Twin. 2023, 3, 5. [Google Scholar] [CrossRef]
Susto, G.A.; Schirru, A.; Pampuri, S.; McLoone, S.; Beghi, A. Machine Learning for Predictive Maintenance: A Multiple Classifier Approach. IEEE Trans. Ind. Inform. 2015, 11, 812–820. [Google Scholar] [CrossRef]
Lee, J.; Bagheri, B.; Kao, H.-A. A Cyber-Physical Systems Architecture for Industry 4.0-based Manufacturing Systems. Manuf. Lett. 2015, 3, 18–23. [Google Scholar] [CrossRef]
Liu, R.; Gao, Z.-Y.; Li, H.-Y.; Liu, X.-J.; Lv, Q. Research on Molten Iron Quality Prediction Based on Machine Learning. Metals 2024, 14, 856. [Google Scholar] [CrossRef]
Xu, G.; He, J.; Lü, Z.; Li, M.; Xu, J. Prediction of mechanical properties for deep drawing steel by deep learning. Int J Min. Met. Mater. 2023, 30, 156–165. [Google Scholar] [CrossRef]
Bai, Y.; Xie, J.; Wang, D.; Zhang, W.; Li, C. A manufacturing quality prediction model based on AdaBoost-LSTM with rough knowledge. Comput. Ind. Eng. 2021, 155, 107227. [Google Scholar] [CrossRef]
Xie, Q.; Suvarna, M.; Li, J.; Zhu, X.; Cai, J.; Wang, X. Online prediction of mechanical properties of hot rolled steel plate using machine learning. Mater. Des. 2021, 197, 109201. [Google Scholar] [CrossRef]
Beden, S.; Lakshmanan, K.; Giannetti, C.; Beckmann, A. Steelmaking Predictive Analytics Based on Random Forest and Semantic Reasoning. Appl. Sci. 2023, 13, 12778. [Google Scholar] [CrossRef]
Lin, L.; Zeng, J. Consideration of green intelligent steel processes and narrow window stability control technology on steel quality. Int. J. Miner. Metall. Mater. 2021, 28, 1264–1273. [Google Scholar] [CrossRef]
Branca, T.; Colla, V.; Algermissen, D.; Granbom, H.; Martini, U.; Morillon, A.; Pietruck, R.; Rosendahl, S. Reuse and Recycling of By-Products in the Steel Sector: Recent Achievements Paving the Way to Circular Economy and Industrial Symbiosis in Europe. Metals 2020, 10, 345. [Google Scholar] [CrossRef]
Colla, V.; Branca, T. Sustainable Steel Industry: Energy and Resource Efficiency, Low-Emissions and Carbon-Lean Production. Metals 2021, 11, 1469. [Google Scholar] [CrossRef]
Neogi, N.; Mohanta, D.; Dutta, P. Review of vision-based steel surface inspection systems. EURASIP J. Image Video Process. 2014, 2014, 50. [Google Scholar] [CrossRef]
Sun, X.; Gu, J.; Tang, S.; Li, J. Research Progress of Visual Inspection Technology of Steel Products—A Review. Appl. Sci. 2018, 8, 2195. [Google Scholar] [CrossRef]
Ibrahim, A.; Tapamo, J. A Survey of Vision-Based Methods for Surface Defects’ Detection and Classification in Steel Products. Informatics 2024, 11, 25. [Google Scholar] [CrossRef]
Amzil, K.; Yahia, E.; Klement, N.; Roucoules, L. Automatic neural networks construction and causality ranking for faster and more consistent decision making. Int. J. Comput. Integr. Manuf. 2022, 36, 735–755. [Google Scholar] [CrossRef]
Ren, L.; Wang, T.; Laili, Y.; Zhang, L. A Data-Driven Self-Supervised LSTM-DeepFM Model for Industrial Soft Sensor. IEEE Trans. Ind. Inform. 2022, 18, 5859–5869. [Google Scholar] [CrossRef]
Ma, L.; Zhao, Y.; Wang, B.; Shen, F. A Multistep Sequence-to-Sequence Model with Attention LSTM Neural Networks for Industrial Soft Sensor Application. IEEE Sens. J. 2023, 23, 10801–10813. [Google Scholar] [CrossRef]
Yuan, X.; Li, L.; Shardt, Y.; Wang, Y.; Yang, C. Deep Learning With Spatiotemporal Attention-Based LSTM for Industrial Soft Sensor Model Development. IEEE Trans. Ind. Electron. 2020, 68, 4404–4414. [Google Scholar] [CrossRef]
Lundberg, S.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef] [PubMed]
Amoukou, S.; Salaün, T.; Brunel, N. Accurate Shapley Values for explaining tree-based models. arXiv 2021, arXiv:2106.03820. [Google Scholar] [CrossRef]

Figure 1. Flow chart of the dataset used for the analysis.

Figure 2. Steel mill dataset.

Figure 3. Rolling mill dataset.

Figure 4. Logistic dataset.

Figure 5. Sales dataset.

Figure 6. The EDDYeyes system of ISEND S.A.

Figure 7. Flow chart of the event filtering module operation.

Figure 8. Flow chart for the expert knowledge extraction module operation.

Figure 9. Flow chart of the department module operation.

Figure 10. Flow chart of the interdepartmental module operation.

Figure 11. Heatmap of F1-scores (%) across departments and classification groups.

Figure 12. Confusion matrix of the interdepartmental expert.

Figure 13. SHAP values of the interdepartmental expert.

Figure 14. More examples of confusion matrix and SHAP values of the interdepartmental expert.

Table 1. Confusion matrix of the AI expert for steel mill.

		Labels Assigned by AI Expert
		OK	BLOCK	ALARM	SCRAP
True labels	OK	42,331	3133	587	97
	BLOCK	1923	7901	137	15
	ALARM	488	191	2201	42
	SCRAP	96	26	7	403

Table 2. Metric scores of the AI expert for steel mill.

	Sensitivity	Precision	F1-Score
OK	91.73%	94.41%	93.05%
BLOCK	79.20%	70.22%	74.44%
ALARM	75.33%	75.07%	75.20%
SCRAP	75.75%	72.35%	74.01%
Mean	80.50%	78.01%	79.18%

Table 3. Confusion matrix of the AI expert for logistics.

		Labels Assigned by AI Expert
		OK	ALARM	SCRAP
True labels	OK	40,124	2947	641
	ALARM	3268	8363	640
	SCRAP	647	412	2536

Table 4. Metric scores of the AI expert for logistics.

	Sensitivity	Precision	F1-Score
OK	91.79%	91.11%	91.45%
ALARM	68.15%	71.34%	69.71%
SCRAP	70.54%	66.44%	68.43%
Mean	76.83%	76.30%	76.53%

Table 5. Confusion matrix of the AI expert for rolling mill.

		Labels Assigned by AI Expert
		OK	BLOCK	ALARM	DOWNGRADE	SCRAP
True labels	OK	48,059	2654	394	94	56
	BLOCK	1532	3809	61	6	6
	ALARM	384	115	1359	3	0
	DOWNGRADE	52	17	14	129	78
	SCRAP	140	14	2	2	598

Table 6. Metric scores of the AI expert for rolling mill.

	Sensitivity	Precision	F1-Score
OK	93.76%	95.80%	94.77%
BLOCK	70.35%	57.63%	63.36%
ALARM	73.03%	74.26%	73.64%
DOWNGRADE	44.48%	55.13%	49.24%
SCRAP	79.10%	81.03%	80.05%
Mean	72.14%	72.77%	72.21%

Table 7. Confusion matrix of the AI expert for sales.

		Labels Assigned by AI Expert
		SOLD	PENDING	CLAIM	ROLLBACK	SCRAP
True labels	SOLD	51,501	2447	451	148	34
	PENDING	722	1884	39	6	1
	CLAIM	332	68	1229	9	0
	ROLLBACK	127	25	11	444	4
	SCRAP	20	5	1	0	70

Table 8. Metric scores of the AI expert for sales.

	Sensitivity	Precision	F1-Score
SOLD	94.36%	97.72%	96.01%
PENDING	71.04%	42.54%	53.21%
CLAIM	75.03%	71.00%	72.96%
ROLLBACK	72.67%	73.15%	72.91%
SCRAP	72.92%	64.22%	68.29%
Mean	77.20%	69.73%	72.68%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Institute of Knowledge Innovation and Invention. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bernárdez, J.M.; Boo, J.; Díaz, J.I.; Medina, R. Interdepartmental Optimization in Steel Manufacturing: An Artificial Intelligence Approach for Enhancing Decision-Making and Quality Control. Appl. Syst. Innov. 2025, 8, 63. https://doi.org/10.3390/asi8030063

AMA Style

Bernárdez JM, Boo J, Díaz JI, Medina R. Interdepartmental Optimization in Steel Manufacturing: An Artificial Intelligence Approach for Enhancing Decision-Making and Quality Control. Applied System Innovation. 2025; 8(3):63. https://doi.org/10.3390/asi8030063

Chicago/Turabian Style

Bernárdez, José M., Jonathan Boo, José I. Díaz, and Roberto Medina. 2025. "Interdepartmental Optimization in Steel Manufacturing: An Artificial Intelligence Approach for Enhancing Decision-Making and Quality Control" Applied System Innovation 8, no. 3: 63. https://doi.org/10.3390/asi8030063

APA Style

Bernárdez, J. M., Boo, J., Díaz, J. I., & Medina, R. (2025). Interdepartmental Optimization in Steel Manufacturing: An Artificial Intelligence Approach for Enhancing Decision-Making and Quality Control. Applied System Innovation, 8(3), 63. https://doi.org/10.3390/asi8030063

Article Menu

Interdepartmental Optimization in Steel Manufacturing: An Artificial Intelligence Approach for Enhancing Decision-Making and Quality Control

Abstract

1. Introduction

2. Methodology

2.1. Data Acquisition and Preprocessing

2.2. Event Filtering and Expert Knowledge Extraction Modules: Structure and Operation

2.2.1. Event Filtering Module

2.2.2. Automatic Expert Knowledge Extraction Module

2.3. Departmental and Interdepartmental Inference Modules: Architecture, Operation, and Justification

2.3.1. Departmental Inference Module

2.3.2. Interdepartmental Inference Module

3. Results

3.1. Departmental Decision-Making Module

3.1.1. Steel Mill

3.1.2. Logistics

3.1.3. Rolling Mill

3.1.4. Sales

3.1.5. Summary of Class-Wise Performance Across Departments

3.2. Interdepartmental Decision-Making Module

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI