1. Introduction
One of the most important challenges of our times is to reduce the climate impacts of industrial processes. The reduction in harmful gas emissions and the prevention of soil and water contamination play an important role in that purpose.
In the context of petroleum refining process, steam and water streams are used in various units such as hydroprocessing, delayed coking, and fluidized catalytic cracking, just to name a few systems. One of the effluents from these units is referred to as sour water, as it is contaminated with various weak electrolytes such as phenol, ammonia (NH
3), hydrogen sulfide (H
2S), and possible traces of carbon dioxide (CO
2) [
1,
2].
After the refining stage, sour water is directed to the sour water treatment unit (SWTU) to reduce contaminant levels, focusing primarily on the removal of NH
3 and H
2S. The stream undergoes a stripping process, being subjected to a heating and rectification system where the necessary heat is provided to reduce the partial pressures of gases and separate the contaminants [
3].
This process is usually carried out with two columns when ammonia concentration in sour water is high, as illustrated in
Figure 1. In this configuration, the first column produces a top stream rich in H
2S, known as acid gas, which is directed to the sulfur recovery unit (SRU). The second column is responsible for separating the bottom output from the first one into two streams: a top stream rich in NH
3, known as ammoniacal gas, which is sent to the ammonia incinerator, and a bottom stream referred to as treated water.
According to Brazilian environmental legislation, at least 90% of the incoming H
2S load in sour water must be removed and sent to the sulfur recovery unit (SRU). This is a critical point in the process because if H
2S residues are sent to the second column, they will be eliminated along with ammoniacal gas (NH
3), producing SOX, which is environmentally undesirable as it contributes to air pollution and is a precursor to acid rain [
4,
5,
6]. Nevertheless, a high efficiency in H
2S removal can have negative consequences on the operation of the sulfur recovery unit (SRU). An increase in H
2S removal can lead to an increase in NH
3 concentration in the acid gas, causing operational issues in the unit. This problem may result in efficiency loss due to line blockages and the formation of NOx, a compound highly detrimental to the environment and public health, as it is also a precursor to acid rain and photochemical smog [
7,
8].
Therefore, maintaining a high H2S recovery in the first tower while simultaneously reducing the NH3 content in the acid gas represents conflicting goals and characterizes an operation with a narrow tolerance range for units with this configuration. Furthermore, small disturbances that may cause variations in the STWU feed streams composition can result in faults that ultimately lead to the emission of environmentally harmful compounds above regulatory limits.
Currently, a major operational challenge is to monitor the effluent composition of these units online. Soft sensors play an important role in solving these difficulties. Soft sensors are mathematical models or algorithms that allow for estimating the value of a variable of interest based on available process information, without the need for direct measurements of that variable [
9]. They can be employed for real-time monitoring of process variables, enabling trend and anomaly identification, and triggering alerts and notifications when potential problems are detected. Additionally, they offer substantial cost savings compared to traditional sensors, as they eliminate the need for physical installation and maintenance and are not affected by obsolescence due to corrosion in industrial environments [
10]. Soft sensors can also assume the role of physical sensors in situations of unexpected failure. This ensures that operators have continuous access to the necessary information to monitor and control processes, reducing the risk of unplanned disruptions due to lack of essential information [
2]. Finally, for process control applications, they eliminate the measurement delay issues associated to complex laboratory analyses [
11]. Thus, the development of soft sensors for an SWTU is an advantageous solution for investment and precision issues compared to online analyzers and laboratory analyses.
There are three main approaches used to estimate variables based on other measurements. These are (i) model-driven (or white-box), in which a model based on fundamental principles is employed; (ii) data-driven (or black-box), which relies solely on historical process data, allowing for models to be created using machine learning techniques; and (iii) gray-box, which is a combination of the two previous methods [
12,
13].
The main objective of the present study was the development of soft sensors for predicting the composition of all main effluents from a SWTU with a two striping tower configuration using a gray-box or hybrid approach.
In the initial phase of the investigation, a new database was developed to support this research, utilizing a modified version of the dynamic model of an SWTU with two stripping columns previously developed in Aspen Plus Dynamics
® V10 by [
1]. Artificial intelligence approaches, including ensemble methods (decision trees) such as gradient boosting [
14] and random forest [
15,
16], along with support vector machines [
17,
18], were initially compared for soft sensor creation using the simulated data generated by the phenomenological model. Subsequently, the best AI paradigm from the initial comparison was selected to develop six models for predicting H
2S and NH
3 concentrations in each output stream of the unit. Hyperparameter optimization was applied in the development of each AI model. The importance of the input variables for the prediction of each soft sensor was determined and analyzed.
In the second part of this work, the methodology was validated by applying it to a real industrial-scale unit. The goal was to utilize the available information about the process provided by the phenomenological model to enhance and calibrate the data-driven AI-based model. This approach allowed for the combination of theoretical understanding with the flexibility of data-driven modeling. Compared to other techniques that implicitly integrate machine learning techniques with physical equations [
19], the present proposal aims to be more intuitive and natural, as the intention was to develop a tool to assist the operator in a human-centered Industry 5.0 context [
20].
Although developments in soft sensors for other wastewater treatments are more common [
21], the number of applications specifically addressing SWTUs is scarce. In [
22], soft sensor models were developed to estimate the removal efficiency of H
2S in an SWTU. The collected data spanned a two-year operational period of a sour water stripper processing non-phenolic acid water. Only steady-state data were employed, so that a total of 144 different operating conditions were considered. A simplified phenomenological approach developed in AspenOne
® and a linear statistical model did not show reliability in the results obtained. On the other hand, a multilayered perceptron (MLP) neural network proved suitable for the desired application. In [
23], a semi-supervised learning approach based on deep neural networks was employed to develop two soft sensors for estimating the concentrations of H
2S and NH
3 in the wastewater of a SWTU with only one column. This study addresses the difficulty of labeling data, as, in certain industrial scenarios, input variables are constantly measured, while output variables are quantified only once or twice a day. The results of the deep neural network-based sensors used were compared with MLP network models. The sensors developed with deep neural networks exhibited better performance and proved to be an effective strategy for industrial applications where a shortage of labeled data can significantly impact the performance of data-driven soft sensors.
The environmental importance of treating sour water in refineries, combined with the operational challenges of SWTUs and safety concerns, establishes the need for real-time predictions of H2S and NH3 mass fractions in the effluent. The soft sensors proposed here will aid in the operation, monitoring, optimization, and control of the process.
The major contribution of this work is the development of six soft sensors for a two-column SWTU. To achieve this, intermediary contributions were made, including the generation of dynamic operational data representative of the units, a comprehensive analysis of the produced data, and the inclusion of industrial validation. Methodologically, the synergistic use of a phenomenological model alongside data-based models for the industrial case is an important contribution from a practical standpoint.
2. Materials and Methods
2.1. The Phenomenological Model
In [
1], the author conducted the dynamic simulation of an SWTU in Aspen Plus Dynamics
® V10 using the gas process association sour water equilibrium (GPSWAT) model, considered the most suitable for sour water applications. In this simulation, there are thirty-four material streams and seven different compositions, three heat streams, and twenty-three pieces of equipment: twelve valves, three heat exchangers, three splitters, two mixers, one pump, and two columns. The block diagram in
Figure 2 summarizes all the streams and key equipment in the simulation, where H1, H2, and H3 are the heat exchangers, and C1 and C2 are the stripping columns.
The seven different compositions described in the process are sour water, which is the incoming stream to be treated: acid gas, which is the overhead stream from the first column rich in H
2S; ammoniacal gas, which is the overhead stream from the second column rich in NH
3; pre-treated water, which is the bottom stream from the first column, hence rich in NH
3; treated water, which is the bottom stream from the second column, containing small amounts of dissolved H
2S and NH
3 contaminants; new water, which is an aqueous stream without contaminants; and finally, clean water, which is the mixture of new water with treated water.
Table 1 correlates the abbreviations of these streams with their compositions.
The two input streams of the process are SW1 and NW1, while the five output streams are ACG2, AMG2, CW4, CW6, and CW9. SW1 is the sour water to be treated, and NW1 is the new uncontaminated water source that dilutes the treated water stream after passing through the two columns.
Table 2 displays the physical properties of the input streams.
These streams had their mass flow described as a function of SW1, defined as SW1F. These values were anonymized due to the confidentiality of information that originated the simulations developed by [
1].
Finally, ACG2 and AMG2 are the outputs of acid gas and ammoniacal gas, respectively. CW4, CW6, and CW9 are the outputs of clean water with the same composition.
The simulation begins with the SW1 stream, which is preheated in heat exchangers H1 and H2 through energy integration with streams CW7 and PW5, respectively, becoming SW4. SW4 is divided into SW5 and SW7. SW5 becomes SW6 after passing through valve V2 and enters the top of column C1. SW7 is preheated once again in heat exchanger H3 through energy integration with PW4 and becomes SW9 after passing through valve V3, subsequently entering the lower feed of column C1. Therefore, streams SW6 and SW9 feed C1.
Column C1 is heated by the thermal load Q1 from the reboiler, separating the overhead stream ACG1 (acid gas) from the bottom stream PW1 (pre-treated water). ACG1, after valve V4, becomes ACG2 and is removed from the system, while PW1 is split into two streams: PW2 and PW4. PW4 is cooled in heat exchanger H3 through energy integration with SW7 and becomes PW5. PW5 is further cooled in H2 through energy integration with SW3 and becomes PW6. PW2 becomes PW3 after passing through valve V5 and is mixed with PW6, forming PW7. It is noteworthy that PW7 has the same mass flow as PW1, as the streams that split reunite without losses. After valve V6, PW7 becomes PW8 and enters the feed of column C2.
Column C2 has a condenser and a reboiler, with thermal loads Q2 and Q3, respectively. The overhead stream from Column 2 is AMG1 (ammoniacal gas), and the bottom stream is TW1 (treated water). AMG1, after valve V7, becomes AMG2 and is removed from the system. TW1, after valve V8, becomes TW2. TW2 is mixed with NW2, becoming clean water (CW1). After passing through pump P1, CW1 is then called CW2. CW2 is divided into three streams: CW3, CW5, and CW7. CW3, after valve V10, becomes CW4 and is removed from the system. CW5, after passing through valve V11, becomes CW6 and is removed from the system. Meanwhile, CW7 is cooled in heat exchanger H1 through energy integration with SW2, becoming CW8. CW8, after valve V12, becomes CW9 and is removed from the system. A more complete view of the process is in
Figure 3, showing the static simulation developed by [
1].
There are two ‘RadFrac’ type columns and three ‘HeatX’ type heat exchangers in the simulation. Column C1 consists of 5 stages, and Column C2 consists of 6 stages. The feed occurs at stages 1 (SW6) and 2 (SW9) in C1, and at stage 2 (PW8) in C2. The two mixers were configured to receive both liquid and vapor phases. Valves V4 and V7 were configured only to receive vapor, while the other valves accept only the liquid phase. All valves have an associated pressure drop.
The dynamic model by [
1] includes 14 controllers added to ensure the operability and stability of the process, as well as aid in the convergence of the dynamic simulation. The controllers in the simulation are the following: two for pressure at the tops of the columns, four for level at the bottom and top of the columns, three for temperature, including one before the first column and one in its second stage, and another at the top of the second column; four mass flow controllers, two before the first column and two after the second column; and an integrated control of mass flow and thermal load in the second column.
Table 3 describes the controlled and manipulated variables for each of these controllers. The tuning of the controllers was performed based on general rules described in the literature and through industry expert advice [
1].
The first step in creating the new database for the development of soft sensors involved the inclusion of six controllers not previously used in the dynamic simulation of the SWTU with two stripping columns by [
1]. Among these controllers, two were incorporated into the acid gas stream (ACG1) at the top of Column 1 (ACID_G_H
2S and ACID_G_NH
3), another two were added to the ammoniacal gas stream (AMG1) at the top of Column 2 (AMON_G_H
2S and AMON_G_NH
3), and two more (WATER_H
2S and WATER_NH
3) were introduced into the treated water stream (TW1) withdrawn from the bottom of Column 2. The controllers present in the original simulation can be seen in
Table 3.
Figure 4 includes all the controllers.
To enable the collection of dynamic data for the fractions of H2S and NH3 in the effluents described, the PID (proportional, integral, and derivative) controllers included were adjusted with a gain of 0, an integral time of 1 min, and a derivative time of 0. This configured them to operate solely as ‘indicators’, without effectively performing control functions during the simulation, only storing the dynamic data.
2.2. Development of the Database and Phenomenological Study of the Process
The simulations, calculations, and codes were developed and executed on a computer with an Intel(R) Core™ i5 7200U with a 2.50 GHz–2.70 GHz CPU, 8.00 GB installed memory (RAM), and Windows 10 64-bit operating system. The dynamic simulation of the SWTU was performed using Aspen Plus Dynamics® V10. All data processing and artificial intelligence algorithms were implemented in Python 3.11.4 within the integrated development environment Visual Studio Code 1.79.2. The libraries NumPy (1.25.2), Matplotlib (2.0.2), joblib (1.3.2), scikit-learn (0.19.0), pandas (2.1.0), Seaborn (0.12.2), and Optuna (3.3.0) were extensively used.
The dynamic simulation, after the changes discussed in
Section 2 and illustrated in
Figure 4, was tested in various scenarios of normal operation with disturbances in five variables of interest, namely, SW1 mass flow rate, SW5 mass flow rate, SW1 temperature, and NH
3 and H
2S mass fractions in SW1. These variables were chosen because they are input process variables that could undergo variations in a real process. SW1 is the sour water entering the system for treatment, and SW5 represents 10% of SW1, which, after passing through the heat exchangers, enters the top of column C1. Normal operation is characterized when events such as complete valve opening or closing, overflows, or lack of level in the columns do not occur.
2.3. Comparison of Artificial Intelligence Algorithms
To create the six soft sensors, AI tools such as random forests, gradient boosting, and SVM with linear and RBF kernels models were evaluated using the same training and test datasets. The methods of gradient boosting (GradientBoostingRegressor function), random forest (RandomForestRegressor), and SVM (SVR function; linear and RBF kernels) from the scikit-learn library (version 0.19.0) were employed. These kernels were chosen because they demonstrated the best results and the shortest computational times in a previous study [
24]. These tools were chosen because they are simple, shallow machine learning methods that are not computationally demanding and can be easily adopted in industrial environments.
The Optuna library was used for hyperparameter optimization of the RF algorithm, aiming to obtain the most effective and accurate model for each of the six sensors. The optimized hyperparameters included the number of estimators (rf_n_estimators), representing the number of trees in the forest, the maximum depth of the trees (rf_max_depth), and the minimum number of samples required to form a leaf (terminal node) of the tree (rf_min_samples_leaf). The lower and upper limits were defined for each of the RF hyperparameters and are shown in
Table 4. This methodology enhances the efficiency of the optimization by eliminating the need to consider extreme or impractical values and helps to prevent overfitting or underfitting issues in the model and resulting in computational time savings. Similarly, the number of iterations (or number of trials) for the hyperparameter optimization process was set to 40.
Analogously to the RF, the same hyperparameters mentioned in
Table 4 were optimized for the GB algorithm using the same limits and number of iterations, while for the SVM models, there was no optimization of the hyperparameters (C and gamma), as optimizing them did not yield significant gains as shown by [
24].
Data processing takes place after storing the dynamic data from all 20 controllers after each simulation (or simply ‘run’). Concatenating this information within normal operation resulted in the development of the database, which was later divided into training and testing data to be used in the development of AI-based soft sensors. To guarantee the representativeness and efficient training of the models, it was ensured that each database had at least one example of each simulation in normal operation. Additionally, all data related to a single run were consolidated in the same database to prevent data leakage, especially considering that the simulations contain temporal data. Furthermore, data shuffling, separation into inputs and outputs, as well as normalization were performed.
After dividing the database, the training set consisted of 45,483 samples (59%), while the test set contained 31,525 samples (41%). A sampling time of 36 s was considered. The listing with the description of each simulation that composed each set can be found in
Table 5 and
Table 6.
To clarify the division of runs between training and testing databases, each database contains one example of each simulation. Although they share the same description, each simulation is unique, featuring various versions in terms of intensity, amplitude, and timing of the disturbances. The normalized database can be found in [
25].
2.4. Soft Sensors Using Random Forests
A RF model was developed for each of the six virtual sensors: ACID_G_H
2S, ACID_G_NH
3, AMON_G_H
2S, AMON_G_NH
3, WATER_H
2S, and WATER_NH
3. Each algorithm was trained, validation was conducted with 20% of the training data, and the final evaluation of the best model for each sensor was performed using the test data. Hyperparameter optimization was conducted for all sensors using the lower and upper limits specified in
Table 4.
2.5. Global Analysis of the Importance of Variables and Phenomenological Study of the Process
The randomForest package provides valuable information for phenomenologically evaluating a process: the measurement of the importance of predictor variables (variable importance—VI). The significance of a variable arises from its complex interaction with other variables. The algorithm assesses the importance of a variable by observing the extent to which the prediction error increases when the data for that variable is altered, while keeping all other variables unchanged. The necessary calculations are conducted tree by tree as the random forest is constructed [
26]. This functionality enables the reduction of the dataset of interest by eliminating less important variables, ensuring the simplification and optimization of the model, along with a reduction in computational processing time [
27,
28].
Therefore, an analysis of the most relevant variables was conducted for each of the sensors. Variables with importance below 2% were excluded, and the predictive capacity of the retrained sensors was evaluated using the same presented metrics (R2, MAE, and RMSE). This reduction of the input dataset, achieved by removing less important variables, enables the simplification and optimization of the model, thus increasing the analysis speed.
2.6. Implementation to a Real SWTU
In this section, the knowledge gained with the approach based on the phenomenological model is employed to develop random forest models using data from a real SWTU of a Brazilian oil refinery. The chosen unit for this development has the same flowsheet configuration of the simulated plant, making it possible to use the same inputs as in the simulated modelling.
The WATER_NH3 sensor was selected for modeling. For this sensor, the plant utilizes an online analyzer that produces results approximately every 20 min. Unfortunately, analysis for acid and ammoniacal gases were not conducted on the site and only a limited number of laboratory results were available for H2S on treated water. This was expected due to the hazards involved in sampling acid gas and that H2S is not a primary concern on treated water given that H2S is easier to remove than NH3 in the second column, making more sense to control only the NH3 content on treated water.
Even though the online analyzer generates results every 20 min, it is worth to mention that any control application to be implemented on the plant would need information at least on a minute basis, justifying the importance of modeling this kind of sensor even in a plant with online analyzer.
Given that operating conditions of the industrial plant did not precisely match the intervals used in the simulation runs, it was not prudent to directly use the models obtained with simulation data. Therefore, the sensor was retrained with real plant data. Variables SW5-FC, SW1-FC, SW8-TC, ACG1-PC, C1S2-TC, C1S-LC, AMG1-TC, AMG1-PC, C2D-LC, and C2S-LC were considered as input variables. Variables CW8_FC, C2HD-IC, and CW5-FC were not included, as these measurements were not available in the real plant.
The first step of data acquisition involved retrieving results from the analyzer, starting from its installation date, covering a period of 22 months, yielding 40,338 results. Subsequently, rather than acquiring input data, at the same timestamps of the analyzer results, they were collected applying a 20 min time average prior to those timestamps. This approach is due to the need to minimize not only the effects of dynamic and dead times but also disturbances characteristics of any industrial plant. It is believed that using the input values on exactly the timestamps of the analyzer results would not correctly represent the information that generated the outputs.
Before modelling, an interquartile range method was applied to remove outliers. The scale value used was 1.7, which means that this procedure discarded any value greater than 3 standard deviations from the respective mean. The final useful data were composed of 21,433 rows. The data were normalized using the method StandardScaler from the scikit-learn Python module. The normalized data were split into 10% for test and 90% for training from which 20% was reserved for validation.
Model training was executed with the same methodology used in training with simulated data, which consisted of using Python’s Optuna library to explore the hyperparameters of random forest regressor, i.e., rf_min_samples_leaf (1 to 10), rf_max_depth (5 to 10), and rf_n_estimators (1 to 50). The best model hyperparameters were, respectively, 1, 10, and 48.
3. Results
3.1. Database and Phenomenological Analysis of the Process
A preliminary study of integration methods [
29] revealed that the standard implicit Euler integrator was the most suitable when compared to other available options, such as Runge–Kutta 4 and gear integration methods. The observed integration errors could be attributed to the inherent complexity of the differential equations in the system.
The initial strategy for building the database was to develop a global run consisting of four random disturbances ranging between +1% and −1% of the set point value for each of the five variables subjected to disturbances. Simultaneously, partial runs were developed that featured only the disturbances of a specific variable at the same time they occurred in the global run. That is, the partial run of the mass flow rate of SW1, for example, refers to the global run in which all disturbances not related to the mass flow rate of SW1 were removed from the programmed task.
The results from the global run, which includes random disturbances ranging from −1% to +1%, along with their respective partial runs, are shown in
Figure 5. Similar studies were also conducted for the other effluent streams, considering H
2S and NH
3 [
29]. After interpreting the simulations, it is concluded that the mass flow rates of SW1 and SW5, and the temperature of SW1, are extremely important variables for the thermodynamic separation process in the two stripping columns. This is because they directly interfere with the fractions of H
2S and NH
3 in the acid gas (effluent from Column 1), ammoniacal gas, and treated water (effluents from Column 2). This is a consistent result, as these are characteristics linked to the input stream of the process.
However, despite the applied deviations, it was observed that the disturbances caused only a slight variation in the fractions of H2S and NH3 in the effluents, especially in the H2S fraction of the acid gas, which presented only a 0.12% variation. This result was not ideal, as it would imply low variability in the data subsequently used in the soft sensors training.
Therefore, new global runs were developed with substantially more intense disturbances (ranging from −25% to +30% of the set point value) for each of the five variables studied individually, while maintaining disturbances between +1% and −1% for the other input variables. The objective was to achieve, at least, a 1% variation in the H2S fraction of the acid gas and to keep the simulation within normal operation. Furthermore, the aim was also to determine which of the analyzed variables were most relevant to the variation in H2S fraction.
The global runs with increased magnitude disturbances in the SW1 mass flow rate, SW5 mass flow rate, SW1 temperature, and NH
3 and H
2S mass fractions in SW1 still showed a small variation in acid gas H
2S fraction, as displayed in
Table 7. On the other hand, the global run of the mass flow rate of SW5 proved to be more effective among all. Therefore, a new global run was created for the mass flow rate of SW5 with disturbances of higher magnitude than the previous ones (ranging from -30% to +40% of the set point value for the variable), resulting in a 1% variation in acid gas H
2S fraction, while keeping normal operation. The general behavior observed reflects the effectiveness of the regulatory control implemented, which adhered to industrial standards and minimized large variations in the controlled variables.
Aiming at further expanding the database diversity and contributing to the future development of more robust sensors, an exploratory analysis was conducted within the dynamic simulation to identify the variable with the greatest potential to modify the H2S fraction in the acid gas. The attempt was to ‘trick’ the controller’s set point to understand how the other variables in the system would be affected. For this purpose, a splitting element was added to the simulation, where Input 1 was the mass fraction of H2S in the acid gas stream, and the output was the remote set point (SP) of the ACID_G_H2S controller. While the simulation was running, the value of Input 2 was altered to set a new SP outside the normal operation of the controller so that the responses of the other variables involved in the process could be evaluated.
This process involved trying to emulate multiple scenarios and disturbances in the dynamic simulation and observing the long-term effect they had in the composition of the outlet process streams. This was important in the database development step of the soft sensors’ creation, aiming to ensure a more realistic and diverse database, encompassing several different normal operational conditions, without anomalies.
After the test, it was identified that the variable most affected by the change was the temperature of the second stage of Column 1, a variable controlled by the C1S2-TC controller. Based on this, new runs were generated with a focus solely on evaluating this variable. One of them consisted of a gradual ramp-up of the temperature of the second stage of Column 1, by applying only positive disturbances (between 2% to 15%) to the set point of the C1S2-TC controller. This resulted in a variation in the H2S fraction in the acid gas of 2.3%, which is more than double of the value obtained thus far.
3.2. Resuts of the Comparison of AI Algorithms
Figure 6 shows the comparison of the coefficient of determination (R
2) among the three evaluated methods (RF, GB, and SVM) for each of the six sensors, while, in
Figure 7, the comparison of the root mean square error (RMSE) is presented.
An initially relevant conclusion was that the SVM method using the linear kernel did not fit well to the data from the developed database, as it is evident from the prediction shown in
Figure 8. Additionally, the computational time for processing the ACID_G_H
2S sensor algorithm was significantly higher, exceeding 20 min. The coefficient of determination (R
2) obtained was 0.9388, which is lower than the results achieved by other methods as shown in
Figure 7. Therefore, the linear kernel and its results were discarded and, in
Figure 7 and
Figure 8, only the data related to the SVM method with RBF kernel were presented.
From
Figure 6 and
Figure 7, it became evident that the R
2 and RMSE metrics, for all SVM sensors, showed poorer performance compared to the other two methods. This is attributed here to the fact that GB and RF have several decision trees that improved the performance, and ensemble methods tend to present better results. Additionally, it is important to emphasize that this method does not provide information about variable importance, a significant aspect for understanding the process. On the other hand, the metrics for GB are very similar to those found for RF in all sensors. In addition to the metrics, the prediction and residuals graphs, and the overall results of variable importance are very similar for both methods. These results were not presented in this section to avoid hindering the readability, but they can be found in [
29].
3.3. Results of the Soft Sensors Using Random Forests
The hyperparameter optimization results are detailed in
Table 8. The most effective model for all sensors required the maximum tree depth (rf_max_depth), which is 10. A decision was made not to increase this number to avoid overly complex models, as a satisfactory predictive performance was already obtained with this value. As for the other hyperparameters, i.e., number of trees (rf_n_estimators) and number of samples to split a node (rf_min_samples_leaf), intermediate values were typically obtained by Optuna, except for two cases: rf_min_samples_leaf in AMON_G_NH
3 and WATER_H
2S.
For the evaluation of all sensors, predictions can be assessed in
Figure 9, where the true values from the training and test databases were compared with the values predicted by the sensor. It is evident that most sensors exhibit excellent data prediction capability, due to the high R
2 values. Furthermore, the low error values for MAE (mean absolute error) and RMSE (root mean squared error) suggest that the models can make accurate predictions, and the individual predictions are generally very close to the actual values.
The sensors to predict H
2S and NH
3 in acid gas showed better performance than the others. The reliable prediction range of acid gas NH
3 fraction is notably broader when compared to the range of the H
2S fraction, which varies between −2.28% and 0.53% of normal operation. This behavior is supported by the results observed in
Table 5, where the acid gas H
2S fraction varied by approximately −2.17%, while the NH
3 fraction varied around 143.81% during the temperature increase in the second stage of Column 1 from 120.3 to 141.9 °C.
This highlights that, in the evaluated normal operating situations, the acid gas NH3 mass fraction is notably more sensitive to disturbances, especially concerning the temperature of the second stage of the first stripping column. On the other hand, the acid gas H2S mass fraction is more stable, showing smaller variations in the presence of disturbances. This analysis emphasizes the relative importance of these two variables in the contaminants separation process and provides valuable insights for process control and optimization.
Initially, when analyzing the metrics in conjunction with the prediction of the AMON_G_H2S sensor, it is observed that, unlike what happened with the acid gas sensors, the results for the training dataset were quite different from those found for the test dataset. Despite the R2 value between the two sets showing only a slight absolute decrease of 4.89%, the results of MAE, MSE, and RMSE for the test set exhibited a significant absolute increase when compared to the training one. A sort of data hysteresis was observed in the test results, which was not as evident in the training ones.
This hysteresis indicated that the model predicted different values of H2S fraction in ammoniacal gas for the same observed value present in the test set. On the other hand, the reverse was also observed, where the same value was predicted for different and distant observed values. This hysteresis is likely to have occurred due to the transient dependence on the previous state, influencing the current state.
In order to better evaluate this behavior, each of the 12 simulations from the test bank listed in
Table 7 was tested individually. The best model, already optimized for the AMON_G_H
2S sensor, was used to assess the contribution of each simulation to the overall model. Examining each simulation in isolation revealed that the dataset from Run 7 played a role in the hysteresis of the results, leading to less precise predictions within the range of values from 0.86 to 12.17, as depicted in
Figure 10. Notably, this specific simulation accounts for only 11% of the test dataset. Consequently, the remaining 89% of the data appears to have contributed to the model attaining a high R
2 value of 0.9463. Meanwhile, the dispersion caused by the 11% of data from Run 7 was apparent primarily in error metrics like MAE, MSE, and RMSE.
Similar to the AMON_G_H
2S sensor, the results for the AMON_G_NH
3 sensor exhibited notable differences between the training and test sets. As before, it became evident that the dataset from Run 7 contributed to the hysteresis of the results, leading to less accurate predictions in the range of values from −11.56 to −1.89. As previously highlighted, this specific simulation represents only 11% of the comprehensive test set. Nevertheless, these data significantly impacted not only the R
2 value but also error metrics, including MAE, MSE, and RMSE, obtained for the AMON_G_NH
3 sensor, resulting in the lowest R
2 among all sensors following the evaluation with the test set. In order to deal with these transient effects, we recommend the use of memory-based approaches, such as recurrent neural networks (RNNs) or long short-term memory (LSTM) networks [
30].
The predictive performance of this sensor demonstrated only reasonable quality in comparison to the top-performing acid gas sensors developed. The R2 value for the training set marginally surpassed that of the acid gas sensors, while for the test set, it exhibited a slight decrement. Conversely, the error results were found to be comparable to those observed for the ACID_G_NH3 sensor.
The analysis of the WATER_NH3 sensor revealed a slightly lower coefficient of determination (R2) for the test set compared to its counterpart, the WATER_H2S sensor. However, it is noteworthy that error metrics such as MAE, MSE, and RMSE yielded, in most cases, more favorable results for the WATER_NH3 sensor in both the training and test datasets. These findings suggest that, although the WATER_NH3 sensor may have a relatively lower capacity to explain the variation in input data, its predictions are more accurate in terms of proximity to actual values.
3.4. Results of the Global Analysis of the Importance of Variables
For four out of the six developed sensors, the reduction generally led to metrics improvements, except for the AMON_G_H2S and WATER_NH3 sensors. In the case of the AMON_G_H2S sensor, the MAE values increased by around 13.22%, which is still considered acceptable for practical use. This allows for the removal of variables without compromising the model’s performance for this sensor. However, for the WATER_NH3 sensor, error metrics increased substantially, with the MAE, for example, rising by approximately 90.72%. Therefore, unlike the other sensors, WATER_NH3 requires the use of the original set of input variables to maintain the confidence of the results.
The complete analysis is summarized in
Table 9, which shows the selected variables and the reduction in RMSE for the test data compared to the prediction before variable reduction.
The results found in
Table 9 reinforce that the process of removing H
2S and NH
3 contaminants from sour waters is largely dependent on temperature. All temperature measurements in the simulation (highlighted in
Table 9)—in the first column feed stream (SW8), the second stage of Column 1 (C1S2), and ammoniacal gas (AMG)—emerged as important variables among the six virtual sensors, indicating the fundamental role of temperature in controlling the thermodynamic separation process of H
2S and NH
3.
3.5. Results of the Implementation to the Real SWTU
In
Figure 10 the model results are presented as well as their metrics. Additionally,
Figure 11 presents the residual values for both the training and testing datasets. An R
2 of 0.69 for training and 0.62 for testing indicates that the model explains a substantial portion of the variance in the data and generalizes reasonably well to new, unseen data. However, the presence of industrial data introduces additional variability. The frequency histogram shows deviations from normality, particularly in the high tail, as confirmed by the Shapiro–Wilk test (test statistic of 0.99 with a
p-value near zero). These deviations are likely due to extreme values. Despite these challenges, the model’s performance suggests it is useful for practical applications within this industrial context.
An analysis of the most relevant variables was conducted for the sensor.
Figure 12 presents the variables’ importance resulting from the random forest modelling considering all the ten available measurements as inputs.
Figure 12 evidences the reduced importance of ACG1-PC and AMG1-PC (<2%). This was not primarily expected as AMG1-PC has great importance for the WATER_NH
3 sensor modeled with simulated data. Nonetheless, it was noted that the acquired pressure data for the real SWTU had the lowest coefficient of variation compared to other variables (i.e., ACG1-PC, 3.38%, and AMG1-PC, 2.00%, while C1S2-TC, 7.50%, and AMG1-TC, 6.26%). As random forests are an ensemble of decision trees, they are expected to exhibit robust prediction behavior. However, the impact of correlations, dynamic effects, and other factors on importance measures is a question of great interest. In this specific case, the reduced importance of ACG1-PC and AMG1-PC may be attributed to their dynamic behavior—specifically, the minimal variance they exhibited due to tight automatic control. In general, other methods that account for dynamic behavior [
30], multicollinearity [
31], and other factors should also be investigated.
Hence, a reduced model was produced excluding these variables from the input data.
Figure 13 shows the predictions for this simplified model, also including the metrics. It was possible to achieve slightly better results removing the inputs with marginal importance. Finally,
Figure 14 shows the final input variables’ importance, where the three most important variables are temperatures. Hence, the quantitative analysis confirmed the relevance of the temperature variables in the industrial SWTU operation.
5. Conclusions
The first contribution of this study involves the modifying a phenomenological model of a two-column SWTU based on [
1]. New simulated data were then generated for sensor development, expanding the Aspen Plus Dynamics
® V10 model with six additional controllers for acid gas, ammoniacal gas, and treated water effluents. This resulted in a new database comprising 30 simulations, over 77 thousand samples, set to be publicly shared as a benchmark for virtual sensor development studies.
The data generation involved applying disturbances to five input variables: mass flow of SW1 and SW5, temperature of SW1, and mass fractions of H2S and NH3 in SW1. A key finding using the phenomenological investigation indicates that the operational variable most influencing the H2S fraction in the acid gas is the temperature of the second stage of Column 1. It was also observed that increasing this temperature improves the recovery of H2S in the acid gas. However, additional increments in the temperature of Column 1, after the system achieves 90% recovery, as required by Brazilian environmental legislation, result in minimal gains in H2S removal efficiency and pronounced volatilization of NH3 in the acid gas, which may lead to operational issues in the SRU. This highlights the importance of precise control of thermal load in the first column to optimize the process and the possibility of using virtual sensors to analyze effluents from an SWTU and may be characterized as the second contribution of this work.
A third contribution of this manuscript is the comparative analysis of the RF, GB, and SVM algorithms, which revealed that the first two are equivalent, exhibiting very similar metrics, while the SVM method with RBF kernel showed the worst performance. Thus, RF was chosen as the standard AI methodology for the development of virtual sensors.
This leads to the core advancement of this research, which is the development of six soft sensors (ACID_G_H2S, ACID_G_NH3, AMON_G_H2S, AMON_G_NH3, WATER_H2S, and WATER_NH3). All sensors exhibited a coefficient of determination (R2) greater than 0.87 and RMSE less than 0.41 before reducing the input variables. The study of variable importance showed that the sensors continued to present good metrics after the exclusion of the input variables with a contribution of less than 2%, proving useful for improving computational processing time without compromising model performance. The only exception was the WATER_NH3 sensor, which revealed the need for the original set of input variables to maintain result reliability. The overall analysis of the variable importance highlighted the significant influence of temperature on the process of removing H2S and NH3 contaminants from sour waters. The results indicated that all temperature variables were relevant for controlling the thermodynamic separation process of these compounds, with the temperature of the second stage of Column 1 being particularly important.
The fifth contribution of this work is particularly challenging as it validated the approach through its application to a real-world process. In this case, the application referred to an industrial unit where there was available data for the NH3 fraction in treated water. The developed algorithms were applied to the actual unit data, yielding satisfactory results from the practical point of view. This is noteworthy, considering that the data originated from process historians, thus being subject to treatments (such as compression and exception handling) that impact their intrinsic variability, making modeling more challenging.
The soft sensors developed here serve for multiple purposes, including monitoring, control, and ‘what-if’ analysis. The approach adopted considered the synergistic use of a phenomenological and AI-based models. The simulation study provided valuable physical insights, guiding the selection of the most appropriate input variables for the empirical simplified models. As a result, the most significant advancement achieved here is the facilitation of soft sensors development for any SWTU with the presented configuration.