1. Introduction
Soft sensors are virtual sensors instantly estimating hard-to-measure variables, such as concentration, which is traditionally measured online by low frequency laboratory analysis through inputting easy-to-measure variables, such as pressure, temperature, and flowrate. In the past few decades, soft sensors have been extensively studied and implemented in the process industries. Typically, soft sensors can be divided into two general classes: the model-driven (white box) and the data-driven (black box). The model-driven soft sensors were commonly based upon first-principle models, while the data-driven ones were usually based on regression techniques such as principal component analysis, partial least squares, neuro-fuzzy systems, support vector machines, and artificial neural networks (ANNs).
Recently, with the advanced progressions in deep learning, ANN variants once again caught the attention of process engineers due to their power in nonlinear regression ability. However, the ANN variants were black boxes, usually difficult to interpret by domain knowledge [
1]. Such a drawback held scientists and engineers from further implementing ANNs on the systems they were focusing on, thus slowing down their popularizing rates. With these concerns, explainable artificial intelligence (AI), which aims to make AI interpretable and trustworthy, became a focusing field of machine learning [
2]. For process engineering and control, it was also critical and necessary to implement interpretable models into the processes to make sure the predictions of these models were not merely accurate but also interpretable based on domain knowledge.
The quality of the data is key to training a good AI model. Udugama et al. [
3] reported that the data of chemical plants requires four properties: volume, variety, velocity, and veracity. It is difficult to obtain an accurate model when lacking any one of the properties in data. For conventional machine learning methods, a large amount of such data was necessary for training robust models with both accuracy and interpretability. However, in process industries, most of the critical quality variables, such as concentration and viscosity, were measured by low frequency offline analysis in laboratories. Due to the low frequencies, the corresponding databases were usually small and required long periods of time to enlarge to be sufficient for training neural networks. Furthermore, for some newly started processes with a short operation history, it was impossible to gather big data. Small datasets are the inherent problem in soft sensor development [
4]. To overcome the lack of data, one common approach is to use linear models such as partial least squares (PLS). However, industrial processes are nonlinear over a large range. For instance, distillation column operations in chemical plants have very nonlinear behavior for producing high-purity products. Hence, constantly updating linear models is required [
5,
6]. Nonlinear models such as ANN are commonly used to predict nonlinear systems such as distillation columns, but the generalization ability of ANN models must be checked using validation data and regularizations [
7]. Our recent study showed that a simple validation test might not be sufficient to ensure the generalizability and physical consistency when the datasets are limited.
It should be pointed out that some form of prior knowledge must exist when we try to build a data-driven model. Hybrid models have been used to alleviate the problem of small datasets and improve soft sensor accuracy [
8,
9]. Prior knowledge may be in the form of data of a similar system or an approximate simulator based on the first-principle model. In machine learning, the technique of building a data-driven model for the current problem, a target domain, from a model of another similar system, the source domain, is known as transfer learning [
10,
11,
12]. The purpose of this study was to present a new data-driven soft sensor development methodology that combines first-principle simulations and transfer learning methods to overcome the overfitting issue and ensure the interpretability of the soft sensor when there is only a limited data set. An industrial C
4 separation column was used to demonstrate the performance of this approach. Furthermore, gain consistency analysis was used to ensure the interpretability of soft sensors.
3. Case Study
3.1. Process Description
In this study, an industrial C4 separation column was used as an example to illustrate the effectiveness of this approach. The column separated the C4 and C5+ components of the reactor effluent. The main product, C4 (over 90% of the feed) left as liquid distillate, while some noncondensable light impurities are left from the vapor distillate; C5+ are left from the bottom. Quality control of the liquid distillate and bottom product was the top priority of this distillation operation, especially the quality of liquid distillate; namely, the concentration of C5+ impurities at distillate and C4 losses at the bottom should be controlled within an acceptable range. Hence, two soft sensors were built in this case study to monitor C5+ impurities and C4 losses at the distillate and bottom, respectively.
According to the domain knowledge of distillation unit operations, 14 critical process variables, including pressures, temperatures, and flow rates, were selected as the input variables for the soft sensors, as shown in
Figure 1. The selected variables can be divided into two types: six manipulated variables (MVs) and eight sensor variables (SVs). MVs were manipulated by manual or automatic approaches, while the SVs were only the measured values, as shown in
Table 1.
3.2. Data Preprocessing
For soft sensors of distillate quality, there were 929 available samples for modeling, where 838 samples were used for learning (training and validation) and 91 samples for testing. For soft sensors of bottom quality, there were 453 available samples for modeling, where 414 samples were used for learning and 39 samples for testing.
The moving window method was applied to consider the dynamic behaviors of the process [
15]. Window length (
W) was 1-h backtracking from each sampling instant
t, and each input variable was averaged every 10 min. The input-output relations of the soft sensor can be mathematically expressed as:
where subscript
t represents time;
W represents window length;
represents the manipulated variables;
represents the sensor variables.
3.3. Network Structure and Hyperparameters
A feedforward network (FFN) was the simplest neural network containing a multilayer perceptron (MLP). In this study, fully-connected FFNs with five hidden layers based on five different models were tested and compared. The five different models in this study were listed as follows:
For FFN, R-FFN, three of FT-FFNs, and the source-domain models, the number of inputs was 76 features, and the number of the output was one. The number of parameters for all models was 32,161. To train from scratch for FFN and R-FFN, the Glorot uniform initialization [
17] (the default option of Keras library) was applied. The regularization rate, λ, penalty weighting of
L2-norm of parameters were fixed to the value of 0.01 for both R-FFN and FT-FFNs.
According to the universal approximation theorem, the rectified linear unit (ReLU) [
18], a commonly used activated function, activated width-bounded deep networks with
N + 4 neurons per layer can approximate any Lebesgue-integrable function, where
N is the number of features [
19]; in this case, there were 76 features
as the inputs, and there were 80 neurons for each layer. The algorithm for gradient descent optimization was Adam [
20]. Additionally, to avoid overfitting during modeling with extremely small datasets, the
L2-norm was implemented to penalize the loss function. All the modeling works were done in Python language using Keras library.
3.4. Metrics of Performance
For the development of robust and interpretable neural models, both predictive accuracy and interpretability (descriptive accuracy) should be cautiously considered. In this study, Root-mean-square error (RMSE) was used as the metric of accuracy for the soft sensors and was calculated with the following equation.
Alongside the metrics for predictive accuracy, the models were also interpreted using post hoc analysis. Post hoc interpretability is a concept and approach of interpretable machine learning [
2,
21]. It aims to interpret the black boxes globally or locally using domain-knowledge-based models. Local interpretation targets to identify the contributions of each feature in the input towards a specific model prediction and usually attributs a model’s decision to its input features (Du et al., 2018). Such interpretation is usually done by posing perturbation on certain features in the input.
For soft sensors regressing input-output relationships of chemical processes, the responding behaviors, which were usually called the process gains, of outputs to the disturbances (perturbations) of inputs should be physically consistent with the chemical engineering domain knowledge. The dynamic process gain (
) of quality variables
i posed by manipulated variable
j can be defined as:
where
is the perturbation of manipulated variable
j at sampling instant
t. For soft sensors of distillation columns, the inputs are the manipulated variables, including reflux rate and reboiler temperature, and the outputs are the qualities of distillate and bottom products. Thus, there should be four process gains, including two main gains (
i =
j) and two interaction gains (
i ≠
j).
It was reasonable, based on the common sense of distillation unit operation, to believe that the sign of dynamic and steady-state process gains to be consistent. Therefore, the percentage of testing samples whose signs of dynamic and steady-state process gains were consistent was defined as the gain consistency (
), as follows:
where
Hv is the Heaviside function. The interpretable soft sensors should be at least with high gain consistency to reasonably respond to the change of manipulated variables.
3.5. Degree of Freedom
There is often an issue of the degree of freedom (DoF), whether the networks were over-parameterized, which implies overfitting. Intuitively, the parameter-to-data ratio, which was conventionally calculated in the form of Equation (5), was an appropriate way to estimate the DoF of networks. However, some works of literature [
22,
23] provided that the equivalent DoF of the multilayer FFNs only related to the units in the highest hidden layer, with the other layers performing only geometric transformations of data. Thus, instead, we considered the DoF of networks using Equation (6).
To consider the effect of the number of learning samples, the neural networks were trained with 360, 270, 180, 90, and 45 samples (with the parameter-to-data ratio of 0.225, 0.3, 0.45, 0.9, and 1.8, respectively), 20% of these samples were used as the validation set during learning.
3.6. Source-Domain Models
Three ASPEN Plus dynamics simulators were constructed to serve as the source domains in this study. The first source domain (D
1) mimiced the actual plant, a debutanizer. The number of trays, type of trays, location of feeds and draws, and feed flowrates were similar to the actual column. The hardware parameters such as actual column diameter, sump size, size of accumulators were obtained using auto-sizing in ASPEN Plus. It should be noted that it is tedious, time-consuming, and somewhat unrealistic to build a rigorous simulation that dynamically matches the real process exactly. Furthermore, we showed that such simulation should be unnecessary with transfer learning techniques in the following discussion. The second (D
2) was also a debutanizer found in the literature [
24,
25]. The third source domain D
3 was a methanol/water splitter, also given in the literature [
24,
25]. Since the processes were the same, D
1 and D
2 used the same thermodynamic models, the Peng–Robinson equation of states. For the methanol/water splitter, an UNIQUAC model was used.
These source domains were the process of separation towers. The three source models had the same MVs shown in
Table 1. The temperature sensors in D
1 were located as those in the plant. D
2 and D
3 had different numbers of trays, and hence the corresponding SVs were the temperature of trays selected by the relative positions with respect to the condenser and the reboiler. The quality output of D
2 and D
3 was set as the light or heavy component.
In general, the qualities of the distillate and bottom were affected by reflux flowrate and reboiler temperature. Thus, in this study, we focused on the interaction of the distillate (
qv1) and bottom (
qv2) to the reflux flowrate (
u1) and the reboiler temperature (
u2) to the gain signs analysis. For these source domains, the corresponding steady-state process gains shared the same sign. Namely, the main gain signs (
and
) were all negative, and the interaction gain signs (
and
) were all positive, as shown in
Table 2. With the datasets generated by these source domains, three source-domain neural network models were pretrained, and their gain consistencies were calculated. There were two potential factors that affect the result of transfer learning: (1) domain similarity between source domains and target domain, and (2) gain consistency of source-domain models. Both effects were considered in this case study. For observation of gain consistency effect, source-domain models with low gain consistencies were intentionally chosen; namely, the
Con12 for D
3 model was 0% consistent, shown in
Table 3.
3.7. Fine-Tuning Recipe
There is no general criterion for neural network fine-tuning [
26]. The most common practices are done by fine-tuning deep layers while freezing shallow ones [
12]. However, Li et al. [
27] stated that shallow layers also had some effects during domain adaptation. Thus, to obtain better fine-tuning results, the trial-and-error method was used to figure out the best one before performing fine-tuning.
The trial results are shown in
Figure 2. As the figure shows, the shallowest layer gave the most significant contribution to minimizing the RMSE in fine-tuning procedures of soft sensors of the distillate and bottom. Note that the six digits in tuning recipes represented positions of the hidden layers and output layer; 1 represented trainable on the specific layer, and 0 represented nontrainable. Generally, the recipes freezing the shallowest layer performed worse than the ones updating their weights, and the ones freezing the intermediate layers performed better than the ones updating their weights. Compared with the shallowest layer, the deeper layers contributed much less effect during fine-tuning, but the deeper layers still gave the contribution to reducing the RMSE. Thus, in this study, the recipe “100011” was chosen, marked by the red arrow in
Figure 2, where the output layer, the shallowest hidden layer, and the deepest hidden layer were fine-tuned, while the intermediate ones were frozen. The number of trainable parameters with this recipe was 12,721, and the number of nontrainable parameters was 19,440.
5. Conclusions
In this paper, a new methodology combining the first-principle simulation and transfer learning was proposed to address the potential problems of overfitting and low interpretability posed by small available datasets often taking place in industrial processes. The method was applied to a real distillation process. It showed its advantages in enhancing both predictive accuracy and physical interpretability over the other conventional deep learning methods, especially when the amount of available real data was small compared to the number of network parameters. Transfer learning was implemented by fine-tuning the weights of networks, freezing inner layers, and updating outer layers. Through fine-tuning, the input-output relationships were modified to accomplish adaptation from the source domains into the target domain. The result showed that the similarities between source and target domains had nearly no effect on fine-tuning results, while the gain consistency of target models was strongly determined by the gain consistency of their corresponding source-domain models. Additionally, the concept and definition of gain consistency were used as the metrics to quantify the physical interpretability of the networks.