Modeling and Evaluation of Attention Mechanism Neural Network Based on Industrial Time Series Data

Zhou, Jianqiao; Wang, Zhu; Liu, Jiaxuan; Luo, Xionglin; Chen, Maoyin

doi:10.3390/pr13010184

Open AccessArticle

Modeling and Evaluation of Attention Mechanism Neural Network Based on Industrial Time Series Data

by

Jianqiao Zhou

¹

,

Zhu Wang

^1,*

,

Jiaxuan Liu

²,

Xionglin Luo

¹ and

Maoyin Chen

¹

Department of Automation, College of Artificial Intelligence, China University of Petroleum Beijing, Beijing 102249, China

²

Research Institute of Petroleum Exploration & Development, PetroChina Company Limited, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Processes 2025, 13(1), 184; https://doi.org/10.3390/pr13010184

Submission received: 10 October 2024 / Revised: 27 December 2024 / Accepted: 8 January 2025 / Published: 10 January 2025

(This article belongs to the Special Issue Process Modeling, Simulation, and Optimization in Chemical Engineering)

Download

Browse Figures

Versions Notes

Abstract

Chemical process control systems are complex, and modeling the controlled object is the first task in automatic control and optimal design. Most chemical process modeling experiments require test signals to be applied to the process, which may lead to production interruptions or cause safety accidents. Therefore, this paper proposes an improved transformer model based on a self-attention mechanism for modeling industrial processes. Then, an evaluation mechanism based on root mean square error (RMSE) and Kullback–Leibler divergence (KLD) metrics is designed to obtain more appropriate model parameters. The Variational Auto-Encoder (VAE) network is used to compute the associated KLD. Finally, a real nonlinear dynamic process in the petrochemical industry is modeled and evaluated using the proposed methodology to predict the time series data of the process. This study demonstrates the validity of the proposed transformer model and illustrates the versatility of using an integrated modeling, evaluation, and prediction scheme for nonlinear dynamic processes in process industries. The scheme is of great importance for the field of industrial soft measurements as well as for deep learning-based time series prediction. In addition, the issue of a suitable time domain for the prediction is discussed.

Keywords:

chemical process modeling; modified transformer; evaluation mechanism; deep learning; nonlinear dynamic modeling

1. Introduction

Models that describe the dynamic characteristics of processes are an essential component of advanced control systems in the refining industry [1]. The quality of the model is a determining factor in the efficacy of the control process [2]. Due to changes in operating conditions, random disturbances caused by environmental influences, complex and variable chemical reaction processes, and time delays caused by measurement processes, industrial processes exhibit various complex characteristics such as multivariable, nonlinearity, slow time-varying, and strong coupling [3,4]. However, identification experiments typically necessitate the cessation of industrial production and their execution under open-loop conditions [5]. At the same time, the process is manually fed with an excitation signal for a long enough period of time to excite the full dynamic characteristics of the system and obtain the data needed for identification [6]. These problems seriously affect industrial production and are difficult to put into use for a long period of time, with limitations for complex and variable operating conditions. Therefore, it is essential to improve the reliability and accuracy of modeling key process variables in industrial production processes.

The emergence and development of deep learning neural networks and the abundance of process data available in industrial production through distributed control systems (DCS) provide an effective method for solving complex modeling problems [7,8,9]. Some specular designed neural networks are adopted for modeling dynamic characteristics [10,11,12]. The advantages of deep networks in modeling complex industrial processes lie in the following: deep networks can capture complex nonlinear relationships within industrial processes, allowing for more accurate modeling compared to linear methods; by changing the number of neurons in the input and output layers, the deep network can be applied to different multivariate systems; and the deep networks have strong feature extraction capabilities and are less demanding in terms of data identifiability than traditional recognition methods.

Since its proposal in 2017, Transformer [13] has attracted widespread attention in the field of natural language processing [14] due to its unique architecture and excellent performance. Compared with basic deep learning models such as recurrent neural networks (RNN) [15] and convolutional neural networks (CNN) [16], Transformer adopts a new encoder–decoder architecture based on self-attention-based encoder–decoder architecture, which has the advantages of strong parallel computation capability, strong data feature-dependent capture capability, strong global information capture capability, high scalability and flexibility, and greatly improves the ability to extract features from long sequential data. Based on the above advantages, the task scenarios of Transformer are gradually expanding from the field of natural language processing to other fields, including image processing [17], speech recognition [18], and time series analysis [19]. Therefore, Transformer’s powerful sequence processing and feature extraction capabilities can meet the needs of industrial modeling.

However, in the field of industrial modeling, the existing Transformer [20,21,22] does not consider the autocorrelation property of the output data when designing the input and output data, and only considers the cross-correlation property between the temporal input and output data. By introducing the auto-correlation property of output data [23,24], the dimensionality of network input data can be reduced to a certain extent, i.e., less historical data are used for modeling, and the efficiency of network training can be improved to a certain extent. Therefore, in this paper, a modified transformer prediction model is proposed to achieve full dynamic modeling of nonlinear dynamic processes in the petrochemical industry. To be specific, the input–output (I/O) design should contain different time series of different variables, at the same time the design can simplify the decoder structure and thus the Transformer model structure.

Moreover, when neural network structures are determined, the relevant values of structure parameters are important for improving the modeling accuracy. Usually, classical evaluation indexes RMSE [25] and mean square error (MSE) [26] are used to observe the overall situations of predictive errors. However, the distributions of residuals are not analyzed sufficiently. In order to observe residual distributions, KLD [27,28,29] is used for computing the distances between different distributions [30,31]. An autoencoder [32,33], an unsupervised learning algorithm, compresses input data into a low-dimensional representation through an encoder, reconstructs the data through a decoder, and learns an effective data representation method. In terms of the above situations, the VAE [34] network is adopted for both reconstructing the residuals and computing the KLD values [35,36]. Hence, in this paper, a new evaluation mechanism based on both RMSE and KLD is exploited. Through the proposed evaluation mechanism, the optimal structure parameters of the proposed original modified transformer are determined, in order to harvest more accurate predictive performances.

To sum up, the main contributions of this paper are as follows:

(1): Firstly, the traditional Transformer model is simplified and improved by ingenious I/O design, and a modified transformer model is established. The proposed model is applied to full dynamic modeling and prediction of nonlinear dynamic processes in the petrochemical industry for the first time. It can adaptively capture time series dependencies between input and output sequences.
(2): Secondly, the evaluation mechanism is proposed. The mechanism uses RMSE for preliminary judgment at first and then uses KLD to further excavate residual characteristics. Finally, a group of optimal network structure parameters can be selected from the set of structure parameters. The network training and prediction under optimal structure parameters can ensure that the predictive effect of the model is relatively best.
(3): Then, this paper combines the proposed model and evaluation mechanism to model, predict, and evaluate a real nonlinear dynamic process, and the effectiveness of our model is proved. Experimental results show that our model has better predictive performance than baseline methods.
(4): Finally, in order to select the predictive time domain, a hybrid network is designed and consists of multiple subnets. Each subnet has different predictive output sequences. By observing the predictive effects of different subnets on the same output sequence, the suitable predictive time domain can be determined.

The rest of this paper is organized as follows. Section 2 describes the overall architecture of the modified transformer model, explains the I/O design of the model, interprets how to determine a predictive time domain, and describes relevant theory and process details. A training algorithm and evaluation mechanism of the proposed model are introduced, and the operation principle of the evaluation mechanism is analyzed in detail in Section 3. In Section 4, the proposed model and two baseline models are used to conduct experiments on a real nonlinear dynamic process. Finally, this paper gives conclusions of this study in Section 5.

2. Modified Transformer Model Incorporating Industrial Timing I/O Design

This paper proposes a Transformer prediction model that incorporates industrial timing I/O design for use in nonlinear dynamic process prediction scenarios. The model’s structure is illustrated in Figure 1. As illustrated in Figure 1, the Transformer model comprises four principal components: the input layer, position encoder, encoder and decoder, where

x

and

\hat{y}

are used to represent input variables and model output variables, respectively,

n

and

q

are used to represent input order and predictive time domain, respectively. The primary distinctions between this model and the classical Transformer model are evident in the input layer and decoder, whereas the position encoder and encoder remain consistent with the classical Transformer model. The input layer streamlines the traditional Transformer model through the strategic integration of input/output (I/O) design.

2.1. Incorporating Industrial Timing I/O Design

The Transformer prediction model used in this paper, which incorporates industrial timing I/O design, describes the dynamic characteristics using the correlation between different moments of the input and output variables. Figure 2 illustrates the I/O design of the Transformer prediction model. For the input module, the dynamic order of the actual process determines the number of columns in the input matrix, while the number of state variables determines the number of rows in the input matrix. As for the output module, the internal dynamic characteristics of the actual system determine the prediction horizon, i.e., the number of columns in the output matrix, and similarly, the number of state variables determines the number of rows in the output matrix. The forward time steps

n

and prediction horizon

q

in Figure 2 are both related to the dynamic order of the actual system.

More specifically, in this paper, the prediction of data at one time in the future is called ‘single-step prediction’, and the prediction of data at multiple times in the future is called ‘multi-step prediction’. In the following, this paper gives a general definition of the prediction task for nonlinear dynamic processes. Its goal is to predict the value of the output sequence in the next period of time based on historical time series. As shown in Figure 1, for single-input and single-output nonlinear dynamic process (hereinafter called univariate nonlinear dynamic process), the input of the model at time

t

is defined as a data information matrix containing the data at previous

n

times, namely,

X_{t} = [x_{t - n}, x_{t - n + 1}, \dots, x_{t}]

, where

x_{t} = {[u_{t}, y_{t}]}^{T}

. This design means that the input sequence and the output sequence of processes are fused together as the input of the model. Hence, model inputs become more complex, contain more abundant dynamic characteristics, and make their generalization ability stronger. Single-step prediction is usually of little help to real early warning application, because it is difficult to predict what will happen after multiple steps. Hence, this paper defines the data information vector containing the next

q - 1

moments in the future as predictive output of the model at the

t - th

moment, namely,

{\hat{Y}}_{t} = [{\hat{y}}_{t + 1}, \dots, {\hat{y}}_{t + q - 1}]

. Equation (1) simply and clearly presents the prediction task of univariate nonlinear dynamic processes. For multiple-input and multiple-output nonlinear dynamic processes (hereinafter called multivariate nonlinear dynamic process), the input representation of the model at time t is the same as that of univariance. However,

X_{t}

is expanded to

x_{t} = {[u_{1, t}, \dots, u_{r, t}, y_{1, t}, \dots y_{l, t}]}^{T}

, where

r

is the number of process input variables, and

l

is the number of process output variables. The predictive output of the model at the

t - th

moment is extended to

{\hat{Y}}_{t} = [{\hat{y}}_{t + 1}, \dots, {\hat{y}}_{t + q - 1}]

, where

{\hat{y}}_{t} = [{\hat{y}}_{1, t}, \dots, {\hat{y}}_{l, t}]

. Equation (2) simply and clearly presents the prediction task of multivariate nonlinear dynamic processes.

[x = (\begin{matrix} u_{t - n} & \dots & u_{t} \\ y_{t - n} & \dots & y_{t} \end{matrix})] \underset{\to}{m o d e l} (\begin{matrix} {\hat{y}}_{t + 1} & \dots & {\hat{y}}_{t + q} \end{matrix}),

(1)

[x = (\begin{matrix} u_{1, t - n} & \dots & u_{1, t} \\ ⋮ & ⋱ & ⋮ \\ u_{r, t - n} & \dots & u_{r, t} \\ y_{1, t - n} & \dots & y_{1, t} \\ ⋮ & ⋱ & ⋮ \\ y_{l, t - n} & \dots & y_{l, t} \end{matrix})] \underset{\to}{model} (\begin{matrix} {\hat{y}}_{1, t + 1} & \dots & {\hat{y}}_{1, t + q} \\ ⋮ & ⋱ & ⋮ \\ {\hat{y}}_{l, t + 1} & \dots & {\hat{y}}_{l, t + q} \end{matrix}) .

(2)

2.2. Modified Transformer Model Structure

2.2.1. Positional Encoding

The structural setup of the attention mechanism in the Transformer results in the inability of the model to learn the positional information in the input sequences, thus the input layer needs to encode the positions. For nonlinear dynamic processes in industrial processes, the position information in the sequence represents the moment information, and this information plays an extremely important role in the prediction effect of the industrial process. As shown in Equations (3) and (4).

P_{l o c} (l o c, 2 i) = \sin (\frac{l o c}{10000^{2 i / d_{m o d e l}}}),

(3)

P_{l o c} (l o c, 2 i + 1) = \cos (\frac{l o c}{10000^{2 i / d_{m o d e l}}}),

(4)

where

l o c

is the position of the input sequence,

d_{m o d e l}

represents the dimension of the input sequence,

2 i

and

2 i + 1

represent parity, and

i

denotes the

i t h

dimension of the input sequence feature vector.

2.2.2. Multi-Head Self-Attention Mechanism

The core principle of the encoder is the self-attention mechanism, and its computational process is commonly used in the Query-Key-Value (QKV) mode, as shown in Equation (5).

S e l f A t t e n t i o n (Q, K, V) = S o f t M a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) \cdot V,

(5)

where

Q

denotes the query feature matrix,

K

denotes the key feature matrix, and

V

denotes the value feature matrix, which are obtained by multiplying the input feature matrix with the corresponding weight matrix, respectively; and

d_{k}

represents the dimensions of

Q

,

K

, and

V

, and the SoftMax function is a commonly used activation function that converts any real vector into a probability distribution.

Based on the self-attention mechanism, the Transformer model uses a multi-head self-attention mechanism. As shown in Equations (6) and (7).

M u l t i H e a d (Q, K, V) = C o n c a t (h e a d_{1}, \dots, h e a d_{m}) \cdot W^{O},

(6)

h e a d_{i} = S e l f A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}),

(7)

where

W_{i}^{Q}

,

W_{i}^{K}

, and

W_{i}^{V}

are weight matrices of

Q

,

K

, and

V

,

i \in [1, m]

, Concat function stitches together matrices generated by multiple attention headers,

W^{O}

denotes the output parameter matrix, and

m

is the total number of attention heads.

2.2.3. Residual Linking and Normalization

Residual connection can solve the problem of network degradation by overlaying input and output. Normalization can not only eliminate the effects of gradient disappearance and gradient explosion but also accelerate convergence. The specific calculation equations are shown in Equations (8)–(11)

X_{M n o r m, o} = L a y e r N o r m (x_{a d d}) = L a y e r N o r m (X + M u l t i H e a d (Q, K, V)),

(8)

L a y e r N o r m (x_{a d d}) = \frac{x_{a d d, a z} - μ_{a}}{\sqrt{{(σ_{a})}^{2} + ε}},

(9)

μ_{a} = \frac{1}{d_{t}} \sum_{z = 1}^{d_{t}} x_{a d d, a z},

(10)

{(σ_{a})}^{2} = \frac{1}{d_{t}} \sum_{z = 1}^{d_{t}} {(x_{a d d, a z} - μ_{a})}^{2},

(11)

where

X_{M n o r m, o}^{g}

is the output of residual connection and normalization layer behind the multi-head self-attention layer, LayerNorm is the normalization function,

x_{a d d, a z}

is the value of row

a

and column

z

of output matrix

x_{a d d}

after residual connection,

μ_{a}

is the mean value of row

a

of

x_{a d d}

,

{(σ_{a})}^{2}

is the variance of row

a

of

x_{a d d}

, and the smoothing parameter

ε

is used to prevent the denominator from being zero in Equation (9).

2.2.4. Feed-Forward Neural Network

The feed-forward neural network is another important component in the Transformer. In each Transformer block, the feed-forward neural network accepts the output from the multi-head self-attention mechanism as input and maps it into a new representation vector using two linear transformations and a nonlinear activation function. As shown in Equation (12).

X_{f n n, o} = R e L U (W_{1} X_{M n o r m, o} + b_{1}) \cdot W_{2} + b_{2},

(12)

where

W_{1}

and

W_{2}

are trainable weights of the first and second layers, respectively,

b_{1}

and

b_{2}

are bias vectors, ReLU is the activation function, and

X_{f n n, o}

is the output of the feed-forward neural network. At the same time, the dropout operation [13] is adopted in the feed-forward layer to prevent the overfitting of training.

2.2.5. Encoder and Simplified Decoder

The input sequence designed in this paper contains the data of the process input vector and the process output vector. Since the multi-head self-attention mechanism of the encoder can learn the dependence between vectors of the input sequence itself, hence, only one encoder can achieve the effect of a classical Transformer containing encoding and decoding structure. In order to avoid the redundancy of encoding–decoding the multi-head attention mechanism in the decoder and the multi-head self-attention mechanism in the encoder, a fully connected neural network with two hidden layers is used as the decoder to obtain predictive output

\hat{y}

of the modified transformer in this paper. This simplifies the structure of traditional transformers. As shown in Equations (13) and (14).

X_{F n o r m, o} = L a y e r N o r m (X_{M n o r m, o} + X_{f n n, o}) .

(13)

\hat{y} = (R e L U (W_{1} X_{F n o r m, o}^{N} + b_{1})) W_{2} + b_{2},

(14)

where

X_{F n o r m, o}^{N}

is the output matrix of the encoder.

2.3. Learning Rate Optimization Algorithm

In this paper, the Adaptive Moment Estimation with Weight-decay (AdamW) algorithm is used to train and optimize the model. AdamW was proposed by Ilya Loshchilov and Frank Hutter in 2018 [37]. Weight attenuation was added to the parameter update in order to solve the problem of the non-convergence or slow convergence of the Adaptive Moment Estimation (Adam) algorithm [38]. Llugsi et al. proved that the AdamW optimizer has higher predictive accuracy than the Adam optimizer in a time series prediction task [39]. It is also proved that AdamW has smaller oscillation in the process of convergence and can produce better generalization performance. The updated equations of AdamW are as Equations (15)–(20).

L (θ_{t}) = \frac{1}{2} {(θ_{t})}^{2} = \frac{1}{2} {(y_{t} - {\hat{y}}_{t})}^{2},

(15)

m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) \nabla L (θ_{t - 1}),

(16)

υ_{t} = β_{2} υ_{t - 1} + (1 - β_{2}) {(\nabla L (θ_{t - 1}))}^{2},

(17)

{\hat{m}}_{t} = \frac{m_{t}}{1 - β_{1}^{t}},

(18)

{\hat{υ}}_{t} = \frac{υ_{t}}{1 - β_{2}^{t}},

(19)

θ_{t + 1} = θ_{t} - α (\frac{{\hat{m}}_{t}}{\sqrt{{\hat{υ}}_{t} + ε}} + ω θ_{t}),

(20)

where

y_{t}

is the actual output at time

t

;

{\hat{y}}_{t}

is the predicted output at time

t

,

m_{t}

is the update of the 1-st moment estimate,

υ_{t}

is the update of estimate of the 2-nd moment,

β_{1}

is the exponential decay rate of the 1-st moment (generally,

β_{1} = 0.9

),

β_{2}

is the exponential decay rate of the 2-nd moment (generally,

β_{2} = 0.999

),

ω

is the weight attenuation factor (generally,

ω = 0.01

),

\nabla L (θ_{t})

is the gradient of loss function

L (θ_{t})

,

{\hat{m}}_{t}

and

{\hat{υ}}_{t}

are the bias correction of

m_{t}

and

υ_{t}

at initial value of 0, respectively.

3. Evaluation Mechanisms of Models and Methods of Parameter Integration in the Prediction Time Domain

3.1. Evaluation Mechanism

In order to make a predictive effect of the modified transformer on nonlinear dynamic processes as good as possible, it is necessary to select a group of optimal network structure parameters from the set of structure parameters. Under optimal structure parameters, the predictive error of the model can be relatively minimal, and the fitting degree of the predictive curve and real curve can be relatively high. In order to achieve the above objectives, this paper proposes an evaluation mechanism for determining the optimal parameters of a neural network structure. The mechanism takes into account two performance evaluation indexes: RMSE and KL dispersion. The latter is obtained using a variational autoencoder (VAE). The basic idea is as follows:

Firstly, the performance indexes RMSE, Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) are used to measure the deviation between the real value and the predictive value. The smaller their values, the better the prediction of the model. As shown in Equations (21)–(23).

R M S E = \sqrt{\frac{1}{v a} \cdot \frac{1}{q} \cdot \frac{1}{L} \sum_{k = 1}^{v a} \sum_{i = 1}^{L} \sum_{j = 1}^{q} {({\hat{y}}_{i, j}^{(k)} - y_{i, j}^{(k)})}^{2}},

(21)

M A E = \frac{1}{v a} \cdot \frac{1}{q} \cdot \frac{1}{L} \sum_{k = 1}^{v a} \sum_{i = 1}^{L} \sum_{j = 1}^{q} |{\hat{y}}_{i, j}^{(k)} - y_{i, j}^{(k)}|,

(22)

M A P E = \frac{100}{v a} \cdot \frac{1}{q} \cdot \frac{1}{L} \sum_{k = 1}^{v a} \sum_{i = 1}^{L} \sum_{j = 1}^{q} |\frac{{\hat{y}}_{i, j}^{(k)} - y_{i, j}^{(k)}}{y_{i, j}^{(k)}}| % .

(23)

where

L

represents the length of the dataset,

{\hat{y}}_{i, j}^{(k)}

is the model predictive output of the

k - th

output variable,

y_{i, j}^{(k)}

is the real output of the

k - th

output variable,

v a

represents the number of process output variables, and

q

represents the length of the predictive output sequence corresponding to each time.

RMSE is primarily used to measure the overall predictive accuracy of a model, reflecting the average deviation between predicted values and actual values. However, the model’s network structure may introduce specific types of errors that might not be fully captured by RMSE. Therefore, for selecting a relatively optimal set of network structure parameters within a defined parameter space, more detailed model analysis and evaluation are necessary, including further examination of residual distributions.

VAE was first proposed by Kingma and Welling [40], which consists of an encoder and a decoder. The encoder part takes the initial data input and compresses and encodes it to obtain a low-dimensional hidden variable. Specifically, the encoder takes a real sample

τ

as the input and outputs its mean

μ_{i} (τ)

and variance

σ_{i} (τ)

in the hidden variable space, where

i = 1, 2, \dots, m

and

m

is the dimension of the hidden variable space. Suppose the data in the hidden variable space satisfies a particular distribution

N (0, I)

with a mean value of 0 and a variance of

I

. By using the re-parameterization method, the hidden variables composed of mean and variance are taken as the input of the decoder, and reconstructed samples are finally output. Through repeated iterations, the RMSE of a real sample and a reconstructed sample is expected to be small enough, and the mapping of the real sample on the latent variable space can be as close as possible to the assumed data distribution. The loss function of VAE includes KLD and RMSE. KLD can be expressed as Equation (24).

K L D = - 0.5 \times \sum_{i = 1}^{m} (1 + \log (σ_{i} {(τ)}^{2}) - {(μ_{i} (τ))}^{2} - {(σ_{i} (τ))}^{2}),

(24)

K L D = - 0.5 \times \frac{1}{v a} \sum_{k = 1}^{v a} \sum_{i = 1}^{m} (1 + \log (σ_{i}^{(k)} {(τ)}^{2}) - {(μ_{i}^{(k)} (τ))}^{2} - {(σ_{i}^{(k)} (τ))}^{2}) .

(25)

In this paper, VAE was used to calculate the KLD between residual sequence and standard normal distribution, as shown in Equation (25). The residual distribution is tested based on KLD. Where

μ_{i, j}^{(k)} (τ)

and

σ_{i, j}^{(k)} (τ)

are the mean and variance of the

k - th

output variable in the hidden variable space. In a statistical sense, KLD can be used to measure the degree of difference between two distributions. The process studied in this paper is generally in a random noise environment, and random noise has certain distribution characteristics. If only residual distribution caused by noise is compared with standard normal distribution, the difference between the two should be the smallest. However, if the residual also has the contribution of model error, the residual distribution will be more complex and diverse, and the difference from the standard normal distribution will be greater. Hence, we can derive the following:

The larger the KLD → the larger the difference between the residual distribution and standard normal distribution → the stronger the non-randomness in the residual → the larger the output error → the larger the model error → the value of neural network structure parameters is inappropriate;
The smaller the KLD → the smaller the difference between the residual distribution and standard normal distribution → the weaker the non-randomness in residual → the smaller the output error → the smaller the model error → the more appropriate the selection of network structure parameters.

To sum up, a group of structure parameters minimizing KLD is selected from the set of structure parameters corresponding to the smaller RMSE. This group of parameters is the optimal solution for the structure parameters of the neural network. In the experiment, this paper selects the first ten groups with better RMSE. However, the optimal structure parameters are the group corresponding to the minimum KLD among these ten groups. Based on this, the specific value vector of structure parameters is determined. In general, according to the above analysis, the specific process of the evaluation mechanism is to establish the value matrix of all structure parameters. Each row represents a complete value of structure parameters. Under this definition, the loop body traverses each row, trains, and tests the model

N N (•)

in turn, in order to obtain residual sequence matrix. Firstly, the RMSE is obtained and sorted from small to large. Then, take out the smaller serial numbers of the first 10 groups. According to these serial numbers, an optimized residual sequence matrix is formed. Next, the KLD of the residual sequence under these serial numbers is calculated by the loop body. Finally, the serial numbers and the group of structure parameters corresponding to the minimum KLD are taken as the optimal group of structure parameters. It should be mentioned that RMSE and KLD are calculated after data normalization in this paper. Finally, the data are denormalized, and then the predictive and real curves are compared by drawing a graph. The following algorithmic pseudo-code (Algorithm 1) provides a more intuitive representation of the exact flow of the evaluation mechanism.

Algorithm 1. Algorithm for evaluation mechanism

Begin

1. Initialize the network structure parameter vector

p_{1} \in ℝ^{L_{1} \times 1}

,

p_{2} \in ℝ^{L_{2} \times 1}

,

p_{3} \in ℝ^{L_{3} \times 1}

.

2. Initialize

k l_o p t i m a l = 0

and

s e q_o p t i m a l = 0

.

3. Generate the parameter matrix

M \in ℝ^{L \times 3}

using

p_{1}

,

p_{2}

and

p_{3}

.

L = L_{1} \cdot L_{2} \cdot L_{3}

.
// Overall structure of selected neural networks //

4. class NN

5. NNstructure

6. train()

7. test()

8. End

9. NN1 = NN()
// Calling different structure parameters for network training and prediction, and getting the corresponding predictive error and residual sequence //

10. For

i = 1 : L

// Normalization of input data //

11. [data_normalize_train, data_normalize_test] = normalize(data)

12. NNstructure = M(i, :)

13. NN1.train(NNstructure, data_normalize_train)

14. residual(i,:) = NN1.test(NNstructure, data_normalize_test)

15. rmse(i) = RMSE(residual(i, :))

16. End for
// Rank RMSE from small to large and get the corresponding parameter serial //

17. [rmse, seq] = sort(rmse)
// Calculate the KLD of residual sequence corresponding to the top 10 RMSE after ranking by VAE //

18. For

i = 1 : 10

19. residual2(i, :) = residual(seq(i), :)

20. [vae_kl(i, 1), ... , vae_kl(i, n)] = VAE_KL(residual2(i, :))

21. kl(i) = average(vae_kl(i, j))
// j = 1, 2, …, n, n represents the number of output variables //

22. End for
// The minimum value of KLD and its corresponding structure parameter serial number are obtained by bubble sorting //

23. kl_optimal = kl(1)

24. For

i = 2 : 10

25. If kl(i) < kl_optimal

26. kl_optimal = kl(i)

27. seq_optimal = seq(i)

28. End if

29. End for

30. print(seq_optimal, M(seq_optimal, :))
// The network is modeled and predicted again under the optimal structure parameters //

31. NNstructure = M(seq_optimal, :)

32. NN1.train(NNstructure, data_normalize_train)

33. residual = NN1.test(NNstructure, data_normalize_test)
// Renormalized the predictive output, and draw the predictive curve and the real curve //

34. yy = renormalize(y-residual)

35. plot(yy)

End

3.2. Predictive Time-Domain Parameterisation Methods

After determining the optimal network based on the previous description, the following is to predict the key process variables according to the actual situation. When performing multi-step predictions, the prediction time domain is not chosen arbitrarily and needs to be analyzed specifically according to the actual situation. To address the problem of uncertainty in the prediction time domain, a hybrid network is designed in this paper, which consists of

q

sub-networks. Each sub-network is of the same chosen model class, except that they have different prediction steps.

For the case where the prediction time domain is

q

, the hybrid network contains

q

sub-networks capable of predicting the data

{\hat{y}}_{t}, \dots, {\hat{y}}_{t + q - 1}

in real time as well as at

q - 1

moments in the future, respectively, and the

q - th

sub-network has a prediction step of

q

. For the case where the prediction time domain is 1, the hybrid network has only 1 sub-network, which can only predict the data

{\hat{y}}_{t}

corresponding to the real-time moment with a prediction step of 1. Observing the fitting degree between the predicted output sequence at the last prediction time step of each subnetwork and the target sequence. When the predictive trend of

(q + 1) - th

subnet becomes chaotic, then the predictive time domain is determined to be

q

.

Remark 1.

For the I/O of the neural network, input is the historical time series data matrix of different variables at the current and forward moments, which does not change. But the output is the predictive value of output variables for the next

q

steps, and it has to change. In particular,

q

changes gradually from 1 to large, and different values of

q

correspond to different subnets. This paper build

N N (•)

with such a hybrid structure as described above. The evaluation index RMSE of

N N (•)

is still relatively accurate until

q = q_{\max}

. However, when

q = q_{\max} + 1

, RMSE increases significantly. It means that the loss function value becomes larger, and the predictive performance becomes worse, that is, the predictive time domain is determined to be

q_{\max}

. In addition, it is worth noting that the predictive time domain value should be further determined after the overall modeling, evaluation, and prediction of

N N (•)

is completed, that is, after the optimal neural network is determined (including the value of optimal structure parameters and the specific form of optimal

N N (•))

because it is closely related to the internal dynamic characteristics of the system or process itself, and it needs to be further effectively determined on the basis of the optimal network.

Modeling and evaluation are mainly used in the offline training and testing phase to determine the optimal neural network, while prediction is mainly used in the online phase to provide effective online prediction and estimation of key output indicators or assay indicators. In the implementation process of multi-step prediction, the ability to predict the next few moments is mainly determined by the prediction time domain, which determines the upper limit of the prediction step size, the flow chart on the adjustment of the prediction time domain parameters is shown in Figure 3a.

The online prediction stage can predict the future moments of data arriving at the prediction time domain, and it can also determine the future prediction moments or real-time moments according to the actual needs of the project and output the predicted values of the real-time moments and the future moments in their entirety. The specific online prediction stage flow chart is shown in Figure 3b, where “actual prediction time domain” means a prediction time domain determined according to the actual production needs within the ideal maximum prediction time domain. After obtaining the prediction time domain for the actual production process, the corresponding hybrid network structure is established, and all the internal characteristic parameters of the corresponding sub-networks are extracted from all the internal characteristic parameters of the hybrid network stored in the parameterization stage of the prediction time domain, so as to generate several optimal sub-networks, and then predict the real-time data and the data at different moments in the future based on the various optimal sub-networks, so as to realize the task of online prediction.

4. Integrated Modelling, Evaluation, and Prediction Experiment

In this section, a Transformer–VAE model is established, which integrates the proposed Transformer model for industrial time series I/O design with the evaluation mechanism introduced in Section 3. This integration allows for obtaining an optimal set of network structure parameters for network prediction, along with the corresponding predictive results. The structure is illustrated in Figure 4. Where solid black arrows indicate the traversal of all group structure parameters within the structural matrix. Through network training and prediction, the RMSE corresponding to each group of network structure parameters is obtained, and the top few groups with smaller RMSE are selected for calculating the KL divergence. The group of network structure parameters corresponding to the minimum KL divergence represents the optimal structural parameters. Dashed orange arrows signify that after obtaining the optimal structural parameters, the network is retrained and predicted using these optimal parameters to achieve more accurate final prediction results.

In this section of the experiment, this paper models, evaluates, and predicts a real nonlinear dynamic process in the petrochemical field based on a modified transformer model. Compared with two baseline models, the predictive performance and effectiveness of the proposed model are verified. In the end, the determination of the predictive time domain under different conditions is explored based on the proposed model. The experiment environments are as follows: MATLAB R2021a software is used to realize the baseline models, open-source machine learning library Scikit-learn and deep learning framework Torch are used to implement our model. The experimental hardware environment of the baseline models is configured with Intel(R) Core (TM) i7-8550U CPU @ 1.80 GHz 1.99 GHz (Intel Corporation, Santa Clara, CA, USA), and the experimental hardware environment of our model is configured with Intel(R) Xeon(R) Gold 5118 CPU @ 2.30 GHz (Intel Corporation, Santa Clara, CA, USA), each of the 4 GPUs is NVIDIA GeForce RTX 2080Ti (Nvidia Corporation, Santa Clara, CA, USA), and the memory is 96 G.

4.1. Selection of Research Subjects

In this example, this paper uses actual data of catalytic cracking fractionation system in Tianjin Dagang Oil Refinery to verify the predictive ability of modified transformer model for nonlinear dynamic process. The catalytic cracking fractionation system is mainly composed of the main fractionator, oil and gas condensation cooling system at the top of the tower, diesel stripping tower, recycle oil tank, and middle circulating reflux. The high-temperature reaction oil-gas mixture exiting from the top of the reactor is fed into the desuperheater section of the lower part of the main fractionator. After heat exchange and washing, the catalyst powder enters the fractionator body. The partial process flow of FCC fractionator C2201 in this plant is shown in Figure 5. Its main fractionator is equipped with top circulating reflux, middle circulating reflux, partial refining reflux, and oil slurry circulating reflux.

This paper mainly considers three closed-loop control loops TC2202, FC2207, and LC2209, which have mutual coupling and mutual influence. To be specific, the change in the branch valve lift controlled by TC2202 affects flow and temperature, and the change in the main pipeline valve lift controlled by FC2207 also affects flow and temperature. Secondly, whether the opening of the main or branch pipeline changes, the input flow changes. At the same time, the liquid level under the LC2209 will change. Hence, three closed-loop control loops are mutually influenced and coupled. This paper set three manipulated variables of the above three closed-loop control loops as the process input variables

u_{1}

,

u_{2}

, and

u_{3}

, and set three controlled variables as the process output variables

y_{1}

,

y_{2}

, and

y_{3}

. The sampling interval is 1 min. This paper collected 2823 data of each input and output variable for three days from 12 August 2021 10:43 to 12 October 2021 09:44 as the dataset. The dataset is partitioned into training, validation, and test data by a ratio of 3:1:1 in this experiment. Then, based on the proposed model and evaluation mechanism, the multivariate nonlinear dynamic process with three inputs and three outputs is modeled, evaluated, and predicted.

In this experiment, to evaluate the forecasting performance of our proposed method, two baseline models are compared, including RNN and Gated Recurrent Unit (GRU). RNN is a classical deep neural network and also a memory model where connections between computational units form a directed circle. Unlike feed-forward networks, RNNs can use their internal memory to process arbitrary sequences of inputs, thus providing dynamic memory. GRU is a variation in Long Short-Term Memory (LSTM). It can not only solve the problem of gradient disappearance in RNNs but also simplify the network structure of LSTM and improve the convergence speed [41].

4.2. Determination of Optimal Structural Parameters of Transformer

In terms of model parameter settings, the optimal structural parameters of the relevant models in this section are derived according to the experimental procedure for optimal structural parameter determination in Section 3.

For the Transformer model incorporating industrial timing I/O design, the batch size is set to 100, dropout is set to 0.1, the learning rate is set to 0.009, and the AdamW function is used as the model optimizer and the loss function is the same as that of the baseline model for evaluating the model. In addition, the number of forward moments

n

in the Transformer model designed in this paper is set to 30. The number of encoder layers (

n_{e}

), the number of attention heads (

n_{h}

), and the number of hidden layer neurons (

n_{n}

) in the feedforward neural network are set as variable parameters as the network structure parameters. The range of initialized network structure parameters in this subsection of the experiment:

n_{e} = {1, 2, 3, 4, 5, 6}

,

n_{h} = {2, 4, 6, 10, 15, 20}

,

n_{n} = {50, 60, 70, 80, 90}

.Then, this subsection determines the optimal network structure parameters based on the Transformer–VAE model that makes the designed Transformer predictions the best for subsequent experiments.

Firstly, we perform a single-step prediction of the validation set under 180 groups of different structure parameters and obtain the first ten groups of parameter combinations with smaller RMSE through the evaluation mechanism. In order to minimize the interference of model errors on predictive results, we further calculate the KLD of the residual sequence corresponding to each group of parameters through the VAE network. Table 1 shows the predictive results of the modified transformer under these ten parameter combinations. Through the performance index RMSE, it can be seen that the proposed model has high predictive accuracy under these ten parameter combinations. The optimal parameter combination is (1,80,4). Its RMSE is 0.01357. However, in order to reduce the influence of model error on predictive results as much as possible, residual characteristics are further excavated by calculating KLD. Finally, the optimal solution of neural network structure parameters is obtained as (1,90,2). Hence, we set

n_{e}

,

n_{h}

and

n_{n}

to 1, 90, and 2, respectively, for the follow-up experiments of this example. It should be mentioned that RMSE and KLD are calculated after data normalization in this paper, and so are the performance indexes used in the following experiments. Subsequently, the data need to be denormalized.

4.3. Comparison Experiment

In this paper, single-step prediction of a multivariate nonlinear dynamic process is performed using the proposed model and three reference models. The number of forward propagation units of the reference model GRU is set to 30, the number of backward updating units to 11, the dimensionality of the state variables within the module to 30, the RMSE is used as the loss function, and the Adam algorithm is chosen to optimize the network. The reference model RNN contains a hidden recurrent layer consisting of 30 neural units, with a backward update unit count of 30, and the rest of the setup is consistent with GRU. The reference model Transformer uses a traditional data input–output structure, and the rest of the parameters are consistent with the modified Transformer model proposed in this paper. The training process is conducted for 200 epochs.

Table 2 shows the single-step predictive results of four models on the test set. As shown in Table 2, the experimental results show that compared with the three baseline models, the modified transformer has a smaller predictive error and higher predictive accuracy. More particularly, the RMSE, MAE, and MAPE of the modified Transformer model designed in this paper are reduced by 70.9%, 63.6%, and 65.8%, respectively, over Transformer, 82.7%, 85.7%, and 62.0% over GRU, and 89%, 91.5%, and 67.7% over RNN.

Figure 6 visually presents the comparisons between predictive curves and real curves of three models in the case of single-step prediction, where legend

t

represents the prediction at the current moment in time. We collect 100 moments of data ranging from 12 October 2021 00:30 to 12 October 2021 02:09 in the test set. It can be seen from Figure 6 that the predictive value based on the modified transformer can well fit the trend of the actual value and has higher modeling accuracy.

4.4. Predictive Time-Domain Parameterisation of Modify Transformer

This subsection performs multistep prediction based on a Transformer model incorporating industrial timing I/O design in the context of a catalytic cracking fractionation system. The objective is to explore the laws for predicting time-domain parameters for multivariate nonlinear dynamic processes.

As shown in Figure 7, the comparisons of predictive errors are produced by the designed hybrid network on the test set, where

t + p

is used to indicate the future

p - th

moments. This hybrid network includes four subnets. This means that the

j - th

subnet can provide the comparative errors at the future

j - th

step, which are described in sequence in Figure 7a–d. As shown in Figure 8, this hybrid network includes five subnets. This means that the

j - th

subnet can provide the comparative errors at the future

j - th

step, which are described in sequence in Figure 8a–e. It can be seen from Figure 8 that when the predictive step increases to 5, the predictive trend begins to become chaotic.

Hence, it is more appropriate to choose 4 as the predictive time domain in this example. The performance indexes in Table 3 can also illustrate the above conclusion. Table 3 shows the performance indexes of modified transformers under different predictive time domains. It can also be seen that with the increase in the predictive step, the predictive effect of our model is becoming worse and worse, and the predictive accuracy is becoming lower and lower. The longer the predictive step is, the harder it is to predict the corresponding moment in the future. More specifically, when the predictive step is 4, RMSE, MAE, and MAPE are 2.28, 2, and 2.03 times for those when the predictive step is 1 (

R M S E = 0.01683, M A E = 0.01195,

M A P E = 28.64307

). The experimental results of Figure 7 and Figure 8 and Table 3 show that a modified transformer can accurately predict multivariate nonlinear dynamic processes in multiple steps when the predictive time domain is selected properly.

5. Discussion

In this paper, we propose a Transformer model that incorporates industrial timing I/O design. On this basis, we conduct a comparative experiment on a real multivariate nonlinear dynamic process through a modified transformer model and three baseline models. These results corroborate the conclusions of a large number of previous studies [19] that the Transformer model can be used in time series forecasting.

It can be concluded from Table 2 and Figure 6 that the method proposed in this paper is superior in the task of predicting nonlinear dynamic processes. This result is consistent with the findings of Reza et al. [42]. This is because the ability of GRU and RNN to extract information features is relatively poor, and it is not easy to accurately predict long-time series [15,41]. The Transformer, which is designed by integrating industrial time series I/O, can fully capture the impact of the input sequence on the output sequence. In other words, it can not only effectively learn hidden nonlinear correlation characteristics of multivariate time series data, but also learn different effects of different input sequences on output sequence at different times. From Table 2, it can be seen that GRU has a better prediction than RNN. This is consistent with the statement that GRU has a longer memory than RNN, and GRU solves the problem of vanishing gradient to some extent [43].

The present paper proposes an integrated approach to the modeling, evaluation, and prediction of Transformer–VAE. The optimal parameter solution of the modified transformer model is obtained through an evaluation mechanism.

In addition, we carry out an exploratory experiment on the predictive time domain. It can be seen from Figure 7 that with the continuous increase in the predictive step, the fitting degree of the predictive curve and the real curve is becoming worse and worse, and the tracking effect is also becoming worse and worse. However, under each future step of this time domain, the predictive curve can track the trend of the actual curve well, so there is no way to determine the predictive time domain of the process. As shown in Figure 8, the predictive trend begins to become chaotic. It means that the predictive data cannot track the trend of the actual data and are difficult to predict more accurately. This also reflects a practical problem of multi-step prediction. The predictive time domain needs to be selected according to the actual situation. This selection depends on the internal dynamics of the system or process itself, rather than the fact that a modified transformer model can well predict the data at any time in the future.

The prediction model proposed in this paper can provide a valuable reference for industrial early warning and other applications based on multivariate time series data.

6. Conclusions

This paper presents an integrated approach to modeling, evaluation, and prediction using Transformer–VAE. The research object is an industrial catalytic cracking fractionation system. The comparative experiments and analyses demonstrate that the proposed evaluation system is capable of identifying the optimal network structure parameters. Furthermore, the Transformer model, which integrates the industrial time-sequence I/O design exhibits enhanced prediction accuracy, while the prediction time domain is investigated through the hybrid network to ascertain the optimal prediction time domain range. The Transformer model, which integrates industrial time series I/O, offers a valuable reference point for industrial early warning based on multivariate time series data and other applications. However, this paper still has areas that need to be studied in depth, the subsequent research is described below: this paper lacks interpretability in the prediction of nonlinear dynamic processes using deep learning models, visualizations can then be used to improve the interpretability of the model; this paper does not consider the fitness of the optimal network structure of the model to the system when the system undergoes dynamic changes. In the future, an adaptive adjustment mechanism for the network structure could be designed to accommodate changing industrial process conditions or significant fluctuations in process loads. In addition, this paper is limited regarding the prediction time domain and relies heavily on experimental determinations. Subsequent studies will address the relationship between the prediction time domain and industrial process modeling.

Author Contributions

Conceptualization, J.Z., Z.W., X.L. and M.C.; Data curation, J.Z. and J.L.; Formal analysis, J.Z.; Funding acquisition, Z.W.; Investigation, J.Z.; Methodology, J.Z. and Z.W.; Project administration, Z.W.; Resources, Z.W.; Software, J.Z. and J.L.; Supervision, Z.W., X.L. and M.C.; Validation, J.Z.; Visualization, J.Z.; Writing—original draft, J.L.; Writing—review and editing, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (No. 61703434) and the Science Foundation of China University of Petroleum, Beijing (No. 2462020YXZZ023).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Jiaxuan Liu was employed by the company Research Institute of Petroleum Exploration & Development, PetroChina Company Limited. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Richalet, J. Industrial applications of model based predictive control. Automatica 1993, 29, 1251–1274. [Google Scholar] [CrossRef]
Zhu, Y.; Patwardhan, R.; Wagner, S.B.; Zhao, J. Toward a low cost and high performance MPC: The role of system identification. Comput. Chem. Eng. 2013, 51, 124–135. [Google Scholar] [CrossRef]
Wang, Z.; Zhao, L.; Luo, X. Wiener structure based adaptive control for dynamic processes with approximate monotonic nonlinearities. J. Frankl. Inst. 2020, 357, 13534–13551. [Google Scholar] [CrossRef]
Kadlec, P.; Gabrys, B.; Strandt, S. Data-driven soft sensors in the process industry. Comput. Chem. Eng. 2009, 33, 795–814. [Google Scholar] [CrossRef]
Qin, S.J.; Badgwell, T.A. A survey of industrial model predictive control technology. Control. Eng. Pract. 2003, 11, 733–764. [Google Scholar] [CrossRef]
Wang, Y.; Xu, B.; Pang, C. Intelligent identification method of chemical processes based on maximum mean discrepancy domain generalization. J. Taiwan Inst. Chem. Eng. 2023, 150, 105075. [Google Scholar] [CrossRef]
Jain, A.K.; Mao, J.; Mohiuddin, K.M. Artificial neural networks: A tutorial. Computer 1996, 29, 31–44. [Google Scholar] [CrossRef]
Kim, K.H.; Kwak, B.I.; Han, M.L.; Kim, H.K. Intrusion detection and identification using tree-based machine learning algorithms on DCS network in the oil refinery. IEEE Trans. Power Syst. 2022, 37, 4673–4682. [Google Scholar] [CrossRef]
Kumpati, S.N.; Kannan, P. Identification and control of dynamical systems using neural networks. IEEE Trans. Neural Netw. 1990, 1, 4–27. [Google Scholar]
Chen, X.; Chen, X.; She, J.; Wu, M. A hybrid time series prediction model based on recurrent neural network and double joint linear–nonlinear extreme learning network for prediction of carbon efficiency in iron ore sintering process. Neurocomputing 2017, 249, 128–139. [Google Scholar] [CrossRef]
Xie, J.; Zhou, P. Robust stochastic configuration network multi-output modeling of molten iron quality in blast furnace ironmaking. Neurocomputing 2020, 387, 139–149. [Google Scholar] [CrossRef]
Xu, K.-K.; Yang, H.-D.; Zhu, C.-J. A novel extreme learning machine-based Hammerstein-Wiener model for complex nonlinear industrial processes. Neurocomputing 2019, 358, 246–254. [Google Scholar] [CrossRef]
Vaswani, A. Attention is all you need. Adv. Neural Inform. Process. Syst. 2017. [Google Scholar]
Gillioz, A.; Casas, J.; Mugellini, E.; Abou Khaled, O. Overview of the Transformer-based Models for NLP Tasks. In Proceedings of the 2020 15th Conference on Computer Science and Information Systems (FedCSIS), Sofia, Bulgaria, 6–9 September 2020; pp. 179–183. [Google Scholar]
Karita, S.; Chen, N.; Hayashi, T.; Hori, T.; Inaguma, H.; Jiang, Z.; Someki, M.; Soplin, N.E.Y.; Yamamoto, R.; Wang, X. A comparative study on transformer vs rnn in speech applications. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 449–456. [Google Scholar]
Maurício, J.; Domingues, I.; Bernardino, J. Comparing vision transformers and convolutional neural networks for image classification: A literature review. Appl. Sci. 2023, 13, 5521. [Google Scholar] [CrossRef]
Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12299–12310. [Google Scholar]
Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; Sun, L. Transformers in time series: A survey. arXiv 2022, arXiv:2202.07125. [Google Scholar]
Wu, N.; Green, B.; Ben, X.; O′Banion, S. Deep transformer models for time series forecasting: The influenza prevalence case. arXiv 2020, arXiv:2001.08317. [Google Scholar]
Zerveas, G.; Jayaraman, S.; Patel, D.; Bhamidipaty, A.; Eickhoff, C. A transformer-based framework for multivariate time series representation learning. In Proceedings of the Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 2114–2124. [Google Scholar]
Tuli, S.; Casale, G.; Jennings, N.R. Tranad: Deep transformer networks for anomaly detection in multivariate time series data. arXiv 2022, arXiv:2201.07284. [Google Scholar] [CrossRef]
Hongmei, R.; Xuemin, T.; Ping, W. Dynamic soft sensor method based on joint mutual information. CIESC J. 2014, 65, 4497. [Google Scholar]
Wang, Y.; Liu, D.; Liu, C.; Yuan, X.; Wang, K.; Yang, C. Dynamic historical information incorporated attention deep learning model for industrial soft sensor modeling. Adv. Eng. Inform. 2022, 52, 101590. [Google Scholar] [CrossRef]
Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
Allen, D.M. Mean square error of prediction as a criterion for selecting variables. Technometrics 1971, 13, 469–475. [Google Scholar] [CrossRef]
Fischer, C.; Peglow, M.; Tsotsas, E. Restoration of particle size distributions from fiber-optical in-line measurements in fluidized bed processes. Chem. Eng. Sci. 2011, 66, 2842–2852. [Google Scholar] [CrossRef]
Belov, D.I.; Armstrong, R.D. Distributions of the Kullback–Leibler divergence with applications. Br. J. Math. Stat. Psychol. 2011, 64, 291–309. [Google Scholar] [CrossRef] [PubMed]
Youssef, A.; Delpha, C.; Diallo, D. An optimal fault detection threshold for early detection using Kullback–Leibler divergence for unknown distribution data. Signal Process. 2016, 120, 266–279. [Google Scholar] [CrossRef]
Ensor, L.A.; Robeson, S.M. Statistical characteristics of daily precipitation: Comparisons of gridded and point datasets. J. Appl. Meteorol. Climatol. 2008, 47, 2468–2476. [Google Scholar] [CrossRef]
Artyushenko, V.M.; Volovach, V.I. Statistical characteristics of envelope outliers duration of non-Gaussian information processes. In Proceedings of the East-West Design & Test Symposium (EWDTS 2013), Rostov-on-Don, Russia, 27–30 September 2013; pp. 1–4. [Google Scholar]
Rao, R.S.; Kalabarige, L.R.; Alankar, B.; Sahu, A.K. Multimodal imputation-based stacked ensemble for prediction and classification of air quality index in Indian cities. Comput. Electr. Eng. 2024, 114, 109098. [Google Scholar] [CrossRef]
Rao, R.S.; Kalabarige, L.R.; Holla, M.R.; Sahu, A.K. Multimodal Imputation based Multimodal autoencoder framework for AQI classification and prediction of Indian cities. IEEE Access 2024, 14, 108350–108363. [Google Scholar]
Luo, Z.; Xiong, Y.; Zuo, R. Recognition of geochemical anomalies using a deep variational autoencoder network. Appl. Geochem. 2020, 122, 104710. [Google Scholar] [CrossRef]
Dallal, G.E.; Wilkinson, L. An analytic approximation to the distribution of Lilliefors’s test statistic for normality. TAS 1986, 40, 294–296. [Google Scholar] [CrossRef]
Gupta, A.K.; Chen, T. Goodness-of-fit tests for the skew-normal distribution. Commun. Stat.-Simul. Comput. 2001, 30, 907–930. [Google Scholar]
Loshchilov, I. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Jais, I.K.M.; Ismail, A.R.; Nisa, S.Q. Adam optimization algorithm for wide and deep neural network. Knowl. Eng. Data Sci. 2019, 2, 41–46. [Google Scholar] [CrossRef]
Llugsi, R.; El Yacoubi, S.; Fontaine, A.; Lupera, P. Comparison between Adam, AdaMax and Adam W optimizers to implement a Weather Forecast based on Neural Networks for the Andean city of Quito. In Proceedings of the 2021 IEEE Fifth Ecuador Technical Chapters Meeting (ETCM), Virtual, 12–15 October 2021; pp. 1–6. [Google Scholar]
Kingma, D.P. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Lavrova, D.; Zegzhda, D.; Yarmak, A. Using GRU neural network for cyber-attack detection in automated process control systems. In Proceedings of the 2019 IEEE International Black Sea Conference on Communications and Networking (BlackSeaCom), Sochi, Russia, 3–6 June 2019; pp. 1–3. [Google Scholar]
Reza, S.; Ferreira, M.C.; Machado, J.J.; Tavares, J.M.R. A multi-head attention-based transformer model for traffic flow forecasting with a comparative analysis to recurrent neural networks. Expert Syst. Appl. 2022, 202, 117275. [Google Scholar] [CrossRef]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]

Figure 1. The transformer predictive model integrated with industrial timing I/O design.

Figure 2. The I/O design of transformer predictive model.

Figure 3. The flow chart of predictive time domain. (a) The flow chart of tuning predictive time domain parameter; (b) the flow chart of online predictive stage.

Figure 4. Transformer–VAE model.

Figure 5. Part of process flow diagram of catalytic cracking fractionator.

Figure 6. The comparisons of predictive curves and real curves of different models in single-step prediction. (a) RNN; (b) GRU; (c) Transformer; (d) modified transformer.

Figure 7. The comparisons of predictive errors in different future steps using different subnets (predictive time domain is 4). (a) The predictive results in future 1st step using subnet 1; (b) the predictive results in future 2nd step by using subnet 2; (c) the predictive results in future 3rd step by using subnet 3; (d) the predictive results in future 4th step by using subnet 4.

Figure 8. The comparisons of predictive errors in different future steps using different subnets (predictive time domain is 5). (a) The predictive results in future 1st step using subnet 1; (b) the predictive results in future 2nd step by using subnet 2; (c) the predictive results in future 3rd step by using subnet 3; (d) the predictive results in future 4th step by using subnet 4; (e) the predictive results in future 5-th step by using subnet 5.

Table 1. Evaluation results of evaluation mechanism.

$(n_{e}$ $, n_{n}$ $, n_{h}$ )	RMSE	KLD
(1,80,4)	0.01357	0.2730
(1,90,4)	0.01364	0.2673
(1,60,4)	0.01455	0.2801
(1,90,2)	0.01473	0.2571
(1,50,2)	0.01483	0.2699
(2,80,2)	0.01499	0.2753
(2,70,2)	0.01501	0.2740
(2,50,2)	0.01545	0.2777
(1,50,4)	0.01620	0.2804
(1,50,6)	0.01631	0.2913

Table 2. Performance comparisons of different models.

Models	RMSE	MAE	MAPE (%)
RNN	0.1238	0.1125	71.4814
GRU	0.0786	0.0671	60.8175
Transformer	0.0468	0.0264	67.5926
Modified transformer	0.0136	0.0096	23.1028

Table 3. Performance comparisons of modified transformer under different predictive time domains.

Indexes	t + 1	t + 2	t + 3	t + 4	t + 5
RMSE	0.01683	0.02261	0.03023	0.03838	0.04940
MAE	0.01195	0.01517	0.01881	0.02385	0.03063
MAPE (%)	28.64307	36.51211	45.86090	58.12561	75.83628

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, J.; Wang, Z.; Liu, J.; Luo, X.; Chen, M. Modeling and Evaluation of Attention Mechanism Neural Network Based on Industrial Time Series Data. Processes 2025, 13, 184. https://doi.org/10.3390/pr13010184

AMA Style

Zhou J, Wang Z, Liu J, Luo X, Chen M. Modeling and Evaluation of Attention Mechanism Neural Network Based on Industrial Time Series Data. Processes. 2025; 13(1):184. https://doi.org/10.3390/pr13010184

Chicago/Turabian Style

Zhou, Jianqiao, Zhu Wang, Jiaxuan Liu, Xionglin Luo, and Maoyin Chen. 2025. "Modeling and Evaluation of Attention Mechanism Neural Network Based on Industrial Time Series Data" Processes 13, no. 1: 184. https://doi.org/10.3390/pr13010184

APA Style

Zhou, J., Wang, Z., Liu, J., Luo, X., & Chen, M. (2025). Modeling and Evaluation of Attention Mechanism Neural Network Based on Industrial Time Series Data. Processes, 13(1), 184. https://doi.org/10.3390/pr13010184

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Modeling and Evaluation of Attention Mechanism Neural Network Based on Industrial Time Series Data

Abstract

1. Introduction

2. Modified Transformer Model Incorporating Industrial Timing I/O Design

2.1. Incorporating Industrial Timing I/O Design

2.2. Modified Transformer Model Structure

2.2.1. Positional Encoding

2.2.2. Multi-Head Self-Attention Mechanism

2.2.3. Residual Linking and Normalization

2.2.4. Feed-Forward Neural Network

2.2.5. Encoder and Simplified Decoder

2.3. Learning Rate Optimization Algorithm

3. Evaluation Mechanisms of Models and Methods of Parameter Integration in the Prediction Time Domain

3.1. Evaluation Mechanism

3.2. Predictive Time-Domain Parameterisation Methods

4. Integrated Modelling, Evaluation, and Prediction Experiment

4.1. Selection of Research Subjects

4.2. Determination of Optimal Structural Parameters of Transformer

4.3. Comparison Experiment

4.4. Predictive Time-Domain Parameterisation of Modify Transformer

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI