3.1. Prediction Effects of Data-Driven Models
The process and results of the parameter optimization of the data-driven model are shown in
Table A1 and
Table A2. The collected hydrocracking product data were brought into the seven data-driven models to analyze and compare various assessment methods. The calculation process of the data-driven model is shown in
Figure 6.
A total of 104 sets of operational data collected from industrial hydrocracking units were fed into the model and divided into 80 sets for training and 24 sets for testing. The input data, after being normalized and with the introduction of optimized and selected parameters, were used for model-training iterations. The calculation results of various models on the test set are shown in
Table 3.
Based on the analysis of RMSE, R2, and MAE values, it can be concluded that the optimal data-driven model for gas prediction is RBF. For all other hydrocracking products, CNN-LSTM emerges as the optimal choice. Although the input data do not fully meet the prediction requirements of the model, relatively favorable results are achieved for heavy naphtha, kerosene, and residue. These results can be attributed to the learning of spatiotemporal information enabled by the CNN-LSTM model. The CNN component extracts spatial features, while the LSTM component processes time series data, thus improving the accuracy of the model.
According to the calculation results in
Table 3 and
Figure 7 and considering that the gas product exhibits a lower flowrate and is not the primary product of hydrocracking, the CNN-LSTM model was selected as the optimal data-driven model type in this paper. Subsequent optimization and adjustments will be built upon the CNN-LSTM model.
3.2. Combined Model of CNN-LSTM and Discrete Lumping
The mechanism model was solved by the Runge–Kutta method. And the quasi-Newton method and a Genetic Algorithm were used to find the accurate optimal parameter estimation. The objective function for optimization was the sum of the squares of the absolute errors between the calculated and actual values. The kinetic parameters obtained by optimizing the solution are shown in
Table 4.
Multiple sets of data that differ from the parameter estimation were used to verify the reliability of the discrete lumped model. The verification results of the calculated and actual values are shown in
Figure 8. The relative errors calculated after statistics are shown in
Table 5. The calculation errors were basically less than 10% for products prediction, demonstrating the relatively high accuracy of the discrete lumped model. Compared with CNN-LSTM, the error of the mechanism model was still relatively high, especially for the prediction of heavy naphtha flowrate and residue flowrate, mainly due to the limited input information learned by the single mechanism model. A mechanical data-driven combined model can be established to combine the advantages of the two models.
By coupling the data-driven model and mechanism model, a new hybrid model network comprising 14 layers is proposed, including double convolutional layers, double pooling layers, LSTM layers, fully connected layers, and a data processing section. The hybrid model network is shown in
Figure 9.
The main method to establish the mechanism and data-driven hybrid model was to further strengthen the transmission of input variable information. The above work brought the input variables into the CNN-LSTM model to obtain the calculated values. The calculated value of the mechanism model was obtained by entering the input variables into the discrete lumped model, which contained the additional input variable information. The information processing of the input variables in the mechanism model differs from that in the data-driven model. The Convolutional Neural Network–Long Short-Term Memory Network–Discrete Lumping Model (CNN-LSTM-DLM) model was constructed by incorporating the mechanism model calculated results into the CNN-LSTM model. There are two main ways to combine the mechanism model and the data-driven model, which is shown in
Figure 10. One method is that the calculation result of the mechanism model is taken as an input variable into the data-driven model to establish a series structure mode. The other method is that the residual between the calculation result of the mechanism model and the actual value is taken as the output variable of the data-driven model to establish a parallel structure model [
20]. The initial values of the hyperparameters used by the CNN-LSTM-DLM are consistent with the CNN-LSTM model.
The industrial hydrocracking data were brought into the established CNN-LSTM-DLM model, and its performance was compared with the single data-driven model to verify the effectiveness of the hybrid model. The correlation coefficient R
2 of the two types of models was compared and the results are shown in
Figure 11.
According to the preliminary calculation results, the prediction effect of the mechanism data-driven combined model in series form was significantly better than that of the single data-driven model and the parallel model. The poor prediction effect of the parallel model may be due to the large calculation bias of some mechanism model results, which leads to a deviation in the final fitting direction. Therefore, the series structure model is the mechanism data-driven hybrid model for subsequent research. The comparison of prediction effects is shown in
Figure 12.
According to the RMSE values, R
2 values, and MAE values of the key indicators used in assessing the model performance, the accuracy of the established CNN-LSTM-DLM model has been significantly improved compared with the single data-driven model, mainly because the existing data information is extracted more effectively after the discrete lumped model is added. The improved accuracy of the prediction can help operators to more accurately adjust reaction conditions (such as temperature, pressure, amount of catalyst, etc.) to optimize the process. Meanwhile, the required training calculation time of various models is shown in
Figure 13.
In the calculation cost part, the calculation time of the CNN-LSTM-DLM model is longer than the data-driven model. The computer configuration used is 2.30 GHz, and the calculation time of the training model is about 40 s. Although the computational time has increased, it still meets the efficiency requirements of real-time calculation and does not increase the calculation cost too much.
To clarify the impact of the mechanism model calculated values on the hybrid model results, a SHAP analysis and MIC analysis are conducted on the input variables of the CNN-LSTM-DLM model. SHAP is an interpretability method based on cooperative game theory, used to quantify the contribution of each feature to the model’s predictions. It is grounded in Shapley values, which fairly allocate the impact of predictions by calculating the marginal contribution of features across all possible feature combinations. MIC is a statistical method for measuring nonlinear relationships between two variables. It is based on mutual information, which quantifies the reduction in uncertainty of one variable given knowledge of another. These analyses deepen the understanding of the relationship between the mechanism model calculated values and the hybrid model results, shedding light on the significance and role of these calculated values within the CNN-LSTM-DLM model.
Positive values indicate promotion (yellow points), and negative values indicate inhibition (blue points) in
Figure 14. The newly introduced calculated value features from the mechanism model exhibit a highly significant influence on the final calculation results. The input features of the mechanism model calculation values were consistently among the top two most important features across all the variables. This result can be attributed to the fact that the mechanism model calculations effectively extract relevant information, causing a strong correlation with the product yield. The SHAP analysis provides evidence of the mechanism model’s crucial role in the CNN-LSTM-DLM model calculations. This finding emphasizes the importance of the mechanism model and highlights that its integration contributes to the enhanced prediction accuracy achieved by the CNN-LSTM-DLM model. The MIC values predicted by the mechanism model calculation results were all greater than 0.2 (dashed line in
Figure 14f) in the hybrid model, especially for light naphtha, kerosene, and residue, which were all above 0.6. It showed that the addition of the mechanism model improves the correlation of input and output variables and increases the feature information required for prediction. This was conducive to revealing the correlation of important data features and models and improving the prediction accuracy of the model. By applying a SHAP and MIC analysis to the CNN-LSTM-DLM model, insights are gained into how the calculated values from the mechanism model influence the overall predictions.
The prediction of reactor bed temperature is of great significance for improving the operation efficiency of the reactor, ensuring production safety, improving product quality, realizing energy savings, and reducing consumption. The prediction results and evaluation effects are shown in
Figure 15. These established models were applied to the bed temperature prediction of the hydrocracking reactor. Compared with the prediction effect of hydrocracking product yields, various data-driven models and the CNN-LSTM-DLM model had better effects on the bed outlet temperature prediction. The main reason is that the outlet temperature is highly correlated with the inlet temperature, thus it is relatively easy to extract data features.
By comparing the prediction effects of the established models, it can be found that the comprehensive effect of the established CNN-LSTM-DLM has the best performance. The CNN-LSTM-DLM model showed the minimum average RMSE value and the maximum average R2 value, indicating that the hybrid model can maximize the extraction of information. The calculated results of the CNN-LSTM-DLM model have a good consistency with the actual values of the hydrocracking reactor bed temperature in both the test set and training set.
3.3. PSO Optimization Combination Model
The combination of the CNN-LSTM model and mechanism model greatly improves the prediction accuracy of the model, but the prediction effect is still not very good at some marginal points, especially for gas products prediction. One of the reasons is that the same learning rate, number of hidden neurons, regularization coefficient, and dropout rate was set for each calculation. These parameters have a significant impact on the fitting effect of the simulation, but they are difficult to adjust manually.
The Particle Swarm Optimization (PSO) algorithm is a heuristic optimization technique. The PSO algorithm is known for its simplicity in implementation and high computational efficiency, making it a popular choice in various fields. Over recent years, it has gained widespread adoption and has proved to be a valuable tool in solving complex optimization problems [
46]. The specific process of PSO optimization and the hybrid model is shown in
Figure 16.
For the gas yield prediction model,
Table 6 shows the prediction range and the corresponding optimized values of its hyperparameters. This optimization process ensures that the CNN-LSTM-DLM model was fine-tuned to achieve the best possible performance in predicting gas product yield.
Increasing the number of hidden neurons may improve the ability of the model to capture the complex relationship of the data, but too many hidden neurons can cause the overfitting of the model and reduce the generalization ability of the model [
47]. The appropriate initial learning rate can help the model converge to the optimal solution faster [
39]. If the initial learning rate and the parameter update are too large, the model has difficulty converging during training. Too small a learning rate results in slow model convergence and a longer training time. The regularization coefficient penalizes excessive weight values by introducing the L2 norm of weights into the loss function, which can make the model tend to learn smaller weight values and avoid overfitting [
48]. A dropout rate was applied in the fully connected layer following the convolutional layer and the pooling layer of the CNN module, and this can reduce overfitting and improve the model’s stability. Therefore, a suitable choice of key hyperparameter scopes can enhance the flexibility and adaptability of the model, enabling it to effectively handle diverse product prediction scenarios. The results of PSO optimized gas flowrate prediction are shown in
Figure 17.
After optimization by the PSO algorithm, the R2 value of the gas product test set increased from 0.332 to 0.492, effectively improving the prediction accuracy. According to the analysis of the optimization parameters, it is obvious to focus on overfitting for the data with a small overall amount, so the accuracy of part of the training set needs to be sacrificed. Considering the limitations, such as insufficient input variables and a weak correlation between many inputs and product yields, the computational results of the new strategy were satisfactory.
The CNN-LSTM-DLM models of kerosene, residue, heavy naphtha, and light naphtha were also optimized using the PSO algorithm and compared with the test set data. The predicted results are shown in
Figure 18 and
Table 7.
The prediction accuracy improved to a certain extent, and the R2 value was basically above 0.8. Given the low correlation of input and output variables in the characteristic analysis, these prediction results were satisfactory for industrial hydrocracking prediction. The PSO can effectively avoid the overfitting or insufficient training problem when manually adjusting parameters, thereby improving the prediction accuracy and the generalization ability of the model.