1. Introduction
Groundwater, a vital freshwater resource on Earth, plays a critical role in supporting human life and economic development. The flow and solute transport processes in groundwater are significantly influenced by complex and highly nonlinear changes in water density [
1]. To address the challenges posed by the complexity and variability of groundwater, establishing a reliable and highly accurate long-term groundwater model based on physical information is deemed effective [
2,
3,
4,
5]. However, the development of such models necessitates extensive detailed information regarding aquifer characteristics. Even with the requisite data for a physics-based model, challenges persist, such as lengthy calibration times and low computational efficiency.
Hydrologists are increasingly utilizing machine learning methods to address challenges associated with physical-based models [
6,
7,
8,
9]. The data-driven modeling approach offers an advantage by eliminating the need to explicitly define physical relationships and parameters required to describe the physical environment. Machine learning algorithms approximate the relationship between model inputs and outputs through an iterative learning process [
10], significantly enhancing the efficiency of model operation. Neural networks (NNs) have proven effective in modeling and predicting nonlinear time series data, such as groundwater levels, and have demonstrated comparable or superior performance to physical-based models in certain cases [
11]. This study focuses on introducing a novel neural network coupled model as a surrogate for existing unsteady groundwater flow numerical models, aiming to enhance model efficiency for nonlinear time series water head data while ensuring prediction accuracy.
Convolutional neural networks (CNNs) are essential for image recognition and natural language processing due to their powerful feature extraction capabilities [
12]. Therefore, 3D convolutional neural networks can be used to predict the head field at a specific time point [
13]. Wunsch et al. [
14] applied a CNN-based model to predict the highly variable discharge behavior of karst systems due to their heterogeneous hydraulic properties of Gottsakel in Oberach, North Alps. Elmorsy et al. [
15] indicated that increasing the diversity of dataset sizes, utilizing multi-scale feature aggregation, and optimizing architectures can surpass the accuracy of existing state-of-the-art CNN permeability prediction models. However, in studies of groundwater flow, predicting the flow field over long periods at each time point is necessary, and a single CNN model cannot effectively predict long-term sequences. To address the limitations of CNNs, which prioritize local feature detection struggle with long-term temporal dependencies [
16], recurrent neural networks (RNNs) are often combined with CNNs to leverage their superior memory capabilities. In previous studies, Lei et al. [
17] accurately predicted riverbed water flux (SWF) by integrating CNNs with a Bayesian data value analysis (DWA) framework. Similarly, Tian et al. [
18]. effectively evaluated groundwater recharge in the North China Plain using attention-gated recurrent units (ATTENTION-GRU) and CNNs. In addition, Ali et al. [
19] combined CNN layers with bidirectional long short-term memory (Bi-LSTM) models for long-term groundwater level prediction at each time point. Recently, Zhang et al. [
20] proved that the convolutional neural network gated recurrent unit (CNN-GRU) model can extract hidden features of the coupling relationship between groundwater depth and time series, allowing further prediction of groundwater depth at different time series.
This study proposes a new architecture to address the computational and representational limitations of CNNs. The method combines a depth separable convolutional neural network (DSCNN) with the GRU framework to construct an efficient surrogate model specifically for predicting groundwater flow. The hybrid model leverages CNN’s feature extraction capabilities and uses GRUs to capture long-term temporal dependencies in time series data. The integration of DSCNN is recognized for its lower computational load and reduced parameter requirements, which provides a computationally efficient surrogate without sacrificing representation capabilities [
21]. Additionally, this study conducted experiments on the groundwater conditions in the Penola region of southeastern Australia. Using only a CNN model to predict groundwater head fields resulted in lower accuracy compared to using a CNN combined with long short-term memory networks (LSTMs). Replacing CNN and LSTM with DSCNN and GRU further achieves better results. Compared to a single CNN model, the DSCNN-GRU model is more efficient in predicting groundwater head fields over continuous time series in this region.
The hybrid DSCNN-GRU model is designed to offer a balanced solution, leveraging the strengths of CNNs and RNNs in feature representation and temporal sequence learning, while addressing the associated high computational demands, providing an efficient approach for groundwater flow prediction. The subsequent sections of this study are organized as follows:
Section 2 presents an overview of the hydrogeological conditions in the study area and details the groundwater model layout for three cases.
Section 3 briefs the standard CNN and LSTM models, followed by a description of the proposed DSCNN-GRU model.
Section 4 outlines the predictive results of varied surrogate models and compares their performance through numerical evaluations. Finally, the main conclusions are summarized in
Section 5.
3. Results
The mean squared error (MSE) is employed as the loss function to evaluate the performance of surrogate models. A lower MSE value indicates reduced loss during the training process, signifying a heightened understanding of the relevant parameters and characteristics of the training samples generated using the physics-based model. Furthermore, MSE is used to evaluate the similarity between the surrogate model’s predicted outputs and the original outputs from the forward model.
However, due to the high sensitivity of MSE to outliers and its neglect of structural information, it simply calculates the gap between outputs at each grids. The calculation formula for MSE is given as follows:
In the above formula, x represents the predicted water head field data, y represents the input water head field data, and m and n represent the number of predicted and input water head field data, respectively.
To address this limitation, the coefficient of determination (R
2) is introduced for a more comprehensive evaluation of prediction accuracy. A higher R
2 value, closer to 1, indicates a stronger correlation between the outputs from the surrogate and the forward model, reflecting superior prediction accuracy. The calculation formula for R
2 is given as follows:
In this equation, is the predicted value, is the average of the predicted values, and is the observed value.
Since CNN models are used for surrogate evaluation experiments in Cases 1, 2, and 3, R
2 is less affected by data size compared to MSE, making it more appropriate for cross-dataset comparisons. To further evaluate the accuracy of the prediction, the structural similarity index (SSIM) is also introduced, which comprehensively evaluates the overall similarity of two images based on brightness, contrast, and structural similarity. The formula for SSIM is as follows:
In this formula, is the average value of x, is the average value of y, is the variance of x, is the variance of y, is the covariance between x and y, and and are constants used to maintain stability.
In addition, considering that the generation time for each set of predicted images (six images) by a trained model is less than 5.625 ms, which is negligible, the training time is used to evaluate model efficiency.
3.1. CNN Surrogate Model Results
To evaluate the effectiveness of the CNN surrogate model, experiments are conducted on three initial models: Case 1, Case 2, and Case 3. The initial model designs vary from Case 1 to 3 in terms of the complexity of hydraulic conductivity fields. The complexity of hydraulic conductivity data increases from Case 1 to Case 3, as described in
Section 2.1.2. Case 1 involves 10 randomly generated values for each hydraulic conductivity volume within a field. Case 2 divides the hydraulic conductivity field into 20 blocks, each with randomly generated values in the range of 10 to 100 m/month. In Case 3, Kriging interpolation is employed to obtain the hydraulic conductivity field from 40 sampling points, where the values is determined from available materials of the study area. Case 3 incorporates pumping rate data from two pumping wells that are not considered in Case 1 and Case 2. Consequently, the overall complexity of the initial model increases progressively from Case1 to Case 3, accompanied by a sequential rise in the volume of data requiring processing by the surrogate model.
The training and testing sample sets are modified for different cases, while the network architecture and parameters remained constant.
Figure 10 illustrates that the MSE value consistently decreases as the training period extends. Generally, the complexity of the models from Case 1 to 3 leads to a gradual increase in MSE during training. However, an intriguing phenomenon was observed at the 30th iteration: the MSE value of model c_3 was lower than that of model c_2. This may be attributed to the more complex Case 3 scenario, where the surrogate model processes a larger number of input features and nonlinear relationships, leading it to focus on global data patterns rather than overfitting to local noise, thus achieving better prediction performance.
To capture the temporal characteristics of the head data, an additional fully connected layer is incorporated into the network structure. In the test sample set, as depicted in
Figure 11, the MSE and R
2 values of the alternative model’s predictions for Cases 1, 2, and 3 increase sequentially. This indicates that as the complexity of the initial model rises, the prediction accuracy of the same surrogate model declines, accompanied by an increase in training time. To further validate this finding, the experiment introduced the SSIM as a new metric to assess the structural similarity between the predicted image data and the original image data in the test set, as shown in
Figure 11b. Notably, the SSIM will also decrease from 0.969 to 0.917 as the complexity of the initial model increases. Despite a gradual decrease in accuracy from Case 1 to Case 3, the CNN surrogate model still performs well in Case 3, with an average R
2 of 0.902 and SSIM of 0.917 when comparing predicted and original head field data.
3.2. Coupled Surrogate Model Results
To enhance the accuracy and effectiveness of the surrogate model, this study integrates CNN-LSTM, DSCNN-LSTM, and DSCNN-GRU. The coupling method employed for all three models follows the same coupling method as DSCNN-GRU, described in
Section 2.2.4. The optimal coupling model is identified by assessing relevant parameters. Given that the training and test sample sets for all three coupled models are identical to c_3, identical evaluation criteria can be used for performance comparison.
As depicted in
Figure 12 and
Figure 13, the CNN-LSTM model has exhibited a significant improvement in accuracy compared to traditional CNN models. Specifically, the MSE value decreased from 0.0018 to 0.0012 during training and from 0.009 to 0.0034 during testing. However, this improvement comes at the expense of increased computational load, resulting in a 4.9% increase in training time from 3348 s to 3514 s. Furthermore, it was observed that the MSE value of the CNN-LSTM model exhibited significant fluctuations throughout the entire training cycle, with an absolute difference of up to 0.0008 between consecutive cycles. These fluctuations can primarily be attributed to the nonlinear characteristics of the transient groundwater flow in the time series.
After integrating DSCNN-LSTM, the training time decreased to 2892 s. Moreover, the limited sampling time of only 540 samples and the small sample size impede LSTM’s capability to effectively leverage its memory for acquiring long-term sequence information. Consequently, the MSE value increased to 0.0042 during the training process, and the MSE value rose to 0.059 during the testing process.
The integration of DSCNN and GRU reduced the training time to 2858 s, which is 490 s lower than that of CNN, 656 s shorter than CNN-LSTM, and 34 s lower than DSCNN-LSTM. This improvement in computational efficiency is significant. The MSE value of the proposed DSCNN-GRU model remained consistently lower than that of traditional CNN models throughout the entire training period. Additionally, the MSE values of the DSCNN-GRU model showed less fluctuations compared to the CNN-LSTM model during the initial 100 to 200 cycles.
Compared to the CNN-LSTM model, the proposed DSCNN-GRU model has a slightly higher MSE value of 0.0096, compared to the CNN-LSTM’s value of 0.0034. However, the DSCNN-GRU model achieves a shorter training time, reduced by 18.6%. The R2 value of the DSCNN-GRU model is 0.949, which is 0.84% lower than the 0.957 achieved using the CNN-LSTM model. Additionally, the SSIM values for DSCNN-GRU and CNN-LSTM are 0.968 and 0.971, respectively. This indicates that the average prediction error of these two models is within 0.003 m, showing a significant improvement compared to the CNN model’s 0.923. Although the proposed DSCNN-GRU model shows only a slight reduction in prediction accuracy compared to CNN-LSTM, it significantly enhances computational efficiency. Considering the trade-off between computational accuracy and efficiency, the proposed DSCNN-GRU model outperforms the other three models in overall performance.
3.3. Further Optimization of the DSCNN-GRU Surrogate Model
The previous analysis compared various coupling models, demonstrating that the DSCNN-GRU not only maintained the prediction accuracy of the proxy model but also significantly improved its efficiency. To further improve the proposed DSCNN-GRU model, the impact of network parameters was examined, including variations in the number of convolutional layers, optimizers, and activation functions. Initially, the experiment assessed three, four, and five convolutional layers to identify the optimal configuration. Subsequently, Adam, Nadam, and RMSprop were employed as optimizers to determine the most effective option. Finally, Relu, Softmax, and Tanh were evaluated as activation functions to ascertain the best choice.
The Nadam optimizer combines the principles of the Adam optimizer with Nesterov momentum. Its primary objective is to accelerate convergence and enhance training stability, making it particularly effective for handling large datasets with high similarity. In contrast, RMSprop is an adaptive learning rate method that facilitates timely adjustments to the learning rate, thereby reducing potential data loss caused by inappropriate learning rates, especially when the complexity of the water head field data in the training samples varies.
Among the various activation functions, the Softmax function is frequently employed for multi-class classification problems. However, it is important to note that the exponential calculations inherent in the Softmax function can cause the output value to approach 1 when the input value is large and to approach 0 when the input value is small. When there is a substantial disparity between the input values, this can lead to saturation effects and potential vanishing gradient issues. Given the highly complex head field data present in the training samples, the differences between individual samples may be significant, resulting in increased training time and reduced prediction accuracy for models utilizing this function. Conversely, the Tanh (hyperbolic tangent) function is an S-shaped function that maps real number inputs to a range between −1 and 1. It exhibits symmetry in handling both positive and negative inputs; however, the data in the training samples of this study are highly nonlinear and do not demonstrate symmetry across a wide range. Consequently, the prediction accuracy of models employing this function may not be optimal.
3.3.1. Optimization of Convolution Layer
To determine the optimal number of convolutional layers for the proposed DSCNN-GRU model, this study experimented with three, four, and five layers. As the number of convolutional layers changed, it was found that excessive pooling layers were unnecessary. Thus, when only three or four convolutional layers were used, the number of units in both convolutional and pooling layers (as shown in
Figure 6) was correspondingly reduced to three or four.
Table 1 demonstrates minimal changes in MSE values across the three models during extended training, likely due to slight differences in the number of convolutional layers. However, reducing the number of convolutional layers resulted in an increase in training time, from 2858 s with five layers to 3064 s with three layers. Despite maintaining constant regularization parameters, batch size, and other model parameters settings, several possible reasons could explain this abnormal behavior: (1) Insufficient parallel optimization of the GPU, possibly due to the inappropriate scale of the model, limiting GPU utilization and slowing down training. (2) The reduction in the number of layers may have decreased the caching efficiency of data and model parameters in memory, resulting in longer training times. (3) With fewer convolutional layers, feature extraction capability is diminished, necessitating more time for the model to learn the same feature representation.
As the training duration increased, the MSE value in the test set also rose. Specifically, the MSE value was 0.0062 for the model with five layers and 0.0096 for the model with three layers, indicating a decline in prediction accuracy. The R
2 value exhibited a similar trend, decreasing from 0.954 for the model with five layers to 0.948 for the model with three layers. These findings suggest that increasing the number of convolutional layers enhances both the computational efficiency and accuracy of the model. Both the models in
Section 3.3.2 and
Section 3.3.3 utilize five convolutional layers.
3.3.2. Optimization of Optimizers
Following the decision to use five convolutional layers, this study further investigates the impact of different optimizers on model efficiency and accuracy. The results, presented in
Table 1, reveal that when using Adam, the model’s training time is 2858 s, with the lowest MSE value during testing recorded at 0.0062. In contrast, when using RMSprop and Nadam, the MSE values are 0.043 and 0.068, respectively, representing an order of magnitude higher compared to using Adam. This difference in performance may be attributed to RMSprop having a high learning rate, causing the optimization process to oscillate around the optimal solution without convergence. However, it is noteworthy that the training time for RMSprop is 2598 s, while for Nadam it is 3212 s. Overall, Nadam yields a stable MSE value of 0.0031 after the 50th cycle during model training, yet the MSE value on the test set increases to 0.068, and the R
2 value reaches 0.853, which is inferior to RMSprop. This implies that Nadam, due to its accelerated convergence speed without appropriate regularization or early stopping strategies, may lead to overfitting problems. Consequently, even though the MSE value during model training is low, the model might assimilate a significant amount of noise information, resulting in diminished prediction accuracy.
The experimental data clearly demonstrate that the choice of optimizer significantly influences the model’s training efficiency and prediction accuracy. Based on the experimental data, it is evident that the choice of optimizer has a notable influence on the training efficiency and prediction accuracy of the model. Although Adam prolongs the training time by 260 s compared to RMSprop, the substantial disparity in prediction accuracy favors Adam for this study. However, in the case of Nadam, its performance was unsatisfactory in this experiment.
3.3.3. Optimization of Activation Functions
After selecting Adam as the optimizer, this study focused on examining the impact of different activation functions on model efficiency and accuracy. Specifically, only the activation function of the convolutional layer was modified, while the other layers remained unchanged. The results, as depicted in
Table 1, show that when Softmax is used as the activation function, the model’s training time is 3548 s, the MSE value for prediction is 0.057, and the R
2 value is 0.826. However, compared to the other two activation functions, Softmax yielded less favorable results in terms of training time and prediction accuracy; thus, further discussion on this particular activation function will not be pursued.
In the training set, the mean squared error (MSE) using Tanh as the activation function is 0.0052, which represents a 16% decrease compared to the MSE of 0.0062 when using Relu. Additionally, the R2 values for Tanh and Relu are 0.974 and 0.954, respectively. These results clearly indicate that using Tanh as the activation function can significantly improve the prediction accuracy of the model in this study. However, it is worth noting that the training time for the Tanh model is 2876 s, whereas it is only 2598 s for the Relu model, resulting in higher computational efficiency. Despite the reduction in training time by 278 s when using Relu, the improvement in prediction accuracy was deemed more important for this study, given that there was no significant difference in model budget efficiency. Therefore, using the Tanh activation function was considered more suitable for this study, as it yielded higher prediction accuracy compared to using Relu, and the reduction in training efficiency is acceptable.
3.4. Illustrative Example
As shown in
Figure 14, each row of y, y
1, and y
2 consists of six images, each of which is the head field data of the study area at the 10th, 20th, 30th, 40th, 50th, and 60th time steps. Row y is a randomly selected set of head field data obtained from the physical model execution of Case 3. Row y
1 and row y
2 are the predicted head field data given by the CNN and DSCNN-GRU models, respectively, corresponding to the data in row y. The lines |y − y
1| and |y − y
2| are cloud maps, showing the absolute differences between the head fields in the corresponding time steps in the y, y
1, and y
2 rows.
Comparing the water level difference data at the same time step for rows |y − y1| and |y − y2|, it is clear that the prediction accuracy of CNN is lower than that of DSCNN-GRU. However, the head field data predicted by these two types of models generally give a consistent head distribution across the entire study area with the physical model. In addition, both models exhibited obvious prediction errors in the southeast and northeast regions of the study area, respectively. This difference may be due to the fact that the water level changes in the eastern region of the area during the sampling period of this study were larger than those in the western region, resulting in a higher complexity of the time-varying water head field data in the eastern region, which led to a decrease in model learning accuracy.
When comparing the prediction results of each time step of the DSCNN-GRU model, it was found that the model has high accuracy in predicting continuous time series. However, it should be pointed out that the error in predicting the head field at each time step is not completely consistent. This inconsistency may stem from the fact that the data contained in the input sample varies unevenly over time.
4. Discussion
This study presents an innovative DSCNN-GRU surrogate modeling framework for simulating transient groundwater head fields through continuous time series prediction. A comprehensive evaluation of surrogate modeling approaches, including CNN-LSTM, DSCNN-LSTM, and the proposed DSCNN-GRU framework, reveals that the proposed DSCNN-GRU model achieves an optimal balance between computational efficiency and predictive accuracy.
The comparative analysis demonstrates that while the CNN-LSTM architecture attains comparable predictive accuracy, the DSCNN-GRU surrogate achieves a significant reduction in computational training time. This enhanced efficiency corroborates the theoretical advantages of GRU architectures over conventional LSTM networks, as originally theorized by Chung et al. (2014) [
16], particularly regarding GRU’s simplified gating mechanism that effectively mitigates the characteristic “memory saturation” phenomenon observed in LSTM networks. Furthermore, the DSCNN-GRU exhibits superior training stability compared to CNN-LSTM, with reduced error fluctuations that align with previous findings regarding LSTM instability in large-scale hydrological modeling applications. This study extends the application of GRU networks in groundwater modeling beyond the pure time series prediction demonstrated by Gharehbaghi et al. [
26] through a novel integration with DSCNN, which successfully integrates spatial feature extraction with temporal sequence modeling. This hybrid architecture achieves a robust solution for simultaneous spatiotemporal analysis of transient groundwater flow dynamics, achieving enhanced predictive capability while maintaining computational efficiency comparable to conventional approaches.
Systematic parameter optimization analysis demonstrates that the DSCNN-GRU model exhibits pronounced sensitivity to three critical hyperparameters: (1) number of convolutional layers, (2) optimizer selection, and (3) activation function choice. Increasing convolutional layers from three to five significantly enhances prediction accuracy, corroborating Ali et al.’s [
19] findings that deeper networks better capture complex relationships in groundwater systems. Notably, contrary to conventional patterns, the five-layer architecture demonstrates higher computational efficiency than the three-layer version, likely due to optimized memory caching that reduces data fragmentation and improved feature extraction hierarchy that accelerates convergence. In optimizer comparisons, Adam exhibits superior performance despite marginally longer training times than RMSprop. The suboptimal performance of Nadam contrasts with some conclusions from Kannan [
27], suggesting its accelerated convergence may adversely affect groundwater modeling applications requiring smooth loss landscapes, highlighting the importance of problem-specific optimizer selection. The activation function evaluation reveals that Tanh activation provides superior performance for transient hydraulic head prediction, particularly in capturing nonlinear aquifer responses, and Softmax activation proves fundamentally unsuitable for this surrogate modeling task. These results quantitatively confirm Chen et al.’s [
28] hypothesis about activation function specialization, demonstrating that hydrological data characteristics should drive function selection rather than relying on default choices from other domains.
While the simplified 2D heterogeneous groundwater flow model validates the DSCNN-GRU approach’s feasibility, several limitations merit discussion. First, the current implementation simplifies the complex hydrogeological characteristics of karst limestone aquifers (e.g., the Gambier Limestone Aquifer system) by representing hydraulic conductivity through range parameters rather than explicitly modeling dominant flow pathways through karst conduits and fracture networks. Future integration of discrete fracture network (DFN) modeling with conditional geostatistical simulation techniques could better capture the hierarchical organization of karst systems and preferential flow dynamics. Second, the current fixed optimization strategy for network architecture may limit performance in complex hydrogeological settings. Subsequent studies could systematically optimize critical parameter combinations in hybrid neural architectures using advanced optimization algorithms while incorporating attention mechanisms to capture the multiscale flow characteristics unique to karst systems. Third, and equally important, future development of surrogate models should explicitly account for conceptual model uncertainty—particularly the impact of geological conceptualizations on flow dynamics [
2]. This would involve generating multiple realizations of aquifer heterogeneity to train and test surrogate models under different geological scenarios, thereby enhancing their robustness and applicability in challenging hydrogeological environments.