1. Introduction
Seismic wavefield simulation/modeling is essential for high-precision imaging of the Earth’s subsurface [
1]. The conventional seismic wavefield modeling is based on the numerical solution of the wave equation. The solution of the wave equation involves the calculation of the time and space derivatives with numerical methods. Researchers have developed wave-equation-based imaging methods such as reverse-time migration [
2,
3] and full-waveform inversion [
4,
5,
6]. The finite-difference (FD) scheme is the most basic and often used method to solve the wave equation [
7,
8,
9]. However, the FD approximations of time and space derivatives lead to numerical dispersion when the sample interval is large, even if it satisfies the stable condition [
10,
11]. To accurately simulate the wavefield, it is possible to minimize the dispersion errors in space and time.
Researchers have proposed methods to improve finite difference spatial accuracy, including methods to lengthen the spatial difference operator and methods to optimize the spatial FD coefficients [
12,
13,
14]. However, using long difference operators to approximate spatial partial derivatives requires more computation, and due to the saturation effect [
15], the improvement of the simulation accuracy becomes small as the operator length increases. The FD coefficients of the conventional FD methods are based on the Taylor series expansion, which can only ensure the accuracy of the simulation in the low wave-number part. The optimized FD methods approximate the dispersion relation to obtaining the FD coefficients using optimized methods, such as least squares (LS) minimization [
16,
17,
18] and simulated annealing methods [
12], which can significantly improve simulation accuracy in the medium-to-high wave-number part. The pseudo-spectral (PS) methods [
19,
20] are used to obtain spatial partial derivatives using forward and inverse Fourier transforms. The PS methods can theoretically be accurate up to the Nyquist wave number but require more computational time than FD methods.
The time derivatives in wave equations are commonly approximated using the second-order FD format, but the low-order systems can result in significant time dispersion errors. Researchers have proposed the Lax–Wendroff methods [
21,
22] to improve time accuracy by replacing high-order time derivatives with spatial ones. To increase the stable time-step upper limit, optimal Lax–Wendroff methods were developed [
13,
23,
24]. In addition, the new FD template method improves temporal accuracy by using time–space domain methods with new stencils. Liu and Sen (2013) [
25] presented a centered-grid FD method with a rhombus-shaped stencil, which can achieve arbitrary even-order accuracy in the time and space domains but is computationally demanding. To balance accuracy and efficiency, Wang (2016) [
18] proposed an FD method based on the combination of cross and rhombus stencils. Tang and Huang (2014) [
26] developed the staggered-grid FD method of fourth- or sixth-order in time. Ren (2017) [
27] presented the temporal high-order SFD schemes, and Ren (2022) [
28] applied the staggered-grid FD method for source wavefield reconstruction.
Alternatively, the method of the post-propagation filter is an efficient approach to reducing time dispersion. Stork (2013) [
29] firstly proposed the time dispersion correction filter, which demonstrated that time dispersion is independent of propagation path, medium, and spatial dispersion errors and successfully filtered away time dispersion errors from seismograms. Wang and Xu (2015) [
30] introduced a time dispersion correction method based on the analytically anticipated time dispersion error, which improves the effectiveness of the time dispersion filter. The time dispersion filter applies the forward time-dispersion transform (FTDT) for predicting time dispersion error and the inverse time-dispersion transform (ITDT) for removal. However, the FTDT is an adjoint rather than an inverse of the ITDT, so the ITDT with their proposed filters cannot fully eliminate the time dispersion added by the FTDT. Li (2016) [
31] suggested a time-varying filter to minimize time dispersion in shot recordings. Koene (2018) [
32] derived the precise inverse of ITDT, which is employed as a new FTDT, where the ITDI can recover (with proper band restrictions) the input signal, and the ITDT can successfully minimize temporal dispersion of the modeling wavefields with FD, PS, and spectral element methods, enhancing temporal accuracy more effectively.
These methods are used to propagate seismic data through synthetic models, and the forward modeling methods are capable of determining the dispersion relationship quantitatively. However, because highly integrated software is typically used in practical production, software algorithms are a black box to users (i.e., it is impossible to obtain the specific forward modeling method from the compiled software) which makes it impossible to quantify the dispersion relationship of the forward modeling results, and therefore difficult to accurately eliminate the time dispersion of the modeling results. Recently, machine learning and deep learning algorithms have been utilized with some success in seismic impedance inversion [
33,
34], seismic modeling [
35], and seismic data interpretation [
36]. In this paper, we propose to use machine learning methods to overcome this problem.
The framework of our proposed neural network includes two main modules: Inverse Model and Forward Model and these two main modules for transformation between large time-step data (with time dispersion) and small time-steps data (without time dispersion). We propose a semi-supervised machine learning strategy to eliminate time dispersion. Specifically, we use deep learning to learn the mapping patterns in temporally dispersed samples. The Inverse Model is used for eliminating time dispersion while the Forward Model is used for regularizing the training. Although the deep-learning network is a “black-box”, which is difficult to be quantitatively interpreted, in this study, we can generate any amount of data that the network needs to train the network. Therefore, the network is able to perform as designed. Thus, the proposed deep-learning method is considered feasible.
The rest of this paper is organized as follows.
Section 2 shows the details of our framework, which is based on semi-supervised learning to train our proposed model and then shows a new training method by transfer learning based on the pre-trained model.
Section 3 provides and analyzes the experimental findings of our proposed network on the Marmousi and SEAM models.
Section 4 presents discussion.
Section 5 summarizes the entire paper.
2. Methods
2.1. Theory
In this work, we use a modeling program with the pseudo-spectral (PS) method as the black box, i.e., the compiled software, to generate the data set with large and small time steps. Since the PS method is used, the simulated seismic data are free of spatial dispersion but have temporal dispersion. A few training data are first generated using the modeling program with large and small time steps, which mimic the time-dispersed data and the time-dispersion-free data, respectively. The data with small time steps are used as the label of the data with large time steps.
In this paper, we propose a semi-supervised framework for eliminating time dispersion based on Convolutional Neural Networks (CNNs) and Gate Recurrent Units (GRUs). We can utilize semi-supervised networks to train not only labeled data but also unlabeled data. The unlabeled data act as constraints to better train the generative network and generate more accurate outcomes.
CNNs are used to extract feature information, where dilation convolution is used to expand the perceptual field and where the data are inputted point by point. In this work, the size of the data is large, and multiple dilation convolutions with different dilation rates are superimposed; thus, the different perceptual fields can lead to multi-scale contextual information, which also helps to reduce computational effort. Recurrent Neural Networks (RNNs) are used to extract temporal information since time dispersion data are a type of temporal data. An RNN network typically has four inputs and one output. The inputs include three gate functions—Input Gate, Forget Gate, and Output Gate—that can be used to train the network’s weights using the context-specific knowledge of the data to learn the mapping relationship between the inputs and the outputs. The GRU network is an adaptation of the RNN network that, to simplify computation, merges the typical Input Gate and Forget Gate into a single Update Gate and blends cell states with hidden states.
2.2. Network Structures
We choose two main modules for transformation between large time-step data and small time-step data: Inverse Model and Forward Model, both of which have learnable parameters; the Inverse Model is used for eliminating time dispersion, and the Forward Model is used for regularizing the training. The workflow we proposed is shown in
Figure 1, which includes labeled and unlabeled data sets. The labeled data set is a paired data set with large time-step data and their corresponding small time-step data, and the unlabeled data set is large time-step data that contain time dispersion that needs to be eliminated.
The network takes input data (both labeled and unlabeled large time-step data) and feeds them to the proposed Inverse Model, which transformations the data from the source domain with time dispersion to the target domain without time dispersion, and the loss function between the label of the input and the output of the Inverse Model is minimized. Next, the generated without-time-dispersion data of the unlabeled data set is then fed into the proposed Forward Model, which converts it back to time-dispersed data, and the loss function between the original input and the output of the Forward Model is minimized. Furthermore, the network takes the other input data (labeled small time-step data) and feeds them to the proposed Forward Model, which transformations the data from the source domain without time dispersion to the target domain with time dispersion, and the loss function between the label of the input and the output of the Forward Model is minimized.
The architecture of the Inverse Model and Forward Model are similar to that used by Motaz Alfarraj [
37], where the difference is the adjustment of local parameters and the removal of the upsampling submodule.
Figure 2 shows the Inverse Model in the proposed workflow. The Inverse Model consists of three main submodules. These submodules are denoted as Sequence Modeling, Local Pattern Analysis, and Regression, and each have a different role in the overall model.
Figure 3 shows the Forward Model architecture, which is similar to that of the Inverse Model in the proposed workflow. We noted that time-dispersion-free data need more complex mapping than time-dispersed data, so the differences are 2 times the GRU output data size and two fully-connected-layer modules in Forward Model. The fully-connected-layer modules are mainly used for compensation of resolution mismatch between input data and output data.
Sequence Modeling submodule comprises multiple Gated Recurrent Units (GRUs) [
38]. The input traces of GRU are modeled as sequential data trace by trace, and temporal features are computed based on temporal variations of these data, so GRU can continually update the memory of the past and future moments and adjust its weights in the training process. Consider that more hidden layer neurons will have significantly better effects on capturing temporal features; therefore, one GRU module is set in the Sequence Modeling submodule, where we set the input data size to the number of hidden layer neurons. Then, one fully connected layer is used as a transition layer, which compensates for the dimensional mismatch between the outputs and the inputs.
Local Pattern Analysis submodule comprises three parallel 1D CNN modules, each parallel CNN module having the same convolutional kernel size and different dilation coefficients. Three outputs of these parallel CNN modules are stitched together into one output, which is then fed into another 1D CNN module with three layers and decreasing convolutional kernels. Dilation is defined as the intervals between convolution core points in the convolution layers [
39]. The superposition of multiple dilated convolutions with different dilation rates expands the receptive field, and the different receptive fields lead to multi-scale temporal feature information. There is also a group normalization layer and an activation function layer between two adjacent CNN layers. The CNNs are used mainly to capture high-frequency trends in the data. The output of deep GRUs is regarded mainly as low-frequency trends in the data. To capture the frequency content of the full frequency band, the two outputs of the CNNs and GRUs are combined and fed into Regression.
The final submodule is Regression, which is used for mapping the extracted features from the other two submodules to the target data. Regression comprises a 1-layer GRU followed by a linear layer. In addition, the input layer of GRU and output layer in this submodule all use the transpose function. Hence, the output size is set to the same as the input size, i.e., one trace in and also one trace out.
In the end, we propose a transfer learning training method to extend from the trained model to another model, which helps the network adapt to new seismic data better. We consider that GRU is designed to capture the sequential relationship of input data when extracting features from the training set, and GRU continually adjusts the weights of the Local Pattern Analysis submodule. Thus, for the new seismic data, we fix the initialization of the Local Pattern Analysis submodule by using the trained model weights file and retraining the parts with the GRU. This training method can use a smaller number of training data rather than starting from scratch.
2.3. Loss Function
In this work, the input size and output size are both set to . We denote the labeled single trace data of time-dispersed data as x, the unlabeled data as x_u, which is also the data for time-dispersion elimination. The corresponding time-dispersion-free data of x is denoted as y. The Inverse Model and Forward Model are denoted as with weights and with weights , respectively.
As shown in
Figure 1, property loss 1 adds property loss 2 as property loss, which is calculated between predicted and target on traces and maps temporally dispersed data to time-dispersion-free data. Unlabeled loss is calculated between predicted and input unlabeled data. This kind of loss is adopted to constrain the prediction consistency on the dispersive time-series traces. In this part, we have modified the mean absolute error (
L1) by calculating property loss and unlabeled loss separately to better fit our proposed workflow training loss function. Formally, the loss function takes the following form:
The parameters of both the Inverse Model and the Forward Model are adjusted by combining both losses as in Equation (3). In this work, we choose
= 0.2 and
= 1, which are proposed by Motaz Alfarraj [
37].
2.4. Training Procedure
We generate the data set using a compiled wavefield modeling program, which uses the pseudo-spectral (PS) method, for simulating wavefields with large and small time steps. Firstly, the proposed strategy is tested on a partial modified Marmousi model. The corresponding velocity model is shown in
Figure 4. The source wavelet is a Ricker wavelet with a dominant frequency of 15 Hz. The spatial location of the source is unequally spaced. A split-spread geometry is applied. The receiver array has a receiver spacing of 20 m. The minimum and maximum offsets are 0 m and 14,750 m, respectively. We choose the time intervals of 0.2 ms and 1.2 ms, assuming that the solution obtained with the 0.2 ms interval is time-dispersion-free data. We use approximately 90% of the shot recording data for the training set and the remainder for the test data set. The validation data are randomly picked from the test data set. Furthermore, the proposed strategy is tested on the SEAM model. The corresponding velocity model is shown in Figure 10. In the SEAM model test, we choose the time intervals of 0.4 ms and 2 ms, assuming that the solution obtained on the 0.4 ms intervals is time-dispersion-free data. Then, we apply transfer learning with the trained network from the Marmousi model and try the SEAM model with the training samples consisting of 40% of shot recording data for the training set. The training procedure for the proposed workflow is shown in Algorithm 1. The initial learning rate is 1 × 10
−4,
represents the labeled time-dispersed data,
represents the unlabeled time-dispersed data, and
represents time-dispersion-free data.
Algorithm 1 Algorithm for updating weights and . |
Input: time-dispersed data sets X, X_U, time-dispersion-free data sets Y |
Output: and |
1: Randomly initialize parameters and |
2: epoch = 500, = 0.2, and = 1 |
3: for epoch steps do |
4: for all of the labeled data sampled do |
5: |
6: Calculate the Property Loss in Equation (1) |
7: Randomly sample the unlabeled data |
8: |
9: Calculate the Unlabeled Loss in Equation (2) |
10: Calculate the Loss in Equation (3) using property loss, unlabeled loss, and |
11: end for |
12: Update and in order to minimize Loss |
13: end for |
4. Discussion
In this work, the given network is designed to map the data simulated by a compiled program using two designated time intervals. Hence, re-training is required whenever the modeling algorithm, time interval, or frequency band is changed. This is because, as in all deep learning methods, training a neural network on one data set and then applying it to another data set requires that both data sets have the same distribution; otherwise, the test data will be incorrect due to overfitting. Considering the computational effort associated with re-training, on the new seismic model, we propose transfer learning to overcome this problem. When we test the proposed strategy on the SEAM model using the transfer learning training approach, we regard Marmousi as a pre-training seismic model. However, the Marmousi model should have enough training data and perform well on the test data sets, otherwise, the SEAM model data set may not work well with small amounts of data.
Additionally, as a comparison, we built a standard GRU network to eliminate time dispersion. This GRU network is from Pytorch machine learning library, and the size of input and output are the same as our proposed network. The data from the SEAM model are used for test. The results are shown in
Figure 16,
Figure 17 and
Figure 18. As can be seen, the GRU deep-learning model can also eliminate the time dispersion but less effective than the proposed semi-supervised deep-learning model. This experiment demonstrates that many deep-learning models can achieve time-dispersion elimination but the performance may be different.