1. Introduction
Elastic wavefields contain multiple wave modes, including compressional (P), shear (S), and converted waves, which naturally combine during propagation. This combination creates challenges in seismic imaging and inversion, especially in elastic reverse-time migration (RTM), where interactions between wave modes can produce unwanted artifacts. To enhance the image quality, separating these wave modes and processing them independently before applying imaging conditions can be helpful [
1,
2].
Wave mode separation and vector decomposition are two main methods that are used to separate elastic wavefields. Wave mode separation uses differences in polarization properties to distinguish P- and S-waves. In isotropic media, P-waves have no curl, and S-waves have no divergence, allowing for separation through the Helmholtz decomposition method, which uses divergence and curl operators [
3,
4]. In anisotropic media, the polarizations of quasi-P (qP) and quasi-S (qS) waves differ from their wave vectors due to directional dependencies in elastic properties, requiring more complex approaches beyond simple divergence and curl operations. Vector decomposition, introduced by Zhang and McMechan (2010) [
5], takes a different approach by projecting wavefields onto polarization vectors that are calculated from the Christoffel equation in the wavenumber domain. While this method preserves the amplitude, phase, and vector characteristics of the separated waves, it shares a common challenge with wave mode separation methods: both approaches require solving the Christoffel equation, which is computationally demanding and needs to be repeated for each spatial location and propagation direction, creating challenges for large-scale or heterogeneous models.
To address the computational limitations of solving the Christoffel equation, several advancements have been proposed. Yan and Sava (2009) [
1] introduced nonstationary spatial filters that adapt to local anisotropic properties. These filters still use polarization vectors from the Christoffel equation but represent them as localized filtering operators in the space domain, reducing the computational overhead while maintaining separation accuracy. Building on this, Yan and Sava (2011) [
6] developed a mixed-domain algorithm that integrates space and wavenumber domain operations. By approximating the medium’s heterogeneity with reference models, this approach reduces the calculations that are needed for polarization vectors. Cheng and Fomel (2014) [
2] proposed using Fourier integral operators with low-rank approximations for both wave mode separation and vector decomposition methods. This modification enables more efficient wavefield separation without directly solving the Christoffel equation at every spatial point, making these methods more practical for large, heterogeneous models. Zhou and Wang (2016) [
7] developed an approach that simplifies the separation operators to forms resembling divergence and curl operators in isotropic media. Their method uses rotated wave vectors and Poynting vectors to approximate polarization directions, reducing the computational complexity in complex geological settings. Further research has extended these methods to low-symmetry anisotropic media. Sripanich et al. (2017) [
8] adapted vector decomposition for orthorhombic, monoclinic, and triclinic models, addressing challenges like polarization discontinuities and singularities through smoothing filters and local signal–noise orthogonalization. These improvements enhance the computational efficiency and broaden the application of elastic wavefield separation methods to more complex geological models.
Recent deep learning (DL) advances have provided useful tools for seismic wavefield processing, addressing challenges in wave mode separation and vector decomposition. U-Net-based convolutional neural networks (CNNs) have emerged as a common approach. Huang et al. (2021) [
9] developed a U-Net model to separate qP and qS wavefields in anisotropic media. By incorporating techniques such as dilated convolutions and anti-aliasing, the model effectively handled complex anisotropic wavefields while maintaining the spatial resolution. Sun et al. (2023) [
10] introduced a U-ConvNeXt architecture that integrates ConvNeXt blocks—known for their enhanced feature extraction and representation capabilities—into a U-Net framework. This architecture enabled cleaner separation of P- and S-wave components, reduced artifacts, and showed computational efficiency suitable for large-scale seismic data analysis. Generative Adversarial Networks (GANs) have also been applied to wave mode separation. Kaur et al. (2021) [
11] proposed a Cycle-GAN framework to decouple qP and qS wavefields in anisotropic media. This method eliminated the need for solving the computationally intensive Christoffel equation and achieved reliable separation of wave modes in highly anisotropic environments.
While previous deep learning methods [
9,
10,
11] have shown promise in wavefield decomposition, they do not incorporate mechanisms to explicitly capture the interactions between P- and S-wave components. To address this limitation, this study proposes a U-Net-based architecture enhanced with cross-attention mechanisms for elastic wavefield vector decomposition in anisotropic media. The model employs a dual-branch encoder to process horizontal and vertical displacement components separately, while the cross-attention enables dynamic feature exchange between them. This structure allows the network to preserve the physical characteristics of each wave mode and to focus on the most relevant features across components, thereby improving the decomposition accuracy and efficiency. The cross-attention mechanism is particularly effective in capturing the complex inter-mode relationships that are inherent to anisotropic wavefields.
2. Theoretical Framework
2.1. Elastic Wave Vector Decomposition in Anisotropic Media
Wave mode separation in elastic wavefields is commonly performed using Helmholtz decomposition, which allows the vector wavefield
U = (
Ux,
Uy,
Uz) to be expressed as the sum of P- and S-wave components:
The P-wave component satisfies ∇ ×
UP = 0, while the S-wave component satisfies ∇ ⋅
US = 0 [
3]. However, in anisotropic media, the polarization directions of qP and qS waves are not necessarily aligned with the wave vector
k, rendering simple divergence and curl operators insufficient for mode separation [
4]. Instead, the vector wavefield must be projected onto the polarization vectors of each wave mode [
2,
5].
The polarization vectors
pqP,
pqSV, and
pqSH are computed by solving the Christoffel equation:
where
C is the stiffness tensor,
ρ is the mass density,
V is the phase velocity of each wave mode, and
p is the polarization vector. To obtain a wave vector decomposition that is analogous to Helmholtz decomposition, the vector wavefield is first expressed in the wavenumber domain. Applying the Fourier transform to the vector wavefield U results in its representation in the wavenumber domain as
. The P- and S-wave separation expressions in isotropic media can be reformulated as follows:
Since the wave vector
k does not necessarily align with the polarization directions in anisotropic media, it is replaced with the polarization vectors obtained from the Christoffel equation, yielding the anisotropic wave vector decomposition:
This formulation extends Helmholtz decomposition to anisotropic media by incorporating polarization vectors derived from the Christoffel equation, ensuring physically accurate separation of elastic wave modes.
2.2. Low-Rank Approximation for Wave Vector Decomposition
Wave vector decomposition, as demonstrated in
Section 2.1, projects the vector wavefield onto the polarization vectors obtained from the Christoffel equation [
5]. However, the direct implementation of these projections requires solving the Christoffel equation at every spatial location and propagation direction, leading to high computational costs. This complexity arises because polarization vectors are spatially dependent and must be computed individually for each grid point in a heterogeneous anisotropic medium.
Applying the inverse Fourier transform to Equation (4), the wavefield decomposition in the space domain is expressed as follows:
where
pqP(
x,
k),
pqSV(
x,
k), and
pqSH(
x,
k) represent the polarization vectors for the qP, qSV, and qSH waves at position
x and wavenumber
k, respectively. Each separated wavefield
UqsP, α is computed by summing up the spatial indices
β ∈ {
x, y, z}, where
pqP,α(
x,
k) denotes the
α-th component of the polarization vector. This formulation allows the wave mode separation to be expressed as nonstationary filtering, but the computational cost remains high due to the need for grid-by-grid polarization vector computation.
To further improve the computational efficiency, the wave mode decomposition operator can be reformulated using matrix notation as follows [
2]:
where
B(
x,
km) and
C(
xn,
k) are mixed-domain matrices that capture variations in wavefield properties,
Amn is a low-rank matrix that approximates the decomposition operator,
Nx is the total number of spatial grid points in the computational domain, and
N is the numerical rank used in the low-rank decomposition.
Since M, N ≪ Nx, this decomposition significantly reduces the computational cost. The direct computation of wave mode decomposition has a complexity of O() due to the need for solving the Christoffel equation at every spatial location. However, low-rank decomposition reduces the complexity to O(). This decomposition is implemented using pivoted QR (orthogonal-triangular) decomposition, enabling the precomputation of mode separation operators in a compact form. While low-rank approximation improves efficiency, the rank N controls the balance between accuracy and cost. A higher rank provides greater accuracy but increases computational demands, whereas a lower rank accelerates calculations at the expense of some accuracy loss. To avoid the extensive computational cost of solving the Christoffel equation across large datasets, low-rank approximation was used to generate ground truth labels for training the deep learning model applied in this study.
2.3. Cross-Attention for Vector Decomposition
In vector decomposition, seismic wavefields consist of multiple displacement components, each containing contributions from different wave modes. Traditional deep learning models for wave mode separation typically concatenate these components into a single input, allowing the network to process them jointly. However, in vector decomposition, each displacement component contains distinct qP and qS wave contributions. Treating all components as a single input may limit the model’s ability to effectively disentangle these mixed wave modes.
A more structured approach is to encode each displacement component separately, ensuring that the model learns distinct features for each mode. While this enhances feature extraction within individual components, it introduces a drawback: the absence of direct interaction between components during encoding. Since displacement components share underlying physical relationships, ignoring these interdependencies could reduce the model’s ability to accurately decompose the wavefield.
To address this limitation, cross-attention enables dynamic feature exchange between separately encoded components. Unlike self-attention, which captures dependencies within a single input sequence, cross-attention establishes relationships across distinct inputs [
12,
13]. This mechanism allows each displacement component to selectively focus on relevant features from others, facilitating improved wave mode separation. Cross-attention is computed by exchanging information between different displacement components. This process is bidirectional, ensuring that each component benefits from interactions with the others. The general form of cross-attention is expressed as follows:
where
D = 3 represents the number of spatial components (
x,
y,
z), and
i,
j ∈ {
x, y, z} ensure that each component attends to the others. Here,
Qi is the query representation obtained from the feature embeddings of displacement component
i, while
Kj and
Vj correspond to the key and value embeddings of component
j. The scaling factor
dk normalizes the attention scores to stabilize training. This formulation ensures that each displacement component retains its physical consistency while incorporating inter-component dependencies that would otherwise be lost when processing components independently.
By integrating cross-attention, the model enhances its ability to maintain physical consistency across decomposed wave modes. This is particularly relevant in anisotropic media, where wave mode interactions vary due to differences in elastic properties. Furthermore, cross-attention has been widely used in multi-modal learning, where multiple input modalities must interact to enhance the feature extraction [
13,
14]. In seismic data processing, displacement components serve as distinct yet correlated feature spaces. When processed independently, they may lose valuable interactions, which cross-attention compensates for by selectively integrating relevant information across components.
2.4. Network Architecture
The proposed model is based on a U-Net architecture [
15], enhanced with cross-attention mechanisms to improve vector decomposition while preserving inter-component relationships. Unlike conventional approaches that concatenate displacement components into a single input, this model processes them separately in the encoder and integrates cross-attention in the decoder to facilitate feature exchange. This structure ensures that the vector decomposition remains physically meaningful while improving the separation process.
The model takes a seismic wavefield snapshot as input, where the horizontal and vertical displacement components,
ux and
uz, serve as the input channels. Each displacement component contains contributions from both qP and qS waves, which must be properly separated to ensure accurate vector decomposition. Since this study focuses on 2D wavefield decomposition, the analysis is restricted to qP and qS waves, without considering separate shear wave polarizations such as qSH and qSV. The network is designed to output two channels,
ux_qS and
uz_qS, corresponding to the qS wave components of the horizontal and vertical displacements. The qP components,
ux_qP and
uz_qP, are then inferred using the physical constraint:
Empirical results indicate that learning S-wave components directly and computing P-wave components as residuals leads to improved separation performance, making this a preferred approach for training.
As illustrated in
Figure 1, the network consists of three main components: an encoder, a cross-attention module, and a decoder. The encoder follows a dual-branch structure, where the two displacement components are processed independently through separate encoding paths. Each encoding path consists of multiple down-sampling stages, applying residual blocks [
16] for feature extraction while progressively reducing the spatial resolution through pooling layers. This design allows the network to learn independent representations for each component while preserving their structural integrity. Unlike conventional U-Net models, which process displacement components jointly, this approach ensures that the unique polarization characteristics in
ux and
uz are retained during feature extraction.
To facilitate feature exchange between the encoded displacement components, cross-attention is introduced before the decoder. Instead of simply concatenating feature maps before decoding, cross-attention selectively integrates relevant information from one component into the other, ensuring that the decomposition process remains consistent with the underlying physical relationships between wave modes. The cross-attention mechanism enables dynamic feature interaction and reinforces inter-component dependencies, which are essential for robust wave mode separation, particularly in anisotropic media. Following cross-attention, the decoder mirrors the hierarchical structure of the encoders. For each up-sampling step, transposed convolutions restore spatial resolution, and feature maps from the corresponding encoder layers are concatenated through skip connections to retain fine-scale structural details [
15]. This ensures that high-resolution, near-surface information—preserved in the encoder’s earlier stages—remains accessible during wave mode separation.
2.5. Training
To evaluate the effectiveness of the proposed cross-attention-based wavefield decomposition, a modified version of the BP 2007 anisotropic benchmark model, a widely used anisotropic subsurface model, was employed.
Figure 2 shows the elastic properties of the medium: P-wave velocity (
Vp), S-wave velocity (
Vs), and Thomsen parameters (
ϵ,
δ). For training, the model domain was partitioned into 200 × 200 subregions, yielding 60 distinct patches. To increase variability and improve generalization, 30 additional patches were generated using standard data augmentation techniques, including random flips and minor scaling. The patch size (200 × 200) and the number of augmented samples (30) were empirically chosen to balance model performance and computational cost. These parameters were fixed during training for consistency but may be adjusted in future studies depending on the application. The process of extracting patches from the full model is illustrated in
Figure 3, which provides an example of how the domain is segmented for training.
For each patch, wave propagation is simulated under five different source locations, ensuring that the network is exposed to wavefields from multiple illumination angles. From each source simulation, 5 temporal snapshots of the propagating wavefield are extracted, leading to a total of 25 wavefield snapshots per patch. The snapshots are randomly selected from a time window excluding the initial 0.2 s of simulation, with at least 0.15 s between each selection to ensure sufficient temporal diversity. The ground truth labels are generated via the low-rank approximation method described in
Section 2.2, allowing us to bypass explicit Christoffel equation solvers during training.
During training, two U-Net variants are compared—one with multi-head cross-attention integrated at the deepest encoder level and one without. Both variants utilize residual blocks in the encoder and decoder to facilitate gradient flow. The dataset is randomly partitioned into training, validation, and testing splits, with 64% of the data used for training, 16% for validation, and 20% for testing. All network weights are initialized by the Kaiming normal scheme [
17] to stabilize the early training. Optimization is performed using the Adam optimizer [
18], with an initial learning rate of 1.5 × 10
−4. Training typically proceeds for up to 200 epochs with a patience-based early-stopping criterion set to 30 epochs. To further regularize the model, a minor physics-informed penalty (coefficient 0.01) is included within the loss function to encourage physically consistent separation. This term is particularly useful in noisy or highly heterogeneous scenarios, where purely data-driven separation may overfit spurious features.
The training was conducted using an NVIDIA RTX 4090 Graphics Processing Unit (GPU; NVIDIA Corporation, Santa Clara, CA, USA) with CUDA (Compute Unified Device Architecture) acceleration. For the baseline U-Net model (without cross-attention), the total training time was approximately 4925 s, while the cross-attention U-Net required approximately 6487 s. This increase in training time reflects the additional computational complexity that is introduced by the attention mechanism.
4. Discussion
While the proposed cross-attention U-Net shows promising results in elastic wavefield decomposition, several aspects require further consideration. One of the main challenges is the computational complexity that is introduced by the cross-attention mechanism. Since attention operations scale quadratically with input size, this can create a bottleneck for applications involving large seismic datasets. This limitation necessitates the exploration of optimization strategies such as sparse attention techniques, hierarchical attention implementations, or alternative attention mechanisms that can maintain performance while reducing the computational overhead.
Another area of concern is the model’s dependence on labeled training data. Unlike physics-based approaches, the proposed approach requires a considerable amount of labeled data for effective generalization. Although the use of low-rank approximations for generating labels has been effective in the current setting, this method may introduce potential bias due to the assumptions involved in the approximation. In particular, it is unclear how these labels generalize to complex real data, where assumptions may not hold. Investigating alternative label generation strategies, such as weak supervision or self-supervised learning, could help reduce such bias while improving model robustness.
Extending the method to three-dimensional wavefield decomposition is an important next step for real data applications. However, this also increases the computational and memory demands of the model. Future work should focus on improving the scalability for 3D domains, possibly by introducing hybrid attention architectures that combine local and global representations or by leveraging the structural characteristics of seismic data to reduce the processing overhead. In addition, inference with the trained network is considerably faster than conventional decomposition methods, such as low-rank approximations, which require intensive computation for each snapshot. Once training is complete, the model yields predictions almost instantly through a single forward pass, making it well suited for large-scale or time-sensitive applications.
In addition, when the model is applied to larger domains using a sliding-patch inference approach, some reduction in accuracy can be observed near patch boundaries. This appears to stem from the limited receptive field of the network, which is trained on relatively small patches (200 × 200) and may not capture long-range spatial dependencies effectively. This limitation becomes more noticeable when processing extended domains, as illustrated in
Figure 7 and
Figure 8. Potential improvements could include overlapping patch strategies with smooth blending, or the integration of global context modeling techniques such as attention modules that span wider spatial extents.
While the model performs well across different velocity structures and anisotropic characteristics, further development is needed to improve its performance in more complex geological settings. This could include incorporating physical constraints directly into the network architecture or loss function, helping the model align more closely with the underlying physics. Additionally, interpretability tools tailored to wavefields decomposition could offer useful insights into how the model makes decisions and support validation in practical scenarios.