1. Introduction
Turbulent flows are encountered in many scientific and engineering problems, such as
in aerospace, weather sciences, and biophysical systems. Computational Fluid Dynamics
(CFD) simulations to study the turbulent dynamics, need to address multiscale flow features
due to the interactions of strong and chaotic fluctuations over a wide range of scales. As the
flow REYNOLDS number increases, even smaller scales are developed in the energy cascade
as the inertial force overwhelms the viscous force, leading to large scale separation in
both space and time. Therefore, fine mesh resolutions are required to obtain accurate
solutions. Evidently, time-marching of such turbulent flows with high-fidelity numerical
techniques, such as Direct Numerical Simulation (DNS) and wall-resolved Large Eddy
Simulation (LES), are prohibitively expensive for many practical use cases. Alternatively,
Machine Learning (ML)-based methodologies can potentially offer significant speed-ups
for the estimation of turbulent fields [
1,
2]. Although fully replacing the CFD solver for all time integration steps with ML or a data-driven model is challenging and could lead to the accumulation of errors, ML can still offer computational efficiency through offloading of a
few steps. In such a workflow, the idea is to use ML for acceleration rather than as a replacement for the underlying CFD solver. ML techniques have already been widely explored in CFD for various such applications, for instance in model-order reduction [
3], super-resolution [
4], and also for temporal predictions [
5], among many other efforts [
6].
For unsteady problems, the Recurrent Neural Network (RNN) architecture is well suited to capture and model temporal dynamics. An RNN consists of hidden layers with a feedback loop, where each layer has an internal state vector, which is combined with the input vector to obtain the output. This output from a hidden layer is fed into the next layer, leading to the
recurrent architecture, which allows the processing of time signals. However, RNNs suffer from
vanishing and
exploding gradients, as described in Bengio et al. [
7]. Initial solutions proposed to the gradient problems in RNNs involved techniques such as gradient clipping, and with
truncated backpropagation through time [
8] by truncating the length of backpropagation. In terms of other architectures to improve RNNs, the Long Short-Term Memory (LSTM) architecture [
9], which features the so-called
memory cells, contributed to overcome the vanishing gradient issue, and led to many developments in temporal predictions. Similar to the development of LSTM, the gating mechanism in the Gated Recurrent Unit (GRU) [
10] also tackled the vanishing gradient issue. In CFD, both RNNs and LSTMs have been used for temporal prediction, including for turbulent flows [
11,
12]. Although the vanishing and exploding gradient issues are addressed by these
developments, training such networks is usually slow as the network relies on sequential computation, making it difficult to benefit from parallel systems for acceleration. This is a major drawback since the current ML applications are compute- and data-intensive, where parallel architectures, especially with accelerators such as Graphical Processing Units (GPUs), need to be exploited. To solve this issue, the Transformer (TR) architecture
introduced in Vaswani et al. [
13] employs the attention mechanism to entirely avoid the recurrence relationship to deduce global dependencies. This model relies entirely on a self-attention mechanism (explained in further detail in
Section 3), inherently allowing parallel training. TRs have already been widely exploited in the Natural Language Processing(NLP) community [
14,
15], demonstrating excellent generative and predictive potential. Owing to their large uptake in the NLP community and also in the computer vision [
16], there have been developments in the scientific domain as well [
17,
18].
In CFD, the use of TRs is still at a relatively early stage compared to other ML architectures. In one of the first applications [
19], a TR is coupled to a Generative Adversarial Network (GAN) to generate turbulent inflow conditions for Turbulent Boundary Layer (TBL) simulations. For the prediction of temporal dynamics in a Reduced Order Modelling (ROM)-based framework, TRs have been used to time-march compressed representations of the flow field. In Hemmasian and Barati Farimani [
20], an AutoEncoder (AE)-based network is used for compression, while, more recently, a
-variational autoencoder network is used with a TR to predict the encoded fields [
21], where also the superiority of TR over LSTM is demonstrated. However, compressed representations tend to underestimate the high frequency components during reconstruction and require careful treatment of the hyperparameters that define the network [
22]. Nonetheless, these investigations show that TRs can outperform other prediction methods for CFD applications. Also for long temporal sequences, TR has shown excellent prediction ability [
19], whereas LSTMs have been shown to reconstruct long-term dependencies only when separately predicting modes corresponding to different frequency ranges [
23].
Inspired by these developments surrounding the use and superiority of TRs in estimating CFD fields, this study analyzes the capability of TRs to predict the full velocity field in an actuated TBL problem. For this purpose, an encoder–decoder configuration of a TR architecture is employed. To limit the number of features that the TR model needs to predict, thus reducing the complexity of the self-attention mechanism, the inputs are reshaped into smaller cubic sub-domains. This also allows handling of non-uniform input shapes encountered in CFD simulations where the computational domain changes. The developed model is envisaged to be integrated into a coupled setup including the baseline LES solver. The idea is to offload a
few time-marching steps of the LES solver, which allows to achieve significant computational speed-up. The number of time steps over which the
TR model is applicable for time-marching offload is based on the accuracy of the model
over different prediction time steps. This is analyzed in this study with the Dynamic Mode
Decomposition (DMD) method [
24,
25,
26].
This manuscript demonstrates for the first time, the use of TR to achieve speed-up in time-marching actuated TBL fields. With the use of TR in an envisioned hybrid workflow coupled to a CFD solver, this manuscript provides a methodology to speed-up the computational time of time-marching turbulent fields with an ML-assisted solution, while retaining accuracy similar to the baseline solver. The contribution of this study is the development of a methodology using a TR architecture to offload time-marching steps of a CFD solver, which leads to massive savings in computational resources. To the best knowledge of the authors, this is the first application of TRs to time-marching of an actuated TBL flow problem.
The manuscript is structured as follows.
Section 2 provides an overview of the computational setup, details on the numerical solver, and the associated data that is employed for training the TR model. The TR architecture and methodology of the training are discussed in
Section 3. The performance of the TR model is shown and analyzed in
Section 4. Finally,
Section 5 concludes the manuscript with a summary of the findings and provides directions for future research.
2. TBL Problem Formulation
Since the aviation sector accounts for a significant share of energy demand and associated greenhouse gas emissions, and as political goals and rising energy costs pose environmental and economic challenges for aircraft, aerodynamic improvements are needed. A promising technique to actively and therefore adaptively reduce the aerodynamic viscous drag are spanwise traveling transversal surface waves to manipulate the near-wall turbulent boundary layer [
27].
As a first step approximation to more realistic and computationally more expensive application scenarios such as aircraft wings, a CFD model based on a validated zero-pressure gradient flat plate configuration plate is selected to study the underlying physics and potential of this active drag reduction technique. For that purpose, wall-resolved LES is performed using the in-house CFD solver m-AIA (
https://git.rwth-aachen.de/aia/m-AIA/m-AIA, accessed on 25 September 2024) [
28,
29,
30], depicted in
Figure 1 for the actuated case. The physical domain of the flat plate model is shown for the actuated case, where the dimensions in the Cartesian directions are
,
, and
. The actuation parameters of the spanwise traveling wave are the wavelength
, the time period
T and the amplitude
A. At the inflow of the domain, the Reformulated Synthetic Turbulence Generation (RSTG) method is used to initiate a TBL flow [
31]. The onset of the surface actuation, analyzed in Albers et al. [
28], Fernex et al. [
29], is located at
, where a fully developed TBL is established. The surface area
for the integration of the wall-shear stress
is shaded in gray. Periodic Boundary Conditions (BC) are used in the spanwise direction
z, characteristic outflow conditions [
32] are applied on the downstream and upper boundaries, and the no-slip condition is imposed on the wall [
28].
For the solver, the unsteady compressible Navier–Stokes equations in the Arbitrary Lagrangian-Eulerian (ALE) formulation for time-dependent body-fitted deformable meshes are solved with the structured part of m-AIA. A second-order accurate Finite Volume (FV) approximation of the governing equations is used in which the inviscid fluxes are computed by the Advection Upstream Splitting Method (AUSM) using a Monotonic Upstream-Centered Scheme for Conservation Laws (MUSCL) to reconstruct the cell-center values. The viscous fluxes are discretized by a modified cell-vertex scheme at second-order accuracy. The time integration is performed via a five-stage Runge–Kutta scheme with second order accuracy. Additional volume fluxes are determined to satisfy the Geometry Conservation Law (GCL). According to the Monotonically Integrated Large Eddy Simulation (MILES) approach, the subgrid dissipative scales of the LES are implicitly modeled by the numerical dissipation of the AUSM scheme [
28].
For the three-dimensional turbulent flow, the wave parameters are non-dimensionalized and given in inner units
based on the friction velocity
and the kinematic viscosity
, both at
averaged in the spanwise direction. The actuation is characterized by the non-dimensional actuation parameters, wavelength
, amplitude
and time period
, such that the wall normal coordinate of the spanwise traveling transversal surface wave is described by Equation (
1). The piecewise-defined function
ensures a smooth spatial transition in the streamwise direction from the non-actuated to the actuated surface area and vise versa, which is given by:
In previous studies, 80 LESs, i.e., one reference/non-actuated and 79 actuated cases, were performed for grid-like distributed actuation parameter combinations within the bounds
[
28,
29].
The flow conditions are predefined by the momentum thickness-based
REYNOLDS number
at
and the
MACH number
using the free stream velocity
, the momentum-based boundary layer thickness
, the kinematic viscosity
, and the ideal gas speed of sound
a of air. The mesh resolution is
in the streamwise,
in the wall-normal direction gradually coarsening off the wall up to
at the boundary layer edge, and
in the spanwise direction. This yields a DNS-like resolution near the wall, rendering these simulations wall-resolved LESs. Further details on the numerical method, the computational setup, validation of the LES, BC and simulation data points including mesh independence studies and flow field statistics can be found in Albers et al. [
28]. For the purpose of this investigation with the TR model, a high-sampling version of the actuated dataset is generated and used for training, where snapshots in every 24 solver time-steps are stored exemplarily for the actuation parameter combination of
,
and
.
3. TR Model Architecture
The TR model is required to provide temporal predictions of velocity fields of the TBL flow. As mentioned in
Section 1, this is achieved through an encoder–decoder configuration of a TR model that has been adapted from Wu et al. [
33]. The architecture is shown in
Figure 2, where the number of encoder and decoder layers, and attention heads are chosen after manually tuning these hyperparameters, such that the loss is minimized. Also the input layer of the network is configured such that the non-uniform velocity field tensors can be read as input to the transformer. The encoder consists of an input layer defined by a fully-connected network and a stack of six encoding layers. Positional encoding is defined by sinusoidal functions. The six layers consist of a self-attention and a fully connected feed-forward layer, each followed by a normalization layer. The self-attention mechanism allows to capture dependencies between tokens, which are the velocity field tensors in the temporal direction in this context. The TR model obtains this dependency by representing each token in the form of query (Q), key (K), and value (V) vectors. These vectors are used to compute attention scores, which represents the relevance of a token with respect to other tokens, thus enabling the capture of not only short-term, but also long-term dependencies. Further details on this can be found in Vaswani et al. [
13]. The decoder has a fully connected input layer, again followed by six decoder layers, and a linear mapping to the target sequence at the output layer. In the decoder layers, there is an additional layer that applies self-attention to the encoder outputs. To ensure that the decoder only sees information from the previous time steps, look-ahead masks are applied. The loss function for training the model consists of a Mean-Squared-Error (MSE) term,
, and an additional gradient loss term,
, defined by the first-order gradients of the velocity field. For example, if
and
are the target and TR-predicted velocity tensors at time instance
, and the velocity gradients are
,
, and
in three directions
x,
y, and
z, the loss
is defined by the following:
where
is the dimension of the velocity tensor summed over all the directions, and
and
are the weights assigned to
and
, and
and
[
34]. To avoid
dominating
, it is scaled in the first 100 epochs such that
.
Assuming that the training dataset consists of velocity field time instances, a subset sequence of
serves as the encoder input, where
m is the encoder sequence length. This is shown in
Figure 2. In this case, the decoder input consists of velocities at time instances
, and the decoder outputs the velocities at time-instances
, where
is the target sequence length. For the investigated TBL problem, these time sequences of the velocity field tensors vary in shape. Specifically, the width of the computational domains used as training data for the TR varies in the spanwise (
z) direction. To resolve the non-uniform samples, the velocity field tensors are reshaped to smaller cubic subdomains. In this case, these sub-domains have a dimension of eight computational cells in each direction. This sub-domain size is a hyperparameter, which could influence the accuracy of the developed model. In the present investigation, this hyperparameter is tuned manually. Other cubic sub-domain sizes of four and 16 are also tested, but these lead to higher errors. Sub-domain sizes of more than 16 are not considered, as training costs increase and also the complexity of self-attention increases. Importantly, this reshaping operation allows to limit the number of features that the TR model needs to predict, which significantly reduces the computational complexity of the self-attention mechanism. It is also observed that the TR performance in terms of accuracy is significantly worse when the full velocity field is predicted.
Furthermore, 16 attention heads are employed in the network architecture. For regularization to each of the encoder and decoder layers, a dropout value of
is used. The Adam optimizer [
35] is used for training for 1700 epochs. The learning rate is scaled linearly with the number of workers (in this case, GPUs) that are used for the training. This is done to preserve the accuracy of the model when dealing with large batch sizes encountered in large-scale distributed training, which is a known issue in ML [
36,
37]. The distributed trainings are conducted with the DeepSpeed framework (
https://github.com/microsoft/DeepSpeed, accessed on 25 September 2024), which is provided by the open-source library AI4HPC (
https://ai4hpc.readthedocs.io/en/latest/, accessed on 25 September 2024) for scaling ML workloads on High Performance Computing (HPC) systems [
38]. The library includes parsing options, which allows configuring the training parameters of the network. For training the TR model, the size of the cubic sub-domains is an important hyperparameter, see discussion above. Another important parameter to consider is the seeding of the network, which is also used to allow deterministic runs. This can easily be specified in AI4HPC with the nseed argument. For exploiting the HPC environment used for training the model, also multiple workers with nworker and prefetching of data with prefetch argument are used. All of these options can be seamlessly specified as input parser arguments to the library [
38]. The TR model (
https://gitlab.jsc.fz-juelich.de/CoE-RAISE/FZJ/ai4hpc/ai4hpc/-/blob/master/Cases/DS_ATBL_TR.py, accessed on 25 September 2024) and the training data [
39] are available open-source.
4. Results and Discussion
This section highlights the results obtained with the TR model and the TBL dataset described above. The time-marching capabilities of the TR model are evaluated by increasing the predicted future time steps. For analyzing the performance of the TR model,
of the entire dataset is used for test purposes, and the results shown in this section refer to this test dataset. It is to be noted that the convective time for the physical CFD solver time step is
, where
is the physical time covered by one CFD solver time step,
is the far-field velocity outside the boundary layer, and
is the momentum-based boundary layer thickness at the onset of the actuation at
. For training the TR model, a dataset with a high sampling rate is generated, where the velocity fields are stored every 24 physical CFD solver time steps. Hence, during inference, the convective TR time step in terms of the convective CFD solver time step is
. As mentioned in
Section 3, the TR has a target sequence length of
during training, where each of the steps of the target length corresponds to
.
During inference, the TR can be tested for longer time sequences, as the inference step runs the model iteratively, such that
, where
is the TR model operator. The implementation of the inference step is available in the AI4HPC repository (
https://gitlab.jsc.fz-juelich.de/CoE-RAISE/FZJ/ai4hpc/ai4hpc/-/blob/master/Cases/src/networks.py, accessed on 25 September 2024). Exemplary reconstructions provided by the TR model for the streamwise velocity (
u) are shown in
Figure 3. Qualitatively, it can be observed that the velocity fields predicted by the TR are in close agreement with the LES fields for
. The reconstructions above
are worse compared to
. The larger flow features are clearly reconstructed by the TR model. However, for
, the discrepancies in the TR-predicted fields are visible, which can be better observed in the line plots along the right panel. In particular, the values at locations with sharp gradients are poorly estimated. Since the sharpest gradients generally correspond to points in the velocity field with extreme values, such behavior with ML-based models can be expected as ML models with good generalizability capture the mean flow characteristics better compared to the outliers. However, up to
, the line plots are in close agreement. The coefficient of determination (
) is also computed, and an average
is found for
, after which it starts to drop. The drop in
, given by the first term in Equation (
2), with increasing
is shown in
Figure 4, where the
scores are also shown. As expected, a trend of increase in MSE and decrease in
is observed with increasing
. Given the reconstruction quality observed in
Figure 3c, it can be concluded that the TR predictions are poor for
. However, the time-marching of the TBL flow up to
already provides high computational speed-up (shown in
Section 4.2) with high accuracy. In terms of a measure to quantify the temporal evolution of the flow that the TR model is able to provide,
would correspond to a convective time of about 0.40, which means about 40% of the boundary layer thickness.
To quantitatively analyze the reconstruction ability of the TR, a modal decomposition of the generated velocity fields is performed in the following
Section 4.1. Subsequently, the computational speed-up achieved with the TR for time-marching the velocity fields is compared with the LES execution times in
Section 4.2.
4.1. Modal Decomposition Analysis
To further analyze and validate the performance of the TR model for predicting dynamics, the DMD method is used to compare the LES- and TR-predicted fields. DMD is a data-driven method, extensively used for analyzing dynamical systems, especially for high-dimensional data. It has been used widely in the field of CFD and turbulent dynamics to extract coherent structures and understand their temporal evolution [
40]. DMD decomposes the behavior of the system into modes and associated frequencies, which allow the interpretation of the evolution of the system in terms of the dominant modes. In particular, for dynamical systems, comparison of the temporal behavior shown by the LES and the TR-based DMD modes provides an assessment if the generated predictions by the TR are stable and remain close to the expected behavior. Here, the modal information extracted from DMD is exploited to measure the accuracy of the TR model with respect to the baseline LES. The open-source Python package PyDMD [
41,
42] is used to extract the modes and their associated eigenvalues. The DMD modes
and eigenvalues
from the LES and TR fields are compared. Furthermore, the relative Frobenius norm error of the discrepancy between LES and TR fields, given by
is used to quantify the agreement between the DMD modes. The
mode dynamics for mode
r at time
is also shown to compare the temporal coefficients that explain the time evolution of each mode.
Figure 5 shows a contour plot of the mode shapes derived from the DMD of the LES and TR fields at
. It can be seen that the TR modes are in close qualitative agreement with the LES modes. The TR-predicted contours are less smooth compared to the LES contours, but the discrepancy is minimal, and these probably arise from the smallest flow structures that are not captured by the TR. In order to further analyze the differences, a quantitative analysis is provided in
Table 1 for the dominant mode, where
,
, and
are compared for the LES and TR at different values of
. It is observed that for
,
is obtained, which shows the excellent reconstruction ability of the TR model. Close agreement between the modes signify that the dominant flow features are accurately captured by the TR, which are responsible for maximum energy content of the system. For
, the model starts to show worse performance and
gives unacceptable values. In terms of
and
, the values are in close agreement until
. The agreement in the mode dynamics suggests that the dynamic evolution of the velocity field predicted by the TR agrees closely with the LES fields, thus validating its use for making temporal predictions.
4.2. Computational Time per Snapshot and Memory Analysis
In this subsection, a comparison of the computational time and memory usage during inference for the TR and the LES time-marching step with
is provided. While the training and inference of the TR model is performed on the JURECA system [
43], the TR solver uses the HAWK system (
https://www.hlrs.de/de/loesungen/systeme/hpe-apollo-hawk, accessed on 25 September 2024). Both the systems feature two 64-core AMD EPYC 7742 Central Processing Units (CPUs) (
https://www.amd.com/en/products/cpu/amd-epyc-7742, accessed on 25 September 2024), whereas the HAWK system consists of 4096 compute nodes. The JURECA-DC used in this work contains 192 accelerated nodes, where each node contains four NVIDIA A100 GPUs. The nodes are interconnected via two InfiniBand HDR adapters (
https://www.mellanox.com/pdf/whitepapers/IB_Intro_WP_190.pdf, accessed on 25 September 2024). The system peak performances for HAWK and JURECA are 26 Petaflops and 3.54 (CPU) + 14.98 (GPU) Petaflops. The LES solver uses CPUs for the computations, while the TR model uses GPUs. The comparison is made based on the statistics for a single node. The LES computations are performed on 64 nodes, while the TR inference uses one node. Assuming a perfect scaling, the wall-time on a single node and memory usage are reported in
Table 2. As can be seen, the inference time of the TR is about 53 times faster than the time-marching step of the LES solver. In terms of memory consumption as well, the TR is found to utilize almost 1100 times less resources. These results show the excellent computational performance of the TR over the LES solver. In practice, the computational gain achieved even with
is significant for long simulation times.
5. Conclusions
A TR architecture was exploited to time-march velocity fields in a TBL flow problem. The model is based on an encoder–decoder configuration, where a reconstruction strategy was proposed to handle non-uniform inputs and reduce the computational complexity of the self-attention mechanism. The TR-predicted fields were analyzed with the DMD method, and a prediction error of less than was achieved for a horizon of five future TR time steps. Furthermore, a computational performance comparison between the LES and the TR revealed that significant computational savings of up to about 53 times were possible during inference, while consuming 1100 times less memory. This study provides a new strategy for achieving computational speed-ups for time-marching TBL fields with a TR architecture in a hybrid setup with a traditional CFD solver.
Ongoing work is directed towards coupling the transformer model with the LES solver in m-AIA. This involves checking physical plausibility by tracking possible physical imbalances and also testing the acceleration of time-stepping operations for predicting future velocity fields in a fully coupled scenario. The current work is intended towards the design and optimization of the actuation parameters in the concerned TBL problem. Therefore, future work is focused on examining how the transformer model generalizes to different combinations of actuation parameters. However, if such a model is to be used for a generic flow problem that is applicable across a wider range of Reynolds and Mach numbers (the main factors influencing the flow conditions), future extensions to the TR model will need to check for their generalizability across different flow conditions. Also, more realistic setups such as airfoil simulations have to be considered, where, additionally, the angle of attack as an influential factor will be introduced.