2.3.1. Lightweight 3D U-Net Architecture
Recent studies have achieved excellent performance by applying deep convolutional networks to the weather prediction process. Among these, research on weather variable prediction based on the U-Net model introduced by Ronneberger et al. [
30] has been actively conducted. The U-Net model was originally proposed for image segmentation tasks in the biomedical field. It is an extended version of the Fully Convolutional Network (FCN), as proposed by Long et al. [
31], characterized by a U-shaped architecture where the encoder and decoder processes are symmetric, with more feature channels present during the upsampling process. The model does not have fully connected layers, allowing it to utilize the valid feature information from each convolution. This enables the use of image information at various levels, leading to superior performance in segmentation tasks, as demonstrated by its outstanding results in the ISBI Cell Tracking Challenge 2015 on the 2D transmitted light dataset.
Figure 5 provides a detailed explanation of the U-Net architecture. This architecture is mainly composed of the contracting path on the left, which captures information, and the expansive path on the right, which restores the captured information. In the blocks of the contracting path, convolution operations are applied, followed by downsampling using max pooling, which then passes the data to the next level. At this point, the number of channels doubles, allowing more global information to be captured. In the blocks of the expansive path, the feature maps from the previous stage are upsampled, and then the number of channels is halved through convolution operations. Simultaneously, the upsampled feature maps are combined with the feature maps from the corresponding level in the contracting path. Finally, after applying convolution operations to the merged feature maps, the last layer uses a 1 × 1 convolution to map the output to the desired number of classes.
Although the U-Net model was originally proposed as an image segmentation architecture in the biomedical field, its ability to extract features at various levels and deliver excellent performance has led to its application in numerous studies and tasks. In the meteorological field, there have been several studies that aimed to predict weather variables by converting the U-Net architecture into a regressor. Kwok et al. [
2] utilized a U-Net-based architecture to predict weather variables using the dataset from the geostationary weather satellite Metosat, as part of the Weather4cast project. Inspired by the solution that placed fourth in the Traffic4cast 2020 competition, which shares similarities with Weather4cast, they implemented a Variational U-Net structure. To achieve this, they applied a Variational Autoencoder (VAE) style configuration at the bottleneck of the U-Net structure, specifically at the end of the encoder and the beginning of the decoder. At the end of the encoder, the data are reduced to a vector of size 512, representing the mean and standard deviation as interpreted by the VAE. This extracted vector is then reconstructed into an image under the assumption that the latent variable follows a Gaussian distribution and is passed through the decoder. Another study introduced by Kaparakis et al. [
32] proposed the Weather Fusion UNet (WF-UNet), a modified model based on the U-Net architecture, for the task of precipitation nowcasting. Unlike the traditional U-Net model, the WF-UNet employs 3D convolutional layers. These 3D convolutions allow the model to extract not only spatial information from a single radar image but also temporal information from previous timesteps. Additionally, unlike other studies, this approach uses both precipitation data and wind speed radar images as input to train individual U-Net models. The features extracted from each model are then combined, and further convolutional operations are performed to derive the final prediction.
Although many studies in meteorology have explored U-Net architecture-based models, such as those by Kim [
33] and Fernandez [
34], additional computations or modules have often been added to enhance accuracy, increasing the complexity of the models. However, since our goal is to substitute certain modules of the existing NWP model with a deep learning-based solver, using a heavy convolutional network would double the execution time, defeating the purpose of the substitution. This study aims to verify the feasibility of substituting specific parts of the NWP internal process with deep learning models. We propose a lightweight 3D U-Net model that can substitute the BiCGStab method, a hotspot in Low GloSea6, while achieving similar computational performance to the original. Low GloSea6 can be run on KMA and small to medium-sized servers.
To achieve this, we introduce the CBAM-based Half-UNet (CH-UNet), a modification of the Half-UNet proposed by Lu et al. [
35], which simplifies the decoder structure of the U-Net to reduce model complexity. As shown in
Figure 6, similar to Half-UNet introduced by Lu et al. [
35], the decoder process in the original U-Net is simplified by combining the full-scale computations into a single step. Additionally, although the number of channels was doubled when downscaling the image during the encoding process, they are unified into the same number of channels to simplify the network further. Furthermore, a Ghost module was introduced to generate the same feature maps at a lower cost. The proposed CH-UNet, shown in
Figure 7, is inspired by the lightweight Half-UNet model. It has a similar overall architecture, with all convolution operations performed as 3D convolutions to match the characteristics of the data. In traditional U-Net or Half-UNet architectures, the initial channels of the encoding blocks are either 64 or 32. In the case of U-Net, if the initial channels are 64 and the model undergoes four levels of downscaling, the bottleneck layer would have 1024 channels. Although Half-UNet uses the same number of channels, it remains too heavy when considering the 64 channels for application in NWP. Therefore, as shown in
Figure 7, we used only three levels and drastically reduced the number of channels, setting the initial number of channels to 8. Following the Half-UNet architecture, we kept the number of channels consistent across all levels. Additionally, we utilized the decoding process of Half-UNet by upsampling the feature maps from all levels to the original image size and combining them, thereby reducing computational costs. The Ghost module was not used.
After this simplified decoding process, we added a Convolutional Block Attention Module (CBAM) introduced by Woo et al. [
36]. As shown in
Figure 8, CBAM is an attention module that can be applied to convolution operations, placing the Channel Attention Module and Spatial Attention Module in parallel to capture which channels and positions to focus on, thereby improving accuracy with minimal cost. Therefore, at the end of the decoding process, we added CBAM and then applied a 1 × 1 convolution operation. The output was directly used for the regression task.
2.3.2. Deep Learning Utilization Method for NWP Models
The trained deep learning model is designed solely to modify BiCGStab. Therefore, to apply it to the NWP model, it is necessary to adapt the model to the execution environment of the NWP. We converted the model to fit the Fortran 90 environment for application to the UM model of Low GloSea6.
In this study, Python 3.8.19 and PyTorch 2.2.2 were used for training various deep learning models. To integrate the trained model with Low GloSea6, we used the FTorch library provided by Cambridge-ICCS [
37]. This allowed us to load the trained deep learning model, written in Python, into Low GloSea6 within a Fortran environment. FTorch provides a library that enables models created and saved in Python-based PyTorch to be directly integrated into Fortran code using the Torch C++ interface, libtorch, without needing to call the Python executable.
We followed the FTorch documentation specifications precisely to convert and integrate the deep network model into Fortran. To combine it with Low GloSea6, we performed the following additional steps. First, as shown in
Figure 1, Low GloSea6 integrates hundreds of Fortran code modules during the “KMA_LINUX” stage, which are then compiled using a make build process. This process results in the building of the um-atmos.exe file, which is executed in the “UM_MODEL” stage. Therefore, we needed to write code that utilizes the model converted by FTorch and include it in the “KMA_LINUX” stage. However, FTorch requires gfortran version 11, while Low GloSea6 uses version 9. To address this, we pre-built the FTorch library to obtain the necessary mod files. Additionally, the “torch_tensor_from_array” function provided by FTorch supports up to 4 dimensions, but we needed to perform 3D-based convolution operations, which require a 5-dimensional tensor (batch, channel, depth, height, width). Therefore, we modified the “torch_tensor_from_array” function to support 5-dimensional tensors before the build process. Next, to use the pre-built FTorch library’s mod files during the make build process in Low GloSea6, we specified the path to these mod files in the ROSE make config file using a flag option. This allowed us to successfully integrate the deep network model, originally written in Python-based PyTorch, into the Fortran-based NWP model, Low GloSea6.
As mentioned in
Section 2.1.3, the BiCGStab method is called four times in the UM model of Low GloSea6 within the dual loop structure of ENDGame. We analyzed the execution time of each iteration, as shown in
Table 7. During one timestep, BiCGStab was called a total of 4609 times, and we measured and averaged the CPU time required for BiCGStab computations in each loop. The most time-consuming was the first outer loop and the first inner loop, taking an average of 1.9284 s, while the second outer loop and the second inner loop required the least CPU time, at 0.1548 s. As shown in the
Table 7, the first outer and first inner loops took more than twice as long as the other loops. We concluded that applying the deep network in the other cases, rather than in the first outer and first inner loops, could lead to increased execution times due to the complexity of the network, potentially negating the intended benefits of replacing BiCGStab.
Therefore, as illustrated in
Figure 9, we integrated the deep network model converted with the FTorch library into the first outer and first inner loops. After performing computations with the deep network model, we proceeded with BiCGStab computations again. This additional step was necessary because using only the deep network model did not fully adhere to the physical constraints of numerical computations, leading to errors in other modules. Therefore, we had to continue using the traditional numerically based convergence method. However, in line with the goals of this study, we successfully integrated the deep learning model into Low GloSea6, the NWP model currently used by KMA, and were able to apply it to the most CPU-intensive parts—namely, the first outer and first inner loops of BiCGStab—without significant performance degradation. Additionally, by conducting further BiCGStab computations, we ensured that the physical constraints of the NWP model were faithfully maintained.