1.1. Background
Rainfall is a natural phenomenon that shows spatiotemporally varying behavior. These characteristics with chaotic dynamic patterns make it difficult to predict its future behavior. Moreover, a common shortcoming of rainfall data is the inherent inclusion of missing values. This is because the data are collected from different gauging stations located in a widespread geographical area. The malfunctioning of equipment, relocation of gauging stations, network interruptions, natural hazards, and emergencies like pandemics may disturb the continuous measurement of this natural phenomenon [
1].
Ignoring the stations with missing data leads to an information loss due to the strong spatial correlation between meteorological stations. On the other hand, modeling rainfall data with missing values negatively impacts the accuracy of rainfall models due to the discontinuity of the time sequence [
2]. Hence, identification of a suitable mechanism to address this issue is a crucial step in an effective rainfall forecasting.
When the missing values occur at random, modeling them to estimate their values is not required. However, if the missing values are not random, then the missingness cannot be ignored and must be modeled and predicted [
3]. This is known as missing value imputation and it is carried out as a data preprocessing step. Much research has been conducted using different types of missing value imputation methods. However, studies on imputing the missing values for spatiotemporal data (including meteorological records) are rare [
4,
5,
6,
7]. This study presents a novel and more efficient spatiotemporal methodology to impute missing values in rainfall data for a few rain gauging stations. The proposed methodology was tested using rainfall data collected from six neighbouring rain gauging stations (which are considered reference stations) of Ratnapura, which was identified as the target rain gauging station in our main study of rainfall forecasting. The reason for our choice is that Ratnapura is a flood-prone area in Sri Lanka due to its frequent exposure to heavy rainfall throughout the year [
8,
9]. In 2003, flash flood events in Ratnapura accounted for financial damage of LKR 50.6 million [
9].
1.2. Related Work
Much past research has accompanied traditional statistical and/or machine learning techniques in missing value imputation of meteorological data. Some studies only captured the spatial variations (e.g., [
10,
11,
12,
13]) using spatial interpolation methods such as the inverse distance method (ID), normal ratio method (NR), geographical co-ordinates method (GC), and spatial kriging, also known as Gaussian process regression, while others only focused on their variation over time (e.g., [
14,
15,
16]). The most common temporal method used is the autoregressive integrated moving average (ARIMA) models.
Some studies (e.g., [
17]) pointed out that the predictions obtained from spatiotemporal methods are more accurate than those of purely spatial predictions. This is mainly because spatiotemporal interpolation can be applied to geo-referencing positions over a space–time grid. Some researchers have used the traditional spatial kriging model after some modifications to model the spatiotemporal behavior of the data [
4,
5].
There have been several spatiotemporal kriging-based studies conducted with weather variables. The study in [
18] developed a spatiotemporal kriging model to predict wind speed and compared the results with the autoregressive moving average (ARMA) and Monte Carlo methods. Spatiotemporal kriging modeling gave a better fit for the data.
Moreover, the authors stated a few advantages of kriging models in comparison with other regression methods. Providing estimates with the mean squared error of the estimation (kriging variance), non-requirement of any distributional assumption related to the data, the ability to use a complete set of spatial and temporal information, and the ability to use a limited number of sampled data points to estimate the value of a variable over a continuous spatial field are some of them.
Due to these prominent advantages of kriging, recent studies have integrated kriging with conventional methods (e.g., regression and time series models) to gain fair and more accurate predictions for the missing observations. Nevertheless, a few research studies that use such hybrid approaches were found. The study [
17] applied a spatiotemporal kriging method to model total monthly rainfall data among 269 rain gauge stations, of which nearly 80% of stations had missing data. The deterministic trend component was estimated with multiple linear regression, taking several covariates, including latitude, longitude, and quadratic effect of the corresponding longitude and latitude pairs. Since it only captured 29% of the temporal variability in rainfall, the residuals produced by the model were analyzed using a generalized product-sum spatiotemporal variogram and the interpolation was carried out with spatiotemporal kriging. The final predictor was obtained by combining trend estimate and interpolated value with kriging. In another study [
5], spatial association of atmospheric temperature data was modeled with universal kriging and temporal variability was captured with autoregressive (AR) techniques. Then, they were spatiotemporally combined to predict
k-steps (days) ahead temperature in a given spatial domain. A comparison of forecasted values with those obtained from ARIMA model indicated that the novel hybrid model performed better than the ARIMA model.
The recent advancements in machine learning (ML) and deep learning (DL) can be utilized effectively to predict missing cases in meteorological data. Both approaches have a great ability to handle large-scale problems and provide a flexible architecture in capturing spatial and temporal features [
1,
2]. In addition, they do not rely on hand-crafted feature engineering and prior assumptions on input data. Moreover, the aptness of deep neural networks on large volumes of spatiotemporal data compared to statistical and other classical machine learning techniques has been recognized.
A convolution bidirectional long short-term memory (LSTM) model was applied to capture spatial and temporal patterns in traffic flow data in the study [
2]. This model outperformed state-of-the-art missing data imputation models. Some studies, including the study carried out in [
19], demonstrated the effectiveness and efficiency of deep learning (DL) methods compared to the ARIMA model and back propagation neural network model. A convolutional neural network (CNN)-based DL model was proposed in [
14] for imputing missing values in weather data of an individual weather station on a temporal basis. The performance of the model was evaluated using various stations nearest to the stations without missing values. This study used five optimizers (Rmsprop, Adam, Nadam, Stochastic Gradient Descend, (SGD), and Adagrad) and found that the SGD optimizer provides the most accurate results in predicting missing values.
Past studies reveal that the application of spatial kriging and DL methodology in missing value imputation of weather data has promising results. However, studies that impute missing values in spatiotemporal weather data using hybrid models are extremely rare, especially when there is a lower number of weather gauging stations. The existing research [
5,
17,
20], which applies hybrid models, used more than 50 weather stations when interpolating spatiotemporal missing data. Most importantly, so far, no study has applied a spatiotemporal hybrid model to impute missing values among rainfall data at the Ratnapura gauging station. Therefore, to reduce this gap in the literature and to utilize the potential of spatial kriging, machine learning, and deep learning techniques in dealing with missing values, this study proposes a hybrid model by integrating them. Our target is to verify its appropriateness when the periods with missing values of rainfall gauging stations are different and discontinuous. Our focus is also given to the scenario where data are available only for a few rainfall stations.
The rest of this paper is organized as follows. The next section describes the adopted missing values imputation methods and the development of the new approach. The results and discussion section, followed by the conclusion are lined up last.