A Comparison of Different Methods for Rainfall Imputation: A Galician Case Study

Vidal-Paz, José; Rodríguez-Gómez, Benigno Antonio; Orosa, José A.

doi:10.3390/app132212260

Open AccessArticle

A Comparison of Different Methods for Rainfall Imputation: A Galician Case Study

by

José Vidal-Paz

^1,*

,

Benigno Antonio Rodríguez-Gómez

²

and

José A. Orosa

³

¹

Departament of Computer Engineering, Facultad de Informática, Universidade da Coruña, Campus de Elviña s/n, 15071 A Coruña, Spain

²

Department of Industrial Engineering, Universidade da Coruña, Campus de Esteiro, 15403 Ferrol, Spain

³

Department of N.S. and M.E., Universidade da Coruña, Paseo de Ronda, 51, 15011 A Coruña, Spain

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(22), 12260; https://doi.org/10.3390/app132212260

Submission received: 7 October 2023 / Revised: 4 November 2023 / Accepted: 11 November 2023 / Published: 13 November 2023

(This article belongs to the Section Environmental Sciences)

Download

Browse Figures

Versions Notes

Abstract

:

With the ultimate goal of developing models that involve the use of environmental variables, a GIS-based application is being developed that is circumscribed to the region of Galicia (Spain). Ten-minute data of six meteorological variables were collected from 150 stations of the MeteoGalicia network over a period of 18 years, but the time series data are not complete. In order to estimate missing rainfall data, four imputation methods were evaluated in this study: missForest, MICE, Amelia II, and inverse distance weighting (IDW). Crossvalidation results show that the precipitation is out of phase in the different stations due to their geographical locations, and the imputation can be improved with a displacement of the time series; on the other hand, the missForest method provided better results in the imputation of this meteorological variable than the MICE, Amelia, or IDW.

Keywords:

imputation; missing data; rainfall

1. Introduction

The analysis of time series data depends in the first stage on the availability of reliable data sources; however, it is common that, even coming from official organizations, the series presents erroneous data that must be eliminated, or there are missing data. In both cases, it is necessary to apply imputation methods that allow us to estimate coherent values to complete the series.

In particular, rainfall is an important meteorological variable in civil engineering, hydrology, and environmental science applications, and for certain research works it is necessary to have complete ten-minute data sets. More specifically, the estimation of flows in hydrographic basins is essential for the design of sanitation networks or the construction of dams. The IDF (intensity–duration–frequency) curves used in Spain date back to the late 1970s, although the isolines were updated in 1987 [1] using data from the 21 rain gauge stations available at that time. There is a need for more up-to-date regional rainfall ten-minute data that allow us to estimate the IDF curves [2] and adjust them to the Gumbel and GEV distributions, develop synthetic hyetograms, etc.

Early methods that were used to fill in time series data simply averaged the precipitation from neighbouring stations to obtain an estimate of the missing data at the target station, such as the normal ratio method [3] or the inverse distance weighting (IDW) method.

Subsequent studies try to interpolate precipitation using different methods, such as multiple discriminant analysis [4], neural networks [5], regression [6], spatial interpolation [7,8,9], and spatial–temporal modelling [10,11]. In all cases, we have found methods that fill in daily or monthly rainfall data, but no author has been found who tries to complete the ten-minute data, possibly due to the difficulty involved in handling the high variability that rainfall presents on this time scale.

On the other hand, the previous imputation methods can be considered as single imputations, since they replace each missing data point with a single estimated data point. This can lead to a loss of variability in the samples. To solve this problem, Rubin [12] proposed the multiple imputation method. What is more, previous research works [13,14,15] demonstrated that multiple imputations were more appropriate when the data pattern was arbitrary, as is the case with rainfall. Consequently, the multiple imputation method is of interest for this case study. In this method, each missing data item can be replaced by a set of plausible values, thereby resulting in several complete data sets. We can find several methods that implement the multiple imputation procedure, such as MICE, which uses the fully conditional specification approach (FCS), or Amelia, which uses a joint modelling (JM) approach; in [16], a discussion about both approaches can be found.

In this study, a comparison between the four imputation methods has been carried out to fill in the rainfall time series data in multiple meteorological stations in Galicia, which takes as reference the data from their nearest neighbouring stations. Methods implemented in the R package have been chosen, such as missForest [17], which is based on the Random Forest algorithm [18], MICE (Multiple Imputation by Chained Equations [19]), and Amelia II [20]—that use a combination of the classic EM algorithm method with a bootstrap approach to impute missing values—as well as the IDW as a classic imputation method to be able to compare the previous ones. A crossvalidation methodology has been proposed for the evaluation of the estimates provided by these imputation methods, which consists of creating incomplete time series data from real data loss patterns for each month and imputing the missing data in different circumstances.

The main aim of this study is to determine the best imputation method to fill in the rainfall data series of the network of meteorological stations in Galicia. The results obtained show that the missForest method provided better results in the imputation of this meteorological variable than the MICE, Amelia, or IDW.

2. Materials and Methods

2.1. Study Region and Data

Galicia is a region located in the northwest of Spain, ranging from 41°49′ N to 43°47′ N in latitude and from 6°42′ W to 9°18′ W in longitude, with a total area of 29,574 km² (Figure 1). Its relief is low on the coast and higher in the interior, with high plains in the north and mountains and depressions in the southeast (Figure 2).

As described in [21], the length of its coastline is 2555 km, which is divided into 1659 km of coastal perimeter, 432 km of islands, and 464 km of marshes. Its coastline is abrupt and jagged, with a multitude of inlets and outlets, as well as small islands. The estuaries are one of the particulars of Galicia: they are inlets of the sea on the coast that have been generated by the flooding of the river valleys when the land level descended. The largest urban centres are located next to the estuaries, which are of great fishing importance, thus making the Galician coast one of the most important fishing areas in the world.

The interior is made up of low, blunt mountains with a multitude of short-distance rivers with gentle slopes that sometimes give way to rugged slopes. The highest mountains are in the eastern limit of the region: these are Peña Trevinca, with an altitude of 2127 m, and Cabeza de Manzaneda, with an altitude of 1178 m. The central area of Galicia is made up of various lower mountain ranges, between which opens the Lugo plateau with an elevation of about 500–600 m and the Miño valley to the south, with a maximum altitude of 1177 m.

As a peninsular oceanic region, Galicia is among the rainiest in Western Europe, but it has an irregular distribution of rainfall. It is an area that progressively passes from the pure oceanic domain, which determines a homogeneous distribution of the annual distribution of rainfall, to the suboceanic climatic margins or even features a Mediterranean trend. Being located at the southern end of the usual route of the disturbances associated with the west winds, its annual rainfall records should be lower than those registered, but the orography plays a decisive role by firstly intensifying the water discharge when rising the reliefs, thus creating maximum catchment slopes to the windward and rain shadow sectors to the leeward, and secondly by conditioning the uneven distribution of rainfall. For these reasons, it is in the Atlantic areas where the greatest amount of rainfall is collected, while in the inland areas, this amount is less due to the discharge experienced in the first orographic barriers. The distribution of rainfall is quite homogeneous in the north because the valleys and rivers barely modify it; however, in the south it is very heterogeneous due to the great discontinuities of the terrain.

The average rainfall in Galicia is 1180 mm, ranging from 500–600 mm in the south to maximums above 1800–2000 mm in the coastal mountains. Of the average total, 337 mm (28%) are collected in winter, 280 mm (24%) in spring, 156 mm (13%) in summer, and 407 mm (35%) in autumn, which is the true rainy season. It is common for springs and summers to be less rainy than winters and autumns: this is the seasonality of rainfall in this region, which increases from north to south. The increase in rainfall in autumn–winter responds to the predominance of an intense zonal circulation with associated frontal systems that sweep the territory to cause abundant rainfall, especially in the north of the region, because it is where the tail fronts pass with associated storms. In the south, the rains are more seasonal, and they are more related to the advection of highly unstable polar maritime cyclonic air masses from the northwest, as well as maritime tropical air masses from the southwest.

Floods are a major problem that can occasionally lead to catastrophic events and can be caused by a high concentration of rainfall within short periods of time or by prolonged sequences of rainy days. These events usually occur in particularly susceptible geological areas (impermeable substrates) such as some tectonic depressions in the interior of Galicia or on the Atlantic coast. Through the study of the daily rainfall series, it is possible to detail the relationships between the duration and the intensity of rainfall. The periods of low duration are in the majority during spring and summer, while the periods of more than 10 days of rain are more characteristic of autumn or winter.

The regional government created the MeteoGalicia station network in January 2000, and it has published its data at https://www.meteogalicia.es (accessed on 1 November 2023). Over the years, the network has grown from the initial 14 stations to approximately 150 stations in 2013 (Figure 1). Throughout this time, some stations have been dropped; others have changed their location, and, in other cases, different sensors have been added. The ten-minute data of temperature, humidity, solar radiation, wind, and rainfall have been downloaded in 148 stations, from 1 January 2000 to 31 December 2018, and stored in a PostgreSQL database of more than 400 million records, where more than 90 million correspond to rainfall. Table 1 details the amount of data downloaded for each of these meteorological variables. All stations are equipped with rain gauges to collect rainfall data; however, some of them may lack the corresponding sensors to collect data on other variables.

For this study, the rainfall data between 1 January 2000 and 31 December 2006 have been removed because the network of stations was very sparse, and the number of missing data items was very high due to the fact that the first the years stations were in test mode. Regarding the stations selected, only 62 stations were chosen, including those with a registration date before 1 January 2007, and the remaining 86 stations were rejected because their time series data did not reach 10 years, which is the minimum needed to be able to create the intensity–duration–frequency curves. As can be seen in Figure 3, chosen stations have a homogeneous geographical distribution and cover the entire region.

Erroneous and missing data are considered imputable data, since we do not have the necessary information to be able to correct them; as well, outliers have also been ruled out. For this reason, it has been necessary to carry out data preprocessing before proceeding to the study. Next, data was analyzed, and it was observed that there were many days with missing data (Figure 4); in the figure below, each row represents one of the 365 days of the year, each column represents a station, the blue cells of the matrix represent the days in which 144 rainfall data items were found to be available, and the red cells represent those days in which at least some missing or erroneous data appeared. It can also be seen how the network of stations has grown from around 70 stations in 2007 to 150 stations in 2018, as well as that, in the first years of operation of the stations, there were many missing data. Finally, blank columns in the matrix correspond to stations that have been dropped over these years.

By means of algorithms developed ad hoc for the GIS tool, it has been possible to carry out a detailed study observing in which stations a greater number of missing data appeared. Figure 5 indicates, for each of the selected stations, the ratio of missing/complete data during the year 2007 as an example to describe the process.

Another observed characteristic is that these missing data do not usually appear isolated, but rather appear in the form of blocks as shown in Figure 6. This is because when a sensor fault occurs, it may take some time for the fault to be corrected. For example, at the Sambreixo station shown in Figure 6g, the largest block has 1584 missing data items, which corresponds to 11 consecutive days without collecting rainfall data at that station; at the same station, you can see another block of 1424 missing data items that corresponds to almost 10 days. Other stations, such as Marco da Curra in Figure 6a, Mabegondo in Figure 6c or Punta Candieira in Figure 6e, contain a block of approximately 1000 missing data items, which corresponds to one week without data.

Based on these results, it was decided to carry out a station-by-station analysis, thus noticing in which months missing data appeared and in which months data were complete. Figure 7 shows the monthly distribution of missing data for the stations indicated in Figure 6.

2.2. Methods

2.2.1. Imputation Methods

missForest
The missForest algorithm uses an iterative imputation scheme by training a random forest (RF) on observed values in a first step, followed by predicting the missing values and then proceeding iteratively. Steckhoven [17] chose RF because it can handle mixed-type data and is known to perform very well under barren conditions like high dimensions, complex interactions, and nonlinear data structures. Unlike the algorithm of RF, missForest directly predicts the missing values using an RF trained on the observed parts of the data set.
Breiman [18] defines RF as a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges to a limit as the number of trees in the forest becomes large.
The algorithm in R was computed using a variablewise argument as TRUE, the data matrix was supplied, and the number of iterations varied depending on the case.
MICE
In MICE, each variable has its imputation model. More specifically, continuous data can be used: pmm (predictive mean matching), norm (Bayesian linear regression), norm.nob (linear regression, nonBayesian), mean (unconditional mean imputation) and 2L.norm (two-level linear model) [16]. Predictive mean matching was chosen because this method, compared to other imputation methods, generally imputes less implausible values (e.g., negative rain) and accounts for heteroscedastic data more appropriately [22,23].
The algorithm in R was computed with $m = 5$ as the number of multiple imputations, $m a x i t = 50$ as the maximum number of iterations, and the random number generator $s e e d = 500$ ; it was not necessary to use a predictor matrix because the selection process was conducted previously as will be explained below. To pool the five estimated values, the mean was calculated.
Amelia II
Amelia implements the expectation–maximization with bootstrapping. The algorithm uses the familiar expectation–maximization (EM) algorithm on multiple bootstrapped samples of the original incomplete data to draw values of the complete data parameters. The algorithm then draws imputed values from each set of boostrapped parameters, thereby replacing the missing values with these draws. It also has features to make valid and more accurate imputations for cross-sectional, time series, and time series cross-section data, as well as allows the incorporation of observation and data matrix cell-level prior information [20]. The imputation in time series data is the reason why this multiple imputation method was chosen, thus adding the weather timestamp associated with each rainfall value.
The number of multiple imputations was $m = 5$ , the parallel argument was set to multicore, the lags and leads arguments were set to station columns, and the mean was also calculated to pool the five estimated values.
IDW (Inverse Distance Weighting)
The IDW method estimates the value of the target point based on the values of neighbouring points, with more weight given to points closer to the target point. This method uses the Cartesian coordinates of the target station and uses the inverse distance raised to a power of P as a weighting factor for the adjacent point values, as shown in Equation (1) [7]:

$z_{p} = \frac{\sum_{i = 1}^{n} (\frac{z_{i}}{d_{i}^{p}})}{\sum_{i = 1}^{n} (\frac{1}{d_{i}^{p}})}$

(1)

To impute with the IDW algorithm in R, a Euclidean distance matrix was supplied, and the power of P was set to 2.

2.2.2. Proposed Crossvalidation Method

Given the seasonality of the meteorological variable rainfall, with higher values in autumn–winter months and lower ones in the spring–summer months [9], it was decided to carry out the imputation process month by month. Data loss patterns observed in a particular month/year were applied in different years to the same calendar month.

To compare the different imputation methods, experimental data sets were created using the following crossvalidation method (Figure 8):

Select several stations in Galicia that have different characteristics, for example, a station located on the coast, another in the interior, another high in the mountains, etc. We call these stations target stations.
For each target station, select 9 neighbouring stations by proximity, with the aim that they surround the target station and have a sufficient number of data point. We will call these stations candidate stations.
Repeat the following steps for each month of all years for which records are available:
(a)
Read ten-minute rainfall data of the corresponding month for the target station.
(b)
If that month is complete for the target station, that month becomes part of the data set we can experiment with. If that month is incomplete for the target station, we must save its data loss pattern, that is, in which positions its NAs are located (value used to indicate missing data).
(c)
Having read the rainfall data for that month for all the years for which we have records, we end up with two sets of data: a set of full months and a set of months with loss patterns.
i.
We apply one of the loss patterns to each month of the complete data set for the target station, thus generating a month with missing data, which we call an experimental month.
ii.
For that experimental month, we must choose, among candidate stations to be imputed, those that are going to be the predictors, which are computed based on the following criteria:
Complete ten-minute data for that month.
Crosscorrelation with the target station.
iii.
Perform a time shift of the series to improve correlation.
iv.
Carry out preprocessing at the experimental month, thus filling the NAs with 0s when 0s appear in all the predictor stations for that specific position.
v.
Carry out data imputation with missForest, MICE, Amelia II, and IDW.
Calculate the precision and fit metrics between full data sets that have been generated and original full data sets from the target station.
Compare the results obtained.

3. Results

After analyzing in detail the rainfall data from the different stations of the network, in Step 1, 62 target stations that did not have large blocks of missing data were chosen to apply the proposed validation method between the years 2007 and 2018, which were both inclusive. Taking the Marco da Curra station as an example of a target station, in Figure 9 it can be seen that the total percentage of missing data was only 1.04 % and also that the percentages of missing data per month were low except in December 2011 with 22.58 %, in August 2014 with 19.20 %, and in April 2011 with 18.03 %.

In Step 2, nine candidate stations were chosen by proximity for each target station. In Figure 10, you can see the selection of the candidates for the Marco da Curra station.

Figure 11 shows the rainfall in March 2007 at the Marco da Curra target station. Two missing data blocks were observed in this station, with the first being much larger than the second, which corresponded to the 414 missing data items in that month.

In Step 3, the experimental months were created. In March 2007, a pattern of losses that corresponded to the 414 missing data was saved (Figure 7a). It can also be verified that in the other years, there were no missing data items during this month, so this pattern of losses was applied to them to generate the experimental months. An example of this process for the month of March 2017 can be observed; Figure 12 shows the rainfall at the target station, and it can be verified that there were no missing data. In Figure 13, we see superimposed in red the loss pattern of 2007 applied to 2017, and the final result we obtained can be seen in Figure 14. For the month of March in this station, a total of 11 experimental data sets were created from exp.mar.2008.p2007 to exp.mar.2018.p2007. For January 3, patterns were saved for the years 2009, 2010, and 2017; next, they were applied to the other 9 years, so 27 experimental months were created for this month. In summary, a total of 218 experimental months were created for the Marco da Curra station. Once the process was finished for all the selected target stations, a total of 7555 experimental months were created.

Among the candidate stations, those stations that have complete data for that month must be chosen as predictors. Figure 15 shows the rainfall data for the nine candidate stations. It can be seen that there were no missing data, and the time series data were very similar, although in the CIS Ferrol station, the amount of rainfall was greater than in other stations. In the lower part of Figure 15, the experimental month for the Marco da Curra target station is shown in blue with the missing data that we have artificially created.

The next criterion applied is to choose from among the candidate stations those that are well correlated with the target station, thereby discarding the others. In Figure 16, it can be seen that it was possible to improve this correlation by shifting the series before imputing. This is logical because the rain moves from east to west with the fronts, and, depending on the geographical situation of each station, this rain can occur a few minutes before or after, which is why the maximum correlation between the time series is sometimes achieved by advancing a series and other times by delaying it.

Once the candidate selection process is finished, the data set will consist of a matrix with only the target station and the predictor stations data in the columns, which are then ready to compute the different imputation algorithms.

After the imputation with the four methods, the precision of all the time series data reconstructed with the complete original series was verified by means of the MAE, RE, RMSE, and NRMSE errors. The results obtained are shown in Figure 17, and it can be verified that missForest is the method that offered better precision.

The bias produced by each imputation method was also calculated. In Figure 18, it can be seen that missForest and Amelia are the methods that introduced the least and similar bias.

Finally, the goodness of fit between the series data was also evaluated using the coefficient of determination (

R^{2}

) and the Willmot fit index (d). The correlation was found to be low in the first three methods, and very low using the IDW method. In Figure 19, we can see that both missForest and Amelia are the best fitting methods, with the former being slightly better.

In Table 2, the obtained statistical data are detailed.

4. Discussion and Conclusions

The objective of this study is to propose a method that allows us to compare the results obtained using different methods of imputation for missing values. Unlike other studies that have worked with daily, monthly, or annual data calculated from the ten-minute data provided by the rain sensors of the meteorological stations, in the present work, ten-minute data from 62 meteorological stations of Galicia were used, thereby seeking greater precision in the results. This implies greater complexity when making imputations because very large data sets have to be handled, and computing times are high.

In this study, the imputation methods were validated with existing data; these experimental data sets were created using real loss patterns due to the mechanisms that caused them to be unknown.

Table 2 summarizes the model validation statistics in terms of the MAE, RE, RMSE, NRMSE, BIAS,

R^{2}

, and D for reconstructed precipitation based on ten-minute values in Galicia. These parameters were calculated station by station and allowed for comparison of the simulated time series with the real series. The data of the original series are public and can be downloaded from https://www.meteogalicia.gal (accessed on 1 November 2023) to reproduce the study.

According to the results obtained, missForest presented the lowest error percentages and also introduced a similar bias as Amelia II, which were smaller than the other two imputation methods. Those errors are quite satisfactory in statistical terms.

Regarding the similarity of the reconstructed time series with the original series, missForest also yielded better results than the other three methods. The means, variances, and quantiles can be also calculated station by station to determine the uncertainty of the imputation methods.

This methodology lets researchers incorporate new imputation methods in Step 3.c.v and compare them with the obtained results. What is more, the next step that will be taken in the future is to compare the execution of this algorithm with different parameters to determine if other values can improve the obtained results. This will influence the time execution, which also has to be studied, and if it is very slow, the execution will be made on a cluster of computers to reduce this time.

To study the influence of geographical circumstances on the results of the imputations, we want to classify stations by height and/or coordinates and compare the obtained results. Given the seasonal influence on the rainfall variable, a comparison will also be made of the results of the imputations carried out in spring, summer, autumn and winter. It is even being considered to combine this study with the previous one, given that seasonality in Galicia grows from north to south, as was previously mentioned.

Author Contributions

Conceptualization, J.V.-P., B.A.R.-G. and J.A.O.; Methodology, J.V.-P., B.A.R.-G. and J.A.O.; Software, J.V.-P. and B.A.R.-G.; Validation, J.V.-P. and B.A.R.-G.; Formal analysis, J.V.-P. and B.A.R.-G.; Investigation, J.V.-P. and B.A.R.-G.; Resources, J.V.-P. and B.A.R.-G.; Data curation, J.V.-P. and B.A.R.-G.; Writing—original draft, J.V.-P. and B.A.R.-G.; Writing—review & editing, J.V.-P., B.A.R.-G. and J.A.O.; Visualization, Jose Vidal-Paz and B.A.R.-G.; Supervision, J.V.-P., B.A.R.-G. and J.A.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://www.meteogalicia.gal/observacion/rede/redeIndex.action?request_locale=gl.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ministerio de Obras Públicas; Transporte y Medio Ambiente. Dirección General de Carreteras. In Cálculo Hidrometeorológico de Caudales Máximos en Pequeñas Cuencas de Carreteras; Alanmer: Madrid, Spain, 1987. [Google Scholar]
Koutsoyiannis, D.; Kozonis, D.; Manetas, A. A mathematical framework for studying rainfall intensity-duration-frequency relationships. J. Hydrol. 1998, 206, 118–135. [Google Scholar] [CrossRef]
Paulhus, J.L.H.; Kholer, M.A. Interpolation of missing precipitation records. Mon. Weather. Rev. 1952, 80, 129–133. [Google Scholar] [CrossRef]
Young, K.C. A three-way model for interpolating for monthly precipitation values. Mon. Weather. Rev. 1992, 120, 2561–2569. [Google Scholar] [CrossRef]
Coulibaly, P.; Evora, N.D. Comparison of neural network methods for infilling missing daily weather records. J. Hydrol. 2007, 341, 27–41. [Google Scholar] [CrossRef]
Lo Presti, R.; Barca, E.; Passarella, G. A methodoloy for treating missing data applied to daily rainfall data in the Candelaro River Basin (Italy). Environ. Monit. Assess. 2019, 160, 1–22. [Google Scholar] [CrossRef] [PubMed]
Helmi, A.M.; Elgamal, M.; Farouk, M.I.; Abdelhamed, M.S.; Essawy, B.T. Evaluation of Geospatial Interpolation Techniques for Enhancing Spatiotemporal Rainfall Distribution and Filling Data Gaps in Asir Region, Saudi Arabia. Sustainability 2023, 15, 14028. [Google Scholar] [CrossRef]
Di Piazza, A.; Lo Conti, F.; Noto, L.V.; Viola, F.; La Loggia, G. Comparative analysis of different techniques for spatial interpolation of rainfall data to create a serially complete monthly time series of precipitation for Sicily, Italy. Int. J. Appl. Earth Obs. Geoinf. 2011, 13, 396–408. [Google Scholar] [CrossRef]
Eisched, J.K.; Pasteris, P.A.; Diaz, H.F.; Plantico, M.S.; Lott, N.J. Creating a serially complete, national daily time series of temperature and precipitation for the western United States. J. Appl. Meteorol. 2000, 39, 1580–1591. [Google Scholar] [CrossRef]
Feng, L.; Nowak, G.; O’Neill, T.J.; Welsh, A.H. CUTOFF: A spatio-temporal imputation method. J. Hydrol. 2014, 519, 3591–3605. [Google Scholar] [CrossRef]
Yang, C.; Chandler, R.E.; Isham, V.S.; Wheater, H.S. Spatial-temporal rainfall simulation using generalized linear models. Water Resour. Res. 2005, 41, 1–13. [Google Scholar] [CrossRef]
Rubin, D.B. Multiple Imputation for Nonresponse in Surveys; John Wiley: New York, NY, USA, 1987. [Google Scholar]
Schafer, J.L.; Graham, J.W. Missing data: Our view of the state of the art. Psychol. Methods 2002, 7, 147–177. [Google Scholar] [CrossRef] [PubMed]
Wayman, J.C. Multiple Imputation For Missing Data: What Is It and How Can I Use It? In Proceedings of the Annual Meeting of the American Educational Research Association, Chicago, IL, USA, 21–25 April 2003. [Google Scholar]
Yang, Y. Multiple Imputation for Missing Data: Concepts and New Development; SAS Institute Inc.: Rockville, MD, USA, 2005. [Google Scholar]
van Buuren, S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat. Methods Med. Res. 2007, 16, 219–242. [Google Scholar] [CrossRef] [PubMed]
Steckhoven, D.; Buhlmann, P. Missforest-non-parametric missing value imputation for mixed-type data. Bioinformatics 2012, 28, 112–118. [Google Scholar]
Breiman, L. Random forest. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
van Buuren, S.; Oudshoorn, K. Flexible Multivariate Imputation by MICE; TNO Prevention and Health: Leiden, The Netherlands, 1999. [Google Scholar]
Honaker, J.; King, G.; Blackwell. Amelia II: A Program for Missing Data. J. Stat. Softw. 2011, 45, 1–47. [Google Scholar] [CrossRef]
Martinez, A.; Perez, A. Atlas Climatico de Galicia; Xunta de Galicia. Servicio Central: A Coruña, Spain, 1999. [Google Scholar]
Modarres, R.; Ouarda, T. Generalized autoregressive conditional heteroscedasticity modelling of hydrologic time series. Hydrol. Process. 2013, 27, 3174–3191. [Google Scholar] [CrossRef]
Quian, Z.; Wang, L.; Chen, X.; Zhang, H.; Li, Z. Heteroscedastic Characteristics of Precipitation with Climate Changes in China. Atmosphere 2022, 13, 2116. [Google Scholar] [CrossRef]

Figure 1. Location of automatic weather stations of MeteoGalicia.

Figure 2. Elevation map of Galicia.

Figure 3. Automatic weather stations with registration data before 2007.

Figure 4. Data missing in red from 2007 (a) to 2018 (l).

Figure 5. Percentage of missing data (red color) in 2007.

Figure 6. Data missing in blocks in several stations: (a) Marco da Curra, (b) Guitiriz, (c) Mabegondo, (d) CIS Ferrol, (e) Punta Candieira, (f) Fragavella, (g) Sambreixo, (h) Serra da Faladoira, (i) Corno do Boi, (j) Olas.

Figure 7. Monthly missing data distribution in several stations: (a) Marco da Curra, (b) Guitiriz, (c) Mabegondo, (d) CIS Ferrol, (e) Punta Candieira, (f) Fragavella, (g) Sambreixo, (h) Serra da Faladoira, (i) Corno do Boi, (j) Olas.

Figure 8. Crossvalidation method flow chart.

Figure 9. Missing data ratio in Marco da Curra.

Figure 10. The target station in yellow and candidate stations for Marco da Curra.

Figure 11. Rainfall in March 2007 for target station.

Figure 12. Rainfall in March 2017 for target station.

Figure 13. Rainfall in March 2017 with 2007 missing pattern applied.

Figure 14. Experimental month generated after applying the missing pattern in the target station (exp.mar.2017.p2007).

Figure 15. Rainfall in March 2017 for candidate stations and experimental month for target station.

Figure 16. Crosscorrelation between objective Marco da Curra and candidate stations: (a) Guitiriz, (b) Mabegondo, (c) CIS Ferrol, (d) Punta Candieira, (e) Fragavella, (f) Sambreixo, (g) Serra da Faladoira, (h) Corno do Boi, (i) Olas.

Figure 17. Precision measures: (a) mean absolute error, (b) relative error, (c) root mean square error, (d) normalized root mean square error.

Figure 18. Statitical bias.

Figure 19. Adjust measures: (a) doefficient of determination or

R^{2}

, (b) Willmot fit index.

Figure 19. Adjust measures: (a) doefficient of determination or

R^{2}

, (b) Willmot fit index.

Table 1. Registers stored in the database.

Total Data	$440, 250, 897$
Mean air temperature	$79, 18, 732$
Mean relative humidity	$79, 142, 133$
Accumulated global solar radiation	$61, 199, 261$
Mean wind speed	$64, 931, 543$
Mean wind direction	$65, 299, 572$
Accumulated rainfall	$90, 492, 656$

Table 2. Statistical results.

Method	MAE	RE	RMSE	NRMSE	BIAS	$R^{2}$	D
missForest	0.015	2.606	0.065	0.649	−0.006	0.119	0.476	Min
	0.024	3.333	0.095	0.763	−0.002	0.309	0.666	Q1
	0.029	3.933	0.108	0.790	−0.001	0.376	0.710	Median
	0.031	4.043	0.112	0.798	−0.001	0.364	0.705	Mean
	0.036	4.376	0.122	0.842	0.000	0.419	0.758	Q3
	0.062	6.261	0.209	0.965	0.003	0.580	0.855	Max
MICE	0.020	2.868	0.071	0.724	0.000	0.078	0.438	Min
	0.029	3.686	0.101	0.829	0.002	0.211	0.613	Q1
	0.038	4.344	0.122	0.866	0.004	0.288	0.683	Median
	0.038	4.396	0.122	0.869	0.004	0.285	0.670	Mean
	0.044	4.773	0.137	0.917	0.006	0.343	0.724	Q3
	0.074	6.831	0.222	1.027	0.011	0.485	0.812	Max
Amelia	0.034	2.713	0.070	0.701	−0.006	0.107	0.503	Min
	0.045	3.539	0.100	0.786	−0.001	0.261	0.668	Q1
	0.053	4.088	0.115	0.833	−0.001	0.339	0.728	Median
	0.053	4.242	0.118	0.838	0.000	0.334	0.716	Mean
	0.058	4.617	0.130	0.889	0.001	0.411	0.774	Q3
	0.092	6.454	0.215	1.004	0.004	0.525	0.840	Max
IDW	0.024	3.411	0.084	0.850	−0.022	0.032	0.237	Min
	0.036	4.059	0.111	0.940	−0.004	0.071	0.345	Q1
	0.044	4.678	0.137	0.960	0.002	0.090	0.378	Median
	0.045	4.855	0.136	0.962	0.000	0.098	0.380	Mean
	0.050	5.249	0.148	0.975	0.005	0.118	0.417	Q3
	0.089	6.954	0.267	1.135	0.019	0.290	0.598	Max

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vidal-Paz, J.; Rodríguez-Gómez, B.A.; Orosa, J.A. A Comparison of Different Methods for Rainfall Imputation: A Galician Case Study. Appl. Sci. 2023, 13, 12260. https://doi.org/10.3390/app132212260

AMA Style

Vidal-Paz J, Rodríguez-Gómez BA, Orosa JA. A Comparison of Different Methods for Rainfall Imputation: A Galician Case Study. Applied Sciences. 2023; 13(22):12260. https://doi.org/10.3390/app132212260

Chicago/Turabian Style

Vidal-Paz, José, Benigno Antonio Rodríguez-Gómez, and José A. Orosa. 2023. "A Comparison of Different Methods for Rainfall Imputation: A Galician Case Study" Applied Sciences 13, no. 22: 12260. https://doi.org/10.3390/app132212260

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comparison of Different Methods for Rainfall Imputation: A Galician Case Study

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Region and Data

2.2. Methods

2.2.1. Imputation Methods

2.2.2. Proposed Crossvalidation Method

3. Results

4. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI