1. Introduction
This paper proposes a method for short-term weekly water-demand forecasting combining various statistical techniques. In the proposed method, firstly, we visually examine the effects of several meteorological factors on water demands through exploratory data analysis (EDA) and then select those that can be employed as inputs for forecasting models. Next, several data preprocessing steps such as missing value imputation, data normalization, data transformation, and data selection are performed. After this, among the initial input candidates composed of demand values measured from past to present, we only select those that are highly correlated with future demand values. Outlier detection and removal are also carried out by considering correlations between the selected inputs and output. After preparing training datasets, forecasting models are constructed using support vector regression (SVR), and finally, weekly water-demand forecasts are calculated using iterated and direct strategies that have been widely used for multi-step ahead forecasting. To verify the performance, the proposed method is applied to hourly urban water-demand datasets provided by the Battle of Water Demand Forecasting (BWDF) organized in the 3rd International WDSA-CCWI Joint Conference in Ferrara, Italy.
2. Proposed Method
Figure 1 describes the proposed short-term weekly water-demand forecasting procedure. Step 1 is to determine which of four weather factors, including rainfall depth (mm), air temperature (°C), air humidity (%), and windspeed (km/h), can be used as input variables for forecasting models; to do this, scatter plots between each of the four factors and net inflow (L/s) of a target area are examined. From these plots, we confirmed that there exist clear correlations between air temperature and net inflow; thus, it makes sense to employ air temperature as one of input variables. Step 2 is to generate one-hot encoded input vectors
and
to reflect daily and weekly periodicities in water demands in forecasting models. Step 3 is to perform several data preprocessing steps such as missing value imputation, max-min normalization, data transformation, and data selection. Step 3-1 is to impute missing values. In Step 3-2, max-min normalization is carried out to convert air temperature and water-demand values to lie in [0, 1]. Step 3-3 transforms one-dimensional water-demand time series into multidimensional matrices. Step 3-4 is to select a portion of entire data vectors for model training to consider seasonality in water demands; the selected data vectors correspond to the days that are seasonally similar to evaluation weeks. In other words, selected data vectors correspond to the 7 ×
w days immediately before the evaluation weeks, the days falling on the same dates as the evaluation weeks in past years, and the 7 ×
w days immediately before and after these dates. The proper values of window size
w can be determined by prior experiments; in this paper, the value of
w is set to five.
Step 4 is to select significant inputs for forecasting models among historical water-demand values. As initial input candidates for a model whose output is a demand value
yt+h at future time
t +
h (where
h = 1, 24, or 168), we consider 168 measured demand values
yt,
yt−1,…,
yt−167 from past time
t − 167 to present time
t. After calculating absolute values of correlation coefficients between the 168 candidates and the output and sorting them in descending order, the first
d candidates are selected. Since water-demand datasets can be contaminated by statistical outliers, robust covariance matrices [
1] are used to calculate the correlation coefficients. The proper number of inputs
d can be decided by cross validation (CV) technique; in this paper, the value of
d is set to six using CV. Step 5 is to detect and remove statistical outliers based on correlations between the selected inputs and the output. To do this, the Mahalanobis distance parameterized by robust covariance matrices [
1] is applied. In Step 6, the following forecasting model is constructed using SVR [
2]:
where
is the forecasted water-demand value at future time
t +
h,
y is the vector composed of
d = 6 relevant inputs selected in Step 4,
xt+h is the value of air temperature at future time
t +
h,
ht+h and
wt+h are one-hot encoded input vectors at future time
t +
h, and
is the model parameter vector estimated based on training datasets. Step 7 is to forecast weekly water-demand values based on Equation (1) using multi-step ahead forecasting strategies.
To perform multi-step ahead forecasting, iterated and direct strategies have been widely used [
3]. An iterated strategy computes multi-step ahead forecasts by feeding the forecasted values back into input values recursively; in general, this strategy requires a forecasting model with the output
yt+1 (i.e., in Equation (1),
h = 1). A direct strategy calculates multi-step ahead forecasts directly based on single or multiple models. In this strategy, to calculate multiple forecasts, we only use observed water-demand values as inputs without reusing forecasts as inputs. For example, to compute 168 weekly forecasts of hourly water demands, 168 models with outputs
yt+1,
yt+2,…,
yt+168 can be employed, or a single model with output
yt+168 can be employed. In the former, training 168 forecasting models takes much time; in the latter, since only a single model is trained and used for forecasting, its accuracy can be degraded. A compromise between using a single model and using 168 models can also be used. In other words, we can train seven models with outputs
yt+24,
yt+48,…,
yt+168 (here,
h is set to a multiple of 24) and then calculate weekly demand forecasts; based on the model with output
yt+24, 24 demand values
yt+1,…,
yt+24 are predicted, based on the model with output
yt+48, 24 demand values
yt+25,…,
yt+48 are predicted, and so on.
In this paper, we use three multi-step ahead forecasting strategies, including an iterated strategy based on a single model with output yt+1, a direct strategy based on a single model with output yt+168, and a direct strategy based on seven models with outputs yt+24, yt+48,…, yt+168.
3. Experimental Results
This section provides the results of applying the proposed method in
Figure 1 to the BWDF dataset [
4]. This dataset consists of a weather dataset, a calendar dataset, and an urban hourly water-demand dataset (hourly net inflow dataset) collected from 10 DMAs. In [
4], the readers can find more detailed information regarding the BWDF dataset. The aim of this battle is to forecast weekly water-demand values for four evaluation weeks (W1, W2, W3, and W4) as accurately as possible. To evaluate the accuracy of weekly forecasts, the following three performance indices and their sum
PI (=
PI1 +
PI2 +
PI3) are used:
where
(
i = 1,…, 168) is the 168 weekly forecasts and
yi are the 168 actual demand values. In this paper, it is assumed that the demand values in the 7 days immediately before the evaluation weeks are unknown, and the aim is to forecast these values. Only the results for the 7 days immediately before W1 are presented, and those for the others are omitted due to space constraints.
Table 1 lists the performance indices obtained by applying the three strategies described in
Section 2 to weekly demand forecasting for the 7 days immediately before W1. As confirmed from this table, the average values of PI for the two direct strategies (7.05 and 7.29) are lower than that for the iterated strategy (7.71); the direct strategy with the single model can obtain a lower averaged value of PI than the direct strategy with the multiple models by 0.24. It is also worthwhile to emphasize that the types of direct strategies that can obtain better performance indices vary from DMA to DMA. This clearly indicates that the characteristics of water-demand time series differ from DMA to DMA; thus which of these strategies achieves lower performance indices can vary from DMA to DMA.
Figure 2 shows the actual water-demand curves and forecasted water-demand curves by the two direct strategies; for DMAs A, E, H, and J, the direct strategy using the single model is applied, and for the remaining DMAs, the direct strategy using the multiple models is applied.