**Feature Papers of Forecasting**

Editor

**Sonia Leva**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editor* Sonia Leva Department of Energy, Politecnico Di Milano, Via Lambruschini 4 Italy

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Forecasting* (ISSN 2571-9394) (available at: https://www.mdpi.com/journal/forecasting/special issues/FP Forecast).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-1030-9 (Hbk) ISBN 978-3-0365-1031-6 (PDF)**

© 2021 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


## **About the Editor**

**Sonia Leva** is Full Professor in "Elettrotecnica" (Electrical Engineering-Circuit Theory) in Politecnico di Milano (Italy). She was born in 1970 in Tradate (Italy). She received the M.Sc. degree in 1997, and the Ph.D. degree in 2001, both in Electrical Engineering, from the Faculty of Engineering, Politecnico di Milano, Italy. In 1998, she registered as a professional Engineer in Italy. From 1999 to 2010, she was Research Associate of Electrical Engineering at the Department of Electrical Engineering, Politecnico di Milano, Italy. Since July 2010, Sonia Leva is qualified as Associate Professor, starting her professorship on December 16, 2010. In February 2014, Sonia Leva qualified as a Full Professor in "Elettrotecnica" (Electrical Engineering-Circuit Theory), starting her professor activity on January 06, 2016.

She has been an IEEE member since 2000 and a senior member since 2013. She served as a Chairperson of sessions in an international conference organized by Institute of Electrical and Electronic Engineers. She is the author of about 100 papers mainly published on international and national journal or conference proceedings. She served as Editor-in-Chief for Forecasting from 2019.

Sonia Leva is member of the Italian Standard Authority (CEI) Technical Committee CT 82 "Sistemi di conversione fotovoltaica dell'energia solare (Photovoltaic Systems)" since 2008. She collaborated to write the second edition of the technical guide.

### *Editorial* **Editorial for Special Issue: "Feature Papers of Forecasting"**

**Sonia Leva**

Department of Energy, Politecnico di Milano, 20156 Milano, Italy; sonia.leva@polimi.it

Nowadays, forecasting applications are receiving unprecedent attention thanks to their capability to improve the decision-making processes by providing useful indications. A large number of forecasting approaches related to different forecasting horizons and to the specific problem that have to be predicted have been proposed in recent scientific literature, from physical models to data-driven statistic and machine learning approaches. Hybrid approaches combining two or more of the previously-mentioned methods, have been also investigated. In general, two methods based on AI-Based Techniques.

In this Special Issue, the most recent and high-quality researches about forecasting are collected. A total of nine papers have been selected to represent a wide range of applications, from weather and environmental predictions to economic and management forecasts. Finally some application related to forecasting the different phase of COVID in Spain and the photovoltaic power production have been presented.

Pedrigão et al. [1] compare the Direct Normal Irradiance (DNI) predictions over one year (from 1 August 2018 to 31 July 2019) from Integrated Forecasting System of European Centre for Medium-Range Weather Forecasts (IFS/ECMWF) against the corresponding observed values in south of Portugal, in Évora station, for different time steps (hourly and daily basis) and for different forecast horizons (up to four days ahead). The comparison highlights similar magnitude and trend between forecast and observed data, with an overestimation of the predicted DNI by IFS/ECMWF and a general error increase with forecast lead time. In addition, a methodology based on DNI attenuation index (DAI) is proposed to estimate the transparency of the atmosphere and a post-processing methodology is adopted to reduce the bias in IFS/ECMWF model. The first methodology revealed the tendency of IFS/ECMWF approach to underestimate the effects of clouds on DNI, while the bias correction post-processing allowed a large improvement in the DNI forecast, with a 30% decrease for all error metrics.

Maroccu et al. [2] propose a nowcasting method for precipitation intensity predictions. The method is based on a generative neural network, which is trained with a specific loss function and presents a PredNet architecture, successfully applied in other fields such as computer vision and natural language processing. Its forecast performance is compared against those of state-of-art optical flow procedures in a real case study, a public domain dataset of radar images from Japan covering a time span of five years. The results demonstrate the neural network to be by far the most effective forecast model between all those proposed.

Gunter et al. [3] investigate the accuracy of individual and combined statistical methods in forecasting the tourism demand for the European Union. This research, contracted by the European Commission, required the analysis of models with low degree of complexity, easily rebuildable for a practical application. The accuracy assessment was performed for eight different periods spanning two years each, in order to grant stable results inside a changing macroeconomic context. The results demonstrate that the combination between Autoregressive Integrated Moving Average (ARIMA) models, REGARIMA models and Error Trend Seasonal (ETS) models performs better than single models, and that the combination based on Bates–Granger weights (VAriance-COvariance methods, VACO) is better performing than the one based on uniform weights.

Ghimire et al. [4] analyze the accuracy of low-complexity data-driven persistencebased approaches for streamflow forecasting in Nepal, in the Himalayan region, proposed

**Citation:** Leva, S. Editorial for Special Issue: "Feature Papers of Forecasting". *Forecasting* **2021**, *3*, 135–137. https://doi.org/10.3390/ forecast3010009

Received: 18 February 2021 Accepted: 20 February 2021 Published: 21 February 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

as benchmark for the real-time streamflow forecasting system. In detail, a simple persistence approach, a streamflow climatology approach and an anomaly persistence approach are discussed. In general, the forecast skill of persistence-based methods presents a strong spatio-temporal dependence: it is higher in rivers with constant baseflow respect to intermittent ones, with moderate flows respect to extreme ones and with larger river basins respect to smaller ones. Finally, the study demonstrates that the proposed persistence-based forecast approaches are difficult to outperform even with complex mechanistic hydrologic models.

Rezazadeh [5] proposes a Machine-Learning (ML) workflow implemented on the cloud-based computing platform Microsoft Azure Machine-Learning Service (Azure ML) with the aim of predicting the possibilities of winning sales opportunities in Business to Business (B2B) sales, a task nowadays mostly relying on human evaluations. In order to investigate the effectiveness of the discussed approach, it was applied to a multi-business consulting firm: the ML workflow performance was compared with user-entered predictions made by salespersons. The results demonstrate the decision-making based on the ML predictions to be more accurate than the subjective human predictions.

Comi et al. [6] investigate the variability of bus travel time by means of a time-seriesbased approach applied on data from Automated Vehicle Monitoring (AVM) of bus lines sharing the road lanes with traffic (private vehicles) in Rome (Italy) and Lviv (Ukraine). The analysis of the results point out the efficiency of the proposed forecast approach, highlighting also the significant effects of time of the day and the day of the week on travel time variability. Moreover, also the key structural differences between Rome and Lviv are considered, showing the need to account them when developing a forecast model.

Hogan et al. [7] describe a diminishing learning rate model, namely Boone's learning curve, with the aim to improve the end-items cost evaluation and to propose an alternative to popular but outdated learning curve models adopted by U.S. Department of Defense (DoD) in costs estimation. The research demonstrates that Boone's learning curve significantly reduces error in modeling observed learning curves using production data from 169 DoD end-items.

Mora et al. [8] develop and calibrate a semi-empirical model based on logistic maps with the aim to forecast the different phases of the COVID-19 epidemic in Spain. In detail, the model predicts the number of infected, hospitalized, patients needing an Intensive Care Unit (ICU) and deaths in four different epidemic phases, namely: non-controlled evolution, total lock-down, partial easing of the lock-down and phased lock-down easing. A reliable forecast of COVID-19 development for both countries or smaller regions allows an optimization of sanitary resources and permits the reduction of the economic and social impact of Non Pharmaceutical Interventions (NPIs), such as lock-downs. The proposed model was capable to provide reasonably accurate results for the different phases of the epidemic.

Pan et al. [9] propose a data-driven Photovoltaic (PV) output power estimation approach using only net load data, temperature data and solar irradiation data and involving a decomposition method of the Behind-The-Meter (BTM) PV output power curve. In order to illustrate the effectiveness of the described approach, the PV output decomposition was simulated and tested on a total of 300 real customers' datasets from Ausgrid.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Assessment of Direct Normal Irradiance Forecasts Based on IFS**/**ECMWF Data and Observations in the South of Portugal**

#### **João Perdigão 1,\*, Paulo Canhoto 1,2, Rui Salgado 1,2 and Maria João Costa 1,2**


Received: 19 April 2020; Accepted: 14 May 2020; Published: 16 May 2020

**Abstract:** Direct Normal Irradiance (DNI) predictions obtained from the Integrated Forecasting System of the European Centre for Medium-Range Weather Forecast (IFS/ECMWF) were compared against ground-based observational data for one location at the south of Portugal (Évora). Hourly and daily DNI values were analyzed for different temporal forecast horizons (1 to 3 days ahead) and results show that the IFS/ECMWF slightly overestimates DNI for the period of analysis (1 August 2018 until 31 July 2019) with a fairly good agreement between model and observations. Hourly basis evaluation shows relatively high errors, independently of the forecast day. Root mean square error increases as the forecast time increases with a relative error of ~45% between the first and the last forecast. Similar patterns are observed in the daily analysis with comparable magnitude errors. The correlation coefficients between forecast and observed data are above 0.7 for both hourly and daily data. A methodology based on a new DNI attenuation Index (DAI) was developed to estimate cloud fraction from hourly values integrated over a day and, with that, to correlate the accuracy of the forecast with sky conditions. This correlation with DAI reveals that in IFS/ECMWF model, the atmosphere as being more transparent than reality since cloud cover is underestimated in the majority of the months of the year, taking the ground-based measurements as a reference. The use of the DAI estimator confirms that the errors in IFS/ECMWF are larger under cloudy skies than under clear sky. The development and application of a post-processing methodology improves the DNI predictions from the IFS/ECMWF outputs, with a decrease of error of the order of ~30%, when compared with raw data.

**Keywords:** Direct Normal Irradiance (DNI); IFS/ECMWF; forecast; evaluation; DNI attenuation Index (DAI); bias correction

#### **1. Introduction**

Solar energy is becoming a crucial renewable resource in modern societies, contributing to the sustainability of the planet with the mitigation of greenhouse gas emissions by reducing the consumption of coal or fuel oil for electricity production; however, the availability of solar resources over time at a given region of interest determines the cost/benefit of solar power plants implementation. Since the temporal series of solar radiation measurements are spatially limited, and thus scarce and sometimes inexistent, the prediction and validation of solar resource is a key factor for such enterprises.

Several researchers have estimated the potential of renewable energies like wind or solar radiation for electricity or thermal energy production around the world; for example, in Europe and Africa [1], in Chile [2], in Iberian Peninsula [3], in United Kingdom [4], and Spain [5]. Solar power is a very promising energy source in the Iberian Peninsula (IP) and strong growth is expected in this area. In the IP there are multiple options for using renewable energy (solar, wind, and hydro) to generate electricity; however, the solar resource is high throughout the year [6–8].

Concerning solar energy, there are two main ways of converting solar energy into electricity: Photovoltaic (PV) and Concentrating Solar Power (CSP). The PV panels convert either direct and diffuse solar irradiance, while the CSP technology only concentrates the Direct Normal Irradiance (DNI). The focus of this work is on the prediction of DNI, because of its use in CSP plant management. The forecast of global solar radiation (direct + diffuse) for the same region was addressed, for example, in [6] and [9].

There are several approaches to predict solar irradiance such as Numerical Weather Prediction (NWP), Cloud Motion Vector (CMV), statistical time series analysis, and other methods [7,10,11]. In the last years, one of the major research challenges for the use of NWP in solar energy applications is the DNI forecast, aiming at the development and increase of CSP installed capacity and operation management. CSP requires the knowledge of DNI for specific sites [12,13], and one of the difficulties is the need to forecast the DNI with several days ahead to increase efficiency and minimize the operational costs of the power plants [14,15]. For instance, Casado-Rubio et al. [16] proposed a simple methodology to obtained DNI forecast, based on Weather Research and Forecasting model (WRF model) and a radiative transfer simulation (for 1-day forecast) and found that this procedure can be used as a diagnostic tool for operational power plants.

Until recently, DNI measurements were not available in many places or with long series, and this variable was not a direct output of NWP models. Currently, the Integrated Forecast System of the European Centre for Medium-Range Weather Forecasts (IFS/ECMWF) provides the direct normal irradiance as an output; however, the use of NWP models in the DNI forecast is still not perfect and requires Multiple Output Statistic (MOS) methodologies [13]. Lopes et al. [17] used the IFS global model of ECMWF to assess DNI for short-term (24 h) in the south of Portugal and found relative differences in the range ~7% to ~12% on an annual basis between predictions and observations at ground-based stations. Lara-Fanego et al. [18] found a relative root mean square error of 60% for hourly DNI forecasts in Spain for all sky conditions, using the Advanced Research Weather Research and Forecasting model (WRF). Troccoli and Morcrette [19] analyzed the direct solar radiation data using two different radiation schemes of the IFS/ECMWF for four ground-based measuring stations in Australia and found mean absolute errors between 18% and 45% and correlation coefficients between 0.25 and 0.85. In that work, the usage of a post-processing bias correction improved results, resulting in mean absolute errors between 10% and 15% and correlation coefficients of about 0.9. Ruiz-Arias et al. [7] also found better results for DNI forecasts from the WRF model by using a post-processing algorithm. Law et al. [13] present a comprehensive review of DNI forecast obtained from several methods and some examples of DNI forecast accuracies are presented. According to Vick et al. [20], most studies on DNI models have assessed the annual and hourly mean bias and root mean square errors between measured and DNI models; however, according to the same authors, the accuracy of monthly and daily direct normal irradiation forecasts should also be assessed to detect gaps in DNI modeling that may be improved and correlated with sky conditions, time of the year, or location.

Since there are several on-going projects in Portugal to explore the solar resource, it is imperative to carry out studies that help understand the errors associated with direct normal irradiance predictions over several days ahead.

The Portuguese Institute for Sea and Atmosphere (IPMA), the meteorological Portuguese authority, uses the ECMWF global model predictions as the main forecast tool. Comparisons between operational global NWP models show that ECMWF over Europe is the best [21]. Good numerical predictions of the near-surface weather conditions presuppose a good representation of the surface radiative balance; however, a correct forecast of the global irradiance does not necessarily mean an accurate partition between direct and diffuse components, as this partition is not essential to solving the surface balance. In response to growing demand from the solar energy market, the ECMWF has recently (2015) started to include DNI among the available predicted variables. These forecasts are likely to be used by solar plants in southern Portugal.

The main objective of this study was to assess the performance of the IFS/ECMWF global model (CY45R1 cycle—released at 5 June 2018) to predict DNI in the south of Portugal, by comparing its results with observational data of Évora station, on an hourly and daily basis and for various forecasting horizons (up to four days ahead). We present a method to predict sky conditions based on observational DNI data. A post-processing methodology was also tested to minimize the bias in the IFS/ECMWF model.

The paper is organized as follows: Section two describes the data and methods used in this study, the performance assessment of the ECMWF forecasts is presented and discussed in Section 3, and finally, conclusions are provided in Section 4.

#### **2. Materials and Methods**

#### *2.1. DNI Observational Data*

The measurements used in this study were obtained from the observatory of Atmospheric Sciences located at the University of Évora (38.57◦ N, 7.9◦ W, 293 m a.m.s.l.). DNI was measured using a first-class pyrheliometer (Kipp & Zonen, model CHP01) [22], following the World Meteorological Organization (WMO) [23] and the International Organization for Standardization (ISO), the 9060:1990 standard [24]. This model of pyrheliometer was designed to measure the solar irradiance with an opening half-angle of 2.5◦.

A period of one year of DNI measurements was used in this study, from 1 August 2018 until 31 July 2019. The sensor output was sampled every 5 s and one-minute mean, minimum, maximum, and standard deviation values were recorded. Hourly values were then computed by averaging one-minute values when the number of records for that hour corresponds to at least fifty minutes. The data for solar zenith angles above 89◦ (twilight and nighttime) were not considered and thus removed from the analysis. The daily mean was computed using a similar methodology of that used by Troccoli and Morcrette [19], i.e., if one or more hourly values are not present on a given day, then that day is not used in the analysis.

All instruments of this measuring station were subject to maintenance and cleaning procedures following the recommendations of the World Meteorological Organization and data was subject to BSRN (Baseline Surface Radiation Network) quality filters [25] based on physically possible and extremely rare values.

In this work, the seasons were defined according to the WMO nomenclature i.e., winter (December–January–February: DJF), spring (March–April–May: MAM), summer (June–July–August: JJA), and autumn (September–October–November: SON).

#### *2.2. DNI Forecast Data*

Predicted DNI from IFS/ECMWF, from 1 August 2018 until 31 July 2019, was obtained with a resolution of 0.125 × 0.125 (lat × lon grid). The forecast data were provided with an hourly time step for the first three days and with a three-hour time step for the 4th day. In this work, forecasts were separated into four intervals: day\_0 (1st day); day\_1 (2nd day); day\_2 (3rd day); day\_3 (4th day). The predicted accumulated solar irradiation in hour time steps for the entire forecast horizon was converted into hourly mean irradiance values of DNI.

The shortwave radiation scheme of the IFS/ECMWF used in this study was the new radiation scheme implemented on 11 July 2017, called ecRad [26]. This scheme is faster than the previously McRad scheme [27] and can be executed more times during the forecast. This scheme computes the profiles of shortwave and longwave irradiances at half levels, and these are interpolated horizontally back onto the model grid using cubic interpolation [26]. The aerosol distribution was adapted from

Tegen et al. [28], using a climatology of six hydrophobic aerosol species as well as the newer climatology obtained from a reanalysis of the atmospheric composition produced by the Copernicus Atmosphere Monitoring Services (CAMS), with 11 hydrophilic and hydrophobic species [29].

According to Hogan and Bozzo [26], ecRad incorporates a method to represent longwave scattering of clouds, which leads to an improvement in forecast skills. The default ice optical properties were computed using the Fu scheme [30], but two additional schemes were available.

In the work by Hogan and Bozzo [26], it was possible to find the evolution of the ECMWF Radiation Scheme after 2000 and the options available. More details on the physical processes (and the options available) were reported in the IFS documentation on the ECMWF web page [31].

The nearest neighbor technique was used to select the forecast data for comparison with measurements. To assess forecasting accuracy, the observational data was compared with the forecasts for the nearest model grid point. Wild and Schmucki [32], made several statistical tests surrounding a grid point to analyze trends and the results showed that different grid points surrounding a given grid point (selected by a Lat/Lon value) do not differ significantly from each other in the majority of the cases.

#### *2.3. Statistical Indicators for Model Sssessment*

The quality of the DNI forecasts was evaluated against observational data using common statistical parameters as the Mean Bias Error (MBE), Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Correlation Coefficient (r). In this work, errors were calculated based on hourly, daily, and monthly mean values. Similar to the analysis presented by Nonnenmacher et al. [14] and Perez et al. [33], night-time values (zero solar irradiance) were excluded from the model assessment. The ratio (RSR) between the root mean square error and the observations standard deviation (σ*obs*) was also determined.

These statistical parameters are defined as follows:

$$MBE = \frac{1}{N} \sum\_{i=1}^{N} (m\_i - o\_i) \tag{1}$$

$$MAE = \frac{1}{N} \sum\_{i=1}^{N} |m\_i - o\_i| \tag{2}$$

$$RMSE = \left[ \frac{1}{N} \sum\_{i=1}^{N} (m\_i - o\_i)^2 \right]^{\frac{1}{2}} \tag{3}$$

$$\tau = \frac{\sum\_{i=1}^{N} (o\_i - \overline{o}) \left(m\_i - \overline{m}\right)}{\left[\sum\_{i=1}^{N} (o\_i - \overline{o})^2 \sum\_{i=1}^{N} (m\_i - \overline{m})^2\right]^{\frac{1}{2}}} \tag{4}$$

$$RSR = \frac{RMSE}{\sigma\_{\text{obs}}} = \frac{\left[\frac{1}{N} \sum\_{i=1}^{N} (m\_i - o\_i)^2\right]^{\frac{1}{2}}}{\left[\sum\_{i=1}^{n} (o\_i - \overline{o})^2\right]^{\frac{1}{2}}} \tag{5}$$

where *N* is the number of data points and *m* and *o* are the forecast and observed values, respectively. The MBE represents a systematic error between predicted and observational values, and the RMSE quantifies the spread in the distribution of errors. The MBE provides information on the underestimation (negative values) or overestimation (positive values) of forecasts using the measured values as reference. On the other hand, the RMSE is very sensitive to high magnitudes errors due to the higher statistical weight of large errors. The MAE represents the average magnitude of errors in a set of forecasts without considering their direction (bias) and gives the same weight to all errors (see as an example, Chai and Draxler [34]), i.e., it is less sensitive to large deviations. RMSE is one of the most relevant statistical parameters for solar power plant analysis (e.g., Landelius et al. [35]). For all statistical

parameters, the best results are obtained when values are equal or near zero, except for the correlation coefficient when values closer to one correspond to better performances. According to Moriasi et al. [36], RSR incorporates the benefits of error index statistics and includes a scaling/normalization factor, so that the resulting statistic and reported values can apply to various constituents. The RSR parameter varies between zero and a positive value with the values close to zero representing a better forecast simulation [36]. The same thresholds performance ratings as shown in Table 4 of the article of Moriasi et al. [36] are used here, i.e., values of RSR < 0.5 indicate the optimal performance rating while RSR > 0.7 represents unsatisfactory model performance rating.

#### *2.4. Cloud Area Fraction and DNI Attenuation Index (DAI)*

In most of the solar radiation studies, an index known as clear sky index or the clearness index is used to quantify the bulk atmosphere transmittance (see Iqbal [37], Lopes et al. [17], among others). This clearness index is defined as the ratio of global horizontal irradiance to extraterrestrial horizontal irradiance.

In this section, a new methodology is proposed to estimate the clearness of the atmosphere (termed DNI attenuation index-DAI) based exclusively on the observed direct normal irradiance. DNI varies during the day due to Sun–Earth geometry and atmospheric constituents, though the main factor of DNI variation is the cloud coverage, which can drastically reduce this component of solar radiation when the direct beam from the sun is intercepted, sometimes reaching a value of zero depending on the type of clouds. The DAI is an indicator of the cloud attenuation of DNI.

This method was based on the integration of the measured hourly mean values of DNI (see Figure 1a) obtained every day, for a given month, and it constitutes a measure of sky conditions for a particular month. This has the advantage of not relying on reanalysis, satellite data, or other products that could also be a source of bias.

**Figure 1.** Curves of hourly mean Direct Normal Irradiance (DNI) projected on the horizontal plane for two different days: (**a**) partially cloudy sky and (**b**) clear sky day. A is the area under the curve corresponding to the measured DNI (energy per unit area) and is obtained by numerical integration using the trapezoid rule.

In this way, a dimensionless quantity (in percentage) called DNI Attenuation Index (DAI) is defined as

$$DAI\_{\bar{l}} = \left(1 - \frac{A\_{\bar{l}}}{NF}\right) \times 100\% \tag{6}$$

where

$$A\_i = \int\_{t0}^{t1} I(t)dt \sim \frac{\Delta t}{2} \sum\_{k=1}^{24} (I\_{k-1} + I\_k) \tag{7}$$

with Δ*t* = 3600 s because a time step of one hour is used and NF is the normalization factor calculated as

$$NF = \max\_{1 \le i \le n} (A\_i) \tag{8}$$

in which *i* is the number of the day of the month.

To obtain the DAI it is assumed that the maximum value of daily energy per unit area (integral) in a given month is interpreted as a clear sky day in that month and the DAI will take the value of zero for that particular day. Different normalization factors will be expected for different months, with higher values during the summer. Although DAI does not allow us to effectively distinguish the contribution of aerosols or cloud cover to DNI variations, it provides a clear idea of the transparence of the atmosphere for a specific day and it hints at the identification of a clear day (or clearness of atmosphere) from an overcast day or extreme aerosol event. The DAI varies between zero (clear sky day) and one (overcast sky).

The relation between DAI and cloud fraction (in oktas) was established through three classes of days [23]: class I – clear sky day (0–2 oktas; DAI< 31.25%); class II – partially cloudy skies (3–5 oktas; 31.25% ≤DAI< 68.75%) and class III as cloudy skies (6–8 oktas; 68.75% ≤ DAI ≤ 100%), in the same way as presented in Table 1 of the article of Jafariserajehlou et al. [38].

**Table 1.** Statistical indicators of comparison between observed and predicted hourly mean Direct Normal Irradiance (DNI) for the entire period (1 August 2018–31 July 2019). Bold values mean the best score.


The total cloud area fraction obtained from the Clouds and the Earth's Radiant Energy System (CERES) radiometer, combined with the Moderate Resolution Imaging Spectroradiometer (MODIS), both onboard the Terra and Aqua satellites, was also considered in this work for assessment of DAI estimates. The CERES–MODIS cloud mask data were obtained monthly (CERES\_SYN1deg\_Ed4.1) for the period available for this study (August 2017 until May 2019) and from CERES portal (see ceres.larc.nasa.gov). The cloud area fraction consists of the percentage of cloudy pixels identified in areas of 1◦ × 1◦ [39].

#### *2.5. Post Processing Correction*

A linear least square statistical method for bias correction to correct daily direct solar radiation values obtained from IFS/ECMWF was tested. This method is the simplest post-processing technique and has been applied in several studies over the past years (see for example Polo et al. [40]). Mejia et al. [41] found that MOS linear fit procedure outperformed the quantile–quantile mapping (Q–Q).

The linear regression parameters were computed for each month using forecast and observed daily values for the period of 1 August 2017 until 31 July 2018. The correction parameters are obtained, for each month, using the linear equation

$$\begin{array}{l} y\_{modcl,j}^{i} = m\_{j} \mathbf{x}\_{obs,j}^{i} + b\_{j} \\ i = 1, \ldots, 28/30/31; j = 1, \ldots 12 \end{array} \tag{9}$$

where *xobs*, *mj* and *bj* are, respectively, the observed DNI values, the slope of the fitted line, and the intercept.

The regression parameters were used to correct the IFS/ECMWF forecasts for the following year—the period of analysis (01/08/2018 until 31/07/2019)—using the following equation [42],

$$\begin{array}{l} y^i{}\_{BC\_{m\text{del},j}} = y^i{}\_{m\text{ade},j} - \left[ (m\_j - 1) \mathbf{x}^i{}\_{\text{abs},j} + b\_j \right] \\ i = 1, \ldots, 28/30/31; j = 1, \ldots 12 \end{array} \tag{10}$$

#### **3. Results and Discussion**

#### *3.1. Assessment of Hourly and Daily DNI Forecasts*

As an example, Figure 2 shows the time series of predicted hourly mean DNI during four consecutive days and the corresponding observed values for two selected cases, one in JJA, other in SON: forecasts issued on 1 March 2017 00:00 and on 27 November 2017 00:00.

**Figure 2.** Example of four consecutive days of observed (red line) and forecasted (blue line) hourly mean DNI in Évora starting at (**a**) 1 August 2017 00:00 and (**b**) 27 November 2017 00:00.

The IFS/ECMWF forecasts have similar behavior to that of observational DNI for the two selected cases, with a fairly good agreement, especially in the case of August (Figure 2a). Figure 2b shows a partially cloudy day (day\_0) and a cloudy day (day\_1), making it evident that the model did not predict clouds correctly on November 28 since observational data clearly shows an overcast day. Another interesting feature in Figure 2 is that the IFS/ECMWF scheme slightly underestimated the DNI in the Summer case (Figure 2a) and overestimated it in the Autumn case (Figure 2b) in the case of a partly cloudy day. It is important to note that the example presented in Figure 2 is simply a selected example.

Figure 3 shows the comparison between ground-based measurements of hourly mean DNI and forecast data obtained from IFS/ECMWF for the entire period of study and the four forecasted days.

As expected, the errors associated with the hourly DNI forecast are quite significant with a strong scatter around y = x line (dashed line). The slope of the regression line indicates the quality of the forecasts and it is possible to conclude that the DNI IFS/ECMWF forecast is reasonable for the first three days ahead since the density of points is higher around the y = x line (dashed line in Figure 3). The worst forecast is for day\_3. From Figure 3, it is also possible to verify that IFS overestimates DNI for lower values and underestimates DNI for higher irradiance values. This underestimation can be explained by the use of a constant monthly aerosol climatology in the IFS/ECMWF as argued

by Lopes et al. [17], concluding that the model tends to underestimate DNI under very clear sky atmospheric conditions, when the actual aerosol concentrations are below mean values.

**Figure 3.** Scatter plots of predicted vs. measured hourly mean DNI for: (**a**) day\_0, (**b**) day\_1, (**c**) day\_2, and (**d**) day\_3, during the entire period considered. The dashed line represents the y = x line and the solid line is the least-squares regression fit.

The statistical errors, on an hourly basis, are presented in Table 1.

The assessment between datasets shows that errors increase from the first day of the forecast to the last day; day\_0 exhibited the best performance with the lowest errors. MBE between calculated and measured DNI is smaller than 18W/m2. Regarding the MAE and RMSE, their values increase from day\_0 to day\_3 (fourth day of the forecast) with a difference between them of ~61 W⁄m2 (~45%) and ~72 W⁄m2 (~37%), respectively. High correlation coefficients (r <sup>≥</sup> 0.70) are obtained between the observations and forecasts for all forecast horizons (see Table 1).

The boxplots of Figure 4 show the MBE, MAE, RMSE, and correlation coefficient based on hourly values for each forecast day and the entire period of data (365 days).

MBE indicates a slight overestimation of hourly mean DNI for the majority of the forecast days (>50%) in the period. The length of the Interquartile Range (IQR) is a measure of the relative dispersion of a dataset and Figure 4a shows a similar length, in IQR, for the first two days of forecasts with values in (~−80; ~100 W/m2). On the other hand, the difference between the IQR of the first forecast day and the last one (day\_4) in the same plot is about 24%. Concerning MAE and RMSE, as expected, a similar pattern like MBE was found, with errors increasing as the lead time of the forecast increases, and a relative percentage error, relatively to the mean, between day\_0 and day\_3, for both parameters, of the order of 30%. It is worth noting that a significant number of outliers exist after the second day of

forecasting. As for the correlation coefficients, these values indicate a good forecast performance with the best results obtained for day\_0 with the highest median value of ~0.98 (Figure 4d). The correlation coefficient (r) presents a good performance for all forecast days in the analysis.

**Figure 4.** Boxplots of statistical indicators based on hourly values for (**a**) Mean Bias Error (MBE), (**b**) Mean Absolute Error (MAE), (**c**) Root Mean Square Error (RMSE), and (**d**) Correlation Coefficient (r), for different days ahead of forecast. The crosses represent the mean value of the sample, the horizontal solid line within the box represents the median, and the bottom and top of the boxes indicate the first and third quartiles, respectively. Boxes correspond to the Interquartile Range (IQR) where 50% of the data is located. The circles represent the outliers, and the lower and upper ends of the whiskers are the minimum and maximum values of the datasets, respectively.

Considering now the daily mean values, Figure 5 shows the comparison between measured and predicted DNI for the entire period.

As observed in the case of hourly values, the differences between observed and predicted DNI increases from the first day of the forecast to the last one. Another common feature observed is the DNI overestimation for lower values of direct normal irradiance as it can be seen through the trend lines. The statistical indicators obtained are comparable to the analysis made for the hourly values, showing a low forecast bias, with MBE values below 7 W⁄m<sup>2</sup> for all forecast days. Regarding RMSE, an increase of 35% percent (from ~61W⁄m2 to 76 W⁄m2) between the first and the last day of forecast. It is evident from the scatter plots of Figure 5 that between roughly 250 and 350 W/m<sup>2</sup> the distribution of data points is closer to the y = x line (ratio 1:1), which reveals a good agreement between observations and predictions. The overestimation occurs for observational DNI values below 200 W/m2, with a larger dispersion, which may reflect inaccuracies in IFS cloud representation.

Figure 6 shows the monthly mean of daily values of simulated and measured DNI for the period between 1 August 2018 and 31 July 2019, thus allowing us to analyze the similarity between datasets throughout the year for the different forecast days.

**Figure 5.** Comparison between predicted and measured daily mean DNI for the four prediction days: (**a**) day\_0, (**b**) day\_1; (**c**) day\_2; (**d**) day\_3. MBE, MAE, RMSE, and *r* are also presented in each plot. The solid lines are the linear fits and the dashed line represents the y = x line.

**Figure 6.** Monthly mean of predicted and observed daily mean DNI in Évora for the four different forecast days in the period from August 2018 to July 2019.

As shown in Figure 6, the IFS/ECMWF model overestimates the radiation in more ~50% of the days throughout the year, independently of forecast day, although with small differences datasets.

The variation of statistical indicators between datasets (daily) grouped by months is presented in Figure 7.

**Figure 7.** Statistical indicators obtained from the comparison between measurements and predictions of daily mean DNI values. (**a**) MBE; (**b**) MAE; (**c**) RMSE; (**d**) correlation coefficient.

Overall, MBE, MAE, and RMSE present better results for the first day of the forecast (day\_0). The highest values of statistical errors correspond to the forecasts obtained for day\_3.

From Figure 7a, MBE values range from about <sup>−</sup>42 W⁄m2 to 35 W⁄m2, and show ~60% of the months with positive MBE values (independently of forecast days). According to the same figure, the underestimation occurs in two-thirds of the months belonging to the MMA and SON seasons, probably as a consequence of a less accurate representation of the clouds (or aerosols) at short time scales in the radiative scheme of IFS/ECMWF.

According to Lopes et al. [17], the IFS global model from ECMWF tends to underestimate DNI in clear sky conditions due to the use of a monthly mean profile of aerosols. Perdigão et al. [6] also used the same argument in the assessment and characterization of the shortwave downward radiation incident at the Earth's surface over the Iberian Peninsula using the mesoscale Weather Research and Forecasting (WRF) model.

Figure 7b,c shows that MAE and RMSE present lower values, independently of forecast day, in JJA and SON seasons. MAE and RMSE present a similar variation that as found for MBE (Figure 7b,c) with values ranging from 33 W⁄m2 to 87 W⁄m<sup>2</sup> and 36 W⁄m2 to 102 W⁄m2, respectively. As mentioned

above, results show high correlation coefficients for all forecast days, although with the IFS/ECMWF model performing better on the first day of forecast.

The majority of statistical errors found in this work are in line with values obtained in the forecast of DNI by Nonnenmacher et al. [43] and Lara-Fanego et al. [18] using WRF model, and Gala et al. [44] using a clear sky model, among other studies. Table 2 of the article of Law et al. [13], show a summary of the state-of-the-art of DNI accuracy obtained from NWP and other methodologies.

#### *3.2. Relation between the DNI Attenuation Index (DAI) and DNI Forecasts*

Inaccurate representation of clouds in the radiative transfer scheme of global numerical weather prediction models is the primary cause of errors in the prediction of solar radiation. In this section, the DNI attenuation index (DAI) is proposed to assess and analyze the impact of cloud representation in the DNI forecast from the IFS/ECMWF model, as defined in Section 2.4 (Equation (6)).

Before analyzing the relationship between the DAI index, computed using data from the Évora radiometric station (Section 2.1) and the quality of the DNI forecast errors, the reliability of DAI is assessed, monthly, using the Total Cloud Area Fraction from CERES (Section 2.4) for the same local. According to Almorox et al. [45], global solar irradiation obtained from CERES, monthly, provides very good accuracy for solar radiation studies since their results show a good fit between CERES data and solar radiation data from different meteorological stations over Spain.

The linear regression between CERES cloud fraction and DAI shows a good agreement between these two indexes (Figure 8a), with a correlation coefficient of r ~0.92. When comparing the temporal evolution of DAI and cloud area fraction (CERES) monthly, both time series exhibits a similar pattern (Figure 8b), and shows, as expected, a decrease of cloud cover in the JJA season in contrast with an increase of the cloud cover in the DJF season.

**Figure 8.** (**a**) Monthly mean cloud cover from Clouds and the Earth's Radiant Energy System (CERES) versus DNI Attenuation Index (DAI) in Evora, and (**b**) temporal evolution of CERES cloud fraction (red line) and DAI (blue line) in the period between August 2017 and July 2019 (twenty-two months). The black solid line represents the linear fit.

The major discrepancies between datasets occur in February 2018, February 2019, and in August 2018, corresponding to periods in which the region was affected by aerosol events (Saharan dust particles, in February, and forest fires in August).

The results suggest that DAI may be used as a proxy to cloud cover, particularly suitable to estimate the impact of clouds on the DNI forecast.

As for the observations, the DAI of model predictions was also calculated and hereafter is referred to as DAI (IFS). Figure 9 shows a boxplot comparison between DAI and DAI (IFS) index.

From Figure 9 it can be seen that in the majority of analyzed months, the DAI (IFS) is lower than DAI, meaning that the cloud scheme in IFS/ECMWF model underestimates the clouds and aerosols events when compared with the DAI index. DAIs are characterized by a lower variability in summer with more than 50% of the days with values lower than 31.25% (clear sky days) and a higher variability during the spring season (higher IQR values with more than 50% of days with values higher than 31.25%). These results are in line with the study by Royé et al. [46], in which low levels of cloudiness (clear skies days) over the Iberian Peninsula were found for the case of summer months, using satellite data from the MODIS and for the period 2001–2017, except for the Cantabrian coast. On the other hand, the variability of observed DAI is higher, mainly due to a higher variability on actual cloud cover and aerosol concentrations values. This variability also explains the relative high errors reported in the previous section. Perdigão et al. [6] also found high variability in downward shortwave radiation during March.

**Figure 9.** Monthly boxplots of daily mean values of DAI based on observations (OBS) and IFS/ECMWF forecasts (IFS), for Évora. The red circles represent outliers (maximum value).

Figure 10 shows the relation between the cloud class, grouped according to cloud coverage (in oktas), estimated based on the DAI, as indicated in Section 2.4, and the statistical indicator RMSE (in three ranges of values), obtained for each day, from hourly raw values. Day type class (I-clear, II-partially cloudy, and III-overcast) is obtained following the WMO guidelines [23].

From both plots of Figure 10, it becomes evident that RMSE strongly depends on the sky conditions, and it is possible to verify that:

(i) Approximately ~19% of the days present RMSE values lower than 100 W⁄m<sup>2</sup> (blue dots in Figure 10a). This percentage corresponds mostly to a cloud coverage lower than or equal to two oktas-clear skies days;


**Figure 10.** (**a**) Scatter plot of DAI versus Root Mean Square Error (RMSE) for day\_0 and (**b**) the number of days in a seasonal basis within different ranges of forecast errors grouped in classes, according to the cloud coverage–class I (0–2 oktas), class II (3–5 oktas), or class III (6–8 oktas) for RMSE.

The analysis of the climatology of cloud cover at Évora based on DAI for the period from 1 August/2018 to 31 July 2019 (Figure 10b), shows that about 45% of the days are in class I, considering only RMSE errors below 200 W⁄m2, and these days are mainly in the MMA and JJA seasons, when more clear skies days occur over Évora city. These values are consistent with those found by Sanchez-Lorenzo et al. [48] and by Perdigão et al. [6] for the same sky conditions over the Iberian Peninsula.

Concerning the cloud coverage of class II and III, for the period in analysis and independently of the RMSE values, there are ~37% and ~13% of days, respectively. Cloud coverage of type III is mostly found in the DJF season.

It is important to mention that the quality and reliability of the forecast of solar radiation are directly related to the accuracy of cloud representation as well as aerosols. For instance, Lara-Fanego et al. [18] found that RMSE values ranged from 20% to 100% for clear and cloudy skies, respectively, using the WRF model over Andalusia.

Another feature can be seen in Figure 11 where it is possible to observe a relation between the RSR and the DAI for the first day of forecast (day\_0).

As expected, from Figure 11, as the DAI increases the RSR increase. The forecast obtained from IFS/ECMWF presents a better performance for clear sky days. In general, the DNI forecast tends to be more accurate (RSR <0.5) for lower values of DAI, which represents ~50% of the days in Évora. This value of clear sky days is of the same order of the study of Freile-Aranda et al. [49], who found, for a climatic region which includes Évora, a minimum cloud cover (at an annual average) around 43%. The best performance (of IFS/ECMWF) are found for summer months (smaller values of RSR ≤0.5 and DAI ≤2 oktas). For instance, Kraas et al. [12], also found that during the summer season the forecasts are generally more reliable than in other seasons.

#### *3.3. Statistical Bias Correction Analysis of Daily DNI Forecasts*

The bias between solar radiation forecasts from Numerical Weather Prediction models and observations can be decreased by applying a bias post-processing correction. This Bias Correction (BC) methodology have been used in solar radiation studies by several authors, such as Ruiz-Arias et al. [7], Polo et al. [40,49], Perdigão et al. [6], Mejia et al. [41], among other authors.

The forecast values of DNI are corrected following the methodology described in Section 2.5. In Figure 12, the show the result of the application of the MOS method for the various forecast horizons.

**Figure 11.** Relation between DAI and RMSE-observations standard deviation ratio (RSR) RSR based on hourly mean DNI forecasts and measurements for the first day of predictions between 1 August 2018 and 31 July 2019 (one year). RSR is dimensionless varying between zero and a large positive number.

Error metrics are also presented in graphs using the new corrected predictions.

*Forecasting* **2020**, *2*

Overall, MOS correction significantly improves the results. The corrected data exhibited (now) less dispersion around y = x line. Statistical errors decrease in the order of about 30% when compared with initial IFS predictions and independently of the forecast day. For example, MAE values decrease from the interval (49–60 W⁄m2) to (32–41 W⁄m2), while RMSE values decrease from (61–76 W⁄m2) to (43–57 W⁄m2). The correlation coefficient is in line with previous values, i.e., all *r* values were improved with values ≥0.89.

**Figure 12.** Comparison between predicted and measured daily mean DNI for the four prediction days: (**a**) day\_0, (**b**) day\_1; (**c**) day\_2; (**d**) day\_3 before (blue dots) and after Bias correction (red dots). MBE, MAE, RMSE, and r, after BC, are also presented in each plot. The solid lines are the linear fits-green for BC procedure and black for IFS/ECMWF raw data-and the dashed line represents the y = x line.

To a better comparison, Figure 13 shows the cumulative distribution functions (CDF) of daily DNI, for each day of forecast, before and after the correction procedure.

Results in Figure 13 show the similarity of the CDF between observational and IFS/ECMWF outputs before and after the bias correction. In general, and independently of the forecast day, as said before, the linear regression method successfully improved the DNI outputs with the new corrected cumulative distribution function plots closer to the observed DNI.

**Figure 13.** Cumulative distribution functions (CDF) of daily mean DNI grouped by day of forecast, (**a**) Day\_0; (**b**) day\_1; (**c**) day\_2; (**d**) day\_3, from 1 August 2018 to 31 July 2019, original forecasts (black line), forecasts after Bias Correction (red line) and observations (blue line).

#### **4. Conclusions**

Given the importance that alternative energies have in a sustainable economic, social, and environmental perspective, it is important to know in advance how solar, wind, or other renewable energy resources change on an hourly, daily, or monthly basis. In this work, DNI from Integrated Forecasting System of European Centre for Medium-Range Weather Forecasts (IFS/ECMWF) dataset was evaluated over one year (1 August 2018 to 31 July 2019) against observed DNI, at hourly time scales, for one station located at Evora (south of Portugal) for different days ahead of forecast (until three days ahead).

Statistical BIAS, MAE, RMSE, RSR, and correlation coefficient were used to assess the relation between IFS/ECMWF and observational DNI. Additionally, this paper, also describes a new methodology based on DNI observations (the called of DNI attenuation index–DAI) to estimate the transparency of the atmosphere in a particular region. The DAI was evaluated with total cloud cover parameter obtained from CERES product data and results showed high correlation coefficients between datasets, suggesting that DAI can be used as proxy to classify the cloud coverage in direct solar radiation studies. This index was used to analyze the relation between the cloud coverage and the predicted DNI, as well as the respective associated error.

The IFS/ECMWF DNI forecasts present similar magnitudes and pattern relatively to observational data, but the errors increase with the forecast lead time (from 1 to 3 days ahead). Discrepancies

#### *Forecasting* **2020**, *2*

between modelled and measured radiation are relatively small mainly for the first three days of forecasts. The analysis of hourly data showed that the DNI is overestimated by IFS/ECMWF. Regarding the correlation coefficient, values are found above ≥0.7, independently of the forecast day. This work also shows that the first day ahead forecast (day\_1) has similar error magnitudes in relation to the first 24 h forecast (day\_0).

Hourly analysis also shows values of MBE, lower than 20 W⁄m2. Regarding the MAE and RMSE values, an increase from day\_0 to day\_3 was observed, with a difference between the first and the third day of forecast ~45% and ~37%, respectively. High correlation coefficients (r ≥0.7) are found for all forecast days.

Daily analysis shows better results, with MBE values lower than 7 W⁄m2 for all forecast days. In the case of RMSE, values increase about 35% percent from the first day of forecast to the last one. The correlation coefficients of daily data are higher than in the case of hourly data, ranging between 0.82 (day\_3) and 0.89 (day\_0).

The mean-monthly cloud coverage is well captured by DAI along the year. As expected, the observed DNI is higher in spring and summer months with the lowest values in DAI for the same seasons. The underestimation of cloud cover by the IFS/ECMWF seems to be evident since comparison between observed and predicted DAI reveals that model tends to underestimate the effects of clouds on DNI. This relation was also found for Andalusia (located in Iberian Peninsula) using WRF model by Lara-Fanego et al. [18] in the case of three days ahead DNI forecasts.

The accuracy of IFS/ECMWF to forecast DNI is higher for clear or partially cloudy sky days. DAI index confirms that the performance of the IFS/model decrease with an increasing of clouds/aerosols effects.

A bias correction post-processing through a linear regression was used to correct the IFS/ECMWF predictions, which has shown to significantly improve the forecast for Évora with a decrease in the order of 30% for all statistical error metrics, except for the correlation coefficient, independently of the days ahead in consideration.

The results obtained in this work are consistent with those obtained by Lopes et al. [17] for the same location, and by Nonnenmacher et al. [14], Troccoli, and Morcrette [19], among others, where it was found that errors increase with the lead time forecast. Overall, ECMWF DNI forecasts provide valuable information for the management and operation of CSP plants, especially after the usage of the post-processing bias correction.

**Author Contributions:** Conceptualization, J.P., P.C., R.S. and M.J.C.; formal analysis, J.P.; methodology, J.P., P.C., R.S. and M.J.C.; resources, P.C., R.S. and M.J.C.; supervision, P.C., R.S. and M.J.C.; writing—original draft, J.P.; writing—review and editing, J.P., P.C., R.S. and M.J.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** The work is co-funded by the European Union through the European Regional Development Fund, included in the COMPETE 2020 (Operational Program Competitiveness and Internationalization) through the ICT project (UIDB/04683/2020) with the reference POCI-01-0145-FEDER-007690 and also through the DNI-A (ALT20-03-0145-FEDER-000011) and LEADING (PTDC/CTA-MET/28914/2017) projects.

**Acknowledgments:** The authors acknowledge the ECMWF for providing DNI forecasts through the website and at NCEP Reanalysis data provided by the NOAA/OAR/ESRL PSD, Boulder, Colorado, USA, from their Web site at https://www.esrl.noaa.gov/psd/. The authors also thank the CERES Science Team for the data provided.

**Conflicts of Interest:** The authors declare there is no conflict of interest with regard to this manuscript.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Performance Comparison between Deep Learning and Optical Flow-Based Techniques for Nowcast Precipitation from Radar Images**

#### **Marino Marrocu \* and Luca Massidda**

CRS4, Center for Advanced Studies, Research and Development in Sardinia, loc. Piscina Manna ed. 1, 09050 Pula, Italy; luca.massidda@crs4.it

**\*** Correspondence: marino.marrocu@crs4.it

Received: 26 May 2020; Accepted: 22 June 2020; Published: 24 June 2020

**Abstract:** In this article, a nowcasting technique for meteorological radar images based on a generative neural network is presented. This technique's performance is compared with state-of-the-art optical flow procedures. Both methods have been validated using a public domain data set of radar images, covering an area of about 104 km2 over Japan, and a period of five years with a sampling frequency of five minutes. The performance of the neural network, trained with three of the five years of data, forecasts with a time horizon of up to one hour, evaluated over one year of the data, proved to be significantly better than those obtained with the techniques currently in use.

**Keywords:** nowcast; meteorological radar data; optical flow; deep learning

#### **1. Introduction**

Forecasting precipitation at very high resolutions in space and time, particularly with the aim of providing the boundary conditions for hydrological models and estimating the associated risk, is one of the most difficult challenges in meteorology [1]. In short, the problem is to predict the field of precipitation on grids of 1 km or less and for horizons of less than a few hours (nowcast) [2]. Due to the high spatial resolutions required, methods based on meteorological models are not effective, because they are too onerous computationally and the time to perform a simulation is usually excessively too long for operational purposes. They also depend, in a critical way, on the initial conditions as the level of uncertainty is too high and it is difficult to assimilate locally recorded data [3,4]. Even if these difficulties were overcome, a fundamental difficulty would remain, the gap between the spatial scales provided by the meteorological models and those useful to hydrologists for operational purposes. To overcome this difficulty and quantify the uncertainty involved, probabilistic approaches were proposed [5].

In contrast, simplified approaches based on image processing and related data extrapolation, based on radar reflectivity or remote sensing data, have proven to be more effective. For example in [6], an application in Mediterranean basins of a semi-distributed hydrological model using rainfall estimated by radar was discussed, and in a more recent paper an application with a coupled soil and groundwater model was discussed [7]. A study discussing the propagation of the uncertainty in weather radar estimates of rainfall through hydrological models can be found in [8]. This approach showed to be even more promising by using new methodologies based on recent advances in deep learning techniques [9].

The simplest techniques for nowcasting on meteorological radar data consist of two phases, tracking and extrapolation, i.e., first the velocity field is estimated, typically with image comparison techniques called Optical Flow (OF), this field is then used to extrapolate a prediction of radar images through semi-Lagrangian schemes, or interpolation procedures [10]. This procedure estimates

the precipitation as seen by the radar, assuming that the rate of advection evaluated by comparing a sequence of two or more consecutive images stays constant with no terms of sink or source altering the intensity of the field in the chosen forecast range. Unfortunately these two assumptions, given the high space-temporal variability of the precipitation, most never occur, and this is all the more true the longer the forecast time horizon is. For this reason, a forecast procedure based on optical flow has a limited validity with lead times of the order of the hour. Despite this, there are several nowcasting procedures for radar images and related by-products based on OF, operational in meteorological centers, that are able to provide predictions with appreciable skills with a time horizon of up to three hours [11,12], when used with additional information from numerical model's data, and real-time precipitation measurements. For this reason, the topic, although widely investigated, is still of interest for research [13]. This work will focus on the analysis of nowcast methods based only on radar data. The intent is to assess the degree of the predictive potential of the data itself and with the idea that this can add value even when operating in conjunction to additional information.

The first nowcasting methods based on OF [14] foresaw the evolution of the image from one instant to the next by means of a rigid translation, with an advection field determined by the criteria of maximum correlation between two or more images, immediately preceding the time of emission of the forecast. In the TREC method, the advection field is not constant but is evaluated by comparing boxes of pixels of equal size between two successive radar images, and identifying similar boxes through cross correlation. An advection vector connects the centers of two similar boxes. Repeating the process on boxes of optimal size distributed on the image, it is possible to determine a non constant field of advection that is then used for the extrapolation of the precipitation field [15]. Over the years these methods have been refined, providing the possibility to advect separately the features defined as "interesting" within the images, such as the position of the precipitating centers, or the maximum gradient points, as in the method described in Shi-Tomasi [16]. Then the possibility of deformation of the advected pattern was added, applying for example an affine transformation [17]. All these methods and several other variants have the conceptual limitation of being deterministic, not taking into account the high uncertainty inherent in this overly simplistic approach. The so-called SPROG [18] approach takes into account the diverse predictability of the precipitation at different scale. It decomposes the logarithmic transformation of the initial field into a multiplicative cascade of decreasing spatial scale fields, and advects each of them separately, keeping the coherence with the pre-forecast period. Finally, the field at each forecast deadline is calculated as the antilogarithm of the product of the fields at the individual scales. More details of this and other techniques used as benchmarks will be provided in the following section.

More recently, the potential offered by the highly non-linear approach of neural networks has been investigated. These techniques use a large amount of past data to "learn" and devise a predictive model. The first studies that have started this line of research for forecasting precipitation from radar images have been introduced by the Hong-Kong Meteorological Observatory Group [19,20]. In a work published in 2015 [19], the authors introduced the ConvLSTM model and demonstrated that it outperforms the OF-based operating model, called SWIRLS (Short term forecast of Intense Rain Localized Systems), for forecasting horizons of up to two hours. In a follow up [20], the same authors have introduced the TrajGRU model, which uses the same convolutional and recursive deep networks as the previous one, but where a location feature for radar image processing is introduced, improving the performance of both SWIRLS and convLSTM models. Recently a variant of PredNet architecture has been proposed in [21] while in [22] a completely convolutional deep neural network is used, investigating in particular the importance of data pre-processing, network structure and loss function.

With the aim of investigating and comparing the predictive potential of different approaches and, in order to find methods that can offer room for improvement, this work proposes and evaluates the performance of a PredNet deep neural network [23]. The architecture is borrowed from computer vision for short-term video prediction, where it represents one of the state of the art approaches.

Compared to [21], where a similar network was used, this version has a modified loss function for training phase that tries to maximize the predictive performance of the architecture. A quantitative and qualitative comparison of the results obtained with the PredNet versus the reference OF methods is presented here that will highlight not only the different performance but also the distinct characteristics of the forecast obtained, and their limits.

The rest of the paper is organized as follows: Section 2 illustrates the methodologies used as benchmarks, Section 3 presents the modified neural network used in this analysis, Section 4 describes the data used for the numerical comparison while results are discussed in Section 5. Section 6 hosts the conclusions and a discussion of potential future developments.

#### **2. Benchmark Nowcast Techniques**

The methodologies used as benchmarks to assess the performance of the approach proposed in this paper are those implemented in two public domain libraries called *rainymotion* [10] and *pysteps* [24].

The *rainymotion* library, built on *opencv* [25], implements four different nowcasting methods called Dense, Dense Rotation, SparseSD and Sparse. The Dense and Dense Rotation methods are based on standard Optical Flow estimation techniques. The Sparse and SparseSD methods, instead of tracking the displacement of all pixels in the image, first detect a chosen number of "good features to track" [16], typically the boundaries of moving objects with respect to the constant background, and trace their most likely displacement with optical flow techniques [14]. Next, they calculate, and apply, the affine transformation that maps the "interesting" features from one frame to the following [17] and eventually fill the final image by interpolation [26]. The Sparse method uses a sequence of images of variable length (in our case 12) preceding the instant of the nowcast, in order to trace the "interesting" characteristics of the field, while SparseSD uses only the last two images. Since, in all the tests we carried out, the Sparse (SP henceforth) and Dense Rotation (DR henceforth) methods have proven to perform better than the corresponding SparseSD and Dense methods, the results presented below will be only related to SP and DR.

Pysteps [24] is also a public domain library, managed and developed by a large community, which besides being an operational tool for nowcasting, is used as a development and study platform for the implementation of new techniques. For the estimation of the field of advection, pysteps implements three methodologies: a local type based on the classic Lucas-Kanade (LK) technique [14,27], a global variational type (VET) [28,29] and a spectral type (DARTS) [30]. Extrapolation by advection is done with a semi-lagrangian scheme backward in time [29]. The most sophisticated technique of the pysteps library, used here as a benchmark, is called STEPS, takes into account the different predictability of the precipitation phenomena at different scales. In practice, using an approach called SPROG [18], after a logarithmic transformation, a decomposition of the initial field into a multiplicative cascade of fields with decreasing spatial scales is performed and each of these spatial fields is advanced with an advection obtained consistently during the period before the nowcast. The decomposition of the field at different scales can be achieved using techniques of varying degrees of complexity, such as Discrete Cosine Transform or wavelet. In this case an FFT was used in which the different portions of the spectrum are weighed by Gaussian functions [31] and then, by inverse FFT, brought back into the spatial domain. The additional uncertainty due to the fact that, during the period of the forecast validity there may be growth, decay, creation and/or definitive disappearance of the precipitating cells, alongside the uncertainties linked to different spatial scales, were taken into account trough a stochastic noise built on the space and time variability, observed in the period preceding the emission, added to each partial field. For the implementation details please refer to [24] and the references therein. This approach makes it possible to obtain not a single deterministic forecast, but a set of equally probable forecasts (ensemble) and therefore also to quantify its space-time uncertainty. In this work, we limited ourselves to study the performance of deterministic forecasts, deliberately avoiding to discuss aspects related to space-time uncertainty, aspects that emerge from the discussion at the end of

Section 5. To allow the comparison also with the other methodologies we will use the deterministic forecast obtained averaging all the STEP members and we will indicate this estimate with ST hereafter.

#### **3. Proposed Nowcast Technique**

The nowcasting problem can be tackled with artificial intelligence techniques exploiting the recent results of neural network approaches in the fields of computer vision and natural language processing. As a matter of fact, radar imagery nowcasting can be thought as a supervised learning problem, where the parameters of a neural network are modified to minimize an objective function that depends on the differences between the estimated precipitation image and the real field. Among the various network architectures proposed in computer vision and, in some cases, also already applied to the problem of meteorological nowcasting, we opted for the PredNet model, developed in computer vision for the forecasting of future frames in a video sequence. A key insight of this and other deep neural network methods is that, in order to be able to predict how the visual scenery will change over time, an agent must have at least some implicit structural model of the object represented by the sequence of images and the possible transformations that these objects may undergo. In the case of precipitation nowcasting this translates into the need for the network to assimilate a representation of the movement of the clouds and the development of related precipitation. In the proposed architecture, this translates into the ability to interpret the cloud movements considering the advective part of the phenomena, and also predict variations in intensity, linked for example to the orography of the domain.

The PredNet architecture [23] continuously strives to predict the appearance of future video frames, using a deep and recursive convolutional network with connections from large scale to small scale (top-down) and vice versa. The approach exploits the experience of similar models based on recursive and convolutional networks for the prediction of successive video frames, but draws particular inspiration from the concept of "predictive coding" in neuroscience literature, which assumes that the brain continuously makes predictions by anticipating incoming sensory stimuli. The top-down connections of the PredNet network transmit these forecasts, gradually increasing the details, compared with real observations, to generate an error signal. This is then propagated back to coarser scales (bottom-up) across the network, ultimately leading to a forecast update.

The PredNet architecture used is shown schematically in Figure 1 and consists of a ladder structure with a descending branch that generates the prediction of the precipitation field image at progressively more detailed scales, and an ascending branch where the prediction is compared with the real data.

The generative block *Gl* receives as input a representation of the state coming from the upper block *rl*+1, performs an upscaling (with factor 2) and concatenates the image thus obtained with the representation of the error *el* for that scale, calculated at the previous iteration and stored in memory. This data is passed to a recurrent Long Short-Term Memory cell with convolutional filters [19], which in turn retains a memory of its own state. The result of processing is the state at the *l* layer, *rl*, where this image is further processed by a convolutional layer to get a representation of the precipitation *hl*.

The discriminant block *Dl* receives as input the *hl* precipitation forecast for the scale under examination, and the representation of the real precipitation field at the same scale *al*. From the comparison of the two images an error image is generated. This is stored in memory as the input of the generative block at the next iteration of the model, and is further processed by convolution and pooling, generating a representation of the *al*<sup>+</sup><sup>1</sup> precipitation of lower detail, passed then to the upper discriminative block.

**Figure 1.** PredNet architecture: in the left the general architecture with the Discriminator blocks and the Generator blocks, in the right details of blocks.

Table 1 shows the tensor sizes *al*, *hl*,*rl*,*el* used in the model. The convolution filters have a kernel size of 3 with a padding of 1. The hidden state for the LSTM network has a size equal to the tensors *rl*.

**Table 1.** PredNet tensors: size of the tensors for the different levels of the PredNet architecture expressed as (channels, width, height).


The model, as already said, provides more accurate results as new inputs are provided. The procedure presented here uses the images of the precipitation for the hour before the lead time, then the model is further iterated for the next hour's forecast times, replacing the real image of precipitation *a*<sup>0</sup> with the one that the model estimated *h*0. With a time sampling of 5 min, 12 images *a*<sup>0</sup> for the first hour are provided and 24 images *h*<sup>0</sup> for the reconstruction of the first hour of precipitation and the estimates (without radar input) for the following hour are obtained by model iteration.

More details about the architecture can be found in Lotter, Kreiman and Cox [23]. This work is focused on forecasting the next frame in a video sequence, and it shows how the architecture is able to generate increasingly accurate forecasts as new frames are input, demonstrating to be able to refine forecasts and learn from mistakes made during the process. The training is performed with a Gradient Descent procedure in which the network parameters are changed to minimize *el* errors at the different levels of image representation with a loss function based on Mean Absolute Error. The results obtained in [23] show that in case of real video sequence the best results are achieved with the minimization of *e*<sup>0</sup> only, i.e., the MAE between the calculated and the actual image.

Based on those results here the loss function is released from the representation of the internal error in the network and the minimization of the Mean Squared Error between the image of the estimated and the actual precipitation is used.

Since the forecast is extended beyond the next frame, the loss function used in the training phase is realized as the Mean Squared Error of the 24 *h*<sup>0</sup> images generated by the network compared to the real *a*<sup>0</sup> precipitation images:

$$Loss = \sum\_{t=0}^{24} MSE(a\_0(t), h\_0(t))\tag{1}$$

where MSE is an operator that returns the Mean Squared Error between the actual radar image and the one estimated by the model; the sum is extended both to the 12 steps used for learning the current weather situation and to the 12 steps of the forecast, in order to train the network to "optimally" reconstruct both the radar images used as initial conditions and the predicted ones. Other experiments have also been carried out, using for example the MAE as loss function, obtaining, in the verification, an improvement limited to the specific metric (i.e., the MAE) but a more marked worsening for all the other metrics considered (i.e.,: RMSE, *r* and ETS). By using both MSE and MAE we verified that the training of a specific network for each forecast horizon did not produce an improvement in performance. Furthermore, tests were conducted using both the normalized rain intensity and a logarithmic transformation to feed NN. The best results were obtained using directly the color indexes of the image, and only after the predictive phase the rain rate transformation function described in Table 2. The weights of the neural network are optimized using a Gradient Descent procedure with Adam optimizer. Having data from 2013 to 2017 we used the images of the first 3 years for training, the images of 2016 for validation while the tests were conducted on the radar images of 2017. The neural network model and training procedures are implemented in python [32] using the PyTorch library [33] and are available in open source (https://github.com/lmssdd/RadarNowcast), the calculation was run on NVIDIA Tesla K80 GPU with 12GB of onboard RAM. The model will be referenced hereafter as NN.

#### **4. Data**

The data used in this application come from a meteorological radar located in the region between Osaka and Kyoto. These public domain data, already used in [21], have been collected from an online archive available on Yahoo (https://developer.yahoo.co.jp/webapi/map/openlocalplatform/v1/ js/) and span five consecutive years from 2013 to 2017. Their spatial resolution is equal to 1 km2, the temporal frequency is 5', the domain covers an area of about 10<sup>4</sup> km<sup>2</sup> as shown in Figure 2 (https://www.jma.go.jp/en/highresorad/ and https://www.google.com/maps). The climate of the area is classified as hot-temperate, precipitation is intense (cumulative annual 1730 mm) and particularly heavy during the summer season as shown for 3 meteorological stations in Figure 3 (http://ds.data.jma.go.jp/gmd/tcc/tcc/products/climate/).

The dimension of the images obtained through the API is 1000 × 1000 pixels. After a crop necessary to remove text characters in the image, since the actual resolution of the radar is lower, a down-sampling of the image is performed, finally generating a 96 × 96 matrix. The data represent the precipitation intensity per pixel in a palette of 15 false colors. It is thus possible to map the colors as a progressive index of rain intensity with a range from 0 to 14 as reported in the Table 2.

**Figure 2.** On the left the map of Japan in which is highlighted the area shown in detail on the right that represents the domain covered by the Japanese meteorological radar whose data is used in this work.

**Figure 3.** Climate data of monthly average temperature (**a**) and precipitation (**b**) for three meteorological stations in the domain.


**Table 2.** Table of correspondence between the colors of the radar image representation and the precipitation intensity.

The array containing the color index is normalized to a maximum unitary value to be processed in the neural network, and it is mapped back to rain intensity values during the post-processing. The use of precipitation intensity, or its elaboration using a logarithmic transform, has proven to be less effective than direct processing of the color scale index. The data has also been filtered by eliminating all weather events where the radar images were incomplete (i.e., either totally or partially missing).

All other methods have been applied to the logarithm transform of the rain rate estimates and only on rainy events, defined as those with at least 5% of the domain with precipitation rate greater or equal to 1 mm/h. The comparison between the different methodologies is limited to 2017, to maximize the use of previous years for the neural network training and favour a direct comparison with the results of optical flow methodologies that, instead, do not require training.

In the operational phase, all methods applied need to process a different number of the most recent steps of the radar images. The Table 3 lists the number of time steps needed for processing and the number of forecast steps obtained by each procedure. The NN and SP methodologies need, as initial conditions, several steps equal to the number of forecast occurrences, while all others can work with only two time steps except ST that needs three steps as initial condition.

**Table 3.** Comparison of the number of images processed by the different methodologies to obtain a forecast of 12 steps, equivalent to 1 h.


#### **5. Results**

The performance of NN, ST, SP, DR, LK nowcasting techniques described in the Sections 2 and 3, were assessed for 2017. The 2013, 2014 and 2015 data were used for training, the data of year 2016 were used for validation of the NN method. The performance indicators were evaluated against the precipitation value, at forecast verification time as estimated by the radar itself. Therefore, although expressed in mm/h, they do not represent in any way an estimate of the predicted precipitation forecast error. As a matter of fact, they gauge the error of the radar measures once transformed into mm/h. The indicators used are: the root mean square error (RMSE), the mean absolute error (MAE), the Pearson correlation coefficient (r), and the equitable threat score (ETS) for precipitation exceeding 0, 1, 2, 4 mm/h respectively.

Before starting the discussion on the results, a preliminary consideration need to be made to understand both strenghts and limitations of nowcasts based on optical flow extrapolation techniques. All methods used (ST, SP, DR and LK) can fall into this category of course except the NN. Figure 4 shows the spatial autocorrelation of radar images as a function of the delay, in minutes, for 200 2017 "rainy" events, randomly chosen. As can be seen, the autocorrelation decays quickly but is appreciable (the orange line represents the inverse of the number of Nepero: 1/*e*) up to a delay of about 60'. This can be taken as a limit of predictability for persistence, a particular case of optical flow with zero advection, and therefore it is expected the optical flow procedures to have non negligeable skills for up to one hour. For the same reason, 60' is also the maximum delay that makes sense to use as input for any extrapolation procedure based on recent history. That is why both for the SP and NN procedure 12 images (at 5' intervals) preceding the nowcast time were used as initial conditions.

**Figure 4.** Autocorrelation of precipitation field as function of the delay expressed in minutes for a sample of 200 "rainy" events.

To clarify some aspects on which the discussion of the results will be based, in Figure 5 we show a sequence of images lasting one hour and related to the forecast of a precipitation event occurred on 8 January 2017. The first column shows the 12 frames that precede the forecast emission (train from minute −55 to 0), the second column (NN\_train) shows the reconstruction of the precipitation field by the neural network. The network produces an estimate at each time step, starting from a zero field and gradually learning from the radar images provided. The third column (verif) shows instead the 12 images recorded by the radar and to be compared to in the following hour of forecast (minutes from 5 to 60). The following columns show, except the fourth (NN\_flip) which we will discuss later, the nowcast for the procedures NN, ST, DR, SP and LK.

The domain of existence of the nowcast depends on the procedure used. The NN technique, being based on a neural network of generative type, returns a prediction for the whole domain while the SP and DR methods after the advective and extrapolative part try to "fill" the domain through interpolation, although generally this is never complete. Vice versa the ST and LK techniques, based only on the advection, have a field of existence just limited to the areas where the initial field has been "moved" while is undefined in the complementary areas. Note, in particular, in Figure 5 the frames for the forecast from 30' onwards in which the area covered by the forecast is progressively reduced for all the forecasts except for NN. When comparing the performance of the different methods this fact is taken into account by evaluating the errors in the Domain of existence of each Single forecast (DS hereafter) and in the Domain resulting from their Intersection (DI henceforth). This problem is also highlighted by the extension of the domain covered by the radar data that is limited with respect to the typical advection speed of that climate zone and confirmed by the depicted event. This inconvenience could be avoided, or at least mitigated, considering a domain large enough to cover the central zone of greatest interest. To simplify the analysis of the results the discussion that follows is organized in subsections.

**Figure 5.** Frame sequence illustrating the mode of operation of the different nowcast techniques calculated for the one-hour forecast starting on January 8 at 16:25 UTM. See the text for a detailed description of the figure.

#### *5.1. RMSE*

The use of RMSE metrics is not completely appropriate, for a field as discontinuous as precipitation. Nonetheless it provides an interesting "bulk" measure to compare the performance of the various methods. The box (**a**) of Figure 6 shows the RMSE of the averaged forecast for the whole year 2017, depending on the forecast time from 1' to 60', for the procedures NN, ST, SP, DR and persistence, evaluated on the DS domains. It is clear that for all techniques there is an added value compared to persistence that lasts for the entire duration of the forecast. NN and ST have an RMSE significantly lower than all the other techniques and in particular ST prevails over NN only in the second half hour. However, since the mean RMSE has been evaluated and then averaged again over different domains for each technique, a direct comparison is not possible. For this reason, in Figure 6, on box (**b**), is always shown the average RMSE evaluated only on the domain DI. In this case the trend is confirmed but NN prevails over the other techniques for all the 60' of duration of the forecast. The slight prevalence of ST over NN in the second half hour forecast can be related to ST's ability to incorporate the average effects of stochastic noise, whose influence increases with the forecasting time. On the other hand, NN has the clear advantage of providing a prediction for the entire domain covered by the radar. In addition this ST prevalence, as be shortly shown, disappears when the performance is evaluated using scores more suitable to assess the quality of the precipitation forecast, such as the MAE, *r* and, especially, the ETS.

**Figure 6.** Average RMSE as function of the forecast time from 1' to 60', for all nowcast methods and persistence (2017). In box (**a**) the indicators are evaluated in the DS domain , in (**b**) the same indicators are calculated in DI.

#### *5.2. MAE and Pearson Correlation Coefficient r*

Figure 7 shows the values of the MAE and of the Pearson's correlation coefficient *r* as a function of the forecast time, from 1' to 60', for the nowcast of the 5 procedures and for persistence, for 2017. The superiority of NN over all other procedures is clear, both in terms of MAE and of *r*, even when these were calculated over the DS domain (boxes (**a**) and (**b**)). To verify the robustness of the results obtained, we repeated the calculation of the MAE and *r*, limiting the analysis to the DI domain. The results are shown in boxes (**c**) and (**d**) of Figure 5. In this case too the performance of the NN method proved to be better than those of each of the others.

**Figure 7.** As in the previous figure for MAE and Pearson's correlation coefficient. First row of plots (boxes (**a**) and (**b**)) shows results evaluated in the domain DS, second row (boxes (**c**) and (**d**)) those in DI.

#### *5.3. ETS*

Figure 8 shows in boxes (**a**), (**b**), (**c**), (**d**) the ETS for the 0, 1, 2, 4 mm/h thresholds exceedance. While it is clear the advantage of the tested procedures with respect to persistence, the NN procedure stands out for the case of 0 mm/h (rain/no-rain) and 1 mm/h thresholds. For higher thresholds there is a slight prevalence of the LK method over all other procedures. However, the maximum values of ETS are significantly lower, as are the differences between the LK, NN and DR methods, which makes this lead not significant. The results reported in Figure 8 are obtained calculating the ETS in the DS domain (therefore different for each one), but the results obtained limiting the calculation to the common DI domain are basically identical and therefore not shown for brevity.

**Figure 8.** ETS, for 0, 1, 2, 4 mm/h thresholds shown on boxes (**a**), (**b**), (**c**), (**d**) respectively, as function of the forecast time from 1' to 60', for all nowcast methods and persistence, for 2017.

#### *5.4. Skill Score*

To have a measure of the accuracy of the nowcasting methods, for the entire duration of the forecast, we used two skill scores (SS) based on RMSE and MAE.

The accuracy is evaluated relatively to the persistence using the following espression:

$$SS(RMSE)\_{meth} = 1 - \frac{RMSE\_{meth}(t)}{RMSE\_{pcrs}(t)}\tag{2}$$

where *meth* = {*NN*, *ST*, *SP*, *DR*, *LK*} is the method for which the skill score is calculated. Similar formulas was used to calculate the SS(MAE). The values of the two SS obtained are summarized in Table 4. Looking at table, the plots and the discussion of the previous paragraph a clear superiority of the NN method is evident, despite the fact that both the RMSE and the MAE have been evaluated in the DS domain of each of the used methods. Once more, it is also worth to note that the only method able to produce a prediction on the entire radar domain is NN, while all others are limited to a portion of it, depending from the advection speed of the event and the forecast time. This is also the reason the SS(RMSE) for ST is slightly better than the one for NN (see discussion of Figure 6).

**Table 4.** RMSE- and MAE-based skill scores relative to the persistence for all nowcasting method used. Both RMSE and MAE are calculated within the domain DS of existence of each nowcast.


#### *5.5. NN vs. ST*

The results clearly show that the NN nowcast procedure, based on generative neural network, has a significantly better performance than all the algorithm that, to the best of our knowledge, represent the current state of the art. At least in the chosen configuration: a domain of relatively small size, compared to the typical advection speed of the climate zone where the radar used operates. To investigate the origin of this performance difference, the input data for the nowcast were flipped with respect to the increasing direction of latitude and longitude and the nowcast repeated throughout the year 2017. All methods based on optical flow are invariant with respect to this transformation.

Not so for NN, as demonstrated by the MAE and *r* shown in Figure 9. In particular from a quantitative point of view, it is clear that the nowcast for the data flipped in latitude and longitude are completely wrong (see MAE for NN\_flip on box (**a**)) while those for the correlation coefficient worsen considerably but remain in line with those of the other methods, in particular for forecast times over 30' (see box (**b**)). Most likely this result is an indication of the neural network "learning" correctly, from the three years of training data, the climatic characteristics of the phenomena related to the particular domain. The behavior of NN applied to the flipped data, with respect to those on the natural domain, can be verified by the analysis of the test case shown in Figure 5. Comparing the NN frames with the corresponding NN\_flip (see the fifth column of images), the similarity of the two predicted patterns is clear and might be related to the NN advective component. Regarding the amount of rain, NN\_flip provides much more intense precipitation, and this can be related to the generative component of the network. This is consistent with the considerably higher values of MAE and, at the same time, with the *r* values similar to the other procedures.

**Figure 9.** As in Figure 7. On box (**a**) the MAE and on box (**b**) the correlation coefficient of Pearson r as a function of the forecast time from 1 to 60 min, this time for the nowcast obtained by inverting the axes of longitude and latitude of the domain. The results for NN, the only ones that differ from those shown in Figure 7, are indicated with NN\_flip.

#### *5.6. Space-Time Behavior of the Nowcast*

The improvement in performance achieved with NN brings in a reduction in the spatial detail of the forecast. This can be seen, from a qualitative standpoint, for the case shown in Figure 5, observing how the smoothing level of the solution NN increases with the lead time, more than for the other methods. Addressing the quantitative aspect, we calculated the spectral power density of the logarithmic transformation both for the measured and predicted fields In Figure 10 the results obtained by averaging the PSD on all the "rainy" events of 2017 for 5', 30' and 60' lead times are shown on boxes (**a**), (**b**) and (**c**) respectively. NN starts to lose power already at a scale of about 16 km for the 5' forecast. This loss of detail moves up to 32 km at 30' lead time, and 48 km at 60' lead time, in line with the results shown in [34] for the RainNet method.

This unintended feature of the NN forecast derives from the network optimization procedure which, while minimizing the loss function, devises "effective" smoothing the solution. The behavior, implicit at each 5' forecast, is amplified by numerical diffusion as the forecast time progresses, because of the iterative nature of NN.

This is also reflected in NN's reduced ability to predict extreme values. As an example, in Figure 11, a time series of precipitation data, for a severe event (29 June to 2 July 2017), is shown for the point of the radar nearest to Kyoto. The radar data estimation of precipitation in mm/h are shown along with the corresponding NN prediction, at lead times of 5', 30', and 60'. The smoothing effect of the forecast could be seen also here, but now over time. For events longer than an hour, in this case the first and third day of the event, the procedure is able to reproduce the evolution of the precipitation field even at lead times of 60'. The central part, instead, is poorly predicted as it lasts less then the minimum needed learning time. The power spectra tells us that this is caused by the pixel scale not being resolved. However, for hydrological applications, if the basin has a surface as large as the scales resolved by the nowcast, this effect is minimized since all points within the basin contribute to the runoff.

The increasing space-time "smoothness" of the NN forecast with lead times, is related to the increasing uncertainty of the forecast, an aspect which is out of the scope of the present paper, and on which we are currently working on as an extension of the present research.

**Figure 10.** Power spectral density averaged over all the precipitation events of 2017, for times of forecast of 5', 30' and 60' on boxes (**a**), (**b**) and (**c**) respectively.

**Figure 11.** Time series of an event of intense precipitation occurred above Kyoto in the period from the 29 of June to the 2 of July 2017. Radar observations and corresponding NN forecasts at lead times of 5', 30' and 60' are shown.

#### **6. Conclusions**

In this work, a comparison of state-of-the-art methods for the nowcast of precipitation intensity derived from weather radar images and based on optical flow, with a deep learning methodology based on PredNet architecture, trained with a specific loss function, is presented.

The detailed analysis of this extensive test case proves that the proposed algorithm, based on a generative neural network architecture, is far superior to any other method representing, to the best of our knowledge, the state of the art for this subject. It must be said, according to a precautionary principle, that these conclusions cannot be generalized, a priori, to the nowcast of the images of another

type of radar or a similar one positioned in a different geographical region. This limit originates from the small size of the domain covered by the radar, compared to the typical advection velocity obtained through the application of OF techniques.On the other hand, there is no valid reason the procedure used here could not be used in other domains and/or other radar types, after an appropriate learning phase, to produce similar encouraging results.

It would be certainly interesting to further develop the proposed concepts and methodologies. In that respect a modification of the architecture in order to formally separate the advective component from the source/sink in the generative branch of the network will be applied in an upcoming work. This should facilitate a transfer learning process and build a general model, able to interpret and predict radar measures, that can then be specialized onto specific instruments and/or different meteo/climatic regions. We also intend to study the potential for a probabilistic nowcast, estimating the spatial and temporal uncertainty that can be obtained from neural network model, for instance through a quantile regression and the use of a pinball loss function [35].

**Author Contributions:** M.M. and L.M. jointly conceived and designed the methodologies, performed the analysis and wrote the paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was financed by the project "Tessuto Digitale Metropolitano" funded by the regional agency Sardegna Ricerche (POR FESR 2014-2020, Azione 1.2.2) and by Sardinian Regional Authorities.

**Acknowledgments:** The authors would like to thank Ryoma Sato of Kyoto University for sharing his scripts, necessary to download the dataset

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Are Combined Tourism Forecasts Better at Minimizing Forecasting Errors?**

#### **Ulrich Gunter 1,\*, Irem Önder <sup>2</sup> and Egon Smeral <sup>1</sup>**


Received: 8 June 2020; Accepted: 25 June 2020; Published: 29 June 2020

**Abstract:** This study, which was contracted by the European Commission and is geared towards easy replicability by practitioners, compares the accuracy of individual and combined approaches to forecasting tourism demand for the total European Union. The evaluation of the forecasting accuracies was performed recursively (i.e., based on expanding estimation windows) for eight quarterly periods spanning two years in order to check the stability of the outcomes during a changing macroeconomic environment. The study sample includes Eurostat data from January 2005 until August 2017, and out of sample forecasts were calculated for the last two years for three and six months ahead. The analysis of the out-of-sample forecasts for arrivals and overnights showed that forecast combinations taking the historical forecasting performance of individual approaches such as Autoregressive Integrated Moving Average (ARIMA) models, REGARIMA models with different trend variables, and Error Trend Seasonal (ETS) models into account deliver the best results.

**Keywords:** Bates–Granger weights; uniform weights; (REG) ARIMA; ETS; Hodrick–Prescott trend; Google Trends indices

#### **1. Introduction**

The perishable nature of tourism products and services such as hotel overnights, airplane seats, or restaurant tables makes forecasting an important prerequisite for setting efficient strategies to ensure business success. The special characteristics of tourism products and services such as perishability, intangibility, and consumption at the point of service delivery, external factors such as natural and man-made disasters, as well as unsteadiness of human nature make forecasting an important issue for international government bodies, national governments, academics, and practitioners alike.

In the past decades, many studies have dealt with the challenge of improving tourism demand forecasting accuracy, yet all these research efforts have led merely to the conclusion that no single forecasting method outperforms all others in all situations [1]. Furthermore, discussions of complexity have become increasingly relevant in the academic literature aiming to improve forecasting accuracies. Green and Armstrong [2] note that the trend to develop increasingly complex approaches has a long history, yet is at odds with scientific principles that advocate simplicity. An alternative way to use complex techniques to improve forecasting accuracy would be to combine the forecasts of individual forecasting models with the help of various combination techniques, as it has been shown that combined methods minimize the risk of extreme inaccuracy by "averaging out" the weaknesses of single models [3]. Forecast combinations are also capable of introducing adjustments and additional information balancing out measurement errors, which could negatively affect forecasting power [4].

Despite the many tourism studies carried out to date, the application of forecast combination techniques as a tool to create complexity remains rare. In contrast to other disciplines, research into combination methodologies for tourism demand forecasting has a short history, which started only in the early 1980s. Research into combining forecast methodologies was stimulated significantly earlier in various other economic and business fields by the seminal work of Bates and Granger [5]. They examined the performance of combining two sets of forecasts of airline passenger data, whereby the weights of the individual forecasts were calculated based on the historic predictive performance of each individual approach, and found that the combined forecasts showed lower errors than the individual forecasts. Only 20 years after the 1969 Bates and Granger paper, Clemen [6] summarized the intensive work that had been done around the topic of forecast combinations in the interim, and delivered an encompassing literature survey about these activities.

The manageable number of studies about forecast combinations in tourism have, in general, delivered the outcome that the combined forecast outperformed the forecasts generated by individual models [3,7–13]. Our work compares forecasting accuracies for eight quarterly report periods spanning the period from March 2015 until August 2017 in order to check the stability of the results during a changing macroeconomic environment. In other words, the forecast evaluation exercise is not only carried out once for different forecasting horizons but eight times recursively (i.e., based on expanding estimation windows), which corresponds to a "natural" practitioner's situation.

To meet these different challenges, we developed forecasting models for arrivals and overnights and used Eurostat data according to our commissioned study for the European Union as a whole. These models were tested across eight different quarterly periods for their stability and accuracy. In addition, we calculated combined forecasts based on the forecasts produced by the various individual forecasting models, and assessed the accuracies of these combined forecasts. Therefore, the objective of this study was to analyze whether combined forecasts of selected models are able to outperform the forecasts generated by the individual models in terms of forecasting accuracy.

Forecasting for the European Union as a whole for the indicated period and predicting three and six months ahead, as well as using both single forecasting models and forecast combination techniques that are easily replicable by practitioners were requirements by the European Commission, which contracted the present study [14]. Tourism plays a major economic role in the European Union: 13.6 million people (9.5% of all employees in the non-financial sector) were employed by 2.4 million tourism businesses (around 10% of all non-financial businesses) in 2016 [15]. In the same year, the tourism industry accounted for 3.9% of turnover and 5.8% of value added in the non-financial sector.

The remainder of this paper is as follows: After the literature review (Section 2), we discuss the advantages of forecast combinations and the most-used combination methods (Section 3). In the methodological section (Section 4), we present the chosen modeling approaches, the accuracy measures and their application, as well as the data used. The subsequent section provides and discusses the forecasting performance of the different approaches (Section 5). Conclusions and recommendations are the final section of the paper (Section 6).

#### **2. Related Literature**

The prediction competition for time series forecasting already has a history of about 50 years. According to Hyndman [16], the earliest scientific study of time series forecasting accuracy—the Nottingham study—was done by David Reid [17]. Paul Newbold and Clive Granger took the next step by conducting a study of forecasting accuracy involving 106 time series [16,18]: one important result of their study was that forecast combination improves forecasting accuracy. Comparing forecasts became fashionable and, over the years, "forecasting competition" has become an important term in the forecasting literature.

The first big forecasting competition took place in 1982 and was organized by Spyros Makridakis and Michele Hibon. For this competition—known in the forecasting literature as the "M(akridakis) competition" —anyone could submit forecasts related to 1001 time series taken from demography, industry, and economics [16,19]. The following M-2 competition was organized in collaboration with four companies, this time using only 29 time series and with the main purpose of simulating

real-world forecasting. A peculiarity of this M-2 competition was that forecasters were allowed to use personal judgements, to ask questions about the data, and to revise their previous forecasts for the next forecast [20]. The succeeding M-3 competition had the objectives of replicating and extending the features of the previous competitions, with more methods, more researchers, and more (i.e., 3003) time series [21]. The most recent competition that had already been completed while this study was written, M-4, ran from January to May 2018, used 100,000 time series, and considered all major forecasting methods, including those based on Artificial Intelligence, as well as traditional statistical ones [22]. Informing the major objective of our paper, a stable result across all the M competitions was that combined approaches, on the average, outperformed individual forecasts [17–22].

Other competitions have also been organized in parallel to the M competitions. Mathematicians and physicists interested in forecasting ran their own competition at the Santa Fe Institute, beginning in January 1992 [23]. Other examples for forecasting competitions are the application of neural networks [24] or the global energy forecasting competitions [25,26].

In tourism, research into combination approaches and their efficacy started significantly later than in other disciplines. One of the first studies about forecast combinations in tourism was that by Fritz et al. [10] about the combination of time series and econometric forecasts. Their paper presents parsimonious methods of improving forecasting accuracy by combining various forecasting techniques. The Box-Jenkins stochastic time-series method was combined with a traditional econometric technique to forecast airline visitors to the State of Florida. Some years later, Calantone et al. [8] confirmed the results of Fritz et al. [10] and showed, also for the State of Florida, that forecasts of tourist arrivals obtained by forecast combination were more accurate than forecasts based on individual approaches. Shen et al. [3] tested the accuracy of forecast combinations compared to the forecasting results of seven different techniques over different forecasting horizons and demonstrated that combinations were superior to the best of the individual forecasts. Song et al. [27] shed further light on these results by showing that combined forecasts may be more beneficial for long-term forecasting. Shen et al. [28] compared six different combination methods and found that those that consider the historical performance of individual forecasts perform better than simple uniform average methods. In contrast, Gasmi [11] demonstrated that from three combination techniques, the Granger-Ramanathan regression method [29] delivered superior results in comparison to the simple uniform average technique and the Bates–Granger variance-covariance technique [5].

Andrawis et al. [7] suggested combining forecasts based on diverse information using different time aggregations (e.g., monthly and annual data). In comparing several forecast combination techniques, they show that the approach using forecasts based on time series with diverse time aggregations outperformed the combined individual forecasts based on time series with the same time structure. For improving tourism forecasts, Cang [30] proposed a non-linear combination method using multilayer perception neural networks, which can map the non-linear relationship between inputs and outputs.

On the other hand, a minority of studies has shown that combined forecasts do not always outperform the best individual forecasts, but are almost certain to outperform the worst individual forecasts [27]. Furthermore, Song et al. [27] stated that combined methods outperform the best single forecast in fewer than 50% of cases on average. A few years later, Song et al. [31] similarly stated that according to their results, forecast combination only improved forecasting performance in the tourism context in just over 50% of all cases compared with the most accurate single prediction. In a similar vein, Gunter and Önder [12] found that combined forecasts based on Bates–Granger weights, on multiple forecast encompassing tests, as well as on a combination of the two approaches [32]. The authors applied the aforementioned forecast combination techniques to Google Analytics indicators used as leading indicators for forecasting tourist arrivals to Vienna. However, these quite complex techniques only performed well for longer forecasting horizons.

To the best of the authors' knowledge, only one true forecasting competition focused on tourism time series data has been held to date [33]. The data set included 366 monthly, 427 quarterly, and 518 annual time series, all supplied by either tourism bodies or academics who had used them in

previous tourism forecasting studies. The forecasting methods implemented in the competition were univariate and multivariate time series approaches, and econometric models. Surprisingly, however, this competition did not evaluate the accuracies of combined forecasts compared to individual forecasts.

#### **3. Advantages of Forecast Combination**

Why do combined forecasts perform better than individual forecasts in many contexts? Bates and Granger [5] stated that the simple portfolio diversification argument justifies the idea of combining forecasts. Forecast combinations offer diversification gains that make it efficient to combine individual forecasts rather than taking forecasts from just one single model. The information set underlying the individual forecasts is often unobservable for the forecast user: potentially because it comprises private information. Differences between the subjective judgements of various forecasters could therefore reflect differences in their respective information sets. In this situation, it is not possible to pool the underlying information set and construct a superior model that captures each of the underlying forecasting models. On the other hand, the higher the degree of overlap in the information set used to produce the underlying forecasts, the less useful a combination of forecasts is likely to be [34]. Furthermore, when forecast users have access to the full information set used to construct the individual forecasts, combinations are sub-optimal and it might be better to recommend finding a superior single model [35,36].

A second reason for using forecast combinations is that individual forecasts are differently influenced by structural breaks caused, for example, by institutional change or technological developments. Some models may adapt fast and only be affected by the structural break for a short time, while other models have parameters with slower adaption speeds. Since it is typically difficult to detect structural breaks in real time, it is plausible that, on average–i.e., across periods with varying degrees of stability–combinations of forecasts from models with different degrees of adaptability will outperform forecasts from individual models [37]. Similarly, Stock and Watson [38] report that in cases of structural breaks, the performance of combined forecasts tends to be far more stable than that of individual forecasts.

Third, forecast combination could be viewed in the sense that additional forecasts act like intercept corrections (ICs) relative to a baseline forecast. ICs can improve forecasting not only if there are structural breaks, but also if there are deterministic misspecifications [39].

Fourth, pooling of forecasts can also be understood as a shrinkage estimation. According to this approach, the unknown future value is viewed as a "meta-parameter" of which all the individual forecasts are estimates [39]. In these cases, averaging may improve the estimates.

Often, we also measure the wrong things. Demand data are rarely, if ever, available: thus, instead of measuring demand, we measure supply data (e.g., in periods with over-utilization of production capacities). However, it is obvious that such proxies of apparent demand introduce systematic biases in measuring real demand and therefore increase forecasting errors [4]. Averaging forecasts, in turn, would balance out these potential errors. Similar conclusions can be drawn for measurement errors and unknown misspecifications.

Moreover, statistical models assume that patterns and relationships remain constant. This is not always given: especially in the real world, where events and actions or fashions bring systematic changes and therefore introduce non-random errors in forecasting. Combining forecasts would help to increase their accuracy. Combining is also expected to be useful when experts are uncertain of which method to choose. This may be because we encounter novel situations (e.g., Brexit, the COVID-19 pandemic, stock market crashes, etc.) or have to make forecasts for a longer time horizon.

A further argument for combining forecasts is that the underlying forecasts may be based on different loss functions [40]: let us assume there are two forecasters, both have the same information set for forecasting a specific variable; however, forecaster #1 dislikes large negative forecasting errors, while forecaster #2 dislikes large positive forecasting errors. As a consequence, forecaster #1 will under-predict, while forecaster #2 will over-predict. If the bias is not constant over time, a forecast

user with a rather symmetric loss function would find a combination of these two forecasts better than following the individual ones.

While forecast combination has advantages, there are also several arguments against it. Following the literature, forecast combinations are at a disadvantage over a single forecasting models because they produce parameter estimation errors in cases where the weights to combine the different forecasts need to be estimated [40]. This important consideration of avoiding errors in estimating the weights for the forecast combination has led simple uniform weighing methods to dominate more complex combination methods in mainstream scientific practice. The important advantage is that the weights are known and therefore do not have to be estimated: this plays a role if there is little evidence on the performance of the individual forecasts or if the parameters of forecasting model are time-varying. Furthermore, in many situations, a simple uniform average of forecasts will result in a significant reduction in variance and bias through averaging out the individual biases [40,41]. In the literature, the most used simple combination approaches are the Simple Average Combination (SAC) method and the VAriance-COvariance (VACO) method [3,5].

The SAC method assigns equal weights to each of the individual forecasts instead of using any optimal weights to minimize the variance of the combined forecasts. Although forecast combinations with equal weights could be biased, they might contribute to the reduction of the forecasting error as these weights are not influenced by other errors accruing from the estimation of optimal weights [3,42]. According to Palm and Zellner [41], the SAC method has the following advantages. When there is little evidence on the performance of the individual forecasts, it is an important advantage that the weights are known and therefore, no estimation is necessary. Furthermore, in many situations, the application of the SAC method contributes to the reduction of the variance and the bias through averaging out individual biases. Another advantage of the SAC method is its avoidance of sampling errors and model uncertainty in estimating optimal weights.

According to the VACO method, past errors of each individual forecast are used to determine the weights in forming the combined forecasts [5,6]. Bates and Granger [5] suggest assigning higher weights to good forecasts (low errors) and lower weights to poor ones (high errors).

#### **4. Forecasting Tourism Demand for the European Union**

The objective of the commissioned study based on the research period 2005–2017 was to find the model best at forecasting arrivals and overnights for the total European Union in the short term. The competing models should have a low degree of complexity, as the "winning" model was to be applied by the European Commission and Eurostat to an actual database in order to look ahead into the near future and to mitigate the lack of tourism data reporting [14]. This study was designed for the European Commission in order to help them forecast tourism demand for the European Union as a whole. Thus, similar models were used to test the forecasting accuracy at each time period.

In doing so, we analyze if the combined forecasting approaches according to the SAC and VACO methods based on the outcome of Autoregressive Integrated Moving Average (ARIMA) models, REGARIMA models with different trend variables, and Error Trend Seasonal (ETS) models (a state-space framework comprising traditional exponential smoothing models) outperform the single models, in general and over all periods, or if specific individual models show superiority [4,43–45]. Similar to the choice of the forecast combination techniques, the single forecasting models were also chosen based on the criterion of easy replicability by practitioners.

The database for the model estimations comprised the monthly arrivals and overnights statistics from Eurostat for the total of the EU-27 (unfortunately, no data were available for Ireland) from January 2005 until August 2017. In the estimations and forecasts, we separated between non-residents and residents (see also Figures 1 and 2). For calculating the forecasting accuracy, we performed out-of-sample forecasting for three and six months ahead. For the computations, we used EViews 9.5, and EViews 10 for the final quarterly report.

**Figure 1.** Arrivals of residents (solid line) and non-residents (dashed line) for the EU-27 (seasonally adjusted by the multiplicative ratio to moving average method). Source: Eurostat and own illustration.

**Figure 2.** Overnights of residents (solid line) and non-residents (dashed line) for the EU-27 (seasonally adjusted by the multiplicative ratio to moving average method). Source: Eurostat and illustration.

#### *4.1. An Outline of the Models Used*

For solving the forecasting problem, we used four different approaches:


#### *4.2. ARIMA and REGARIMA Models*

A general ARIMA (*p*, *d*, *q*) model for a first and seasonally differenced forecast variable ∇∇*syt* (i.e., *d* = 1) in period *t* (*t* = 1, ... , *T*) reads as follows:

$$
\varphi(L)\nabla\nabla^s y\_t = a + \mathfrak{G}(L)e\_t \tag{1}
$$

where ϕ(*L*) and ϑ(*L*) in Equation (1) denote lag polynomials of finite orders *p* and *q*, while *a* denotes the intercept and *et* denotes the random error term. In this study, *yt* corresponds either to overnights or to arrivals for the total EU-27.

A general REGARIMA (*p*, *d*, *q*) model for a first and seasonally differenced forecast variable ∇∇*syt* (i.e., *d* = 1) with a contemporaneous exogenous variable *trendt*, in turn, reads as follows:

$$
\varphi(L)\nabla\nabla^s y\_t = a + \Psi(L)e\_t + trend\_t \tag{2}
$$

where the notation in Equation (2) corresponds to that of Equation (1). In this study, *trendt* corresponds either to a Hodrick–Prescott trend of the arrivals and the overnights or to various Google Trends indices. It should be noted that the forecast variable is only first or second-differenced in the REGARIMA models with Google Trends indices, but not seasonally differenced.

#### *4.3. Employed ARIMA Models*

For all variables used in the ARIMA models, we applied seasonal differencing to remove any deterministic or stochastic seasonal patterns. Because of the existence of non-seasonal unit roots, we additionally took first differences to achieve difference-stationary processes.

For the three and six month forecasting periods and to achieve the best model fit, we employed an ARIMA (2, 1, 0) approach to model the overnights of residents and non-residents. To model the arrivals of non-residents, we applied the ARIMA (2, 1, 1) model, while we chose the ARIMA (2, 1, 0) model for the residents. All the estimated coefficients were statistically significant. The estimated equations also had excellent results in terms of out-of-sample forecasting accuracy (see as an example the forecasting accuracies for arrivals and overnights for the study period December 2017 in Tables A1–A8 in Appendix A).

#### *4.4. Employed REGARIMA Models with Hodrick–Prescott Trends*

In order to manage the short-term forecast problem, a quasi-causal model was constructed to explain arrivals and overnights. This model is based on a REGARIMA approach, which uses the flexible trend of the overnights/arrivals being explained through the model as its contemporaneous exogenous variable. The flexible trend was identified by the Hodrick–Prescott (HP) filter method and indicates important exogenous aggregated information in the model [43,46]. Correcting the forecast based on the flexible trend with AR and MA errors optimized the results.

In doing so, the overnights/arrivals values were transformed into absolute previous year differences of the moving 12-month averages of the log-transformed original values. 12-month averages were used to adjust for seasonal fluctuations, calendar effects, and special events. The explanatory variables are absolute previous year differences of the flexible overnights/arrivals trends identified by the HP filter method [43,46]. As the HP filter is based on the moving 12-month average of the log-transformed original values, these variables are easy to extrapolate by using exponential smoothing methods for forecasting purposes [4].

For the three and six month forecasting periods, we employed REGARIMA approaches and corrected the processes with AR (1) errors for modeling non-residents' and residents' overnights. To model the arrivals of non-residents, we applied approaches corrected with MA (1) errors, while for the residents we corrected with AR (1) errors. All the estimated coefficients were statistically significant. The estimated equations also had excellent results in terms of out-of-sample forecasting accuracy.

#### *4.5. Employed REGARIMA Models with GoogleTrends*

Based on the REGARIMA models outlined in the previous section, we also used model variants with Google Trends indices instead of HP trend variables. In order to achieve stationarity, we took the first or second difference of the variables (no seasonal differencing was performed). First differences were taken of arrivals, while the Google Trends indices were used as they came. Second differences were taken of overnights, while the Google Trends indices were employed in first differences.

For the three and six month forecasting periods, we employed the aforementioned REGARIMA approaches. Various correction processes were necessary for both the arrivals and overnights data. To model the arrivals of the non-residents, we applied approaches corrected with AR (2) and MA (1) errors, while for the residents, we corrected with AR (2) errors. For modeling overnights, we applied AR (2) errors throughout.

Next, we explain what Google Trends are, where we can find them, and how we can use them. Google provides search data at an aggregated level (as an index) on its Google Trends page (http://trends.google.com/trends/), where users can identify the topics trending in search results or investigate a specific search term to learn about its popularity in different parts of the world. These data are open and free of charge to Google account holders, and can be downloaded in common spreadsheet formats to be used for analytical purposes, including forecasting.

In order to determine which web search term was more useful, we collected and developed four types of Google Trends variables: (i) a Google Trends index with country name (EU\_trends), (ii), a Google Trends index with country names and flights (EU\_flights), (iii) a Google Trends index with country names and hotels (EU\_hotels), and iv) a Google Trends index with country names, flights, and hotels (EU\_travel). There is not a consensus among researchers about how to choose the keywords for analysis. One method was directly choosing the keywords by subjective assessment of a set of text or data [47] and in this study we used the keywords that are related to travel planning (i.e., flights, hotels) under the travel category of Google Trends.

To calculate the EU\_trends variable, the name of each of the 27 European Union countries was used as the search term to retrieve the respective monthly search indices from the Google Trends website. The data were retrieved in monthly intervals between January 2016 and December 2017 and compiled in a single Excel file. Then, to generate a regional EU\_trends variable, the average of the index values across the 27 countries was calculated for each month separately. The EU\_flights and EU\_hotels variables were also calculated in a similar way: the search terms being each country's name followed by "flights" or "hotels", respectively. As a last variable using Google Trends indices, we generated an EU\_travel variable by summing all the search indices used for the previously calculated monthly "EU" indices, such that EU\_trends + EU\_hotels + EU\_flights = EU\_travel.

In most cases, the estimated models had statistically significant parameters and satisfactory forecasting accuracy results (for an example see Tables A1–A8 in Appendix A). The reason why insignificant variables were retained was so the models could be tested using the same variables for the whole duration of the study period. A second reason was that, although some variables were indeed statistically insignificant in specific periods but not over the whole duration time of the study, this does not imply that these variables are unimportant for forecasting. In doing so, we follow a recent statement of the American Statistical Association (ASA) pointing out very clearly that the widespread use of statistical significance—to be understood as a 5% *p*-value threshold—as a justification for scientific findings leads to a biased perception of the scientific process [48]. The ASA statement has been an impulse to the scientific community to move further toward a world beyond *p* < 0.05 and a signal to recognize that statistical inference is not equivalent to scientific inference [49,50].

#### *4.6. Error Trend Seasonal (ETS) Models*

The Error Trend Seasonal (ETS) model class was developed by Hyndman et al. [45,51] and encompasses various well-known exponential smoothing methods (e.g., single exponential smoothing, double exponential smoothing, additive and multiplicative seasonal Holt-Winters) within a theoretically founded state-space framework which is estimated recursively by employing maximum-likelihood methods. The ETS framework consists of a signal equation for the forecast variable and a number of state equations for the three components that cannot be directly observed: level, trend, and seasonal. Since the ETS framework can automatically detect both trend and seasonal patterns and apply the most suitable model, further data transformation was not necessary.

*Forecasting* **2020**, *2*

Generally speaking, an ETS(·, ·, ·) model is represented by one of the following configurations of the error, trend, and seasonal components of the forecast variable, i.e., total EU-27 overnights or arrivals in this study [52]:

$$E(Error) \in \langle A, M \rangle, \ T(Term) \in \langle N, A, A\_d, M, M\_d \rangle, \ S(Sasonal) \in \langle N, A, M \rangle \tag{3}$$

where *A* in Equation (3) corresponds to additive, *Ad* to additive damped, *M* to multiplicative, *Md* to multiplicative damped, and *N* to none.

This makes a total of 30 possible ETS specifications. From these, the most suitable specifications were automatically selected by employing information criteria such as the Akaike information criterion (AIC) and the Schwarz or Bayesian information criterion (BIC). Information criteria such as AIC and BIC are means for model selection and offer a relative estimate of the information lost in terms of the log likelihood function when a given model, e.g., a particular ETS specification relative to all other ETS specifications, is used to represent the process that generates the data [44]. Here, AIC and BIC have been calculated for all 30 possible ETS specifications, and the specifications characterized by the minimum AIC value and BIC values were then used for estimation and forecasting.

As an example of an ETS specification for three months ahead forecasting of overnights of non-residents in the EU-27 as selected by BIC, please see Table A5 in Appendix A. The one signal and two state equations of the selected ETS (*M*, *N*, *A*) specification read as follows [52]:

$$\mathbf{y}\_t = (l\_{t-1} + s\_{t-m})(\mathbf{1} + \mathbf{c}\_t) \tag{4}$$

$$l\_t = l\_{t-1} + \alpha (l\_{t-1} + s\_{t-m})c\_t \tag{5}$$

$$s\_t = s\_{t-m} + \gamma (l\_{t-1} + s\_{t-m})e\_t \tag{6}$$

where Equation (4) corresponds to the signal equation for the forecast variable *yt*, Equation (5) to the state equation for the unobservable level component *lt*, and Equation (6) to the state equation for the unobservable seasonal component *st*. α and γ denote the two smoothing constants, while the remaining notation in Equations (4) to (6) corresponds to that of Equations (1) and (2). It should be noted that the ETS(*M*, *N*, *A*) specification does not contain a state equation for the unobservable trend component since it is not present.

#### *4.7. Forecast Combinations*

Apart from the individual forecasting models, also the merits of two forecast combination techniques are evaluated. Bates and Granger [5] indicate that combination forecasts can yield lower forecasting errors, a finding which was later confirmed by Clemen [6]. To this aim, the forecasts produced by the 64 models, separately for forecast horizons three months and six months ahead, were all combined (or averaged) based on two methods that are common in the literature: simple uniform weights (unweighted average, or SAC) and so-called Bates–Granger weights (weighted average, or VACO). On the other hand, there are more complex combination approaches available, with the given disadvantage that additional errors through parameter estimations might flaw the results. To avoid these additional errors sources, we used only combination methods based on calculated weights in this study.

More formally, a simple uniformly combined forecast *F<sup>U</sup> <sup>h</sup>* of overnights/arrivals for non-residents/ residents for a forecast horizon *h* (*h* = 3, 6) is calculated as follows:

$$F\_h^{II} = \sum\_{m=1}^{M} \frac{1}{\mathcal{M}} F\_h^m \tag{7}$$

where *Fm <sup>h</sup>* in Equation (7) is a forecast value produced by one of the single *<sup>m</sup>* (*<sup>m</sup>* = 1, ... , *<sup>M</sup>*) competing single forecasting models.

The Bates–Granger weight of an individual forecasting model is calculated as the inverse of the mean square error of that forecasting model relative to the sum of the inverses of the mean square errors of all forecasting models. Hence, a better individual forecasting model receives a relatively higher weight when calculating the average forecast.

More formally, a Bates–Granger [5] combined forecast *FBG <sup>h</sup>* of overnights/arrivals for non-residents/ residents for a forecast horizon *h* (*h* = 3, 6) is calculated as follows:

$$F\_h^{BG} = \sum\_{m=1}^{M} \frac{1/MSE\_h^m}{\sum\_{m=1}^{M} \left(1/MSE\_h^m\right)} F\_h^m \tag{8}$$

where *Fm <sup>h</sup>* in Equation (8) is a forecast value produced by one of the single *<sup>m</sup>* (*<sup>m</sup>* = 1, ... , *<sup>M</sup>*) competing single forecasting models, and *MSEm <sup>h</sup>* denotes the corresponding mean square error.

#### **5. Results**

The forecasting models were assessed based on the comparison of their ex-post out-of-sample forecasting accuracy in terms of the root mean square error (RMSE), the mean absolute error (MAE), and the mean absolute percentage error (MAPE). The averaged (or combined) forecasts were then treated the same way as the forecast values produced by the individual forecasting models, which means that the same error measures are also calculated for them.

To evaluate the forecasting accuracy of the different models, we ranked the values yielded by the various forecasting accuracy measures employed and summated the scores: this procedure allows for the interpretation that the forecasting model with the lowest total score delivers the best forecasting accuracy (see as an example the results for the December 2017 report in Tables A1–A8 in Appendix A). In the next step, we compared the total scores added over eight report periods (March 2016, June 2016, September 2016, December 2016, March 2017, June 2017, September 2017, December 2017) and over three forecasting accuracy measures (RMSE, MAE, and MAPE) to determine whether or not the combined forecasting methods outperformed the single forecasting models (see Tables 1 and 2). These tables present an evaluation by forecast horizon separated between non-residents and residents, as well as an overall rank. In addition, one goal of the commissioned study was to give an overall recommendation across forecasting accuracy measures, report periods, and forecasting horizons. Therefore, presenting an overall rank became necessary as well.

Comparing the forecasting performance between March 2015 and August 2017, incorporating eight different forecasting situations as well as different macroeconomic environments, the combined forecasts with Bates–Granger weights continuously deliver the most accurate forecasts (see Tables 1 and 2). Generally speaking, this is the case for arrivals, overnights, all forecast horizons, and tourist types. However, in one case, the six months ahead forecasts of overnights by non-residents, the combined forecast approach based on Bates–Granger weights ranked second behind the ARIMA (2,1,0) model, thus failing to perform best as it had in all other cases, although the differences in the rank totals were so small that both methods could be considered as sharing the first rank.


**Table 1.** Summary ranking of the overall results for arrivals forecasts. Source: European Commission (2017) and own calculations. Boldface indicates the singleforecasting **Table 2.** Summary ranking of the overall results for overnights forecasts. Source: European Commission (2017) and own calculations. Boldface indicates the singleforecasting model(s)/forecast combination technique(s) characterized by the lowest sum of ranks.


#### *Forecasting* **2020** , *2*

In the competition for the top three ranks, the combined forecasts with Bates–Granger weights ranked first seven times and came second once. The ARIMA models ranked first once, tightly beating the weighted combined forecasts, ranked second once, and ranked third twice. The combined forecasts with uniform weights achieved the second rank five times and the third rank once. The REGARIMA models with Google Trends indices for the European Union scored one second rank and two thirds. The ETS (AIC) models and REGARIMA models with Google Trends indices for hotels and travel each achieved the third rank a single time.

When analyzing the overall ranks of the two employed forecasting combination techniques separately for the eight report periods, combined forecasts based on Bates–Granger weights achieved the lowest sum of ranks in 22 out of 64 possible cases, whereas combined forecasts based on uniform weights achieved the lowest sum of ranks in 16 cases (detailed results are available from the authors on request). Thus, for approximately 60% of all cases and when only considering best ranks, averaging over the results of the individual forecasting models was definitely worthwhile. It should also be noted that none of the individual forecasting models performed extremely poorly and that the results were quite similar across report periods. For the given sample, minimizing the impact of individual forecasting models performing slightly poorly in terms of assigning a lower weight when calculating the Bates–Granger combined forecast proved to be sufficient. Consequently, more complex approaches (e.g., using a formal screening procedure based on statistical criteria such as a forecast encompassing test) was not necessary (and would also have been at odds with the simplicity requirement for the methodology).

#### **6. Conclusions**

This study compared the accuracy of individual and combined approaches to forecasting tourism demand for the total European Union. The evaluation of the forecasting accuracies was performed recursively for eight periods spanning two years in order to check the stability of the outcomes during a changing macroeconomic environment. The analysis of the out-of-sample forecasts for arrivals and overnights showed that forecast combinations taking the historical forecasting performance of individual approaches such as Autoregressive Integrated Moving Average (ARIMA) models, REGARIMA models with different trend variables, and Error Trend Seasonal (ETS) models into account deliver the best results.

The analysis of the three months ahead and six months ahead out-of-sample forecasts for arrivals and overnights (non-residents, residents) showed that the VACO method was clearly more accurate, followed by the ARIMA approach, and the uniformly weighted combined forecasting method (SAC). The results and their stability over a two-year observation period demonstrate that taking the historical forecasting performance into account contributes to a significant improvement of forecasting accuracy and is recommended for practical application.

One particular advantage of the SAC and VACO methods in addition to their excellent performance is that they can be easily implemented by practitioners, which typically is not the case for more complex forecast combination techniques such as multiple encompassing tests. This simplicity also holds for the employed single forecasting models, which are typically part of any modern statistics and econometrics software. One further advantage is the ready availability of the different trend variables employed as exogenous variables in the REGARIMA models, such as HP trends and the various Google Trends indices.

Limitations of our study are that we used very simple models for practical reasons and did not test whether the introduction of complex models in the forecasting competition would have changed the rank orders in terms of the forecasting accuracies. In line with the general need to improve forecasting accuracy, discussions of complexity have also become increasingly relevant in the academic literature, although building complex approaches is at odds with scientific principles that advocate simplicity. On the other hand, a way to use complex techniques in order to improve forecasting accuracy is also the combination of the forecasts of individual forecasting models, as employed in this study,

since combined methods minimize the risk of high inaccuracy by "averaging out" the weaknesses of the single forecasting models. They are also capable of introducing adjustments and additional information, which can balance out measurement errors and thereby affect forecasting power.

Future research efforts could concentrate on developing forecasting models for the single European Union countries and testing if the forecasting accuracy performances of the different methods stay stable across countries or if significant differences can be detected. For practical reasons, such country-level approaches could even be more useful for governments and national tourism boards. Another research endeavor could investigate the sensitivity of the forecasting performances of the different methods in terms of different data frequencies (e.g., quarters instead of months).

**Author Contributions:** Conceptualization, U.G., I.Ö., and E.S.; methodology, U.G., I.Ö., and E.S.; software, U.G., I.Ö., and E.S.; validation, U.G., I.Ö., and E.S.; formal analysis, U.G., I.Ö., and E.S.; investigation, U.G., I.Ö., and E.S.; resources, I.Ö.; data curation, I.Ö.; writing—original draft preparation, U.G., I.Ö., and E.S.; writing—review and editing, U.G.; visualization, U.G.; supervision, E.S.; project administration, E.S.; funding acquisition, E.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by European Commission (Directorate General for Internal Market, Industry, Entrepreneurship & SMEs – Directorate F: Innovation & Manufacturing industries) under the title "Statistical Report on Tourism Accommodation Establishments", tender number GRO/SME/15/C/N125C.

**Acknowledgments:** The authors would like to thank the two anonymous reviewers of this journal, whose comments and suggestions could greatly improve the quality of the manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

**Appendix A**

*Appendix A.1. Arrivals (December 2017 Results)*

**Table A1.** Forecasting accuracy (December 2017): arrivals, non-residents, three months ahead. Source: European Commission (2017) and own calculations.Boldface indicates the single forecasting model(s)/forecast combination technique(s) characterized by the lowest sum of ranks.



REGARIMA

REGARIMA

REGARIMA

REGARIMA

 (2,1,1) + Google Trends (EU\_travel)

**Combined forecasts (uniform weights)**

**Combined forecasts** 

**(Bates–Granger**

 **weights)**

 (2,1,1) + Google Trends (EU\_hotels)

 (2,1,1) + Google Trends (EU\_flights)

 (2,1,1) + Google Trends (EU)

1,465,033.00 1,460,824.00 1,397,517.00 1,467,741.00

 **1,412,399.45**

 **1,344,569.48**

 **3**

 **1**

 **1,199,343.49**

 **996,892.74**

 **1**

 **2**

 **4.45**

 **2**

 **3.78**

 **1**

 **5**

 **5**

5 4 2 6

1,231,018.00 1,280,448.00 1,234,282.00 1,274,591.00

5

 4.67

 5

 16

4

 4.47

 3

 9

6

 4.69

 6

 16

3

 4.50

 4

 12












REGARIMA

REGARIMA

REGARIMA

 (2,2,0) + Google Trends (EU\_travel)

Combined forecasts (uniform weights)

**Combined forecasts** 

**(Bates–Granger**

 **weights)**

 **1,880,723.47**

 **1**

 **1,486,759.71**

 **1**

 **1.51**

 **1**

 **3**

 (2,2,0) + Google Trends (EU\_hotels)

 (2,2,0) + Google Trends (EU\_flights)

6,884,342.00 6,347,254.00 7,349,120.00 4,024,454.10

7 6 8 4

5,586,683.00 5,179,781.00 6,081,357.00 3,681,202.32

4

 3.32

 4

 12

8

 5.36

 8

 24

6

 4.55

 6

 18

7

 4.91

 7

 21

*Forecasting* **2020**

, *2*


**Table A8.** Forecasting accuracy (December 2017): overnights, residents, six months ahead. Source: European Commission (2017) and own calculations. \* AIC andBIC selected

Combined forecasts (uniform weights)

Combined forecasts

(Bates–Granger

 weights)

5,684,099.00 2,319,373.77

5 2

5,174,079.48 1,971,291.24

2

 2.43

 2

 6

5

 5.71

 4

 14

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Benchmarking Real-Time Streamflow Forecast Skill in the Himalayan Region**

**Ganesh R. Ghimire 1,\*, Sanjib Sharma 2, Jeeban Panthi 3, Rocky Talchabhadel 4, Binod Parajuli 5, Piyush Dahal <sup>6</sup> and Rupesh Baniya <sup>7</sup>**


Received: 15 June 2020; Accepted: 6 July 2020; Published: 8 July 2020

**Abstract:** Improving decision-making in various areas of water policy and management (e.g., flood and drought preparedness, reservoir operation and hydropower generation) requires skillful streamflow forecasts. Despite the recent advances in hydrometeorological prediction, real-time streamflow forecasting over the Himalayas remains a critical issue and challenge, especially with complex basin physiography, shifting weather patterns and sparse and biased in-situ hydrometeorological monitoring data. In this study, we demonstrate the utility of low-complexity data-driven persistence-based approaches for skillful streamflow forecasting in the Himalayan country Nepal. The selected approaches are: (1) simple persistence, (2) streamflow climatology and (3) anomaly persistence. We generated the streamflow forecasts for 65 stream gauge stations across Nepal for short-to-medium range forecast lead times (1 to 12 days). The selected gauge stations were monitored by the Department of Hydrology and Meteorology (DHM) Nepal, and they represent a wide range of basin size, from ~17 to ~54,100 km2. We find that the performance of persistence-based forecasting approaches depends highly upon the lead time, flow threshold, basin size and flow regime. Overall, the persistence-based forecast results demonstrate higher forecast skill in snow-fed rivers over intermittent ones, moderate flows over extreme ones and larger basins over smaller ones. The streamflow forecast skill obtained in this study can serve as a benchmark (reference) for the evaluation of many operational forecasting systems over the Himalayas.

**Keywords:** Himalayan region; streamflow forecast verification; persistence; snow-fed rivers; intermittent rivers

#### **1. Introduction**

Skillful streamflow forecasts are critically important in improving decision-making in water-related policy and management (e.g., flood and drought preparedness, reservoir operation and hydropower generation). Real-time streamflow forecasting over the Himalayas remains a critical challenge. (e.g., [1,2]) because of complex basin physiography, shifting weather patterns and sparse distribution of hydrometeorological monitoring stations. The verification of streamflow forecasts can provide insight into hydrologic forecasting by elucidating the level of forecast skill that can be reasonably

expected from the forecasting system [3]. In this study, we demonstrate the utility of a persistence-based approach to benchmark forecast skill of the operational streamflow forecasting system in the Himalayan country Nepal.

Nepal is an important part of the Hindu Kush Himalayas (HKH), which are referred to as the water towers of Asia due to the largest concentration of snow and glaciers outside the two poles [4–7]. The snow cover and glaciers make significant contributions to the hydrology of glacierized basins. Khadka et al. [5] showed, in the simulation study of Tamakoshi basin in eastern Nepal for the years 2000–2009, that snowmelt contributes about 18% of the annual runoff. Further, they showed that the snowmelt represents about 17% of the runoff in the summer season (June to September) while about 25% in the spring season, when the streamflow in these rivers are low. The study by Bhattarai and Regmi [8] in the Langtang river basin in central Nepal shows similar numbers on the contribution of snowmelt to the runoff (~19% on the annual runoff). Dhami et al. [9] used the Soil and Water Assessment Tool and snow-melt runoff model in the Karnali river basin in western Nepal to simulate components of water balance. They reported that about 12% of annual runoff is contributed by the snowmelt, while about 29% by the groundwater base flow.

In Nepal, the Department of Hydrology and Meteorology (DHM) is responsible for monitoring the streamflow at rivers, providing real-time river watch, and developing decision support systems. At present, the real-time river watch system at DHM generates siren alerts at critical stations leveraging the internet and mobile devices for alert delivery [10]. In addition, the community-based early warning system (CBEWS) has been integrated by the early warning system managed by the DHM with the aid of intergovernmental organizations, UN organizations and international non-governmental organizations [11]. CBEWS is inherently people-centric, helping communities use local resources and capacities for flood preparedness efforts [10–12]. The simplified data-based mechanistic forecast model was adopted by the DHM to cover the major river basins of Nepal that stem from the mountains to the Terai plains. The integration of the data-based mechanistic model with CBEWS provides an additional lead time of 3–5 h for larger basins and 1–3 h for smaller basins in addition to previous 2–3 h [13]. Owing to the major contribution of snow to the streamflow, particularly in the glacierized basins, persistence-based forecasts could serve as reference to evaluate the skill of the current operational forecasting system, which this study explores. Moreover, this approach could serve as a tool to guide the decision-making pertaining to flood preparedness in the short-medium forecast range.

Hydrologic forecasts are inherently uncertain (e.g., [3,14]). The uncertainty can originate from different sources, including shortcomings in forcing data (e.g., quantitative precipitation forecasts), model structure and parameters, as well as model initial conditions (e.g., [1,14–16]). In addition, complex hydrologic processes, such as snowmelt, sub-surface and flow routing, make it more difficult to produce skillful streamflow forecasts in the HKH region (e.g., [1,17]). Most countries in the HKH region are prone to floods (e.g., [1,18]); therefore, achieving skillful forecasts up to the medium range is of utmost importance for flood preparedness efforts. Because of such uncertainties and challenges, achieving skillful streamflow forecasts in this region requires skillful reference forecasting systems. For example, if a reference forecasting system with a low forecast skill is chosen, the operational streamflow (flood) forecasting system might depict higher forecast skill. It will essentially give a false sense of forecast skill associated with the operational system.

The persistence-based approach uses a notion of "tomorrow will be like today" (e.g., [19]). It is tied to the concept of "memory" of any analyzed system [20]: in our case, the basins. One would expect the notion of streamflow persistence to be more pertinent in the glacierized basins due to the relatively slow snowmelt process over time. Though persistence-based forecasts are in use as the reference in the hydrologic forecasting community, their utility across different hydroclimatic conditions and scales have not been fully explored.

Forecasters use various reference forecasts, such as simple persistence, streamflow climatology, a hydrological model fed with zero rainfall and a hydrological model fed with an ensemble of resampled historical rainfall [21], for operational forecast skill verification. Despite some

inherent predictability of hydrologic systems due to coupling with the weather and climate system, streamflow shows some likelihood of repetition at seasonal to daily time scales, and hence some inherent predictability [19,20,22]. Den Dool [23] and Fraedrich and Ziehmann-Schlumbohm [24] iterate that only forecasts depicting skill better than persistence can handle the forecast of the time derivative. The degree of persistence, however, might demonstrate variation with location, time of the year and the system studied [23,25]. Den Dool [23] indicated that processes with a long-time scale show higher skill with persistence forecasts, which makes persistence a hard-to-beat method for many complex mechanistic models. Though some authors (e.g., [26]) attribute the streamflow persistence to the carryover storage of water in lakes, and below the land surface, the comprehensive evaluation of persistence-based forecasts was lacking until recently. Palash et al. [1], motivated by the concept of requisite simplicity, showed the utility of a simple linear flood-forecasting system for the Ganges, Brahmaputra and Meghna Rivers, using streamflow persistence as the mechanism of prediction. Recently, Ghimire and Krajewski [19] and Krajewski et al. [27] showed, through a comprehensive study of 140 mid-western agricultural watersheds in the United States, that persistence-based forecast skills show strong dependence with the basin scale, while weaker but non-negligible dependence with the properties of the river network.

In the context of a lack of comprehensive evaluation of persistence-based reference forecasts in the data-scarce Himalayan region, their quantitative assessment can provide the benchmark across scales for any operational forecasting system (e.g., DHM's forecasting system). This paper specifically explores the following questions: (1) what is the utility of data-driven persistence-based approaches for skillful streamflow forecasting in the Himalayan region? (2) Which forecast conditions, such as lead time, flow threshold, basin size and flow regime (e.g., perennial/snow-fed and intermittent), benefit potential increase in forecast skill? We organize the paper as follows: In Section 2, we discuss the materials and methods used in this study. Section 3 presents results of this study and Section 4 discusses these findings. In Section 5, we summarize and present some conclusions and limitations of this work.

#### **2. Materials and Methods**

#### *2.1. Study Area and Data*

Our domain of interest for this study was Nepal, which is in the central part of Himalayan mountain range. The total area of Nepal is about 147,520 km2, located approximately between 80◦03 –88◦12 E and 26◦21 –30◦26 N (Figure 1). The topographic elevation ranges from 8848m (elevation of Mt. Everest) in the north to 70m in the south above mean sea level featuring diverse climatic conditions varying from polar to tropical [28]. Forest (39.1%), agriculture (29.8%), barren (10.7%) and snow/glaciers (8.2%) are the major land cover types in Nepal [29]. Most of the climate variability in Nepal is attributed to the high reliefs of river catchments e.g., [4]. The climate in Nepal is dominated by a southwestern monsoon (June–September) that originates from the Bay of Bengal and about 80% of precipitation in Nepal occurs only in the monsoon season [30]. The months of October–November occasionally experience post-monsoon rainfall, November–March typically remains dry, and April–May experiences pre-monsoon rainfall [4,5,9,31]. The summer monsoon is more active in the eastern part of Nepal as the monsoon enters and departs from that region, and the winter monsoon is more active in the western region because of the influence of western disturbances [28]. The average annual precipitation (1990–2010) ranges from ~270 mm to ~5500 mm [32]. The consequent runoff during the monsoon accounts for 70%–90% of the annual water balance, as shown in [33].

**Figure 1.** Location of the study domain, Nepal. The white patches in the terrain map (not to scale) depict the snow cover (Nepal Himalaya). The land cover is from Uddin et al. [29]. The red dots represent stream gauge stations monitored by the Department of Hydrology and Meteorology (DHM), Nepal.

Most rivers in Nepal drain from north to south and eventually to the Ganges River in India. The rivers in Nepal can be broadly categorized into five major river systems: Karnali, Narayani, Koshi, Mahakali, and the southern and Mahabharat rivers [33]. The former four river systems are combinations of rain- and snow-fed rivers, while the latter river system mostly comprise mid-size rivers, such as the Mechi, Kankai, Kamala, Bagmati, Babai and West Rapti, mostly fed by rainfall and characterized by frequent flash floods [33]. The Karnali, Gandaki and Koshi river systems highlighted in Figure 1 represent about two-thirds of the area of Nepal. These river systems show variability in their hydroclimatic conditions, in addition to the topographic characteristics such as average basin slope and snow cover. In this study, we used historic daily streamflow data at stream gauges stations (red dots in Figure 1) monitored by the DHM across Nepal. Daily streamflow data is the highest-resolution data publicly available from the DHM. It corresponds to the mean values of three readings recorded daily at 8 AM, 12 noon and 4 PM, local time. The streamflow data across stations span between 1962 and 2010, with varying record lengths. Thirty-eight stations have more than 30 years of historical daily streamflow records (Figure A1, Appendix A). We screened the stream gauge stations based on the availability criteria of at least 10 years of streamflow records and at least fair quality data categorized by DHM. The stations that are not shown within the three major river systems correspond to the rivers of the southern and Mahabharat river systems. In total, we used streamflow data from 65 unregulated

stream gauge stations. The corresponding basin size varies from ~17 to ~54,100 km2, with more than half of the stations monitoring catchments greater than 1000 km<sup>2</sup> (Figure A1, Appendix A). Note that the hydropower projects built in some of these rivers are run-off-river schemes, which typically do not store the water.

#### *2.2. Experimental Design*

Several persistence-based forecast approaches have been discussed in den Dool [19], Wu and Dickinson [23] and Ghimire and Krajewski [25]. Here, we present three of these approaches to generate the reference (benchmark) forecasts in the time domain and evaluate them; these are simple persistence, streamflow climatology and anomaly persistence. Our experimental setup mostly follows the methods outlined in Ghimire and Krajewski [19]. If *t* represents the time of observation of streamflow *Q*(*t*), the forecast for lead time Δ*t* is given by

$$Q(t + \Delta t) = \mathbb{Q}(t) \tag{1}$$

Note that the forecasts are generated for the entire streamflow time-series in the records for 65 stream gauge stations in Nepal. While considering the entire time-series, we account for the interannual variability of streamflow.

Here, we use the "climatology" as the average of streamflow on record at time *t*. The corresponding forecast for lead time Δ*t* is given by

$$Q(t + \Delta t) = \frac{\sum\_{i=1}^{n} Q(t + \Delta t)}{n} \tag{2}$$

where *Q*(*t* + Δ*t*) refers to the streamflow at time *t* + Δ*t* from previous years of record and *n* refers to the number of years in record.

The anomaly persistence forecast scheme assumes that streamflow anomalies persist over the lead time Δ*t*. The anomalies are computed with respect to the climatology. The anomalies at *t* and *t* + Δ*t* i.e., *Q* (*t*) and *Q* (*t* + Δ*t*), respectively, are computed as

$$Q'(t) = Q(t) - \text{clim}(t) \tag{3}$$

$$Q'(t + \Delta t) = Q(t + \Delta t) - \dim(t + \Delta t) \tag{4}$$

where *clim*(*t*) and *clim*(*t* + Δ*t*) are climatology forecasts obtained from Equation (2) at *t* and *t* + Δ*t*, respectively. The Equations (3) and (4) yield the forecast at *t* + Δ*t* as

$$Q(t + \Delta t) = Q(t) - \dim(t) + \dim(t + \Delta t) \tag{5}$$

The forecasts computed in Equations (1)–(5) use the entire streamflow time-series in record, i.e., at least 10 years of streamflow data. Here, we are also interested in exploring the forecast skill associated with the direct runoff component of streamflow. In other words, we computed forecast skill from persistence-based approaches for the rainfall-runoff events. Acknowledging the fact that the separation of runoff components from baseflow is not easy, we used the automated separation technique used by the United States Geological Survey (USGS) called the hydrograph separation program (HYSEP) [34]. The HYSEP employs three methods to separate storm flow from baseflow in the hydrograph: fixed interval, sliding interval, and local minimum. The duration of the surface runoff is computed empirically by

$$K = A^{0.2} \tag{6}$$

where *K* is the number of days after which the surface runoff stops and *A* is the upstream drainage area of the basin in sq. miles [34]. The interval used to separate the storm flow is 2*K*\*, which should be the nearest odd integer to 2*K*. The hydrograph separation starts one interval, i.e., 2*K*\* days, before the start of the date selected for the start of the separation, while it ends 2*K*\* days after the selected date. For further details of the separation technique, refer to Sloto and Crouse [34]. Figure A2 of the Appendix A shows the demonstration of storm flow separation across two river basins (small and large scales) using the sliding interval method.

#### *2.3. Evaluation Metrics*

We used several standard hydrologic forecast evaluation measures to evaluate the forecasts described in Section 2.2. The metrics we computed were Kling–Gupta efficiency (KGE), mean absolute error (*MAE*), normalized *MAE* (*nMAE*), timing of the hydrographs (*TH*) and peak difference (*PD*). The KGE comprises three components: Pearson's correlation coefficient (*r*), variance ratio (α) and mean ratio (β) (refer to [35]). It is given by

$$KGE = 1 - \sqrt{(r-1)^2 + (a-1)^2 + (\beta - 1)^2} \tag{7}$$

where <sup>α</sup> <sup>=</sup> <sup>σ</sup>*<sup>f</sup>* <sup>σ</sup>*<sup>o</sup> and* <sup>β</sup> <sup>=</sup> <sup>μ</sup>*<sup>f</sup>* <sup>μ</sup>*<sup>o</sup>* . σ*<sup>f</sup>* and σ*<sup>o</sup>* correspond to the standard deviations of forecasts and observed streamflow, respectively. μ*<sup>f</sup>* and μ*<sup>o</sup>* correspond to the means of forecasts and observed streamflow, respectively. The ideal value of KGE is equal to 1, which can be obtained when all three components attain values of 1. For example, if the values of α and β are close to 1, the corresponding KGE is dominated by the correlation component. By construction, the KGE associated with the persistence-based forecasts is dominated by correlation. When the mean of the observations is used as a benchmark, the frontier of the KGE separating between the good model and the bad model is equal to −0.41 see [36]. In our discussion, we primarily focus on the KGE.

We compute the *MAE* associated with the forecasts as below:

$$MAE = \frac{\sum \left| Q\_f - Q\_o \right|}{N} \tag{8}$$

where *Qf* is the streamflow forecast, *Qo* is the observation and *N* is the number of data pairs. For fair comparison across the basin scales, we normalize *MAE* by the upstream drainage area to compute *nMAE*. The hydrograph timing (*TH*) is computed as the number of hours a hydrograph requires to be shifted so that the cross-correlation between the forecasts and the observations is maximized. By construction, *TH* is close to the lead time of the forecast. The peak difference (*PD*) in percentage is computed as

$$PD = \frac{peak\_f - peak\_o}{N} \times 100\tag{9}$$

where *peakf* and *peako* are peaks of the forecast and the observations, respectively.

#### **3. Results**

In this section, we present key results from our experiment evaluating persistence-based reference forecasts at 65 stream gauge stations in the Himalayan region in Nepal. First, we present results for three major river systems (basin-wise forecasts), and then for the entire region (regional forecasts).

#### *3.1. Basin-Wise Forecasts*

Our notion behind exploring basin-wise forecast skill is due to the variability of snow/glacier cover, storage conditions and hydroclimatic conditions across three major river basins. In Figure 2, we show the variability of the KGE metric with drainage area for three major river systems of Nepal (Karnali, Narayani and Koshi) across forecast lead times of 1, 3, 6 and 12-days. The result corresponds to the simple persistence forecasts. For simple persistence forecasts, the KGE is dominated by the correlation component *r*. For all three basins across most of the stations, KGE > 0.8 for the lead time of 1-day. The stronger relationship between the KGE and basin size emerges with the increasing forecast lead

times. Note that the pattern is not as strong in the Koshi river basin, but the KGE values are relatively higher across basin scales. In addition, there are not many smaller monitored basins in the Koshi river system to exactly decide on the pattern relative to other two river systems. The overall results across three river systems clearly show a strong spatio-temporal dependence of simple persistence-based forecast skill.

**Figure 2.** Relationship between the Kling–Gupta Efficiency (KGE) and drainage area across three major basins in Nepal. The columns show the Karnali, Narayani and Koshi river basins while the rows show lead times of 1, 3, 6 and 12 days, respectively.

#### *3.2. Regional Streamflow Forecasts*

The results from basin-wise forecasts show that no distinct pattern of forecast skill emerges across major river systems in Nepal. Therefore, we decided to pool all stations from three major river systems, including the mid-sized river stations from other river basins (see Figure 1), and refer to their forecasts as regional forecasts. In Figure 3, we present results in terms of two skill metrics, i.e., KGE (first row) and *nMAE* (second row), based on simple persistence forecasts (Equation (1)). The relationship between the KGE and basin size is more prominent for pooled stations across lead times of the forecast. There are, however, few stations which diverge from the overall pattern of the KGE observed in the basin-wise forecast evaluation. We explore the forecast performance associated with these stations separately (see Figure A3, Appendix A). We find that their forecast skills show more variability and decay much faster with forecast lead times. In addition, these stations are mostly mid-sized river basins associated with the intermittent flow regime. We present the detailed discussion on the origin of their forecast skill later. Note that basins > 1000 km<sup>2</sup> show relatively higher forecast skill (KGE~0.7) even at the lead time of 12 days. The values of *nMAE* also depict similar dependence with basin size as the KGE. The values of *nMAE* show a much stronger relationship with basin scale at shorter lead times, while starting to diverge at longer lead times particularly for small basin scales of size < 1000 km2. By construction, *PD* for the simple persistence forecast is close to 0. In addition, by construction, *PD* for the climatology-based forecasts does not vary with lead times (see Figure A4, Appendix A). The corresponding median value of *PD* across basins in Nepal is about −76% (underestimation of the peak). For anomaly persistence, however, the median values of *PD* across basins range from about −1.5% to −3% at lead times of 1 day to 12 days, respectively (see Figure A4, Appendix A). Note that climatology-based forecasts show sizable a dependence of *PD* on basin size, while this is not apparent for anomaly persistence-based forecasts.

**Figure 3.** Relationship of skill metrics with drainage area for the entire Nepal. The columns show 1, 3, 6 and 12 day lead times, while the rows show Kling–Gupta efficiency (KGE) and normalized mean absolute error (*nMAE*), respectively.

In this section, we evaluate the forecast performance associated with various reference forecast schemes described in Section 2. Here, we present the one-to-one comparison of forecast skill in terms of KGE (first row) and *nMAE* (second row) between these schemes in Figure 4. Apparently, persistence-based forecasts outperform climatology-based forecasts at shorter lead times across all basin scales. Both simple persistence and climatology-based forecasts perform similarly for longer lead times. Anomaly persistence, however, performs similarly with simple persistence for shorter lead times, while it outperforms simple persistence for longer lead times at smaller basin scales (also see Figure 5). Note that climatology is an integral component of anomaly persistence forecasts. Therefore, the improved skill of anomaly persistence at longer lead times and smaller basin scales could be attributed to the improved performance of climatology-based forecasts at longer lead times.

**Figure 4.** One-to-one relationship of skill metrics between climatology and simple persistence-based forecasts (blue), and between anomaly persistence-based forecasts and simple persistence-based forecasts (red). The columns show 1, 3, 6 and 12 day lead times, while the rows show Kling–Gupta efficiency (KGE) and normalized mean absolute error (*nMAE*), respectively.

**Figure 5.** Evolution of KGE metric with lead time across increasing basin scales (**a**–**d**) shown in the inset. The panels show the KGE comparison between three methods: simple persistence, anomaly persistence and climatology.

From the real-time streamflow forecasting point of view, it would be more informative to explore how KGE evolves over time across basin scales. In Figure 5, we show that the evolution of KGE skill is associated with three methods across four nested basins in the Karnali river system (refer inset). In addition to the apparent strong dependence of KGE with basin scales, the temporal evolutive pattern of KGE shows variability across basin scales. For example, the smallest basin (*A*~159 km2) depicts the sharp drop in the KGE metric up to lead time of four days and remains stable from thereon. However, as the basin size increases, the rate of decrease of the KGE metric becomes slower. The other important observation is that up to about four days of lead time, both persistence-based forecast schemes outperform climatology-based forecasts. After four days of lead time, the difference between climatology and anomaly persistence-based forecast skill is negligible. Note, however, that the difference in KGE between simple persistence and other two forecast schemes shows systematic decrease with the increasing basin scales.

The results presented above are associated with the entire streamflow time-series. In other words, the forecast skills computed originate from the contribution of both baseflow and stormflow. However, the intrinsic question is whether we can achieve similar forecast performance for the storm flow (i.e., direct runoff component of hydrographs). In Figure 6, we present the forecast performance based on simple persistence for the direct runoff obtained using HYSEP described in Section 2. As we demonstrate through stormflow hydrographs in Figure A5, Appendix A, at a lead time of one day, the simple persistence-based forecasts are essentially delayed by one day. In other words, the timing of forecast hydrographs and the timing of the extreme event peak are delayed by one day, while preserving the magnitude of the peak of the hydrograph. Note the damping of smaller rainfall-runoff event stormflows in the smaller nested basin (see a, Figure A5, Appendix A) by the river network aggregation process at the larger basin (see b, Figure A5, Appendix A). The resulting pattern of both KGE and *nMAE* with basin scales is similar to the one presented before. Notably, reasonable KGE could be achieved for a one-day-ahead stormflow forecast, while showing significant decline for longer lead times. Note that the basins of size > 1000 km2 still show KGE > 0.3 for 12 days ahead forecast, illustrating the potential of using it as reference forecast for the evaluation of stormflow forecasts particularly at short-range.

**Figure 6.** Relationship of skill metrics with drainage area for event-based streamflow. The columns show 1, 3, 6 and 12 day lead times, while the rows show Kling–Gupta efficiency (KGE) and normalized mean absolute error (*nMAE*), respectively.

As we highlighted in Section 2, our region predominantly contains rivers in the perennial flow regime (snow/glacier fed). There are, however, some rivers originating in the southern plains of Nepal which are predominantly in the intermittent flow regime. Since we established the fact that forecast skill shows strong spatial scale dependence, it would be of interest to the forecasting community to explore the dependence on the flow regimes. The ideal comparison would be to evaluate forecasts conditional on the basin scale. We identified four basins of a size of ~2500 km2; two perennial river basins in northern Nepal and the other two intermittent river basins in southern Nepal (see inset of Figure 7). Figure 7 demonstrates the evolution of KGE (simple persistence-based) with lead times for these four basins. A distinct pattern of KGE evolution emerges. The perennial rivers depict higher forecast performance with gradual decline in the forecast performance with forecast horizon (lead time) while the intermittent rivers show sharp decline in the forecast performance with forecast horizon. Since intermittent rivers are mostly driven by rainfall-runoff events, their forecast skill evolution shows similar behavior elucidated by the results in Figure 6. Note a sudden spike of KGE for the Babai river at four days of lead time. Though it is hard to point out explicitly the reason, one could attribute it to the ability of four days ahead forecast to capture two consecutive historic floods (separated by 4 days) in the year 2015.

It is extremely important for forecasters to achieve good predictability of higher flow quantiles for flood preparedness and mitigation efforts. For the same four basins, we evaluate the forecast performance associated with the different streamflow quantiles. We refer to flow quantiles as flow threshold. Figure 8 shows the evolution of KGE across lead times for various flow quantiles. KGE reported here is computed for the streamflow forecasts exceeding the corresponding flow quantile. Note that as the flow quantile increases, the sample size used to compute the forecast skill systematically decreases. Given that we evaluate the forecast skill using the continuous streamflow time-series in record, we consider the sample size to be enough for the evaluation. Clearly, the KGE for the intermittent rivers (lower panel) shows sharp decline for the higher flow quantiles. The perennial rivers (upper panel), however, depict much better forecast performance even for the higher flow quantiles at longer lead times. However, of the two, the Marshyangdi basin at Bhakundebesi, shows relatively faster decline in the performance of flows exceeding 60th percentile. As Ghimire and Krajewski [19] highlighted, this variability in KGE could be explained by the difference in their river network geometries, typically explained by the network width function see [37,38] for detail. Explicit attribution is beyond the scope of this paper.

**Figure 7.** Evolution of Kling-Gupta efficiency (KGE) metric with lead time. The lines in the top (red and blue) represent snow-fed river basins while the lines in the bottom (green and magenta) represent intermittent river basins. Note that all four river basins are of a similar size of ~2500 km2.

**Figure 8.** Percentile plots showing the evolution of Kling–Gupta efficiency (KGE) with lead time for four river basins described in Figure 7. Each percentile refers to the KGE computed for streamflow forecasts exceeding corresponding percentile flow. For example, 90th percentile KGE represents the KGE computed for streamflow forecasts >90th percentile flow. The envelope corresponds to each 10th quantile on either side of the median flow. The higher and lower flow quantiles shown in the interval correspond to the lower and upper bounds of the envelope, respectively. For example, the interval [90, 10] corresponds to the KGE computed for 90th and 10th percentile flows, respectively.

#### **4. Discussion**

A clear signature emerging out of our persistence-based streamflow forecasts skill is its strong spatio-temporal dependence. Our results substantiate it further. The results from Figures 3 and 6 demonstrate that significant contribution to the persistence-based forecast skill is from the baseflow contribution. In other words, the contribution from snowpack (snow/glacier), groundwater and storage in the contributing catchments plays an important role in the streamflow hydrographs, particularly for the perennial rivers in the Himalayan region of Nepal. Typically, the streamflow predictability at small basin scales is more tied to the rainfall than the large basin scales. We highlighted this fact through our results for the intermittent rivers. The perennial rivers, however, receive sizable contributions from the base flow relative to the rainfall even in the small basin scales, hence the relatively good forecast skill. Moreover, the persistence-based forecast derives skill from the memory of the system, in this case river catchments. The larger the basin scales, the longer the memory of the catchments, and hence the long-range persistence. We show from our results that higher forecast skill (KGE > 0.7) could be achieved for a basin size larger than 1000 km2, even at a longer forecast horizon. Another important contribution to the skillful forecasts of larger basins at longer lead times is from the water transport in the river network, where the streamflow aggregation process controls the shape of the streamflow fluctuations [27,39–41]. In addition to the land surface (e.g., catchment) memory, the streamflow predictability has a strong connection to the persistence in the land surface initial hydrologic conditions (e.g., soil moisture, groundwater, current streamflow and snowpack) [42–44], providing the main source for the skillful streamflow forecasts.

The dependence of forecast skill depicted in this study with the basin size and forecast lead times is consistent with the results presented in Ghimire and Krajewski [19]. Given a large number of glacierized basins in the region, as opposed to agricultural watersheds used in the study of Ghimire and Krajewski [19], more stations depict higher streamflow forecast skill. For instance, at a lead time of one day, we show a median value of KGE of about 0.92 in the region, compared to the median value of KGE of about 0.78 presented in their study. Therefore, our results demonstrate a clear implication of persistence-based forecasts for the real-time streamflow forecasting in the Himalayan region. Our illustration of reasonably good streamflow forecast skill across spatial and temporal scales provides a reliable benchmark (reference) to evaluate the efficacy of the operational real-time streamflow forecasting. Particularly for the larger basin scales, we are able to show that it is difficult for many operational mechanistic hydrologic models to outperform persistence-based forecast skills. Any operational forecasting scheme that can aptly depict forecast skill better than the three-reference forecast schemes presented in this study, can be considered skillful. We are able to show quantitatively that it can serve as a skillful reference for the evaluation of real-time flood forecasts (higher flow quantiles) up to the medium range for perennial rivers while up to the short-range for the intermittent rivers. Our findings clearly show that persistence-based forecast scheme, anomaly persistence in particular, could provide skillful reference forecasts for the evaluation of current operational streamflow forecasting systems in the Himalayan country of Nepal. Moreover, it could provide guidance to the flood related decision-making process, especially in the short-medium forecast range.

#### **5. Conclusions**

In this study, we explored the utility of persistence-based forecasting schemes to benchmark the real-time streamflow forecasting system in the Himalayan region of Nepal. We used the daily streamflow data at 65 stream gauge stations monitored by the DHM, Nepal, to generate the forecasts and evaluate the associated skills in the short-to-medium forecast horizon. To this end, we employed three reference forecast schemes: (i) simple persistence, (ii) streamflow climatology and (iii) anomaly persistence. Based on the results from this study, we could reach at following conclusions:



This study highlights the fact that a persistence-based forecast scheme is difficult to outperform using many mechanistic hydrologic models, particularly for larger basin scales in the Himalayan region. The findings from this study have implications for evaluating real-time streamflow forecasts from the operational forecasting system across basin scales and lead times in this region. Moreover, it provides insights on designing the streamflow monitoring network for future applications.

Our study, however, is not without limitations. We did not account for the measurement uncertainty associated with streamflow observations. We considered the published streamflow data to be within the acceptable limit of observational uncertainty. We did not account for the uncertainty associated with the rating curves. Moreover, we could not perform the sub-daily forecast skill evaluation due to the unavailability of published sub-daily streamflow observations. We expect that the analysis using sub-daily streamflow data will not lead to a significantly different inference regarding the forecasting performance of persistence-based systems, which we could explore further in our future research.

**Author Contributions:** Conceptualization, G.R.G.; methodology, G.R.G., and S.S.; software, G.R.G.; formal analysis, G.R.G.; investigation, G.R.G., S.S., J.P., R.T., B.P., P.D., and R.B.; resources, G.R.G., B.P., J.P.; writing—original draft preparation, G.R.G.; writing—review and editing, G.R.G., S.S., J.P., R.T., B.P., P.D., and R.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** The authors greatly acknowledge the Department of Hydrology and Meteorology (DHM), Nepal for providing the streamflow data used for the study. The authors are grateful to the anonymous reviewers for their reviews and constructive comments.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Figure A1.** Histograms showing the distribution of upstream drainage area (**a**) and streamflow record length (**b**) associated with the stream gauge stations in Nepal.

**Figure A2.** Demonstration of streamflow separation using HYSEP. Area highlighted in the blue represents the baseflow while the remaining area represents the direct run-off component. (**a**,**b**) correspond to the Langurkhola in Chhana (159 km2) and Karnali river in Chisapani (42,890 km2), respectively.

**Figure A3.** Relationship between KGE and drainage area across lead times for the intermittent flow rivers.

**Figure A4.** Relationship between peak difference, *PD* and drainage area across lead times. Blue dots correspond to the climatology forecasts and red dots correspond to the anomaly persistence-based forecasts.

**Figure A5.** Demonstration of observed and forecasted stormflow hydrographs for the largest rainfall-runoff event of the year 2008 between 9/1/2008 and 11/1/2008 based on simple persistence for two basins shown in Figure A2 at a lead time of one day: Langurkhola in Chhana (159 km2) and Karnali river in Chisapani (42,890 km2), respectively.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **A Generalized Flow for B2B Sales Predictive Modeling: An Azure Machine-Learning Approach**

#### **Alireza Rezazadeh**

Electrical and Computer Engineering Department, University of Illinois at Chicago, Chicago, IL 60607, USA; arezaz2@uic.edu

Received: 3 July 2020; Accepted: 5 August 2020; Published: 6 August 2020

**Abstract:** Predicting the outcome of sales opportunities is a core part of successful business management. Conventionally, undertaking this prediction has relied mostly on subjective human evaluations in the process of sales decision-making. In this paper, we addressed the problem of forecasting the outcome of Business to Business (B2B) sales by proposing a thorough data-driven Machine-Learning (ML) workflow on a cloud-based computing platform: Microsoft Azure Machine-Learning Service (Azure ML). This workflow consists of two pipelines: (1) An ML pipeline to train probabilistic predictive models on the historical sales opportunities data. In this pipeline, data is enriched with an extensive feature enhancement step and then used to train an ensemble of ML classification models in parallel. (2) A prediction pipeline to use the trained ML model and infer the likelihood of winning new sales opportunities along with calculating optimal decision boundaries. The effectiveness of the proposed workflow was evaluated on a real sales dataset of a major global B2B consulting firm. Our results implied that decision-making based on the ML predictions is more accurate and brings a higher monetary value.

**Keywords:** costumer relation management; business to business sales prediction; machine learning; predictive modeling; microsoft azure machine-learning service

#### **1. Introduction**

In the Business to Business (B2B) commerce, companies compete to win high-valued sales opportunities to maximize their profitability. In this regard, a key factor for maintaining a successful B2B enterprise is the task of forecasting the outcome of sales opportunities. B2B sales process typically demands significant costs and resources and, hence, requires careful evaluations in the very early steps. Quantifying the likelihood of winning new sales opportunities is an important basis for appropriate resource allocation to avoid wasting resources and sustain the company's financial objectives [1–4].

Conventionally, forecasting the outcome of sales opportunities is carried out mostly relying on subjective human rating [5–8]. Most of the Customer Relationship Management (CRM) systems allow salespersons to manually assign a probability of winning for new sales opportunities [9]. This probability is then used at various stages of the sales pipeline, i.e., calculating a weighted revenue of the sales records [10,11]. Often each salesperson develops a non-systematic intuition to forecast the likelihood of winning a sales opportunity with little to no quantitative rationale, neglecting the complexity of the business dynamics [9]. Moreover, as often as not, sales opportunities are intentionally underrated to avoid any internal competition with other salespersons or overrated to circumvent the pressure from sales management to maintain a higher performance [12].

Even though the abundance of data and improvements in statistical and machine-learning (ML) techniques have led to significant enhancements in data-driven decision-making, the literature is scarce in the subject of B2B sales outcome forecasting. Yan et al. [12] explored predicting win-propensity of sales opportunities using a two-dimensional Hawkes dynamic clustering technique. Their approach

allowed for live assessment of active sales although relied heavily on regular updates and inputs from salespersons in the CRM system. This solution is hard to maintain in larger B2B firms considering each salesperson often handles multiple opportunities in parallel and would put less effort into making frequent interaction with each sales record [13].

Tang et al. [9] built a sales forecast engine consist of multiple models trained on snapshots of historical data. Although their paradigm is focused on revenue forecasting, they demonstrated the effectiveness of hybrid models for sales predictive modeling. Bohane et al. [5] explored the idea of single and double-loop learning in B2B forecasting using ML models coupled with general explanation methods. Their main goal was actively involving users in the process of model development and testing. Built on their earlier work on effective feature selection [14] they concluded random forest models were the most promising for B2B sales forecasting.

Here, we proposed an end-to-end cloud-based workflow to forecast the outcome of B2B sales opportunities by reframing this problem into a binary classification framework. First, an ML pipeline extracts sales data and improves them through a comprehensive feature enhancement step. The ML pipeline optimally parameterizes a hybrid of probabilistic ML classification models trained on the enhanced sales data and eventually outputs a voting ensemble classifier. Second, a prediction pipeline makes use of the optimal ML model to forecast the likelihood of winning new sales opportunities. Importantly, the prediction pipeline also performs thorough statistical analysis on the historical sales data and specifies appropriate decision boundaries based on sales monetary value and industry segment. This helps to maximize the reliability of predictions by binding the interpretation of model results to the actual data.

The proposed workflow was implemented and deployed to a global B2B consulting firm's sales pipeline using Microsoft Azure Machine-Learning Service (Azure ML). Such a cloud-based solution readily integrates into the existing CRM systems within each enterprise and allows for more scalability. Finally, we compared the performance of the proposed solution with salespersons' predictions using standard statistical metrics (e.g., accuracy, AUC, etc.). To make the comparison more concrete, we also investigated the financial aspect of implementing this solution and compared the monetary value of our ML solution with salespersons' predictions. Overall, we have found that the proposed ML solution results in a superior prediction both in terms of statistical and financial evaluations; therefore, it would be a constructive complement to the predictions made by salespersons.

This paper is organized as follows: In Section 2, materials and methods used in this work are introduced in detail. Section 3 summarizes the results of this work. Section 4 presents discussion on the results, limitations of the current work and potential future directions.

#### **2. Materials and Methods**

#### *2.1. Data and Features*

Data for this study were obtained from a global multi-business B2B consulting firm's CRM database in three main business segments: Healthcare, Energy, and Financial Services (Finance for short). This section, first, gives an overview of the data and then explains a data enhancement technique used to infer additional relevant information from the dataset.

#### Data

A total number of 25,578 closed sales opportunity records starting January 2015 through August 2019 were used in this work (Figure 1a). Each closed opportunity record contained a status label (won/lost) corresponding to its ultimate outcome, otherwise if still active in the sales pipeline, they were labeled as open. Out of all closed sales records ∼58% were labeled as "won" in their final sales status (Figure 1b).

A total number of 20 relevant variables (features) were extracted for each sales opportunity from the raw CRM database. Table 1 describes these features in more details. Specifically, a subset of the features described the sales project (Opportunity Type, General Nature of Work, Detailed Nature of Work, Project Location, Project Duration, and Total Contract Value, Status). The remaining features provided further information on the customer (Account, Account Location, Key Account Energy, Key Account Finance, and Key Account Healthcare) and the internal project segmentation and resource allocation (Business Unit, Engagement Manager, Sales Lead, Probability, Sub-practice, Practice, Group Practice, Segment, and User-entered Probability).

**Figure 1.** Data Exploration: (**a**) Distribution of sales opportunity records across the three business segments: Healthcare, Energy, and Finance. (**b**) Closed sales opportunities final status.


Once a sales opportunity profile was created in the CRM system, users were required to input their estimation for the probability of winning that opportunity. Please note that the user-entered probabilities were not used in the process of training ML models and were only used as a point of reference to compare with the performance of the ML workflow. All the features listed in Table 1 were required to populate in the CRM system; therefore, less than 1% of the dataset contained missing values. As a result, sales records with a missing value were dropped from the dataset.

#### *2.2. Feature Enhancement*

The CRM raw dataset was enhanced by inferring additional relevant features calculated across the sales records. These additional features were calculated using statistical analysis on the categorical features: Sales Leads, Account, Account Location, etc. Mainly, the idea was to extract a lookup table containing relevant statistics calculated across the sales records for each of the unique values in the categorical features.

By collecting the historical data of unique values of each categorical features (i.e., for each individual Sales Lead, Account, and Project Location, etc.), we calculated the following statistical metrics: (1) Total number of sales opportunities (2) Total number of won sales (3) Total number of lost sales (4) Average contract value (value for short) of won sales (5) Standard error of the mean won sales value (6) Winning rate calculated as the ratio of won and total sales counts (7) Coefficient of variation (the ratio of the standard deviation to the mean) [15] of won sales value to capture the extent of variability in the won sales contract values.

The aforementioned statistics were calculated and stored in feature enhancement lookup tables for each categorical feature (see Table 1 for a list of these features). Table 2 provides an example of a feature enhancement lookup table calculated based on the "Sales Lead" feature in the raw CRM dataset. These lookup Tables (13 tables in total for all categorical features) were appropriately merged back to the raw CRM sales dataset.

In the last feature enhancement step, the Mahalanobis [16] distance was calculated between each sales opportunity's value and the distribution of all won sales value that shared a similar categorical feature (individually for each of the 13 categorical features). This quantifies how far a sales value is relative to the family of won sales with the same characteristics (i.e., same Sales Lead, Project Location, Segment, etc.). The process of feature enhancement increased the total number of features to 137 for each sales record (20 features originally from the raw CRM dataset + 9 × 13 = 117 additional features from the lookup tables).

**Table 2.** Feature Enhancement Lookup Table: An example of the statistics calculated based on "Sales Lead" including counts of the total, won, and lost opportunities along with the mean and standard error of the mean (SEM) for won sales value and their coefficient of variation (CV).


The enhanced CRM dataset (25,578 total number of sales opportunities) was randomly split into a "train set" (70%) and a "test set" (30%). The train set was used to train ML models. The performance of the model on train set is reported using a 10-fold cross-validation technique. The test set was used to report the performance of the trained ML model on the unseen portion of the dataset. For further evaluations, after the proposed framework was deployed to the sales pipeline of the enterprise, a "validation set" was collected of new sales records over a period of 3 months (846 closed sales opportunities).

#### *2.3. Machine-Learning Overview*

Our solution to predicting the likelihood of winning sales opportunities is essentially reframing this problem in a supervised binary classification paradigm (won, lost). Hence, we made use of two of the most promising supervised classification algorithms: XGBoost, and LightGBM. In particular, these two models were selected among the family of probabilistic classification models due to their higher classification accuracy in our problem. A second motivation for using these two models was the fact that the distributed versions of both can easily integrate into cloud platforms such as Azure ML. Last, to attain a superior performance, multiple iterations of both models were combined in a voting ensemble.

#### 2.3.1. Binary Classification

Probabilistic classification algorithms [17], given pairs of samples and their corresponding class labels (*X*1, *y*1), ... ,(*Xn*, *yn*), capture a conditional probability distribution over the output classes *P*(*yi* ∈ *Y* | *Xi*) where for a binary classification scenario *Y* ∈ {0, 1} (maps to lost/won in our problem). Given the predicted probability of a data sample, a decision boundary is required to define a reference point and predict which class the sample belongs to. In a standard binary classification, the predicted class is the one that has the highest probability [18]. This translates to a standard decision boundary of 0.5 for predicting class labels.

However, the decision boundary can be calibrated arbitrarily to reflect more on the distribution of the data. The influence of the decision boundary on the number of true positives (*TP*), false positives (*FP*), true negatives (*TN*), and false negatives (*FN*) in binary classification is illustrated in Figure 2 (see Table 3 for definitions). In this work, we find the optimal decision boundary for a classification model by maximizing all true conditions (both *TP* and *TN*) which in return, minimizes all the false conditions (*FP* and *FN*). Visually, this decision boundary is a vertical line passing through the intersections of *P*(*X*|*Y* = 0) and *P*(*X*|*Y* = 1) in Figure 2.

**Figure 2.** Decision Boundary: impact of the decision boundary on different scenarios of the binary classification.

The performance of a binary classifier can be evaluated using standard statistical metrics such as accuracy, precision, recall, and F1-score (see Table 3). For the case of binary classification, the area under ROC curve (AUC) measures the robustness of the classification (a higher AUC suggests more robust classification performance) [19]. As shown by Hand et al. [20], the AUC of a classifier *G* can be calculated as:

$$\hat{A} = \frac{S\_0 - n\_0(n\_0 + 1)/2}{n\_0 n\_1} \tag{1}$$

where *n*<sup>0</sup> and *n*<sup>1</sup> are the numbers of positive and negative samples, respectively. Also, *S*<sup>0</sup> = ∑ *ri*, where *ri* is the rank of the *i*th positive example in the ranked list where more positive examples are ranked higher.

We took a step forward to obtain more insight into the classification results and measured the performance of the classifier from a monetary aspect, i.e., we calculated the value created by adopting a classification algorithm in the decision-making process. In particular, we aggregated the total sales values in each of the four scenarios of classification (*TPm*, *TNm*, *FPm*, *FNm*) and defined monetary performance metrics with a similar formulation to the statistical metrics (see Table 3). For instance, the monetary precision is the fraction of the sales values correctly predicted as won.

**Table 3.** Statistical and monetary classifier performance metrics.


#### 2.3.2. XGBoost and LightGBM Classifiers

XGBoost, introduced by Chen and Guestrin [21], is a supervised classification algorithm that iteratively combines weak base learners into a stronger learner. With this algorithm, the objective function *J* is defined as

$$J(\Theta) = L(y, \hat{y}) + \Omega(\Theta) \tag{2}$$

where Θ denotes the model's hyperparameters. The training loss function *L* quantifies the difference between the prediction *y*ˆ and actual target value *y*. The regularization term Ω penalizes the complexity of the model with the L1 norm to smooth the learned model and avoid over-fitting. The model's prediction is an ensemble of *k* decision trees from a space of trees F:

$$\mathcal{Y}\_{l} = \sum\_{i=1}^{k} f\_{k}(x\_{i})\_{\prime} f\_{k} \in \mathcal{F} \tag{3}$$

The objective function at iteration *t* for *n* instances can be simplified as:

$$J^{(t)} = \sum\_{i=1}^{n} L(y\_i, \mathfrak{z}\_i) + \sum\_{k=1}^{t} \Omega(f\_k) \tag{4}$$

where according to Equation (3), *y*ˆ*<sup>i</sup>* can iteratively be written as

$$
\hat{y}\_i^{(t)} = \sum\_{i=1}^t f\_k(\mathbf{x}\_i) = \hat{y}\_i^{(t-1)} + f\_l(\mathbf{x}\_i) \tag{5}
$$

The regularization term can be defined as

$$
\Omega(f\_k) = \gamma T + \frac{1}{2}\lambda \sum\_{j=1}^{T} w\_j^2 \tag{6}
$$

where the coefficient *γ* is the complexity of each leaf. Also, *T* is the total number of leaves in the decision tree. To scale the weight penalization, *λ* can be tweaked. Using second-order Taylor expansion and assuming a mean square error (MSE) loss function, Equation (4) can be written as

$$J^{(t)} \approx \sum\_{i=1}^{n} \left[ \mathcal{g}\_i w\_{q(\mathbf{x}\_i)} + \frac{1}{2} (h\_i w\_{q(\mathbf{x}\_i)})^2 \right] + \frac{1}{2} \lambda \sum\_{j=1}^{T} w\_j^2 \tag{7}$$

Since each incident of data corresponds to only one leaf, according to [22], this can also be simplified as

$$J^{(t)} \approx \sum\_{j=1}^{T} [(\sum\_{i \in I\_j} g\_i)w\_j + \frac{1}{2}(\sum\_{i \in I\_j} h\_i + \lambda)w\_j^2] + \gamma T \tag{8}$$

where *Ij* represents all instances of data in leaf *j*. As can be seen in Equation (8) minimizing the objective function can be transformed into finding the minimum of a quadratic function.

In an endeavor to reduce the computation time of the XGBoost algorithm, Ke, et al. proposed LightGBM [23]. The main difference between XGBoost and LightGBM is how they grow the decision tree (see Figure 3 for a high-level comparison). In XGBoost decision trees are grown horizontally (level-wise) while with LightGBM decision trees are grown vertically (leaf-wise). Importantly, this makes LightGBM an effective algorithm to handle datasets with high dimensionality.

**Figure 3.** Comparison between XGBoost level-wise horizontal tree growth and LightGBM vertical leaf-wist tree growth.

#### 2.3.3. Voting Ensemble

The main idea behind voting ensembles is to combine various classification models to balance out their individual classification errors. Voting ensembles predict the class label either by using most individual models' predictions (hard vote) [24] or averaging their predicted probabilities (soft vote) [25]. A voting ensemble was used to integrate the predictions of multiple iterations of both XGBoost and LightGBM classifiers with different parameterizations. Specifically, a soft-voting weighted average voting ensemble was used to combine the predictions for each model (Figure 4). A soft-voting ensemble is a meta-classifier model that computes its predicted probabilities *yi* by takes the weighted average (*wj*) probability predicted by each classifier (*pij*):

$$\mathcal{Y}\_i = \operatorname{argmax}\_i \sum\_{j=1}^m w\_j p\_{ij} \tag{9}$$

**Figure 4.** Voting Ensemble combines predictions *pi* of multiple classifiers *mi* using weighted average *wi* to compute a final prediction *P*.

#### *2.4. Workflow and Pipelines*

Pipeline is defined as an executable workflow of data that is encapsulated in a series of steps. In this work, the proposed workflow consists of two main pipelines: (1) ML pipeline and (2) Prediction pipeline. All pipeline codes were custom-written in Python 3.7 on Microsoft Azure Machine-Learning Service [26] cloud platform. XGBoost v1.1 and LightGBM v2.3.1 libraries were integrated into Python to create ML classification models. The voting ensemble was created using the Microsoft Automated Machine-Learning tool [27].

#### 2.4.1. Machine-Learning Pipeline

The main objective of the ML pipeline is to train predictive models on the closed sales opportunities data. As illustrated in Figure 5, there are four main steps in this pipeline:

(1) *Data Preparation*: Raw data of all closed sales opportunities are extracted from the CRM cloud database. Relevant features are selected for each sales record (see Table 1) and paired with their sales outcome (won/lost) as a class label. Please note that the user-entered probabilities are dropped to avoid biasing the model's predictions.


**Figure 5.** ML Pipeline: In four major steps the pipeline extracts and enhances sales data from a cloud database, trains an ensemble of ML classification models on the data, and eventually creates a cloud endpoint for the model.

#### 2.4.2. Prediction Pipeline

The prediction pipeline, as illustrated in Figure 6, was designed to use the trained ML model and make predictions on the likelihood of winning new sales opportunities in four main steps:


(4) *Decision Boundaries*: All historical data on closed sales opportunities along with their ML predicted probabilities are grouped by the business segments (Healthcare, Energy, and Finance). Next, within each business segment, closed sales records are split into four quartiles based on their value. Then, the optimal decision boundary is calculated for each business segment's value quartile as described in Section 2.3.1. A total number of 12 decision boundaries are calculated (3 business segments × 4 quartiles). Eventually, all predicted probabilities and decision boundaries are stored back to the cloud database.

**Figure 6.** Prediction Pipeline: new sales opportunities data are transformed and enhanced, probability of winning is inferred using the trained ML model, and finally decision boundaries are optimized based on historical sales data.

#### **3. Results**

This section gives an overview of the proposed workflow's performance. The workflow was implemented in the CRM system of a global B2B consulting firm. The two pipelines were scheduled for automated runs on a recurring basis on the Azure ML platform. The ML pipeline was scheduled for a weekly rerun to retrain ML models on updated sales data and generate updated feature enhancement lookup tables. The prediction pipeline was scheduled for a daily rerun to calculate and store predictions for new sales opportunities.

#### *3.1. Training the ML Model*

A total number of 34 iterations of XGBoost and LightGBM were individually trained on the data and then combined in a voting ensemble classifier (see Section 2.3.3 for more details). The training accuracy was calculated using 10-fold cross-validation. The accuracy for each of the 35 iterations (with the last iteration being the voting ensemble classifier) is demonstrates in Figure 7a. Training accuracy for the top five model iterations are listed in Figure 7b. As expected, the voting ensemble had a slightly higher training accuracy compared to each individual classifier.

The voting ensemble classifier had a training accuracy of 81% (other performance metrics are listed in Table 4). On the train set, approximately 83% of the won sales and 79% of the lost sales were classified correctly (Figure 7d). For more insight into the training performance ROC curve (Figure 7c) is also illustrated. The area under the ROC curve (AUC) was equal to 0.87. In other words, this implies

that a randomly drawn sample out of the train set has a 87% chance of being correctly classified by the trained model.

**Figure 7.** ML Training Results: (**a**) Training accuracy for all model iteration. (**b**) Accuracy of the top five iterated models sorted by the training accuracy. (**c**) ROC curve of the voting ensemble classifier. (**d**) Confusion matrix showing the four scenarios of classification for the voting ensemble model.



#### *3.2. Setting the Decision Boundaries*

As explained in Section 2.4.2, statistical analysis of historical sales data is performed in each business segment (Healthcare, Energy, and Finance) to determine the decision boundaries. Specifically, the decision boundary was optimized for each of the four sales value quartiles of each business segment. The decision boundaries, demonstrated in Figure 8, ranged from 0.41 (Finance business segment—3rd value quartile) to 0.75 (Energy business segment—1st value quartile).

Interestingly, the decision boundaries were lower for sales opportunities with a higher monetary value which implies a more optimistic decision-making for more profitable opportunities. This sensible trend observed in the optimal decision boundaries provides more evidence to substantiate the idea of tailoring the boundaries uniquely to each business segment and value quartile due to their inherent decision-making differences.

**Figure 8.** Decision Boundaries.

#### *3.3. Model's Performance*

The voting ensemble was used to make predictions on the unseen test set. In particular, after inferring the probability of winning for each sales opportunity, they were classified in accordance with a decision boundary corresponding to their business segment and value quartile. If the inferred probability of winning exceeded the decision boundary, a sales opportunity was classified to be won otherwise it was classified to be a lost opportunity. To make a concrete comparison between user-entered and ML predictions, statistical and monetary performance metrics were calculated for both approaches.

All four classification scenarios in the test set for both user-entered and ML prediction are depicted in Figure 9a. Qualitatively, the ML workflow made fewer false classifications (i.e., compare the true positive *TP* slice proportions in Figure 9a). More specifically, the ML workflow accurately classified 87% of the unseen sales data while the user-entered predictions only had an accuracy of 67%. In fact, all statistical performance metrics (precision, recall, and F1 score) were in favor of the ML predictions (see Table 5).

The performance of the user-entered and ML predictions was also compared with reference to the monetary metrics (see Section 2.3.1 for more details). As shown in Figure 9b, sales opportunities falsely predicted to be won by the ML workflow had considerably lower cumulative monetary value (compare the true positive *FPm* slice proportions). This implies a lower monetary loss due to prediction error when using the ML predictions. Quantitatively, the monetary accuracy of the ML model was notably higher than the user-entered (90% versus 74%). Other monetary performance metrics are listed in Table 5.

**Figure 9.** Test Set Results: (**a**) Statistical Performance: Actual outcome of all won (light green) and lost (light red) sales opportunities along with the corresponding predictions (solid green and red). (**b**) Monetary Performance: Cumulative contract value of won (light green) and lost (light red) sales opportunities along with the cumulative value of opportunities in each of the four classification scenarios (solid green and red). In both panels miss-matching colors indicates false classification.


**Table 5.** Test Set Performance Metrics.

#### *3.4. Analysis of the Workflow Implementation*

Similar to the previous section, a performance comparison between the user-entered and ML predictions was performed on a validation set. The validation set was collected while the workflow was implemented in the sales pipeline of a B2B consulting firm over a period of three months (see Section 2.1 for further details). A qualitative comparison in terms of statistical and monetary performance is presented in Figure 10. In the validation set, the ML workflow retained a substantially higher prediction accuracy (83% versus 63%). Also, there was an evident gap between the number of won sales misclassified by each approach (compare the true positive *TP* slices in Figure 10a).

The monetary accuracy of the ML predictions was marginally lower than the user-entered predictions (75% versus 77%). However, the cumulative value of the won sales opportunities correctly classified by the ML workflow was still considerably higher than the user-entered predictions (compare the true positive *TPm* slices in Figure 10b). All performance metrics are listed in Table 6.

**Figure 10.** Validation Set Results: (**a**) Statistical and (**b**) Monetary performance of user-entered and ML predictions. Refer to Figure 9 caption for further explanations.



#### **4. Conclusions**

In this paper, we proposed a novel ML workflow implemented on a cloud platform for predicting the likelihood of winning sales opportunities. With this approach, sales data were extracted from the CRM cloud database and then improved by an extensive feature enhancement approach. The data was then used to train an ensemble of probabilistic classification models in parallel. The resulting meta classification model was then used to infer the likelihood of winning new sales opportunities. Lastly, to maximize the interpretability of the predictions, optimal decision boundaries were calculated by performing statistical analysis on the historical sales data.

To inspect the effectiveness of the ML approach, it was deployed to a multi-business B2B consulting firm for over three months. The performance of the ML workflow was compared with the user-entered predictions made by salespersons. Standard statistical performance metrics confirmed that by far the ML workflow provided superior predictions. From a monetary standpoint, the value created from decision-making was also higher when incorporating the ML workflow.

The proposed ML workflow is a cloud-based solution that can readily be integrated into the existing cloud-based CRM system of enterprises. On top of that, this workflow is highly sustainable and scalable since it relies on cloud computing power instead of on-premise computing resources. Although our proposed workflow is mainly built around Azure ML platform, future work can explore implementing this workflow on other cloud computing resources such as Amazon web services, Google cloud platform, etc.

A potential issue with the proposed workflow is handling the scenario of imbalanced dataset. An imbalanced dataset is characterized by having more instances of a certain class compared to others [29]. In our problem, this would translate to a dataset that has more lost sales record than won (or vice versa). For instance, consider Energy or Finance business segments (Figure 1) in our data set where the number of won and lost sales records are unbalanced. Solutions to deal with an imbalance problem at the data-level involves oversampling the smaller class or undersampling the larger class [30–32]. Galar et al. [33] showed that combining random undersampling techniques with ensemble models stands out among other data-level solutions. In future work, we hope to explore this idea by incorporating an undersampling technique to the existing ensemble models in the workflow.

The results obtained in this work suggest a data-driven ML solution for predicting the outcome of sales opportunities is a more concrete and accurate approach compared to salespersons' subjective predictions. However, it is worth mentioning that ML solutions should not be overwhelmingly used to rule out sensible or justifiable sentiments of salespersons in evaluating a sales opportunity. A data-driven approach, such as the workflow presented in this work, can provide a reliable reference point for further human assessments of the feasibility of a sales opportunity.

**Funding:** This research received no external funding.

**Acknowledgments:** The author would like to thank Saeed Fotovat for his valuable suggestions and comments.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Bus Travel Time: Experimental Evidence and Forecasting**

#### **Antonio Comi \* and Antonio Polimeni**

Department of Enterprise Engineering, University of Rome Tor Vergata, 00118 Rome, Italy; antonio.polimeni@uniroma2.it

**\*** Correspondence: comi@ing.uniroma2.it; Tel.: +39-06-7259-7061

Received: 28 July 2020; Accepted: 26 August 2020; Published: 28 August 2020

**Abstract:** Bus travel time analysis plays a key role in transit operation planning, and methods are needed for investigating its variability and for forecasting need. Nowadays, telematics is opening up new opportunities, given that large datasets can be gathered through automated monitoring, and this topic can be studied in more depth with new experimental evidence. The paper proposes a time-series-based approach for travel time forecasting, and data from automated vehicle monitoring (AVM) of bus lines sharing the road lanes with other traffic in Rome (Italy) and Lviv (Ukraine) are used. The results show the goodness of such an approach for the analysis and reliable forecasts of bus travel times. The similarities and dissimilarities in terms of travel time patterns and city structure were also pointed out, showing the need to take them into account when developing forecasting methods.

**Keywords:** travel time forecasting; time series; bus service; transit systems; sustainable urban mobility plan; bus travel time

#### **1. Introduction**

Travel time plays a key role in assuring the reliability and quality of service in a transit system. The use of this variable ranges from operation planning (e.g., for short- and long-term planning) to service monitoring (e.g., for real-time information). Moreover, in large historical cities (e.g., Rome, Italy), the influence of traffic congestion on travel time variability has been pointed out. It is linked, on the one hand, to the city structure and, on the other hand, with the growing number of private and freight vehicles travelling into the city [1–3]. Therefore, transit operators have to take into consideration the travel time variability when they design new lines or when they work to improve existing lines. Usually, in cities, transit services share the road lanes with other vehicles; subsequently, their travel times are highly impacted by congestion, and their temporal patterns could be similar to private/freight ones, as emerged from some surveys carried out around the world [4,5], where seasonality and trend/cycle components were revealed. Time series analysis captures this pattern [6].

Therefore, the availability of reliable travel time forecasting is a relevant attribute for transit operators [6–10] to use when designing or updating service timetables. This can also contribute to attracting more passengers and increasing their satisfaction [11,12]. Moreover, in order to provide accurate information to the passenger, models and procedures must be developed to forecast travel time.

According to the Guidelines for Developing and Implementing a Sustainable Urban Mobility Plan (SUMP; [13]), SUMP has to improve urban accessibility as well as to offer users high-quality and sustainable mobility services from/to the study area. It regards the needs of the *functioning city* and its hinterland rather than a municipal administrative region. Furthermore, SUMP has to foster a balanced development of all relevant transport modes, while encouraging a shift towards more sustainable modes (e.g., transit). The plan puts forward an integrated set of technical, infrastructure, policy-based, and soft measures to improve performance and cost-effectiveness with regard to the

declared goal and specific objectives [13]. Among the topics that are typically addressed, one of the most relevant is public transport, for which it is requested to present a strategy to enhance the quality, security, integration and accessibility of public transport services, covering infrastructure, rolling stock, and services. Public transport is a good way to reduce congestion and environmentand health-harming emissions in urban areas, especially when they run on alternative, cleaner fuels. Therefore, information on the state of the public network and reliable timetables and service schedules may be an effective tool for improving the quality and effectiveness of services perceived by users ([14]) and, hence, for diverting people to public transport modes.

Currently, one of the new challenges facing urban planners is to find solutions that can reduce the impacts of urban mobility without penalising passengers' mobility needs. Therefore, since suitable time estimates are needed for designing timetables and schedule services on the different routes, the development of travel time forecasting methodologies (including models) has to point out the complexity of urban systems as well as city sizes and characteristics. In fact, cities can differ from each other in both mobility/transport patterns and traffic conditions, as well as in other factors including geographical, environmental, demographic and socioeconomic conditions, cultural backgrounds, and institutional and legal frameworks. Hence, an incoming challenge is to develop performing forecasting methods and test their transferability [14,15]. Therefore, an outline of city similarities or dissimilarities with respect to the travel time of transit services could be useful for supporting the development and implementation of project/scenario actions. Such analysis could be a preguideline for planners in an ex-ante assessment. It can verify whether the experiment results and forecasting methods used in a city match those obtained from other cities in the way of defined goals (e.g., shifting users to more sustainable transit services) and identify factors that need in-depth and specific investigations.

Using some comparable surveys carried out in two cities that are very different in terms of spatial and economic patterns, the paper highlights the similarity and dissimilarity that exist in bus travel time and points out an easy-to-apply methodology for forecasting. Such a tool can help transit agencies achieve further advances in transit systems, both from the travellers' (i.e., transit trip planners) and operators' perspective (i.e., decision support system for operations control). The paper thus aims at identifying if the results obtainable by travel time forecasting methods can be easily transferred from one city to another.

Information about travel time, thus, benefits operators and passengers [16]. It assists operators in defining optimal slack times to maximise the on-time arrival performance of buses [17], in determining the reliability of systems [18], and in defining timetables [19]. Accuracy in travel time forecasting for defining service timetables and schedules can contribute to increasing the perceptions of service quality and help users in their decision-making about departure time and route choice [20–22]. This is found to be as valuable [23], or even more valuable, than a reduction in travel time [24].

In this paper, a time-series approach is applied to investigate bus travel times; the analysed data refer to the automatic vehicle location (AVL) of some transit lines in the cities of Rome (Italy, [5]) and Lviv (Ukraine, [25]). The first objective is to correlate bus travel time with general traffic patterns and other explanatory variables. Once the bus travel time has been analysed, the second objective is to provide a travel time forecasting procedure based on time series and to point out if similarities exist among travel time patterns in cities that are quite different.

This paper is organised as follows. Section 2 briefly reviews the current literature on bus travel time forecasting methods. Section 3 summarises the available data, while Section 4 describes the methodology used and synthetises the data available, the analyses performed, and the results obtained. Finally, Section 5 draws conclusions and the road ahead.

#### **2. Literature Review**

As said, travel time and, of course, onboard loads [26,27] are two of the most commonly used performance indicators in public transport systems. In particular, travel time (or commercial speed) is generally proposed as one of the fundamental parameters for assessing the effectiveness of the

transport service. On the other hand, providing users with accurate and reliable travel forecasts can be a valid driver in attracting new demand and, therefore, favouring modal shifting.

In the following, unless otherwise specified, travel time between two successive time points is defined as the total time taken between departure (or passage) from one time-point to restart (or passage) from the next time-point. The definition of travel time is shown in Figure 1.

**Figure 1.** Notation on bus travel time [28].

Travel time forecasts can be classified as short term/real-time forecasts and long-term forecasts. The difference between the two types is mainly due to their forecast horizons. The short-term travel time forecast aims to predict the travel time on the forecast horizon as equal to or less than a time value *T* [22], in which *T* could be defined case by case, e.g., 15 min. The long-term travel time forecast aims to predict the travel time as the longest forecast horizon of *T*, which could be the following day, week or year [29,30]. Long-term forecasting uses only historical average traffic condition data to predict future traffic status and travel time. Figure 2 illustrates the definition of travel time estimation and forecast/prediction.

**Figure 2.** Estimation and prediction notations on bus travel time [28].

Therefore, given the values of the travel time observed up to time *T*, the forecast/prediction at time *T* of the future realisation can be of two types:


Real-time forecasting can, in turn, follow the one-step-ahead forecasting approach, i.e., at time *T* + 1, realisations up to time *T* + 1 are used to forecast the value of the variable at time *T* + 2. In the event that the weights to be attributed to the previous creations decrease over time, there is the approach known in the literature as "exponential smoothing" [31].

In transit systems, travel time (*TT*) forecasts are used both for operations and monitoring planning [32–37], and different methods and models have been developed both for long- and short-term. Below, the literature is reviewed, and the pros and cons of each proposed approach are pointed out. Long-term travel times that are forecast for bus services [38–40] have been studied through *regression methods* and *time series methods*. Regression methods explain travel time (the dependent variable) through a set of independent variables (e.g., loads, street characteristics). To point out that nonlinear relationships between independent and dependent variables can exist [38], more complex models were developed, e.g., k-nearest neighbour regression, support vector regression, project pursuit regression, and artificial neural networks. Although such methods do not give satisfactory results when traffic conditions are not stable, they are largely applied [38–40] because they reveal which independent variables are relevant or most important for the reproduction/forecast of travel times.

Time-series-based methods focus on the relationship between the variables to be predicted (travel time) through the analysis of historical data [41–44]. Their strength consists of the high calculation speed due to the simple formulation of the analysis algorithm and the small number of operating variables of the service: only the travel times of the bus relative to the instant in time to which they were found. They allow the structure of travel time variability to be highlighted and the effects over time (e.g., hours of the day, day of the week, time of year), which are relevant in the links where the buses travel alongside other traffic components [5,45], to be revealed.

Deepening studies developed for short-term forecasting, these works can be classified into time-series [46,47], regression models [8,48], artificial neural networks (ANNs) [8,49–52], Kalman filter [52–55], and nonparametric regression models (NPRs) [47,56–58]. Time series models, such as autoregressive integrated moving average (ARIMA) models and exponential smoothing models, can forecast based on historical values. Large variations in historical data could lead to significant differences between observations and forecasts because time series models depend on the transferability of historical trends (patterns) to forecast trends [50]. Regression models are used to predict travel time through linkage with context-specific independent variables (e.g., traffic conditions, road link characteristics). Given that the explanatory variables must be statistically independent of each other and many of the variables relating to transport systems are highly correlated [52], ANN models have been developed. In fact, ANN models allow complex nonlinear relationships to be pointed out and can also generate better results than other models such as Kalman filter-based methods, smoothing models, historical profiles, and real-time profiles [51]. However, ANN models take longer computational time for the training process than other models.

#### **3. Forecasting Methodology and Data Analysis**

The collected data are related to two lines in the city of Rome (Italy; Lines A and B) and one line (Line C) in the city of Lviv (Ukraine). Table 1 reports the characteristics of the investigated bus service lines (Rome and Lviv). Figure 3 shows the observed travel times during working days (i.e., from Monday to Friday). The travel time for each bus line is analysed in two cases: *to the city centre* and *from the city centre*. According to the time of the day, we can observe a similarity in the shape of the data if travel is to or from the city centre: two peak hours present in the mornings and afternoon. Particularly, for the investigated bus lines, the variance is higher in the afternoon peak hour than in the morning (e.g., for Line A, direction to city centre: 229,800 sec<sup>2</sup> vs. 160,284 sec2; direction from city centre: 155,034 sec2 vs. 110,413 sec2). Moreover, given that the roads are congested during peak hours and less congested during nonpeak hours, and, subsequently, travel time is longer in rush hours than in nonpeak hours, time series allow these patterns to be captured. In the following sections, time-series components of bus travel time data are studied.


**Table 1.** The investigated bus lines.

ATT: daily average travel time; TL: travel length; N: one-way number of stops; T: observation period. (\*) to the city centre; (#) from the city centre.

**Figure 3.** Analysed bus lines: hourly fluctuation of travel time during working days.

#### **4. Bus Travel Time Analysis and Forecasting**

#### *4.1. Bus Travel Time Analysis*

Given a bus line, the bus travel time (*TT*) from a terminal to another is assumed to be the sum of the running time (*RT*) between successive stops and the dwelling time at stops (*DW*):

$$TT = \sum\_{i} RT\_i + \sum\_{i} DW\_i \tag{1}$$

with *RTi* the running time between stop *i* and successive one *i*+1, and *DWi* the dwelling time spent at stop *i*.

Running time and dwelling time depends on a set of determinants. Running time is a function of speeds (related to flow composition and link flow), link characteristics (infrastructural and functional), context conditions (e.g., weather conditions; [59]), while dwelling time depends on on-board flow, alighting and boarding users, and bus features (e.g., number of doors, lift operations). Therefore, the analysis of bus travel time variability has to be investigated by capturing the fluctuations of these determinants. This pattern can be pointed out by time-series analysis.

A time series *Yt* can be composed of three components: a trend-cycle component *Tt*, a seasonal component *St*, and a remainder component *Et* that contains anything else in the time series. Therefore, assuming an additive relation among the three above components, the time series can be expressed as

$$Y\_t = f(T\_t, S\_{t'}E\_{t'}) = T\_{t'} + S\_{t'} + E\_{t'} \tag{2}$$

Once the output of time series decomposition is used for forecasting, forecast accuracy has to be evaluated. An approach to evaluating the accuracy of forecasts relies on the application of the model on new data that were not used in the model calibration [31]. To do this, the available data are divided into two sets, *training* and *test* data. The training data is used to set up the forecasting model (i.e., identification of time-series components) and the test data is used to evaluate its accuracy (e.g., one week ahead). Since the test data is not used in determining the model, it should provide a reliable indication of how well the model is able to forecast by using new data. With this approach, the forecast error (*ei*) is assumed to be the difference between an observed value and its forecast, as follows:

$$x\_i = \mathcal{Y}\_i - \hat{\mathcal{Y}}\_i \tag{3}$$

where *Yi* denotes the *i*th observation and *Y*ˆ*<sup>i</sup>* denotes a modelled value of *Yi*.

The most known accuracy measures [31] are MAE (mean absolute error), RMSE (root mean squared error), and MAPE (mean absolute percentage error). MAE and RMSE are scale-dependent measures and are based on *absolute errors* |*ei*| or *squared errors* (*ei*) 2 :

$$\begin{aligned} \text{mean absolute error}: \text{ MAE} &= \text{mean}(|e\_i|) \\ \text{root mean squared error}: \text{ RMSE} &= \sqrt{\text{mean}(e\_i^2)} \end{aligned} \tag{4}$$

On the other hand, MAPE has the advantage of being *scale-independent*:

$$\text{mean absolute percentage error}: \text{ MAPE} = \text{mean}([p\_i])\tag{5}$$

with *pi* = 100 ·*ei*/*Yi*, which measures the percentage error.

#### *4.2. Bus Travel Time Forecasting*

As mentioned above, the used data refer to observed values of three bus lines during some working days of 10 consecutive weeks in Rome and Lviv. The last two weeks were set up as the test

#### *Forecasting* **2020**, *2*

set, while the remaining weeks as the training set. Therefore, the time-series period was set in a week (five working days).

The time series decomposition (Figures 4 and 5) was performed through seasonal and trend decomposition using the loess (STL; [60,61]) method implemented in R software [31]. The results analysis can be pointed out, and it shows similarities with patterns revealed in other urban contexts ([32,45,62]):

	- quite relevant for different hours of the day; as expected, it is influenced by the variance of traffic flows (i.e., buses share the lanes with other traffic components and bus travel time is influenced);
	- quite different for routes to or from the city centre; higher values were revealed in the morning due to high concentrations of constrained trip arrivals (e.g., systemic trips, such as to work or school); on the other hand, the effects are more spread along the hours in the afternoon.

**Figure 4.** Travel time STL decomposition with weekly period (8 weeks): Rome.

**Figure 5.** Travel time STL decomposition with weekly period (8 weeks): Lviv.

The capability to reproduce the observed data is then tested by using trend/cycle (*Tt*) and daily/hourly seasonality (*St*; i.e., time-series systematic components). The modelled error *e* can be assumed to be the remainder *E* (i.e., *e* ≡ *E*).

Figure 7 shows an example of a comparison between observed and modelled travel time for the test set, while Table 3 reports the accuracy for the investigated period. As synthetised by MAPE (smaller than 8%), the systemic component of travel time (i.e., trend/cycle and seasonality) allows the main part of variance to be explained. It means an average error in reproducing travel time less than 3 min (and about 1 min for Line B from the city centre) for the investigated lines. These accuracy metrics have the same magnitude as those we can find in the literature ([19,27,45,54]).

The results of these analyses refer to the improvement of long-term travel time forecasting, but they can suggest new research opportunities for short-term forecasting, as also suggested by Cristobal et al. [8], who proposed short-term travel bus methods based on the similarity between historical ones.

**Figure 6.** Seasonal components of the analysed bus lines.

**Table 2.** Variance (σ2) and mean (μ) of the observed travel time (*Y*) and the remainder (*E*).


**Figure 7.** Example of a comparison between observed and forecasted/modelled values.


**Table 3.** Accuracy indicators of a route of Lines A, B and C.

#### *4.3. Discussion*

The comparison of the bus lines operating in two different (both in size and economic structure) cities pointed out a number of major findings. Bus lines in Lviv operate within the historical centre of the city, while in Rome, the study area merges suburbs with the inner-city area.

As regards spatial form, the inner area of Rome is surrounded by radially distributed roads. Lviv is an Eastern European city with a strong influence from Western cities in its urbanisation. Both the study areas are characterised by narrow streets and buses shares the lanes; therefore, their commercial speed is highly influenced by private traffic. In relation to bus travel time, there are no significant differences in patterns, with high values during morning and afternoon peak hours.

An in-depth comparison between Lviv and Rome showed that the travel time pattern is not strictly dependent on the size of the cities, but it is related to traffic jams (due mainly to lifestyle, e.g., starting of working day). Lviv presents a not-high perturbation during the morning, as indeed happens in Rome, where the high concentration of traffic dominates. Therefore, there is also a very similar pattern in relation to time distribution along the day, although Rome sees activity in the early morning and afternoon, and Lviv does not. The same is reflected in departures from the centre in the afternoon.

Finally, the differences between Rome and Lviv are primarily due to city morphology and lifestyle. The Rome study area has a road network with narrow streets, and it favours an increase of time during the morning peak hour due to high demand, with a high concentration of trip arrivals on time at

the workplace or school. The buses share the lanes and necessarily require travel time to be longer. The second reason is lifestyle and the start of work time, in particular of offices and the opening time of shops and stores. While Rome has two main peak hours and more specific regulations (in Rome, a limited traffic zone has been implemented), in Lviv, the circulation of traffic is always allowed in the morning.

According to these first results, the proposed forecasting methods present high performance in covering variances. Of course, the transferability of results in terms of accuracy issues is not direct, and city-specific surveys are needed. Additionally, the general conclusion is that a more systemic approach could be used in forecasting travel time by trying to define more comprehensive methods that are able to capture the characteristics of traffic, taking into account that some features are city-specific. Results, such as those derived in this paper, can contribute to more effective and rational management of resources, offering to mobility agency technicians an easy-to-apply method for designing or revising timetables and scheduled services.

Further analyses are also in progress to improve these first results by developing other analyses through the inclusion of zonal and level-of-service attributes (e.g., traffic volumes, passive and active accessibility) and the characteristics of users that reach these areas for daily activity, as well as alighting and boarding flows. Currently, the travel time variable has been the only independent variable taken into consideration. It does not capture the influence of factors such as weather conditions, road pavement conditions, and pedestrian flows. A time series-based regression model that also includes more independent variables that capture the above effects is under development. Finally, the degree of transferability of the models and the obtainable accuracy is a work in progress.

#### **5. Conclusions**

An investigation of bus travel time variability was carried out, and the first results for developing an easy-to-apply methodology for the urban areas of Rome and Lviv are presented. The findings can be used by urban mobility agency technicians for designing transit service timetables and vehicle scheduling in the replanning of existing lines. The developed approach focuses on the systematic components of travel time. The results are mainly devoted to bus lines that share the road lanes with other traffic components (e.g., cars, freight vehicles, and so on). The analyses were performed through time-series methods, which allowed factors such as trends, seasonal variations, cycles, and irregular components to be pointed out. Time of the day as well as the day of the week (i.e., Wednesday vs. Thursday) were discovered to have significant effects on travel time variability, and, if rightly modelled, performing forecasts can be obtained. Moreover, the findings can be integrated into a short-term approach, which, traditionally, is not considered. In fact, the development of models for short-term travel time forecasts that consider the variety of different factors, such as demand, transport conditions, and weather conditions, is a research challenge. From the above statements, the further development of this study germinates the availability of further data on bus line operations and loads, data on other vehicles with whom lanes are shared (e.g., floating car data for revealing local traffic patterns), in-depth analysis of residuals coming from time-series decomposition, as well as an approach to incorporate covariates such as weather data to model travel time as this factor might impact bus line travel time significantly. Finally, bus travel time forecasts are based on profile similarities through the time-series and regression methods.

**Author Contributions:** Conceptualization, A.C.; methodology, A.C.; software, A.P.; validation, A.C. and A.P.; formal analysis, A.C. and A.P.; investigation, A.C. and A.P.; data curation, A.C. and A.P.; writing—original draft preparation, A.C. and A.P.; writing—review and editing, A.C.; visualization, A.C. and A.P.; supervision, A.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding. The APC was funded by MDPI.

**Acknowledgments:** The authors wish to thank M. Zhuk, V. Kovalyshyn and V. Hilevych, for having shared data from Lviv. The authors wish to thank the anonymous reviewers for their suggestions, which were most useful in revising the paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Cost Estimating Using a New Learning Curve Theory for Non-Constant Production Rates**

**Dakotah Hogan 1, John Elshaw 2,\*, Clay Koschnick 2, Jonathan Ritschel 2, Adedeji Badiru <sup>3</sup> and Shawn Valentine <sup>4</sup>**


Received: 27 August 2020; Accepted: 9 October 2020; Published: 16 October 2020

**Abstract:** Traditional learning curve theory assumes a constant learning rate regardless of the number of units produced. However, a collection of theoretical and empirical evidence indicates that learning rates decrease as more units are produced in some cases. These diminishing learning rates cause traditional learning curves to underestimate required resources, potentially resulting in cost overruns. A diminishing learning rate model, namely Boone's learning curve, was recently developed to model this phenomenon. This research confirms that Boone's learning curve systematically reduced error in modeling observed learning curves using production data from 169 Department of Defense end-items. However, high amounts of variability in error reduction precluded concluding the degree to which Boone's learning curve reduced error on average. This research further justifies the necessity of a diminishing learning rate forecasting model and assesses a potential solution to model diminishing learning rates.

**Keywords:** learning curve; forecasting; production cost; cost estimating

#### **1. Introduction**

The U.S. Government Accountability Office (GAO) critiqued the cost and schedule performance of the Department of Defense (DoD)'s \$1.7 trillion portfolio of 86 major weapons systems in their 2018 "Weapons System Annual Assessment." The GAO cited realistic cost estimates as a reason for the relatively low cost growth of the portfolio in comparison to earlier portfolios [1]. Congress and its oversight committees maintain a watchful eye on the DoD's complex and expensive weapons system portfolio. Inefficient programs are scrutinized and may be terminated if inefficiencies persist. Funding of inefficient programs will also lead to the underfunding of other programs. In the public sector, these terminated and underfunded programs may result in capability gaps that negatively impact our nation's defense. In the private sector, the inefficient use of resources often spells failure for a company.

A key to the efficient use of resources is accurately estimating the resources required to produce an end-item. Learning curves are a popular method of forecasting required resources as they predict end-item costs using the item's sequential unit number in the production line. Learning curves are especially useful when estimating the required resources for complex products. The most popular learning curve models used in the government sector are over 80 years old and may be outdated in today's technology-rich production environment. Additionally, researchers have demonstrated both theoretically and empirically that the effects of learning slow or cease over time [2–4].

A new model, named Boone's learning curve, has been recently proposed to account for diminishing rates of learning as more units are produced [5]. The purpose of this research is to survey the need for alternative learning curve models and further examine how Boone's learning curve performs in comparison to the traditional learning curve theories in predicting required resources. This research uses a large number of diverse production items to compare Boone's model to the traditional theories of Wright and Crawford. While many different learning curve models exist (i.e., DeJong, Stanford B, Sigmoid, etc.), some of these others may not be as accurate in cases where the learning rate decreases over time. The next section is a review of the learning curve literature relevant to diminishing learning rates, followed by a description of our methodology and analysis to compare Boone's learning curve to traditional models. We conclude the paper discussing managerial implications and limitations followed by recommendations for the way forward.

#### **2. Literature Review and Background**

The two learning curve models cited by the GAO Cost Estimating and Assessment Guide (2009) are Wright's cumulative average learning curve theory developed in 1936 and Crawford's unit learning curve theory developed in 1947. Although both learning curve theories use the same general equation, the theories have contrasting variable definitions. Wright's learning curve is shown in Equation (1):

$$
\overline{\hat{Y}} = A \mathbf{x}^b \tag{1}
$$

where *Y* is the cumulative average cost of the first *x* units, *A* is the theoretical cost to produce the first unit, *x* is the cumulative number of units produced, and *b* is the natural logarithm of the learning curve slope (LCS) divided by the natural logarithm of two. Note, the LCS is the complement of the percent decrease in cost as the number of units produced doubles. For example, with a learning curve slope of 80% and a first unit cost of 100 labor hours, the average cost of the first two units would be 80 labor hours, or 60 labor hours for the second unit. Regardless of the number of units produced, there is a constant decrease in labor costs with each doubling of units due to the constant learning rate.

Several years following the creation of Wright's cumulative average learning curve theory, J.R. Crawford formulated the unit learning curve theory. Crawford's theory deviates from Wright's by assuming that the individual unit cost (as opposed the cumulative average unit cost) decreases by a constant percentage as the number of units produced doubles. Crawford's model is shown in Equation (2):

$$\mathbf{Y} = A\mathbf{x}^{\text{b}} \tag{2}$$

where *Y* is the individual cost of unit *x*, *A* is the theoretical cost of the first unit, *x* is the unit number of the unit cost being forecasted, and *b* is the natural logarithm of the LCS divided by the natural logarithm of two. For example, with a learning curve slope of 80% and a first unit cost of 100 labor hours, the cost of the second unit would be 80 labor hours. Note, Crawford's unit theory is the similar to Wright's in function form; but the difference arises in the variable interpretation lead to a different forecast.

Figure 1 below shows a comparison between Wright's and Crawford's theories using the two numerical examples provided. Cumulative average theory and unit theory will produce different predicted costs provided the same set of data despite all predicted costs being normalized to unit costs. Figure 1 demonstrates this point where unit theory was used to generate data using a first unit cost of 100 and a learning curve slope of 90%. The original unit theory data was converted to cumulative averages in order to estimate cumulative average theory learning curve parameters.

#### *Forecasting* **2020**, *2*

**Figure 1.** Wright's Cumulative Average Theory vs. Crawford's Unit Theory.

Cumulative average theory learning curve parameters. Cumulative average theory estimated a learning curve slope of 93% and a first unit cost of 101.24. These Cumulative Average Theory parameters were then used to predict cumulative average costs. These predicted costs were then converted to unit costs. This conversion allows for the cumulative average predictions to be directly compared to the original Unit Theory generated data. As shown in Figure 1, the cumulative average learning curve predictions first overestimate, then underestimate, and ultimately overestimate the generated unit theory data for all remaining units. Together, Wright's and Crawford's theories form the basis of the traditional learning curve theory.

One assumption of these traditional learning curve theories is that they only apply to processes that may benefit from learning. Typically, these costs are only a subset of total program costs; hence appropriate costs must be considered when applying learning curve theory to yield viable parameter estimates. In a complex program, costs can be viewed in a variety of ways to include recurring and non-recurring costs, direct and indirect costs, and costs for various activities and combinations of end-items that can be stated in units of hours or dollars. Learning curve analysis focuses solely on recurring costs in estimating parameters because these costs are incurred repeatedly for each unit produced [6]. Researchers have also focused solely on direct labor costs due to the theoretical underpinnings of learning occurring at the laborer level [2,3]. Additionally, researchers have historically studied end-items that include only the manufactured or assembled hardware and software elements of the end-item [2,3]. Lastly, labor hours in lieu of labor dollars are generally used in analysis so that data can be compared across fiscal years without the need to adjust for inflation. Therefore, the literature indicates using direct, recurring, labor costs in units of labor hours. These costs should be considered only for the certain elements that include the manufacturing or assembly of hardware and software of an end-item.

An implicit assumption in the traditional learning curve theories is that knowledge obtained through learning does not depreciate. However, empirical evidence demonstrates that knowledge depreciates in organizations [7,8]. Argote [7] showed that knowledge depreciation occurs at both the individual and the organizational levels. Many variations of the traditional models make use of the concept of performance decay (commonly called forgetting) to model non-constant rates of learning. Forgetting and its relationship to learning can take many forms and is essential to consider in contemporary learning curve analysis.

Forgetting is the concept that an individual or organization will experience a decline in performance over time resulting in non-constant rates of learning. Badiru [4] theorizes that forgetting and resulting performance decay is a result of factors "including lack of training, reduced retention of skills, lapse in performance, extended breaks in practice, and natural forgetting" (p. 287). According to Badiru [4], these factors may be caused by internal processes or external factors. Badiru [4] lists three cases in which forgetting arises. First, forgetting may occur continuously as a worker or organization progresses down the learning curve due in part to natural forgetting [4]. The impact of forgetting may not wholly eclipse the impact of learning but will hamper the learning rate while performance continues to increase at a slower rate. Second, forgetting may occur at distinct and bounded intervals, such as during a scheduled production break [4] or towards the end of production as workers are transferred to other duties. Finally, forgetting may intermittently occur at random times and for stochastic intervals such as during times of employee turnover [4]. Others have expanded on the causes of forgetting and have drawn similar conclusions to Badiru [4,9–11]. This decline in performance decays the learning rate and causes longer manufacturing times and higher costs than would be forecasted using traditional learning curve theory.

The concept of forgetting and its impact on non-constant rates of learning has proven relevant in contemporary learning curve research. Several forgetting models have been developed to include the learn-forget curve model (LFCM) [11], the recency model (RCM) [12], the power integration and diffusion (PID) model [13], and the Depletion-Power-Integration-Latency (DPIL) model [13] among others [10]. However, these forgetting models focus solely on the phenomenon of forgetting due to interruptions of the production process [9,10,14]. Jaber [9] states that "there has been no model developed for industrial settings that considers forgetting as a result of factors other than production breaks" (pp. 30–31) and mentions this as a potential area of future research. Although forgetting models have emerged after Jaber's [9] article, a review of the popular forgetting models cited confirms Jaber's statement.

A related concept to the forgetting phenomenon is the plateauing phenomenon. According to Jaber [9] (2006), plateauing occurs when the learning process ceases and manufacturing enters a production steady state. This ceasing of learning results in a flattening or partial flattening of the learning curve corresponding to rates of learning at or near zero. There remains debate as to when plateauing occurs in the production process or if learning ever ceases completely [3,9,15–17]. Jaber [9] provides several explanations to describe the plateauing phenomenon that include concepts related to forgetting. Baloff [18,19] recognized that plateauing is more likely to occur when capital is used in the production process as opposed to labor. According to some researchers, plateauing can be explained by either having to process the efficiencies learned before making additional improvements along the learning curve or to forgetting altogether [20]. According to other researchers, plateauing can be caused by labor ceasing to learn or management's unwillingness to invest in capital to foster induced learning [21]. Related to this underinvestment to foster induced learning, management's doubt as to whether learning efficiencies related to learning can occur is cited as another hindrance to constant rates of learning [22]. Li and Rajagopalan [23] investigated these explanations and concluded that no empirical evidence supports or contradicts them while ascribing plateauing to depreciation in knowledge or forgetting. Jaber [9] concludes that "there is no tangible consensus among researchers as to what causes learning curves to plateau" and alludes that this is a topic for future research (pp. 30–39).

Despite the controversy in the research surrounding forgetting and plateauing effects, empirical studies have shown learning curves to exhibit diminishing rates of learning. For instance, the plateauing phenomenon at the tail end of production was investigated by Harold Asher in a 1956 RAND study. The U.S. Air Force contracted RAND after the service noticed traditional learning curves were underestimating labor costs at the tail end of production [3]. Asher intended to study if the logarithmically transformed traditional learning curves were approximately linear. This linearity would indicate constant rates of learning throughout the production cycle. The alternative hypothesis for these learning curves was a convexity of the logarithmically-transformed traditional learning curves that would indicate diminishing rates of learning as the number of units increased [3]. An example of a learning curve with a diminishing learning rate is shown in Figure 2 in logarithmic scale. The first unit

cost is 100 with an initial learning curve slope of 80% decaying at a rate of 0.25% with each additional unit. For example, the second unit's learning curve slope is 80.25%.

**Figure 2.** Unit Theory learning curve with a Decaying Learning Curve Slope.

Asher investigated this hypothesis of convex logarithmically transformed learning curves by analyzing the learning curves of the various shops within a manufacturing department producing aircraft. Asher used airframe cost data with the appropriate amount of detail to perform a learning curve analysis on the lower level job shops within the manufacturing department. He divided the eleven major kinds of aircraft manufacturing operations into four shop groups each with a set of direct labor cost data [3]. If non-constant rates of learning were present, the shop group curves would differ in their rates of learning and may themselves be convex in logarithmic scale. This would indicate their aggregate learning curve would also be convex in logarithmic scale.

Asher's results showed that the learning curves of the manufacturing shop group had different learning slopes and were convex in logarithmic scale [3]. Asher claims the convexity within the manufacturing shop group learning curves is due to the disparate operations within the job shops and stated that each had their own unique learning curve [3]. He asserts that a linear approximation is reasonable for a relatively small quantity of airframes produced but becomes increasingly unwarranted for larger quantities. This is due in part because larger quantities of produced end-items are likely to experience diminishing rates of learning. Moreover, highly aggregated learning curves are also likely to experience diminishing rates of learning. Because the aggregated manufacturing cost curve is usually the lowest level of detail on which learning curve analysis is performed, the manufacturing cost curve will have diminishing rates of learning as cumulative output increases. These results further justify a learning curve model with diminishing rates of learning.

Wright's and Crawford's learning curve theories provided the basis of the traditional approach that learning occurs at a constant rate as the number of units produced increases. Since this initial discovery, several log-linear learning curve models were founded in attempts to more accurately model data from manufacturing processes. These contemporary models diverge from constant rates of learning by including adjustments in various forms. The six most popular models (including the traditional model) are shown in Figure 3 in logarithmic scale and include log-log graphing lines to more clearly illustrate the differences between models. These illustrated models include the traditional log-linear model or Wright/Crawford curves, the plateau model [19], the Stanford-B model [24], the De Jong model [25], the S-curve model [21], and Knecht's upturn model [26].

**Figure 3.** Comparison of Learning Curve Models (adapted from Badiru [27]).

Recent studies have investigated whether the Stanford-B, De Jong, and S-Curve models more accurately predict program costs in comparison to the traditional theories. Moore [16] and Honious [17] studied how prior experience in the manufacturing of an end-item along with the proportion of touch labor in the manufacturing process affected the accuracy of the Stanford-B, De Jong, and S-curve models in comparison to the traditional models. The authors concluded that these models improved upon the traditional curves for only a narrow range of parameter values. Their research provided insight that the traditional learning curve models become less accurate at the tail-end of production when the proportion of human labor is high in the manufacturing process. Moreover, Honious [17] explicitly references a plateauing effect at the end of production. These findings provide further justification for investigating non-constant rates of learning.

The Stanford-B, De Jong, and S-Curve univariate models illustrated in Figure 3 alter the resulting learning curve slope based on alterations to the theoretical first unit cost parameter *A*. However, the learning curve slopes of these models are not directly a function of the number of cumulative units produced. The plateau model and Knecht's upturn model also illustrated in Figure 3 each produce a learning curve whose slope is directly affected by the number of cumulative units produced. The plateau model uses a step function to reduce the learning rate to 0% (i.e., the learning curve slope is 100%) past a certain number of cumulative units produced. In contrast, Knecht's Upturn Model amends the learning curve exponent term *b* by multiplying *b* by Euler's number *e* raised to the term of a constant multiplied by the number of cumulative units produced. Mathematically, this is expressed *Y* = *Axb*·*exc* , where *Y* is the cumulative average unit cost, *A* is the theoretical first unit cost, *x* is the number of cumulative units produced, *b* is the natural logarithm of the learning curve slope divided by the natural logarithm of 2, and *c* is a constant. The forgetting models stated within the manuscript also

amend the learning curve slope based indirectly on the number of cumulative units but only apply when interruptions to the production process occur.

In response to these researchers' findings, Boone [5] developed a learning curve model with a learning rate that diminishes as more units are produced. Conversely, the traditional learning curve theories diminish the rate of cost reductions as the number of units produced doubles. However, the existing literature provides evidence that the cost reductions with each doubling of units may not be constant as the number of units produced increases. Therefore, Boone [5] sought to attenuate the cost reductions that occur with each doubling of units produced by decreasing the learning rate as the number of units increases.

Boone [5] devised a model that decreases the learning curve exponent *b* as the number of units produced *x* increases. He first considered a model without an additional parameter to reduce the learning curve exponent *b* directly by the unit number. However, he decided to temper the effect each additional unit has on the parameter *b* by adding an additional parameter *c*. The resulting learning curve is shown in Equation (3):

$$
\overline{Y} = A x^{\frac{\overline{\theta}}{1 + \frac{\overline{\theta}}{2}}} \tag{3}
$$

where *Y* is the cumulative average cost of the first *x* units, *A* is the theoretical cost to produce the first unit, *x* is the cumulative number of units produced, *b* is the natural logarithm of the learning curve slope (LCS) divided by the natural logarithm of two, and *c* is a positive decay value. For example, a learning curve slope of 80%, first unit cost of 100 labor hours, and decay value of 100, Boone's model yields a cumulative average cost at the second unit of 80.35 labor hours—or 60.70 labor hours for the second unit. What began as an 80% learning curve model has decayed to an 80.35% learning curve for the second unit. In comparison to Wright's learning curve using the same parameters, the effect of learning has decreased slightly in the production of unit two. The inclusion of the decay value increases the learning curve slope, and hence decreases the learning rate as more units are produced. Note, Boone's model can also be modified to incorporate Crawford's unit theory–refer to Equation (3) for the necessary modifications.

Boone's learning curve diverges from the constant learning assumptions in both Wright's and Crawford's learning curve models by incorporating the unit number in the denominator of the exponent—thus decreasing the effect of *b* as the number of units produced increases. Furthermore, the decay value moderates this diminishing effect, so the amount of learning decreases more slowly. In general, Boone's model is flatter near the end of production and steeper in the early stages compared to the traditional theories. Note, as the decay value approaches zero (holding other factors constant), the exponent term approaches zero representing a learning curve slope approaching 100%. As the decay value approaches infinity, the parameter *b* remains constant, and Boone's learning curve simplifies to the traditional learning curve [5].

Boone [5] tested his learning curve using unit theory to provide a consistent comparison to Crawford's learning curve. Based on the scope of his research and lack of comparison using cumulative average theory, a more robust examination and analysis of Boone's learning curve should be accomplished.

#### **3. Methodology**

One goal of this research is to examine the accuracy of Boone's learning curve in comparison to the popular Wright and Crawford learning curve theories. In order to perform this analysis, production cost and quantity data from a diverse set of DoD systems was collected from government Functional Cost-Hour Reports, Progress Curve Reports, and the Air Force Life Cycle Management Center Cost Research Library. The dataset consisted of recurring costs (either in dollars or labor hours) by production lot for 169 unique end-items. Our data included end-items from a variety of systems (i.e., bomber, cargo, and fighter aircraft, missiles, and munitions), contractors, and time periods (1957–2018). Additionally, only production runs with at least four lots were included. The dataset for the Cumulative Average Theory analysis only includes 140 of the 169 end-items. This theory relies on

#### *Forecasting* **2020**, *2*

continuous data because each lot's cumulative average cost and cumulative quantity is a function of all previous lots' costs and quantities. In order to compare Boone's model to the traditional theories, each model will be fitted to data: (1) Boone's and Wright's models using cumulative average theory, and (2) Boone's and Crawford's models using unit theory. Then, the predicted values for each model will be compared to the actual costs using root mean squared error (RMSE) and mean absolute percentage error (MAPE).

Labor costs were collected from the work breakdown structure (WBS) for the specific item being manufactured (e.g., aircraft frame) or from the documentation provided by the government. Our data included three broad functional cost categories: labor, material, and other. These costs are included in both forms of recurring and non-recurring costs. There are also four functional labor categories delineated that include manufacturing, tooling, engineering, and quality control labor. These four labor category costs, when summed with the material costs and other costs, comprise the total cost for each WBS element for recurring and non-recurring costs.

The definition for the manufacturing labor cost category most clearly aligns with the extant literature to be the focus as the pertinent labor cost category for learning curve research. According to the WBS elements, the manufacturing labor category "includes the effort and costs expended in the fabrication, assembly, integration, and functional testing of a product or end item. It involves all the processes necessary to convert raw materials into finished items [28]." This manufacturing labor category aligns with the categories examined by Wright, which he called "assembly operations [2]," along with those cost categories Crawford studied, which he called "airframe-manufacturing processes [3]." Therefore, the manufacturing labor cost category as defined by the government is associated with the types of labor costs studied by traditional learning curve theorists and succeeding research.

The learning curve parameters for each model (i.e., Equations (1)–(3)) will be estimated by minimizing the sum of squares error (SSE) using Excel's generalized reduced gradient (GRG) nonlinear solver and evolutionary solver. The SSE is calculated by squaring the vertical difference of the observed data and predicted data for each lot and summing these squared differences across all lots.

With lot data, cumulative theory models can be estimated directly. Conversely, when utilizing unit learning curve theory, Crawford's and Boone's models are estimated using an iterative process based on lot midpoints, adapted from Hu and Smith [29]. The algebraic lot midpoint is defined as "the theoretical unit whose cost is equal to the average unit cost for that lot on the learning curve" [6]. The lot midpoint supplants using sequential unit numbers when using lot cost data.

Lot midpoints and model parameters are calculated iteratively due to the lack of a closed-form solution for the lot midpoint. First, an initial lot midpoint (for each lot) is determined using a parameter-free approximation formula [6]—see Equation (4):

$$\text{Lot Midpoint}(\text{LMP}) = \frac{F + L + 2\sqrt{FL}}{4} \tag{4}$$

where *F* is the first unit number in a lot and *L* is the last unit number in a lot. These lot midpoint estimates are then used to estimate the learning curve parameters for Crawford's model (Equation (2)) using the GRG non-linear optimization algorithm. Next, using the estimated parameter *b*, a new set of lot midpoints are determined using a simple and popular formula—Asher's Approximation [6]; see Equation (5):

$$\text{Lot Modpoint} \approx \left[ \frac{\left(L + \frac{1}{2}\right)^{b+1} - \left(F - \frac{1}{2}\right)^{b+1}}{\left(L - F + 1\right)\left(b + 1\right)} \right]^{\left(\frac{1}{b}\right)} \tag{5}$$

where *F* is the first unit number in a lot, *L* is the last unit number in a lot, and *b* is the estimated value from Equation (2). Learning curve parameters will then be re-estimated using these more precise lot midpoint estimates. The iterative process is repeated until changes between successive values of the estimated lot midpoints and *b* are sufficiently small [29] (see Appendix A for a summary of this process). In order to use an iterative process for Boone's model, Asher's Approximation from Equation (5) was adapted to incorporate Boone's decaying learning curve slope. This adaptation allows the lot costs of Boone's learning curve to decrease as more units are produced which affects the lot midpoint estimates; the formula is shown in Equation (6):

$$\text{Lot Modpoint}\_{i} \approx \left[ \frac{\left(L + \frac{1}{2}\right)^{b'+1} - \left(F - \frac{1}{2}\right)^{b'+1}}{(L - F + 1)(b' + 1)} \right]^{\left(\frac{1}{D'}\right)}\tag{6}$$

where *F* is the first unit number in a lot, *L* is the last unit number in a lot, *b* = *<sup>b</sup>* 1+ *LMPi*−<sup>1</sup> *<sup>c</sup>* , and *i* is the

iteration number.

This iterative process of calculating the lot mid-point then solving a non-linear least squares problem requires the execution of a series of non-linear optimization algorithms. Boone's model requires the GRG algorithm which found solutions in a longer but still reasonable amount of time. While more burdensome than the traditional models due to the longer run time and the requirement to provide bounds for the parameters. For Boone's model, the bounds for A and b have a fairly straightforward basis by which to define the bounds. In practice, the A parameter is often supported by a point estimate of the cost of the first theoretical unit. Thus, a bound can be built around this value with tools such as a confidence interval. The b parameter is defined by the learning curve slope which for all practical purposes will be in the (0, 1) interval—most likely on the higher end. As for the c parameter, the basis for the bound is more of a challenge. From a model implementation standpoint, the bound can be arbitrarily large if a long solve time is not limiting. Practically, the bound should be reasonably set; this aspect of the model is an avenue of future research which is discussed in the conclusion. This algorithm does allow the analyst to define stopping conditions such as convergence threshold, maximum number of iterations, or maximum amount of time. Additionally, there is an option called multi-start which uses multiple initial solutions to help locate a global solution verse possibly only finding a local solution. These options allow the user to mitigate the extra burden if necessary. Overall, the computing burden to calculate these models was on the order of minutes per weapon system.

The final estimated parameters for Boone's model and the traditional learning curves were used to create predicted learning curves. These predicted curves were then compared to observed data. Total model error was calculated by comparing the difference between observations and predicted values to understand how accurately the models explained variability in the data. Two measures were used to determine the overall model error. The first error measure was Root Mean Square Error (RMSE) that is calculated by taking the square root of the total SSE divided by the number of lots. RMSE is not robust to outliers—i.e., the effects of outliers may unduly influence this measure. RMSE is often interpreted as the average amount of error of the model as stated in the model's original units.

The second measure was mean absolute percentage error (MAPE). MAPE is calculated by subtracting the predicted value from the observed value, dividing this difference by the observed value, taking the absolute value, and multiplying by 100%. These absolute percent errors are then summed over all observations and divided by the total number of observations. MAPE provides a unit-less measure of accuracy and is interpreted as the average percent of model inaccuracy. Unlike RMSE, MAPE is robust to outliers.

After calculating these measures of overall model error, a series of paired difference *t*-tests are conducted to determine if reductions in error from Boone's learning curve are statistically significant. In order to conduct the first paired difference *t*-test, Boone's learning curve RMSE using cumulative average theory will be subtracted from Wright's learning curve RMSE, and the difference will be divided by Wright's learning curve RMSE. This calculation will yield a percentage difference rather than raw difference to compare end-items of varying differences in magnitude equitably. The null hypothesis posits that Boone's learning curve results in an equal amount (or more) of error in predicting

observed values compared to Wright's learning curve. The alternative hypothesis is that the percentage difference is greater than zero. Support for the alternative hypothesis signifies that Boone's learning curve results in less error predicting observed values than Wright's learning curve. This methodology will be repeated five times to examine each learning curve theory using the two error measures and the different units of production costs—see Table 1.


An assumption to utilize the paired difference *t*-test is that the data are approximately normally distributed. For hypothesis tests with large sample sizes, the central limit theorem can be invoked. Alternatively, a Shapiro–Wilk test will be used to evaluate the normality assumption for small samples. If the Shapiro–Wilk test does not support the normality assumption, the non-parametric Wilcoxon Rank Sum test will be used. A 0.05 level of significance will be used for all statistical tests.

#### **4. Analysis & Results**

The detailed results for Wright's and Boone's learning curves using cumulative average theory are provided in Appendix B Tables A1 and A2. A total of 118 end-items in units of total dollars and 22 components in units of labor hours were analyzed. Each entry lists the program number, number of production lots, number of items produced, type of end-item, and units of the production costs. Additionally, each entry lists both error measures and the respective percent difference between the models. Positive (negative) differences indicate Boone's model has less (more) error than Wright's.

Boone's curve performs better for two reasons. First, Boone's model can explain costs to at least the same degree of accuracy as the traditional learning curve theories due to the extra parameter. Second, increased accuracy could also be explained by Boone's functional form. Despite these theoretical explanations, Boone's model had more error than Wright's for some observations; these negative percentage differences occur because an upper bound was placed on Boone's decay value. An upper bound of 5000 was used for the decay value (same as Boone's original paper). The practical effect of this particular bound can be observed by the number of end-items where the traditional models significantly outperformed Boone's (i.e., a MAPE difference larger than 0.5%): 7 out of 140 for cumulative average theory and 15 out 169 for unit theory. Thus, the majority of the results were not affected by this artificial limitation which was chosen by trial and error. In practice, the bound could be set arbitrarily large so that it is not binding. Boone's learning curve. This upper bound was necessary s since the GRG algorithm requires bounds on the estimated parameters.

Some percentage error differences are approximately (but not exactly) zero. Observations with percentage error differences of approximately zero were defined as those within the bounds (−0.25%, 0.25%). These bounds were used by the researchers to distinguish between observations with approximately zero and non-zero percentage error differences in order to inform the descriptive statistics.

Boone's model had less error for 41% of observations, was approximately equal to Wright's for 50% of observations, and had more error for 9% of observations. While Boone's model is an improvement

on Wright's for some observations, many times the models fit the data equally well (i.e., an approximate zero difference).

The results of the paired difference *t*-tests for cumulative average theory are shown in Table 2 and a sample graph is shown in Figure 4. No outliers, as defined by a value which fell more than three interquartile ranges from the upper 90% and lower 10% quantiles, were present in any of the tests.


**Table 2.** Cumulative Average Theory Descriptive and Inferential Statistics.

**Figure 4.** Comparison of Program 20 PME Air Vehicle.

The results of these hypothesis tests were mixed. For the RMSE percentage difference (measured in total dollars) and MAPE percentage difference, the paired difference *t*-tests led to rejection of the null hypothesis—indicating the increase in accuracy is statistically significant. Conversely, RMSE percentage difference (measured in hours) failed to reject the null hypothesis. Due to the small sample size, large sample theory could not be used, and the data failed a Shapiro–Wilk test (*p*-value = 0.721). Therefore, a Wilcoxon rank signed test was used. This indicates that Boone's improvement in accuracy over Wright's is not statistically significant when costs are measured in labor hours. However, small sample sizes can cause paired difference tests to have low power that may cause hypothesis tests to incorrectly fail to reject the null hypothesis [30].

Now considering unit theory, the results from Crawford's and Boone's learning curve models are presented in Appendix B. A total of 141 end-items (measured in total dollars) and 28 end-items (measured in labor hours) were analyzed.

Similar to cumulative average theory, observations with percent error differences of approximately zero were defined as those within the bounds (−0.25%, 0.25%). Boone's model had less error for 43% of observations across all percent difference error measures in comparison to crawford's learning curve. Boone's learning curve error was approximately equal for 52% of observations, and had more error for 5% of observations.

The results of the paired difference testing for unit theory are provided in Table 3 and a sample graph is shown in Figure 5. Again, no outliers were present in any of the paired difference *t*-tests.


**Table 3.** Unit Theory Descriptive and Inferential Statistics.

**Figure 5.** Comparison of Program 1 PME Air Vehicle.

The results of these paired difference tests indicate the improvement with Boone's model is statistically significant. Again, the RMSE percent difference (for labor hours) used a Wilcoxon rank sum test (due to the failure of the Shapiro–Wilk test with a *p*-value less than 0.001).

#### **5. Conclusions**

A large, diverse dataset of DoD production programs was used to test if Boone's learning curve more accurately explained error in comparison to traditional learning curve theories. The direct recurring cost data from bomber, cargo, and fighter aircraft along with missiles and munitions programs in units of total dollars and labor hours were analyzed using Cumulative Average and Unit Learning Curve theories. Various components of these programs were analyzed from wings and data link systems to the airframes and air vehicles. Boone's learning curve was tested against both cumulative average and unit learning curve theories using two different measures of model error that resulted in six paired difference tests. This methodology resulted in 998 total observations across all measures and ensured the generalizability of Boone's learning curve was tested.

Boone's learning curve improved upon the traditional learning curve estimates for approximately 42% of the sampled program components while approximately equaling the traditional learning curve error for approximately 51% of program components. Boone's learning curve resulted in a range of mean percentage difference reductions of 6% to 18.6% across all measures. The standard deviations of these improvements were high with coefficients of variation ranging from 150% to 247% across all measures. Absent additional analysis, these high amounts of variability make it challenging to conclude the degree to which Boone's learning curve will improve the accuracy of explaining program component costs in comparison to the traditional estimation methods. Specifically, more research is needed to understand the shape of the learning curve and how it behaves related to production circumstances. It remains unclear which programs are more accurately modelled using Boone's learning curve and to what degree Boone's learning curve will more accurately model program component costs.

The paired difference tests between Boone's learning curve and the traditional theories indicate that Boone's learning curve reduces error to a significant degree across a wide range of measures. Five of the six paired difference tests resulted in rejecting the null hypothesis that Boone's learning curve had an equal amount or more error than the traditional theories at a significance level of 0.05.

Due to data availability, program lot data was used instead of unitary data. Although Boone's learning curve should perform just as well using either type of data, this research cannot conclusively state that Boone's learning curve will more accurately explain programs in unitary data. Also, the majority of data utilized were end-item components in units of total dollars. The total dollar cost includes all cost categories rather than solely labor costs. These data are not ideal when applying learning curve theory and may bias learning curves to display diminishing rates of learning. Despite these potential issues, total dollar cost data are regularly utilized by cost estimators in the field due to data availability. Therefore, the practical applications of this analysis remain valid despite the limitations of using imperfect total dollar cost data in learning curve analysis.

Boone's learning curve was tested on programs whose lot costs were already known and whose parameters can be directly estimated. In other words, Boone's learning curve was tested against the traditional theories on how well it explained rather than predicted program costs. In order to utilize Boone's learning curve to predict costs, a decay value would be selected a priori. Similar to the learning curve slope, an analyst could use the decay value from similar programs to provide a range values to make predictions. Additionally, future research should investigate if Boone's Decay Value can be predicted using various attributes of a program. Tests could be performed on how well Boone's learning curve predicts costs for a program using analogous programs in comparison to the traditional theories. Lastly, additional labor hour data should be collected and analyzed in order to dispel the potential bias of learning curves displaying diminishing rates of learning when analyzed in units of total dollars.

**Author Contributions:** Conceptualization, D.H., J.E., C.K. and J.R.; methodology, D.H., J.E., C.K., J.R. and A.B.; software, D.H. and S.V.; validation, D.H., J.E., C.K. and J.R.; formal analysis, D.H., J.E., C.K. and J.R.; investigation, D.H., J.E., C.K., J.R. and S.V.; resources, D.H., A.B. and S.V.; data curation, D.H., J.E. and S.V.; writing—original draft preparation, D.H., J.E. and C.K.; writing—review and editing D.H., J.E., C.K., J.R. and A.B.; visualization, D.H. and A.B.; supervision, J.E., C.K., J.R. and A.B.; project administration, D.H., J.E., C.K. and J.R.; funding acquisition, J.E., J.R. and A.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Calculation Process for Lot Midpoint Estimation**

The following process was implemented to estimate parameters for lot midpoint estimation.


Steps 4 and 5 were repeated until the iterative process converged on a solution to produce final estimates of Crawford's learning curve parameters and lot midpoint approximations.

#### **Appendix B. Learning Curve Error Comparisons Using Cumulative Average and Unit Theories**


**Table A1.** Error Comparison using Cumulative Average Theory.

**Table A1.** *Cont.*


**Table A1.** *Cont.*



**Table A1.** *Cont.*

#### **Table A2.** Error Comparison using Unit Theory.


**Table A2.** *Cont.*


**Program Number of Lots Number of Units Component Estimated Units Traditional RMSE Boone RMSE RMSE Percentage Di**ff**erence Traditional MAPE Boone MAPE MAPE Percentage Di**ff**erence** Program 18 11 83 PME–Air Vehicle Dollars 82,138.6 82,143.3 0.0% 23.2% 23.2% 0.0% Program 19 5 45 Airframe Dollars 501.2 501.2 0.0% 53.9% 53.9% 0.0% Program 19 5 45 PME–Air Vehicle Dollars 649.0 649.0 0.0% 17.6% 17.6% 0.0% Program 19 5 45 Mission Computer (1) Dollars 61.7 59.7 3.2% 9.8% 9.7% 1.2% Program 20 9 76 PME–Air Vehicle Dollars 1108.7 522.5 52.9% 7.2% 3.6% 49.9% Program 21 5 50 PME–Air Vehicle Dollars 24,625.3 6362.0 74.2% 7.4% 2.3% 69.5% Program 22 9 31 PME–Air Vehicle Dollars 16,636.3 16,636.4 0.0% 6.6% 6.6% 0.0% Program 23 5 14 PME–Air Vehicle Dollars 14,475.8 14,476.0 0.0% 8.7% 8.7% 0.0% Program 24 6 98 PME–Air Vehicle Dollars 2259.9 2260.1 0.0% 3.3% 3.3% 0.0% Program 25 7 59 Electronic Warfare (5) Dollars 2808.4 2805.2 0.1% 14.8% 15.4% <sup>−</sup>4.0% Program 25 11 84 PME–Air Vehicle Dollars 5083.2 4228.8 16.8% 8.7% 9.2% <sup>−</sup>5.2% Program 25 11 84 Electronic Warfare (2) Dollars 248.9 248.6 0.1% 13.9% 14.3% <sup>−</sup>2.9% Program 25 7 59 Electronic Warfare (1) Dollars 1259.1 653.3 48.1% 16.1% 7.1% 55.6% Program 26 7 344 Airframe Dollars 11,474.7 8294.9 27.7% 22.7% 21.5% 5.3% Program 26 7 344 Avionics Dollars 2218.8 2102.8 5.2% 29.5% 26.9% 8.8% Program 26 7 344 PME–Air Vehicle Dollars 12,898.4 8742.1 32.2% 20.7% 16.9% 18.4% Program 27 14 453 PME–Air Vehicle Hours 54,142.9 53,766.4 0.7% 59.9% 63.1% <sup>−</sup>5.4% Program 27 14 453 Airframe Hours 70,415.0 69,426.8 1.4% 58.8% 59.1% −0.5% Program 28 8 538 PME–Air Vehicle Hours 3828.8 3829.8 0.0% 9.8% 9.9% 0.0% Program 28 8 538 Airframe Hours 3865.3 3866.2 0.0% 7.6% 7.6% 0.0% Program 29 8 529 Hydraulic Dollars 156.9 156.4 0.3% 22.3% 22.9% −2.8% Program 29 12 477 Airframe Dollars 6490.2 5974.2 7.9% 14.2% 14.4% −1.8% Program 29 12 477 Wing Dollars 712.3 712.7 −0.1% 27.8% 27.8% −0.1% Program 29 11 433 Electronic Warfare (1) Dollars 57.5 57.5 0.0% 13.5% 13.5% <sup>−</sup>0.1% Program 29 8 309 Electrical Dollars 230.6 230.7 −0.1% 8.2% 8.2% 0.0% Program 29 12 1045 Body Dollars 1922.2 1826.7 5.0% 26.0% 25.9% 0.7% Program 29 5 177 Empennage Dollars 32.3 22.0 31.8% 6.1% 4.6% 24.5% Program 29 12 477 PME–Air Vehicle Dollars 8218.5 5525.3 32.8% 15.0% 10.2% 32.0% Program 29 8 309 Alighting Gear Dollars 205.7 42.2 79.5% 11.6% 2.0% 83.1% Program 30 5 469 PME–Air Vehicle Dollars 1283.8 891.8 30.5% 13.5% 8.3% 38.3% Program 31 10 59 PME–Air Vehicle Dollars 11,978.9 11,979.3 0.0% 8.6% 8.6% 0.0% Program 32 9 348 PME–Air Vehicle Dollars 430.6 430.8 0.0% 3.5% 3.5% <sup>−</sup>0.1% Program 33 5 109 PME–Air Vehicle Hours 993.9 994.0 0.0% 9.5% 9.5% 0.0% Program 33 5 109 PME–Air Vehicle Dollars 6824.7 6824.8 0.0% 28.2% 28.2% 0.0% Program 34 18 631 PME–Air Vehicle Dollars 6926.7 2799.9 59.6% 17.0% 6.6% 61.0% Program 35 6 425 PME–Air Vehicle Dollars 1135.8 1137.5 <sup>−</sup>0.2% 3.5% 3.5% <sup>−</sup>0.2% Program 35 7 522 PME–Air Vehicle Hours 4615.3 4458.5 3.4% 6.3% 6.1% 3.1% Program 35 7 522 Airframe Hours 6757.0 6280.7 7.0% 5.7% 5.4% 4.8% Program 36 9 358 PME–Air Vehicle Hours 5118.7 5120.1 0.0% 6.8% 6.8% 0.0% Program 36 9 358 Airframe Hours 12,155.2 11,257.1 7.4% 15.5% 14.3% 7.6% Program 37 5 204 PME–Air Vehicle Dollars 1468.7 921.0 37.3% 2.9% 1.9% 36.4% Program 38 5 605 PME–Air Vehicle Dollars 2641.9 1527.7 42.2% 14.9% 8.1% 46.0%

**Table A2.** *Cont.*

**Table A2.** *Cont.*


**Table A2.** *Cont.*


#### **References**


30. Badiru, A.B. Half-Life Learning Curves in the Defense Acquisition Life Cycle. *Def. Acquis. Res. J.* **2012**, *19*, 283–308.

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Application of a Semi-Empirical Dynamic Model to Forecast the Propagation of the COVID-19 Epidemics in Spain**

#### **Juan Carlos Mora 1,\*, Sandra Pérez <sup>2</sup> and Alla Dvorzhak <sup>1</sup>**


Received: 9 September 2020; Accepted: 16 October 2020; Published: 22 October 2020

**Abstract:** A semiempirical model, based in the logistic map, was developed to forecast the different phases of the COVID-19 epidemic. This paper shows the mathematical model and a proposal for its calibration. Specific results are shown for Spain. Four phases were considered: non-controlled evolution; total lock-down; partial easing of the lock-down; and a phased lock-down easing. For no control the model predicted the infection of a 25% of the Spanish population, 1 million would need intensive care and 700,000 direct deaths. For total lock-down the model predicted 194,000 symptomatic infected, 85,700 hospitalized, 8600 patients needing an Intensive Care Unit (ICU) and 19,500 deaths. The peak was predicted between the 29 March/3 April. For the third phase, with a daily rate *r* = 1.03, the model predicted 400,000 infections and 46,000 ± 15,000 deaths. The real *r* was below 1%, and a revision with updated parameters provided a prediction of 250,000 infected and 29,000 ± 15,000 deaths. The reported values by the end of May were 282,870 infected and 28,552 deaths. After easing of the lock-down the model predicted that the health system would not saturate if *r* was kept below 1.02. This model provided good accuracy during epidemics development.

**Keywords:** semi-empirical model; logistic map; COVID-19; SARS-CoV-2

#### **1. Introduction**

A new respiratory disease, initially dominated by pneumonia, and caused by a coronavirus, was detected at the province of Hubei, in China, at the end of 2019. It was initially named by the World Health Organization (WHO) as 2019-nCoV [1] and renamed in February 2020 by the International Committee on Virus Taxonomy as Severe Acute Respiratory Syndrome coronavirus 2 (SARS-CoV-2), recognizing it as a sister of the SARS-CoV viruses [2,3]. The same day the WHO [4] named the disease as Coronavirus Disease 2019 (COVID-19).

Many efforts have been made since then to mathematically model the spread of the disease in the whole world and in the different countries where the infection arrived. Modelling the epidemics has many practical uses: preparation of national health systems; make provisions of the necessary sanitary material; predict whether and when a saturation of the health system could occur; when and to what extent Non Pharmaceutical Interventions (NPI) [5] should be applied; predict the day when those countermeasures can be relaxed, and so forth. These theoretical approaches to predict the evolution of epidemics often use compartment models as simple as the SIR model (Susceptible, Infectious and Recovered—sometimes called Removed) [6], but this model can be increased in complexity to include different characteristics of an infectious epidemics. For example the model can include individuals who can infect others, without presenting symptoms, what is known as the SEIR model (Susceptible, Exposed, Infectious and Removed); the model can also assume that people who have recovered from

the disease lose the immunity after a given time, and therefore they could be infected again, giving rise to the SEIRS models (Susceptible, Exposed, Infectious, Removed and Susceptible); also the deaths and births can be included for long term epidemics, as is the case in the influenza; and many times compartments to distinguish deaths, recovered, hospitalized, and other situations, are included by using empirical ratios (see for instance References [7,8] for further information).

As in any consolidated scientific field, there is abundant literature describing the different mathematical models which can be applied to different diseases, for different population behaviour, or even to define the optimal control strategies [9–13]. In Reference [14] a review of the models used in the India to forecast the behaviour of the COVID-19 was carried out, including among others: compartments (SIR type) models, ARIMA models, machine learning based approaches and logistic curves. This review remarks the very important discrepancies found between those models in the predictions after lockdowns were applied.

Since the SARS-CoV-2 outbreak many efforts have been carried out to adapt these SIR type compartment models to the behaviour of this particular virus. For example, a conceptual representation of a compartment model for COVID-19 disease's spread, developed by the authors of this paper, is shown in Figure 1, where immunization of recovered is assumed to be lost after a given period of time, as happen in other infectious diseases (typically immunity is lost after less than 12 months in the case of the coronavirus causing common cold).

**Figure 1.** Example of a Susceptible, Infectious and Recovered (SIR) type compartment model adapted to simulate the behaviour of SARS-CoV-2. In this case we assumed that immunity would be lost in a given period of time.

Due to the difficulty of developing and calibrating these compartment models at the early stages of an epidemics, wrong conclusions are often reached, for instance predicting the timing of the epidemic, and many times the uncertainties associated with the results of the models are so wide that are not well accepted by the public. Sometimes the authors of such predictions blamed the quality of the statistical data [15,16]. However, this quality is severely affected by the urgency of the epidemic and could not probably be avoided in this or future outbreaks of epidemics. The continuous publications of medical and epidemiological studies on the COVID-19, and the associated virus, don't make it easy to extract good quality information to adapt the models. But it must be accepted that this situation will be always the case—or even worse—when new diseases appear.

Some other problems associated with future predictions of the COVID-19 behaviour, like the influence of ambient temperature or humidity in the seasonality of the disease [17], are still unsolved. In the early stages many aspects of the behaviour of the SARS-CoV-2 virus has been associated with previous studies on similar viruses as the SARS virus, as the resilience in fomites [18] or the immunization of patients recovered from it [19]. Many other aspects which would also affect the choice of the best model are still under investigation, for example the possible lost of the immunity after some months, as happens with other human coronaviruses which cause other diseases like the 15% of the common cold cases [20].

During the outbreak of the epidemic in Spain several models were tested and a follow-up of the published results were performed to support Spanish national authorities in the decision-making process [21], producing a preliminary work covering all the phases which was published as a preprint [22]. The best results were obtained by using a semi-empirical approach presented herein, which has the advantage of performing accurate predictions with the minimum amount of information available during this epidemic which, very likely, will be the situation in future outbreaks.

This paper presents the mathematical development for a proposed semi-empirical model and the results obtained using it, focusing the results into the Spanish case. Some results obtained for other countries are also presented.

#### **2. Materials and Methods**

The semi-empirical model presented in this paper, with a proper calibration, produces accurate results at every stage of the epidemic: during the first spread of the disease, after the application of NPI (specifically total lock-down) which were applied in many countries, and after the ease of the lock-down.

Although the instant reproduction number (Rt) used by the epidemiologists for estimating the severity and evolution of an epidemic [23] is not used in this model, the basic reproduction number (R0) was derived for 10 different countries, ranging from 2.0 to 9.3, which is in a good agreement with previous estimations [24]. The R0 derived from real data are found in Table 1, giving an average of 5.8 ± 2.4, a value almost doubling early estimations [25].


**Table 1.** Basic Reproductive Number calculated for some studied countries.

The model proposed in this paper applies the well-known logistic map, often used for describing the growth of populations and mentioned as an example of chaotic behaviour. This chaotic behaviour depends on a single parameter *r* (Figure 2 shows a fractal created with the logistic map as a function of *r*). In this equation values of *r* < 1 would make an epidemic to extinguish. Any *r* greater than 1 but below 3 will provide an equilibrium in the size of the population for the long term, while values of *r* higher than 3.56995 would produce a chaotic behaviour on the size of the population.

**Figure 2.** Bifurcation diagram for the logistic map as a function of r.

Therefore to determine the number of infected diagnosed cases Equation (1) is used:

$$I(t) = r \cdot \left(1 - \frac{I(t-1)}{N}\right) \cdot I(t-1),\tag{1}$$

where *I*(*t*) is the number of infected diagnosed cases at day *t*, *I*(*t* − 1) the infected diagnosed cases of the precedent day *t* − 1, *r* is the growth parameter of the logistic map (named hereafter daily infection rate), and *N* the number of individuals susceptible to be infected (in Figure 2 a simplified example of this function, with constant parameters, is shown). It should be noted that the number of susceptible individuals used here is not necessarily the same as the number used for modellers applying SIR type models. The sub-index n will be used below to indicate the n-th day after the outbreak.

The behaviour of this function gives rise to the logistic function and the typical sigmoid shape of its cumulative distribution if *r* < 2, while it shows a chaotic behaviour if *r* > 3.56995 (see Figure 3). Other authors have studied the behaviour of this logistic function applied to the COVID-19 epidemics [26,27].

**Figure 3.** Number of infected obtained for the logistic map as a function of *r* for the different options, from *r* = 1.9 to *r* = 3.8.

*Forecasting* **2020**, *2*

In order to compare with the values of the basic reproduction number in Table 2 the empirically determined values of the growth parameter *r* are shown for the same countries in Table 1. This *r* parameter is simply measured by dividing the new infected in a given day by the infected in the previous day. To avoid statistics biases *r* was taken for each country as an average for the first 7 days after the initial detections of infected at each country. The *r* in these countries was equal to 1.9 ± 0.5, ranging from 1.3 to 3.0. In this approach a value of *r* < 3 implies that, in absence of countermeasures, and independently of the initial value *I*(0) there would be reached an unique equilibrium on the number of infected: *<sup>I</sup>*(∞) = *<sup>N</sup>* · *<sup>r</sup>*−<sup>1</sup> *<sup>r</sup>* . Worldwide, in average, an equilibrium value of 3.65 billions of infected would be reached applying *<sup>r</sup>* = 1.9 and *<sup>N</sup>* = 7.7 × <sup>10</sup><sup>9</sup> to the equation. The equilibrium values which would be reached, if no intervention was applied for each country, are shown at the Table 2.


**Table 2.** Growth parameter *r* for the logistic map, empirically calculated for the same countries during the first days of the spread of the COVID-19 epidemics, and equilibrium value for the infected people if no countermeasures were applied.

Therefore the basic quantity used to make predictions is the number of infected *I*(*t*) reported by each country or region. This model does not need considering asymptomatic infected or questions what is the real number of infected, but makes use of the data reported. However, as demonstrated in the case of the "Diamond Princess" cruise, nearly the 70% of the infected would be asymptomatic and undiagnosed [28].

Other quantities needed to provide advice to the authorities are the number of inpatients who would need medical attention at the hospital (*Hn*), the number of those who would need intensive care (*Cn*) and the number of deaths (*Dn*), all of them at each time *t*. *Hn* and *Dn* are calculated as a fraction of the number of the diagnosed infected cases at time t (*In*), and *Cn* as fraction of *Hn*. Obviously the number of recovered patients (*Rn*) is given by the fraction (1 − *Dn*). The fraction used to calculate *Dn* in this way is referred to as the case fatality rate (CFR), determined as *CFR* = *Dn Dn*+*Rn* . This is proved to be more practical during the outbreak than other approaches as the mortality rate for the whole population which can be only experimentally known at the end of the epidemic.

A delay must be included to represent the time elapsed between a death and its report to the authorities, including the time needed to perform the diagnoses (usually by using the polymerase chain reaction (PCR) technique).

All of these numbers are crucial to policy makers in order to take well founded decisions. However, to perform reliable predictions an appropriate calibration of the model is needed which will depend on the specific situation of each phase of the epidemics.

#### *2.1. Initial Parametrization*

All the parameters of the model are empirically calibrated by averaging the available information in a studied region. This calibration is feasible at the early stages, when the data available cover only few days, but it can also be dynamically adjusted during the whole evolution of the epidemics. For performing reliable predictions at the very beginning of the outbreak, the information from previously infected countries can be used as initial calibration of the model.

SARS-CoV-2 is assumed to infect with the same probability to every human, disregarding sex or age. Being a new human virus, no immunization was previously acquired, by natural or artificial (vaccine) means. For that reason the number of people which can be infected, *N*, was initially assumed to be the whole population of the studied region, whatever the size of that region is. For the sake of simplicity the total population of a country (or a region, as are the so-called autonomous communities in Spain) is initially used. In the case of Spain the total population *<sup>N</sup>* = 46.6 × <sup>10</sup><sup>6</sup> was initially used.

The daily infection rate, *r*, can be dynamically determined - using all the data collected to average a given period - dividing the number of infected the day *n* +1(*In*+1) by the number of infected at day *n* (*In*). For Spain, averaging the daily infection rate during the first 7 days of the outbreak, from the 26 February/3 March, an *r* = 1.5 was obtained. Following the same method *r* values were obtained for 10 countries (see Table 2). This parameter however with the actions taken by the governments and the population, as the social distancing, the frequent hands washing or the use of masks.

The fraction of the infected who need to be hospitalized (*H* = *Hn In* ) is dynamically determined using the data acquired at each region (or state, or country), averaged for the whole period since the beginning of the epidemic. The same was done for the fraction of inpatients needing an ICU (*C* = *Cn Hn* ). An initial factor of patients needing an Intensive Care Unit (ICU) from the number of infected cases *H* · *C* = 0.05 to 0.15 was computed from the studies in Eastern countries.

As the reported data for the infected patients were given as accumulated since the beginning of the epidemic, all the other quantities were also obtained as cumulative functions. Due to its special configuration of the health system, this was a problem in Spain, as different regions decided to report different quantities and then initially was impossible to obtain accumulated data for hospitalized persons, or patients which needed an ICU for the country. Daily rates were interesting to calculate, for instance, the day where the maximum of infections or deaths (peak of the curve) would occur.

For the CFR, the value measured near the equilibrium in China was used (*CFR* = 0.0578, as measured the 4 March—see Figure 4 to see the evolution of the CFR in China). This parameter presented a similar behaviour in many other countries, reaching a value at equilibrium of nearly a 5%. The high CFR values observed at the beginning of the epidemics in every country are probably due to several joint factors, including the weakness of the more vulnerable population (very old, already sick, people), or the lack of knowledge on which medical treatments were more effective. Those factors improved with time. In the cases of European countries a similar evolution was observed, although the decrease was slower than happened in China (at the beginning of April 2020 the CFR for USA was 0.3998, for Italy was 0.3557, and for Spain was 0.2052). The time delay, from the diagnose of an infected patient to the possible death, was adjusted at each country using real data. In the initial stages the observed delay was of 5 to 10 days for all the countries and reduced after some days.

**Figure 4.** Case Fatality Rate as measured in China since the beginning of February 2020. The value measured the 4th of March of 5.78% was taken for the model (Source of information https://www.wo rldometers.info/coronavirus/ (consulted on 11 March)).

#### *2.2. Parametrization during the Lock-Down*

A non pharmaceutical intervention used in China and many other countries, included Spain, was the so called 'lock-down' in which the population is required to stay at home and only leave if essential. This NPI has been partially implemented in some regions and totally in others, including the region of Hubei in China (58.5 million inhabitants), Spain (46.6 million inhabitants) or Italy (60.4 million inhabitants).

In each region or country the initial value used for *N* was its total population, but after the lock-down, the number of people already infected, or in contact with infectious people, is fixed and therefore *N* would be smaller. This number cannot be determined before the lock-down but can be calculated the same day that the lock-down is implemented by using the number of infected measured at that exact time. A first estimation was made using the number of infected, estimated with the model, 14 days after beginning the lock-down (14 days was assumed to be the incubation period for the COVID-19) and multiplying that number by a factor of 10, which would provide the total infected. This method provides a rough estimation which needs further refinements when new data are obtained, however it provided valid estimations for forecasting the time when the maximum (peak) for daily new cases of diagnosed cases or deaths would be expected.

As expected, the daily infection rate *r* was observed to decrease, from the rate the day before the NPI was applied (typically around 1.3) to a number slightly higher than 1.0, as observed at South Korea. The same behaviour was observed in every country and at every scale. After the lock-down is implemented, the *r* parameter can be fitted by least squares to the curve given in Equation (2), for the given region or country.

$$r = 1 + A \cdot e^{-a \cdot t},\tag{2}$$

where *t* is the time in days since the lock-down and *A* and *α* are constants empirically determined at the location. Table 3 shows some values for *A* and *α* adjusted for different countries after a lock-down was implemented and examples of smaller scale regions within Spain (Andalusia and Catalonia were chosen for this example because they are the two more populated regions in Spain). Those values were obtained by fitting the Equation (2) to the experimental data in different regions or countries. (\* Experimental data from worldometer (Source of information https://www.worldometers.info/coro

navirus/ (consulted on 17 April)). \*\* Experimental data from the Spanish official source of information (Source of information 'Instituto de Salud Carlos III' (ISCIII): https://covid19.isciii.es/)).

**Table 3.** Values obtained by fitting Equation (2) to the experimental data in different countries and two Spanish regions (Catalonia and Andalusia).


\* (Source of information https://www.worldometers.info/coronavirus/ (consulted on 17 April)); \*\* ('Instituto de Salud Carlos III' (ISCIII): https://covid19.isciii.es/(consulted on 17 April)).

The number of individuals infected before the lock-down (*N*), and the constants *A* and *α* cannot be determined prior to the lock-down, as different groups of individuals or societies behave differently under the same exact government instructions, and also different governments provided slightly different instructions. So the only chance to obtain good predictions after the lock-down is to wait for several days to obtain experimental data to be used to fit the curve under the Equation (2). It should be also pointed out that, some sources of information provided data shifted in time or simply just different and consequently the fitting could provide different values for the parameters.

The parameters *H* and *C* were determined by averaging the empirical values from the studied region. In Spain the values obtained as an average up to the 6 April were *H* = 0.467 and *C* = 0.1497, which indicates that a high rate of diagnosed infected needed to be hospitalized, or more likely, that only severe cases were diagnosed at the hospitals, needing half of them to be admitted. Also a high percentage of the inpatients (almost a 15%) needed intensive care using ventilators, which was in agreement with the observed pattern in China and other countries. In this case *H* · *C* = 0.07 (7%) which was in a good agreement with the initial range observed for the inpatients needing an ICU. As all parameters were dynamically calculated every day, the predictions were slightly calibrated daily.

Also for the CFR empirical values were used, as the equilibrium value taken from China was well surpassed in the initial stages in many European countries, although it tended to the same equilibrium value (nearly 5%). Although initially it reached values of even a 50%, the experimental CFR in Spain, as in UK, Belgium, or Italy, was 0.12 (12%) by April. The delay applied from reporting the positive diagnose of a patient to the death (when produced) was reduced to 3 days.

#### *2.3. Parametrization after Easing the Lock-Down*

When a region or a country decides to relax the confinement, the parameters need a new calibration to take into account the situation.

When the lock-down is completely abandoned *N* would return to be again the whole population of the region or country. However, this was not the situation in every country.

For example, in Spain the lock-down was established 15 March. Although the ideal situation would be to maintain the total lock-down until *r* reached a value close to 1, value expected to happen by the end of April according with the model, 13 April some relaxation was adopted, allowing most of the workers to return to their normal activities. A large part of the population remained confined, but a graded approach was established to remove it before the end of May. This being the situation, the parameters can be only inferred after some data are collected, following the same methodology established during the lock-down. Therefore *r* should be fitted to an exponential decrease, following the same Equation (2) after obtaining enough data. To perform initial conservative estimations a value of *r* = 1.03 can be used.

$$r = 1.01 + B \cdot e^{-\beta \cdot t}.\tag{3}$$

In the final stages of the easing of the lock-down, Equation (3) was used for *r*, considering impossible to achieve a value *r* < 1.01 (as the experience in other countries showed that reducing the level of daily infections below that value was, at least, very difficult). *B* and *β* are again constants empirically determined at the location

The rest of parameters: *H*, *C* or the CFR remain being averaged along the whole period with real data.

This parametrization can be used to assess the evolution of the situation after easing the lock-down or to design the strategy to optimize the number of infected, hospitalized or inpatients needing an ICU, to avoid the saturation of the health system of a country.

#### **3. Results**

As pointed out, 3 phases were considered:


As an example the application of the model with the appropriate parametrization in Spain is presented to show the performance of the model. The same methodology was also used for some of the regions in Spain, and can be used to any other region or country of any size.

#### *3.1. Initial Phase*

The initial phase is considered before any countermeasure is applied. Figure 5 shows the cumulated number of infected and the total number of deaths reported in Spain (in red) from 29 February/20 March. From 29 February/14 March reported data (in red) are shown against modelled values (in green). The schools closing was established from the 11th of March and the total lock-down the 15th of March.

**Figure 5.** Predicted number of infected (**a**) and deaths (**b**) in Spain from 29 February/20 March. The data from the 14 March are based in the model assuming no interventions were implemented. Red dots—reported. Green bars—modelled with uncertainty.

As explained, during this initial stage, *r* = 1.05 was calculated as an average of the values measured during the initial days of spread of the epidemics; *N* was the total population in Spain (46.6 × <sup>10</sup><sup>6</sup> inhabitants); and the *CFR* = 0.0578 was taken from the Chinese experience. In this phase the only parameter dynamically calibrated, to adjust the data reported daily, was the delay from the number of infected to the number of deaths, as was explained before, from an initial value of 7 days that was reduced up to a 2 days delay applied the 5th of March. In this specific case, the forecast indicated a number of infected cases of 26,600 ± 500 and a number of deaths of 1230 ± 150 to occur 6 days later (11th of March). The real number of reported infected was 21,571 (19% difference), and the number of reported deaths was of 1093 (11% difference). The number of inpatients needing an ICU and a ventilator was calculated as *I*(*t*) · *H*(*t*) · *C*(*t*), providing a range of [1330–3990]. The reported number of inpatients which needed an ICU the 20th of March was of 1630 (within the calculated range). Although in this initial stage many factors altered the real numbers the accuracy was reasonably good.

Predictions of the likely number of infected, hospitalized inpatients and total deaths were carried out using these conditions for the initial phase (uncontrolled spread of the disease). The model predicted that, if a severe NPI (total lock-down) was not adopted, but on the contrary the virus was left to spread without control, at the end of the epidemic in Spain 12 million people would have been infected, of which nearly 1 million people would need intensive care and about 700,000 infected would die directly because of the COVID-19 disease. However the number of deaths would likely be higher due to the saturation of the health system, as these numbers would occur in a very short period.

These results could provide an early idea of the urgent necessity of applying extreme NPIs like the total lock-down, they could be also used to predict the consequences of not applying the severe NPIs, and also to prepare for the capabilities of the ICUs, including the number of ventilators.

#### *3.2. Lock-Down Phase*

In Spain closing of schools began on 11 March and lock-down was established the 15 March. All factors were re-calibrated for this second phase as stated, including a fitting of the daily infection rate to the curve given in Equation (2). For the number of susceptible individuals *N* an initial estimation (*<sup>N</sup>* = 1.1 × 106) was carried out using the results of the model 14 days after the lock-down. An average value *r* = 1.32 was used. Results of these early predictions are presented in Figure 6. These values were later calibrated dynamically to *<sup>r</sup>* = 1.20 and *<sup>N</sup>* <sup>9</sup> × <sup>10</sup><sup>6</sup> using data up to the 15th of March. Results of the predictions are presented in Figure 7.

(**c**) Daily Deaths.

**Figure 6.** Total number of infected **(a)**, total deaths **(b)** and daily deaths **(c)** in Spain predicted from 29 February/19 April. The preliminary results assumed the total lock-down since the 15 March with data up to the 14 March. Red dots—reported data. Green bars—modelled values with uncertainty.

*Forecasting* **2020**, *2*

(**c**) Daily Deaths.

**Figure 7.** Total number of infected **(a)**, total deaths **(b)** and daily deaths **(c)** in Spain predicted from the 29 February/19 April. These second modelled results assumed the total lock-down since the 15 March and were calibrated with data up to the 25 March. Red dots—reported data. Green bars—modelled values with uncertainty.

The application of the total lock-down to the model reduced the predictions carried out the 26 March to a total of 194,000 infected, 85,700 hospitalized, nearly 8,600 with needs of an ICU and 19,500 ± 1400 deaths to occur by the 17 April. The real numbers reported at that date were 197,142 infected (1.6% difference), 7548 inpatients needing an ICU (12% difference) and 20,043 deaths (2.7% difference). The model predicted the peak for the rate of daily deaths to occur between the 29th of March and the 3rd of April. In reality the peak, after the administration revised the data (two months later), occurred the 31 March.

#### *3.3. Unlocking Phase*

The last phase is the ease of the lock-down. In this case, predictions based on some hypothesis, carried out during April, are presented in this paper and compared with the real evolution. Again the case of Spain is presented as example. Using the models presented in this paper recommendations were provided to ease the lock-down around 21 April [21]. In reality a partial unlock was decreed the 13th of April for non-essential workers, and a phased total unlocking from the 30 April, where some activities were allowed gradually each week until the 21 June, where normal activity was restored, although the population should follow NPI health countermeasures, as social distance, use of masks, washing of hands, and so forth.

After the unlocking many uncertainties appear, but the results of the model depend largely on the daily rate of infections *r*.

#### 3.3.1. Partial Unlock

On the 13 April a partial ease of the total lock-down was applied in Spain.

In order to obtain conservative figures an initial forecast was carried out with the data available the 16 April [22]. At that point only some reasonable hypotheses could be applied to calibrate *N* and *r*.

For *N* it was assumed that about a 20% of the total workforce (in Spain 20 million workers on 2020) went back to work, as only some industries were allowed to begin again their activities, after that day. As those people could also infect their families, 2 further members on average, an initial *<sup>N</sup>* = 1.2 × <sup>10</sup><sup>7</sup> was taken. In order to carry out conservative predictions *r* = 1.03 (a daily increase of the infected of a 3%) was taken. Results in Figure 8 were obtained for those conservative assumptions.

(**c**) Daily Deaths.

**Figure 8.** Total number of infected **(a)**, total deaths **(b)** and daily deaths **(c)** in Spain predicted from 29 February/29 May. These modelled results assumed the ease of the lock-down since the 13 April with conservative assumptions for *N* and *r*. Red dots—reported data. Green bars—modelled values with uncertainty.

Using these conservative values of *N* and *r*, some consequences in the partial ease of the lock-down could be extracted. First of all, the number of diagnosed infected people would increase continuously beyond May. In fact there would been a peak in the daily rate of infected by the 28th of May and a peak in the daily rate of deaths by the 1st of June. The total number of deaths in Spain by the end of April would reach the 46,000 ± 15,000 under this scenario. That was the conservative value we published in April [22]. If this would have been the case (*r* > 1.03) a saturation of the health system would have occurred again in Spain. So that value of *r* could be regarded as an upper bound which should be avoided.

In reality, a very good behaviour of the Spanish population made *r* to continuously reduce even after a total easing of the lock-down.

#### 3.3.2. Total Unlocking

From the 4th of May a total unlock was applied in Spain, with a progressive increase in the mobility of the people since then, and therefore a recalibration was needed. For this phase *N* was again considered the total population (46.6 millions inhabitants).

Obtaining some more data a least squares fit was performed using an initial daily infection rate *r* = 1.02 (2% daily increase of infected) the day before the unlocking, and the Equation (3) to fit *r*. The calibration resulted in *<sup>B</sup>* <sup>=</sup> 0.03 and *<sup>β</sup>* <sup>=</sup> 0.153, what gives the equation *<sup>r</sup>* <sup>=</sup> 1.01 <sup>+</sup> 0.03 · *<sup>e</sup>*(−0.153·*t*). The results obtained with this fit is shown in Figure 9.

(**c**) Daily Deaths.

**Figure 9.** Total number of infected **(a)**, total deaths **(b)** and daily deaths **(c)** in Spain predicted from 29 February/28 June. These modelled results assumed the ease of the lock-down since the 13 April. Red dots—reported data. Green bars—modelled values with uncertainties.

The series of data finishes at the end of May, as official aggregated data for Spain were no longer provided. In fact least squares fit was extremely difficult as there was an attempt to homogenize the data between the different regions in the country which made the whole series of data to be revised almost every day. That is one of the reasons why in this occasion the fitting was not as good as in previous phases for the total number of infected and the total number of deaths, as the focus was put in the daily number of deaths to obtain a good fitting. Daily number of deaths was the main endpoint in this phase because this indicator is the best one to perform future surveys of the situation of the epidemic.

The results obtained using this calibration for this final stage was that the number of diagnosed infected people would slowly increase continuously beyond May. This is a logical result, as the infection would be always present in a slow rate until the virus is eradicated or there is a vaccine to control the spread of the disease. The predicted number of total infected is of 317,500 ± 1700 and the total number of deaths would be 37,100, with a huge uncertainty, by the 1 August. The real numbers that day were 335,602 reported infected (5.7% difference) and 28,445 deaths (23% difference).

During the summer the situation was controlled, however if the good practices in the application of NPIs are abandoned: hands washing, social distance, use of masks, and so forth, *r* could easily reach values above 1.03, surely another increase in the number of infected will occur. This will be the case also when borders are reopened and new infectious people (even asymptomatic) enter inadvertently from countries in the initial phase of the epidemics.

Of course, these predictions did not consider important changes, as the discovery of a vaccine—which seems extremely difficult in a short period of time—or the increase in the temperatures in the summer which could reduce the infectivity of the SARS-CoV-2, or a higher isolation which could reduce the severity of the COVID-19 due to a higher production of vitamin D [29,30], or any other unforeseen circumstance. During that summer time, of course, the treatment of hospitalized patients has improved and therefore a smaller fraction of inpatients need an ICU or even die.

#### *3.4. Percentage of Infected Population in the Regions*

There was an additional result extracted from this model. The need to recalibrate it during the locking phase by fitting the parameter *N*, the number of people susceptible to be infected in that phase, offers the possibility of using that number to infer the percentage of the population in a country or region which could have been infected, for this particular virus, most of them showing no symptoms.

As an example these numbers were extracted for the autonomous communities (administrative regions) in Spain and transformed to three levels of infection (low—below 5% —, medium—from 5% to 10%— and high—above 10%), giving rise to the result shown in Figure 10a.

This qualitative result has an important use for the authorities, as the population which have a high level of infection by the SARS-CoV-2 did probably developed immunity against the virus (at least temporarily as discussed before), and therefore there is no chance for them to be infected again in the close future. And on the contrary, there is a bigger chance of developing further outbreaks of the disease in those regions where the percentage of the infected population was smaller.

The model provided some counter-intuitive results. For example, the capital of Spain, Madrid, presented "medium level" of infection of the population, whereas Catalonia showed "high level", while both, the number of diagnosed cases and the number of deaths in Madrid was higher, and this would imply a higher level of infection in Madrid. However this could be explained in the different criteria followed by the different regions for reporting the numbers. For example Catalonia decided by mid of May to include in the statistics the deaths of people occurred out of the hospitals, while it was unknown if Madrid was including those deaths already in the statistics. The same occurs with the number of diagnosed infected cases, as there is observed an increase in the number of tests performed on people who finally did not need a hospital in Madrid, the percentage in Catalonia of diagnosed population finally needing a hospital was still around 65% by May.

(**a**) Forecasted levels of infection [22]. (**b**) Measured levels of infection [31].

**Figure 10.** Levels of infection in the population at each region of Spain: calculated in April 2020 by using the model presented in this paper (**a**) and measured at the seroprevalence study ENE-COVID finished in July 2020 (**b**).

These predictions were compared against the extensive statistical program of immunity prevalence carried out on the Spanish population from April to June 2020, published in early July [31] (see Figure 10b). This study measured the percentage of people infected during the epidemic, taking account of asymptomatic infected persons, while the previous reported numbers included just the hospitalized inpatients showing severe symptoms on which a polymerase chain reaction (PCR) test returned a positive result. The results of both studies can be compared in Figure 10.

As can be seen the results provided by the model in April were accurate in most of the regions were low infection occurred, as in the south (Andalucia, Extremadura, Ceuta, Melilla, Canary Islands, Balears Islands, Valencia and Murcia) or north-west (Galicia, Asturias and Cantabria) of Spain. However the model predicted a medium to high infection level at the whole north and north-east, whereas the measured levels of seroprevalence obtained low levels at some regions (La Rioja, Navarra and Basque Country). In general the forecast provided a good general view of the infection level.

#### **4. Discussion and Conclusions**

As it is unlikely that a vaccine to the SARS-CoV-2 or a cure of the associated disease COVID-19 is developed in the next months, the only way of reducing the consequences of the epidemic at this moment is an optimum application of the available NPIs.

Traditional epidemiological models like compartment models need good quality data to offer proper predictions. However, during the outbreak of an epidemic often the quality of the data is not the primary concern of the health systems and therefore many epidemiologists were discouraged and abandoned their efforts of offering predictions. This fact, together with the need of fewer data, made it interesting using a semiempirical model based in the logistic map. The application of such methodology to the different phases of the COVID-19 epidemics in Spain was shown. This methodology provided good results in the forecast of the evolution of the disease in every situation.

The use of extreme non-pharmaceutical interventions, such as the total lock-down, showed their effectiveness during the period they were applied. However, easing the countermeasures allow new outbreaks of the infection to appear. This situation forces the need of applying many simultaneous techniques to reduce the effect of the disease if that is the case. One of those techniques could be the application of the methodology described in this paper to provide early alerts of the outbreaks in countries or smaller units of population, allowing an optimization of sanitary resources and reducing the economic and social impact of future NPIs applied locally.

All the data used in the paper were data officially published in real time by the official sources from the Spanish Government [32,33]. As was shown, reasonably accurate results can be produced by using the model presented in this paper to the different phases of an epidemic. In a previous preprint, assuming an infection daily rate *r* of 3%, a total number of 400 000 diagnosed infected and a total number of 46,000 ± 15,000 deaths were forecasted in Spain by the end of May [22]. Those predictions overestimated the real values due to a more strict reduction of the infection daily rate in the country, reaching values below 1%.

The forecasts covered from the number of infected, hospitalized, inpatients needing an ICU or deaths, to the time where the peak of daily deaths would be produced or the level of infection in a given region. In the last prediction, carried out for the beginning of August, 317,500 ± 1700 infected and a total number of deaths of 37,100 were predicted, with a huge uncertainty, to be compared with the real numbers of 335,602 reported infected (5.7% difference) and 28,445 deaths (23% difference).

The aim of any policy dealing with the application and withdrawal of NPI should carefully consider daily infection rates. In the case of the COVID-19 a daily infection rate *r* lying within the range of 1.01 to 1.02 (1% to 2% daily increase), as was shown in countries like South Korea, would produce a manageable level of people needing an ICU in hospitals, avoiding the saturation of national healthcare systems and therefore unnecessary deaths.

Also a qualitative prediction of the percentage of the population infected in the different regions of Spain was performed by using the suggested semi-empirical model. These predictions were compared against the extensive statistical program of immunity prevalence carried out on the Spanish population from April to June 2020, published at early July, showing that the model provided in April reasonable results in most of the regions, although the model predicted a medium to high infection level at the whole north and north-east, while the measured levels of seroprevalence obtained low levels. Some results obtained with this methodology were not intuitive according to the official information. The more counter-intuitive result probably being the higher level of infection of Catalonia compared with Madrid region. As said, in general the forecast provided a good forecast of the infection level.

The COVID-19 epidemic is still ongoing and the knowledge will increase with time. In the next future new outbreaks are foreseen in the countries where the first one was controlled, unless a vaccine or a cure are developed in the next future. Therefore models will be needed to forecast again the evolution and to advice the authorities in the needs of the country's health system. Some characteristics of the virus, needed to perform better predictions, are still unknown, as the lost of immunity of cured individuals or the influence of vitamin D in the severity of the disease.

A continuous watch of the disease is still needed to provide proper advice which can be used by policy makers.

**Author Contributions:** Conceptualization, Data Curation, Methodology, Software and Writing by J.C.M.; Validation, Writing—review and editing and Formal analysis by S.P. and A.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** The authors would like to acknowledge to all the colleagues who have constructively read and discussed the paper to improve its content. Specially to Ignacio Rodríguez and Asunción Núñez from Dragonfly Innovation Technologies, UK; Justin Smith, from the Public Health England, UK; and Alejandro Ubeda from the Ramon y Cajal Hospital, Spain.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*
