*5.1. Simulation Set-Up*

The simulation study covers a wide range of scenarios, each evaluated using 30 replicates. Each replicate is constructed using the following process. Initially, the Spanish gross domestic product actual data to be predicted in our data analysis is used to obtain parameter estimates (*μ*ˆ, *φ*ˆ and *σ*ˆ), for a standard autoregressive process of order one, *yt* = *μ* + *φyt*−<sup>1</sup> + *σt*. These estimates are used in each replicate to generate a preliminary simulated target time series, {*y*˜<sup>∗</sup>*t* }. This preliminary target is then used to generate simulated predictions for each institution, {*y*˜*i*,*<sup>t</sup>*}, by adding noise (all noises considered, i.e., *t*, *ηt* and *<sup>ε</sup>t*), are independent standard Gaussian noises) with different intensities parameterized by its standard deviation, that is, *y*˜*i*,*<sup>t</sup>* = *y*˜∗*t* + *σηη<sup>t</sup>*. These simulated predictions are then aggregated using simulated weights, {*ω*˜*i*}. Simulated weights depend on the number of key agents (institutions) considered. For a 100% of key agents, simulated weights are set to equal weights. For 40% and 10% of key agents, that percentage of the total of institutions is randomly selected and randomly assigned uniform weights between 0.5 and 1. The other institutions are assigned a negligible weight and all weights are rescaled to add up to one. These simulated weights are used to produce the final simulated target time series *y*˜*t* = ∑*i <sup>ω</sup>*˜*iy*˜*i*,*<sup>t</sup>* + *σεεt* (with *σε* fixed at 0.1 to introduce some but not much deviation from the direct aggregate). Algorithm performance for different such simulated target time series is analyzed by varying the following parameters: the number of institutions, the sample size, the percentage of key agents and the noise standard deviation.

The number of institutions or agents takes values 10, 20 and 40. The first two values are slightly under and slightly over the number of institutions in our data analysis (Section 4, Table 1). The third value corresponds to an ideal, large number of institutions. Sample size (T) takes values 6, 10 and 20. The first value matches the observations available in our data analysis and the other values consider reasonable and desirable horizons respectively. The percentages of key agents considered are 10%, 40% and 100%, with the latter corresponding to all institutions weighting equally in the generating the target time series. The noise standard deviation (SD), *ση*, takes values 0.1, 0.2 and 0.3. While the first two values are appropriate for near-future forecasts (e.g., forecasts for a given year made in December of that same year), the last value corresponds to forecasts further into the future (e.g., for a given year made in July of the preceding year).

## *5.2. Simulation Results*

The results from the simulation study are as expected (Tables 3 and 4). The proposed algorithm becomes preferable to the simpler, naive overall average as the length of the target time series increases and as the number of both institutions and key institutions decreases. The simulation study reveals that the root average square error can more than double when using the naive algorithm instead of the proposed one. Also, while the results show a good number of improvements of relative error over 20%, negative results seem to stop at around 12%.

The results in terms of weight recovery are shown in Table 4. We assess weight recovery via the Kullback–Leibler divergence between true and recovered weights. A small Kullback–Leibler divergence between these weights is linked to the improvements identified by the simulation study in forecast error resulting from applying the proposed algorithm. The results from Table 4 are in agreemen<sup>t</sup> with those from Table 3.

The so-called *forecast combination puzzle* consists in the realization that simple combinations of point forecasts have been found to outperform elaborated weighted combinations in repeated empirical applications [28]. Smith and Wallis [28] pointed out at finite-sample errors in weight estimation as a likely culprit. More recently, Genre et al. [13] establish that "we would not conclude that there exists a strong case for considering combinations other than equal weighting as a means of better summarizing the information collected as part of the regular quarterly rounds" of the Survey of Professional Forecasters. Our findings are in agreemen<sup>t</sup> with this literature. The agreemen<sup>t</sup> is both from the empirical perspective and from that of the simulation study. This agreemen<sup>t</sup> complements the main contribution of this paper in connecting the information theory literature with the machine learning literature in the context of forecast combination. The success of equal weighting for forecast combination can also be linked to the fact that forecasting institutions tend to form a well-informed consensus, which benefits simultaneously from a herd effect [29] and a wisdom-of-the-crowds effect [30].


**Table 3.** Relative changes (in %) of root average square error (averaging over years) of the arithmetic average of simulated institution predictions ("naive2" in Table 2) with respect to the the proposed algorithm. The parameters are the number of institutions or agents, sample size T (inner subtable dimensions), key agents and noise standard deviation (outer dimensions).


**Table 4.** Kullback–Leibler divergence between true and recovered weights.

## **6. Concluding Remarks**

According to prediction and sampling theories, forecasting errors and variances of single forecasts can be reduced by combining individual predictions. The traditional methods for combining forecasts are based on assessing the relative past performance of the forecasters to be combined. The problem, however, becomes indeterminate as soon as the number of forecasters is larger than the number of past results. To overcome this issue, an alternative is to assume some set of a priori weights and to apply the principle of maximum entropy to obtain a set of a posteriori weights, subject to the constraint that the combined predictions equal the realized values. Unfortunately, this is a complex problem that grows with the cardinality of the variables and the possibility of finding a solution is not guarantee.

In order to reach a solution within the information theory framework we propose a fresh approach to the problem and, inspired in the machine learning literature, we sugges<sup>t</sup> a new specification based on regularization regression and an algorithm to solve it. The new approach always produces a solution, being moreover quite flexible. It permits the use of different norms to measure the discrepancies among the combined predictions and the realized values and to weight the relative importance of the discrepancies. Our regularization approach also has the advantage of producing, as a by-product, the weights assigned to the different forecasters. These weights could be understood as a measure of the forecasters' ability and be used as a tool to decide the methodologies deserving more credit.

Further flexibility could be introduced in our model. For instance, by substituting in Equation (2) the single prediction values by prediction functions (for example, regression equations). In this case, the parameters of such prediction functions would be estimated simultaneously, during the cross-validation step. This will enable us to apply our proposal in one step when, for instance, we try to obtain, from a set of national forecasts, a prediction for a regional economy where single forecasts are not available. We could substitute the (unavailable) single regional forecasts for a parametrized function (e.g., a dynamic regression equation) of the national values.

In our algorithm, we have considered a quadratic norm (a ridge penalty) and a rolling-origin evaluation as cross-validation strategy. Obviously, other penalties (e.g., lasso or elastic net) are also possible and, likewise, there is also room for implementing other methods of cross-validation. For instance, we can explicitly omit the temporal order of the data in the training sets and carry out leave-one-out cross-validation. At the end, the relative importance of the most recent predictions can be implicitly included in our specification through the *δ*'s coefficients.

Regarding our application, as it is a common practice we have used the last reliable GDP available figures (all the countries elaborate several vintages of GDP. National accounts are regularly revised as statistical information is enlarged. For instance, in the case of Spain, the estimates from each year undergo three revisions until they are considered definite [31]) as realized values, *at*. In our opinion, this is not however the best strategy to be followed for a "combiner" of macroeconomic forecasts. Instead, flash estimates should be used. Flash estimates (the most provisional and least reliable figures, though) are the most appealing, getting a strong attention (on the one hand, they occupy the front pages of the media and are the ones more analysed, debated and commented on. Revised and definitive data, published three to four years later, attract little public opinion interest. On the other hand, and more importantly, the flash estimates serve as a framework for decision-making by economic stakeholders. Decisions which may give rise to rights and obligations: budgetary stability commitments in the EU, ceilings on general governmen<sup>t</sup> expenditure, size of deficit or governmen<sup>t</sup> debt allowed). This may entail marked consequences on the weights each forecaster receives.

The key contribution of this paper is to link the maximum-entropy inference methodology from the information theory literature with regularization from the machine learning literature with the ultimate goal of combining forecasts. Although one might envisage linking forecast combination algorithms other than regularization (e.g., boosting or bagging) with the information theory literature, it does not seem immediatly clear how this could be done. Such immediacy seems to be one of the advantages of regularization over alternative algorithms when it comes to connecting the machine learning and information theory literature.

**Author Contributions:** Conceptualization, C.B., P.E., P.H. and J.M.P.; Methodology, C.B., P.E., P.H. and J.M.P.; Software, C.B., P.E., P.H. and J.M.P.; Formal analysis, C.B., P.E., P.H. and J.M.P.; Investigation, C.B., P.E., P.H. and J.M.P.; Resources, C.B., P.E., P.H. and J.M.P.; Data curation, P.E. and J.M.P.; Writing–original draft preparation, C.B., P.E., P.H. and J.M.P.; Writing–review and editing, C.B., P.E., P.H. and J.M.P.; Supervision, C.B., P.H. and J.M.P.; Project administration, J.M.P.; Funding acquisition, P.H. and J.M.P.

**Funding:** The authors acknowledge the support of Generalitat Valenciana through the aggreemen<sup>t</sup> "Desarrollo y mantenimiento de las previsiones macroeconómicas de la Comunitat Valenciana" (Consellería de Economía Sostenible, Sectores Productivos, Comercio y Trabajo) and the project AICO/2019/053 (Consellería d'Innovació, Universitats, Ciència i Societat Digital). The authors also thank the support of the Spanish Ministry of Science, Innovation and Universities and the Spanish Agency of Research, co-funded with FEDER funds, project ECO2017-87245-R.

**Acknowledgments:** The authors wish to thank two anonymous reviewers for their valuable comments and suggestions and the Guest Editors and Journal Editors for their help and kindness. They also like to thank Marie Hodkinson for revising the English of the paper.

**Conflicts of Interest:** The authors declare no conflict of interest.
