*3.3. Ensemble Assessment*

The proportion of simulated VLF counts equal to or below the observed values varied by region and time period, and the frequency with which this quantity fell within the central 95th percentile of the simulated VLF counts informs us of the overall quality of the multi-model average forecasts. Using this performance metric, the highest ensemble quality occurs in regions where the central 95th percentile of simulated VLF counts covers the observed VLF counts in all three time periods, which was observed in the Marine Regime Mountains Redwood Forest, Prairie, Hot Continental, Temperate Desert Regime Mountains, and Savanna divisions. In as many regions, this quantity fell in the central 95th percentile for the testing and tuning time periods only, or in the testing and training time periods

only. This was observed in the Temperate Steppe Regime Mountains, Temperate Steppe, Temperate Desert, Mediterranean Regime Mountains, and Tropical/Subtropical Steppe divisions. Predictive performance was occasionally poor in the tuning and training time periods, but good during the testing time period, as was observed in the Warm Continental, Subtropical, and Tropical/Subtropical Desert divisions. In the Mediterranean division, the ensembles performed well on the tuning and training time periods, but showed poor performance when predicting data they were not already optimized on. The lowest model quality was seen in the Tropical/Subtropical Regime Mountains, where observed VLF counts were covered by the central 95th percentile in the tuning time period only, and in the Hot Continental Regime Mountains, where the central 95th percentile of the simulated VLF counts never covered the observed quantity.

**Figure 7.** Simulated and actual very-large fire month counts for each region and three time periods: 1984–2005, 2006–2015, and 2016. A sample of 100,000 simulated very-large fire (VLF) counts are produced under historical (grey), RCP 4.5 (blue), and RCP 8.5 (red) scenarios by randomly selecting a VLF probability time series from the posterior and randomly generating a VLF occurrence time series. The observed VLF counts are represented with arrows.

Consistent underestimation, where the observed VLF count was equal to or greater than the median simulated VLF count in all three time periods, was reported in nine of the sixteen regions considered. The magnitude of these underestimates ranged from very minor, as in the Temperate Desert, to quite severe, as in the Hot Continental Regime Mountains. Consistent overestimates were much less frequently observed, with only the Marine Regime Mountains Redwood Forest and Warm Continental divisions reporting observed VLF counts equal to or less than the median simulated VLF counts in every time period. Five regions had VLF counts that were located to the left or right of the median depending on the time period considered. The Temperate Steppe, Prairie, and Tropical/Subtropical Steppe divisions simulations tended to underestimate the true VLF counts, while the opposite was observed in Temperate Steppe Regime Mountains and Hot Continental divisions.

The simulated distributions did not appear to be strongly sensitive to the choice of RCP scenario during the temporal extent of the training (2006–2015) and testing (2016) time periods, as there are only slight differences between them during those times (Figure 7).

#### **4. Discussion**
