*3.3. Dichotomous Evaluation*

With respect to dichotomous evaluation, Figure 8 compares the number of 24-h extreme precipitation events recorded at all stations in different precipitation thresholds. As the main emphasis of this study was on the evaluation of extreme events, the maximum of ensemble forecasts was carefully determined. Based on Figure 8, at a threshold of 25 mm, NWP models estimated precipitation better than IMERG, while among the NWP models, ECMWF was closer to in situ observations. However, all products underestimated the number of events above 25 mm. At the 50 mm threshold, the UKMO predictions of the number of events were closer to the observations. However, in terms of mean ensemble forecast, IMERG estimated the number of events closer to the observations than the NWP models. It is worth noting that the performance of the NCEP model decreased significantly with an increase in the threshold. At the 75 mm threshold, UKMO was better than the other products, whereas NCEP had difficulty in forecasting the precipitation amount at this threshold. From 15 March 2019 to 2 April 2019, precipitation amounts of over 100 mm were reported for a number of days. At this threshold, in situ observations recorded 22 events, while NCEP detected none. However, the UKMO and ECMWF maximum ensembles as well as IMERG detected 11, 10, and 4 events, respectively.

**Figure 8.** Comparison of the number of extreme events estimated by the NWP models and IMERG satellite with observations in different precipitation thresholds.

The performance of all precipitation products was then evaluated using dichotomous evaluation criteria (Figure 9). Based on the POD metric, which measures the percentage of the accurate detection of events, the ECMWF in the 25 mm threshold and the UKMO in 50, 75, and 100 mm thresholds showed the highest PODs over Iran. The decline in the performance of NCEP with an increase in the threshold is clearly evident in this criterion. However, based on the FAR metric, which expresses the percentage of false alarms, ECMWF obtained the lowest FAR in maximum ensemble forecasts within all thresholds, although it did not perform well in high thresholds in the mean ensemble forecast mode. Based on the ETS, which measures the forecast quality of detecting occurrence and non-occurrence of an event, the mean ensemble forecast of the ECMWF model achieved the best score in the 25 mm threshold. In the other thresholds, the maximum ensemble forecast of the ECMWF model achieved the highest ETS scores.

**Figure 9.** Verification statistics (POD, FAR, and ETS) between ECMWF, NCEP, UKMO, IMERG, and the in situ observations for different thresholds over Iran for 15 March 2019 to 2 April 2019.

Finally, the evaluation with respect to the number of events estimated by the precipitation products at different thresholds was conducted. According to Figure 10, in Gorganrud basin in 25 mm threshold, both UKMO and ECMWF models performed well. Satellite (IMERG) and NCEP obtained rather similar performances with the detection of seven events among the 13 events in these thresholds, while the ECMWF and UKMO performances were more robust and closer to the in situ observations with the detection of 12 events. In the 50 mm threshold, both ECMWF and UKMO still performed better than the other products with the detection of seven events among the 11 events detected by the in situ observations, while the skill of NCEP declined and IMERG showed an improvement with the detection of two and six events, respectively. At a threshold of 75 mm, both ECMWF and UKMO models performed well and could forecast almost 80% of events (five out of seven events) in the 75 mm threshold. The NCEP model did not record any event and IMERG estimated only one event out of seven events. In the 100 mm threshold, both ECMWF and UKMO forecasted five events out of six events above 100 mm. The satellite and NCEP model were weak in this segment and did not estimate any events in northeast Iran in the Gorganroud Basin.

f h b f db E E U **Figure 10.** Comparison of the number of extreme events estimated by ECMWF, NCEP, UKMO, and IMERG with in situ observations.

In the Karkheh Basin, the NCEP model performed well in the 25 mm threshold and was able to estimate the number of events better than the other products. However, the performance of NCEP decreased dramatically with an increase in the threshold (e.g., in the 75 mm (100 mm) threshold only one (zero) event out of 15 (5) events were forecasted). Meanwhile, IMERG performed better at higher thresholds. Generally, the ECMWF revealed better results than the other products in all precipitation categories.

In Karun Basin, NCEP again performed better at lower thresholds than at higher thresholds, whereas UKMO was slightly better than ECMWF. However, extreme precipitation greater than 100 mm was only recorded once in this basin.

#### **4. Discussion**

In this study, the evaluation of three precipitation forecast models, namely ECMWF, NCEP, and UKMO as well as the IMERG-RT V05B satellite product provided new insight into how errors vary with extreme precipitation events within different climate zones of Iran. Overall, the examined products in this study sometimes agreed well with in situ observations while in some other instances, showed significant differences. As far as possible causes for model performances, precipitation is a function of available atmospheric moisture while derived from moisture convergence. Thus, the models need to be correctly initialized and parameterized through several factors such as (i) gross condensation rate, (ii) latent heat energy exchange within the atmosphere, and (iii) the microphysical behavior of clouds [30]. However, individual clouds typically occur at subgrid scales and must be parameterized based on resolved variables such as average humidity and temperature [31]. The parameterization of cloud, and thus precipitation, continues to be one of the greatest sources of uncertainty in NWP models [32].

Another is the significant differences in the detection of extreme precipitation amounts among the products when compared with the in situ observations. Although, in general, the models captured the spatial distribution of heavy precipitation events, the hot spots were not located in the correct area. Moreover, orography and local effects can affect the accuracy of the products. These issues should be addressed by improving the models' algorithms [33].

Another factor in the interpretation of the differences between the model/satellite products and in situ observations might be related to the precipitation thresholds. As such, overestimation or underestimation in each precipitation threshold means that a given precipitation product was not able to estimate/detect precipitation within that particular threshold, while they could have estimated/detected the precipitation within a lower or higher threshold.
