**Rainfall Intensity-Duration-Frequency Relationship. Case Study: Depth-Duration Ratio in a Semi-Arid Zone in Mexico**

**Ena Gámez-Balmaceda 1,2, Alvaro López-Ramos 3, Luisa Martínez-Acosta 2,3, Juan Pablo Medrano-Barboza 3, John Freddy Remolina López 4, Georges Seingier 1, Luis Walter Daesslé <sup>1</sup> and Alvaro Alberto López-Lambraño 1,2,5,6,\***


Received: 9 September 2020; Accepted: 12 October 2020; Published: 15 October 2020

**Abstract:** Intensity–Duration–Frequency (IDF) curves describe the relationship between rainfall intensity, rainfall duration, and return period. They are commonly used in the design, planning and operation of hydrologic, hydraulic, and water resource systems. Considering the intense rainfall presence with flooding occurrences, limited data used to develop IDF curves, and importance to improve the IDF design for the Ensenada City in Baja California, this research study aims to investigate the use and combinations of pluviograph and daily records, to assess rain behavior around the city, and select a suitable method that provides the best results of IDF relationship, consequently updating the IDF relationship for the city for return periods of 10, 25, 50, and 100 years. The IDF relationship is determined through frequency analysis of rainfall observations. Also, annual maximum rainfall intensity for several duration and return periods has been analyzed according to the statistical distribution of Gumbel Extreme Value (GEV). Thus, Chen's method was evaluated based on the depth-duration ratio (*R*) from the zone, and the development of the IDF relationship for the rain gauges stations was focused on estimating the most suitable (*R*) ratio; chosen from testing several methods and analyzing the rain in the region from California and Baja California. The determined values of the rain for one hour and return period of 2 years (*P*<sup>2</sup> <sup>1</sup>) obtained were compared to the values of some cities in California and Baja California, with a range between 10 and 16.61 mm, and the values of the (*R*) ratio are in a range between 0.35 and 0.44; this range is close to the (*R*) ratio of 0.44 for one station in Tijuana, a city 100 km far from Ensenada. The values found here correspond to the rainfall characteristics of the zone; therefore, the method used in this study can be replicated to other semi-arid zones with the same rain characteristics. Finally, it is suggested that these results of the IDF relationship should be incorporated on the Norm of the State of Baja California as the recurrence update requires it upon recommendation. This study is the starting point to other studies that imply the calculation of a peak flow and evaluation of hydraulic structures as an input to help improve flood resilience in the city of Ensenada.

**Keywords:** hydrologic statistics; flood design; extreme rainfall intensity time series

#### **1. Introduction**

Intensity–Duration–Frequency (IDF) relationship, or IDF curves, is a representation of intense rainfall events that allows for calculation of a peak flow needed to design hydraulic structures (e.g., storm sewers, culverts, drainage systems), to assess and predict flood hazard, and design flood protection structures [1–3]. Most of these structures were designed in many developing countries a long time ago without an updated IDF—remarkable, since rain is a variable that changes with space and time. Consequently, any update from the IDF relationship in urban catchments will necessarily imply revision and modification of the local standard structural designs [4]. The updated IDF design would also help to predict flood risk occurrences and map flood hazards of the expected peak flow. This is especially relevant in arid and semiarid zones where rain characteristics indicate that yearly variation in storms is very large, and the intensity of rare storms is always very high for a brief period, and so therefore the flooding is of sudden occurrence and rapid rise [5]. Northwest Mexico is a semi-arid region with low annual average rainfall; however, with the presence of rainfall intensities associated with climate variability, that has caused flooding [6–8].

The Intensity-Duration-Frequency relationship design is based on the measurement of peak rainfall events, regarding duration and developed for a certain recurrence interval or return period [9], where accuracy depends on the rainfall characteristics, such as magnitude, frequency, and duration [10,11]. The analyzed data is the precipitation time series, modeled for future projections at a regional scale. These projections indicate that the precipitation return period tends to increase if the climate conditions do not change [12,13]. However, due to Climate Change, there are uncertainties of intense rainfall occurrence affecting the return period of the IDF design, which in turn tends to decrease in some global regions [14]. This issue makes it necessary for trend analysis of the extreme rainfall events to design or update the IDF relationship.

The best estimation of the precipitation intensity (unit/time) is directly obtained from the automatic (pluviograph) weather station that measures the sub-hourly rain, and for which the records are automatically transmitted every five or ten minutes [15,16]. There is a low density of this kind of weather station in the Mexican territory, though. The separation between stations should be between 5 and 30 km [17,18] according to the Manual on the Global Observing System proposed by the World Meteorological Organization. However, distance between the stations in Mexico is 70 km on average; thus, there is a lack of information between stations along space and time. In the country, and only since 1999, the available sub-hourly rain records have been registered through automatic weather stations (EMAs), by the National Meteorological Service (SMN).

The Ensenada's city has two IDF designs: one published as an Official Norm by the State of Baja California, with daily rainfall records from 1948 to 2008, just for one station [19]; and the other, by the Federal Communication and Transportation Ministry (SCT) published in 2000, and whose analyzed time series are unknown [20]. These are the only IDF studies found in the literature for the study area, and both serve as official Mexican documents [21,22]. Due to a lack of automatic stations, none of these studies used the pluviograph records from the last two decades (2000–2020), to assess extreme rainfall events necessary to develop the IDF curves. This problem of a lack of data and weather stations is evident in several cities in Mexico; however, to minimize the inconvenience of the periodicity of measurements and the distance between weather stations, most studies in Mexico have used different empirical methods to estimate the IDF relationship based on daily rainfall from standard rain gauges (pluviometric), suggesting that the Chen method [23] is the most appropriate for the IDF estimation [24,25].

Considering flood occurrence caused by intense rains, the insufficient studies on the IDF relationship, lack of pluviographic information, and other important aspects for planning hydraulic structures in Ensenada city and its surroundings, it is necessary to carry out a detailed analysis of the Intensity–Duration–Frequency relationship for the area. Therefore, the purpose of this study is to analyze, estimate, and propose the Depth–Duration–Ratio (*R*), appropriate for the characteristics of the rains that occur in the study area, to obtain representative IDF curves. For this, it is necessary to estimate the rainfall of one hour, and the return period (*P*<sup>2</sup> <sup>1</sup>) of two years. (*P*<sup>2</sup> <sup>1</sup>) is traditionally derived from the proposals made by Hershfield [26], Reich [27], and Bell [28] in areas where only pluviometric information is available. However, based on pluviographic information obtained in automatic stations, it was possible to adjust the expression of Bell [28] that represents the characteristics of the rainfall occurring in the study area. Based on this, the Chen method [23] is used to obtain the IDF curves and consequently their updating for the return periods of 10, 25, 50, and 100 years. Additionally, the spatial distribution of the rainfall-duration ratio (*R*) was obtained, which facilitates obtaining the IDF relationship with the Chen method [23] in places where there are no pluviographic and standard rain gauge stations.

Finally, the adequate estimation of IDF curves will allow and guarantee the planning and optimal design of hydraulic structures. Future flood hazard studies and consequently disaster risk management will also benefit directly from this work.

#### **2. Materials and Methods**

The urban zone of Ensenada and its surroundings has been selected as the study area, located on the Pacific coast of Baja California (31◦30'–31◦60'N; 116◦50'–116◦10'W. See Figure 1). This zone, in the northwest of Mexico, has a semiarid climate, with convective type rains characterized for being intense and of short term. The average annual rainfall is 273 mm, where the rainy season usually occurs between November and April. Topographic landscape is variable, with steep slopes, alluvial valleys, and an alluvial coastal plain [29]. This area is divided into four urban and seven rural subbasins that drain toward the Pacific coast. There are 11 weather stations inside the basins located at different elevations, that contain the main data to achieve the purpose of the study (Figure 1).

**Figure 1.** Study area location showing sub-basins and distribution of weather stations.

The required data to develop the IDF relationship in the study zone were the maximum rainfall events, annually recorded at different durations; obtained from four automatic stations (pluviograph records). On the other hand, historical rainfall data of ten standard rain gauges (daily rainfall records) were collected and included in the calculations (Table 1).


**Table 1.** Rain gauge network from the study zone.

5-min: automatic stations, 24 h (daily): rain gauge stations. Lat: latitude, Long: longitude.

Based on the characteristics of the study area and available precipitation data, the methodology was defined with the steps shown in Figure 2.

**Figure 2.** General diagram of the methodology.

Table 1 introduces four automatic stations. The Emilio Lopez Zamora station is the automatic station with more recorded data. It has 20 years of records, with measures from every ten minutes. This station is managed by a governmental agency called Comisión Nacional del Agua (CONAGUA). The other three automatic stations scarcely have 10 years of records, with measures for every five minutes. They are CICESE (CIC), Guadalupe (2036), and Ojos Negros (2035); all three were administered by Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE). From these data, the maximum rainfall events of each year were identified and classified by their magnitude and duration in order to estimate the IDF curves.

Identification of each event was based on the continuous pluviograph records, from the rain start, until it stopped for more than one hour; e.g., if the rain ended at any time, but it suddenly started before an hour passed, this record was considered to be part of the previous event. Nonetheless, it was also considered a separate event during the day if, after one event had occurred, there were two hours without rain, and it started to rain again. Once all events were identified, they were classified by duration and magnitude to select the most intense of every year. Intensity is defined by the rain magnitude in mm, registered for a determined time; e.g., there are events with the same magnitude and different durations in hours, where the most intense is always the one with the shortest duration. Measures of the events recorded by the four automatic stations had different durations. In some automatic stations, the longest event recorded (not the most intense), was one of 180 min. Other intense events had a duration of only 10 minutes. However, most of the intense events had a duration of 120 min; therefore, this duration was selected to identify the intense events of each year. López–Lambraño developed this method to identify and classify the intense events [30], and it has been satisfactorily applied by Maldonado et al. [31].

Regarding rainfall information, there are 10 stations around the city of Ensenada (Table 1), which have records of maximum rainfall in 24 h, from 1940 to 2011. In the same location of these stations there are three automatic stations (2072, 2036 and 2035), which have data from 1999 to 2019. Thus, the time series of the pluviometric data of these three stations were completed from the pluviograph records (automatic stations); e.g., for station 2072 a time series of 79 years was obtained. This was achieved through the daily accumulation of rainfall recorded every 10 min in that station until 2019. This same procedure was performed for stations 2036 and 2035.

A trend analysis was carried out on the pluviographic and pluviometric data, to determine the trend lines and verify if the registered rainfall tends to increase or decrease in the area. The stationarity in all the stations was also verified through the Augmented Dickey–Fuller test, which is a unit root test that allows accepting or rejecting the stationarity hypothesis in a time series [32]. Subsequently, the Chi square goodness and fit test was performed for the precipitation data analyzed with the probability distribution functions of Gumbel type I, Log-normal, Frechet and double Frechet to verify which of them presented the best fit.

The development of the IDF relationship is a procedure that starts with data availability, type of records, years of records, quality, and coverage, to choose an efficient method and claim the best results. Two methods, for different conditions, were applied in this case to estimate the IDF relationship for the return periods of 10, 25, 50, and 100 years. These periods, proposed by the National Agency for Prevention of Disasters, are suitable for the design of minor urban hydraulic structures and flood hazard assessments [21]. The first method to estimate the IDF relationship is applied to pluviograph data (from automatic stations). These data provide the real intensity that occurred for each rain event through the years and can be extrapolated applying a probability distribution. The Generalized Extreme Value Distribution (GEV) is usually applied to estimate IDF curves, like the extrapolation of the available data to estimate the peak value of the sample. The Gumbel's method, based on the theory of extreme values, is the first method applied here, where the probability of an event of a determined

magnitude not being equaled or exceeded, can safely be adopted and have been widely used [30,33]. This is expressed by the Equation (1).

$$F(X) = e^{\left[-\epsilon^{\left(-\frac{X-\mu}{4}\right)}\right]} \left(-\infty < X \le \infty\right) \tag{1}$$

where 0 <α< ∞ is the scale parameter, −∞ <μ< ∞ is the location parameter or central value, e is the base of the Naperian logarithms, α, *X*, and μ; corresponding to the parameters of the statistic moments of the distribution. The distribution derivative provides the probability density function where the values of *X*, for different return periods (*T*), are estimated by means of:

$$X\_T = \mu + a\mathcal{Y}\_T \tag{2}$$

$$\mathcal{Y}\_T = -\ln\left[\ln\left(\frac{T}{T-1}\right)\right] \tag{3}$$

Confidence intervals are important to estimate the return period and data accuracy. Generally, the data to calculate the IDF curves have a standard error of 10% for short return periods and 20% for periods of 50 and 100 years [28]. By Gumbel's method, applied to the automatic data, the IDF curves can be estimated with a confidence value for a return period of 20 years, considering the 10 and 20 years of records from the three automatic stations. The IDF curves for a return period of 20 years are very useful for minor urban structures design. However, to develop the IDF curves for large return periods, such as 50 or 100 years, there would be uncertainty when applying the Gumbel method. Therefore, the requirement to apply Gumbel's method with a good confidence value, would be to have a long series of precipitation data.

Pluviograph records were useful to have real data of the events to estimate IDF for short return periods but were not enough for extrapolating rainfall intensities for the larger return periods. Therefore, the use of daily rainfall depth, from the standard rain gauges, was necessary to make the calculation of the rainfall intensity for all the return periods proposed. This allows us to counteract the calculations with the automatic data and disseminate better results.

Consequently, a second method to develop the IDF curves recommended for urban hydrological design in the Mexican Republic [24] is applied to daily rainfall records (from the 10 standard rain gauges). This method was developed by Chen as an alternative of the absent pluviograph records [23], by the following equation:

$$P\_t^{Tr} = \frac{aP\_1^{10} \log\left(10^{2-\chi} Tr^{\chi-1}\right)t}{60(t+b)^c} \tag{4}$$

where, *PTr <sup>t</sup>* is the intensity of precipitation in mm/h, *<sup>P</sup>*<sup>10</sup> <sup>1</sup> <sup>=</sup> *<sup>R</sup>*(*P*<sup>10</sup> <sup>24</sup>), *<sup>P</sup>*<sup>10</sup> <sup>1</sup> is the rain in mm, generated in one hour for a return period of 10 years, (*X*) is the ratio of the rain-return period *<sup>X</sup>* <sup>=</sup> *<sup>P</sup>*<sup>100</sup> *t P*<sup>10</sup> *t* , *P*<sup>100</sup> *<sup>t</sup>* and *<sup>P</sup>*<sup>10</sup> *t* is the rain of 24 h and return period of 100 and 10 years respectively. (*Tr*) is the return period in years, (*t*) is the duration in minutes, (*a*), (*b*), and (*c*) are parameters of regional characterization of the rain defined by the (*R*) ratio.

The depth-duration ratio (*R*) is the most important parameter to estimate the IDF relationship, from daily records, by using the Chen equation for a specific geographic location, because this ratio is related to rain characteristics of the zone [34]. As shown before, Equation (4) is based on (*R*) ratio to determine the rain of one hour and a return period of ten years *P*<sup>10</sup> <sup>1</sup> , and therefore, the rain for a given return period. The (*R*) ratio for any average condition of rainfall over any geographical areas, also, has been proposed by Chen [23], through the following equation:

$$R = \frac{P\_1^2}{P\_{24}^2} \tag{5}$$

where, (*P*<sup>2</sup> <sup>1</sup>) is the rain in mm, generated in one hour for a return period of 2 years. (*P*<sup>2</sup> <sup>24</sup>) is the rain in mm, generated in 24 h for a return period of 2 years. This value is easily estimated by applying Gumbel distribution for each standard rain gauge. (*P*<sup>2</sup> <sup>1</sup>) can be estimated from pluviograph records applying the Gumbel distribution for each station. However, the issue to apply formula 5 is finding the value of the rain of one hour and a return period of two years (*P*<sup>2</sup> <sup>1</sup>), when there are only daily records, not pluviograph records. Therefore, to validate Chen's method in this region, it was considered to make a good estimation of (*R*) ratio through the value of (*P*<sup>2</sup> 1).

Several alternatives to apply Equation (5) for any geographic area, based only on daily records, have been given by Hershfield and Wilson through a diagram that relates the mean of maximum annual observations of precipitation days with the mean annual number of thunderstorm days; Reich, through a proposed world isopluvial map of 2-year/1-h maximum precipitation; and Bell, based on Hershfield method through the following equation:

$$P\_1^2 = 0.17 \text{MN}^{0.33} \tag{6}$$

Equation (6) could be applied if 0 <sup>&</sup>lt; *<sup>M</sup>* <sup>≤</sup> 2.0, 1 <sup>&</sup>lt; *<sup>N</sup>* <sup>≤</sup> 80, in which (*P*<sup>2</sup> <sup>1</sup>) <sup>=</sup> 2-yr, 1-h rainfall in inches: *M* = mean of maximum annual observational—precipitation day in inches: and *N* = mean annual number of thunderstorm days.

The purpose was to find the *R*-value to 10 rain gauge stations applying Equation (5), where (*P*<sup>2</sup> 24) was easily estimated using daily records, by applying Gumbel distribution (Equations (2) and (3)). In the case of (*P*<sup>2</sup> <sup>1</sup>) there are three stations with pluviograph records and seven stations with daily records, the issue was to find (*P*<sup>2</sup> <sup>1</sup>) for the seven stations with daily records. Consequently, it was decided to evaluate the methods available for estimating the value of (*P*<sup>2</sup> <sup>1</sup>) and choosing the best way of finding the accurate (*P*<sup>2</sup> <sup>1</sup>) to calculate the (*R*) ratio.

The process started by knowing the approximated value of the (*P*<sup>2</sup> <sup>1</sup>) along the regional area to compare them with the estimated values through the mentioned methods, started the process. Approximated values of (*P*<sup>2</sup> <sup>1</sup>) in the regional area were examined through a literature review of regional studies. This review, with the same rainfall characteristics, includes California in the US (CA), near to the international border, and Baja California in Mexico, as shown in Figure 1.

The first method evaluated to calculate (*P*<sup>2</sup> <sup>1</sup>) was the Gumbel distribution using pluviograph records from stations: CIC, 2072, 2036, and 2035. Then, a weighted average of (*P*<sup>2</sup> <sup>1</sup>) was estimated from the four automatic stations, through the Thiessen polygon. The averaged value was assigned to the seven rain gauges stations, to replace it in Equation (5) and have the first (*R*) value in the zone, as (*R*1).

The next step was to calculate (*P*<sup>2</sup> <sup>1</sup>) for the 10 rain gauges stations through the methods previously described methods; Hershfield, Bell, and Reich [26–28]. Each value was replaced in Equation (5) to have different values of (*R*) like (*R*2) and (*R*3). Three different (*R*) values were estimated. Nevertheless, from the different values of (*P*<sup>2</sup> <sup>1</sup>), it was considered to search for a relationship to provide values of (*P*<sup>2</sup> <sup>1</sup>) for the rain gauges stations close to the values of (*P*<sup>2</sup> <sup>1</sup>) estimated from pluviograph records. The Hershfield and Bell methods provide the same results. This way, a new method to find (*P*<sup>2</sup> 1) was derived from Bell´s formula (Equation (6)). After several tests of parameters variation, the new relationship resulted from modifying Equation (6) in the following equation:

$$P\_1^2 = 0.12MN^{0.33} \tag{7}$$

The (*P*<sup>2</sup> <sup>1</sup>) values estimated from Equation (7) were replaced in Equation (5) to have new values of (*R*), such as (*R*4). Therefore, the evaluated methods provided four different results of (*R*<sup>1</sup> ... 4). The challenge was to choose the best method of finding (*R*); thus, to address this challenge, an average of all calculated (*R*) was made, and this average was proposed as the best option to determine the (*R*)

ratio for the city of Ensenada (Equation (8)). A summary of the process to find (*R*) ratio for each station in the zone is represented in Figure 3.

$$
\overline{R} = \frac{\sum\_{i=1}^{n} R}{n} \tag{8}
$$

**Figure 3.** Procedure to determine the (*R*) ratio proposed. *R* is the average ratio and n is the number of (*R*<sup>1</sup> ... 4) calculated for each (*P*<sup>2</sup> 1).

Once the (*R*) ratio was determined for each rain gauge station, the spatial distribution of the (*R*) ratio was drawn on a map by interpolation, using the Inverse Distance Weighting (IDW) method. The error analysis of this spatial interpolation was carried out between actual and interpolated values using three statistical parameters: Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and coefficient of determination (r2). Consequently, the distribution of (*R*) ratio for the study area can be successfully used to calculate the IDF curves in the zone.

The procedure to develop the IDF relationship for the proposed return periods includes the combination of pluviograph and daily rainfall records, through the methods described above, testing different values of estimated (*R*), and comparing their results, to choose the most suitable, according to rain characteristics.

The first step was the calculations of the IDF relationship in three automatic stations (2072, 2036, and 2035), using the maximum rainfall events for each year, and applying the Gumbel method (Equation (1)).

The second step was the estimation of the IDF relationship for the 10 rain gauges stations, with daily records for more than 40 years, applying the Chen method (Equation (4)). There are three pluviographic

stations (2072, 2036, and 2035) on the same location from the previous stations. The calculations using Chen equation included the use of the (*R*) ratios found with the different methods already mentioned, for all stations, followed by the comparison of the different results of IDF developments with the IDF values by the (SCT), and the comparison with the IDF relationship developed in 2011 by the by Norm of the State of Baja California, that only has results for station 2072.

Therefore, for each rain gauge station, with daily records for more than 40 years, there are several IDF calculations to enhance the comparison between each other, and have more options to choose the best results, developed through the following methods:


Once the IDF relationships were estimated from the previously established methods, two comparisons were made. The first comparison was made from the use of methods A, B, C, D, E, F, and G. The second comparison was made from the use of methods A, C, H, and I. From the first comparison method A and C were chosen again, considering that method A is very important because it was developed with pluviograph information (2072, 2036, and 2035 stations), representing the continuous measure of rainfall intensity. However, the rainfall time series data at stations do not exceed 20 years of record (Table 1). Thus, the method C was chosen because it corresponds to the official IDF relationship from the country. The chosen method and the corresponding results of the IDF curve for the city of Ensenada and its surroundings are shown in the results and discussion section.

#### **3. Results and Discussion**

Peak rainfall events of each year were identified from the records of the four automatic stations, following the previously described method. Table 2 provides the events extracted from the Emilio Lopez Zamora station (2072). Since this station has the most extensive time series of rainfall data (20 years), it will be taken to illustrate the maximum precipitation events.


**Table 2.** Peak rainfall events for ELZ station (2072).

It is clearly seen that the highest precipitation depth of each event occurs during the first 10 and 20 min. These observations confirm the short periods of rainfall events, as a local rain characteristic. Moreover, to analyze the increase or decrease of extreme events throughout the years, a rainfall trend has been drawn. The line trend of time series was analyzed from data of the station that have pluviograph and daily rainfall record. Figure 4 shows the line trend for data of station (2072). Figure 4a shows the trend of the high daily rainfall depth of every year, and Figure 4b presents the trend of the extreme rainfall events of the years at different durations, not necessarily the most intense (pluviograph records). The series presents a positive trend. In both cases, the trend indicates that maximum daily rainfall depth and peak rainfall events have been increasing with the years and will continue occurring in the future. In the trend analysis shown in Figure 4a, the slope establishes that the average increase in the maximum daily rainfall is 18%, i.e., if the current conditions that govern the occurrence of rainfall in the study area were to be maintained over time, then precipitation would increase at a rate of 1.8 mm per decade. For the case of Figure 4b there would be an increase of approximately 1.2 mm. The above magnitudes are considered significant if we consider that precipitation falls over a given coverage area, which translates into the generation of more direct runoff volume over the city of Ensenada.

**Figure 4.** Trend analysis for station 2072 using automatic and daily records. (**a**) Annual maximum daily rainfall depth. (**b**) Annual Peak rainfall events.

These results can also be complemented by a detailed analysis of the rainfall time series for the State of Baja California [35].

The traditional methods used in hydrology to estimate rainfall and extreme flows for different return periods are based on the stationarity hypothesis in the probability distribution function of the series. According to Poveda et al. [36], this hypothesis could not be valid given the effects of climate change, climate variability, changes in land use, and the records of hydrological variables. Therefore, the Augmented Dickey–Fuller test was performed to evaluate the rainfall time series in the city of Ensenada fulfilling the hypothesis of stationarity or non-stationarity (Table 3).


**Table 3.** Augmented Dickey–Fuller hypothesis test for precipitation data from stations used in the study area.

Table 3 shows the results of the Augmented Dickey–Fuller test. The analysis was carried out for a 5% significance level, finding that none of the time series corresponding to weather stations comply with the stationarity assumption because the *p*-value is greater than 0.05 [32]. Given the previous analysis, the Chi-square goodness and fit test was used for the precipitation data analyzed with the Gumbel type I, Log-normal, Frechet and double Frechet probability distribution functions to verify which of them had the best fit. In the case of the rain gauge stations, eight were better adjusted to the Gumbel type I distribution function, and two stations to the double Frechet. The previous analysis was carried out with a 0.05 confidence level. In the case of the automatic station 2072, both the Gumbel type I distribution function and the double Frechet distribution function presented equivalent results. Given the above, it was decided to use the Gumbel distribution function for the analysis of the relationship Intensity-Duration-Frequency of rainfall in the study area. In addition, it has been found that the chosen distribution fits well for precipitation in semi-arid areas [11].

The (*R*) ratio was estimated based on the (*P*<sup>2</sup> <sup>1</sup>) following the process shown in Figure 3. The result of the regional review of (*P*<sup>2</sup> <sup>1</sup>) is shown in Table 4. The regional review reveals that values of (*P*<sup>2</sup> <sup>1</sup>) from Southern California to Northern Baja California are in the range between 10 and 16.61 mm, indicating that values of (*P*<sup>2</sup> <sup>1</sup>) for the study area should be estimated in this range. For this reason, the evaluation of the different methods to calculate (*P*<sup>2</sup> <sup>1</sup>) and estimate (*R*) followed by the comparison of their results was carefully analyzed to determine the accurate (*R*) ratio. The results of (*P*<sup>2</sup> <sup>1</sup>) are in the range of the regional review, when (*P*<sup>2</sup> 1) <sup>1</sup> is calculated by the average of pluviograph and daily records applying Gumbel, (*P*<sup>2</sup> 1) <sup>2</sup> is calculated by Hershfield [26], (*P*<sup>2</sup> 1) <sup>3</sup> is calculated by Reich [27], and (*P*<sup>2</sup> 1) <sup>4</sup> is calculated by the adjusting and modification of Bell [28]. The results of (*R*1...4) ratios estimated for each station through the different methods of (*P*<sup>2</sup> <sup>1</sup>) were averaged to define and support the proposed (*R*) for each station. These values are shown in Table 5.


**Table 4.** Literature review of (*P*<sup>2</sup> <sup>1</sup>) for California and Baja California.

**Table 5.** Depth-duration ratio (*R*) estimated through the average ratios *R1* ... *<sup>4</sup>* estimated from (*P*<sup>2</sup> <sup>1</sup>) and (*P*<sup>2</sup> <sup>24</sup>). Where (*P*<sup>2</sup> 1) <sup>1</sup> was averaged of pluviograph and daily records applying Gumbel, (*P*<sup>2</sup> 1) <sup>2</sup> calculated by Hershfield [26] and Bell [28], (*P*<sup>2</sup> 1) <sup>3</sup> calculated by Reich [27], and (*P*<sup>2</sup> 1) <sup>4</sup> calculated by the adjusting and modification of Bell [28] (Equation (7)).


The (*R*) ratio calculated for each station varied from 0.35 to 0.44; this range is close to the (*R*) ratio of 0.44 for one station in Tijuana reported by Campos–Aranda [24], which indicates that both cities share the same rain characteristics. Thus, once the (*R*) ratio was estimated, the spatial distribution of this ratio was projected in a map for the study area (Figure 5). The error analysis was carried out between actual and interpolated values using three statistical parameters: Mean Absolute Error (MAE <sup>=</sup> 2.341 <sup>×</sup> <sup>10</sup><sup>−</sup>14), Root Mean Square Error (RMSE <sup>=</sup> 1.53011 <sup>×</sup> <sup>10</sup><sup>−</sup>7), and coefficient of determination (R2=1). There is a significant positive correlation between actual and interpolated values estimated for the depth-duration ratio. Values for MAE and RMSE indicate that the IDW method created a good interpolated surface for the whole area of interest based on the observed data. It can be established that in Ensenada City the average value of *R* is about 0.39 and this value is proposed as (*R*) ratio for the area.

It is highly important to highlight that in the absence of pluviograph data, the (*R*) ratio calculated becomes the key to develop a satisfactory IDF curves for the standard rain gauge stations, by applying Chen equation.

IDF curves are commonly developed using historical annual maximum precipitation, this involves utilization of long-term historical rainfall observations. When sub-daily rainfall records are not available, the characteristics of extreme rainfall intensities, and subsequently, their distribution functions corresponding to the short durations might not be captured [39–43]. This problem is clearly addressed by applying Equation (7) to rain duration ratio (*R4*) estimated (Table 5), likewise distribution of the depth-duration ratio (Figure 5). In this way, empirical formulas (e.g., Chen) can be used in areas where there are no pluviographs (high temporal resolution), rain gauges, and/or information available.

**Figure 5.** Distribution of the Depth-Duration Ratio (*R*) for the city of Ensenada.

The IDF relationship were calculated with pluviograph records, by applying Gumbel distribution and, calculated with daily records by applying Chen methods through different options of (*R*) ratio. The different results allowed us to make comparisons to choose the most suitable method for calculations of the IDF relationship.

The comparisons of results were focused on the three stations that have pluviograph and daily records (2072, 2036, and 2035). However, station 2072 has been chosen to define the methods of comparisons, because it has the widest time series of pluviograph and daily records.

Table 6 presents the results of the first comparison that tested methods A, B, C, D, E, F, and G previously described, and its average likelihood method (H), to assess the IDF relationship. In this table the values of the IDF relationship obtained by the different methods vary considerably, and for this reason it was decided to carry out an average of these methods. However, it is important to highlight that method G presents the highest magnitudes and correspond to the IDF reported in the Baja California State Standards. It should be noted that these magnitudes are far from the average. With these results, it was identified that these IDFs are over-estimated, which implies a greater over determination of dimensions in the designs of hydraulic structures, and so therefore higher construction costs.

From this comparison, another comparison was selected to assess and estimate the IDF relationship for the rest of the rain gauge stations. The comparison defines the selection of the best method of IDF estimation, that includes the IDF estimated using pluviograph data (method A), the IDF reported by the SCT (method C), the IDF estimated with the *R* ratio proposed in this zone (method I), and the IDF from method H. This comparison is shown in Figures 6–8 for the stations 2072, 2036, and 2035.


**Table 6.** First comparison of IDF relationship, estimated through different methods for station 2072 for different return periods.

(A) Estimated from pluviograph records using Gumbel (Equation (1)). (B) Estimated from the Chen equation using *R1* from Table 5. (C) Obtained from the isohyetal map of the SCT. (D) Estimated from the Chen equation using *R4* from Table 5 (Equation (7)). (E) Estimated from the Chen equation using *R3* from Table 5 (Reich, 1963). (F) Estimated from the Chen equation using *R2* from Table 5 (Equation (6)). (G) Obtained from the of Norm by the State of Baja California.

The comparisons of IDF relationship for the 2072 station showed similar values calculated with methods (I) and (H). The method (I) is calculated by using Chen with the proposed *(R*) ratio, and the method (H) is calculated from the average of IDF developed by methods A, B, C, D, E, F, and G described before. On the other hand, the intensities obtained with method C are lower than those obtained with method I. This may be because the IDFs obtained with method C have not been updated since 2000, therefore, they do not involve the analyzes performed in Figure 4a,b.

Figures 6–8 show that the results of the method (I) are closer to methods (C) and (H) than method (A). Therefore, considering the position of the results of (I) on the range of values of all methods, it could be suggested as the most suitable to calculate the IDF curves. The values to estimate Figures 6–8 can be seen in Tables S1–S3 in Supplementary Materials. Method (I) was chosen to calculate IDF for the 10 rain gauge stations like the best option of the methods tested. Table 7 shows the IDF relationship for the 10 stations calculated by the Chen method through the proposed (*R*) ratio.

**Figure 6.** Established comparison of the IDF relationship estimated through different methods for station 2072 for different return periods. (**a**) shows the IDF curves from method (A), estimated from pluviograph records using Gumbel (Equation (1)). (**b**) shows the IDF curves from method (C) estimated from isohyetal map of the SCT. (**c**) shows the IDF curves from method (H) estimated from the average of IDF developed by methods A, B, C, D, E, F, and G, and (**d**) shows the IDF curves from method (I) estimated from Chen equation using the (*R*) ratio proposed (Equation (8)).

**Figure 7.** Established comparison of the IDF relationship estimated through different methods for station 2036 for different return periods. (**a**) shows the IDF curves from method (A), estimated from pluviograph records using Gumbel (Equation (1)). (**b**) shows the IDF curves from method (C) estimated from isohyetal map of the SCT. (**c**) shows the IDF curves from method (H) estimated from the average of IDF developed by methods A, B, C, D, E, and F, and (**d**) shows the IDF curves from method (I) estimated from Chen equation using the (*R*) ratio proposed (Equation (8)).

**Figure 8.** Established comparison of the IDF relationship estimated through different methods for station 2035 for different return periods. (**a**) shows the IDF curves from method (A), estimated from pluviograph records using Gumbel (Equation (1)). (**b**) shows the IDF curves from method (C) estimated from isohyetal map of the SCT. (**c**) shows the IDF curves from method (H) estimated from the average of IDF developed by methods A, B, C, D, E, and F, and (**d**) shows the IDF curves from method (I) estimated from Chen equation using the (*R*) ratio proposed (Equation (8)).

For the stations 2036 and 2035, the results of the IDF relationship were similar by methods (C) and (I). However, when graphing the IDF curves with the C method, it is noted that they do not show the typical fit of said curves. This may be because there are errors in the process of estimating them. A comparison has also been made for the stations 2001, 2005, and 2045, that provided equivalent results by applying methods (C) and (I). From the analysis of the comparison of results, method I is recommended as the most suitable for the IDF design in the evaluated stations.

For the IDF relationship calculation from daily records to be used in Mexico, the Chen's method has been mostly recommended by other authors [24,44]. However, considering that rain changes in space and time, the use of this method has been validated through the accurate calculation of (*R*) ratio, where it depended on the (*P*<sup>2</sup> <sup>1</sup>). The results of (*P*<sup>2</sup> <sup>1</sup>) obtained by pluviograph data, Reich and the adjusting from Bell equation, are similar; however, the method widely used is Hershfield that provides the highest values. Therefore, it was considered to calculate (*R*) with each value of (*P*<sup>2</sup> <sup>1</sup>) of each method to make an average of all estimated (*R*). This average was proposed as the (*R*) ratio of the zone that can be safe to calculate the IDF curves in other area stations.

The differences between the various methods in this study can be attributed to the length of the time series used. Finally, it has been verified that the quality and length of the pluviographic and rainfall precipitation records are important aspects in the study of the intensity-duration-frequency relationship in an area.


**Table 7.** IDF relationship selected for the rain gauge stations estimated through the Chen Method with the (*R*) ratio proposed in this study.

#### **4. Conclusions**

The procedure and methods for developing IDF relationship in Ensenada were carefully tested making effective use of the available rainfall data, including pluviograph and daily records that allowed to achieve the research objectives.

Despite the absence of enough pluviograph data in this study, the IDF relationship was carried out successfully using the combination of Chen's methods, through the average (*R*) calculated from the average value of the one-hour rainfall and the two-year return period. The values of (*P*<sup>2</sup> <sup>1</sup>) obtained were compared to the values of some cities in California and Baja California, with a range between 10 and 16.61 mm, and the values of the (*R*) ratio in a range between 0.35 and 0.44; this range is close to the (*R*) ratio of 0.44 for one station in Tijuana, a city 100 km farther North from Ensenada. The values found here correspond to the rainfall characteristics of the zone; therefore, the method used in this study can be replicated to another semi–arid zones with the same rain characteristics. The parameterization of the IDF relation for different durations, allows better understanding and realization of spatio-temporal analysis of the characteristics of rainfall in an area.

Chen's method is applicable to any zone if the (*R*) ratio is well defined. This value was carefully reviewed in the region and carefully tested with updated pluviograph and daily data. Therefore, the (*R*) ratio estimated is proposed to develop IDF curves in the absence of pluviograph data.

After analyzing the results of the IDF relationship for the station 2072, it was observed that the IDF relationship published by the Norm of the State of Baja California presents the highest values of all methods. Although they are also safe for designing, they imply the highest cost of construction—greater sizing—that could be minimized considering the new results. The document review indicates that the IDF relationships were developed with data available up to the year 2011. Therefore, it is suggested that these results from the IDF relationship should be included in the Norm of the State of Baja California, as it requires the recurrence update upon its recommendation.

This study guarantees the following aspects: input to rain–runoff models to improve the information available for an adequate design, planning, estimation of dimensions of civil works, and the integral management of water resources in this semi-arid region. Also, the proposed intensity–duration–frequency relationship will facilitate the evaluation of the flood hazard. The city of Ensenada periodically suffers the effects of floods, is out of electricity service, it is not possible to travel, there are drainage problems, health hazards, and work activities are often stopped. Therefore, proper risk management is important to protect not only the lives of citizens, but also the public and private assets, as well as the progress made in the development process. These results constitute a starting point for risk management and flood resilience.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2306-5338/7/4/78/s1, Table S1: Established comparison of the IDF relationship estimated through different methods for station 2072 for different return periods, Table S2: Established comparison of IDF relationship estimated through different methods for the station 2036 for different return periods, Table S3: Established comparison of the IDF relationship estimated through different methods for the station 2035 for different return periods.

**Author Contributions:** Conceptualization, E.G.-B.; A.A.L.-L.; methodology, E.G.-B., A.A.L.-L., J.P.M.-B., A.L.-R. and J.F.R.L; formal analysis, E.G.-B., G.S., L.W.D., A.A.L.-L., L.M.-A., J.P.M.-B., and A.L.-R.; writing—original draft preparation, E.G.-B., A.A.L.-L., L.M.-A., J.P.M.-B., A.L.-R. and J.F.R.L.; writing—review and editing, E.G.-B., A.A.L.-L., L.M.-A., A.L.-R. and J.F.R.L.; visualization, E.G.-B. and L.M.-A.; supervision, A.A.L.-L.; project administration, E.G.-B., A.A.L.-L., L.M.-A., A.L.-R. and J.F.R.L.; funding acquisition, J.F.R.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Universidad Autónoma de Baja California and HIDRUS S.A de C.V, Grupo HIDRUS S.A.S and Universidad Pontificia Bolivariana Campus Montería, the APC was funded by Universidad Pontificia Bolivariana Campus Montería.

**Acknowledgments:** We would like to thank the Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), and especially Santiago Higareda from the meteorology laboratory, who provided the precipitation data of automatic station.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Intra-Storm Pattern Recognition through Fuzzy Clustering**

**Konstantinos Vantas and Epaminondas Sidiropoulos \***

Department of Rural and Surveying Engineering, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece; kon.vantas@gmail.com

**\*** Correspondence: nontas@topo.auth.gr

**Abstract:** The identification and recognition of temporal rainfall patterns is important and useful not only for climatological studies, but mainly for supporting rainfall–runoff modeling and water resources management. Clustering techniques applied to rainfall data provide meaningful ways for producing concise and inclusive pattern classifications. In this paper, a timeseries of rainfall data coming from the Greek National Bank of Hydrological and Meteorological Information are delineated to independent rainstorms and subjected to cluster analysis, in order to identify and extract representative patterns. The computational process is a custom-developed, domain-specific algorithm that produces temporal rainfall patterns using common characteristics from the data via fuzzy clustering in which (a) every storm may belong to more than one cluster, allowing for some equivocation in the data, (b) the number of the clusters is not assumed known a priori but is determined solely from the data and, finally, (c) intra-storm and seasonal temporal distribution patterns are produced. Traditional classification methods include prior empirical knowledge, while the proposed method is fully unsupervised, not presupposing any external elements and giving results superior to the former.

**Keywords:** rainfall; rainstorm events; inter-event time; intra-storm patterns; fuzzy clustering; clustering analysis; clustering tendency; Greece

#### **1. Introduction**

Knowledge of the temporal and spatial distribution of rainfall is essential both for climatological studies, especially regarding climate change, and for purposes of flood studies and water resources planning. Effective and illuminating studies of rainfall data are achieved through the detection of patterns or groupings. Stochastic precipitation models utilize Markov chains to simulate the occurrence of wet or dry days and consequently the daily precipitation depth [1–3]. Numerous models, extensively developed, are based on the concept of rectangular pulses point process (RPPP), which can be categorized into two types, the Bartlett–Lewis [4] and the Neyman–Scott [5,6] precipitation models. Both of them use the assumption that storms arrive according to a Poisson process, the most basic example of continued-time Markov chains, of which a concise description can be found in Onof et al. [7]. The application of RPPP methods concerns the fitting of a small set of parameters that define the distributions that follow different rainfall quantities, such as the rainstorm and rain-cell origins, durations and intensities, a multi-objective problem without analytical solution and, sometimes, erratic results [5,7–9].

A different family of models are the profile-based ones [10,11] that utilize the internal structure of rainstorms in their entirety and not as derived parameters of statistical distributions. These rainfall models employ the construction of Huff or mass curves [12], a probabilistic method for sub-daily precipitation records, in which rainstorm data are represented in the form of normalized time, versus normalized cumulative precipitation depth and classified by the quartile where the maximum intensity occurs. The use of mass curves offers several advantages: (a) adequate stability regarding the sample size of rainstorms [13]; (b) coarser hourly data giving nearly identical results to data with

**Citation:** Vantas, K.; Sidiropoulos, E. Intra-Storm Pattern Recognition through Fuzzy Clustering. *Hydrology* **2021**, *8*, 57. https://doi.org/10.3390/ hydrology8020057

Academic Editor: Davide Luciano De Luca

Received: 28 February 2021 Accepted: 22 March 2021 Published: 25 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

finer time-step in the order of minutes [13]; (c) similarity in very long distances [12–14]; (d) the fact that they are not affected by elevation, type of storm, storm duration or storm precipitation depth [15].

Given that precipitation records contain both wet and dry periods, an important issue is the delineation or the extraction of individual rainstorm events from the contiguous recorded precipitation data. A systematic review [16] of different methods that are based on various rainstorm events characteristics (depth, intensity, generated runoff) reveals the necessity of selecting a criterion that defines objectively the rainless intervals, or minimum inter-event time, and not the arbitrary selection of constant time length. The latter is a common practice: Huff used 6 h to separate storms [12], Yu et al. used 2 h [17] and the calculation of rainfall erosivity in the universal soil loss equation (USLE) and its revisions [18] uses, also, 6 h. A different approach was proposed by Restrepo-Posada and Eagleson [19], commonly used today by many authors [13,20], in which "an empirical and inexact", but easily applied, method was developed for the separation of precipitation timeseries, into statistically independent rainstorms.

Unsupervised learning methods, clustering in particular, are now employed for pattern recognition in hydrological data [21]. In general, clustering analysis is a popular exploratory task aimed at partitioning the content of databases into smaller groups on the basis of inherent similarity criteria [22]. Clustering yields meaningful patterns to be further utilized for understanding of processes or for simulations. Clustering methods are applied in many fields, motivated by the large amounts of accumulated data and the developed needs for better data management through grouping and detection of patterns. The unsupervised nature of clustering analysis is a definite advantage, but it raises additional difficulties regarding suitable choice of methods and metrics. The challenge of clustering also lies in adapting the method and its parameters to the nature of the specific application.

A variety of such unsupervised learning methods are reviewed by Sheikholeslami et al. [23], including descriptions of clustering approaches. Among these, the well-known k-means and k-medoids methods divide the data into distinct clusters, such that each element belongs to exactly one cluster. This type of clustering may be characterized as hard, crisp or non-fuzzy. On the contrary, in fuzzy clustering, each point-element is allowed to belong to more than one cluster, albeit with a varying degree of membership. The membership degree of an element is a number in the interval [0,1]. Points closer to the center of a cluster are assigned a higher membership degree than points further away. One of the most widely used fuzzy clustering algorithms is fuzzy c-means clustering. It is used frequently in pattern recognition, and a related review can be found in Nayak et al. [24]. A major problem in clustering is the determination of the optimal number of clusters, which, in real-world applications, is generally not known in advance. As a result, a number of methods have been developed for the determination of the optimal number of clusters. Many of them use the concept of relative cluster validity [25], where results from different clustering methods are compared using a predefined metric. A number of these methods can be found in Milligan et al. [26] and Charrad et al. [27].

In hydrology, machine learning and clustering methods are being applied with increasing frequency, but, apparently, not to their full potential yet. More specifically, cluster analysis has been applied for the identification of hydrologically homogeneous regions, as noted in several literature publications: (a) hierarchical clustering on monthly rainfall erosivity density [21]; (b) fuzzy c-means clustering on annual precipitation [28] and annual maximum intensities [29]; self-organizing maps on monthly precipitation [30], while less work has appeared in terms of temporal pattern investigations applying: (a) the AL algorithm on rainstorms mass curves [31]; (b) self-organizing maps on design hyetographs; (c) k-means, also, on rainstorms mass curves [32]. On the other hand, rainfall mass curves have been utilized, as previously reported, in the stochastic generation of rainfall events [10,11,33,34], also as design storms [14,35–37], for the simulation of floods [38,39] and in changes in storm properties due to climate change [40].

In view of the above considerations, this paper aims to investigate the presence of intra-storm temporal patterns using timeseries of rainfall data coming from the Greek National Bank of Hydrological and Meteorological Information. Prior to clustering analysis, a more precise statistical analysis is applied in order to define a seasonal model to aid in the extraction of statistically independent rainstorm events, which has never been presented for Greece. The computational process that follows is a custom-developed, domain-specific algorithm that produces temporal rainfall patterns using common characteristics from the data via fuzzy clustering in which (a) every storm may belong to more than one cluster, allowing for some equivocation in the data, (b) the number of the clusters is not assumed known a priori, but is determined solely from the data and, finally, (c) intra-storm and seasonal temporal distribution patterns are produced. The optimal temporal rainfall distribution curves presented recently by the authors employed hierarchical clustering on principal components and utilized data from a specific water region of Greece [41]. Additionally, a preliminary research that utilizes self-organizing maps and crisp clustering has been presented by the authors in a recent European Geosciences Union General Assembly [42]. The present paper (a) adopts the more general approach of fuzzy clustering, which, to our knowledge, has not been applied so far to rainstorm timeseries; (b) the data utilized extend country wide; (c) the advantages of mass curves mentioned in the literature for different parts of the word are materialized for the case of Greece, regarding regionalization and the effect of elevation. Moreover, the overall result of clustering is verified and compared with the established Huff's classification, which has also never been presented for the country, via visualization effected through non-linear projection and topographic maps that are created using emergent self-organizing maps, a method that can reveal patterns in high-dimensional datasets.

#### **2. Materials and Methods**

The methodology that was applied in the study is presented in Figure 1 in the form of a flowchart. Precipitation data with a time step of 30 min were imported; rainstorms were extracted from the dataset and preprocessed in order to have an appropriate form as input to clustering analysis. A cluster tendency assessment was applied, so as to examine if clustering is meaningful, and the optimal number of clusters were determined by a custom method that utilizes fuzzy c-means. Additionally, the Huff's curves were compiled for the same dataset. The resulted patterns were visualized using a nonlinear projection of the data in two dimensions, and finally, their characteristics were examined in terms of internal and seasonal structure.

**Figure 1.** Flowchart of the applied methodology.

#### *2.1. Data Acquisition and Processing*

The data used in the analysis were taken from the Greek National Bank of Hydrological and Meteorological Information [43] for 108 meteorological stations across Greece (Figure 2). The timeseries comprised a total of 2926 years of 30 min records for the time period from 1953 to 1997, with a mean length of 26.6 years per station. The timeseries were

checked for consistency and errors as follows: (a) in records of repetitive values near zero (i.e., ≤0.01 mm), these values were set to zero and (b) records of aggregated values, where the time step was larger than the reported, were removed. The pluviograph data coverage was 43.2% on average. The missing values in the timeseries are not random points but contiguous data related to entire months or years.

**Figure 2.** Station locations in Greece used in the analysis obtained from the Greek National Bank of Hydrological and Meteorological Information.

Greece, given its geographic characteristics, the distribution of sea and land and the complex and rich relief, has a mosaic of different micro-climates and regional variations, all in the Mediterranean context. Precipitation varies from its maximum values during winter to a minimum during summer. The highest values are observed westwards and on the mountain range of Pindos and its expansion on Peloponnesus (two to three times higher). This fact reveals the importance of relief on the distribution of rainfall over the country. Furthermore, higher precipitation depth is connected to the movement of Mediterranean depressions that follow a characteristic path from west to east. Northerly, at the valleys of central Macedonia, smaller amounts of precipitation are recorded, due to prevailing dry winds. At the northeastern part, at Thrace, higher values are observed also due to relief. Finally, eastern parts of Greece are drier. During summer months when convective activity prevails and over northern Greece, it produces higher precipitation amounts than in the drier southern parts.

#### *2.2. Extraction of Rainstorms*

A Poisson process hypothesis is assumed for the separation of the precipitation timeseries to statistically independent rainstorm events, where:


• The probability density function is:

$$f(t\_a) = \omega \cdot e^{-\omega \cdot t\_a}, \quad t\_a \ge 0 \tag{1}$$

where *ω* is the average storm arrival rate and

$$t\_a = t\_r + t\_b \tag{2}$$

where *ta* is the storm interarrival time, *tr* is the storm duration, and *tb* is the dry time between storms.

The estimation of **CD** that separates rainstorms is based on an iterative procedure of statistical tests (Algorithm 1). Inter-month data per station are used to ensure homogeneity and: (a) a test value of *cd* is used, coming from a predefined vector of values **CD**, (b) a vector **T** *<sup>a</sup>* of *ta* values is computed for each *cd*, (c) *ω* values for **T** *<sup>a</sup>* are estimated by means of the maximum likelihood estimation method, (d) the goodness-of-fit between **T** *<sup>a</sup>* and the exponential distribution is estimated via parametric bootstrapping of s samples that utilizes the Kolmogorov–Smirnov test [44], and (e) the *cd* value with the maximum *p*-value from the empirical non-parametric distribution is selected. In Algorithm 1, a threshold of 50 *ta* values was imposed empirically prior to statistical testing.

A first version of the above process was presented in previous publications of the authors [41,45]. The basic hypothesis concerns the application of the probability density function (1). A recent related publication [20] adopted the same exponential distribution to estimate inter-event times in a region of China, but using the "inexact" Restrepo-Posada and Eagleson approach.

#### *2.3. Preprocessing*

Prior to clustering analysis, preprocessing of data is necessary, in order to transform them into standardized uniform representations that will facilitate the recognition of the common features. More specifically, each one of the storms is a vector of different length, so that a method must be applied to transform the dataset to one with a fixed number of features that represent time. Thus, the storms were scaled and interpolated to unitless

form in which (a) the time expresses the ratio of the rainstorm duration and (b) the rainfall expresses, also, the ratio of total rainstorm depth:

$$p'\_i = \frac{p\_i}{\sum\_{i=1}^{n} p\_i} \tag{3}$$

$$t'\_i = \frac{t\_i}{\sum\_{i=1}^{n} t\_i} \tag{4}$$

where *pi* and *ti* are the precipitation depth and time for the timestep *i*, respectively, and *p i* and *t <sup>i</sup>* their scaled values.

In the sequel, since the scaled rainstorm vectors have variable length, linear interpolation was applied to compute the unitless rainfall for every 1% of unitless time values. Finally, a matrix of unitless rainstorms **U** was produced with the values of unitless rainfall, in which every row represents a storm and each column the unitless time values. In the clustering analysis, use was made only of the rainstorms with cumulative rainfall greater than 12.7 mm (i.e., with the exception of light rain) that were no single points, due to the timestep of 30 min, and also erosive by definition [18], as in other similar publications [12–14].

#### *2.4. Clustering Tendency*

Before the application of a clustering algorithm, it is advisable to have a preliminary look into the dataset in order to detect any existing clustering tendency. This was done in the following ways: (a) a visual assessment of cluster tendency (VAT [46]) and (b) application of the Hopkins index, *H* [47].

The VAT method creates an image matrix that can be used to visually assess the cluster tendency. In the method, as it was applied, the pairwise dissimilarity values of the rainstorms were computed, and these values were reordered in the form of a matrix and displayed as an intensity image. In this image, the clusters are indicated as dark blocks of pixels along the diagonal [46]. As a dissimilarity measure, the Euclidean distance *d* was used

$$d(u,v) = \sqrt{\sum\_{i=1}^{100} (u\_i - v\_i)^2} \tag{5}$$

where *u* and *v* are two row vectors from the unitless rainstorms **U** matrix. The re-ordering was achieved applying agglomerative hierarchical clustering using the Ward's minimum variance criterion, an algorithm that minimizes the total within-cluster variance [48]. At the beginning of the algorithm, the number of the clusters is equal to the number of data points (all clusters contain a single point). At every step, the algorithm finds the pair of clusters that result after merging to the minimum increase in the total within-cluster variance, which is expressed as the sum of squared differences between the clusters' centers. Finally, all clusters are combined to one cluster that contains all the data using a hierarchical method [21].

The Hopkins index, *H*, can be used to test the null hypothesis of randomly and uniformly distributed data, generated by a Poisson point process and is calculated with

$$H = \frac{\sum\_{j=1}^{m} u\_j^d}{\sum\_{j=1}^{m} w\_j^d + \sum\_{j=1}^{m} u\_j^d} \tag{6}$$

where *X* is a collection of *n* data points that have *d* dimensions. A random sample from *X* without replacement with members *xi*(*i* = 1 *to m*, *m n*) is formed. *Y* is a set of uniformly random data points, also with *d* dimensions and members *yj*(*j* = 1 *to m*), *uj* in turn is the Euclidean distance from *yj* to its nearest neighbor in *X*, and *wj* is also the Euclidean distance from *xi* to its nearest neighbor in *X*. A value of *H* close to one indicates that the data are highly clustered, 0.5 indicates randomly distributed data, and zero indicates regularly spaced data [21,25].

#### *2.5. Fuzzy Clustering*

The unitless rainstorm data were clustered by the fuzzy c-means (FCM) algorithm, which was developed by J. Dunn [49] and improved by J. Bezdek [50]. FCM aims to minimize the objective function

$$\underset{\mathbf{C}}{\operatorname{argmin}} \sum\_{i=1}^{n} \sum\_{j=1}^{c} w\_{ij}^{\text{pr}} ||\mathbf{x}\_{i} - \mathbf{c}\_{j}||^{2} \tag{7}$$

where *wij*[0, 1] is the degree or membership of item *xi* from a set of *n* elements *X* = *x*1,..., *xn* that belong to cluster *cj*, *C* is the set of *c* cluster centers *C* = *c*1, ... , *cc*, and ... denotes any norm that expresses similarity, such as the Euclidean distance.

From an iterative optimization procedure on (7), each element *wij* of the partition matrix *W* is equal to

$$w\_{ij} = \frac{1}{\sum\_{k=1}^{\zeta} \left(\frac{||x\_i - c\_j||}{||x\_i - c\_k||}\right)^{\frac{2}{m-1}}} \tag{8}$$

where *<sup>m</sup>* is the "fuzzifier" that determines the level fuzziness with *<sup>m</sup>* <sup>R</sup> and *<sup>m</sup>* <sup>≥</sup> 1, commonly set to two [51]. The larger the *m* value is, the fuzzier the membership values of the clustered data points are. The centers *cj* of the clusters are the mean of all the elements weighted by their degrees:

$$\mathbf{c}\_{j} = \frac{\sum\_{i=1}^{n} w\_{ij}^{m} \mathbf{x}\_{i}}{\sum\_{i=1}^{n} w\_{ij}^{m}} \tag{9}$$

FCM stops when the maximum number of iterations given a priori is reached, or when the algorithm is unable to reduce the current value of the objective function further to a predefined, usually very small, value.

#### *2.6. Optimal Number of Clusters*

The most common and fundamental problem in clustering analysis is the determination of the number of clusters in an unlabeled dataset, as the one used in the analysis. As previously reported and due to the fact that most clustering algorithms, including FCM, require the number c of clusters to be known a priori, the next step, after answering the question of clustering tendency, has to do with the determination of c. The proposed method uses statistical testing (Algorithm 2). Variations of it have been presented by the authors in the context of different clustering algorithms, namely, (a) self-organizing maps [42] and (b) a dendrogram produced by hierarchical clustering on principal components [41].

A different approach has been presented [52], using k-means (crisp) clustering, which involves two different parameters: (a) an initial threshold that indicates similarity, through a Pearson correlation coefficient, between all pairs of the unitless cumulative rainstorms and (b) an additional one, based on the distances matrix of all pairs of the unitless cumulative rainstorms, which creates the initial seeds used in the k-means clustering algorithm. In other studies, the number of clusters was set without any estimation [32] or through a visual, ambiguous method involving the density maps derived from self-organizing maps [53]. Comparing the reported methodologies to the proposed one, Algorithm 2 presents the following advantages: (a) It is easier to implement, and it can be combined with any clustering algorithm that uses as a parameter the number of clusters directly or indirectly, as is for example the *eps* parameter in the DBSCAN algorithm [54]. (b) It utilizes hypothesis testing about the centers of the clusters, and (c) it has only one parameter to choose, the significance level of well-known statistical tests.

In Algorithm 2, firstly, the cumulative values of the unitless rainstorms matrix are computed, as that kind of transformation was found to help the computational procedures of clustering algorithms. At each step of the iteration, FCM is applied using a trial number of clusters, greater than two and smaller than a predefined maximum value *nmax*. Afterwards, the cluster centers that resulted are represented as cumulative rainfall distributions.

These centers, and for all possible pairs, are tested to find out whether they are drawn from the same distribution using the two-sample Kolmogorov–Smirnov test [55]. The test quantifies the distance between the two samples' empirical distribution functions and has been used in similar comparisons [56,57]. Due to the multiple statistical testing that arise, the resulted *p*-values are adjusted using the Benjamini and Hochberg method [58], which controls the false discovery rate. The algorithm stops when any *p*-value is greater than a predefined significance level *α*, and in that case, the clusters from the previous step are returned.


#### *2.7. Projecting Data Using Non-Linear Mapping*

The non-linear method that was used to visualize the rainstorm data in a twodimension scatterplot was the generalized U-Matrix [59]. This method uses the results from a dimensional reduction method, such as principal components analysis, or a nonlinear method, such as t stochastic neighbor embedding (t-SNE, [60]), in conjunction with emergent self-organizing maps (ESOM [61]). This step is a necessity due to the Johnson– Lindenstrauss lemma [62] stating that the "low-dimensional similarities do not represent high-dimensional distances coercively" [63]. In other words, specific measures are taken to avoid the common mistake of assuming that, when the projected points are similar to each other after dimensional reduction, such is the case in the actual high-dimensional space. The generalized U-Matrix after its calculation by ESOM is visualized by a threedimensional space called "topographic map with hypsometric tints" or colors on surface that represent elevation ranges [64]. The topographic map has no actual real borders and is toroidal, which means that the left border is connected to the right one and the top to the bottom.

#### **3. Results and Discussion**

#### *3.1. Rainstorms Extraction*

Empirically, a minimum set of 50 values per month and station were used in Algorithm 1 and some basic statistics; results about CD are presented in Table 1. Due to the dry summer period in Greece, these monthly CD values could be computed only in a small set of stations. A seasonal sinusoidal model (Figure 3) was developed to use in Greece:

$$f(\mathbb{C}D) = \theta\_1 \sin\left(\frac{2\pi}{12}m\right) + \theta\_2 \sin\left(\frac{4\pi}{12}m\right) + \theta\_3 \cos\left(\frac{2\pi}{12}m\right) + \theta\_4 \cos\left(\frac{4\pi}{12}m\right) \tag{10}$$

where *m* is the month, and *θ*1, ... , *θ*<sup>4</sup> parameters that were estimated using linear regression.

Using the CD sinusoidal model, a set of 174,883 rainstorms were extracted from the dataset. With the exception of summer months and September, the model gives values around 6 h, close to the value selected empirically by Huff [12] and in rainfall erosivity calculations [65]. A subset of 26,678 rainstorms that met the criterion of minimum depth were preprocessed and used to compile the *U* matrix.


**Table 1.** Average statistical properties of monthly **CD** values for the stations. SD is an abbreviation for standard deviation and CV for coefficient of variation (the ratio of the standard deviation to the mean) and h for hours.

**Figure 3.** Red represents the monthly sinusoidal model of CD (Equation (11)). The grey area corresponds to the standard error of the model using linear regression. The grey dots represent the monthly CD values per station, as computed from Algorithm 1.

#### *3.2. Clustering Tedency*

Due to computation issues, a random sample of 10% was used to compute the value of the Hopkins index with H = 0.886, so the null hypothesis of random data could safely be rejected. This result indicated that there was a physical meaning in the categorization of the rainstorms based on their internal structure. The VAT method (Figure 4) created an image matrix, where the clusters are indicated as at least four darker blocks along the diagonal. Given the results from these two methods, there is strong evidence that the dataset contains meaningful clusters and the next step can follow, to apply the proposed method and select the optimal number of clusters.

**Figure 4.** Image matrix created by the VAT method, in order to visually inspect the clustering tendency of data. Distances (dissimilarities) among unitless rainstorms are unitless. The rainstorms that belong to the same cluster are displayed in successive order.

#### *3.3. Clustering Results and Visualization*

After the application of Algorithm 2, the optimal number of clusters are proposed to be four, a suitable number comparing it with VAT results. In Figure 5, the matrix of unitless rainstorms **U** is depicted, with its rows, the rainstorms, reordered by the cluster they belong to. This representation shows that the rainstorms belonging to clusters number one and four are more similar. The latter statement is strengthened by Figure 6 that illustrates the distribution of the membership of every unitless rainstorm that belongs to a cluster (that value must be >0.25 to be classified to a cluster). In the same figure, the mean values of the memberships for each one of the clusters are {0.67, 0.60, 0.59, 0.67}. The latter values are not normally distributed, and cluster one and four are left-skewed meaning less equivocality, in contrast to clusters two and three that are right-skewed.

**Figure 5.** Unitless rainstorms values re-order by clustering results. The names of the clusters were used in order to match with Huff's classification. White symbolizes zero precipitation depth, and different blues depict the presence of unitless precipitation.

**Figure 6.** Distribution of the membership values from fuzzy c-means (FCM) of every unitless rainstorm that belongs to a cluster.

From Figure 5, initially, it seems that the selection of Huff about the grouping in four clusters has physical meaning, in contrast to criticism about being artificial [66].

The average values related to precipitation depth, duration and intensity per cluster are presented in Table 2. Cluster one has on average shorter duration rainstorms, with lower precipitation depth, but higher mean intensity. Clusters two and three have both rainstorms with similar statistics, with larger duration and precipitation depth than one and four. Cluster four has analogous statistics to one with the exception of intensity, which is lower.

**Table 2.** Average values of occurrence of clusters, duration and precipitation depth intensity of clusters' rainstorms.


Cluster number one has notably higher variance and maximum occurrence during spring and summer months that match convective activity over Greece (Figure 7). Clusters two and three have distributions similar to each other and inverse to the first one, related to the prevailing winter weather systems, having their minima during summer. Cluster four has a more uniform distribution with the exception of winter months, when it has its maximum. The latter cluster, with its relatively higher intensity values and, also, higher ratio of occurrence during the wetter winter months in Greece, can be utilized in hydrologic design in the construction of design storms.

**Figure 7.** Seasonal variability of clusters' monthly occurrence.

In order to examine if elevation affects the distribution of rainstorms in the clusters, density plots were compiled (Figure 8) for each cluster. The distribution of the elevations of the stations related to rainstorms appears to be almost identical, despite the fact that altimetry is closely connected to the precipitation regime in Greece. This result about elevation is in accordance with the reported one for Huff's curves by Loukas et al. in Canada [15].

**Figure 8.** Distribution of elevation values from FCM of every unitless rainstorm given its cluster.

In order to examine the temporal patterns in every station using the results from FCM, correlation matrices were computed using Pearson's r coefficient [67]:

$$\mathbf{r} = \frac{1}{n-1} \sum\_{i=1}^{n} \left( \frac{\mathbf{x}\_i - \overline{\mathbf{x}}}{s\_x} \right) \left( \frac{\mathbf{y}\_i - \overline{\mathbf{y}}}{s\_y} \right) \tag{11}$$

where *xi* is the average values of unitless rainstorms coming from a station *x* and cluster *c*, *yi*, similarly, is the average values of unitless rainstorms coming from a station *y* and the same cluster *c*, *n* = 100 (the ordinates of unitless rainstorms), *sx* and *sy* are the standard deviations of *xi* and *yi* and, finally, *x* and *y* the mean values, respectively. The patterns per station and given cluster showed very high similarity with r ≥ 0.974 and on average for all values r = 0.999. These results indicate that despite the rich topography and the variability of micro-climates in Greece, the clusters have regional stability in long distances.

In the sequel (Figure 9), the cumulative values of the centers of the four clusters are compared to the ones calculated using Huff's method of classification that is based on the quartile with the highest intensity. The two methods produce different results, while fuzzy clustering creates unitless cumulative curves that do not overlap. The four Huff curves were also tested, for all possible pairs, whether or not are drawn from the same distribution using the Kolmogorov–Smirnov test and three pairs failed to reject that hypothesis for both significance levels *a* = 5% and *a* = 10%.

**Figure 9.** (**a**) Unitless cumulative values using FCM. (**b**) Unitless cumulative curves using Huff's classification.

Finally, the topographic maps utilizing the generalized U-matrix using ESOM, as described in Section 2.7, using both FCM and Huff's classification results were produced (Figure 10). In both maps, the unitless rainstorms with membership >80% are used to remove equivocation from the data. The visualization indicates the presence of four clusters and that Huff's method misclassifies the data. The topographic map produced valleys with blue and green color (indicating smaller distances of the points) that are separated by ridges of brown and white color (representing larger distances).

**Figure 10.** Topographic maps with hypsometric tints using t stochastic neighbor embedding (t-SNE) and the generalized U-matrix. The topology of the map is toroidal (the top and bottom and as well the right and left are joined). Each colored sphere represents a rainstorm with its different classification, as it has already computed. (**a**) Results using FCM: the rainstorms that belong to the same cluster are separated by mountain ranges. (**b**) Results using Huff's classification: the rainstorms of different classification are mixed, despite the visual separation of mountain ridges in the map.

#### **4. Conclusions**

In this paper, the timeseries of rainfall data were processed in order to detect patterns. The data, coming from the Greek National Bank of Hydrological and Meteorological

Information, cover the totality of the country's regions. In the various stages of the project, novel methods were presented and implemented. First, extraction of rainstorms was executed via a stochastic method. Then, the timeseries were properly transformed to unitless form. The data were tested for clustering tendency prior to the application of clustering algorithms. An iterative form of fuzzy clustering was presented that does not assume a priori knowledge of the number of clusters. It consists of a repeated execution of fuzzy c-means clustering in combination with statistical testing for the choice of the relevant number of clusters. Finally, a verification of the clustering results and a comparison to the widely used Huff's method was attained through topographic maps. The overall approach extends previous rainfall pattern recognition efforts of the authors and of the literature in the sense of the unsupervised learning and in the direction of fuzzy analysis in rainfall and in any follow-up study of floods or water management problems.

**Author Contributions:** Conceptualization, methodology and software K.V.; writing—original draft preparation, K.V. and E.S.; writing—review and editing, E.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not Applicable.

**Data Availability Statement:** Precipitation data are available from the Greek National Bank of Hydrological and Meteorological Information.

**Acknowledgments:** The data importing, analysis and presentation were done using the open source R language for statistical computing and graphics [68] using the packages: hydroscoper [43], hyetor [69], e1071 [70], FCPS [71], GeneralizedUmatrix [63], factoextra [72] and ggplot2 [73].

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

