**4. Simulation Study**

A MC study was conducted. Two different strategies were developed in order to deal efficiently with a very large amount of data, and specifically, to solve the order statistics problem (that is, first sampling from the uniform distribution, and later using Equations (7)–(10). One of those alternatives has been described in [14] and the other is described below. Table 2 shows the details of the conducted MC study.


**Table 2.** Details of the MC simulation on "g1" outlier detection statistic.

For each sample size of the observed *n* in each run *m* samples (see Table 2) were generated from the standard uniform continuous distribution (e.g., from the [0, 1] interval). The outlier detection statistic "g1" was calculated (Equations (9) and (10)). From a large pool of sampled and resampled data (m·resa·repe = 7·10<sup>9</sup> in Table 2, repetitions were joined *(n, p, g1)* as pairs from the *p*·*<sup>n</sup>* control points, that is, where the probability was from 0.001 to 0.999 with a step of 0.001 for each n (from 2 to 12). The external repetitions (resa = 7 in Table 2) were joined together by taking the median (since the median is a sufficiency statistic [31] for any order statistic such as in the extraction of (*<sup>n</sup>*, *p*, *g1*) pairs from the *p*·*<sup>n</sup>* control points). The MC simulation was conducted with the configuration set as defined in Table 2. The obtained data were recorded in separate files by sample size and analyzed as such.

The objective associated (with any statistic) is to obtain the cumulative distribution function (CDF, Equation (5)), and thus by evaluating the CDF for the value of the statistic obtained from the sample (Equations (9) and (10)) to obtain a probability for the sampling. Please note that only in the lucky cases are we able to do this; Generally only the critical values (values corresponding to certain risks

of being in error) or approximation formulas are available (see for instance [21,24,26,28,29]). Here, the analytical CDF formula was obtained for the "g1" outlier detection statistic.

#### **5. The Analytical Formula of CDF for g1**

The "g1" statistic have a very simple calculation formula (see Equation (9)) and, as expected, its CDF formula is also very simple (see Equation (11)). Thus, for a calculated sample statistic g1 (x ← g1 in Equation (11)), the significance level ( α ← 1-p) is immediate (Equation (11), where P represents the probability that the random variable X takes on a value less than or equal to x).

$$p = \text{CDF}\_{\mathbb{S}^1}(\mathbf{x}; n) = \text{P}(\mathbf{X} \le \mathbf{x}) = (2 \cdot \mathbf{x})^n, \ a = 1 - p = 1 - (2 \cdot \mathbf{x})^n \tag{11}$$

#### **6. Simulation Results for the Distribution of the "g1" Statistic**

The results of the simulation for n varying from 2 to 10 were su fficient to provide a clear indication of the analytical formula for the CDF of "g1". Descriptive statistics including Standard Error (SE, the standard error formula is given as Equation (12)) between the expected probability (from MC simulation) and the calculated probability (from Equation (11), *p*ˆ*i* ← (<sup>2</sup>·*xi*) *n* ) and the highest positive and highest negative departures are given in Table 3.

$$SE = \sqrt{\frac{1}{999} \sum\_{i=1}^{999} \left(p\_i - \not{p}\_i\right)^2}, p\_i = \frac{i}{1000} \tag{12}$$

**Table 3.** Descriptive statistics for the agreemen<sup>t</sup> in the calculation of the "g1" statistic (Equation (10) vs. Equation (11)).


As can be observed in Table 3 the standard error (SE) slowly decreases beginning with n = 7, being two orders of magnitude smaller (actually it is about 200 times smaller) than the step from the MC experiment. Since the standard error alone is not proof that Equation (11) is the true CDF formula for providing the probability for the g1 statistic, the smallest and the highest di fference between the observed and the expected probabilities are also given in Table 3. They substantiate that Equation (11) is indeed the right estimate for the CDF of g1. For convenience, Figure 1 shows the value of the error in each observation point (999 points corresponding to p = 0.001 up to p = 0.999 for each n from 2 to 12).

**Figure 1.** Departures between expected and observed probabilities for g1 statistic (Equation (10) vs. Equation (11)).

Regarding the estimation error (of the "g1" statistic) depicted in Figure 1, the "g1" statistic is rarely bigger than <sup>10</sup>−5, never bigger than 1.5·× 10−<sup>5</sup> and tends to become smaller with the increase in sample size (n). Using Equation (11), Figure 2 depicts the shape of the CDF"g1"(x;n).

With regard to the "g1" statistic (depicted in Figure 2), the domain for a variable distributed by the "g1" statistic (see Equation (11)) has values between 0 and 0.5 with the mode at p = 0 (a vertical asymptote at p = 0), a median of <sup>n</sup><sup>−</sup>1·2−1/<sup>n</sup> (and having a left asymmetry decreasing with the increasing of n and converging (for n → ∞) to symmetry) and mean of 1/2(n+1).

**Figure 2.** CDF"g1"(x;n) for n = 2 to n = 20.

The expression of CDF"g1" is easily inverted (see Equation (13)).

$$\text{CDF}\_{\mathbb{R}^{1}}(\mathbf{x};\mathbf{n}) = (2\mathbf{x})^{\text{n}} \to \text{InvCDF}\_{\mathbb{R}^{1}}(\mathbf{p};\mathbf{n}) = \,\,\,\!\!\!\!\!\!p/2 \,\tag{13}$$

#### **7. from "g1" Statistic to "g1" Confidence Intervals for the Extreme Values**

Equation (13) can be used to calculate the critical values of the "g1" statistic for any values of α (α ← 1-p) and n. The critical values of the "g1" statistic acts as the boundaries of the confidence intervals.

By setting the risk of being in error α (usually at 5%), then p = 1-α and Equation (13) can be used to calculate the statistic associated with it (InvCDF"g1"(1-α;n) = √*n* 1 − α/2). By placing this value into Equations (9) and (10), the (extreme) probabilities can be extracted (Equation (14)).

$$\max\_{1 \le i \le n} |p\_i - 0.5| = \sqrt[n]{1 - \alpha}/2. \to \text{p}\_{\text{extreme}}(\mathfrak{a}) \tag{14} = 0.5 \pm \sqrt[n]{1 - \alpha}/2 \tag{14}$$

One should note that the confidence interval defined by Equation (14) is symmetric.

In order to arrive at the confidence intervals for the extreme values in the sampled data (Equation (15)) it is necessary to use the inverse of the CDF (again), and for the distribution of the sampled data.

$$\text{Inx}\_{\text{extreme}}(\alpha) = \text{InvCDF}\_{\text{"Distoribuation"}}(0.5 \pm \sqrt[n]{1 - \alpha}/2; \text{"parameters"}) \tag{15}$$

To illustrate the calculation of the confidence intervals for the extreme values in the sampled data, a series of 206 data was chosen from [32]. The data were tested against the assumption that it follows a generalized Gauss-Laplace distribution (Equation (16), a symmetrical distribution), and later if there were some observations suspected to be outliers. The steps of this analysis and the obtained results are given in Table 4.

$$\text{PDF}\_{\text{GL}''}(\mathbf{x};\mu,\sigma,\mathbf{k}) = \mathbf{c}\_1 \sigma^{-1} \mathbf{e}^{-|\mathbf{c}\_0 x|^k},\\\mathbf{c}\_0 = \left(\frac{\Gamma(3/\mathbf{k})}{\Gamma(1/\mathbf{k})}\right)^{1/2}, \mathbf{c}\_1 = \frac{\mathbf{k}\mathbf{c}\_0}{2\Gamma(1/\mathbf{k})},\\\mathbf{z} = \frac{\mathbf{x}-\mu}{\sigma} \tag{16}$$

The greatest departure from the median (0.5) for the 206 PCB dataset (Table 4) was 9.603 (CDF"GL"(9.603; μ = 6.47938, σ = 0.82828, k = 1.79106) = 0.9998). Due to the force of this deviation from the median, 9.603 was suspected as being an outlier and was removed (it should be noted that in a broader context, an outlier can be also seen as an atypical observation, correctly collected from the population observation, as part of the data generation process and thus it may be maintained in the sample but probably with a less weight). The same procedure (as in Table 4) can be applied to the remaining data (205 observations). Then, InvCDF"g1"(1-0.05; 205) = 0.499875, pmin(n=205) = 0.0001251; and pmax(n=205) = 0.9998749. The MLE estimates for the parameters of the Gauss-Laplace distribution remain unchanged (μ = 6.47938, σ = 0.82828, k = 1.79106) and the removed observation (9.603) is still not an outlier (xmax = InvCDF"GL"(0.9998749; μ = 6.47938, σ = 0.82828, k = 1.79106) = 9.7166 > 9.603).

**Table 4.** Distribution analysis for a series of 206 measurements for the octanol water partition coefficient (Kow) of polychlorinated biphenyls expressed in logarithmic scale (log10(Kow))


**Table 4.** *Cont.*


#### **8. Proposed Procedure for Detecting the Outliers**

The procedure for detecting the outliers should start with measuring the agreemen<sup>t</sup> between the observed and estimated (Figure 3).

Figure 3 contains a statistical "trick", namely, when there are no outliers the statistics measuring the gap between the observation and the model (order statistics, Equation (6)) are in agreemen<sup>t</sup> (their associated probabilities are not too far from each other). When outliers exist, the order statistics are also sensitive to their presence. Since this is a separate subject, for further discussion please see the series of papers beginning with [32–34].

**Figure 3.** The procedure for detecting outliers.

#### **9. Second Simulation Assessing "Grubbs" and "g1" Outlier Detection Alternatives**

Another MC study was designed to test the claim that the proposed method provides consistent results. This second MC simulation is much simpler than the one used to derive the data for constructing the outlier statistics (Figure 4).

**Figure 4.** The procedure for testing the outlier statistics.

The data used here as a proof of the facts are from [7] and all cases involve a Normal distribution (Distribution = Normal in Equation (15); PDF and CDF for Normal distribution in Equation (18); a symmetrical distribution) with α = 5% risk being in error. The parameters of the Normal distribution (μ and σ) are determined for each case, as well as the sample size (Equation (17)).

$$\chi\_{\text{extreme}}(a) = \text{InvCDF}\_{\text{"Normal"}}(0.5 \pm 0.5 \cdot \sqrt[\theta]{1 - \alpha}; \mu, \sigma) \tag{17}$$

$$\text{PDF}\_{\text{"Normal"}}(\mathbf{x};\mu,\sigma) = \frac{e^{-\frac{(\mathbf{z}-\mu)^2}{\sigma^2}}}{\sigma\sqrt{2\pi}},\\\text{CDF}\_{\text{"Normal"}}(\mathbf{x};\mu,\sigma) = \int\_{-\infty}^{\infty} \text{PDF}\_{\text{"Normal"}}(\mathbf{t};\mu,\sigma)d\mathbf{t} \tag{18}$$

For comparison, the same strategy for calculating the confidence intervals of the extreme values for the Normal distribution with the Grubbs test statistic (Equation (2)) was used to provide an alternate result (Equation (19)).

$$\text{tr}\_{\text{crit}}(a) = \overline{\text{x}} \pm G\_{\text{crit}}(a) \cdot \text{s},\\G\_{\text{crit}}(a) = \frac{\text{n} - 1}{\sqrt{\text{n}}} \sqrt{\frac{t\_G^2(a)}{\text{n} - 2 + t\_G^2(a)}},\\t\_G = \text{InvCDF}\_{\text{``Student'}}(\frac{a}{2\text{n}}, \text{n} - 2) \tag{19}$$

The steps followed in this analysis are given in the Table 5.

**Table 5.** Comparison of the steps of the analysis and simulation for extreme values confidence intervals (proposed method vs. Grubbs test)


Results of the analysis using the steps given in Table 5 for the first dataset are given in Table 6.


**Table 6.** Outlier analysis results for {568, 570, 570, 570, 572, 572, 572, 578, 584, 596} dataset.

In regard to the results given in Table 6:

At step 1, CPs are the cumulative probabilities ({p1, ... , p10} in Figure 3) associated with the series of the observations from the sample ({x1, ... , x10} in Figure 3).

At step 2, the data passes the normality test (<sup>α</sup>FCS = 7% > 5% = α, see Figure 3).

Step 3 was made for n = 10 (see Figure 4). (a) The proposed method does not detect outliers in the sample (552.086 < 568, 596 < 598.314); (b) Grubbs test detect 596 as being an outlier (596 > 595.13).

At step 4 (see Figure 4), since {510, 526} are comparable with 500 and {1977, 2009} are much greater than 500, the results lead to the conclusion that the existing method produces type I errors by leading to false positive detection of outliers in the samples while the proposed method does not.

#### **10. Going Further with the Outlier Analysis**

What if "596" is removed from the sample? The following table provides mirror-like results for this scenario (Table 7).


**Table 7.** Outlier analysis results for {568, 570, 570, 570, 572, 572, 572, 578, 584} dataset.

As can be observed in Table 7, the data is not in good agreemen<sup>t</sup> with normality (<sup>α</sup>FCS in Table 6 is 7%, while in Table 7 it is 16%) and there is no change in the accuracy of the classification ({563, 543} comparable with 500, {2341, 2333} is much greater than 500; the existing method produces type I errors by leading to false positive detection of outliers in the samples, while the proposed method does not). When comparing the results given in Table 6 with the results given in Table 7 it should be noted that both tests (Grubbs and the newly proposed g1) produce somewhat confusing results (see Table 8 for side-by-side outcomes).


**Table 8.** Side-by-side comparison of the analysis of the samples.

Table 8 highlights the fact that based on the {568, 570, 570, 570, 572, 572, 572, 578, 584} sample, the g1 test may be interpreted as identifying 596 as being an outlier. This is not quite true because the g1 test was not intended to be used in this way. That is, 596 is outside of the dataset, so at the time of constructing the confidence intervals for the extreme values, the information regarding its observation was missing.

Another trial was done, this time with **601** replacing **596** in the initial dataset (Table 9).

**Table 9.** Outlier analysis results for the {568, 570, 570, 570, 572, 572, 572, 578, 584, **601**} dataset.


#### In a further trial, **604** replaced **596** in the initial dataset (Table 10).


**Table 10.** Outlier analysis results for the {568, 570, 570, 570, 572, 572, 572, 578, 584, 604} dataset.

The conclusion is simple (see the results in the Tables 6, 7, 9 and 10): A test will hardly ever detect an outlier for a small sample; it is more likely to reject the hypothesis of the sample drawn from the distribution itself!

The same trick was used on a bigger sample and the results are shown in Table 11 (the dataset is from Table 4).


**Table 11.** Outlier analysis results for Table 4 dataset under the assumption of normal distribution.

On one hand, as the results in Table 11 prove, the proposed method correctly identifies the confidence interval for the extreme values, while the existing method does not.

On the other hand, the results in Table 11 also show that the likelihood of identifying the outliers increases with the sample size, making it perfectly possible to identify outliers with the proposed method, although this is not the case in small samples. It is possible to detect the outliers in small samples as well, but not when the parameters of the distribution are derived from the sample data—only when the parameters of the distribution are known a priori or determined from other samples (the results given in Tables 6–10 are proof of this).

## **11. Further Discussion**

The obtained expression for CDF of "g1" (Equation (11)) reveals the domain of a random variable distributed by the "g1" statistic ([0, 0.5]), which is consistent with the definition of "g1" (Equations (9) and (10)).

Independently of the shape of the theoretical distribution being tested (the generic case is defined by Equation (5)), as defined by Equations (9) and (10), the newly proposed statistic "g1" defines a symmetric confidence interval for the extreme values in samples in the probability space (Equation (14)). Later, this symmetric confidence interval may be changed back into an asymmetrical one when it is expressed in the domain of the theoretical distribution being tested (Equation (15)). It should be recognized that "g1" uses a symmetrization strategy to obtain the confidence interval for the extreme values in samples.

It might seem that the literature on robust statistics was ignored in this work, however, this is not entirely true. In fact, a whole pool of robust statistics was used extensively in the study (see Equation (8)), introduced as a tool in Table 5 and involved in the later calculations (Tables 6, 7 and 9, Tables 10 and 11). Also, it should be noted that the substitution of the mean by the median is not a new idea; it is well known in the field of robust statistics (for example, Watson U<sup>2</sup> [29], the WUStatistic in Equation (8), uses it).

A short literature survey provides several of examples of current real applications that require the proposed method. Thus, in signal processing, non-stationary, non-Gaussian, spiky signals are usually regarded as outliers and thus discarded (see [35–38] as typical cases). In this context, it should be noted that Mood's median test is preferred to the Kruskal-Wallis test when outliers are present [39]. The identification of outliers is also recognized as an issue in the validation of protein structures, and the current methods are revised in [40]. Other examples can be found in [41].

In the wider context, an alternate window-based strategy has been proposed in which outliers are detected in each window by the Tukey method and labeled so that they can be excluded from the realization of the process points to be used for model identification [42]. A contingency-based strategy proposes maximization of true positive (TP) values and minimization of false negative (FN) and false positive (FP) values [43]. Finally, another distribution testing procedure has been proposed in [44].
