**1. Introduction**

Many statistical techniques are sensitive to the presence of outliers and all calculations, including the mean and standard deviation can be distorted by a single grossly inaccurate data point. Therefore, checking for outliers should be a routine part of any data analysis.

To date, several tests have been developed for the purpose of identifying outliers of certain distributions. Most of the studies are connected with the Normal (or Gauss) distribution [1]. The first paper that attracted attention on this matter is [2] and this was followed by studies that identified the derivation of the distribution of the extreme values in samples taken from Normal distributions [3]. Then, a series of tests were developed by Thompson in 1935 [4], these were subjected to evaluation [5], and revised [6,7].

For other distributions such as the Gamma distribution, procedures for detecting outliers were proposed [8], revised [9], and unfortunately proved to be ine fficient [10].

The first attempt to generalize the criterion for detecting outliers for any distribution can be found in [11], but further research on this subject is scarce apart from a notable recent attempt by Bardet and Dimby [12].

The Grubbs test is a frequently used test for detecting the outliers of a Normal distribution [7]. For a sample (x), the Grubbs' test statistic takes the largest absolute deviation from the sample mean (*x*) in units of the sample standard deviation (s) in order to calculate the risk of being in error (αG) when stating that the most departed values from the mean (min(x), max(x) or both) are not outliers (see Table 1). The associated probabilities of the observed (pG) are obtained from the Student t distribution [13].


**Table 1.** The Grubbs statistic.

One should note that the Grubbs test statistic produces a symmetrical confidence interval (see Equations (1) and (2)). The Grubbs statistic as given in Table 1, is intended to be used with the parameters of the population (μ and σ), which are determined using the central moments (CM) method (μˆ = *x* = *x*/n; σˆ = s = ( (*x* − *x*)<sup>2</sup>)1/2/n).

Here, a method is proposed for constructing the confidence intervals for the extreme values of any continuous distribution for which the cumulative distribution function is also obtainable. The method involves the direct application of a simple test for detecting the outliers. The proposed method is based on deriving the statistic for the extreme values for the uniform distribution. Also, the proposed method provides a symmetrical confidence interval in the probability space.

#### **2. Materials and Methods**

The Grubbs test (Table 1) is based on the fact that if outliers exist, then these are "localized" as the maximum value and/or the minimum value in the dataset. Thus, the Grubbs test is essentially a sort of order statistic [14].

Some introductory elements are required for describing the proposed procedure. When a sample of data is tested under the null hypothesis that it follows a certain distribution, it is intrinsically assumed that the distribution is known. The usual assumption is that we possess its probability density function (PDF, for a continuous distribution) or its probability distribution function (PDF for a discrete distribution). The discussion below relates to continuous distributions, although the treatment of discrete distributions are similar to certain degree. Nevertheless, a major distinction between continuous and discrete distributions in the treatment of data is made here; that is, a continuous distribution is "dense", e.g., between any two distinct observations it is possible to observe another while in the case of a discrete distribution, this is generally not true.

Even when the PDF is known (possibly intrinsically), its (statistical) parameters may not necessarily be known, and this raises the complex problem of estimating the parameters of the (population) distribution from the sample; however, this issue is outside the scope of this paper. In general, the estimation of the parameters of the distribution of the data is biased by the presence of the outliers in the data, and thus, identifying the outliers along with the estimation of the parameters of the distribution is a difficult task because two statistical hypotheses are operating. Assuming that the parameters ("parameters") of the distribution (of the PDF) are obtained using the maximum likelihood estimation method (MLE, Equation (3); see [15]), there is some suggestion that the uncertainty accompanying this estimation is transmitted to the process of detecting the outliers.

$$\prod \text{PDF}(\mathbb{X}; \text{ "parameters"} ) \to \text{max.} \implies \sum \ln \left( \text{PDF}(\mathbb{X}; \text{ "parameters"} ) \right) \to \text{min.} \tag{3}$$

It should be noted that Equation (3) is a simplified version of the MLE method, since the real use of it requires and involves partial derivatives of the parameters; see Source code (MathCad language) for the MLE estimations in the Supplementary Materials available online.

Either way (whether the uncertainty accompanying this estimation is transmitted to the process of detecting the outliers or not), once an estimate for the parameters of the distribution is available, a test (most desirably, a test based on a statistic) for detecting the presence of an outlier must provide the probability of observing that (assumed) "outlier" as a randomly drawn value from the distribution. What to do next with the probability is another statistical "trick": to observe a value with a probability less than an imposed "level" (usually 5%) is defined as an unlikely event, and therefore, the suspicion regarding the presence of the outlier is justified. With regard to the statistical "trick" mentioned above, the opinion of the author of this manuscript is that one "observation" is not enough. Actually, there should be a series of observations, that come from a series of statistics, each providing a probability. Then, the unlikeliness of the event can be safely ascertained by using Fisher's "combining probability from independent tests" method (FCS, Equation (4); see [16–18]:

$$-\sum\_{i=1}^{\tau} \ln \left( \mathbf{p}\_i \right) \quad \sim \chi^2(\tau) \to \mathfrak{a}\_{\text{FCS}} = 1 - \text{CDF}\_{\chi^2}(-\sum\_{i=1}^{\tau} \ln \left( \mathbf{p}\_i \right); \tau) \tag{4}$$

where p1, ... , pτ are probabilities from τ independent tests, CDF χ2 is the χ2 cumulative distribution function (see also up until Equation (6) below), and pFCS is the combined probability from independent tests.

Taking the general case, for (*x1*, ... , *xn*) as *n* independent draws (or observations) from a (assumed known) continuous distribution defined by its probability density function, PDF (x; ( <sup>π</sup>j)1≤j≤m) where (πj)1≤j<sup>≤</sup>m are the (assumed unknown) *m* statistical parameters of the distribution, by way of integration for a (assumed known) domain (D) of the distribution, we may have access to the associated cumulative density function (CDF) CDF(x; ( <sup>π</sup>j)1≤j≤m; PDF), simply expressed as (Equation (5)):

$$\text{CDF}(\mathbf{x}; \ (\pi\_{\rangle})\_{1 \le \mathbf{j} \le \mathbf{m}}) = \int\_{\inf(D)}^{\mathbf{x}} \text{PDF}(\mathbf{x}; \ (\pi\_{\rangle})\_{1 \le \mathbf{j} \le \mathbf{m}}) \tag{5}$$

where inf(D) was used instead of min(D) to include unbounded domains (e.g., when inf(D) = -<sup>∞</sup>; "inf" stands for infimum, "min" stands for minimum). Please note that having the PDF and CDF does not necessarily imply that we have an explicit formula (or expression) for any of them. However, with access to numerical integration methods [19], it is enough to have the possibility of evaluating them at any point (x).

Unlike PDF(x; ( <sup>π</sup>j)1≤j≤m), CDF(x; ( <sup>π</sup>j)1≤j≤m) is a bijective function and therefore, it is always invertible (even if we do not have an explicit formula; let "InvCDF" be its inverse, Equation (6)):

$$\text{if } \mathbf{p} = \text{CDF}(\mathbf{x}; (\pi\_{\rangle})\_{1 \le i \le m}), \text{ then } \mathbf{x} = \text{InvCDF}(\mathbf{p}; (\pi\_{\rangle})\_{1 \le i \le m}), \text{and } \text{vice-versa} \tag{6}$$

CDF(x; ( <sup>π</sup>j)1≤j≤m; "PDF") is a strong tool that greatly simplifies the problem at hand: the problems of analyzing any distribution function (PDF) are translated such that only one needs to be analyzed (the continuous uniform distribution). That is, a series of observed data (xi)1≤i≤n is expressed through their associated probabilities pi = CDF(xi; ( <sup>π</sup>j)1≤j≤m) (for 1≤i≤n) and the analysis can be conducted on the (pi)1≤i≤n series instead.

Since the analysis of the (pi)1≤i≤n series of probabilities is a native case of order statistics, the discussion now turns to order statistics. The first studies in this area were by the fathers of modern statistics, Karl Pearson [20] and Ronald A. Fisher [3] while the first order statistic applicable to any distribution (not only the normal distribution) was first studied by Cramér and Von Mises (see [21,22]).

An order statistic operating on probabilities ((pi)1≤i≤n) will sort the values (let (qi)1≤i≤n be the series of sorted (pi)1≤i≤n values, Equation (7)) and will assess its departure from the continuous uniform distribution (where it is assumed that SORT is a procedure that sorts ascending the values).

$$(\mathbf{q}\_{i})\_{1 \le i \le n} \leftarrow \text{SORT}((\mathbf{p}\_{i})\_{1 \le i \le n}) \tag{7}$$

Since the assessment of the departure from the continuous uniform distribution cannot be made directly, the use of a series of order statistics was proposed by several authors including: Cramér and Von Mises [21,22], Kolmogorov-Smirnov [23–25], Anderson-Darling [26,27], Kuiper V [28], Watson U<sup>2</sup> [29], and the H1 Statistic [18]; see Equation (8). They remain in use today.

For instance, the Kolmogorov-Smirnov (KS) method (see Equation (8); the Kolmogorov-Smirnov statistic) calculates the KSStatistic and later tests the value (from a sample) against the threshold of a chosen significance level (usually 5%).

In order to have certain thresholds for a series of significance levels, these statistics can be derived from Monte-Carlo ("MC") simulations [30], and deployed for a large number of samples in order to reflect, as best as possible, the state of the population.

$$\begin{array}{l} KS\_{\text{Static}} = \sqrt{\text{n}} \cdot \max\_{1 \le i \le n} (\mathbf{q}\_{\text{i}} - \frac{\mathbf{i} - \mathbf{1}}{\mathbf{n}}, \frac{\mathbf{i}}{\mathbf{n}} - \mathbf{q}\_{\text{i}})\\ KV\_{\text{Statistic}} = \sqrt{\text{n}} \cdot \left( \max\_{1 \le i \le n} (\mathbf{q}\_{\text{i}} - \frac{\mathbf{i} - \mathbf{1}}{\mathbf{n}}) + \max\_{1 \le i \le n} (\frac{\mathbf{i}}{\mathbf{n}} - \mathbf{q}\_{\text{i}}) \right) \\\ AD\_{\text{Statistic}} = -\mathbf{n} - \frac{\mathbf{1}}{\mathbf{n}} \cdot \sum\_{i=1}^{n} \left( 2\mathbf{i} - 1\right) \cdot \ln \left( \mathbf{q}\_{\text{i}} \cdot (1 - \mathbf{q}\_{\text{n-i}}) \right) \\\ CM\_{\text{Statistic}} = \frac{1}{12\mathbf{n}} + \sum\_{i=1}^{n} \left( \frac{2\mathbf{i} - \mathbf{1}}{2\mathbf{n}} - \mathbf{q}\_{\text{i}} \right)^{2} \\\ WL\_{\text{Statistic}} = \text{CM}\_{\text{Statistic}} + \left( \frac{1}{2} - \frac{1}{\mathbf{n}} \sum\_{i=1}^{n} \mathbf{q}\_{\text{i}} \right)^{2} \\\ HI\_{\text{Satistic}} = -\sum\_{i=1}^{n} \mathbf{q}\_{\text{i}} \cdot \ln(\mathbf{q}\_{\text{i}}) - \sum\_{i=1}^{n} (1 - \mathbf{q}\_{\text{i}}) \cdot \ln(1 - \mathbf{q}\_{\text{i}}) \end{array}$$

#### **3. Proposed Outlier Detection Statistic**

A statistic was developed to be applicable to any distribution. For a series of probabilities ((pi)1≤i≤n) or (sorted probabilities, (qi)1≤i≤n) associated with a series of (repeated drawing) observations ((xi)1≤i≤n), the (ri)1≤i≤n differences are calculated as Equation (9):

$$\mathbf{r}\_{\mathbf{i}} = \left| \mathbf{p}\_{\mathbf{i}} - 0.5 \right|\_{\prime} \text{ for } 1 \le \mathbf{i} \le \mathbf{n} \tag{9}$$

The statistic called "g1" (see below) was generated based on the formula given in Equation (9) (given as Equation (10)).

$$\mathbf{g1} = \max\_{1 \le i \le n} \mathbf{r}\_i \tag{10}$$

It should be noted that Equations (9) and (10) provide the same result regardless of whether the calculation is made on a sorted series of probabilities ((qi)1≤i≤n) or not (then it is made on (pi)1≤i≤n).

Regarding the name of this new proposed statistic ("g1"), when Equations (1) and (2) (G"min", G"max", G"all") and Equation (9) are compared, for a standard normal distribution N(x; μ=0,<sup>σ</sup>=1) the equation defining G"all" becomes much more like Equation (9), with the difference being that in Equation (2) the sample mean (*x*) is used as an estimate for the mean of the population (μ) and the sample standard deviation (s) is used as an estimate for the standard deviation of the population (σ) while Equation (9) basically expresses the same in terms of associated probabilities (pi = P(X ≤ xi) = CDF"Normal"(xi; μ,<sup>σ</sup>), 0.5 = P(X ≤ μ) = CDF"Normal"(μ; μ,<sup>σ</sup>)).

Therefore, the proposed statistic very much resembles the Grubbs test for normality (and hence its name). One difference is that in the Grubbs test sample statistics are used to calculate the sample G"all" value (*x* and s), thereby reducing the degrees of freedom associated with the value (from n to n-2) while

for the g1 value (and statistic) the degrees of freedom remain unchanged (n). The major difference is actually the one that makes the proposed statistic generalizable to any distribution—the mean used in the Grubbs test is replaced by the median—the beauty of this change is that for symmetrical distributions (including a Normal distribution) these two coincide.

A further connection with other statistics must also be noted. If any sample is resampled by extracting only the smallest and the largest of its values, then the Kolmogorov-Smirnov statistic for those subsamples almost perfectly resembles (by setting n = 2 in Equations (8)–(1)) the proposed "g1" statistic.

Since CDF is a bijective function (see Equation (6)), the proposed generalization of the Grubbs test for detecting the outliers for Normal distribution into the "g1" statistic for detecting the outliers for any distribution is a natural extension of it. The "g1" test associated with the "g1" statistic will be able to operate in the probability space ((pi)1≤i≤n or (qi)1≤i≤n) instead of the observed space ((xi)1≤i≤n), the calculation formula (Equations (9) and (10)) is slightly different (to those given in Equations (1) and (2)), and the probability associated with the departure will no longer be extracted from the Student t distribution (as in Equations (1) and (2)). The change from mean (μ for G"all") to median (0.5 in Equation (9)) is a safe extension for any distribution type, since Equation (9) measures (or accounts for) the extreme departures from the equiprobable point—having an observation y (y ← X) with y ≤ InvCDF"Any distribution"(0.5; "parameters") and an observation z (z ← X) with z ≥ InvCDF"Any distribution"(0.5; "parameters") is equiprobable.

One way to associate a probability with the "g1" statistic is to do a Monte-Carlo (MC) simulation.
