Next Article in Journal
Image Segmentation Based on Statistical Confidence Intervals
Next Article in Special Issue
Granger Causality and Jensen–Shannon Divergence to Determine Dominant Atrial Area in Atrial Fibrillation
Previous Article in Journal
Fractional Time Fluctuations in Viscoelasticity: A Comparative Study of Correlations and Elastic Moduli
Previous Article in Special Issue
Maximum Correntropy Criterion Kalman Filter for α-Jerk Tracking Model with Non-Gaussian Noise
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Entropy Measures for Stochastic Processes with Applications in Functional Anomaly Detection

1
Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires and CONICET, Buenos Aires C1428EGA, Argentina
2
Department of Statistics, Universidad Carlos III de Madrid, 28903 Getafe, Spain
3
Department of Computer Science and Statistics, University Rey Juan Carlos, 28933 Móstoles, Spain
*
Authors to whom correspondence should be addressed.
Entropy 2018, 20(1), 33; https://doi.org/10.3390/e20010033
Submission received: 5 December 2017 / Revised: 29 December 2017 / Accepted: 2 January 2018 / Published: 11 January 2018
(This article belongs to the Special Issue Entropy in Signal Analysis)

Abstract

:
We propose a definition of entropy for stochastic processes. We provide a reproducing kernel Hilbert space model to estimate entropy from a random sample of realizations of a stochastic process, namely functional data, and introduce two approaches to estimate minimum entropy sets. These sets are relevant to detect anomalous or outlier functional data. A numerical experiment illustrates the performance of the proposed method; in addition, we conduct an analysis of mortality rate curves as an interesting application in a real-data context to explore functional anomaly detection.

1. Introduction

The family of α -entropies, originally proposed by Rényi [1], plays an important role in information theory and statistics. Consider a random variable Z distributed according to a measure F that admits a probability density function f. Then, for α 0 and α 1 , the α -entropy of Z is computed as follows:
H α ( Z ) = 1 1 α log V α ( Z )
where V α ( Z ) = E F { f α 1 } , and E F stands for the expected value with respect to the F measure. Several renowned entropy measures in the statistical literature are particular cases in the family of α -entropies. For instance, when α = 0 , we obtain the Hartley entropy; when α 1 , then H α converges to the Shannon entropy; and when α , then H α converges to the Min-entropy measure. The contribution of this paper is two-fold. Firstly, we propose a natural definition of entropy for stochastic processes that extends the previous one and a suitable sample estimator for the observation of partial realizations of the process, the typical framework when dealing with functional data. We also show that Minimal Entropy Sets (MES), as formally defined in Section 3, are useful to solve anomaly detection problems, a common task in almost all data analysis contexts.
The paper is structured as follows: In Section 2, we introduce a definition of entropy for a stochastic process and suitable sample estimators for this measure. In Section 3, we show how to estimate minimum-entropy sets of a stochastic process in order to discover atypical functional data in a sample. Section 4 illustrates the theory with simulations and examples, and Section 5 concludes the work.

2. Entropy of a Stochastic Process

In this section, we extend the definition of entropy to a stochastic process. For the sequel, let ( Ω , F , P ) be a probability space, where F is the σ -algebra in Ω and P a σ -finite measure. We consider random elements (functions) X ( ω , t ) : Ω × T R in a metric space ( T , τ ) . As usual in the case of functional data, the realizations of the random elements X ( ω , · ) are assumed in C ( T ) , the space of real continuous functions in a compact domain T R endowed with the uniform metric.
The first step is to consider a suitable representation for the stochastic process. We make use of the well-known Karhunen–Loève expansion [2] (p. 25, Theorem 1.5). Let X ( ω , t ) be a centered (zero-mean) stochastic process with continuous covariance function K X ( s , t ) = E ( X ( ω , s ) X ( ω , t ) ) , then there exists a basis { e i } i 1 of C ( T ) such that for all t T :
X ( ω , t ) = i = 1 ξ i ( ω ) e i ( t ) ,
where the sequence of random coefficients ξ i ( ω ) = T X ( ω , t ) e i ( t ) d t comprises zero mean random variables with (co)variance E ( ξ i ξ j ) = δ i j λ j , being δ i j the Kronecker delta and { λ } j 1 the sequence of eigenvalues associated with the eigenfunctions of K X ( s , t ) .
The equality in Equation (2) must be understood in the mean square sense, that is:
lim d E { ( X ( ω , t ) i = 1 d ξ i ( ω ) e i ( t ) ) 2 } = 0 ,
uniformly in T. Therefore, we can always consider a ε -near representation X d ( ω , t ) = i = 1 d ξ i ( ω ) e i ( t ) such that for all ε arbitrarily small, there exists an integer D such that for d D , then τ ( X , X d ) = sup t T | X ( ω , t ) X d ( ω , t ) | ε . From this result, it is possible to establish a suitable way to approximate the entropy of a random element X ( ω , t ) according to the distribution of the “representation coefficients” { ξ i ( ω ) } i d obtained from X d ( ω , t ) .
Definition 1 (d-truncated entropy for stochastic processes).
Let X be a centered stochastic process with a continuous covariance function. Consider the truncation X d ( ω , t ) = i = 1 d ξ i ( ω ) e i ( t ) and the random vector Z = ( ξ 1 , , ξ d ) ; then, the d-truncated entropy of X is defined as H α ( X , d ) = H α ( Z ) .
The “approximation error” when computing the entropy of the stochastic process X with Definition 1 decreases monotonically with the number of terms retained in the Karhunen–Loève expansion, at a rate that depends on the decay of the spectrum of the covariance function K X ( s , t ) . In general, the more autocorrelated the process is, the more quickly the eigenvalues of K X ( s , t ) converge to zero. In practical functional data applications (see for instance the mortality-rate curves in Section 4), the autocorrelation is usually strong, and the truncation parameter d will be small when approximating the entropy of the process. The next example illustrates the definition.
Example 1.
[Gaussian process] When X is a Gaussian Process (GP), the coefficients in the Karhunen–Loève expansion have the further property that they are independent and zero-mean normally distributed random variables. Therefore, the Shannon entropy ( α = 1 ) of X can be approximated with the truncated version of the GP as follows:
H 1 ( X , d ) = 1 2 log ( 2 π e ) d det ( Σ ) ,
where Σ is the diagonal covariance matrix with elements [ Σ ] i , j = E ( ξ i ξ j ) for i , j = 1 , , d .
In practice, we can only observe some realizations of the stochastic process X, and these observations are sparsely registered. Therefore, to estimate the entropy of X ( ω , t ) from a random sample of discrete realizations of a stochastic process, a first task is the representation of these paths by means of continuous functions. To this end, we consider a reproducing kernel Hilbert space H of functions, associated with a positive definite and symmetric kernel function K : T × T R .

Estimating Entropy in a Reproducing Kernel Hilbert Space

Most functional data analysis approaches for representing raw data suggest proceeding as follows: (i) choose an orthogonal basis of functions Φ = { ϕ 1 , , ϕ N } , where each ϕ i belongs to a general function space H ; and (ii) represent each functional datum by means of a linear combination in the Span ( Φ ) [3,4]. Our choice is to consider H as a Reproducing Kernel Hilbert Space (RKHS) of functions [5]. In this case, the elements in the spanning set Φ are the eigenfunctions associated with the positive-definite and symmetric kernel function K : T × T R that span H [5] (Moore-Aronszajn Theorem p. 19).
In our setting, the functional representation problem can be framed as follows: We have available m discrete observations, that is a realization path x ( t 1 ) , , x ( t m ) of the stochastic element X ( ω , t ) . We also assume that the discrete path { x ( t i ) , t i } i = 1 m , as usual when dealing with real data, contains zero mean i i d error measurements. Then, the functional data estimator, denoted onwards as x ˜ ( t ) , is obtained solving the following regularization problem:
x ˜ ( t ) : = arg min g H i = 1 m V ( x ( t i ) , g ( t i ) ) 2 + γ Ω ( g ) ,
where V is a strictly convex functional with respect to the second argument, γ > 0 is a regularization parameter, frequently chosen by cross-validation, and Ω ( g ) is a regularization term. By the representer theorem [6,7] (Theorem 5.2, p. 91, Proposition 8, p. 51), the solution of the problem stated in Equation (4) exists, is unique and admits a representation of the form:
x ˜ ( t ) = i = 1 m a i K ( t , t i ) .
In the particular case of a squared loss function V ( w , z ) = ( w z ) 2 and considering Ω ( g ) = T g 2 ( t ) d t , the coefficients of the linear combination in Equation (5) are obtained solving the following system:
( γ m I + K ) a = y ,
where a = ( a 1 , , a m ) T , y = ( x ( t 1 ) , , x ( t m ) ) T , I is the identity matrix of order m and K is the Gram matrix with the kernel evaluations, [ K ] k , l = K ( t k , t l ) , for k = 1 , , m and l = 1 , , m . To relate the Karhunen–Loève expansion in Equation (2) to the RKHS representation, we make use of Mercer’s theorem [2] (Lemma 1.3, p. 24), then K X ( s , t ) = j = 1 λ j ϕ j ( s ) ϕ j ( t ) , where λ j is the eigenvalue associated with the orthonormal eigenfunction ϕ j for j 1 , and invoking the reproducing property, then:
X ( ω , t ) = X ( ω , s ) , K X ( s , t ) = j = 1 λ j ϕ j ( t ) T X ( ω , s ) ϕ j ( s ) d s .
Therefore, following Equation (2), ξ j ( ω ) : = λ j T X ( ω , s ) ϕ j ( s ) d s and e j ( t ) = λ j ϕ j ( t ) ; and the connection is clearly established. When working with discrete realizations of a stochastic process, we must solve two sequential tasks. First, we need to represent raw data as functional data and later find a truncated representation of the function. To this end, when combining Equation (5) with Mercer’s theorem and the reproducing property, we obtain:
x ˜ d ( t ) = j = 1 d λ j ϕ j ( t ) λ j i = 1 m a i ϕ j ( t i ) ,
and now, z j : = λ j i = 1 m a i ϕ j ( t i ) is the realization of the random variable ξ j for j = 1 , , d ; see [8] for further details. For some kernel functions, for instance the Gaussian kernel, the associated sequence of eigen-pairs ( λ j , ϕ j ) for j 1 is known [9] (pp. 10), and we can obtain an explicit value for all z j . If not, let ( λ j , v j ) be the j-eigenpair associated with the kernel matrix K R m × m , then z j = λ j i = 1 m a i v i , j for j = 1 , , d .
In practice, given a sample of n discrete paths (realizations) of the stochastic process X, say { x l ( t 1 ) , , x l ( t m ) } for l = 1 , , n , a suitable input to estimate entropy in Definition 1 is to consider the set of multivariate vectors z l = ( z l , 1 , , z l , d ) for l = 1 , , n , as formally proposed in the next definition.
Definition 2 (K-entropy estimation of a stochastic process).
Let { x 1 ( t i ) , , x n ( t i ) } for i = 1 , , m be a discrete random sample of X, and let { ( λ j , v j ) } j = 1 d be the eigen-pairs of the kernel matrix K R m × m , where d = rank ( K ) . Consider the corresponding finite dimensional representation S n : = { z 1 , , z n } , where z l = ( z l , 1 , , z l , d ) R d for l = 1 , , n and z l , j = λ l , j i = 1 m a l , i v i , j for j = 1 , , d . Then, the estimated kernel entropy of X is defined as H ^ α ( X , K ) = H ^ α ( Z ) .
In Definition 2, H ^ α ( Z ) denotes the estimated entropy using the (finite dimensional) representation coefficients S n = { z 1 , , z n } . In Section 3, we formally introduce two approaches to estimate entropy departing from S n . The next example illustrates the estimation procedure in the context of GPs in Example 1.
Illustration with Example 1:
Consider 100 realizations of a GP as follows: 50 curves from X ( t ) = i = 1 3 ξ i e i ( t ) and another 50 curves from Y ( t ) = i = 1 3 ζ i e i ( t ) ; where e i ( t ) is a Fourier basis in T = [ 0 , 1 ] , ξ i N ( μ = 0 , σ 2 = 0.5 ) , and ζ i N ( μ = 0 , σ 2 = 2 ) are independent normally distributed random variables (r.v.) for i = 1 , 2 , 3 .
In Figure 1 (left), we illustrate the realizations of the stochastic processes, in black (“—”) the sample paths of X ( t ) and in red (“”) the paths corresponding to Y ( t ) . In Figure 1 (right), we show the distribution of the linear combination coefficients { ( z 1 , z 2 , z 3 ) l , ( w 1 , w 2 , w 3 ) l } l = 1 50 corresponding to these paths. Following Example 1, we estimate the covariance functions Σ ^ X and Σ ^ Y using the respective coefficients and plug this covariance matrix into the Shannon entropy expression to obtain the estimated entropies H ^ 1 ( X ) = 1.402 and H ^ 2 ( Y ) = 99.552 , similar to the true entropies H 1 ( X ) = 1.428 and H 2 ( Y ) = 91.420 , respectively. We formally propose the estimation procedure in Algorithm 1.
Algorithm 1: Estimation of H α ( X , K ) from a sample of random paths.
Entropy 20 00033 i001
The choice of kernel parameters in Algorithm 1 is made by cross-validation. This ensures that the curve fitting method is asymptotically optimal. Nonetheless, although the selection of the kernel parameters affects the scale of the estimated entropy, the center-outward ordering induced by H α ( X , K ) , as formally proposed in the next section, is unaffected. In the Supplementary Material, we present relevant experimental results to illustrate this property, which make the method robust in terms of the selection of the kernel and regularization parameters.

3. Minimum Entropy for Anomaly Detection

Anomaly detection is a common task in almost all data analysis context. The unsupervised approach considers a sample X 1 , , X n of random elements where most instances follow a well-defined pattern and a small proportion, here denoted as ν [ 0 , 1 ] , present an abnormal pattern. In recent works (see for instance [10,11,12,13]), the authors propose depth measures and related methods, to deal with functional outliers. In this section, we propose a novel criterion to tackle the problem of anomaly detection with functional data using the ideas and concepts developed in Section 2. For a real-valued d-dimensional random vector Z that admits a continuous density function f Z , define H α ( A Z ) = 1 1 α log A f Z α ( z ) d z to be the entropy of the Borel-set A with respect to the measure F Z . Then, the ν -Minimal-Entropy Set (MES) is formally defined as:
MES ν ( Z ) : = { arg min A R d H α ( A Z ) s . t . P ( A ) 1 ν } .
The MES ν is equivalent [14,15] to a ν -High Density Set (HDS) [16] formally defined as HDS ν ( Z ) = { z R d | f Z ( z ) > c ν } , where c ν is the largest constant such that P ( HDS ν ( Z ) ) 1 ν , for 0 < ν < 1 . Therefore, the complement of MES is a suitable set to define outlier data in the sample, considering x ˜ ( t ) MES ν as an atypical realization of X. Next, we give two approaches to estimate MES.

3.1. Parametric Approach

Given a random sample of n discrete random paths { x 1 ( t i ) , , x n ( t i ) } for i = 1 , , m , we transform this sample into d-dimensional vectors S n = ( z 1 , , z n ) using the representation and truncation method proposed in this work, numerically implemented in Lines 2–8 in Algorithm 1. Assume further that f Z ( z , θ ) is a suitable probability model for the random sample z 1 , , z n , then we estimate by Robust Maximum Likelihood (RML) the parameters θ . For instance, in this paper, we consider f Z ( z , θ ) to be the normal density, and then, RML estimated parameters are θ ^ = ( μ ^ , Σ ^ ) , the robust mean vector and covariance matrix, respectively. For details on robust estimation, we refer to [17]. After the estimation of the distribution parameters, the computation of H α follows by plugging the estimated density f Z ( z , θ ^ ) into Equation (1). Moreover, for the normal model, the estimated set MES ν is defined trough the following expression:
MES ν ( S n ) = { z R d | ( z μ ^ ) T Σ ^ 1 ( z μ ^ ) χ d 2 ( ν ) } ,
where χ d 2 ( ν ) is the 1 ν quantile of a Chi-square distribution with d-degrees of freedom. Then, if the coefficient z i , representing x ˜ i ( t ) , lies outside this ellipsoid, we say that the functional datum is atypical. When the proportion of outlier ν in the sample is known a priori, the χ d 2 ( ν ) -quantile can be replaced by the corresponding sample 1 ν Mahalanobis distance quantile, as is the case in Section 4.1.

3.2. Non-Parametric Approach

The following are definitions to introduce further non-parametric estimation methods. For the random vector Z R d distributed according to F Z , let B Z ( z , r δ ) R d be the z -centered ball with radius r δ that fulfills the condition δ = B Z ( z , r δ ) f Z ( z ) d z , then the δ -neighbors of the point z comprise the open set Δ z = R d B ( z , r δ ) .
Definition 3 (δ-local α-entropy).
Let z R d , for α > 0 and α 1 ; the δ-local α-entropy of the r.v. Z is:
h α ( Δ z ) = 1 1 α log Δ z f Z α ( z ) d z f o r a l l z R d .
Under mild regularity conditions on f Z , the local entropy measure is a suitable metric to characterize the degree of abnormality of every point z in the support of F Z . Several natural estimators of local entropy measures can be considered, for instance the (average) distance from the point z to its k-th-nearest neighbor. We estimate MES combining the estimated δ -Local α -entropy. As in the parametric case, let { x 1 ( t i ) , , x n ( t i ) } for i = 1 , , m be a random sample of n discrete random paths; we transform this sample into d-dimensional vectors S n = ( z 1 , , z n ) following Lines 2–8 in Algorithm 1. Next, we estimate the local entropy for these data using the estimator h ^ α ( Δ z i ) = exp ( d ¯ k ( z i , S n ) ) , where d ¯ k ( z i , S n ) is the average distance from z i to its k-th-nearest neighbor [18], and then estimate MES ν solving the following optimization problem:
max ρ , ϵ 1 , , ϵ n ( 1 ν ) ρ 1 n i = 1 n ϵ i s . t . h ^ α ( Δ z i ) ρ ϵ i , ϵ i 0 for i = 1 , , n .
The solution to this problem, ρ * , leads to the following decision function:
D ( z ) = sign ( ρ * h ^ α ( Δ z ) ) ,
where D ( z ) = + 1 if z corresponds to the ( 1 ν ) proportion of curves projected near the origin, that is the set of curves that belongs to a low entropy (high density) set. The following theorem shows that as the number of available curves increases, the estimation method asymptotically detects the proportion 1 ν of curves belonging to the MES ν .
Theorem 1.
At the solution of the optimization problem stated in Equation 8, the following equality holds:
lim n 1 n i = 1 n I ( z i ) = 1 ν ,
where I ( z ) = 1 if h ^ α ( Δ z ) ρ * and I ( z ) = 0 otherwise.

4. Experimental Section

The aim of this section is to illustrate the performance of the proposed methodology to detect abnormal observations in a sample of functional data. In what follows, for the representation of functional data, we consider the Gaussian kernel function K ( t l , t k ) = e σ t l t k 2 . The kernel parameter σ and the regularization coefficient γ in Algorithm 1 were defined through cross-validation.

4.1. Simulation Analysis

In a Monte Carlo study, we investigate the performance of the proposed method over three data configurations (Scenarios A, B and C). Specifically, we consider the following generating processes: a fraction 1 ν of n = 400 curves are realizations of the following stochastic model:
X l ( t ) = j = 1 4 ξ j sin ( j π t ) + ε l ( t ) , for l = 1 , , ( 1 ν ) n , and t [ 0 , 1 ] ,
where ξ = ( ξ 1 , , ξ 4 ) is a normally-distributed multivariate random variable with mean μ ξ = ( 4 , 2 , 4 , 1 ) and diagonal co-variance matrix Σ ξ = diag ( 5 , 2 , 2 , 1 ) , and ε l ( t ) are independent autocorrelated random error functions.
The remaining proportion of data n ν with ν { 1 % , 5 % , 10 % } comprises outliers that contaminate the sample according to the following typical scenarios (see [19]):
(A)
Magnitude outliers: Y l ( t ) = j = 1 4 ζ j sin ( j π t ) + ε l ( t ) , for l = 1 , , ν n , and t [ 0 , 1 ] , where ζ is a normally-distributed multivariate r.v. with parameters μ ζ = 2.5 μ ξ and Σ ζ = ( 2.5 ) 2 Σ ξ .
(B)
Shape outliers: Y l ( t ) = j = 1 4 ζ j sin ( j π t ) + ε l ( t ) , for l = 1 , , ν n , and t [ 0 , 1 ] , where ζ is a normally-distributed multivariate r.v. with parameters μ ζ = ( 4 , 2 , 1 , 3 ) and Σ ζ = Σ ξ .
(C)
A combination considering ν n / 2 outliers from Scenario A and ν n / 2 outliers from Scenario B.
To illustrate the generating process, in Figure 2, we show one instance of the simulated paths in Scenario C with ν = 10 % . We test our Parametric entropy (PA) and Non-Parametric entropy (NPA) method against several well-known depth measures for functional anomaly detection, namely: the Modified Band Depth (MBD), the H-Mode Depth (HMD), the Random Tukey Depth (RTD) and the Functional Spatial Depth (FSD) (see [10,11,12,13]), respectively, already implemented in the R-package fda-usc [20]. For this experiment, the values of the parameter ν are assumed known in each scenario. With respect to parameters σ and γ in Algorithm 1, in this simulation exercise, we chose them with a 10-fold cross-validation procedure using a single set of data, which correspond to the first instance of the simulations. The reference values (which remain fixed throughout the simulation exercise) are σ = 10 and γ = 0.1 5 .
Let P and N be the amount of outlier and normal data in the sample, respectively, and let TP = True Positive and TN = True Negative be the respective quantities detected by different methods; in Table 1, we report the following average metrics TPR = TP/P (True Positive Rate or sensitivity), TNR = TN/N (True Negative Rate or specificity) and the area under the ROC curve (aROC) of each method obtained through the M = 1000 replications in the Monte Carlo study.
As can be seen, the PA and NPA entropy methods proposed in this article outperform other recently-proposed depth measures in the three scenarios considered in the experiments when ν = { 0.10 , 0.05 } . In the remaining case (when ν = 0.01 ), PA and NPA outperform the other methods; however, the standard errors are slightly high to confirm a significant difference between the methods.
When we compare among the proposed methods, the parametric approach seems to be slightly (but consistently) more effective than the non-parametric approach in Scenario A. For Scenarios B and C, both methods provide similar results. It is important to remark that the PA method is especially adequate for Gaussian data, while the NPA method does not assume any distributional hypothesis on the data. In this sense, the simulation results show the robustness of the non-parametric approach even when competing with parametric methods designed for specific distributions.

4.2. Outliers in the Context of Mortality-Rate Curve Analysis

We consider the French mortality rates database, available in the R-package Demography [21], to study age-specific male death rates in a logarithmic scale. In Figure 3 (left), each curve corresponds to one year from 1901–2006 (106 paths in total) and accounts for the number of deaths per 1000 of the mean population in the age group (from 0–101 years) in question. As expected, for low-age cohorts (until 12 years, approximately), the mortality rates present a decreasing trend and then start to grow until late ages, where all cohorts achieve a 100% mortality rate.
For some years, the evolution pattern of mortality presents an atypical behavior, mostly coinciding with the first and second World Wars, jointly with the influenza pandemic episode that took place in 1919.
In this experiment, we do not know a priori the proportion of atypical curves. Therefore, after having conducted inference over a wide range of values for ν , as a way to assess the sensitivity and reliability of the inference when determining the number of abnormal curves, we decided to fix ν = 10 % . For further details on the way to choose the parameter ν (and an extended sensitivity analysis on the values of ν ), please refer to § 3.2 in the Supplementary Material. In Figure 3 (left), we highlight in red the anomalous detected curves with both the entropy-PA and NPA methods corresponding to the years 1914–1919 and 1940, 1942–1945, which match with men (between 20 and 40 years old) participating in World War I and II. In Figure 3 (right), we use the first two principal components of the kernel eigenfunctions to project the representation coefficients (in this experiment, in R 14 ) in two dimensions. As can be seen, the points laying outside the MES ν = 90 % , represented with doted-blue ellipses when estimating it with PA (- -) and the convex hull with a continuous blue line () when estimating it with NPA, correspond to the the atypical curves in the sample.

5. Discussion

In this article, we propose a definition of entropy for stochastic processes. We provide a reproducing kernel Hilbert space model to estimate entropy from a random sample of realizations of a stochastic process, namely functional data, and introduce two approaches to estimate minimum entropy sets for functional anomaly detection.
In the experimental section, the Monte Carlo simulation illustrates the adequacy of the proposed method in the context of magnitude and shape outliers, outperforming other state of the art methods for functional anomaly detection. In the study of French mortality rates, the parametric and non-parametric approaches for minimum entropy sets estimation show their adequacy to capture anomalous curves, principally associated with the First and Second World Wars and the Influenza episode in 1919.
Regardless of the results presented in the paper, how widely the method can be used in practice, especially with noisier data, is an open question. In this sense, as future work, we will consider testing the performance of the proposed method in other scenarios with different noise assumptions in the observations. Another natural extension for future work entails the study of the asymptotic properties of the MES ν estimators. The extension of the proposed method from the stochastic process to random fields, useful for several statistical and information science areas, seems straightforward, but a wide range of simulations and numerical experiments must be done in order to stress the performance of entropy methods in comparison to other techniques when dealing with abnormal fields. Another natural avenue for future work entails the study of the connections between entropy for stochastic process, as formally defined here, and the maximum entropy principle when estimating the governing parameters of Gaussian processes.

Supplementary Materials

The following are available online at www.mdpi.com/1099-4300/20/1/33/s1.

Acknowledgments

We thank the referees and the editor for constructive comments and insightful recommendations. This work has been supported by CONICET Argentina Project 20020150200110BA, the Spanish Ministry of Economy and Competitiveness Projects ECO2015–66593-P, GROMA(MTM2015-63710-P), PPI (RTC-2015-3580-7) and UNIKO(RTC-2015-3521-7) and the “methaodos.org” research group at URJC.

Author Contributions

All authors have contributed equally to the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
RMLRobust Maximum Likelihood.
MES and HDSMinimum Entropy and High Density Sets, respectively.
PA and NPAParametric and Non-Parametric approaches.
MBD, HMD, RTD, FSDModified Band, H-Mode, Random Tukey and Functional Spatial Depths.

Appendix A

Proof Theorem 1.
Consider the following optimization problem:
min β 1 , , β n i = 1 n β i h ^ α ( Δ z i ) s . t . i = 1 n β i = n ( 1 ν ) and 0 β i 1 for i = 1 , , n .
For the sake of simplicity, consider first the case where n ( 1 ν ) N . Let q * be the 1 ν quantile of the S n sample. Then, it can be shown that β i * = 1 if h ^ α ( Δ z i ) q * and β i * = 0 if h ^ α ( Δ z i ) > q * is a solution for the problem stated in Equation (A1). As a consequence:
1 n i = 1 n I ( z i ) = 1 n i = 1 n β i * .
From the constraint in Equation (A1), it holds that i = 1 n β i * = n ( 1 ν ) , and then:
lim n 1 n i = 1 n β i * = lim n 1 n n ( 1 ν ) = 1 ν
For the case n ( 1 ν ) N , it holds that β i = 1 , if h ^ α ( Δ z i ) < q * β i = n ( 1 ν ) [ n ( 1 ν ) ] , if h ^ α ( Δ z ) = q * β i = 0 , if h ^ α ( Δ z i ) > q * where [ z ] stands for the largest integer no greater than x. Therefore, the number of β i * ’s equating to one is [ n ( 1 ν ) ] and:
lim n 1 n i = 1 n I ( z i ) = lim n 1 n ( [ n ( 1 ν ) ] × 1 + 1 ) = lim n [ n ( 1 ν ) ] n = 1 ν .
Finally, we show that ρ * = q * . The dual problem of (A1) is:
max b , ϵ 1 , , ϵ n n ( 1 ν ) b i = 1 n ϵ i s . t . h ^ α ( Δ z i ) b ϵ i , ϵ i 0 for i = 1 , , n .
By the fundamental theorem of duality, the objective functions of the problems stated in Equations (A1) and (A2) take the same value at their solutions, and as a consequence, b * = q * (see [22]). Since Problem (A2) differs from Problem (8) just in the scaling of the objective function, it holds that ρ * = b * , which concludes the proof. ☐

References

  1. Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
  2. Bosq, D. Linear Processes in Function Spaces: Theory and Applications; Springer Science & Business Media: New York, NY, USA, 2012. [Google Scholar]
  3. Ramsay, J.O. Functional Data Analysis; Wiley: New York, NY, USA, 2006. [Google Scholar]
  4. Ferraty, F.; Vieu, P. Nonparametric Functional Data Analysis: Theory and Practice; Springer: New York, NY, USA, 2006. [Google Scholar]
  5. Berlinet, A.; Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics; Springer: New York, NY, USA, 2011. [Google Scholar]
  6. Kimeldorf, G.; Wahba, G. Some results on Tchebycheffian spline functions. J. Math. Anal. Appl. 1971, 33, 82–94. [Google Scholar] [CrossRef]
  7. Cucker, F.; Smale, S. On the mathematical foundations of learning. Bull. Am. Math. Soc. 2002, 39, 1–49. [Google Scholar] [CrossRef]
  8. Munñoz, A.; González, J. Representing functional data using support vector machines. Pattern Recognit. Lett. 2010, 31, 511–516. [Google Scholar]
  9. Zhu, H.; Williams, C.; Rohwer, R.; Morciniec, M. Gaussian Regression and Optimal Finite Dimensional Linear Models; Aston University: Birmingham, UK, 1997. [Google Scholar]
  10. López-Pintado, S.; Romo, J. On the concept of depth for functional data. J. Am. Stat. Assoc. 2009, 104, 718–734. [Google Scholar] [CrossRef]
  11. Cuevas, A.; Febrero, M.; Fraiman, R. Robust estimation and classification for functional data via projection-based depth notions. Comput. Stat. 2007, 22, 481–496. [Google Scholar] [CrossRef]
  12. Sguera, C.; Galeano, P.; Lillo, R. Spatial depth-based classification for functional data. Test 2014, 23, 725–750. [Google Scholar] [CrossRef]
  13. Cuesta-Albertos, J.A.; Nieto-Reyes, A. The random Tukey depth. Comput. Stat. Data Anal. 2008, 52, 4979–4988. [Google Scholar] [CrossRef]
  14. Hero, A. Geometric entropy minimization (GEM) for anomaly detection and localization. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 3–6 December 2007; pp. 585–592. [Google Scholar]
  15. Xie, T.; Narabadi, N.; Hero, A.O. Robust training on approximated minimal-entropy set. arXiv, 2016; arXiv:1610.06806. [Google Scholar]
  16. Hyndman, R.J. Computing and graphing highest density regions. Am. Stat. 1996, 50, 120–126. [Google Scholar]
  17. Maronna, R.; Martin, R.; Yohai, V. Robust Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
  18. Beirlant, J.; Dudewicz, E.; Györfi, L.; Van der Meulen, E. Nonparametric entropy estimation: An overview. Int. J. Math. Stat. Sci. 1997, 6, 17–39. [Google Scholar]
  19. Cano, J.; Moguerza, J.M.; Psarakis, S.; Yannacopoulos, A.N. Using statistical shape theory for the monitoring of nonlinear profiles. Applied Stochastic Models in Business and Industry. Appl. Stoch. Models Bus. Ind. 2015, 31, 160–177. [Google Scholar] [CrossRef]
  20. Febrero-Bande, M.; De la Fuente, M.O. Statistical computing in functional data analysis: The R package fda.usc. J. Stat. Softw. 2012, 51, 1–28. [Google Scholar] [CrossRef]
  21. Hyndman, R.J. Demography Package; R Foundation for Statistical Computing: Vienna, Austria, 2017. [Google Scholar]
  22. Muñoz, A.; Moguerza, J.M. Estimation of high-density regions using one-class neighbor machines. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 476–480. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Gaussian processes realizations on the left and coefficients for entropy estimation on the right. The sizes of the balls on the right are proportional to the determinants of Σ ^ X (in black) and Σ ^ Y (in red).
Figure 1. Gaussian processes realizations on the left and coefficients for entropy estimation on the right. The sizes of the balls on the right are proportional to the determinants of Σ ^ X (in black) and Σ ^ Y (in red).
Entropy 20 00033 g001
Figure 2. (Left) Raw data, 400 curves corresponding to Scenario C with ν = 10 % . (Right) Functional data, in black (“—”), the sample of regular paths X ( t ) , and abnormal curves Y ( t ) in red (“”).
Figure 2. (Left) Raw data, 400 curves corresponding to Scenario C with ν = 10 % . (Right) Functional data, in black (“—”), the sample of regular paths X ( t ) , and abnormal curves Y ( t ) in red (“”).
Entropy 20 00033 g002
Figure 3. French mortality data: On the left, the regular curves in black (“—”) and outliers detected in red (“”) for ν = 10 % . On the right, the first two principal components of the kernel eigenfunctions; the area inside the doted blue ellipsoid (- -) corresponds PA estimation of MES ν = 90 % and the region inside the convex hull in blue () to the NPA estimation. The regular curves, represented with black dots (•), lie inside the MES ν = 90 % and detected outliers with a red asterisk () outside of MES ν = 90 % .
Figure 3. French mortality data: On the left, the regular curves in black (“—”) and outliers detected in red (“”) for ν = 10 % . On the right, the first two principal components of the kernel eigenfunctions; the area inside the doted blue ellipsoid (- -) corresponds PA estimation of MES ν = 90 % and the region inside the convex hull in blue () to the NPA estimation. The regular curves, represented with black dots (•), lie inside the MES ν = 90 % and detected outliers with a red asterisk () outside of MES ν = 90 % .
Entropy 20 00033 g003
Table 1. Simulation analysis: Scenarios and contamination percentages ν in columns. In rows, different methods and average sensitivities, specificities and the areas under the ROC curves (aROC) (this last on a scale of 102). The corresponding standard-error is reported in parenthesis.
Table 1. Simulation analysis: Scenarios and contamination percentages ν in columns. In rows, different methods and average sensitivities, specificities and the areas under the ROC curves (aROC) (this last on a scale of 102). The corresponding standard-error is reported in parenthesis.
MethodMetricScenario AScenario BScenario C
10%5%1%10%5%1%10%5%1%
MBDTPR74.86771.01055.30048.27539.39513.47567.78758.36536.300
(4.699)(7.712)(20.852)(5.914)(9.013)(16.180)(5.351)(7.772)(18.341)
TNR97.20798.47499.54894.25296.81099.12696.42097.80899.356
(0.522)(0.406)(0.210)(0.657)(0.474)(0.163)(0.594)(0.409)(0.185)
aROC96.66297.37597.73589.39391.69393.24495.27295.44495.354
(1.245)(1.517)(3.059)(2.033)(2.388)(4.425)(1.399)(1.831)(4.370)
HMDTPR92.66591.54588.67566.53262.78047.47579.99276.76566.025
(3.295)(5.173)(14.793)(6.084)(8.809)(21.206)(4.562)(7.039)(18.004)
TNR99.18599.55599.88596.28198.04199.46997.77698.77799.656
(0.366)(0.272)(0.149)(0.676)(0.463)(0.214)(0.506)(0.370)(0.181)
aROC99.20099.25699.34694.98096.15396.96997.67697.92497.842
(0.851)(1.105)(2.391)(1.583)(1.812)(3.473)(1.089)(1.401)(3.542)
RTDTPR83.55583.04576.40050.97243.94022.70071.97565.22549.700
(4.743)(0.694)(18.931)(9.409)(1.279)(2.1334)(7.178)(9.716)(1.834)
TNR98.17499.10499.76294.54497.04999.21896.88998.16599.491
(0.526)(0.365)(0.191)(1.045)(0.674)(0.215)(0.798)(0.511)(0.184)
aROC98.18798.60598.96290.42692.51094.15496.15696.34596.242
(1.094)(1.347)(2.538)(2.817)(2.967)(4.574)(1.580)(1.977)(4.085)
FSDTPR81.47283.21581.92550.27546.55027.40074.77569.48553.775
(3.978)(5.947)(16.671)(5.238)(8.018)(19.547)(4.601)(6.859)(16.707)
TNR97.94199.11699.81794.47597.18699.26797.19798.39699.533
(0.442)(0.313)(0.168)(0.582)(0.421)(0.197)(0.511)(0.361)(0.168)
aROC97.93498.73899.16390.05993.27995.48596.77797.14897.125
(1.030)(1.232)(2.490)(1.794)(2.061)(3.723)(1.158)(1.477)(3.682)
Entropy-PATPR94.15093.21591.72580.74077.39066.92587.55084.93577.650
(3.078)(4.817)(12.591)(6.250)(8.550)(20.330)(4.632)(6.604)(17.015)
TNR99.35099.64999.91697.86098.81099.66498.61699.20799.774
(0.342)(0.253)(0.127)(0.694)(0.450)(0.205)(0.514)(0.347)(0.171)
aROC99.35199.35399.37497.54997.98798.30198.67798.75298.641
(0.788)(1.078)(2.474)(1.364)(1.495)(2.785)(0.944)(1.208)(3.081)
Entropy-NPATPR92.72591.50589.05074.21577.14571.25087.22585.80579.775
(3.325)(5.228)(14.630)(6.237)(7.904)(19.970)(4.217)(6.198)(16.788)
TNR99.19199.55299.88997.13598.79299.70998.58699.25299.795
(0.369)(0.275)(0.147)(0.693)(0.416)(0.201)(0.468)(0.326)(0.169)
aROC99.24399.26699.29397.24098.25398.68598.78298.88098.861
(0.815)(1.097)(2.528)(1.130)(1.250)(2.550)(0.856)(1.145)(2.880)

Share and Cite

MDPI and ACS Style

Martos, G.; Hernández, N.; Muñoz, A.; Moguerza, J.M. Entropy Measures for Stochastic Processes with Applications in Functional Anomaly Detection. Entropy 2018, 20, 33. https://doi.org/10.3390/e20010033

AMA Style

Martos G, Hernández N, Muñoz A, Moguerza JM. Entropy Measures for Stochastic Processes with Applications in Functional Anomaly Detection. Entropy. 2018; 20(1):33. https://doi.org/10.3390/e20010033

Chicago/Turabian Style

Martos, Gabriel, Nicolás Hernández, Alberto Muñoz, and Javier M. Moguerza. 2018. "Entropy Measures for Stochastic Processes with Applications in Functional Anomaly Detection" Entropy 20, no. 1: 33. https://doi.org/10.3390/e20010033

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop