Next Article in Journal
Time-Dependent Effective Hamiltonians for Light–Matter Interactions
Next Article in Special Issue
Likelihood Inference for Factor Copula Models with Asymmetric Tail Dependence
Previous Article in Journal
Design of a Robust Synchronization-Based Topology Observer for Complex Delayed Networks with Fixed and Adaptive Coupling Strength
Previous Article in Special Issue
Bayesian Spatio-Temporal Modeling of the Dynamics of COVID-19 Deaths in Peru
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Metric Based on the Efficient Determination Criterion

by
Jesús E. García
,
Verónica A. González-López
* and
Johsac I. Gomez Sanchez
Department of Statistics, University of Campinas, Campinas 13083-859, São Paulo, Brazil
*
Author to whom correspondence should be addressed.
Entropy 2024, 26(6), 526; https://doi.org/10.3390/e26060526
Submission received: 16 March 2024 / Revised: 9 June 2024 / Accepted: 17 June 2024 / Published: 19 June 2024
(This article belongs to the Special Issue Bayesianism)

Abstract

:
This paper extends the concept of metrics based on the Bayesian information criterion (BIC), to achieve strongly consistent estimation of partition Markov models (PMMs). We introduce a set of metrics drawn from the family of model selection criteria known as efficient determination criteria (EDC). This generalization extends the range of options available in BIC for penalizing the number of model parameters. We formally specify the relationship that determines how EDC works when selecting a model based on a threshold associated with the metric. Furthermore, we improve the penalty options within EDC, identifying the penalty ln ( ln ( n ) ) as a viable choice that maintains the strongly consistent estimation of a PMM. To demonstrate the utility of these new metrics, we apply them to the modeling of three DNA sequences of dengue virus type 3, endemic in Brazil in 2023.

1. Introduction

This article embarks on an exploration of the efficient determination criterion (EDC), as introduced in [1], with a particular emphasis on formulating an EDC-based metric. Our endeavor is bolstered by the presence of a Bayesian information criterion (BIC) metric proposed in [2], designed to provide consistent estimations of partition Markov models [2]. Our aim is to extend the scope of the BIC-based metric, thereby broadening the array of algorithms available for identifying partition Markov models.
To achieve our goal, we furnish a theoretical framework delineating the operational principles underlying the BIC/EDC, the BIC-based metric, and additionally, we conduct a brief survey of the current research landscape within this domain to provide context for our approach.
Let ( X t ) be a discrete-time order o Markov chain on a finite and discrete alphabet Δ , with o < ; let us call Ω = Δ o the state space. Denote the string a k a k + 1 a m by a k m , where a i Δ , k i m . For each a Δ and s Ω , the transition probability from the state s to a is
P ( a | s ) = Prob ( X t = a | X t o t 1 = s ) .
Let P = { Γ 1 , Γ 2 , , Γ | P | } be a partition of Ω , then for each pair of parts Γ i and Γ j , i j ,   i , j { 1 , , | P | } ,   Γ i Γ j = , and Ω = i = 1 | P | Γ i .
Note that a part Γ of the partition is constituted by a collection of states coming from Ω ; we reformulate the notion introduced by Equation (1) as follows, for a Δ , Γ P :
P ( Γ , a ) = s Γ Prob ( X t o t 1 = s , X t = a ) , P ( Γ ) = s Γ Prob ( X t o t 1 = s ) , P ( a | Γ ) = P ( Γ , a ) P ( Γ ) if P ( Γ ) > 0 .
Given the previous notation, we appeal to a model in ( X t ) which allows a more efficient estimation of the transition probabilities, introduced by Equation (1); see [2].
Definition 1.
Let ( X t ) be a discrete-time order o Markov chain on a finite and discrete alphabet Δ , o < . Two states s , r Ω = Δ o are equivalent (denoted by s p r ) if P ( a | s ) = P ( a | r ) a Δ . For any s Ω , the equivalence class of s is given by the set of states { r Ω : r p s } .
The previous notion allows the definition of a Markov chain with minimal partition P , that is, one which follows the equivalence relationship.
Definition 2.
Let ( X t ) be a discrete-time order o Markov chain on a finite and discrete alphabet Δ , o < , and let P = { Γ 1 , Γ 2 , , Γ | P | } be a partition of Ω = Δ o ; ( X t ) is a Markov chain with minimal partition P if P is defined by the relationship p introduced by Definition 1.
As previously indicated, the objective of this model is to allow a more efficient estimation of the probabilities introduced by Equation (1), which occurs in the most efficient way possible by identifying the parts of the minimal partition (Definition 2), and thus, being able to use all the states inserted in each part to estimate a single probability per part. To identify the partition P introduced in Definition 2, a strategy must be implemented as shown below.
In a given sample x 1 n , of size n , coming from the stochastic process ( X t ) under the assumptions of Definition 2, given the state s Ω and the element of the alphabet a Δ , we denote the number of occurrences of s followed by a in the sample x 1 n by N n ( s , a ) = | { t : o < t n , x t o t 1 = s , x t = a } | and N n ( s ) = a Δ N n ( s , a ) is the number of occurrences of s in the sample x 1 n . Also, given a partition P of Ω , denote the number of occurrences of elements into Γ (part of P ) followed by a as
N n ( Γ , a ) = s Γ N n ( s , a ) , a Δ ,
the accumulated number of values N n ( s ) for s Γ is denoted by
N n ( Γ ) = s Γ N n ( s ) .
Note that N n ( Γ , a ) and N n ( Γ ) can be computed for any partition P of Ω , not only for the partition introduced by Definition 2.
The counts of occurrences, in this case N n ( Γ , a ) and N n ( Γ ) , allow the estimation of probabilities (Equation (2)) subject to a modification of the likelihood function of the sample. The likelihood of the sample is
P ( x 1 n ) = P ( x 1 o ) a Δ , Γ P P ( a | Γ ) N n ( Γ , a ) ,
then, the maximum of the modified log-likelihood is
a Δ , Γ P N n ( Γ , a ) ln N n ( Γ , a ) N n ( Γ ) , with N n ( Γ ) > 0 , Γ .
And,
N n ( Γ , a ) N n ( Γ ) , with N n ( Γ ) > 0 ,
is the maximum likelihood estimator of P ( a | Γ ) given in Equation (2).
As shown in [2], under the assumptions of Definition 2, the partition P can be consistently (strong consistency) retrieved using the Bayesian information criterion (BIC), defined as
BIC ( x 1 n , P ) = a Δ , Γ P N n ( Γ , a ) ln N n ( Γ , a ) N n ( Γ ) ( | Δ | 1 ) | P | α ln ( n ) ,
with α > 0 , a constant value. Then, the BIC takes into consideration the maximum of the modified log-likelihood term penalized by ( | Δ | 1 ) | P | α ln ( n ) , where ( | Δ | 1 ) | P | is the number of probabilities to be estimated.
In practice, candidates to be the partition according to Definition 2 are compared, and the partition with the higher BIC value is considered more suitable. Also, in [2] a metric is introduced based on the BIC criterion, along with clustering algorithms, which are used to obtain P ; the metric is defined below. To achieve consistent estimation, such a metric operates on partitions of the state space that follow certain rules. The metric is then able to refine partitions until it identifies the one cited by Definition 2. The partitions in which we will apply the metric, are made up of members (parts) formed by states sharing all the transition probabilities. The definition to follow formalizes the concept.
Definition 3.
Let ( X t ) be a Markov chain of order o , with finite and discrete alphabet Δ , o < , and state space Ω = Δ o . Set a partition of Ω , P = { Γ 1 , . . . , Γ | P | } ,
i.
given a part Γ of P , Γ is a good part if a , a Δ ,   P ( a | s ) = P ( a | r ) , r , s Γ , r s .
ii.
P is a good partition of Ω if Γ satisfies i. Γ P .
Under the validity of Definition 3-i, the probabilities introduced by Equation (2) are
P ( a | Γ ) = Prob ( X t = a | X t o t 1 = s ) a Δ , s Γ ,
since all the elements of the good part Γ of P share the transition probabilities. Note that the partition identified by Definition 2 verifies Definition 3-ii, but the reciprocal is naturally not valid. A straightforward example of a good partition is one composed of all the states being isolated.
The following introduces a notion used to estimate the minimal partition (Definition 2). This criterion operates on good parts (Definition 3-i).
Definition 4.
Let ( X t ) be a Markov chain of order o , with finite and discrete alphabet Δ , o < , and state space Ω = Δ o ; x 1 n is a sample of the process and let P = { Γ 1 , . . . , Γ | P | } be a good partition of Ω ,
d P ( i , j ) = α ( | Δ | 1 ) ln ( n ) a Δ k { i , j } N n ( Γ k , a ) ln N n ( Γ k , a ) N n ( Γ k ) N n ( Γ i j , a ) ln N n ( Γ i j , a ) N n ( Γ i j ) .
where α is a constant and positive value, N n ( Γ i j ) = N n ( Γ i ) + N n ( Γ j ) ,   N n ( Γ i j , a ) = N n ( Γ i , a ) + N n ( Γ j , a ) , a Δ .
In [2], it is proved that d P of Definition 4 is a metric, meaning that, if Γ l P , l { i , j , k } ,
i.
d P ( i , j ) 0 , with equality, if and only if N n ( Γ i , a ) N n ( Γ i ) = N n ( Γ j , a ) N n ( Γ j ) a Δ ;
ii.
d P ( i , j ) = d P ( j , i ) ;
iii.
d P ( i , k ) d P ( i , j ) + d P ( j , k ) .
As a consequence of the property i of a metric, the ability of d P to operate adequately depends on the accuracy of the maximum likelihood estimation of the transition probabilities P ( a | Γ i ) and P ( a | Γ j ) , a Δ . That is, when the estimators of those probabilities N n ( Γ i , a ) N n ( Γ i ) ,   N n ( Γ j , a ) N n ( Γ j ) a Δ are near and the sample size n is large enough, we have evidence of proximity between P ( a | Γ i ) and P ( a | Γ j ) , a Δ . And, such a finding indicates that the elements of both parts must be together.
Partition Markov models, commonly referred to as those delineated by Definition 2, have found application in diverse realms. For instance, they have been employed in data compression in conjunction with Huffman coding, as exemplified in [3]. Across these investigations, the utilization of the BIC-based metric d P has proven indispensable. Also, in [2], this metric has been pivotal for modeling the behavior of internet users. The partition Markov model allows identifying the chances of a user visiting a certain internet site in their next step, based on their history, and identifies equivalent histories in the sense introduced by Definition 2.
Since the support of d P is the BIC criterion, the question arises whether there is a broader criterion than BIC that is capable of maintaining strong consistency in the estimation of P . The next section shows that such a criterion exists (a generalization of BIC) and was proved by [4]. Then, the next question that we propose to answer is whether such a generalization of the BIC allows the creation of a metric that generalizes the one introduced in [2].
The next section (Section 2) addresses the problem by introducing the efficient determination criterion, and then, presenting how this criterion is linked to a metric, also introducing a cut-off point that enables the practical use of the metric based on the efficient determination criterion, for sufficiently large values of n .  Section 3 shows an application in which different fits of model—Definition 2—are compared, inferred by variants of the efficient determination criterion, indicated as recommended in Section 2. This article ends with the Conclusions—Section 4—in which we highlight the main contributions, and the Bibliography section.

2. Efficient Determination Criterion

Ref. [1] proposes a criterion generalizing the BIC criterion, the efficient determination criterion (EDC). In that paper, the proposal is to introduce a sequence { w n } n 1 in the place of { ln ( n ) } n 1 ; see Equation (6). The generalization also offers more options in the penalty term of Equation (6), instead of the number of parameters a function γ ( · ) is introduced acting over the number of parameters; this function is strictly increasing in the number of parameters. Under the assumptions of Definition 2, the criterion is formulated as follows:
EDC ( x 1 n , P ) = a Δ , Γ P N n ( Γ , a ) ln N n ( Γ , a ) N n ( Γ ) γ ( | Δ | 1 ) | P | α w n .
With α > 0 a constant value, γ ( · ) being a strictly increasing function, and { w n } a sequence of positive numbers depending on n . As well as BIC, candidates to be the partition according to Definition 2 are compared, and the higher the EDC, the more indicated the partition is. Note that if we choose γ ( · ) as being the identity function, γ ( | Δ | 1 ) | P | α = ( | Δ | 1 ) | P | α , and w n = ln ( n ) , then Equation (6) is recovered. Then, clearly the EDC criterion is a generalization of the BIC criterion.
Ref. [4] proves that the EDC criterion provides a strongly consistent way to estimate the partition P of Definition 2 if
lim n w n n = 0 and lim n w n ln ( ln ( n ) ) = .
Note that if we take w n = n a for a ( 0 , 1 ) , the conditions given in Equation (9) are valid. Also, we can use w n = a ln ( n ) for a > 0 . Another option is to use w n = n a ln ( n ) for a ( 0 , 1 ) .  Figure 1 shows penalty functions w n verifying Equation (9). We see in the figure that functions w n are positioned between the functions n and ln ( ln ( n ) ) . And between n and ln ( ln ( n ) ) is also the w n related to the BIC criterion ( w n = ln ( n ) ).
Clearly, the penalty ln ( ln ( n ) ) does not verify the second statement of Equation (9), but according to [5] it is an optimal penalty term for estimating the order of a Markov chain. With such inspiration in mind, the following proposition guarantees that ln ( ln ( n ) ) can also be used to obtain a consistent estimate of P . To state the proposition we introduce the notion of relative entropy.
Definition 5.
Let P and Q be probability distributions on Δ . The relative entropy between P and Q is given by D P ( · ) | | Q ( · ) = a Δ P ( a ) ln P ( a ) Q ( a ) , with Q ( a ) 0 , a Δ .
Proposition 1.
Let ( X t ) be a Markov chain of order o , with finite and discrete alphabet Δ , o < , and state space Ω = Δ o ; x 1 n is a sample of the process and let P = { Γ 1 , . . . , Γ | P | } be a partition of Ω , and P ( · | Γ ) be the probability given by Equation (2) related to a good part Γ (Definition 3-i). To any δ > 0 there exists κ > 0 (depending on P ( · | · ) ) such that, eventually, almost surely as n
| N n ( Γ , a ) N n ( Γ ) P ( a | Γ ) | < δ ln ( ln ( n ) ) N n ( Γ ) ,
for all Γ , good part, with N n ( Γ ) 1 and o < κ ln ( ln ( n ) ) .
Proof. 
From the proof of Corollary 2 of [6] (on page 1621), we obtain that for any ϵ > 0 there is κ > 0 (depending on P ( · | · ) ) such that, eventually, almost surely as n
| N n ( s , a ) N n ( s ) P ( a | s ) | < ϵ ln ( ln ( n ) ) N n ( s ) ,
for all s Ω with N n ( s ) 1 and o < κ ln ( ln ( n ) ) .
Consider δ > 0 and set ϵ = δ | Δ | 2 o , in Equation (10), then
N n ( s , a ) N n ( s ) P ( a | s ) δ ln ( ln ( n ) ) | Δ | 2 o N n ( s ) N n ( s , a ) N n ( s ) P ( a | s ) δ ln ( ln ( n ) ) | Δ | 2 o N n ( s ) .
Because Γ is a good part of P ,   s Γ , we obtain
s Γ N n ( s , a ) P ( a | Γ ) s Γ N n ( s ) s Γ δ ln ( ln ( n ) ) | Δ | 2 o N n ( s ) .
Following Equations (3), (4) and (7), we have
N n ( Γ , a ) P ( a | Γ ) N n ( Γ ) δ ln ( ln ( n ) ) | Δ | o s Γ N n ( s ) ,
then,
N n ( Γ , a ) N n ( Γ ) P ( a | Γ ) δ ln ( ln ( n ) ) | Δ | o N n ( Γ ) | Γ | max s Γ ( N n ( s ) ) δ ln ( ln ( n ) ) | Δ | o N n ( Γ ) | Δ | o s Γ N n ( s ) = δ ln ( ln ( n ) ) N n ( Γ ) N n ( Γ ) = δ ln ( ln ( n ) ) N n ( Γ ) .
The next results show that despite ln ( ln ( n ) ) violating the second condition imposed by Equation (9), the EDC (with ln ( ln ( n ) ) ) provides a consistent estimate of the minimal partition.
Theorem 1.
Let ( X t ) be a Markov chain of order o , with finite and discrete alphabet Δ , o < , and state space Ω = Δ o ; x 1 n is a sample of the process and let P = { Γ 1 , . . . , Γ | P | } be a partition of Ω , and suppose that i and j exist; i j such that Γ i and Γ j following Definition 3-i. Then, P ( a | Γ i ) = P ( a | Γ j ) ,   a Δ if, and only if, eventually, almost surely as n ,
E D C ( x 1 n , P i j ) > E D C ( x 1 n , P ) .
where E D C ( x 1 n , P ) is defined by Equation (8), with w n = ln ( ln ( n ) ) and E D C ( x 1 n , P i j ) is given by Equation (8) (with w n = ln ( ln ( n ) ) ) over the partition P i j = P { Γ i } { Γ j } Γ i j and Γ i j = { Γ i Γ j } .
Proof. 
The proof is a variant of the one presented in [2], theorem 1. ⇐ is direct from that proof, just considering (i) ln ( ln ( n ) ) n 0 instead of ln ( n ) n 0 , when n and considering (ii) that γ ( · ) is an increasing function. For ⇒, we have that P ( a | Γ i ) = P ( a | Γ j ) , a Δ , and we want to prove that E D C ( x 1 n , P ) E D C ( x 1 n , P i j ) < 0 . Again, following the steps in such a proof, we obtain that E D C ( x 1 n , P ) E D C ( x 1 n , P i j ) is bounded above by
N n ( Γ i ) D N n ( Γ i , . ) N n ( Γ i ) | | P ( . | Γ i ) + N n ( Γ j ) D N n ( Γ j , . ) N n ( Γ j ) | | P ( . | Γ j ) γ ( | Δ | 1 ) | P | α γ ( | Δ | 1 ) ( | P | 1 ) α ln ( ln ( n ) ) ,
where D ( P ( · ) | | Q ( · ) ) is the relative entropy, given by Definition 5.
For each Γ { Γ i , Γ j } ,   N n ( Γ , . ) N n ( Γ ) and P ( . | Γ ) are probabilities on Δ ; then, Equation (11) follows from lemma 6.3 in [7]. On the other hand, since Γ { Γ i , Γ j } is a good part, by hypothesis, from Proposition 1, for any δ > 0 and large enough n, Equation (12) follows,
D N n ( Γ , . ) N n ( Γ ) | | P ( . | Γ ) a Δ N n ( Γ , a ) N n ( Γ ) P ( a | Γ ) 2 P ( a | Γ )
a Δ δ ln ( ln ( n ) ) N n ( Γ ) P ( a | Γ ) .
Then, set c 0 = γ ( | Δ | 1 ) | P | α γ ( | Δ | 1 ) ( | P | 1 ) α , which is >0, since γ ( · ) in a strictly increasing function. For any δ > 0 and large enough n,
E D C ( x 1 n , P ) E D C ( x 1 n , P i j ) 2 δ | Δ | p ln ( ln ( n ) ) c 0 ln ( ln ( n ) ) = ln ( ln ( n ) ) 2 δ | Δ | p c 0
where p = min { P ( a | Γ ) : a Δ , Γ { Γ i , Γ j } } . In particular, taking δ < p c 0 2 | Δ | , for a large enough n, E D C ( x 1 n , P ) E D C ( x 1 n , P i j ) < 0 .
As a result of the previous theorem, it turns out that it is possible to guarantee that the EDC with the penalty term w n = ln ( ln ( n ) ) allows the consistent estimation of the minimal partition. As a consequence, we have:
Corollary 1.
Let ( X t ) be a Markov chain of order o , with finite and discrete alphabet Δ , o < , and state space Ω = Δ o ; x 1 n is a sample of the process. Let Ψ be the set of all the partitions of Ω . Define
P n * = a r g m a x P Ψ { E D C ( x 1 n , P ) }
where E D C ( x 1 n , P ) is defined by Equation (8), with w n = ln ( ln ( n ) ) . Then, eventually, almost surely as n ,   P * = P n * , where P * is the partition of Ω , following Definition 2.
Proof. 
Following the same steps as the proof of Theorem 3 of [2]. It is enough to replace the BIC criterion with the EDC criterion (Equation (8)) with w n = ln ( ln ( n ) ) and apply Theorem 1 instead of Theorem 1 and Corollary 1 of [2]. □
Corollary 1 complements the results of [4], showing that the minimal partition (Definition 2) is consistently recovered by the EDC (Equation (8)) when it is formulated by a strictly increasing function γ and w n follows Equation (9), or when w n = ln ( ln ( n ) ) .
In order to generalize the BIC-based metric d P , given by Definition 4, the following notion is introduced.
Definition 6.
Let ( X t ) be a Markov chain of order o , with finite and discrete alphabet Δ , o < , and state space Ω = Δ o ; x 1 n is a sample of the process, let P = { Γ 1 , . . . , Γ | P | } be a good partition of Ω , and 1 i , j | P | , i j :
δ P ( i , j ) = v n a Δ k { i , j } N n ( Γ k , a ) ln N n ( Γ k , a ) N n ( Γ k ) N n ( Γ i j , a ) ln N n ( Γ i j , a ) N n ( Γ i j ) ,
where N n ( Γ i j ) = N n ( Γ i ) + N n ( Γ j ) ,   N n ( Γ i j , a ) = N n ( Γ i , a ) + N n ( Γ j , a ) , a Δ . With v n 1 = w n γ ( | Δ | 1 ) | P | α γ ( | Δ | 1 ) ( | P | 1 ) α , α a constant and positive value, γ ( · ) being a strictly increasing function, and { w n } a sequence of positive numbers depending on n .
It is evident that if we take γ ( | Δ | 1 ) | P | α = ( | Δ | 1 ) | P | α and w n = ln ( n ) , Definition 6 coincides with Definition 4.
The next result shows the relationship between the EDC criterion and the notion introduced in Definition 6.
Theorem 2.
Let ( X t ) be a Markov chain of order o , with finite and discrete alphabet Δ , o < , and Ω = Δ o ; x 1 n is a sample of the process. Let P = { Γ 1 , . . . , Γ | P | } be a good partition of Ω , and 1 i , j | P | , i j , then,
E D C ( x 1 n , P ) < E D C ( x 1 n , P i j ) δ P ( i , j ) < 1 .
where δ P ( i , j ) is given by Definition 6, E D C ( x 1 n , P ) is defined by Equation (8), and E D C ( x 1 n , P i j ) is given by Equation (8) over the partition P i j = P { Γ i } { Γ j } Γ i j and Γ i j = { Γ i Γ j } .
Proof. 
EDC ( x 1 n , P ) EDC ( x 1 n , P i j ) = a Δ k { i , j } N n ( Γ k , a ) ln N n ( Γ k , a ) N n ( Γ k ) a Δ N n ( Γ i j , a ) ln N n ( Γ i j , a ) N n ( Γ i j ` ) v n 1
Note that EDC ( x 1 n , P ) < EDC ( x 1 n , P i j ) EDC ( x 1 n , P ) EDC ( x 1 n , P i j ) < 0 , and EDC ( x 1 n , P ) EDC ( x 1 n , P i j ) < 0 δ P ( i , j ) < 1 , applying Equation (13), since v n > 0 .
Remark 1.
In order to guarantee the consistent estimation of the partition given by Definition 2, we note that Theorem 2 must be used for a large enough n and with weights w n following Equation (9) or w n = ln ( ln ( n ) ) .
The following theorem characterizes the notion given by Definition 6 as being a metric.
Theorem 3.
Let ( X t ) be a Markov chain of order o over a finite and discrete alphabet Δ , o < ,   Ω = Δ o the state space, and x 1 n a sample of the Markov process. If P = { Γ 1 , , Γ | P | } is a good partition of Ω, for each n , and for any i , j , k { 1 , 2 , . . . , | P | } , given δ P as Definition 6,
i.
δ P ( i , j ) 0 , with equality, if and only if, N n ( Γ i , a ) N n ( Γ i ) = N n ( Γ j , a ) N n ( Γ j ) a Δ ;
ii.
δ P ( i , j ) = δ P ( j , i ) ;
iii.
δ P ( i , k ) δ P ( i , j ) + δ P ( j , k ) .
Proof. 
Here, we only prove iii. since item i. is straightforward from Theorem 2 [2] and ii. follows from definition. Consider the relative entropy between two probabilities P and Q on the alphabet Δ ,   D ( P ( · ) | | Q ( · ) ) = a Δ P ( a ) ln ( P ( a ) / Q ( a ) ) .  D is non-negative, furthermore D is zero if and only if P ( · ) = Q ( · ) . Returning to our goal, iii. occurs if and only if
0 v n 1 ( δ P ( i , j ) + δ P ( j , k ) δ P ( i , k ) ) = ( * )
since v n > 0 . We inspect the right side of Equation (14),
( * ) = s = i , k a Δ N n ( Γ j , a ) ln N n ( Γ j , a ) N n ( Γ j ) ln N n ( Γ j s , a ) N n ( Γ j s ) + s = i , k a Δ N n ( Γ s , a ) ln N n ( Γ i k , a ) N n ( Γ i k ) ln N n ( Γ s j , a ) N n ( Γ s j ) = s = i , k N n ( Γ j ) a Δ N n ( Γ j , a ) N n ( Γ j ) ln N n ( Γ j , a ) N n ( Γ j ) / N n ( Γ j s , a ) N n ( Γ j s ) + s = i , k a Δ N n ( Γ s , a ) ln N n ( Γ i k , a ) N n ( Γ i k ) / N n ( Γ s j , a ) N n ( Γ s j ) = ( 1 ) N n ( Γ j ) s = i , k D N n ( Γ j , · ) N n ( Γ j ) | | N n ( Γ j s , · ) N n ( Γ j s ) + s = i , k a Δ N n ( Γ s , a ) N n ( Γ i k ) N n ( Γ i k , a ) N n ( Γ i k , a ) N n ( Γ i k ) ln N n ( Γ i k , a ) N n ( Γ i k ) / N n ( Γ s j , a ) N n ( Γ s j ) ( 2 ) N n ( Γ j ) s = i , k D N n ( Γ j , · ) N n ( Γ j ) | | N n ( Γ j s , · ) N n ( Γ j s ) + s = i , k 1 n D N n ( Γ i k , · ) N n ( Γ i k ) | | N n ( Γ s j , · ) N n ( Γ s j ) ( 3 ) 0 .
where (1) follows from the definition of the relative entropy D , between two empirical laws N n ( Γ j , · ) N n ( Γ j ) and N n ( Γ j s , · ) N n ( Γ j s ) . (2) follows from 1 N n ( Γ i k , a ) 1 n and N n ( Γ s , a ) N n ( Γ i k ) 1 , and using the relative entropy D between the two empirical laws N n ( Γ i k , · ) N n ( Γ i k ) and N n ( Γ s j , · ) N n ( Γ s j ) . The last inequality (3) is valid since D is non-negative. Then, Equation (14) is valid. □
We saw by Theorem 2 that the improvement in the construction of the partition, when joining the parts Γ i and Γ j is detected by the metric δ P when it takes a value less than 1, a value that could be used as a reference. It is clear that this property should only be used for large enough values of n, and a penalization term following Equation (9), according to [4], or when w n = ln ( ln ( n ) ) according to Theorem 1 and Corollary 1, which is when the EDC criterion is capable of consistently estimating the partition that follows Definition 2.
The following section shows an example of applying the metric—Definition 6—to real data. We seek to show how the model—Definition 2—varies by varying the penalty term w n for cases where the consistency of the estimate is guaranteed, when n is large enough, that is, with terms w n following Remark 1.
In the application we consider genetic sequences of dengue virus in FASTA format. In [8], a variant of the model specified in Definition 2 is designed specifically to model the first DNA sequence (FASTA format) of SARS-CoV-2 virus, Genbank number MN908947 (accessible at: https://www.ncbi.nlm.nih.gov/nuccore/MN908947, accessed on 10 March 2024). Furthermore, partition Markov models have been instrumental in modeling the DNA of the SARS-CoV-2 virus, exposing the evolution of the various variants during the pandemic period (see, for example, [9]).

3. Application

We examine and model three DNA sequences sourced from dengue virus type 3 (DENV-3), originating from Brazil. These sequences were sequenced and made publicly available in early 2023 (https://www.ncbi.nlm.nih.gov/, accessed on 10 March 2024). We then proceed to compare the models derived from these sequences by applying the metric—Definition 6—and employing the agglomerative algorithm. Our analysis focuses on observing the variations in partition composition and probability magnitudes as we change the penalization term w n .
According to [10], the genesis of the initial autochthonous case of DENV-3 (GIII-American-I lineage) in Brazil dates back to December 2000, specifically within Rio de Janeiro. Over the course of the 2000s multiple incursions of this lineage were documented from the Caribbean into Brazil. The northern and southeastern regions of Brazil swiftly emerged as the epicenters of dissemination. The advent of this lineage precipitated a significant dengue outbreak in Brazil, in Rio de Janeiro, in 2002, followed by subsequent outbreaks in diverse locales.
However, since 2010, publicly available data indicate a downward tendency in the prevalence of DENV-3; DENV-3 has represented a mere fraction (<1%) of the total dengue cases in Brazil, with scant confirmed instances reported. Consequently, the transmission of DENV-3 has not been substantiated in recent years, pointing out a potential extinction of the DENV-3 (GIII-American-I lineage) within Brazil. The resurgence of DENV-3 is a real challenge in Brazil, since it is expected that the population will not have immunity, given the time that this virus has not been found in the region.
Table 1 shows the GenBank numbers, collection date, and origins of three sequences, introduced by [10]. The records correspond to three complete genetic sequences in FASTA format (alphabet Δ = { a , c , g , t } ), of DENV-3, which is already native to Brazil.
We assume that each of these sequences is a sample of a process that meets Definition 2. We proceed to fit the model—Definition 2—using the metric—Definition 6—and the agglomerative algorithm. For this, we take into account the alphabet Δ = { a , c , g , t } , with cardinal | Δ | = 4 , where the sequences take their values. In Table 2, we show the frequencies for each element of the alphabet.
Considering that min{10,697, 10,511, 10,553} = 10,511 and log | Δ | (10,511) = 6.68, with integer part equal to 6, we adopt o = 3 , since 3 < 6 and the elements of the genetic alphabet Δ are organized in multiples of 3.
We fit four scenarios for each of the three sequences OQ706226, OQ706227, and OQ706228, with each scenario governed by a different penalty w n . All of them are considered in Definition 6, Δ = { a , c , g , t } , o = 3 , α = 2 (see [11]), and the γ function is the identity function. For each penalty, we identify, using the metric (Definition 6), the estimated partition of the partition given by Definition 2, and then, determine the transition probabilities of each part for each element of Δ . We denote by Γ i v the part i estimated for the sequence v, where v can be A , B , C , corresponding to OQ706226, OQ706227, and OQ706228, respectively. In Table 3 and Table 4, we record the results for the three sequences with penalty w n = n 1 / 2 ; Table 5 and Table 6 report the results for the three sequences with penalty w n = n 1 / 3 . While Table 7 and Table 8 show, for the three sequences, the results using the usual BIC penalty ( w n = ln ( n ) ). Finally, Table 9, Table 10 and Table 11 report the results with the penalty w n = ln ( ln ( n ) ) (see Corollary 1): for the sequence OQ706226, Table 9; for the sequence OQ706227, Table 10; and for the sequence OQ706228, Table 11.
We observe from Table 3, Table 5, Table 7 and Table 9 (right)–Table 11 (right) that as the penalty w n is reduced (that is, when w n approaches the lower limit ln ( ln ( n ) ) ), the model is allowed to acquire more parameters, in this case, more parts.
Given a penalization w n , the three sequences, A (GenBank OQ706226), B (GenBank OQ706227), and C (GenBank OQ706228), show a similar number of parts. More specifically, for the penalty w n = n 1 / 2 the behavior of the three sequences is represented by two parts (Table 3), for w n = n 1 / 3 the behavior of the three sequences is described by four parts (Table 5). For w n = ln ( n ) , OQ706226 is modeled by five parts while the other two are modeled by six parts (see Table 7). For penalty w n = ln ( ln ( n ) ) , OQ706226 is modeled by a partition with 13 parts (see Table 9, right) while the other two sequences are modeled by 14 parts, see Table 10 (right) and Table 11 (right).
The formal determination of whether the identified models, under each penalty, exhibit significant differences lies beyond the scope of this application. However, we acknowledge it as an open question worthy of further exploration.
The following observation applies to all three sequences. Observing the magnitudes of the transition probabilities, marked in bold, we note that, as reported in Table 6, Table 8, and Table 9 (left)–Table 11 (left), there is a predominant number of parts whose prevalence is the transition to element a of the alphabet { a , c , g , t } , and secondly, parts that indicate a prevalence for element g of the alphabet. As for Table 4, which reports the most penalized case ( w n = n 1 / 2 ), one part is recorded with prevalence for a and another with prevalence for g, which is natural, since the model only has two parts. As Table 9, Table 10 and Table 11 show, based on the penalty w n = ln ( ln ( n ) ) , the three sequences show the same part { a g c , g g t , g a g , a t a } with a prevalence for the element t of the alphabet Δ , of lower magnitude, than those previously mentioned.

4. Conclusions

The main objective of this paper, developed in Section 2, is to introduce a new notion based on the Equation (8), as given in Definition 6. This concept is used to identify the minimal partition of a Markov chain—Definition 2. Theorem 3 proves that the concept in Definition 6 constitutes a metric. Furthermore, Theorem 2 establishes the relationship between this new metric and the operation of the EDC criterion, showing that in an iterative process, selecting a partition with a higher EDC value is equivalent to using the value 1 as a threshold in the metric. In this way, we achieve our main goal of proposing an EDC-based metric to estimate the minimal partition.
Our results add to those of [4], in the search to characterize penalty terms that can be used in the EDC criterion to obtain the consistent estimation of the minimal partition. Ref. [4] demonstrates that the EDC, under certain conditions on the term w n (Equation (9)), provides a strongly consistent estimate of the minimal partition, as defined in Definition 2. Building on the results from [5], we conjectured that using w n = ln ( ln ( n ) ) might preserve strong consistency, even though this term does not satisfy the second condition imposed by Equation (9). We confirm in Theorem 1 and Corollary 1 that strong consistency is indeed achieved using the EDC with the penalization term w n = ln ( ln ( n ) ) .
We conclude the article with an application demonstrating the effect of the metric introduced in Definition 6 on estimating the minimal partition—Definition 2, using various penalty terms discussed in Remark 1. For this purpose, we analyze three Dengue virus type 3 sequences, native to Brazil and collected in 2023, in FASTA format. The application shows that relaxing the penalty results in higher cardinalities for the estimated partition. We identify which parts (collections of states) of the Dengue sequences have a greater or lesser preference for transitioning to the next element (a, c, g, or t) in the alphabet Δ = { a , c , g , t } . As expected, the models identified for each sequence exhibit similar features when the penalty is applied, which is natural given that the sequences share the same collection date and region of origin.

Author Contributions

Conceptualization, J.E.G. and V.A.G.-L.; Methodology, J.E.G. and V.A.G.-L.; Software, J.E.G. and V.A.G.-L.; Validation, J.E.G., V.A.G.-L. and J.I.G.S.; Formal analysis, J.E.G. and V.A.G.-L.; Investigation, J.E.G., V.A.G.-L. and J.I.G.S.; Resources, J.E.G. and V.A.G.-L.; Data curation, J.E.G. and V.A.G.-L.; Writing—original draft, J.E.G. and V.A.G.-L.; Writing—review & editing, J.E.G. and V.A.G.-L.; Visualization, J.E.G. and V.A.G.-L.; Supervision, J.E.G. and V.A.G.-L.; Project administration, J.E.G. and V.A.G.-L.; Funding acquisition, J.E.G. and V.A.G.-L. All authors have read and agreed to the published version of the manuscript.

Funding

Johsac I. Gomez Sanchez gratefully acknowledge the financial support provided by CAPES with fellowships from the Master Graduate Program in Statistics—University of Campinas.

Data Availability Statement

The National Center for Biotechnology Information Advances Science and Health, https://www.ncbi.nlm.nih.gov/ (accessed on 10 March 2024).

Acknowledgments

The authors wish to express their gratitude to the three referees and the editors for their helpful comments on an earlier draft of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zhao, L.C.; Dorea, C.C.Y.; Gonçalves, C.R. On determination of the order of a Markov chain. Stat. Inference Stoch. Process. 2001, 4, 273–282. [Google Scholar] [CrossRef]
  2. García Jesús, E.; González-López, V.A. Consistent Estimation of Partition Markov Models. Entropy 2017, 19, 160. [Google Scholar] [CrossRef]
  3. García, J.E.; González-López, V.A.; Tasca, G.H.; Yaginuma, K.Y. An Efficient Coding Technique for Stochastic Processes. Entropy 2022, 24, 65. [Google Scholar] [CrossRef]
  4. Pereira, D.F.S. Critério de Determinação Eficiente Para Estimação de Cadeias de Markov de Partição Mínima. Master’s Thesis, University of Brasilia, Brasilia, Federal District, Brazil, 2021. Available online: http://repositorio2.unb.br/jspui/handle/10482/42891 (accessed on 10 March 2024).
  5. Dorea, C.C.Y. Optimal penalty term for EDC Markov chain order estimator. Ann. de l’ISUP 2008, 52, 15–25. Available online: https://hal.science/hal-03633210 (accessed on 10 March 2024).
  6. Csiszár, I. Large-scale typicality of Markov sample paths and consistency of MDL order estimators. IEEE Trans. Inf. Theory 2002, 48, 1616–1628. [Google Scholar] [CrossRef]
  7. Csiszár., I.; Talata, Z. Context tree estimation for not necessarily finite memory processes, via BIC and MDL. IEEE Trans. Inf. Theory, 2006; 52, 1007–1016. [Google Scholar] [CrossRef]
  8. García, J.E.; González-López, V.A.; Tasca, G.H. Partition Markov Model for COVID-19 Virus. 4open 2020, 3, 13. [Google Scholar] [CrossRef]
  9. García, J.E.; González-López, V.A.; Tasca, G.H. Multiple partition Markov model for B.1.1.7, B.1.351, B.1.617.2, and P.1 variants of SARS-CoV 2 virus. Comput. Stat. 2022. [Google Scholar] [CrossRef]
  10. Naveca, F.G.; Santiago, G.A.; Maito, R.M.; Meneses, C.A.R.; do Nascimento, V.A.; de Souza, V.C.; do Nascimento, F.O.; Silva, D.; Mejía, M.; Gonçalves, L.; et al. Reemergence of dengue virus serotype 3, Brazil, 2023. Emerg. Infect. Dis. 2023, 29, 1482–1484. [Google Scholar] [CrossRef]
  11. Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
Figure 1. Top left: Double-dashed line, ln ( ln ( n ) ) ; dotted-dashed line, ln ( n ) ; long-dashed line, n .  Top right: w n = n 1 / 2 (in red); w n = n 1 / 3 (in magenta); w n = n 1 / 5 (in blue); double-dashed line, ln ( ln ( n ) ) ; dotted-dashed line, ln ( n ) . Bottom: w n = n 1 / 2 ln ( n ) (in red); w n = n 1 / 3 ln ( n ) (in magenta); w n = n 1 / 10 ln ( n ) (in blue); double-dashed line, ln ( ln ( n ) ) ; dotted-dashed line ln ( n ) ; long-dashed line n.
Figure 1. Top left: Double-dashed line, ln ( ln ( n ) ) ; dotted-dashed line, ln ( n ) ; long-dashed line, n .  Top right: w n = n 1 / 2 (in red); w n = n 1 / 3 (in magenta); w n = n 1 / 5 (in blue); double-dashed line, ln ( ln ( n ) ) ; dotted-dashed line, ln ( n ) . Bottom: w n = n 1 / 2 ln ( n ) (in red); w n = n 1 / 3 ln ( n ) (in magenta); w n = n 1 / 10 ln ( n ) (in blue); double-dashed line, ln ( ln ( n ) ) ; dotted-dashed line ln ( n ) ; long-dashed line n.
Entropy 26 00526 g001
Table 1. Three autochthonous sequences of DENV-3 in FASTA format, Δ = { a , c , g , t } .
Table 1. Three autochthonous sequences of DENV-3 in FASTA format, Δ = { a , c , g , t } .
OriginCollection DateGenBankSequence NicknameSize
Roraima, Canta (Brazil)4 March 2023OQ706226A10,697
Roraima, Boa Vista (Brazil)22 January 2023OQ706227B10,511
Roraima, Boa Vista (Brazil)3 January 2023OQ706228C10,553
Table 2. Frequencies N n ( a ) , a Δ = { a , c , g , t } .
Table 2. Frequencies N n ( a ) , a Δ = { a , c , g , t } .
GenBankSequence Nickname N n ( a ) N n ( c ) N n ( g ) N n ( t )
OQ706226A3435220927822271
OQ706227B3378216427362233
OQ706228C3395217327432242
Table 3. Minimal partition—Definition 2—estimated by Definition 6. From top to bottom, for the sequences A (GenBank OQ706226), B (GenBank OQ706227), and C (GenBank OQ706228). Δ = { a , c , g , t } , o = 3 , w n = n 1 / 2 , α = 2 ,   γ function given by the identity.
Table 3. Minimal partition—Definition 2—estimated by Definition 6. From top to bottom, for the sequences A (GenBank OQ706226), B (GenBank OQ706227), and C (GenBank OQ706228). Δ = { a , c , g , t } , o = 3 , w n = n 1 / 2 , α = 2 ,   γ function given by the identity.
SequencePartStates
OQ706226 Γ 1 A ccc, tgc, cgc, ggc, gac, gtc, acg, att, gct, gcc, atc, agc, ggt, gag, ata, tcc, agt, tct, ctc, ttc,
cct, ctt, gtt, aaa, ttt, aag, ccg, gcg, tgg, gtg, tat, agg, ctg, gat, ttg, aat, tcg, cgg, cat, ggg,
act, atg
Γ 2 A cag, cac, taa, cgt, aga, gga, tta, tac, cga, acc, tag, gta, cca, tga, tca, cta, gca, aac, tgt, gaa,
caa, aca
OQ706227 Γ 1 B ccc, ggc, cgc, gac, gtc, acg, gct, tag, gcc, atc, agc, ggt, gag, ata, att, tcc, agt, tgc, tct, gtt,
ctc, ttc, cct, ctt, aaa, ttt, aag, ccg, gcg, tgg, gtg, tat, agg, ctg, ttg, aat, gat, tcg, cgg, cat,
act, ggg, atg
Γ 2 B cag, cgt, aga, gga, tta, tac, cga, cac, acc, taa, tga, cta, cca, tca, gca, gta, aac, tgt, gaa, caa,
aca
OQ706228 Γ 1 C ccc, cgc, ggc, gac, tgc, gct, tcc, tct, agt, gtc, acg, att, tag, gcc, atc, agc, ggt, gag, ata, ctc,
cct, ttc, aaa, ctt, gtt, ttt, aag, ccg, gcg, tgg, gtg, agg, ctg, ttg, aat, gat, tcg, cat, tat, cgg,
ggg, atg, act
Γ 2 C cag, cgt, tta, cac, aga, gga, cga, tac, acc, taa, tga, cta, gca, cca, tca, gta, aac, tgt, gaa, caa,
aca
Table 4. Transition probabilities—Equation (2)—estimated by Equation (5). From top to bottom, for the sequences A (GenBank OQ706226), B (GenBank OQ706227), and C (GenBank OQ706228). Δ = { a , c , g , t } , w n = n 1 / 2 ; full estimated partitions displayed in Table 3. In bold, the highest probability per part.
Table 4. Transition probabilities—Equation (2)—estimated by Equation (5). From top to bottom, for the sequences A (GenBank OQ706226), B (GenBank OQ706227), and C (GenBank OQ706228). Δ = { a , c , g , t } , w n = n 1 / 2 ; full estimated partitions displayed in Table 3. In bold, the highest probability per part.
OQ706226i P ( a | Γ i A ) P ( c | Γ i A ) P ( g | Γ i A ) P ( t | Γ i A )
10.370320.213360.212560.20376
20.205950.190610.371520.23192
OQ706227i P ( a | Γ i B ) P ( c | Γ i B ) P ( g | Γ i B ) P ( t | Γ i B )
10.368460.214410.213740.20339
20.207230.185080.373410.23428
OQ706228i P ( a | Γ i C ) P ( c | Γ i C ) P ( g | Γ i C ) P ( t | Γ i C )
10.368710.213730.213870.20369
20.207600.186810.371670.23392
Table 5. Minimal partition—Definition 2—estimated by Definition 6. From top to bottom, for the sequences A (GenBank OQ706226), B (GenBank OQ706227), and C (GenBank OQ706228). Δ = { a , c , g , t } , o = 3 , w n = n 1 / 3 , α = 2 ,   γ function given by the identity.
Table 5. Minimal partition—Definition 2—estimated by Definition 6. From top to bottom, for the sequences A (GenBank OQ706226), B (GenBank OQ706227), and C (GenBank OQ706228). Δ = { a , c , g , t } , o = 3 , w n = n 1 / 3 , α = 2 ,   γ function given by the identity.
SequencePartStates
OQ706226 Γ 1 A ccc, tgc, cgc, ggc, gac, gtc, acg, att, gct, gcc, atc, agc, ggt, gag, ata
Γ 2 A tcc, tct, agt, ctc, ttc, cct, ctt, gtt, aaa, ttt
Γ 3 A aag, ccg, gcg, tgg, gtg, agg, ctg, gat, ttg, aat, tcg, cgg, cat, ggg, tat, act, atg
Γ 4 A cag, cac, taa, cta, cgt, aga, gga, tta, tac, cga, acc, tag, gta, cca, tga, tca, aac, tgt, gaa, caa,
aca, gca
OQ706227 Γ 1 B ccc, tgc, tct, gtt, cgc, ggc, gac, gtc, gct, acg, att, tag, gcc, atc, agc, ggt, gag, ata
Γ 2 B tcc, agt, ctc, ttc, cct, ctt, aaa, ttt
Γ 3 B aag, ccg, gcg, tgg, gtg, agg, ctg, ttg, aat, gat, tcg, cgg, cat, tat, act, ggg, atg
Γ 4 B cag, cgt, aga, gga, tta, tac, cga, cac, acc, tgt, gaa, caa, taa, tga, cta, cca, tca, aca, gca, gta,
aac
OQ706228 Γ 1 C ccc, cgc, ggc, gac, tcc, tct, agt, tgc, gct, gtc, acg, att, gcc, atc, agc, ggt, gag, ata
Γ 2 C ctc, cct, ttc, aaa, ctt, gtt, ttt
Γ 3 C aag, ccg, gcg, tat, tgg, gtg, agg, ctg, ttg, aat, gat, tcg, cat, cgg, act, ggg, atg
Γ 4 C cag, taa, tta, cac, cgt, cga, tac, acc, aga, gga, tag, gta, cca, tca, tga, cta, gca, aac, tgt, gaa,
caa, aca
Table 6. Transition probabilities—Equation (2)—estimated by Equation (5). From top to bottom, for the sequences A (GenBank OQ706226), B (GenBank OQ706227), and C (GenBank OQ706228). Δ = { a , c , g , t } , w n = n 1 / 3 ; full estimated partitions displayed in Table 5. In bold, the highest probability per part.
Table 6. Transition probabilities—Equation (2)—estimated by Equation (5). From top to bottom, for the sequences A (GenBank OQ706226), B (GenBank OQ706227), and C (GenBank OQ706228). Δ = { a , c , g , t } , w n = n 1 / 3 ; full estimated partitions displayed in Table 5. In bold, the highest probability per part.
OQ706226i P ( a | Γ i A ) P ( c | Γ i A ) P ( g | Γ i A ) P ( t | Γ i A )
10.298820.226270.256080.21882
20.401660.187600.259870.15087
30.412910.226730.117090.24328
40.205950.190610.371520.23192
OQ706227i P ( a | Γ i B ) P ( c | Γ i B ) P ( g | Γ i B ) P ( t | Γ i B )
10.302330.225960.259630.21208
20.406120.185930.259020.14893
30.414100.226680.116080.24314
40.207230.185080.373410.23428
OQ706228i P ( a | Γ i C ) P ( c | Γ i C ) P ( g | Γ i C ) P ( t | Γ i C )
10.310380.222050.264360.20321
20.421820.177730.245780.15467
30.413620.225730.116440.24422
40.207610.190170.371470.23075
Table 7. Minimal partition—Definition 2—estimated by Definition 6. From top to bottom, for the sequences A (GenBank OQ706226), B (GenBank OQ706227), and C (GenBank OQ706228). Δ = { a , c , g , t } , o = 3 , w n = ln ( n ) , α = 2 ,   γ function given by the identity.
Table 7. Minimal partition—Definition 2—estimated by Definition 6. From top to bottom, for the sequences A (GenBank OQ706226), B (GenBank OQ706227), and C (GenBank OQ706228). Δ = { a , c , g , t } , o = 3 , w n = ln ( n ) , α = 2 ,   γ function given by the identity.
SequencePartStates
OQ706226 Γ 1 A ccc, tgc, cgc, ggc, gac, gtc, acg, att, gct, gcc, atc, agc, ggt, gag, ata
Γ 2 A tcc, agt, ctc, tct, ttc, cct, ctt, gtt, aaa, ttt
Γ 3 A aag, ccg, gcg, tgg, gtg, act, agg, ctg, gat, ttg, aat, tcg, cgg, cat, ggg, tat, atg
Γ 4 A cag, cac, taa, cgt, aga, gga, tta, tac, cga, acc, tag, gta, cca, tca, tga, cta, aac
Γ 5 A tgt, gaa, caa, aca, gca
OQ706227 Γ 1 B ccc, tgc, tct, gtt, cgc, ggc, gac, gtc, gct, acg, att, tag
Γ 2 B gcc, atc, agc, ggt, gag, ata
Γ 3 B tcc, agt, ctc, ttc, cct, ctt, aaa, ttt
Γ 4 B aag, ccg, gcg, tgg, gtg, agg, ctg, ttg, aat, gat, tcg, cgg, cat, tat, act, ggg, atg
Γ 5 B cag, cgt, aga, gga, tta, tac, cga, cac, acc
Γ 6 B tgt, gaa, caa, taa, tga, cta, cca, tca, aca, gca, gta, aac
OQ706228 Γ 1 C ccc, cgc, ggc, gac, tcc, tct, agt, tgc, gct, gtc, acg, att
Γ 2 C gcc, atc, agc, ggt, gag, ata
Γ 3 C ctc, cct, ttc, aaa, ctt, gtt, ttt
Γ 4 C aag, ccg, gcg, tgg, gtg, agg, ctg, ttg, aat, gat, tcg, cgg, cat, tat, act, ggg, atg
Γ 5 C cag, taa, tta, cac, cgt, cga, tac, acc, aga, gga
Γ 6 C tag, gta, cca, tca, tga, cta, gca, aac, tgt, gaa, caa, aca
Table 8. Transition probabilities—Equation (2)—estimated by Equation (5). From top to bottom, for the sequences A (GenBank OQ706226), B (GenBank OQ706227), and C (GenBank OQ706228). Δ = { a , c , g , t } , w n = ln ( n ) ; full estimated partitions displayed in Table 7. In bold, the highest probability per part.
Table 8. Transition probabilities—Equation (2)—estimated by Equation (5). From top to bottom, for the sequences A (GenBank OQ706226), B (GenBank OQ706227), and C (GenBank OQ706228). Δ = { a , c , g , t } , w n = ln ( n ) ; full estimated partitions displayed in Table 7. In bold, the highest probability per part.
OQ706226i P ( a | Γ i A ) P ( c | Γ i A ) P ( g | Γ i A ) P ( t | Γ i A )
10.298820.226270.256080.21882
20.401660.187600.259870.15087
30.412910.226730.117090.24328
40.183510.205960.359670.25086
50.265070.150170.402730.18203
OQ706227i P ( a | Γ i B ) P ( c | Γ i B ) P ( g | Γ i B ) P ( t | Γ i B )
10.309450.213940.281670.19493
20.284680.255730.205070.25452
30.406120.185930.259020.14893
40.414100.226680.116080.24314
50.188680.209260.319900.28216
60.218600.170260.406200.20494
OQ706228i P ( a | Γ i C ) P ( c | Γ i C ) P ( g | Γ i C ) P ( t | Γ i C )
10.319330.210400.284820.18545
20.284510.255700.205280.25450
30.421820.177730.245780.15467
40.413620.225730.116440.24422
50.192020.209750.319050.27917
60.219320.175460.410880.19434
Table 9. Sequence OQ706226. Right: Minimal partition—Definition 2—estimated by Definition 6. Left: Transition probabilities—Equation (2)—estimated by Equation (5); in bold, the highest probability per part. Δ = { a , c , g , t } , o = 3 , γ function given by the identity, w n = ln ( ln ( n ) ) .
Table 9. Sequence OQ706226. Right: Minimal partition—Definition 2—estimated by Definition 6. Left: Transition probabilities—Equation (2)—estimated by Equation (5); in bold, the highest probability per part. Δ = { a , c , g , t } , o = 3 , γ function given by the identity, w n = ln ( ln ( n ) ) .
i P ( a | Γ i A ) P ( c | Γ i A ) P ( g | Γ i A ) P ( t | Γ i A ) PartStates
10.311400.193420.284110.21107 Γ 1 A ccc, tgc, cgc, ggc, gac
20.322180.253140.194560.23013 Γ 2 A gcc, atc
30.387320.192490.279340.14085 Γ 3 A tcc, tct, agt, ctc, ttc, cct
40.237570.259670.218230.28453 Γ 4 A agc, ggt, gag, ata
50.288790.260780.273710.17672 Γ 5 A gtc, acg, att, gct
60.406860.213160.142720.23726 Γ 6 A aag, ccg, gcg, tgg, gtg, ttg, tat
70.191130.209300.321950.27762 Γ 7 A cag, cac, taa, cgt, aga, gga, tta, tac, cga, acc
80.171640.291040.417910.11940 Γ 8 A tag, gta
90.466750.214550.085320.23337 Γ 9 A tcg, cgg, cat, ggg, act, atg
100.345660.271720.112750.26987 Γ 10 A agg, ctg, gat, aat
110.265070.150170.402730.18203 Γ 11 A tgt, gaa, caa, aca, gca
120.431160.177540.219810.17150 Γ 12 A ctt, gtt, aaa, ttt
130.172460.186100.414390.22705 Γ 13 A cca, tca, tga, cta, aac
Table 10. Sequence OQ706227. Right: Minimal partition—Definition 2—estimated by Definition 6. Left: Transition probabilities—Equation (2)—estimated by Equation (5); in bold, the highest probability per part. Δ = { a , c , g , t } , o = 3 , γ function given by the identity, w n = ln ( ln ( n ) ) .
Table 10. Sequence OQ706227. Right: Minimal partition—Definition 2—estimated by Definition 6. Left: Transition probabilities—Equation (2)—estimated by Equation (5); in bold, the highest probability per part. Δ = { a , c , g , t } , o = 3 , γ function given by the identity, w n = ln ( ln ( n ) ) .
i P ( a | Γ i B ) P ( c | Γ i B ) P ( g | Γ i B ) P ( t | Γ i B ) PartStates
10.344790.193130.277250.18483 Γ 1 B ccc, tgc, gtt, tct
20.319920.256360.192800.23093 Γ 2 B gcc, atc
30.391270.191830.279780.13712 Γ 3 B tcc, agt, ctc, ttc, cct
40.238100.254900.221290.28571 Γ 4 B agc, ggt, gag, ata
50.290940.197370.285090.22661 Γ 5 B cgc, ggc, gac
60.276720.269080.284350.16985 Γ 6 B gtc, gct, acg, att, tag
70.407050.216280.149450.22722 Γ 7 B aag, ccg, gcg, tgg, gtg
80.188680.209260.319900.28216 Γ 8 B cag, cgt, aga, gga, tta, tac, cga, cac, acc
90.470180.206420.086010.23739 Γ 9 B tcg, cgg, cat, tat, act, ggg, atg
100.350150.265580.114240.27003 Γ 10 B agg, ctg, ttg, aat, gat
110.279920.139960.421910.15822 Γ 11 B tgt, gaa, caa
120.434900.174500.218800.17181 Γ 12 B ctt, aaa, ttt
130.211270.176470.391880.22038 Γ 13 B taa, tga, cta, cca, tca, aca, gca
140.113300.206900.453210.22660 Γ 14 B gta, aac
Table 11. Sequence OQ706228. Right: Minimal partition—Definition 2—estimated by Definition 6. Left: Transition probabilities—Equation (2)—estimated by Equation (5); in bold, the highest probability per part. Δ = { a , c , g , t } , o = 3 , γ function given by the identity, w n = ln ( ln ( n ) ) .
Table 11. Sequence OQ706228. Right: Minimal partition—Definition 2—estimated by Definition 6. Left: Transition probabilities—Equation (2)—estimated by Equation (5); in bold, the highest probability per part. Δ = { a , c , g , t } , o = 3 , γ function given by the identity, w n = ln ( ln ( n ) ) .
i P ( a | Γ i C ) P ( c | Γ i C ) P ( g | Γ i C ) P ( t | Γ i C ) PartStates
10.307400.189750.284630.21822 Γ 1 C ccc, cgc, ggc, gac
20.319240.255810.194500.23044 Γ 2 C gcc, atc
30.353190.209140.295010.14266 Γ 3 C tcc, tct, agt
40.238890.255560.219440.28611 Γ 4 C agc, ggt, gag, ata
50.300480.246420.273450.17965 Γ 5 C tgc, gct, gtc, acg, att
60.413540.177080.268750.14063 Γ 6 C ctc, ttc, cct
70.413040.211960.145650.22935 Γ 7 C aag, ccg, gcg, tat, tgg, gtg
80.192020.209750.319050.27917 Γ 8 C cag, taa, tta, cac, cgt, cga, tac, acc, aga, gga
90.177780.288890.414810.11852 Γ 9 C tag, gta
100.473150.207160.084400.23529 Γ 10 C tcg, cat, cgg, act, ggg, atg
110.345640.265880.113740.27474 Γ 11 C agg, ctg, ttg, aat, gat
120.272050.150970.415550.16144 Γ 12 C tgt, gaa, caa, aca
130.431540.178480.218830.17115 Γ 13 C ctt, gtt, aaa, ttt
140.189570.176530.407220.22668 Γ 14 C cca, tca, tga, cta, gca, aac
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

García, J.E.; González-López, V.A.; Gomez Sanchez, J.I. A Metric Based on the Efficient Determination Criterion. Entropy 2024, 26, 526. https://doi.org/10.3390/e26060526

AMA Style

García JE, González-López VA, Gomez Sanchez JI. A Metric Based on the Efficient Determination Criterion. Entropy. 2024; 26(6):526. https://doi.org/10.3390/e26060526

Chicago/Turabian Style

García, Jesús E., Verónica A. González-López, and Johsac I. Gomez Sanchez. 2024. "A Metric Based on the Efficient Determination Criterion" Entropy 26, no. 6: 526. https://doi.org/10.3390/e26060526

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop