Next Article in Journal
Bibliometric Mapping of Literature on High-Entropy/Multicomponent Alloys and Systematic Review of Emerging Applications
Previous Article in Journal
COVID-19’s Impact on International Trade
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Probability Distribution on Full Rooted Trees

1
Center for Data Science, Waseda University, Shinjuku-ku 169-8050, Tokyo, Japan
2
Faculty of Informatics, Gunma University, Maebashi-shi 371-8510, Gunma, Japan
3
Department of Information Science, Shonan Institute of Technology, Fujisawa-shi 251-8511, Kanagawa, Japan
4
Department of Applied Mathematics, Waseda University, Shinjuku-ku 169-8555, Tokyo, Japan
*
Author to whom correspondence should be addressed.
Entropy 2022, 24(3), 328; https://doi.org/10.3390/e24030328
Submission received: 27 January 2022 / Revised: 18 February 2022 / Accepted: 21 February 2022 / Published: 24 February 2022
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
The recursive and hierarchical structure of full rooted trees is applicable to statistical models in various fields, such as data compression, image processing, and machine learning. In most of these cases, the full rooted tree is not a random variable; as such, model selection to avoid overfitting is problematic. One method to solve this problem is to assume a prior distribution on the full rooted trees. This enables the optimal model selection based on Bayes decision theory. For example, by assigning a low prior probability to a complex model, the maximum a posteriori estimator prevents the selection of the complex one. Furthermore, we can average all the models weighted by their posteriors. In this paper, we propose a probability distribution on a set of full rooted trees. Its parametric representation is suitable for calculating the properties of our distribution using recursive functions, such as the mode, expectation, and posterior distribution. Although such distributions have been proposed in previous studies, they are only applicable to specific applications. Therefore, we extract their mathematically essential components and derive new generalized methods to calculate the expectation, posterior distribution, etc.

1. Introduction

1.1. Review of Literature and Motivation

Full rooted trees are utilized in various fields of study. For example, for text compression in information theory, a full rooted tree represents a set of contexts, which are strings of the most recent symbols at each time point, and it is known as a context tree [1]. In image processing, it represents a variable block-size segmentation, and it is known as quadtree block partitioning [2]. In machine learning, it represents a nonlinear function that comprises many conditional branches and is known as a decision tree [3]. In most of these studies, the rooted tree is not a random variable and serves as an index of a statistical model or function; i.e., one full rooted tree τ corresponds to one statistical model p ( x ; τ ) or one function f τ ( x ) .
Full rooted trees’ recursive and hierarchical structures are suitable for representing complex statistical models or functional structures. For example, the expansion of the leaf nodes represents an increase in the contexts of a context tree [1], a division of a block on the image in quadtree partitioning [2], or the addition of a conditional branch in a decision tree [3]. The expressive ability and extensibility of full rooted trees render them widely applicable in various fields.
However, this hierarchical expressive capability causes a problem in tree selection, i.e., the selection of one statistical model or function. This is because the optimal tree under the criterion of the likelihood or squared loss for training data is inevitably the deepest one. This phenomenon is called overfitting in the field of machine learning. Therefore, most previous studies have applied a stopping rule for node expansion [2,3], introduced a normalization term into the objective function [4], or averaged the statistical models or the functions with some weights [1,4,5]. However, these algorithmic modifications are heuristic at times.
A theoretical way to solve this problem is to consider the full rooted tree as a random variable and assume a prior distribution on it. An appropriate prior distribution provides a unified method for selecting one full rooted tree or combining them based on Bayes decision theory (see, e.g., [6]). Although Bayes decision theory is typically applied to statistical models with unknown continuous parameters, it is also applicable to statistical models with unknown discrete random variables, such as full rooted trees (see, e.g., [7]). By assigning a high prior probability to a shallow tree and a low prior probability to a deep tree, we can avoid the complex statistical model corresponding to a deep tree.
As mentioned above, most previous studies regard a full rooted tree as a non-stochastic variable. However, a few studies adopted the above-mentioned approach. In terms of text compression, the complete Bayesian interpretation of the context tree weighting method was first investigated by the authors of [8]. Not only the theory, but also the associated algorithm have been improved during the decade they were first investigated (see, e.g., [9]). Moreover, similar results obtained from rich real data analysis have been reported recently [10,11] (note that the prior form reported in [10,11] is extremely restricted and cannot be updated as a posterior, in contrast to that reported in [8,9]). In image processing, the authors of [12] were the first to regard the quadtree as a stochastic model, and its optimal estimation was derived under the Bayesian criteria. In machine learning, the authors of [13] redefined the decision tree as a stochastic generative model and improved most tree weighting methods (e.g., [5]).
However, these studies depend on specific data or generative models. This might have been the reason that more than 25 years passed before the first study [8] pertaining to text compression was applied to image processing [12] and machine learning [13].

1.2. The Objective of This Study

Therefore, we separate the mathematically essential component of the discussion from the modifiable component based on specific data or the generative model. Mathematically, a tree is defined as a connected graph without cycles (see, e.g., [14]). A rooted tree is a tree that has one node known as a root node, and a full rooted tree is a rooted tree in which each inner node has the same number of child nodes. Subsequently, we can define a finite set of subtrees of a full rooted tree. This full rooted tree, which contains all the subtrees in the finite set, is denoted as a base tree herein.
A trivial method to define a probability distribution for this set is to assign occurrence probabilities to all subtrees and regard these values as parameters. In other words, we can define the categorical distribution for the finite set of subtrees of the base tree. However, this definition requires the same number of parameters as the subtrees, which increases in a doubly exponential order with the depth of the base tree.
Therefore, we propose an efficient parametric representation of the probability distribution of a set of subtrees. It is suitable for the recursive structure of full rooted trees and allows the number of parameters to be reduced. Moreover, it enables us to calculate its mode, expectation, posterior distribution, etc., using recursive functions. Therefore, it is efficient from a computational viewpoint. Furthermore, we expect these recursive functions to be effective as a subroutine of the variational Bayesian method and the Markov chain Monte Carlo method in hierarchical Bayesian modeling (see, e.g., [15]).
Strictly speaking, our distribution has already been proposed independently in source coding [8], image processing [12], and machine learning [13], as mentioned above. The substantial novelty of our study is the extraction of the essence from the previous discussion, which depends on the applicational objects, and its representation as a clear mathematical theory. This theoretically expands the potential application of probability distributions on full rooted trees. Subsequently, we derive new generalized methods to evaluate the characteristics of the probability distribution of full rooted trees, which could not be derived in previous studies pertaining to real-world applications. More precisely, only Theorems 1 and 3 and Corollary 2 has been used in previous studies. Meanwhile, the other methods expand the possibility of applying the probability distribution on full rooted trees.

1.3. Organization of This Paper

The remainder of this paper is organized as follows: In Section 2, we present the notation used herein. In Section 3, we define the prior for full rooted trees. In Section 4, we describe the algorithms for calculating the properties of the proposed distribution—e.g., a marginal distribution for each node, an efficient calculation of the expectation, mode, and the posterior distribution. In Section 5, we discuss the usefulness of our distribution in statistical decision theory and hierarchical Bayesian modeling. In Section 6, we propose some future work. In Section 7, we conclude the paper.

2. Notations Used for Full Rooted Trees

In this section, we define notation for the rooted trees. It is shown in Figure 1. Let k N denote the maximum number of child nodes and d max N denote the maximum depth. Let τ p = ( V p , E p ) denote the perfect (“perfect” means that all inner nodes have exactly k children and all leaf nodes have the same depth) k-ary rooted tree whose depth is d max , and root node is v λ . V p and E p denote the sets of the nodes and edges of it, respectively. Then, let I p V p and L p V p denote the set of inner nodes and the set of leaf nodes of τ p , respectively. For each node v V p , Ch p ( v ) V p denotes the set of child nodes of v on τ p . Notation about the relation between two nodes v , v V p is as follows. Let v v denote that v is an ancestor node of v ( v is a descendant node of v), v v denote that v is an ancestor node of v or v itself ( v is a descendant node of v or v itself), An ( v ) : = { v V p v v } , and De p ( v ) : = { v V p v v } .
Subsequently, we consider rooted subtrees of τ p in which their root nodes are the same as v λ and all inner nodes have exactly k children. They are called full rooted subtrees and τ p is called a base tree. Let T denote the set of all full rooted subtrees of τ p . Let V τ and E τ denote the set of the nodes and the edges of τ T , respectively. Let I τ V τ and L τ V τ denote the set of the inner nodes and the set of leaf nodes of τ T , respectively.

3. Definition of Probability Distribution on Full Rooted Subtrees

In this section, we define a probability distribution of full rooted subtrees T . Let T denote the random variable on T and τ denote its realization.
Definition 1.
For ( α v ) v V p [ 0 , 1 ] | V p | , we define the probability distribution p ( τ ) on T as below.
p ( τ ) : = v I τ α v v L τ ( 1 α v ) ,
where α v = 0 for v L p .
Intuitively, α v represents the probability that v has child nodes under the condition that v is contained in the tree (it will be proved as a theoretical fact in Remark 2). Therefore, the occurrence probability of a full rooted subtree exponentially decays as its depth increases.
Example 1.
An example of the probability distribution on full rooted subtrees for k = 2 and d max = 2 is shown in Figure 2.
Theorem 1.
The quantity p ( τ ) defined as in (1) fulfills the condition of the probability distribution, that is, τ T p ( τ ) = 1 .
Example 2.
Before the proof of Theorem 1 for the general case, we describe an example where d max = 2 and k = 2 (see Figure 2). First, we factorize the sum as below.
τ T p ( τ ) = ( 1 α v λ ) + α v λ ( 1 α v 0 ) ( 1 α v 1 ) + α v λ α v 0 ( 1 α v 00 ) ( 1 α v 01 ) ( 1 α v 1 ) + α v λ ( 1 α v 0 ) α v 1 ( 1 α v 10 ) ( 1 α v 11 ) + α v λ α v 0 ( 1 α v 00 ) ( 1 α v 01 ) α v 1 ( 1 α v 10 ) ( 1 α v 11 )
= ( 1 α v λ ) + α v λ { ( 1 α v 0 ) ( 1 α v 1 ) + α v 0 ( 1 α v 00 ) ( 1 α v 01 ) ( 1 α v 1 ) + ( 1 α v 0 ) α v 1 ( 1 α v 10 ) ( 1 α v 11 ) + α v 0 ( 1 α v 00 ) ( 1 α v 01 ) α v 1 ( 1 α v 10 ) ( 1 α v 11 ) }
= ( 1 α v λ ) + α v λ { ( 1 α v 0 ) ( 1 α v 1 ) + α v 1 ( 1 α v 10 ) ( 1 α v 11 ) + α v 0 ( 1 α v 00 ) ( 1 α v 01 ) ( 1 α v 1 ) + α v 1 ( 1 α v 10 ) ( 1 α v 11 ) }
= ( 1 α v λ ) + α v λ { ( 1 α v 0 ) + α v 0 ( 1 α v 00 ) ( 1 α v 01 ) × ( 1 α v 1 ) + α v 1 ( 1 α v 10 ) ( 1 α v 11 ) } .
Here, α v 00 = α v 01 = α v 10 = α v 11 = 0 since v 00 , v 01 , v 10 , v 11 L p . Then,
τ T p ( τ ) = ( 1 α v λ ) + α v λ ( 1 α v 0 ) + α v 0 · ( 1 α v 1 ) + α v 1
= ( 1 α v λ ) + α v λ
= 1 .
The general proof of Theorem 1 is in the following. That also consists of two parts, namely, factorization and substitution. We will first prove Lemma 1, which is the essential lemma, since it is not used only in the proof of Theorem 1 but also in the proof of other theorems later.
Lemma 1.
Let F : T R be a real-valued function on the set T of the full rooted subtrees of the base tree τ p . If F has the form
F ( τ ) = v I τ G ( v ) v L τ H ( v ) ,
where G : V p R and H : V p R are real-valued functions on V p , then the summation τ T F ( τ ) can be recursively decomposed as follows.
τ T F ( τ ) = ϕ ( v λ ) ,
where ϕ : V p R is defined as
ϕ ( v ) : = H ( v ) , v L p , H ( v ) + G ( v ) v Ch p ( v ) ϕ ( v ) , v I p .
Proof. 
Let [ v λ ] denote the tree that consists of only the root node v λ of the base tree τ p . Then, the cases of the sum is divided as follows.
τ T F ( τ ) = τ T v I τ G ( v ) v L τ H ( v )
= v I [ v λ ] G ( v ) v L [ v λ ] H ( v ) + τ T { [ v λ ] } v I τ G ( v ) v L τ H ( v )
= H ( v λ ) + τ T { [ v λ ] } v I τ G ( v ) v L τ H ( v )
= H ( v λ ) + G ( v λ ) τ T { [ v λ ] } v I τ { v λ } G ( v ) v L τ H ( v ) ,
where (14) is because [ v λ ] has no inner node and its leaf node is only v λ ; (15) is because every tree in T [ v λ ] has v λ and the corresponding factor G ( v λ ) .
We have already pointed out that each tree τ T { [ v λ ] } contains v λ as its inner node. The other structure of τ is determined by the shape of k subtrees whose root nodes are the child nodes of v λ (see Figure 3). We index them in an appropriate order. Then, let v λ i denote the i-th child node of v λ for i { 0 , 1 , , k 1 } ; i.e., { v λ 0 , , v λ k 1 } = Ch p ( v λ ) . Let T v λ i denote the set of subtrees whose root node is v λ i . Then, there is a natural bijection from T { [ v λ ] } to T v λ 0 × × T v λ k 1 .
Therefore, the summation of (15) is further factorized. Consequently, we have
τ T F ( τ ) = H ( v λ ) + G ( v λ )
× ( τ 0 , , τ k 1 ) T v λ 0 × × T v λ k 1 v I τ 0 G ( v ) v L τ 0 H ( v ) v I τ k 1 G ( v ) v L τ k 1 H ( v ) = H ( v λ ) + G ( v λ )
× τ 0 T v λ 0 τ k 1 T v λ k 1 v I τ 0 G ( v ) v L τ 0 H ( v ) v I τ k 1 G ( v ) v L τ k 1 H ( v )
= H ( v λ ) + G ( v λ ) i = 0 k 1 τ T v λ i v I τ G ( v ) v L τ H ( v ) .
Then, from (12) and (18), we have
τ T v I τ G ( v ) v L τ H ( v ) ( a ) = H ( v λ ) + G ( v λ ) i = 0 k 1 τ T v λ i v I τ G ( v ) v L τ H ( v ) ( b ) .
The underbraced parts ( a ) and ( b ) have the same structure except for the depth of the root node of the subtree. Therefore, ( b ) can be decomposed in a similar manner from (12) to (18). We can continue this decomposition to the leaf nodes.
Then, let T v denote the set of subtrees whose root node is v V p in general; i.e., we define a notion similar to T v λ i for not only v λ 0 , v λ 1 , , v λ k 1 but also any other nodes v V p . Finally, we have an alternative definition of ϕ ( v ) : V p R , which is equivalent to (11).
ϕ ( v ) : = τ T v v I τ G ( v ) v L τ H ( v ) .
The equivalence is confirmed by substituting it into both sides of (19). Therefore, Lemma 1 is proved. □
Then, the proof of Theorem 1 is as follows.
Proof of Theorem 1.
Using Lemma 1, we can divide the cases of the sum and factorize the common terms of τ T p ( τ ) in the following recursive manner.
τ T p ( τ ) = ϕ ( v λ ) ,
where
ϕ ( v ) : = 1 α v , v L p , ( 1 α v ) + α v v Ch p ( v ) ϕ ( v ) , v I p .
Then, we prove ϕ ( v ) = 1 for any node v V p by structural induction. For any leaf node v L p , α v = 0 from Definition 1. Therefore,
ϕ ( v ) = 1 α v = 1 , v L p .
For any inner node v I p , assuming ϕ ( v ) = 1 as the induction hypothesis for any descendant nodes v De p ( v ) ,
ϕ ( v ) = ( 1 α v ) + α v v Ch p ( v ) ϕ ( v )
= ( 1 α v ) + α v v Ch p ( v ) 1
= 1 .
Therefore, τ T p ( τ ) = ϕ ( v λ ) = 1 since v λ is also in V p . □
Remark 1.
Although Theorem 1 is also proved in [12,13], we extract the essential part of them as Lemma 1. In [10,11], a restricted case of Theorem 1 is proved, in which α v has a common value for all v I p .

4. Properties of Probability Distribution on Full Rooted Subtrees

In this section, we describe properties of the probability distribution on full rooted subtrees and methods to calculate them. All the proofs are in Appendix A. Note that the motivation and usefulness of Conditions 1, 2, 3, and 4 in this section will be described in Section 5.

4.1. Probability of Events on Nodes

At the beginning, we explain why v V T determines a probabilistic event. We consider any v V p is given as a non-stochastic constant and fixed. After that, a full rooted subtree is randomly chosen according to the probability distribution proposed in Section 3. Then, V T sometimes contains v and sometimes not, depending on the realization τ of random variable T. Therefore, v V T determines a probabilistic event on p ( τ ) . Although the probability of such events are trivially represented as τ T I { v V τ } p ( τ ) , where I { · } denotes the indicator function, we derive computationally efficient forms without the summation about τ in the following.
Theorem 2.
For any v V p , we have the following:
Pr { v V T } = v An ( v ) α v ,
Pr { v I T } = α v v An ( v ) α v ,
Pr { v L T } = ( 1 α v ) v An ( v ) α v .
Example 3.
Let us consider p ( τ ) shown in Figure 2. Trivially, Pr { v 01 V T } , Pr { v 1 I T } , and Pr { v 0 L T } are calculated as
Pr { v 01 V T } = p ( τ 2 ) + p ( τ 4 ) = 0.28 ,
Pr { v 1 I T } = p ( τ 3 ) + p ( τ 4 ) = 0.56 ,
Pr { v 0 L T } = p ( τ 1 ) + p ( τ 3 ) = 0.42 .
The same probabilities are also given by
Pr { v 01 V T } = α v λ α v 0 = 0.28 ,
Pr { v 1 I T } = α v λ α v 1 = 0.56 ,
Pr { v 0 L T } = α v λ ( 1 α v 1 ) = 0.42 .
Remark 2.
Probabilities of many other events on nodes are derived from Theorem 2. For example,
Pr { v I T v V T } = Pr { ( v I T ) ( v V T ) } Pr { v V T }
= Pr { v I T } Pr { v V T }
= α v .

4.2. Mode

We describe an algorithm to find the mode of p ( τ ) with O ( k d max + 1 ) computational cost. ( O ( · ) denotes the Big-O notation, i.e., f ( n ) = O ( g ( n ) ) means that k > 0 , n 0 > 0 , n > n 0 , | f ( n ) | k · g ( n ) ). Note that the size of search space T is of the order of Ω 2 k d max 2 in general. ( Ω ( · ) denotes the Big-Omega notation in complexity theory; i.e., f ( n ) = Ω ( g ( n ) ) means that k > 0 , n 0 > 0 , n > n 0 , f ( n ) k · g ( n ) . | T | = Ω 2 k d max 2 is proved by substituting G ( v ) H ( v ) 1 in Lemma 1). First, by replacing all the sum in the proof of Lemma 1 with the maximum, we can derive the following recursive expression of max τ T p ( τ ) .
Proposition 1.
max τ T p ( τ ) = ψ ( v λ ) ,
where
ψ ( v ) : = 1 , v L p , max 1 α v , α v v Ch p ( v ) ψ ( v ) , v I p .
Example 4.
On p ( τ ) shown in Figure 2, the maximum probability is p ( τ 3 ) = 0.336 . It is also calculated as follows.
max { 1 α v λ , α v λ max { 1 α v 0 , α v 0 } · max { 1 α v 1 , α v 1 } }
= max { 0.3 , 0.7 max { 0.6 , 0.4 } · max { 0.2 , 0.8 } }
= max { 0.3 , 0.336 } = 0.336 .
In addition, we define a flag variable δ v { 0 , 1 } as follows.
Definition 2.
For any v V p , we define
δ v : = 1 , 1 α v < α v v Ch p ( v ) ψ ( v ) , 0 , otherwise .
We can calculate ψ ( v ) and δ v simultaneously. Then, the mode of p ( τ ) is given by the following proposition.
Proposition 2.
arg max τ T p ( τ ) is identified as the tree that satisfies
v I τ δ v = 1 ,
v L τ δ v = 0 .
Then, the following theorem holds.
Theorem 3.
The mode of p ( τ ) can be found via backtracking search from v λ after the calculation of ψ ( v ) and δ v . It is detailed in Algorithm A1 in Appendix B.
Remark 3.
In [10,11], Papageorgiou et al. proposed the same algorithm as Algorithm A1 and an algorithm to find multiple most likely trees on the background of text compression.
Example 5.
See Figure 4. The parameters are the same as those in Figure 2. The mode τ 3 is found by the proposed algorithm.

4.3. Expectation

Let f : T R denote a real-valued function on T . Here, we discuss sufficient conditions of f, under which the following expectation can be calculated efficiently with O ( k d max + 1 ) cost.
E [ f ( T ) ] : = τ T f ( τ ) p ( τ ) .
Note that the size of T is of the order of Ω 2 k d max 2 in general.
Condition 1.
There exist g : V p R and h : V p R such that
f ( τ ) = v I τ g ( v ) v L τ h ( v ) .
Theorem 4.
Under Condition 1, we define a recursive function ϕ : V p R as
ϕ ( v ) : = h ( v ) , v L p , ( 1 α v ) h ( v ) + α v g ( v ) v Ch p ( v ) ϕ ( v ) , v I p .
Then, we can calculate E [ f ( T ) ] as E [ f ( T ) ] = ϕ ( v λ ) .
Example 6.
Theorem 2 can be regarded examples of Theorem 4.
Condition 2.
There exist g : V p R and h : V p R such that
f ( τ ) = v I τ g ( v ) + v L τ h ( v ) .
Theorem 5.
Under Condition 2, we define a recursive function ξ : V p R as
ξ ( v ) : = h ( v ) , v L p , ( 1 α v ) h ( v ) + α v g ( v ) + v Ch p ( v ) ξ ( v ) , v I p .
Then, we can calculate E [ f ( T ) ] as E [ f ( T ) ] = ξ ( v λ ) .
Remark 4.
Theorem 5 is useful to calculate the Shannon entropy (see, e.g., [16]) of p ( τ ) . It is described in Section 4.4.

4.4. Shannon Entropy

Corollary 1.
By substituting g ( v ) = log α v and h ( v ) = log ( 1 α v ) into (51), the Shannon entropy H [ T ] : = τ T p ( τ ) log p ( τ ) can be recursively calculated as follows.
H [ T ] = ξ ( v λ ) ,
where
ξ ( v ) : = 0 , v L p , ( 1 α v ) log ( 1 α v ) + α v log α v + v Ch p ( v ) ξ ( v ) , v I p .
Remark 5.
Kullback–Leibler divergence (see, e.g., [16]) between two tree distributions p ( τ ) and p ( τ ) can be calculated in a similar manner to Corollary 1. This fact may be useful for variational Bayesian inference, in which the Kullback–Leibler divergence is minimized. This will be future work.

4.5. Conjugate Prior for a Probability Distribution on Full Rooted Subtrees

Here, we consider that α v [ 0 , 1 ] is also a realization of a random variable. Let α denote { α v } v V p , and we describe p ( τ ) as p ( τ | α ) to emphasize the dependency of α in the following theorem. Then, a conjugate prior for p ( τ | α ) is as follows.
Theorem 6.
The following probability distribution is a conjugate prior for p ( τ | α ) .
p ( α ) : = v V p Beta ( α v | β v , γ v ) ,
where Beta ( · | β v , γ v ) denotes the probability density function of the beta distribution whose parameters are β v and γ v . More precisely,
p ( α | τ ) = v V p Beta ( α v | β v | τ , γ v | τ ) ,
where
β v | τ : = β v + 1 , v I τ , β v , otherwise ,
γ v | τ : = γ v + 1 , v L τ , γ v , otherwise .

4.6. Probability Distribution on a Full Rooted Tree as a Conjugate Prior

We define another random variable X on a set X and assume X depends on T; i.e., it follows a distribution p ( x | τ ) . Here, we discuss a sufficient condition of p ( x | τ ) , under which p ( τ ) becomes a conjugate prior for it, and we can efficiently calculate the posterior p ( τ | x ) .
Condition 3.
There exist two functions g : V p × X R and h : V p × X R , and p ( x | τ ) has the following form.
p ( x | τ ) = v I τ g ( x , v ) v L τ h ( x , v ) .
Note that g and h are not necessarily probability density functions.
Example 7.
For given μ 1 , μ 2 R and σ 1 , σ 2 R > 0 , let N ( x | μ 1 , σ 1 2 ) and N ( x | μ 2 , σ 2 2 ) denote the probability density functions of the normal distributions governed by them. Let x : = ( x v ) v V p . If we assume
g ( x , v ) = N ( x v | μ 1 , σ 1 2 ) ,
h ( x , v ) = v De p ( v ) { v } N ( x v | μ 2 , σ 2 2 ) ,
we can construct p ( x | τ ) that satisfies Condition 3. In other words, the elements of the | V p | dimensional vector x follow the mixture of two normal distributions, and either of the two is chosen by τ.
Theorem 7.
Under Condition 3, we define q ( x | v ) : = and α v | x as follows.
q ( x | v ) : = h ( x , v ) , v L p , ( 1 α v ) h ( x , v ) + α v g ( x , v ) v Ch p ( v ) q ( x | v ) , v I p ,
α v | x : = α v , v L p , α v g ( x , v ) v Ch p ( v ) q ( x | v ) q ( x | v ) , v I p .
Note that α v = 0 for v L p (see Definition 1). Then, the posterior p ( τ | x ) is represented as follows.
p ( τ | x ) = v I τ α v | x v L τ ( 1 α v | x ) .
It should be noted that the calculation of q ( x | v ) and α v | x requires O ( k d max + 1 ) cost, and it requires Ω 2 k d max 2 cost in general.
Moreover, if we assume the following condition to be stronger than Condition 3, we can calculate the posterior p ( τ | x ) more efficiently with O ( d max ) cost.
Condition 4.
In addition to Condition 3, we assume that there exists a path from v λ to a leaf node v end L p and another function h : V p × X R , which satisfy
g ( x , v ) 1 ,
h ( x , v ) : = h ( x , v ) I { v v end } .
Here, I { · } denotes the indicator function. In other words, only h ( x , v ) on the path from v λ to v end takes a value different of 1.
Corollary 2.
Under Condition 4, q ( x | v ) and α v | x are calculated as follows, more efficiently than (61) and (62).
q ( x | v ) = h ( x , v ) , v = v end , ( 1 α v ) h ( x , v ) + α v q ( x | v ch ) , v v end , 1 , otherwise ,
α v | x = α v , v v end , α v q ( x | v ch ) q ( x | v ) , v v end ,
where v ch is a child node of v on the path from v λ to v end . Note that we need not calculate q ( x | v ) for v v end to update the posterior, and it costs only O ( d max ) .
Remark 6.
Condition 4 is effective at representing the generation of sequential data x 1 , x 2 , , x N , in which there exists a path from root node v λ to a leaf node v end n L p for each n { 1 , 2 , , N } ( v end n and v end n may different each other for n n ). The remarkable previous studies using Corollary 2 are [8,9,10,11,12,13] (In [10,11], only (66) is used but (67) is not). In other words, they treat only the case under Condition 4. The other theorems in this paper have potential applications to broader fields of study.

5. Discussion

In this section, we describe the usefulness of our results in statistical decision theory (see, e.g., [6]) and hierarchical Bayesian modeling (see, e.g., [15]). First, our results are useful in model selection and model averaging under the Bayes criterion in statistical decision theory (see, e.g., [6]). The proposed probability distribution p ( τ ) is a conjugate prior for stochastic models p ( x | τ ) satisfying Condition 3, as shown in Theorem 7, and the MAP estimate arg max τ p ( τ | x ) can be efficiently calculated by applying Theorem 3 to the posterior distribution p ( τ | x ) obtained by Theorem 7. This is the Bayes optimal model selection based on the posterior distribution. Furthermore, we can calculate τ p ( x new | τ ) p ( τ | x ) , i.e., the weighting of the stochastic models based on the posterior distribution, by using Theorems 4 and 7, since the stochastic models p ( x | τ ) satisfying Condition 3 also satisfy Condition 1. This is model averaging of all possible trees with Bayes optimal weights. This corresponds to the methodologies in which they do not select a single tree but aggregate several trees, such as [4,5]. It should be noted that the occurrence probability of a deep tree exponentially decays in our proposed probability distribution. Therefore, we can avoid the deep tree, which often corresponds to a complex statistical model, as mentioned in Section 1.
Second, one example of the applications derived from our results is hyperparameter learning. As mentioned in Remark 6, Condition 4 has been applied to various stochastic models p ( x | τ ) in previous studies [8,9,10,11,12,13]. Conditions 1 and 3 are more generalized conditions than Condition 4, since the stochastic model p ( x | τ ) satisfying Condition 4 also satisfies Conditions 1 and 3. In addition, the logarithm of a function f ( τ ) satisfying Conditions 1 and 3 (and a stochastic model p ( x | τ ) satisfying Condition 4) satisfies Condition 2. Therefore, we can calculate τ T p ( τ | x ) log p ( x | τ ) by using Theorems 7 and 5. In particular, the fact that we can calculate the expectations E [ p ( x | T ) ] = τ T p ( τ | x ) p ( x | τ ) and E [ log p ( x | T ) ] = τ T p ( τ | x ) log p ( x | τ ) of the stochastic model p ( x | τ ) satisfying Condition 4 implies that we can learn hyperparameters of the stochastic models in [8,9,10,11,12,13] by hierarchical Bayesian modeling with variational Bayesian methods (see, e.g., [15]). To the best of our knowledge, there are no unified studies treating hyperparameter learning for these models.

6. Future Work

Since the present study is a theoretical study, the theorems derived will be applied in future studies. Theorems 1 and 3 and Corollary 2 have been used in previous studies [8,9,10,11,12,13]. Therefore, the other theorems can be applied.
In this study, we did not use approximative algorithms such as the variational Bayes or Markov chain Monte Carlo method (see, e.g., [15]). Such algorithms are required for learning hierarchical models that contain the probability distribution on full rooted subtrees. The methods proposed herein may serve as subroutines. The expansion of our methods to approximative algorithms is another future work.
In this study, the class of trees is restricted to that of full trees, in which every inner node has the same number of child nodes. Hence, another the generalization of the class to that of any rooted tree can be considered in future studies.

7. Conclusions

In this paper, we discussed the probability distributions on full rooted subtrees. Although such a distribution has been used in many fields of study, such as information theory [8,9,10,11], image processing [12], and machine learning [13], it depends significantly on the specific applications and data generative models. By contrast, we discussed it theoretically, collectively, and independently from a specific data generative model. Subsequently, we derived new generalized methods to evaluate the characteristics of the probability distribution on full rooted subtrees, which have not been performed in previous studies. The derived methods are efficient for calculating the events on the nodes, the mode, the expectation, the Shannon entropy, and the posterior distribution of a full rooted subtree. Therefore, this study expands the possibility of the applying the probability distribution on full rooted subtrees.

Author Contributions

Conceptualization, Y.N., S.S., A.K., and T.M.; methodology, Y.N., S.S., A.K., and T.M.; validation, Y.N., S.S., A.K., and T.M.; formal analysis, Y.N., S.S., A.K., and T.M.; writing—original draft preparation, Y.N.; writing—review and editing, Y.N., S.S., A.K., and T.M.; visualization, Y.N., S.S., A.K., and T.M.; supervision, T.M.; project administration, Y.N.; funding acquisition, S.S. and T.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JSPS KAKENHI, grant numbers JP17K06446, JP19K04914, and JP19K14989.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A

Proof of Theorem 2.
First, we prove (27). Let I { · } denote the indicator function. Then, Pr { v V T } is expressed as
Pr { v V T } = τ T I { v V τ } p ( τ ) .
Here, v V τ is equivalent that all the leaf nodes is not a ancestor node of v. Then,
Pr { v V T } = τ T v L τ I { v v } p ( τ )
= τ T v I τ α v v L τ I { v v } ( 1 α v ) .
Therefore, using Lemma 1,
Pr { v V T } = ϕ v ( v λ ) ,
where
ϕ v ( v ) : = I { v v } ( 1 α v ) , v L p , I { v v } ( 1 α v ) + α v v Ch p ( v ) ϕ v ( v ) , v I p .
We further transform this function.
If v v , then I { v v } = 1 and consequently,
ϕ v ( v ) = ( 1 α v ) , v L p , ( 1 α v ) + α v v Ch p ( v ) ϕ v ( v ) , v I p .
It has the same form as (22), and every child node v Ch p ( v ) also satisfies v v . Therefore, ϕ v ( v ) = 1 for v v .
If v v , v cannot be in L p and has only one child node in An ( v ) { v } . Let v ch denote it. Then, ϕ v ( v ) = 1 for the other child nodes v Ch p ( v ) { v ch } . Therefore, (A5) is represented as follows.
ϕ v ( v ) = α v ϕ v ( v ch ) .
Therefore, expanding ϕ v ( v λ ) ,
Pr { v V T } = v An ( v ) α v .
Next, we prove (28). It is proved in a similar manner to the proof of (27) since
Pr { v I T } = τ T v L τ I { v v } p ( τ ) .
Lastly, we prove (29). We have
Pr { v L T } = Pr { v V T } Pr { v I T } .
Therefore, (29) follows from (27) and (28). □
Proof of Theorem 4.
By substituting (48) into (47), E [ f ( T ) ] can be represented as follows.
E [ f ( T ) ] = τ T v I τ α v g ( v ) v L τ ( 1 α v ) h ( v ) .
Then, using Lemma 1, Theorem 4 straightforwardly follows. □
Proof of Theorem 5.
First, we switch the order of the summation as follows.
E [ f ( T ) ] = τ T p ( τ ) v I τ g ( v ) + v L τ h ( v )
= τ T p ( τ ) v V p I { v I τ } g ( v ) + τ T p ( τ ) v V p I { v L τ } h ( v )
= v V p g ( v ) τ T p ( τ ) I { v I τ } + v V p h ( v ) τ T p ( τ ) I { v L τ }
= v V p g ( v ) Pr { v I T } + v V p h ( v ) Pr { v L T }
= v V p g ( v ) α v v An ( v ) α v + v V p h ( v ) ( 1 α v ) v An ( v ) α v
= v V p ( α v g ( v ) + ( 1 α v ) h ( v ) ) v An ( v ) α v ,
where (A16) is because of Theorem 2.
Next, we decompose the right-hand side of (A17) until it has the same form as (51). Since V p = { v λ } De p ( v λ ) ,
( A 17 ) = α v λ g ( v λ ) + ( 1 α v λ ) h ( v λ ) + v De p ( v λ ) ( α v g ( v ) + ( 1 α v ) h ( v ) ) v An ( v ) α v .
For any v De p ( v λ ) , An ( v ) contains v λ . Therefore,
( A 18 ) = ( 1 α v λ ) h ( v λ ) + α v λ g ( v λ ) + v De p ( v λ ) ( α v g ( v ) + ( 1 α v ) h ( v ) ) v : v λ v v α v .
Further, since De p ( v λ ) = v Ch p ( v λ ) ( De p ( v ) { v } ) ,
( A 19 ) = ( 1 α v λ ) h ( v λ ) + α v λ g ( v λ ) + v Ch p ( v λ ) v De p ( v ) { v } α v g ( v ) + ( 1 α v ) h ( v ) v : v v v α v .
Comparing (A17) and (A20), we have
v V p ( α v g ( v ) + ( 1 α v ) h ( v ) ) v An ( v ) α v ( a ) = ( 1 α v λ ) h ( v λ ) + α v λ g ( v λ ) + v Ch p ( v λ ) v De p ( v ) { v } α v g ( v ) + ( 1 α v ) h ( v ) v : v v v α v ( b ) .
The underbraced parts ( a ) and ( b ) have the same structure. Therefore, ( b ) can be decomposed in a similar manner from (A17) to (A20). We can continue this decomposition to the leaf nodes.
Finally, we have an alternative definition of ξ ( v ) : V p R , which is equivalent to (51).
ξ ( v ) : = v De p ( v ) { v } α v g ( v ) + ( 1 α v ) h ( v ) v : v v v α v .
The equivalence is confirmed by substituting it into both sides of (A21). Therefore, Theorem 5 is proved. □
Proof of Theorem 6.
By the Bayes theorem, we have
p ( α | τ ) p ( τ | α ) p ( α )
= v I τ α v v L τ ( 1 α v ) v V p Beta ( α v | β v , γ v )
= v I τ α v Beta ( α v | β v , γ v ) v L τ ( 1 α v ) Beta ( α v | β v , γ v ) v V p V τ Beta ( α v | β v , γ v )
v V p Beta ( α v | β v | τ , γ v | τ ) ,
where we used the conjugate property between the Bernoulli distribution and the beta distribution for each term and
β v | τ : = β v + 1 , v I τ , β v , otherwise ,
γ v | τ : = γ v + 1 , v L τ , γ v , otherwise .
 □
Proof of Theorem 7.
We prove (63) from the right-hand side to the left.
v I τ α v | x v L τ ( 1 α v | x ) = v I τ α v | x v L τ L p ( 1 α v | x ) v L τ I p ( 1 α v | x ) .
In the following, we transform each of the above products in order. First, the first product is transformed by substituting (62) as follows.
v I τ α v | x = v I τ α v g ( x , v ) v Ch p ( v ) q ( x | v ) q ( x | v ) .
Next, the second product is transformed as follows.
v L τ L p ( 1 α v | x ) = v L τ L p ( 1 α v )
= v L τ L p ( 1 α v ) h ( x , v ) q ( x | v ) ,
where (A31) is because of (62) and (A32) is because q ( x | v ) = h ( x , v ) for v L p .
Lastly, the third product is transformed as follows.
v L τ I p ( 1 α v | x ) = v L τ I p 1 α v g ( x , v ) v Ch p ( v ) q ( x | v ) q ( x | v )
= v L τ I p q ( x | v ) α v g ( x , v ) v Ch p ( v ) q ( x | v ) q ( x | v )
= v L τ I p ( 1 α v ) h ( x , v ) q ( x | v ) ,
where (A33) is because of (62) and (A35) is because of substitution of (61) into q ( x | v ) at the numerator.
Therefore, we can combine (A30), (A32), and (A35). Then,
( A 29 ) = v I τ α v g ( x , v ) v Ch p ( v ) q ( x | v ) q ( x | v ) v L τ ( 1 α v ) h ( x , v ) q ( x | v ) .
Here, (A36) is a telescoping product; i.e., q ( x | v ) appears at once each in the denominator and the numerator. Therefore, we can cancel them except for q ( x | v λ ) . Then,
( A 36 ) = 1 q ( x | v λ ) v I τ α v g ( x , v ) v L τ ( 1 α v ) h ( x , v )
= 1 q ( x | v λ ) v I τ g ( x , v ) v L τ h ( x , v ) v I τ α v v L τ ( 1 α v )
= p ( x | τ ) p ( τ ) q ( x | v λ ) ,
where we used (58) and Definition 1.
In addition, because of Theorem 4,
q ( x | v λ ) = E [ p ( x | T ) ] = τ T p ( x | τ ) p ( τ ) = p ( x ) .
Therefore,
( A 39 ) = p ( x | τ ) p ( τ ) p ( x ) = p ( τ | x ) .
Then, Theorem 7 holds. □
Proof of Corollary 2.
We will prove only q ( x | v ) = 1 for v v end . Then, (66) and (67) are straightforwardly derived by substituting it with (64) and (65) into (61) and (62).
For v v end , by substituting (64) and (65) into (61),
q ( x | v ) = 1 , v L p , ( 1 α v ) + α v v Ch p ( v ) q ( x | v ) , v I p .
Since this has the same form as (A6), q ( x | v ) = 1 for v v end is derived in a similar manner. □

Appendix B. Pseudocode to Calculate Mode

Algorithm A1 Calculation of mode.
  • Require: { α v } v V p
  • Ensure: τ * = arg max τ p ( τ )
1:
functionflag_calculation(v)                   ▹ Subroutine
2:
    if  v L p  then
3:
         δ v 0
4:
        return 1
5:
    else if  v I p  then
6:
         tmp v Ch p ( v ) flag_calculation( v )
7:
        if  1 α v < α v · tmp  then
8:
            δ v 1
9:
           return  α v · tmp
10:
        else
11:
            δ v 0
12:
           return  1 α v
13:
        end if
14:
    end if
15:
end function
16:
        
17:
functionbacktracking( v , V , E )                  ▹ Subroutine
18:
    if  δ v = 0  then
19:
        return
20:
    else if  δ v = 1  then
21:
         V V Ch p ( v )
22:
         E E v Ch p ( v ) ( v , v )
23:
        for all  v Ch p ( v )  do
24:
           backtracking( v , V , E )
25:
        end for
26:
        return
27:
    end if
28:
end function
29:
        
30:
procedure                        ▹ The main procedure
31:
    flag_calculation( v λ )
32:
     V
33:
     E
34:
    backtracking( v λ , V , E )
35:
    return  τ * = ( V , E )
36:
end procedure

References

  1. Willems, F.M.J.; Shtarkov, Y.M.; Tjalkens, T.J. The context-tree weighting method: Basic properties. IEEE Trans. Inf. Theory 1995, 41, 653–664. [Google Scholar] [CrossRef] [Green Version]
  2. Sullivan, G.J.; Ohm, J.; Han, W.; Wiegand, T. Overview of the High Efficiency Video Coding (HEVC) Standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1649–1668. [Google Scholar] [CrossRef]
  3. Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A. Classification and Regression Trees; CRC Press: Boca Raton, FL, USA, 1984. [Google Scholar]
  4. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef] [Green Version]
  5. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  6. Berger, J.O. Statistical Decision Theory and Bayesian Analysis; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
  7. Matsushima, T.; Inazumi, H.; Hirasawa, S. A class of distortionless codes designed by Bayes decision theory. IEEE Trans. Inf. Theory 1991, 37, 1288–1293. [Google Scholar] [CrossRef]
  8. Matsushima, T.; Hirasawa, S. A Bayes coding algorithm using context tree. In Proceedings of the 1994 IEEE International Symposium on Information Theory, Trondheim, Norway, 27 June–1 July 1994; p. 386. [Google Scholar] [CrossRef]
  9. Matsushima, T.; Hirasawa, S. Reducing the space complexity of a Bayes coding algorithm using an expanded context tree. In Proceedings of the 2009 IEEE International Symposium on Information Theory, Seoul, Korea, 28 June–3 July 2009; pp. 719–723. [Google Scholar] [CrossRef]
  10. Papageorgiou, I.; Kontoyiannis, I.; Mertzanis, L.; Panotopoulou, A.; Skoularidou, M. Revisiting Context-Tree Weighting for Bayesian Inference. In Proceedings of the 2021 IEEE International Symposium on Information Theory (ISIT), Melbourne, Australia, 12–20 July 2021; pp. 2906–2911. [Google Scholar] [CrossRef]
  11. Kontoyiannis, I.; Mertzanis, L.; Panotopoulou, A.; Papageorgiou, I.; Skoularidou, M. Bayesian Context Trees: Modelling and exact inference for discrete time series. arXiv 2020, arXiv:stat.ME/2007.14900. [Google Scholar]
  12. Nakahara, Y.; Matsushima, T. A Stochastic Model for Block Segmentation of Images Based on the Quadtree and the Bayes Code for It. Entropy 2021, 23, 991. [Google Scholar] [CrossRef] [PubMed]
  13. Dobashi, N.; Saito, S.; Nakahara, Y.; Matsushima, T. Meta-Tree Random Forest: Probabilistic Data-Generative Model and Bayes Optimal Prediction. Entropy 2021, 23, 768. [Google Scholar] [CrossRef] [PubMed]
  14. Kenneth, R. Discrete Mathematics and Its Applications, 7th ed.; McGraw-Hill Science: New York, NY, USA, 2011. [Google Scholar]
  15. Bishop, C. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  16. Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley & Sons, Inc.: New York, NY, USA, 2006. [Google Scholar]
Figure 1. The notation for the rooted trees.
Figure 1. The notation for the rooted trees.
Entropy 24 00328 g001
Figure 2. An example of the probability distribution on full rooted subtrees. Here, k = 2 and d max = 2 . The parameters of the distribution are in the upper right figure. When k = 2 and d max = 2 , | T | = 5 . The probability of each full rooted subtree is calculated under the graph of it.
Figure 2. An example of the probability distribution on full rooted subtrees. Here, k = 2 and d max = 2 . The parameters of the distribution are in the upper right figure. When k = 2 and d max = 2 , | T | = 5 . The probability of each full rooted subtree is calculated under the graph of it.
Entropy 24 00328 g002
Figure 3. Examples of trees in { [ v λ ] } and T { [ v λ ] } , where k = 2 . The left side shows the structure of the tree in { [ v λ ] } . There is only one tree [ v λ ] . The right side shows the structure of the trees in T { [ v λ ] } . All of them have the root node v λ as their respective inner nodes. The other structure is determined by choosing subtrees from T v λ 0 and T v λ 1 .
Figure 3. Examples of trees in { [ v λ ] } and T { [ v λ ] } , where k = 2 . The left side shows the structure of the tree in { [ v λ ] } . There is only one tree [ v λ ] . The right side shows the structure of the trees in T { [ v λ ] } . All of them have the root node v λ as their respective inner nodes. The other structure is determined by choosing subtrees from T v λ 0 and T v λ 1 .
Entropy 24 00328 g003
Figure 4. An example of the mode calculation. The parameters are the same as those in Figure 2 and are shown in the lower right diagram. Diagrams in the top half show the process of calculation for the flag variable δ v , which is determined from leaf nodes in order. Diagrams in the lower half show the process of backtracking. If δ v = 1 , expand the edge. If δ v = 0 , stop the expansion.
Figure 4. An example of the mode calculation. The parameters are the same as those in Figure 2 and are shown in the lower right diagram. Diagrams in the top half show the process of calculation for the flag variable δ v , which is determined from leaf nodes in order. Diagrams in the lower half show the process of backtracking. If δ v = 1 , expand the edge. If δ v = 0 , stop the expansion.
Entropy 24 00328 g004
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Nakahara, Y.; Saito, S.; Kamatsuka, A.; Matsushima, T. Probability Distribution on Full Rooted Trees. Entropy 2022, 24, 328. https://doi.org/10.3390/e24030328

AMA Style

Nakahara Y, Saito S, Kamatsuka A, Matsushima T. Probability Distribution on Full Rooted Trees. Entropy. 2022; 24(3):328. https://doi.org/10.3390/e24030328

Chicago/Turabian Style

Nakahara, Yuta, Shota Saito, Akira Kamatsuka, and Toshiyasu Matsushima. 2022. "Probability Distribution on Full Rooted Trees" Entropy 24, no. 3: 328. https://doi.org/10.3390/e24030328

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop