Hidden Node Detection between Observable Nodes Based on Bayesian Clustering

Yamazaki, Keisuke; Motomura, Yoichi

doi:10.3390/e21010032

Open AccessArticle

Hidden Node Detection between Observable Nodes Based on Bayesian Clustering

by

Keisuke Yamazaki

^* and

Yoichi Motomura

AI Research Center, National Institute of Advanced Industrial Science Technology, 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan

^*

Author to whom correspondence should be addressed.

Entropy 2019, 21(1), 32; https://doi.org/10.3390/e21010032

Submission received: 15 November 2018 / Revised: 14 December 2018 / Accepted: 3 January 2019 / Published: 7 January 2019

(This article belongs to the Special Issue Bayesian Inference and Information Theory)

Download

Browse Figures

Versions Notes

Abstract

:

Structure learning is one of the main concerns in studies of Bayesian networks. In the present paper, we consider networks consisting of both observable and hidden nodes, and propose a method to investigate the existence of a hidden node between observable nodes, where all nodes are discrete. This corresponds to the model selection problem between the networks with and without the middle hidden node. When the network includes a hidden node, it has been known that there are singularities in the parameter space, and the Fisher information matrix is not positive definite. Then, the many conventional criteria for structure learning based on the Laplace approximation do not work. The proposed method is based on Bayesian clustering, and its asymptotic property justifies the result; the redundant labels are eliminated and the simplest structure is detected even if there are singularities.

Keywords:

Bayesian clustering; structure learning in singular cases; model selection

1. Introduction

In learning Bayesian networks, one of the main concerns is structure learning. Many criteria to detect the network structure have been proposed such as the minimum description length (MDL) [1], the Bayesian information criterion (BIC) [2], the Akaike information criterion (AIC) [3], and the marginal likelihood [4]. Most of these criteria assume statistical regularity, which means that the network has identifiability on the parameter and then the nodes are observable.

The nodes of the network are not always observable in practical situations; there will be some underlying factors, which are difficult to observe and do not appear in the given data. In such cases, the criteria for the structure learning must be designed by taking account of the existence of the hidden nodes. However, the statistical regularity does not hold when the network contains hidden nodes [5,6].

The probabilistic models fall into two types: Regular and singular. If the parameter and the probability function expressed by the parameter have one-to-one mapping, the model has statistical regularity and is referred to as regular. Otherwise, there are singularities in the parameter space and the model is referred to as singular. Due to the singularities, the Fisher information matrix is not positive definite, which means that the conventional analysis based on the Laplace approximation or the asymptotic normality does not work in the singular models. Many probabilistic models such as mixture models, hidden Markov models, and neural networks are singular. To cope with the problem of the singularities, an analysis method based on algebraic geometry has been proposed [7], and asymptotic properties of the generalization performance and of the marginal likelihood have been investigated in mixture models [8], hidden Markov models [9], neural networks [7,10], etc.

It is known that the Bayesian network with hidden nodes is singular since the parametrization will change compared with the network without hidden nodes. Even in the simple structure such as the naive Bayesian network, the parameter space has singularities [5,11]. A method to select the optimal structure from some candidate networks has been proposed by using the algebraic geometrical method [5]. For general singular models, new criteria are developed; a widely applicable information criterion (WAIC) is based on the asymptotic form of the generalization error and a widely applicable Bayesian information criterion (WBIC) is derived from the asymptotic form of the marginal likelihood. BIC is also extended to the singular models [12].

The structure learning of the Bayesian network with hidden nodes is a very widely studied problem. Observable constraints from the Bayesian network with hidden nodes is considered in [13]. A model based on observable conditional independence constraints is proposed by [14]. For causal discovery, the related fast causal inference (FCI) algorithm has been developed, e.g., [15]. In the present paper, we consider a two-step method; the first step obtains the optimal structure with observable nodes and the second step detects the hidden nodes in each partial graph. Figure 1 shows the hidden-node detection. The left side of the figure describes the optimal structure with observable nodes only, based on some method of the structure learning. Then, as the second step, we focus on the connections between the observable nodes shown in the right side of the figure. In this example, the parent node

x_{4}

has the domain

{1, \dots, 5}

and the child node

x_{6}

has the one

{1, \dots, 4}

. If the value of the child node is determined by only three factors, the middle node Z, which has the domain

{1, 2, 3}

, simplifies the conditional probability tables (CPTs). It has been known that the smaller dimension the parameter of the network is, the more accurate the parameter learning is. So, it is practically useful to find the simplest expression of the CPTs.

The issue comes down to detection of a hidden node between observable nodes. We compare two network structures, which are shown in Figure 2.

The left and the right panels are networks without and with the hidden node, respectively, where

X = (x_{1}, \dots, x_{L})

with the domain

x_{l} \in {1, \dots, N_{X}^{(l)}}

and

Y \in {1, \dots, N_{Y}}

are observable and

Z \in {1, \dots, N_{Z}}

is hidden. Since the evidence data on X and Y are given and there is no information on Z, we need to consider whether the hidden node exists and its range

N_{Z}

. We propose a method to examine whether the middle hidden node should exist or not using Bayesian clustering. In order to obtain the simplest structure, there is a way to use the regularization technique [16], while it is not straightforward to prove the selected structure is theoretically optimal. Our method is justified based on a property of the entropy term in the asymptotic form of the marginal likelihood, which plays an essential role in the clustering. The result of clustering shows necessary labels to express the relation between the observable nodes X and Y. Counting the number of the used labels, we can determine the existence of the hidden node. Note that we do not consider whole possible structures of the network to reduce the computational complexity; in the present paper, we try to optimize the network from the limited structures, where for example there is no multiple inserted hidden nodes or connections between hidden nodes.

The remainder of this paper is organized as follows. Section 2 presents a formal definition of the network. Section 3 summarizes Bayesian clustering. Section 4 proposes the method to select the structure based on Bayesian clustering and derives its asymptotic behavior. Section 5 shows results of the numerical experiments validating the behavior. Finally, we present a discussion and our conclusions in Section 6 and Section 7, respectively.

2. Model Settings

In this section, the network structure and its parameterization are formalized. The naive structure has been applied to classification and clustering tasks and its mathematical properties are studied [5] since it is expressed as a mixture model. As mentioned in the previous section, we consider the hidden node with both parent and child observable nodes. One of the simplest networks is shown in the right panel of Figure 2. Let the probabilities of

X = (X_{1}, \dots, X_{L})

, Z, and Y be defined by

\begin{matrix} p (X_{l} = i^{(l)}) = & a_{i^{(l)}}^{(l)}, \end{matrix}

(1)

\begin{matrix} p (Z = j | X = i) = & b_{i j}, \end{matrix}

(2)

\begin{matrix} p (Y = k | Z = j) = & c_{j k} \end{matrix}

(3)

for

i \in I = {(i^{(1)}, \dots, i^{(L)})}

,

i^{(l)} \in {1, \dots, N_{X}^{(l)}}

,

j = 1, \dots, N_{Z}

, and

k = 1, \dots, N_{Y}

. Since they are probabilities, we assume that

\begin{matrix} a_{i}^{(l)} \geq 0, & a_{1}^{(l)} = 1 - \sum_{i = 2}^{N_{X}^{(l)}} a_{i}^{(l)}, \end{matrix}

(4)

\begin{matrix} b_{i j} \geq 0, & b_{i 1} = 1 - \sum_{j = 2}^{N_{Z}} b_{i j}, \end{matrix}

(5)

\begin{matrix} c_{j k} \geq 0, & c_{j 1} = 1 - \sum_{k = 2}^{N_{Y}} c_{i j} . \end{matrix}

(6)

It is easy to find that

b_{i j}

is the element of the CPT for Z and

c_{j k}

is that for Y. Let w be the parameter consisting of

a_{i}^{(l)}, b_{i j}, c_{j k}

, where the dimension is

\begin{matrix} dim w = \sum_{l = 1}^{L} (N_{X}^{(l)} - 1) + (N_{Z} - 1) \prod_{l = 1}^{L} N_{X}^{(l)} + N_{Z} (N_{Y} - 1) . \end{matrix}

(7)

We also define the probabilities of the network shown in the left panel of Figure 2;

\begin{matrix} p (X^{(l)} = i^{(l)}) = & d_{i^{(l)}}^{(l)}, \end{matrix}

(8)

\begin{matrix} p (Y = j | X = i) = & e_{i j} . \end{matrix}

(9)

The parameter u consisting of

d_{i}

and

e_{i j}

has the dimension

\begin{matrix} dim u = \sum_{l = 1}^{L} (N_{X}^{(l)} - 1) + (N_{Y} - 1) \prod_{l = 1}^{L} N_{X}^{(l)} . \end{matrix}

(10)

If the relation between X and Y can be simplified, the degree of freedom

dim u

is not necessary and is reduced to

dim w

such as the case shown in Figure 1. This is similar to the dimension reduction of data with sandglass type neural networks or the non-negative matrix factorization, which have a smaller number of nodes in the middle layers than the one in the input and output layers. The relation between the necessary dimension of the parameter and the probability of the output is not always trivial [17]. The present paper focuses on the sufficient case in terms of the dimension reduction, where

dim w < dim u

rewritten as

\begin{matrix} N_{Y} \prod_{l = 1}^{L} N_{X}^{(l)} > N_{Z} (N_{Y} - 1 + \prod_{l = 1}^{L} N_{X}^{(l)}) . \end{matrix}

(11)

Recall that X and Y are observable and Z is hidden, where

N_{X}

and

N_{Y}

are given and

N_{Z}

is unknown. When the minimum

N_{Z}

is detected from the given evidence pairs of X and Y, and is satisfied Equation (11), the network structure with the hidden node expresses the pairs with smaller dimension of the parameter. We use Bayesian clustering technique to detect the minimum

N_{Z}

.

3. Bayesian Clustering

In this section, let us formally introduce Bayesian clustering. Let the evidence be described by

(x_{i}, y_{i})

and there are n pairs, which are denoted by

(X^{n}, Y^{n}) = {(x_{1}, y_{1}), \dots, (x_{n}, y_{n})}

. Recall that

x_{i} = (x_{i}^{(1)}, \dots, x_{i}^{(L)})

. The corresponding value of the hidden node is

z_{i}

and the set of n data is denoted by

Z^{n}

. We can estimate

z_{i}

based on the probability

p (Z^{n} | X^{n}, Y^{n})

. In Bayesian clustering, it is defined by

\begin{matrix} p (Z^{n} | X^{n}, Y^{n}) = & \frac{p (X^{n}, Z^{n}, Y^{n})}{p (X^{n}, Y^{n})}, \end{matrix}

(12)

\begin{matrix} p (X^{n}, Z^{n}, Y^{n}) = & \int \prod_{i = 1}^{n} p (x_{i}, z_{i}, y_{i} | w) φ (w | α) d w, \end{matrix}

(13)

\begin{matrix} p (X^{n}, Y^{n}) = & \sum_{Z^{n}} p (X^{n}, Z^{n}, Y^{n}), \end{matrix}

(14)

where

φ (w | α)

is a prior distribution and

α

is the hyperparameter.

In the network with the hidden node,

\begin{matrix} p (x_{i}, z_{i}, y_{i} | w) = & \prod_{l = 1}^{L} \{a_{x_{i}^{(l)}}^{(l)}\} b_{x_{i} z_{i}} c_{z_{i} y_{i}} . \end{matrix}

(15)

If the prior distribution is expressed as the Dirichlet distribution for

a_{i^{(l)}}^{(l)}

,

b_{i j}

, and

c_{j k}

, the numerator

p (X^{n}, Z^{n}, Y^{n})

is analytically computable. Based on the relation

p (Z^{n} | X^{n}, Y^{n}) \propto p (X^{n}, Z^{n}, Y^{n})

, the Markov Chain Monte Carlo (MCMC) method provides the sampling of

Z^{n}

from

p (Z^{n} | X^{n}, Y^{n})

. This is a common method to estimate hidden variables in machine learning; the underlying topics are estimated based on the Gibbs sampler in topic models such as the latent Dirichlet allocation [18].

4. Hidden Node Detection

In this section, the algorithm to detect the hidden node is introduced and its asymptotic property reducing the number of the used labels is revealed.

4.1. The Proposed Algorithm

When the size of the middle node is large such as

\begin{matrix} \prod_{l = 1}^{L} N_{X}^{(l)} < N_{Z}, \end{matrix}

(16)

there is no reason to have the node Z; the middle node should reduce the degree of freedom from X. If only

N_{Z} = 1

satisfies Equation (11), the middle node is not necessary. Note that

N_{Z} = 1

shows that there is no edge between X and Y, which is already excluded in structure learning.

Example 1.

When

L = 1

,

N_{X}^{(1)} = 3

and

N_{Y} = 3

, only

N_{Z} = 1

satisfies Equation (11), which shows that there is no hidden node between X and Y.

The present paper proposes the following algorithm to determine the existence of Z;

Algorithm 2.

Assume that there is

N_{Z} > 1

for given

N_{X}^{(l)}

and

N_{Y}

, that is Equation (11) is satisfied. Apply the Bayesian clustering method to the given evidence

(X^{n}, Y^{n})

and estimate

Z^{n}

based on the MCMC sampling. Let the number of used labels be denoted by

{\hat{N}}_{Z}

. If the following inequality holds, the hidden node

Z \in {1, \dots, {\hat{N}}_{Z}}

reduces the parameter,

\begin{matrix} 1 < {\hat{N}}_{Z} < \frac{N_{Y} \prod_{l = 1}^{L} N_{X}^{(l)}}{N_{Y} - 1 + \prod_{l = 1}^{L} N_{X}^{(l)}} . \end{matrix}

(17)

4.2. Asymptotic Properties of the Algorithm

The MCMC method in Bayesian clustering is based on the probability

p (X^{n}, Z^{n}, Y^{n})

as shown in Section 3. Since the proposed method depends on this clustering method, let us consider the properties of

p (X^{n}, Z^{n}, Y^{n})

. The negative logarithm of the probability is expressed as follows:

\begin{matrix} F_{α} (X^{n}, Z^{n}, Y^{n}) & = - ln p (X^{n}, Z^{n}, Y^{n}) \\ = - ln \int \prod_{i = 1}^{n} p (x_{i}, z_{i}, y_{i} | w) φ (w | α) d w \\ = \sum_{l = 1}^{L} \{ln Γ (n + N_{X} α_{a}) - \sum_{i = 1}^{N_{X}} ln Γ (n_{i} + α_{a})\} \\ + \sum_{i \in I} \{ln Γ (\sum_{j = 1}^{N_{Z}} n_{i j} + N_{Z} α_{b}) - \sum_{j = 1}^{N_{Z}} ln Γ (n_{i j} + α_{b})\} \\ + \sum_{j = 1}^{N_{Z}} \{ln Γ (\sum_{k = 1}^{N_{Y}} m_{j k} + N_{Y} α_{c}) - \sum_{k = 1}^{N_{Z}} ln Γ (m_{j k} + α_{c})\} \\ + \sum_{l = 1}^{L} \{N_{X}^{(l)} ln Γ (α_{a}) - ln Γ (N_{X}^{(l)} α_{a})\} \\ + \{\prod_{l = 1}^{L} N_{X}^{(l)}\} \{N_{Z} ln Γ (α_{b}) - N_{X} ln Γ (N_{Z} α_{b})\} \\ + N_{Z} \{N_{Y} ln Γ (α_{c}) - ln Γ (N_{Y} α_{c})\}, \end{matrix}

(18)

where

n_{i}

,

n_{i j}

, and

m_{j k}

are given as

\begin{matrix} n_{i} = & \sum_{j = 1}^{n} \prod_{l = 1}^{L} δ_{x_{j}^{(l)}, i^{(l)}}, \end{matrix}

(19)

\begin{matrix} n_{i j} = & \sum_{k = 1}^{n} δ_{z_{k}, j} \prod_{l = 1}^{L} δ_{x_{k}^{(l)}, i^{(l)}}, \end{matrix}

(20)

\begin{matrix} m_{j k} = & \sum_{l = 1}^{n} δ_{z_{l}, j} δ_{y_{l}, k}, \end{matrix}

(21)

respectively, and the prior distribution

φ (w | α)

consists of the Dirichlet distributions;

\begin{matrix} φ (w) = & \prod_{l = 1}^{L} Dir (a^{(l)} | α_{a}) \prod_{i \in I} Dir (b_{i} | α_{b}) \prod_{j = 1}^{N_{Z}} Dir (c_{j} | α_{c}), \end{matrix}

(22)

\begin{matrix} Dir (a^{(l)} | α_{a}) = & \frac{Γ (N_{X}^{(l)} α_{a})}{Γ {(α_{a})}^{N_{X}^{(l)}}} \prod_{i = 1}^{N_{X}^{(l)}} a_{i}^{(l) (α_{a} - 1)}, \end{matrix}

(23)

\begin{matrix} Dir (b_{i} | α_{b}) = & \frac{Γ (N_{Z} α_{b})}{Γ {(α_{b})}^{N_{Z}}} \prod_{j = 1}^{N_{Z}} b_{i j}^{α_{b} - 1}, \end{matrix}

(24)

\begin{matrix} Dir (c_{j} | α_{c}) = & \frac{Γ (N_{Y} α_{c})}{Γ {(α_{c})}^{N_{Y}}} \prod_{k = 1}^{N_{Y}} c_{j k}^{α_{c} - 1} . \end{matrix}

(25)

The function

δ_{i j}

and

Γ (\cdot)

are the Kronecker delta and the gamma function, respectively. The hyperparameter

α

consists of

α_{a}

,

α_{b}

, and

α_{c}

. The sampling result of

Z^{n}

is dominantly taken from the area, which makes

p (X^{n}, Z^{n}, Y^{n})

large. Then, we investigate which

Z^{n}

minimizes

F_{α} (X^{n}, Z^{n}, Y^{n})

for given

(X^{n}, Y^{n})

.

Theorem 3.

When the number of the given data n is sufficiently large,

F (X^{n}, Z^{n}, Y^{n})

is written as

\begin{matrix} F (X^{n}, Z^{n}, Y^{n}) = & - n S + C ln n + O_{p} (1), \end{matrix}

(26)

\begin{matrix} S & = \sum_{l = 1}^{L} \sum_{i^{(l)} = 1}^{N_{X}^{(l)}} \frac{n_{i^{(l)}}^{(l)}}{n} ln \frac{n_{i^{(l)}}^{(l)}}{n} \\ + \sum_{i \in I} \sum_{j = 1}^{{\tilde{N}}_{Z}} \frac{\sum_{j^{'} = 1}^{{\tilde{N}}_{Z}} n_{i j^{'}}}{n} \frac{n_{i j}}{\sum_{j^{'} = 1}^{{\tilde{N}}_{Z}} n_{i j^{'}}} ln \frac{n_{i j}}{\sum_{j^{'} = 1}^{{\tilde{N}}_{Z}} n_{i j^{'}}} \\ + \sum_{j = 1}^{{\tilde{N}}_{Z}} \sum_{k = 1}^{N_{Y}} \frac{m_{j}}{n} \frac{m_{j k}}{m_{j}} ln \frac{m_{j k}}{m_{j}}, \end{matrix}

(27)

\begin{matrix} C = & \{\prod_{l = 1}^{L} N_{X}^{(l)}\} (N_{Z} - {\tilde{N}}_{Z}) α_{b} \\ + \frac{1}{2} \{\sum_{l = 1}^{L} (N_{X}^{(l)} - 1) + (\prod_{l = 1}^{L} N_{X}^{(l)}) ({\tilde{N}}_{Z} - 1) + {\tilde{N}}_{Z} (N_{Y} - 1)\}, \end{matrix}

(28)

\begin{matrix} m_{j} = & \sum_{k = 1}^{N_{Y}} m_{j k}, \end{matrix}

(29)

where

{\tilde{N}}_{Z}

is the number of

m_{j}

such that

m_{j} / n = O (1)

.

The proof will be shown in Appendix A. The first term

- n S

is the dominant factor, and its coefficient S is maximized in the clustering. This coefficient determines

{\tilde{N}}_{Z}

, which is the number of used labels in the clustering result.

Assume that the true structure with the hidden node has the minimal expression, where the range of Z is

z = 1, \dots, N_{Z}^{*}

, and that the estimated size is larger than the true one;

N_{Z}^{*} \leq {\tilde{N}}_{Z}

. We can easily confirm that Bayesian clustering chooses the minimum structure

{\tilde{N}}_{Z} = N_{Z}^{*}

as follows. The three terms in the coefficient S correspond to the negative entropy functions of the parameter

a_{i}^{(l)}

,

b_{i j}

, and

c_{j k}

, respectively. Then, the minimum

{\tilde{N}}_{Z}

obviously makes the coefficient S maximized since the number of elements of parameter should be minimized for the small entropy. When the hidden node has the redundant state, which means that two values of Z have completely same output distribution of Y, the second term of S is larger than the case of non-redundant situation

{\hat{N}}_{Z} = N_{Z}^{*}

. Based on the assumption that the true structure is minimal, the estimation therefore gets the minimum structure,

{\tilde{N}}_{Z} = N_{Z}^{*}

.

According to this property, the number of used label

{\hat{N}}_{Z}

asymptotically goes to

N_{Z}^{*}

. The proposed algorithm compares the essential number of the values of Z and will be a criterion to select the proper structure when n is large. This property exists only in Bayesian clustering so far; the eliminating effect of the redundant labels has not been found in other method of the clustering such as the maximum-likelihood clustering based on the expectation-maximization algorithm.

5. Numerical Experiments

In this section, we validate the asymptotic property in numerical experiments. We set the data-generating model shown in Figure 3 and prepared ten evidence data sets.

There was a single parent node

L = 1

. The sizes of the nodes were

N_{X}^{(1)} = 6

,

N_{Y} = 6

and

N_{Z}^{*} = 3

. The CPTs are described on the right-side of the figure, where the true parameter consists of these probabilities. There were 2000 pairs of

(x, y)

in each data set. Since the following condition is satisfied,

\begin{matrix} \frac{N_{Y} \prod_{l = 1}^{L} N_{X}^{(l)}}{N_{Y} - 1 + \prod_{l = 1}^{L} N_{X}^{(l)}} = \frac{6 \times 6}{6 - 1 + 6} = \frac{36}{11} > 3 = N_{Z}^{*}, \end{matrix}

(30)

the structure of the data-generating model with the hidden node had smaller dimension of the parameter than the one without a hidden node.

We applied Bayesian clustering to each data set, where the model had the size of the hidden node

N_{Z} = 6

. According to the asymptotic property in Theorem 3, the MCMC method should take label assignment from the area, where the number of the used labels was reduced to three. The estimated model size was determined by the assignment, which minimized the function

F_{α} (X^{n}, Z^{n}, Y^{n})

. Since the sampling of the MCMC method depended on the initial assignment, we conducted ten trials for each data set and regarded the estimated size as the minimum one. The number of iterations in the MCMC method was 1000.

Table 1 shows the results of the experiments.

In all data sets, the size of the hidden node Z is reduced and the correct size is estimated in more than half sets, we confirmed the effect eliminating the redundant labels. Since the result of the MCMC method depends on the given data, the minimum size is not always found; the estimated size is four in some data sets instead of three. Even in such case, however, we could estimate the correct size after setting the initial size of the model as

N_{Z} = 4

. Repeating this procedure, we will be able to avoid the local optimal size and find the global one.

Figure 4 shows this estimation procedure in the practical cases. The initial model size starts from six. The left panel is the case, where the proper size is directly found and the estimated size does not change at size four. The right panel is the case, where the estimated size is first four and then the next result is three, which is the fixed point.

To investigate the properties of the estimated size, we tried some different numbers of pairs

n = 100, 500

and a skewed distribution of the parent node (Figure 5), and nearly uniform distribution of the child node (Figure 6).

Table 2 shows the results of

n = 100, 500

.

Since these CPTs of

X, Z, Y

are a straightforward case to distinguish the role of the hidden node, the smaller number of the pairs does not adversely affect the estimation. Table 3 shows the results of the different CPTs in the parent and the child nodes.

The number of pairs was

n = 100

. Due to the CPT of Z, the skewed distribution of the parent node still keeps the sufficient variation of Z to estimate the size

N_{Z}

, which provides the same accuracy as the uniform distribution. On the other hand, the nearly uniform distribution of the child node makes the estimation difficult because each value of Z has the similar output distribution. The Dirichlet prior of Z has a strong effect to eliminate the redundancy, which means the estimated sizes tend to be smaller than the true one.

6. Discussion

In this section, we discuss the difference between the proposed method and other conventional criteria for the model selection. In the proposed method, the label assignment

Z^{n}

is obtained from the MCMC method, which takes the samples according to

p (X^{n}, Z^{n}, Y^{n})

. The probability

p (X^{n}, Z^{n}, Y^{n})

is the marginal likelihood on the complete data

(X^{n}, Z^{n}, Y^{n})

; recall the definition,

\begin{matrix} p (X^{n}, Z^{n}, Y^{n}) = & \int \prod_{i = 1}^{n} p (x_{i}, z_{i}, y_{i} | w) φ (w | α) d w . \end{matrix}

(31)

This looks similar to the criteria based on the marginal likelihood such as BDu(e) [19,20] and its asymptotic form such as BIC [2], MDL [1]. Since it is assumed that the network has the statistical regularity or the nodes are all observable, many criteria do not work on the network with hidden nodes.

WBIC is proposed for the singular models. The main difference is that it is based on the marginal likelihood of the incomplete data

X^{n}, Y^{n}

;

\begin{matrix} p (X^{n}, Y^{n}) & = \sum_{Z^{n}} p (X^{n}, Z^{n}, Y^{n}) \\ = \int \prod_{i = 1}^{n} \sum_{z_{i}} p (x_{i}, z_{i}, y_{i} | w) φ (w | α) d w . \end{matrix}

(32)

Due to the marginalization over

Z^{n}

, it requires the calculation of values for all candidate structures. For example, assume that we have candidate structures

N_{Z} = 1, 2, 3

denoted by

p_{1} (X^{n}, Y^{n})

,

p_{2} (X^{n}, Y^{n})

, and

p_{3} (X^{n}, Y^{n})

, respectively. In WBIC, we calculate all values and select the optimal structure;

\begin{matrix} {\hat{N}}_{Z} = arg min_{i = 1, 2, 3} p_{i} (X^{n}, Y^{n}) . \end{matrix}

(33)

On the other hand, in the proposed method, we calculate the label assignment with the structure

N_{Z} = 3

and obtain

{\hat{N}}_{Z}

, which shows the necessity of the node Z.

Another difference from the conventional criteria is the dominant order of the objective function, which determines the optimal structure. As shown in Corollary 6.1 of [6], the negative logarithm of the marginal likelihood of the incomplete data has the following asymptotic form;

\begin{matrix} F_{α} (X^{n}, Y^{n}) & = - ln p (X^{n}, Y^{n}) \\ = - n S_{X Y} + C_{X Y} ln n + o_{p} (ln n), \end{matrix}

(34)

where the coefficient

S_{X Y}

is the empirical entropy of the observation

(X^{n}, Y^{n})

and

C_{X Y}

depends on the data-generating distribution, the model, and the prior distribution. This form means that the optimal model is selected by

ln n

order term with the coefficient

C_{X Y}

, while it is selected by n order term with the coefficient S of Theorem 3 in the proposed method. Since the largest terms are n order in both

F_{α} (X^{n}, Y^{n})

and

F_{α} (X^{n}, Z^{n}, Y^{n})

, the proposed method will have stronger effect to distinguish the difference of the structures.

The asymptotic accuracy of Bayesian clustering has been studied [21], which considers the error function between the true distribution of the label assignment and the estimated one measured by the Kullback-Leibler divergence:

\begin{matrix} D (n) = & E_{X^{n}, Y^{n}} [\sum_{Z^{n}} q (Z^{n} | X^{n}, Y^{n}) ln \frac{q (Z^{n} | X, Y^{n})}{p (Z^{n} | X^{n}, Y^{n})}], \end{matrix}

(35)

where

E_{X^{n}, Y^{n}} [\cdot]

is the expectation over all evidence data and

\begin{matrix} q (Z^{n} | X^{n}, Y^{n}) = & \frac{q (X^{n}, Z^{n}, Y^{n})}{\sum_{Z^{n}} q (X^{n}, Z^{n}, Y^{n})}, \end{matrix}

(36)

\begin{matrix} q (X^{n}, Z^{n}, Y^{n}) = & \prod_{i = 1}^{n} q (x_{i}, z_{i}, y_{i}) . \end{matrix}

(37)

The true network is denoted by

q (x, z, y)

. The proposed method minimizes this error function, which means that the label assignment

Z^{n}

is optimized in the sense of the density estimation. Even though the optimized function is not directly for the model selection, due to the asymptotic property of the Bayes clustering simplifying the label use, the proposed method is computationally efficient to determine the existence of the hidden node and the result asymptotically has coincident.

7. Conclusions

In this paper, we have proposed a method to detect a hidden node between observable nodes based on Bayesian clustering. The asymptotic behavior of the clustering has been revealed and it shows that the redundant labels are eliminated and the essential structure will be detected. Evaluation of the proposed method with numerical experiments is one of our future studies.

Author Contributions

Conceptualization, K.Y. and Y.M.; data curation, K.Y.; formal analysis, K.Y.; funding acquisition, Y.M. and K.Y.; investigation, K.Y. and Y.M.; project administration, Y.M.; software, K.Y.; writing, original draft, K.Y.; writing, review and editing, K.Y.

Funding

This research is based on results obtained from projects commissioned by the New Energy and Industrial Technology Development Organization (NEDO) and KAKENHI 15K00299.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

In this section, we prove Theorem 3. Using the asymptotic relation

ln Γ (x) = x ln x - \frac{1}{2} ln x - x + O (1)

for sufficiently large x, we can obtain that

\begin{matrix} F_{α} (X^{n}, Z^{n}, Y^{n}) & = \sum_{l = 1}^{L} {(n + N_{X}^{(l)} α_{a}) ln (n + N_{X}^{(l)} α_{a}) \\ - \frac{1}{2} ln (n + N_{X}^{(l)} α_{a}) - (n + N_{X}^{(l)} α_{a})} \\ - \sum_{l = 1}^{L} \sum_{i^{(l)} = 1}^{N_{X}^{(l)}} {(n_{i^{(l)}}^{(l)} + α_{a}) ln (n_{i^{(l)}}^{(l)} + α_{a}) \\ - \frac{1}{2} ln (n_{i^{(l)}}^{(l)} + α_{a}) - (n_{i^{(l)}}^{(l)} + α_{a})} \\ + \sum_{i \in I} {(\sum_{j = 1}^{N_{Z}} n_{i j} + N_{Z} α_{b}) ln (\sum_{j = 1}^{N_{Z}} n_{i j} + N_{Z} α_{b}) \\ - \frac{1}{2} ln (\sum_{j = 1}^{N_{Z}} n_{i j} + N_{Z} α_{b}) - (\sum_{j = 1}^{N_{Z}} n_{i j} + N_{Z} α_{b})} \\ - \sum_{i \in I} \sum_{j = 1}^{N_{Z}} \{(n_{i j} + α_{b}) ln (n_{i j} + α_{b}) - \frac{1}{2} ln (n_{i j} + α_{b}) - (n_{i j} + α_{b})\} \\ + \sum_{j = 1}^{N_{Z}} {(\sum_{k = 1}^{N_{Y}} m_{j k} + N_{Y} α_{c}) ln (\sum_{j = 1}^{N_{Y}} m_{j k} + N_{Y} α_{c}) \\ - \frac{1}{2} ln (\sum_{k = 1}^{N_{Y}} m_{j k} + N_{Y} α_{c}) - (\sum_{k = 1}^{N_{Y}} m_{j k} + N_{Y} α_{c})} \\ - \sum_{j = 1}^{N_{Z}} \sum_{k = 1}^{N_{Y}} \{(m_{j k} + α_{c}) ln (m_{j k} + α_{c}) - \frac{1}{2} ln (m_{j k} + α_{c}) - (m_{j k} + α_{c})\} \\ + O_{p} (1) . \end{matrix}

(A1)

Collecting the constant terms and uniting them to

O_{p} (1)

, we rewrite

F_{α} (X^{n}, Z^{n}, Y^{n})

as

\begin{matrix} F_{α} (X^{n}, Z^{n}, Y^{n}) & = \sum_{l = 1}^{L} \{n ln n + N_{X}^{(l)} α_{a} ln n - \frac{1}{2} ln n - n\} \\ - \sum_{l = 1}^{L} \sum_{i^{(l)} = 1}^{N_{X}^{(l)}} \{n_{i^{(l)}}^{(l)} ln n_{i^{(l)}}^{(l)} + α_{a} ln n_{i^{(l)}}^{(l)} - \frac{1}{2} ln n_{i^{(l)}}^{(l)} - n_{i^{(l)}}^{(l)}\} \\ + \sum_{i \in I} \{(\sum_{j = 1}^{N_{Y}} n_{i j}) ln \sum_{j = 1}^{N_{Y}} n_{i j} + N_{Z} α_{b} ln \sum_{j = 1}^{N_{Z}} n_{i j} - \frac{1}{2} ln \sum_{j = 1}^{N_{Z}} n_{i j} - \sum_{j = 1}^{N_{Z}} n_{i j}\} \\ - \sum_{i \in I} \sum_{j = 1}^{N_{Z}} \{n_{i j} ln n_{i j} + α_{b} ln n_{i j} - \frac{1}{2} ln n_{i j} - n_{i j}\} \\ + \sum_{j = 1}^{N_{Z}} {\sum_{k = 1}^{N_{Y}} m_{j k} ln (\sum_{k = 1}^{N_{Y}} m_{j k}) + N_{Y} α_{b} ln \sum_{k = 1}^{N_{Y}} m_{j k} \\ - \frac{1}{2} ln \sum_{k = 1}^{N_{Y}} m_{j k} - \sum_{k = 1}^{N_{Y}} m_{j k}} \\ - \sum_{j = 1}^{N_{Z}} \sum_{k = 1}^{N_{Y}} \{m_{j k} ln m_{j k} + α_{c} ln m_{j k} - \frac{1}{2} ln m_{j k} - m_{j k}\} + O_{p} (1) . \end{matrix}

(A2)

Using the following relations,

\begin{matrix} \sum_{i^{(l)} = 1}^{N_{X}^{(l)}} n_{i^{(l)}}^{(l)} ln n_{i^{(l)}}^{(l)} = & n \sum_{i^{(l)} = 1}^{N_{X}^{(l)}} (\frac{n_{i^{(l)}}^{(l)}}{n} ln \frac{n_{i^{(l)}}^{(l)}}{n}) + n ln n, \end{matrix}

(A3)

\begin{matrix} \sum_{i \in I} (\sum_{j = 1}^{N_{Z}} n_{i j}) ln \sum_{j = 1}^{N_{Z}} n_{i j} = & n \sum_{i \in I} (\frac{n_{i}}{n} ln \frac{n_{i}}{n}) + n ln n, \end{matrix}

(A4)

\begin{matrix} \sum_{i \in I} \sum_{j = 1}^{N_{Z}} n_{i j} ln n_{i j} = & n \sum_{i \in I} \sum_{j = 1}^{N_{Z}} \frac{n_{i j}}{n} ln \frac{n_{i j}}{n} + n ln n, \end{matrix}

(A5)

\begin{matrix} \sum_{j = 1}^{N_{Z}} (\sum_{k = 1}^{N_{Y}} m_{j k}) ln \sum_{k = 1}^{N_{Y}} m_{j k} = & n \sum_{j = 1}^{N_{Z}} \frac{m_{j}}{n} ln \frac{m_{j}}{n} + n ln n, \end{matrix}

(A6)

\begin{matrix} \sum_{j = 1}^{N_{Z}} \sum_{k = 1}^{N_{Y}} m_{j k} ln m_{j k} = & n \sum_{j = 1}^{N_{Z}} \sum_{k = 1}^{N_{Y}} \frac{m_{j k}}{n} ln \frac{m_{j k}}{n} + n ln n \end{matrix}

(A7)

and focusing on the terms of order n and

ln n

, we obtain the asymptotic form in the theorem.

References

Rissanen, J. Stochastic complexity and modeling. Ann. Stat. 1986, 14, 1080–1100. [Google Scholar]
Schwarz, G.E. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar]
Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar]
Good, I.J. The Estimation of Probabilities: An Essay on Modern Bayesian Methods, Research Monograph No. 30; The MIT Press: Cambridge, MA, USA, 1965. [Google Scholar]
Rusakov, D.; Geiger, D. Asymptotic model selection for naive Bayesian networks. J. Mach. Learn. Res. 2005, 6, 1–35. [Google Scholar]
Watanabe, S. Algebraic Geometry and Statistical Learning Theory; Cambridge University Press: New York, NY, USA, 2009. [Google Scholar]
Watanabe, S. Algebraic analysis for non-identifiable learning machines. Neural Comput. 2001, 13, 899–933. [Google Scholar]
Yamazaki, K.; Watanabe, S. Singularities in mixture models and upper bounds of stochastic complexity. Int. J. Neural Netw. 2003, 16, 1029–1038. [Google Scholar] [Green Version]
Yamazaki, K.; Watanabe, S. Algebraic geometry and stochastic complexity of hidden Markov models. Neurocomputing 2005, 69, 62–84. [Google Scholar]
Aoyagi, M. Consideration on Singularities in Learning Theory and the Learning Coefficient. Entropy 2013, 15, 3714–3733. [Google Scholar] [Green Version]
Geiger, D.; Heckerman, D.; Meek, C. Asymptotic Model Selection for Directed Networks with Hidden Variables. In Proceedings of the Twelfth Annual Conference on Uncertainty in Artificial Intelligence, Portland, OR, USA, 1–4 August 1996; pp. 283–290. [Google Scholar]
Drton, M.; Plummer, M. A Bayesian information criterion for singular models. J. R. Stat. Soc. Ser. B Stat. Methodol. 2017, 79, 323–380. [Google Scholar] [Green Version]
Verma, T.; Pearl, J. Equivalence and Synthesis of Causal Models. In Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, Cambridge, MA, USA, 27–29 July 1990; Elsevier Science Inc.: New York, NY, USA, 1991; pp. 255–270. [Google Scholar]
Richardson, T.; Spirtes, P. Ancestral Graph Markov Models. Ann. Stat. 2000, 30, 2002. [Google Scholar]
Zhang, J. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artif. Intell. 2008, 172, 1873–1896. [Google Scholar] [Green Version]
Chaturvedi, I.; Ragusa, E.; Gastaldo, P.; Zunino, R.; Cambria, E. Bayesian network based extreme learning machine for subjectivity detection. J. Frankl. Inst. 2018, 355, 1780–1797. [Google Scholar]
Allman, E.S.; Rhodes, J.A.; Sturmfels, B.; Zwiernik, P. Tensors of nonnegative rank two. Linear Algebra Appl. 2015, 473, 37–53. [Google Scholar] [Green Version]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Heckerman, D.; Geiger, D.; Chickering, D.M. Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Mach. Learn. 1995, 20, 197–243. [Google Scholar] [Green Version]
Buntine, W. Theory Refinement on Bayesian Networks. In Proceedings of the Seventh Conference on Uncertainty in Artificial Intelligence, Los Angeles, CA, USA, 13–15 July 1991; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1991; pp. 52–60. [Google Scholar] [Green Version]
Yamazaki, K. Asymptotic accuracy of Bayes estimation for latent variables with redundancy. Mach. Learn. 2016, 102, 1–28. [Google Scholar]

Figure 1. The two-step method: structure learning with observable nodes and hidden-node detection.

Figure 2. Two networks with and without a hidden node.

Figure 3. The data-generating model.

Figure 4. The estimation procedure in practical cases.

Figure 5. The skewed distribution of the parent node X.

Figure 6. The nearly uniform distribution of the parent node Y.

Table 1. The results of the estimated size.

Data-Set ID	1	2	3	4	5	6	7	8	9	10
Estimated size	3	3	3	4	3	3	4	4	4	3

Table 2. The results of the estimated size in

n = 100, 500

.

Table 2. The results of the estimated size in

n = 100, 500

.

Data-Set ID	1	2	3	4	5	6	7	8	9	10
Estimated size (n = 100)	3	3	3	3	4	3	4	3	3	3
Estimated size (n = 500)	3	3	3	3	3	3	3	3	3	4

Table 3. The results of the estimated size in the different conditional probability tables (CPTs).

Data-Set ID	1	2	3	4	5	6	7	8	9	10
Estimated size (skewed parent node)	3	3	3	3	3	4	3	3	3	3
Estimated size (nearly-uniform child node)	1	1	1	1	2	1	1	2	1	1

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yamazaki, K.; Motomura, Y. Hidden Node Detection between Observable Nodes Based on Bayesian Clustering. Entropy 2019, 21, 32. https://doi.org/10.3390/e21010032

AMA Style

Yamazaki K, Motomura Y. Hidden Node Detection between Observable Nodes Based on Bayesian Clustering. Entropy. 2019; 21(1):32. https://doi.org/10.3390/e21010032

Chicago/Turabian Style

Yamazaki, Keisuke, and Yoichi Motomura. 2019. "Hidden Node Detection between Observable Nodes Based on Bayesian Clustering" Entropy 21, no. 1: 32. https://doi.org/10.3390/e21010032

APA Style

Yamazaki, K., & Motomura, Y. (2019). Hidden Node Detection between Observable Nodes Based on Bayesian Clustering. Entropy, 21(1), 32. https://doi.org/10.3390/e21010032

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hidden Node Detection between Observable Nodes Based on Bayesian Clustering

Abstract

1. Introduction

2. Model Settings

3. Bayesian Clustering

4. Hidden Node Detection

4.1. The Proposed Algorithm

4.2. Asymptotic Properties of the Algorithm

5. Numerical Experiments

6. Discussion

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI