Consideration on Singularities in Learning Theory and the Learning Coefficient

Aoyagi, Miki

doi:10.3390/e15093714

Open AccessArticle

Consideration on Singularities in Learning Theory and the Learning Coefficient

by

Miki Aoyagi

Department of Mathematics, College of Science & Technology, Nihon University, 1-8-14, Surugadai, Kanda, Chiyoda-ku, Tokyo 101-8308, Japan

Entropy 2013, 15(9), 3714-3733; https://doi.org/10.3390/e15093714

Submission received: 21 June 2013 / Revised: 29 August 2013 / Accepted: 30 August 2013 / Published: 6 September 2013

(This article belongs to the Special Issue The Information Bottleneck Method)

Download

Browse Figures

Versions Notes

Abstract

:

We consider the learning coefficients in learning theory and give two new methods for obtaining these coefficients in a homogeneous case: a method for finding a deepest singular point and a method to add variables. In application to Vandermonde matrix-type singularities, we show that these methods are effective. The learning coefficient of the generalization error in Bayesian estimation serves to measure the learning efficiency in singular learning models. Mathematically, the learning coefficient corresponds to a real log canonical threshold of singularities for the Kullback functions (relative entropy) in learning theory.

Keywords:

learning coefficient; Kullback function (relative entropy); singular learning machine; resolution of singularities

1. Introduction

The purpose of a learning system is to estimate an unknown true density function (a probability model) that generates the data. Real data associated with, for example, genetic analysis, data mining, image or speech recognition, artificial intelligence, the control of a robot and time series prediction are very complicated and usually are not generated by a simple normal distribution. In Bayesian estimation, we set a learning model that is written in probabilistic form with parameters, and our goal is to estimate the true density function by a predictive function constructed with the learning model and such data. Therefore, the learning model should be abundant enough to capture the true density function’s structure. Hierarchical learning models, such as the layered neural network, the Boltzmann machine, the reduced rank regression and the normal mixture model, are known to be effective learning models for analyzing such data. These are, however, singular learning models, which cannot be analyzed using the classic theory of regular statistical models, because singular learning models have a singular Fisher metric that is not always approximated by any quadratic form [1,2,3,4]. Therefore, it is difficult to analyze their generalization errors, which indicate how precisely the predictive function approximates the true density function.

In recent studies, Watanabe showed using algebraic geometry that the generalization and training errors are subject to a universal law and defined the model selection method “widely applicable information criterion” (WAIC ) as a generalized Akaike information criterion (AIC) [5,6,7,8,9]. WAIC can even be applied to singular learning models, whereas AIC cannot. Using the WAIC, we can estimate the generalization errors from the training errors without any knowledge of the true probability density functions. The generalization errors relate to the generalization losses via the entropy of the true distribution. Thus, we can select a suitable model from among several statistical models by this method.

Computing the WAIC requires the values of the learning coefficient and the singular fluctuation, which are both birational invariants. Mathematically, the learning coefficient is the log canonical threshold (Definition 1) of the Kullback function (relative entropy), and the singular fluctuation is known as a statistically generalized log canonical threshold, which is obtained theoretically from the learning coefficient (Equation (1) in Section 2). These values can be obtained by Hironaka’s Theorem (Appendix A). However, it is still difficult to obtain these within learning theory for several reasons, such as degeneration with respect to their Newton polyhedra and non-isolation of their singularities [10]. Moreover, in algebraic geometry and algebraic analysis, these studies are usually done over an algebraically closed field [11,12]; many differences exist for real and complex fields. For example, log canonical thresholds over the complex field are less than one, whereas those over the real field are not necessarily so. We, therefore, cannot apply results over an algebraically closed field to our current situation directly (Appendix B). One of the bottlenecks in learning theory is to obtain the learning coefficients and the singular fluctuation.

In this paper, we consider the learning coefficient of “Vandermonde matrices-type singularities” in statistical learning theory. The reason why we contribute only to such singularities is that the Vandermonde matrix type is generic and essential in learning theory. These log canonical thresholds give the learning coefficients of normal mixture models, three-layered neural networks and mixtures of binomial distributions, which are widely used as effective learning models (Section 3.1 and Section 3.2 and [13]). Moreover, we prove Theorem 2 (the method for finding a deepest deepest singular point) and Theorem 3 (the method to add variables), which are very beneficial to obtain the log canonical threshold for the homogeneous case. Theorem 2 indicates the best point of singularities that gives the log canonical threshold. Therefore, this theorem is useful for the reduction of the number of blowup processes. Theorem 3 improves our recursive blowup method by simplifying coordinate system changes with added variables. These two theorems enable us to obtain a new bound for the log canonical thresholds of Vandermonde matrix-type singularities in Theorem 5. These bounds are much tighter than those in [14].

In the past few years, we have obtained the learning coefficients for reduced rank regression [15], for the three-layered neural network with one input unit and one output unit [16,17], and for the normal mixture models with a dimension of one [18]. The paper [14] derived bounds on the learning coefficients for the Vandermonde matrix-type singularities and explicit values under some conditions. The learning coefficients for the restricted Boltzmann machine [19] have also been considered recently. Ref [20,21,22], respectively, obtained these for naive Bayesian networks and for directed tree models with hidden variables. These results give partial answers for the learning coefficients.

The rest of the paper is in three sections. Section 2 summarizes the framework of Bayesian learning models. In Section 3, we demonstrate our main theorems and consider the log canonical threshold of Vandermonde matrix-type singularities (Definition 3). We finish with our conclusions in Section 4.

2. Learning Coefficients and Singular Fluctuations

In this section, we present the theory of learning coefficients and singular fluctuations. Let

q (x)

be a true probability density function of variables,

x \in R^{N}

, and let

x^{n} : = {x_{i}}_{i = 1}^{n}

be n training samples selected from

q (x)

independently and identically. Consider a learning model that is written in probabilistic form as

p (x | w)

, where

w \in W \subset R^{d}

is a parameter. The purpose of the learning system is to estimate

q (x)

from

x^{n}

using

p (x | w)

. Let

ψ (w)

be an a priori probability density function on the parameter set, W, and

p (w | x^{n})

be the a posteriori probability density function:

p (w | x^{n}) = \frac{1}{Z_{n}} ψ (w) \prod_{i = 1}^{n} p (x_{i} | w)

where:

Z_{n} = \int_{W} ψ (w) \prod_{i = 1}^{n} p (x_{i} | w) d w

Let us define for the inverse temperature, β:

E_{w} [f (w)] = \frac{\int d w f (x) ψ (w) \prod_{i = 1}^{n} p {(x_{i} | w)}^{β}}{\int d w ψ (w) \prod_{i = 1}^{n} p {(x_{i} | w)}^{β}}

We usually set

β = 1

.

We then have a predictive density function,

p (x | X^{n}) = E_{w} [p (x | w)]

, which is the average inference of the Bayesian density function.

We next introduce the Kullback function,

K (q | | p)

, and the empirical Kullback function,

K_{n} (q | | p)

, for density functions

p (x), q (x)

:

K (q | | p) = \sum_{x} q (x) log \frac{q (x)}{p (x)}

K_{n} (q | | p) = \frac{1}{n} \sum_{i = 1}^{n} log \frac{q (x_{i})}{p (x_{i})}

The function,

K (p | | q)

, always has a non-negative value and satisfies

K (q | | p) = 0

, if and only if

q (x) = p (x)

.

The Bayesian generalization error,

B_{g}

, Bayesian training error,

B_{t}

, Gibbs generalization error,

G_{g}

, and Gibbs training error,

G_{t}

, are defined as follows:

B_{g} = K (q (x) | | E_{w} [p (x | w)]))

B_{t} = K_{n} (q (x) | E_{w} [p (x_{i} | w)])

G_{g} = E_{w} [K (q (x) | | p (x | w))]

and

G_{t} = E_{w} [K_{n} (q (x) | | p (x | w))]

The most important of these is the Bayesian generalization error. This error describes how precisely the predictive function approximates the true density function.

Watanabe [6,7,23] proved the following four relations:

E [B_{g}] = \frac{λ + ν β - ν}{n β} + o (\frac{1}{n})

E [B_{t}] = \frac{λ - ν β - ν}{n β} + o (\frac{1}{n})

E [G_{g}] = \frac{λ + ν β}{n β} + o (\frac{1}{n})

E [G_{t}] = \frac{λ - ν β}{n β} + o (\frac{1}{n})

Thus we have:

E [B_{g}] = E [B_{t}] + 2 β (E [G_{t}] - E [B_{t}]) + o (\frac{1}{n})

and

E [G_{g}] = E [G_{t}] + 2 β (E [G_{t}] - E [B_{t}]) + o (\frac{1}{n})

Eliminating the expectation of the true probability density function from the above four errors and setting:

B L_{g} = - \sum_{x} q (x) log E_{w} [p (x | w)]

B L_{t} = - \frac{1}{n} \sum_{i = 1}^{n} log E_{w} [p (x_{i} | w)]

G L_{g} = - E_{w} [\sum_{x} q (x) log p (x | w)]

G L_{t} = - E_{w} [\frac{1}{n} \sum_{i = 1}^{n} log p (x_{i} | w)]

we then have:

E [B L_{g}] = E [B L_{t}] + 2 β (E [G_{t}] - E [B_{t}]) + o (\frac{1}{n})

and

E [G L_{g}] = E [G L_{t}] + 2 β (E [G_{t}] - E [B_{t}]) + o (\frac{1}{n})

These two equations constitute the WAIC and show that we can estimate the Bayesian and Gibbs generalization errors from the Bayesian and Gibbs training errors without any knowledge of the true probability density functions. Training errors are calculated from training samples,

x_{i}

, using a learning model, p. In real applications or experiments, we usually do not know the true distribution, but only the values of the training errors. Our purpose is to estimate the true distribution from the training samples, showing that these relations are effective. We can select a suitable model from among several statistical models by observing these values.

Let λ denote a learning coefficient and ν a singular fluctuation, both of which are birational invariants. Mathematically, λ is equal to the log canonical threshold introduced in Definition 1 and Appendix B. For regular models,

λ = ν = d / 2

holds, where d is the dimension of the parameter space.

The difference between the Bayesian and Gibbs training errors converges to

ν / n

:

n β (E [G_{t}] - E [B_{t}]) \to ν

These relations were shown using the resolution of singularities and the Schwarz distribution.

From the learning coefficient, λ, and its order, θ, the value, ν, is obtained theoretically as follows. Let

ξ (u)

be an empirical process defined on the manifold obtained by a resolution of singularities, and

\sum_{u^{*}}

denote the sum of local coordinates that attain the minimum λ and the maximum θ. We then have:

ν = \frac{1}{2} E_{ξ} \frac{\int_{0}^{\infty} d t \sum_{u^{*}} \int d u ξ (u) t^{λ - 1 / 2} e^{- β t + β \sqrt{t} ξ (u)}}{\int_{0}^{\infty} d t \sum_{u^{*}} \int d u t^{λ - 1 / 2} e^{- β t + β \sqrt{t} ξ (u)}}

(1)

ξ (u)

is a random variable of a Gaussian process with mean zero and variance two. Our purpose in this paper is to obtain λ.

To assist in achieving this aim, we use the desingularization approach from algebraic geometry (cf. Appendix A). It is a new problem in algebraic geometry to obtain the desingularization of the Kullback functions, because the singularities of these functions are very complicated, and as such, most of these have not yet been investigated.

3. Main Theorems and Vandermonde Matrix-Type Singularities

We denote constants, such as

a^{*}

,

b^{*}

and

w^{*}

, by the suffix ∗. Additionally, for simplicity, we use the notation:

w = {a_{k i}, b_{i j}}_{1 \leq i \leq H}

instead of:

w = {a_{k i}, b_{i j}}_{1 \leq k \leq M, 1 \leq i \leq H, 1 \leq j \leq N},

because we always have

1 \leq k \leq M

and

1 \leq j \leq N

in this paper.

Define the norm of a matrix,

C = (c_{i j})

, by

| | C | | = \sqrt{\sum_{i, j} {| c_{i j} |}^{2}}

. Set

N_{+ 0} = N \cup {0}

.

Definition 1

For a real analytic function, f, in a neighborhood, U, of

w^{*}

and a

C^{\infty}

function ψ with a compact support, let

λ_{w^{*}} (f, ψ)

be the largest pole of

\int_{U} {| f |}^{z} ψ d w

and

θ_{w^{*}} (f, ψ)

be its order. If

ψ (w^{*}) \neq 0

, then we denote

λ_{w^{*}} (f) = λ_{w^{*}} (f, ψ)

and

θ_{w^{*}} (f) = θ_{w^{*}} (f, ψ)

, because the log canonical threshold and its order are independent of ψ.

Definition 2

Fix

Q \in N

. Define:

{[b_{1}^{*}, b_{2}^{*}, \dots, b_{N}^{*}]}_{Q} = γ_{i} (0, \dots, 0, b_{i}^{*}, \dots, b_{N}^{*})

if

b_{1}^{*} = \dots = b_{i - 1}^{*} = 0

,

b_{i}^{*} \neq 0

, and

γ_{i} = \{\begin{matrix} 1 & i f Q i s o d d, \\ | b_{i}^{*} | / b_{i}^{*} & i f Q i s e v e n . \end{matrix}

Definition 3

Fix

Q \in N

.

Let

A = (\begin{matrix} a_{11} & \dots & a_{1 H} & a_{1, H + 1}^{*} & \dots & a_{1, H + r}^{*} \\ a_{21} & \dots & a_{2 H} & a_{2, H + 1}^{*} & \dots & a_{2, H + r}^{*} \\ ⋮ & ⋮ \\ a_{M 1} & \dots & a_{M H} & a_{M, H + 1}^{*} & \dots & a_{M, H + r}^{*} \end{matrix})

,

I = (ℓ_{1}, \dots, ℓ_{N}) \in {N_{+ 0}}^{N}

B_{I} = (\prod_{j = 1}^{N} b_{1 j}^{ℓ_{j}}, \prod_{j = 1}^{N} b_{2 j}^{ℓ_{j}}, \dots, \prod_{j = 1}^{N} b_{H j}^{ℓ_{j}}, \prod_{j = 1}^{N} {b_{H + 1, j}^{*}}^{ℓ_{j}}, \dots, \prod_{j = 1}^{N} {b_{H + r, j}^{*}}^{ℓ_{j}})^{t}

and

B = {(B_{I})}_{ℓ_{1} + \dots + ℓ_{N} = Q n + 1, 0 \leq n \leq H + r - 1} = (B_{(1, 0, \dots, 0)}, B_{(0, 1, \dots, 0)}, \dots, B_{(0, 0, \dots, 1)}, B_{(1 + Q, 0, \dots, 0)}, \dots)

(the superscript, t, denotes matrix transposition).

a_{k i}

and

b_{i j}

(1 \leq k \leq M, 1 \leq i \leq H, 1 \leq j \leq N)

are variables in a neighborhood of

a_{k i}^{*}

and

b_{i j}^{*}

, where

a_{k i}^{*}

and

b_{i j}^{*}

are fixed constants.

Let

I

be the ideal generated by the elements of

A B

.

We call singularities of

I

Vandermonde matrix-type singularities.

To simplify, we usually assume that

{(a_{1, H + j}^{*}, a_{2, H + j}^{*}, \dots, a_{M, H + j}^{*})}^{t} \neq 0, (b_{H + j, 1}^{*}, b_{H + j, 2}^{*}, \dots, b_{H + j, N}^{*}) \neq 0

for

1 \leq j \leq r

and

{[b_{H + j, 1}^{*}, b_{H + j, 2}^{*}, \dots, b_{H + j, N}^{*}]}_{Q} \neq {[b_{H + j^{'}, 1}^{*}, b_{H + j^{'}, 2}^{*}, \dots, b_{H + j^{'}, N}^{*}]}_{Q}

for

j \neq j^{'}

.

Example 1

If

N = Q = 1

and

r = 0

, then we have:

B = (\begin{matrix} b_{11} & b_{11}^{2} & \dots & b_{11}^{H} \\ b_{21} & b_{21}^{2} & \dots & b_{21}^{H} \\ ⋮ \\ b_{H 1} & b_{H 1}^{2} & \dots & b_{H 1}^{H} \end{matrix})

.

This matrix is a Vandermonde matrix.

Example 2

If

Q = 1

,

M = 1

,

H = 2, N = 2

and

r = 1

, then we have:

A = (\begin{matrix} a_{11} & a_{12} & a_{1, 3}^{*} \end{matrix})

and

B = (\begin{matrix} b_{11} & b_{12} & b_{11}^{2} & b_{11} b_{12} & b_{12}^{2} & b_{11}^{3} & b_{11} b_{12}^{2} & b_{11}^{2} b_{12} & b_{12}^{3} \\ b_{21} & b_{22} & b_{21}^{2} & b_{21} b_{22} & b_{22}^{2} & b_{21}^{3} & b_{21} b_{22}^{2} & b_{21}^{2} b_{22} & b_{22}^{3} \\ b_{31}^{*} & b_{32}^{*} & {b_{31}^{*}}^{2} & b_{31}^{*} b_{32}^{*} & {b_{32}^{*}}^{2} & {b_{31}^{*}}^{3} & b_{31}^{*} {b_{32}^{*}}^{2} & {b_{31}^{*}}^{2} b_{32}^{*} & {b_{32}^{*}}^{3} \end{matrix})

.

In this paper, we denote:

\begin{matrix} A_{M, H} = (\begin{matrix} a_{11} & a_{12} & \dots & a_{1 H} \\ a_{21} & a_{22} & \dots & a_{2 H} \\ ⋮ \\ a_{M 1} & a_{M 2} & \dots & a_{M H} \end{matrix}), B_{H, N, I} = (\begin{matrix} \prod_{j = 1}^{N} {b_{1 j}}^{ℓ_{j}} \\ \prod_{j = 1}^{N} {b_{2 j}}^{ℓ_{j}} \\ ⋮ \\ \prod_{j = 1}^{N} {b_{H j}}^{ℓ_{j}} \end{matrix}) a n d \end{matrix}

B_{H, N}^{(Q)} = {(B_{H, N, I})}_{ℓ_{1} + \dots + ℓ_{N} = Q n + 1, 0 \leq n \leq H - 1}

.

Furthermore, we denote:

a^{*} = (\begin{matrix} a_{1, H + 1}^{*} \\ ⋮ \\ a_{M, H + 1}^{*} \end{matrix})

and

\begin{matrix} (A_{M, H}, a^{*}) = (\begin{matrix} a_{11} & a_{12} & \dots & a_{1 H} & a_{1, H + 1}^{*} \\ a_{21} & a_{22} & \dots & a_{2 H} & a_{2, H + 1}^{*} \\ ⋮ \\ a_{M 1} & a_{M 2} & \dots & a_{M H} & a_{M, H + 1}^{*} \end{matrix}) \end{matrix}

Theorem 1

([18]) Consider a sufficiently small neighborhood, U, of

w^{*} = {a_{k i}^{*}, b_{i j}^{*}}_{1 \leq i \leq H}

and variables,

w = {a_{k i}, b_{i j}}_{1 \leq i \leq H}

, in the set, U.

Set:

(b_{01}^{* *}, b_{02}^{* *}, \dots, b_{0 N}^{* *}) = (0, \dots, 0)

.

Let each:

(b_{11}^{* *}, b_{12}^{* *}, \dots, b_{1 N}^{* *})

, …,

(b_{r^{'} 1}^{* *}, b_{r^{'} 2}^{* *}, \dots, b_{r^{'} N}^{* *})

be a different real vector in:

{[b_{i 1}^{*}, b_{i 2}^{*}, \dots, b_{i N}^{*}]}_{Q} \neq 0, f o r i = 1, \dots, H + r

That is:

{(b_{11}^{* *}, \dots, b_{1 N}^{* *}), \dots, (b_{r^{'} 1}^{* *}, \dots, b_{r^{'} N}^{* *}); {[b_{i 1}^{*}, \dots, b_{i N}^{*}]}_{Q} \neq 0, i = 1, \dots, H + r}

Then,

r^{'}

is uniquely determined, and

r^{'} \geq r

by the assumption in Definition 3. Set:

(b_{i 1}^{* *}, \dots, b_{i N}^{* *}) = {[b_{H + i, 1}^{*}, \dots, b_{H + i, N}^{*}]}_{Q},

for

1 \leq i \leq r

.

Assume that:

{[b_{i 1}^{*}, \dots, b_{i N}^{*}]}_{Q} = \{\begin{matrix} 0, & 1 \leq i \leq H_{0} \\ (b_{11}^{* *}, \dots, b_{1 N}^{* *}), & H_{0} + 1 \leq i \leq H_{0} + H_{1}, \\ (b_{21}^{* *}, \dots, b_{2 N}^{* *}), & H_{0} + H_{1} + 1 \leq i \leq H_{0} + H_{1} + H_{2}, \\ ⋮ \\ (b_{r^{'} 1}^{* *}, \dots, b_{r^{'} N}^{* *}), & H_{0} + \dots + H_{r^{'} - 1} + 1 \leq i \leq H_{0} + \dots + H_{r^{'}}, \end{matrix}

and

H_{0} + \dots + H_{r^{'}} = H

.

We then have:

λ_{w^{*}} {(| | A B | |}^{2}) = \frac{M r^{'}}{2} + λ_{{w_{1}^{(0)}}^{*}} (| | A_{M, H_{0}} B_{H_{0}, N}^{(Q)} {| |}^{2}) + \sum_{α = 1}^{r} λ_{{w_{1}^{(α)}}^{*}} (| | (A_{M, H_{α} - 1}, {a^{(α)}}^{*}) B_{H_{α}, N}^{(1)} {| |}^{2}) + \sum_{α = r + 1}^{r^{'}} λ_{{w_{1}^{(α)}}^{*}} (| | A_{M, H_{α} - 1} B_{H_{α} - 1, N}^{(1)} {| |}^{2})

where:

{w_{1}^{(0)}}^{*} = {a_{k, i}^{*}, 0}_{1 \leq i \leq H_{α}},

{w_{1}^{(α)}}^{*} = {a_{k, H_{0} + \dots + H_{α - 1} + i}^{*}, 0}_{2 \leq i \leq H_{α}}

and

{a^{(α)}}^{*} = (\begin{matrix} a_{1, H + α}^{*} \\ ⋮ \\ a_{M, H + α}^{*} \end{matrix})

for

α \geq 1

.

Theorem 2 (Method for finding a deepest singular point)

Let

f_{1} (w_{1}, \dots, w_{d})

, …,

f_{m} (w_{1}, \dots, w_{d})

be homogeneous functions of

w_{1}, \dots, w_{j}

(j \leq d)

with the degree,

n_{i}

, of

w_{1}, \dots, w_{j}

. Furthermore, let ψ be a

C^{\infty}

function, such that

ψ_{(0, \dots, 0, w_{j + 1}^{*}, \dots, w_{d}^{*})} \geq

ψ_{(w_{1}^{*}, \dots, w_{d}^{*})}

and

ψ_{w}

is homogeneous of

w_{1}, \dots, w_{j}

in a small neighborhood of

(0, \dots, 0, w_{j + 1}^{*}, \dots, w_{d}^{*})

.

Then, we have:

λ_{(0, \dots, 0, w_{j + 1}^{*}, \dots, w_{d}^{*})} (f_{1}^{2} + \dots + f_{m}^{2}, ψ) \leq λ_{(w_{1}^{*}, \dots, w_{j}^{*}, w_{j + 1}^{*}, \dots, w_{d}^{*})} (f_{1}^{2} + \dots + f_{m}^{2}, ψ)

(Proof)

Let d be the degree of

w_{1}, \dots, w_{j}

for ψ in a neighborhood of

(0, \dots, 0, w_{j + 1}^{*}, \dots, w_{d}^{*}) .

Let us construct the blowup of

f_{1}

, …,

f_{m}

along the submanifold,

{v = 0, w_{i} = 0, 1 \leq i \leq j}

. Let

w_{i} = v w_{i}^{'}

for

1 \leq i \leq j

. We have:

v^{n_{i}} f_{i} (w_{1}^{'}, \dots, w_{d}^{'})

and

{(f_{1} {(w)}^{2} + f_{2} {(w)}^{2} + \dots + f_{m} {(w)}^{2})}^{z} ψ d w d v = {(v^{2 n_{1}} f_{1}^{2} (w^{'}) + v^{2 n_{2}} f_{2}^{2} (w^{'}) + \dots + v^{2 n_{m}} f_{m} {(w^{'})}^{2})}^{z} ψ (w^{'}) v^{d + j} d w^{'} d v

. Because:

v^{2 n_{i}} f_{i}^{2} (w_{1}^{'}, \dots, w_{d}^{'}) \leq f_{i}^{2} (w_{1}^{'}, \dots, w_{d}^{'})

for

| v | < 1

, we have:

v^{2 n_{1}} f_{1}^{2} + \dots + v^{2 n_{m}} f_{m}^{2} \leq f_{1}^{2} + \dots + f_{m}^{2}

, and, hence, by Lemma 1 in Appendix C:

λ_{(0, \dots, 0, w_{j + 1}^{*}, \dots, w_{d}^{*})} (f_{1}^{2} + \dots + f_{m}^{2}, ψ) \leq min {d + j + 1, λ_{w^{*}} (f_{1}^{2} + \dots + f_{m}^{2}, ψ)}

Furthermore, we consider the construction of the blowup of:

f_{1}

, …,

f_{m}

along the submanifold:

{w_{i} = 0, 1 \leq i \leq j}

, for which we have

\begin{matrix} λ_{(0, \dots, 0, w_{j + 1}^{*}, \dots, w_{d}^{*})} (f_{1}^{2} + \dots + f_{m}^{2}, ψ) \leq d + j \\ Q.E.D. \end{matrix}

In general, it is not true that:

λ_{w_{0}} (f_{1}^{2} + \dots + f_{m}^{2}, ψ) \leq λ_{w^{*}} (f_{1}^{2} + \dots + f_{m}^{2}, ψ)

even if

w_{0} \in R^{d}

satisfies:

f_{i} (w_{0}) = \frac{\partial f_{i}}{\partial w_{j}} (w_{0}) = 0, 1 \leq i \leq m, 1 \leq j \leq d

Example 3

Let

f_{1} = x {(x - 1)}^{2}

,

f_{2} = (y^{2} + {(x - 1)}^{2}) ({(y - 1)}^{6} + x)

and

f_{3} = (z^{2} + {(x - 1)}^{2}) ({(z - 1)}^{6} + x)

. Then, we have:

f_{1} = f_{2} = f_{3} = \frac{\partial f_{1}}{\partial x} = \frac{\partial f_{2}}{\partial y} = \frac{\partial f_{2}}{\partial x} = \frac{\partial f_{3}}{\partial z} = \frac{\partial f_{3}}{\partial x} = 0

if and only if

x = 1, y = 0, z = 0

.

In this case, we have

λ_{(1, 0, 0)} (f_{1}^{2} + f_{2}^{2} + f_{3}^{2}) = 1 / 4 + 1 / 4 + 1 / 4 > λ_{(0, 1, 1)} (f_{1}^{2} + f_{2}^{2} + f_{3}^{2}) = 1 / 2 + 1 / 12 + 1 / 12 .

Theorem 3 (Method to add variables)

Let

f_{1} (w_{1}, \dots, w_{d})

, …,

f_{m} (w_{1}, \dots, w_{d})

be homogeneous functions of

w_{1}, \dots, w_{d}

of the degree,

n_{i}

, in

w_{1}, \dots, w_{d}

. Set:

f_{1}^{'} (w_{2}, \dots, w_{d}) = f_{1} (1, w_{2}, \dots, w_{d})

, …,

f_{m}^{'} (w_{2}, \dots, w_{d}) = f_{m} (1, w_{2}, \dots, w_{d})

. If

w_{1}^{*} \neq 0

, then we have:

λ_{(w_{1}^{*}, \dots, w_{d}^{*})} (f_{1}^{2} + \dots + f_{m}^{2}) = λ_{(w_{2}^{*} / w_{1}^{*}, \dots, w_{d}^{*} / w_{1}^{*})} (f_{1}^{' 2} + \dots + f_{m}^{' 2})

(Proof) Set

w_{2}^{'} = w_{2} / w_{1}, \dots, w_{d}^{'} = w_{d} / w_{1}

. Then, we have:

f_{i} (w_{1}, w_{2}, \dots, w_{d}) = w_{1}^{n_{i}} f_{i} (1, w_{2}^{'}, \dots, w_{d}^{'}) = w_{1}^{n_{i}} f_{i}^{'}

Since

w_{1}^{n_{i}} \neq 0

on a small neighborhood of

w_{1}^{*}

, there exist positive real numbers,

C, C^{'}

, such that:

C (f_{1}^{2} + \dots + f_{m}^{2}) \leq f_{1}^{' 2} + \dots + f_{m}^{' 2} \leq C^{'} (f_{1}^{2} + \dots + f_{m}^{2})

This completes the proof by Lemma 1 in Appendix C. Q.E.D.

Remark 1

The above theorem shows that we can set nonzero constants as variables to obtain the same log canonical threshold. However, in general, this is not true.

(1): Consider the function $f (x, y) = {(y - 1 + x^{2})}^{2}$ . We have $λ_{f = 0} (f) = 1 / 2$ , whereas $λ_{f (x, 1) = 0} (f (x, 1)) = 1 / 4$ .
(2): Consider the function $f (x_{1}, x_{2}, x_{3}, x_{4}, x_{5}, y) = {(x_{1}^{2} + x_{2}^{2} + x_{3}^{2} + x_{4}^{2} + x_{5}^{2} + y - 1)}^{2}$ . We have $λ_{f = 0} (f) = 1 / 2$ , whereas $λ_{0} (f (x_{1}, x_{2}, x_{3}, x_{4}, x_{5}, 1)) = 5 / 4$ .

The second example shows that the following theorem over the complex field is not true over the real field.

Theorem 4 [11]

Let

f (x_{1}, \dots, x_{d}, y)

be a holomorphic function near zero, and for a hyperplane H, let

{g = f |}_{y = 0}

(or

g = f_{H}

) denote the restriction of f to

y = 0

(or H). Then,

λ_{g = 0} (g) \leq λ_{f = 0} (f)

.

Define:

〈 \binom{k}{l} 〉 = \frac{k!}{l! (k - l)!}

,

Theorem 5

We use the same notation as in Theorem 1. Let:

{bound}_{1} = min {\frac{(H - i + 1) N + d_{i} (s) + d_{i}^{'} (s) + d_{i}^{″} (s)}{2 (c o u n t (i, s, k (s)) - 1) Q + 2} : 1 \leq i \leq s, 1 \leq k (1), \dots, k (s) \leq N, 1 \leq s \leq H}

where:

c o u n t (i, s, j) = # {i_{1} : i \leq i_{1} \leq s, k (i_{1}) = j}

,

C (i, s) = # {c o u n t (i, s, j) = 0, 1 \leq j \leq N},

\begin{matrix} d_{i} (s) & = & (N - 1) Q \sum_{s_{1} = i}^{s} (c o u n t (i, s_{1}, k (s_{1})) - 1) \\ d_{i}^{'} (s) & = & M (i - 1) {(c o u n t (i, s, k (s)) - 1) Q + 1} \\ + Q M \sum_{\binom{s_{1} = i,}{c o u n t (i, s, k (s)) > c o u n t (i, s_{1}, k (s_{1}))}}^{s - 1} (c o u n t (i, s, k (s)) - c o u n t (i, s_{1}, k (s_{1}))), \\ d_{i}^{″} (s) & = & \{\begin{matrix} 0, i f c o u n t (i, s, k (s)) = 1, \\ (H - s) {C (i, s) Q + (N - 1) Q (c o u n t (i, s, k (s)) - 2)}, \\ i f c o u n t (i, s, k (s)) \geq 2, N - 1 \leq M, \\ (H - s) {C (i, s) Q + M Q (c o u n t (i, s, k (s)) - 2)}, \\ i f c o u n t (i, s, k (s)) \geq 2, C (i, s) \leq M < N - 1, \\ (H - s) {M Q (c o u n t (i, s, k (s)) - 1)}, \\ i f c o u n t (i, s, k (s)) \geq 2, M \leq C (i, s) . \end{matrix} \end{matrix}

Furthermore, let:

{bound}_{2} = \frac{N H + \sum_{i = 0}^{k^{'} - 1} M Q (k^{'} - i) 〈 \binom{N + Q i}{N - 1} 〉}{2 + 2 Q k^{'}},

where:

k^{'} = max {i \in Z; N H \geq M \sum_{i^{'} = 0}^{i - 1} (1 + Q i^{'}) 〈 \binom{N + Q i^{'}}{N - 1} 〉}

, and let:

{bound}_{3} = \frac{M H}{2}

We have

\begin{matrix} λ_{0} (| | A_{M, H} B_{H, N}^{(Q, m)} {| |}^{2}) \leq min {{bound}_{1}, {bound}_{2}, {bound}_{3}} \\ λ_{0} (| | (A_{M, H - 1}, a^{*}) B_{H, N}^{(Q, m)} {| |}^{2}) \leq min {{bound}_{1}, {bound}_{2}} \end{matrix}

The proof appears in Appendix C.

Figure 1a–d show the values of new bounds,

min {{bound}_{1}, {bound}_{2}, {bound}_{3}}

, for (a)

H = 7, N = 6

, (b)

H = 8, N = 6

, (c)

H = 7, N = 7

and (d)

H = 8, N = 7

with

Q = 2

, respectively. We compare these values with those obtained by the past work in [14]. In the figures, the horizontal axis is the number, M, and the vertical one, the value of such bounds. The dashed lines indicate the bounds obtained by the past work. These figures show that new bounds are not greater than old ones.

Figure 1. The values of new bounds,

min {{bound}_{1}, {bound}_{2}, {bound}_{3}}

, for (a)

H = 7, N = 6

; (b)

H = 8, N = 6

; (c)

H = 7, N = 7

and (d)

H = 8, N = 7

with

Q = 2

, compared with the bounds obtained by the past work in [14].

Figure 1. The values of new bounds,

min {{bound}_{1}, {bound}_{2}, {bound}_{3}}

, for (a)

H = 7, N = 6

; (b)

H = 8, N = 6

; (c)

H = 7, N = 7

and (d)

H = 8, N = 7

with

Q = 2

, compared with the bounds obtained by the past work in [14].

In paper [24], we had exact values for

N = 1

:

λ_{0} (| | A_{M, H} B_{H, 1}^{(Q)} {| |}^{2}) = \frac{M Q k (k + 1) + 2 H}{4 (1 + k Q)}

where:

k = max {i \in Z : 2 H \geq M (i (i - 1) Q + 2 i)},

and we had:

θ = \{\begin{matrix} 1, & if 2 H > M (k (k - 1) Q + 2 k) \\ 2, & if 2 H = M (k (k - 1) Q + 2 k) \end{matrix}

We had other exact values when H is small on paper [14]. Both sets of exact values are the bounded values in Theorem 5.

3.1. A Learning Coefficient for a Three-Layered Neural Network

Consider the three-layered neural network with N input units, H hidden units and M output units, which are trained for estimating the true distribution with r hidden units. Their learning coefficients, λ, are as follows [14,24]:

λ = min {λ_{0} (| | A_{M, H_{0}} B_{H_{0}, N}^{(2)} {| |}^{2}) + \sum_{α = 1}^{r} λ_{0} (| | (A_{M, H_{α} - 1}, a^{*}) B_{H_{α}, N}^{(1)} {| |}^{2}) : H_{0} + H_{1} + \dots + H_{r} = H, a^{*} \neq 0} .

3.2. A Learning Coefficient for a Normal Mixture Model

Consider normal mixture models with H peaks and the true distribution with r peaks. Then, their learning coefficients, λ, are as follows [14,18]:

λ = \frac{r - 1}{2} + min {\sum_{α = 1}^{r} λ_{0} (| | (A_{M, H_{α} - 1}, a^{*}) B_{H_{α}, N}^{(1)} {| |}^{2}) : \sum_{α = 1}^{r} H_{α} = H, a^{*} \neq 0}

In particular, we have for

N = 1

:

λ = r - 1 + \frac{i + i^{2} + 2 (H - (r - 1))}{4 (i + 1)}, θ = \{\begin{matrix} 1, & if i^{2} + i < 2 (H - (r - 1)), or H = r \\ 2, & if i^{2} + i = 2 (H - (r - 1)), \end{matrix}

where

i = max {j \in Z; j^{2} + j \leq 2 (H - (r - 1))}

.

4. Conclusions

In this paper, we prove two theorems, Theorem 2 (the method for finding a deepest singular point) and Theorem 3 (the method to add variables) for obtaining learning coefficients in a homogeneous case. By applying these methods to Vandermonde matrix-type singularities and using the inclusion of ideals and recursive blowup from algebraic geometry, we found new bounds on learning coefficients for Vandermonde matrix-type singularities. These bounds are much tighter than those in [14]. Our future research aim is to improve our methods and to obtain exact values for the general machine model.

The learning coefficients from our recent results have been used very effectively by Drton [25,26] for model selection, using a method called “singular Bayesian information criterion (sBIC)”, which can be applied to singular models, where the assumptions supporting the use of the standard BIC do not hold. Our theoretical results introduce a mathematical measure of precision to numerical calculations, such as Markov chain Monte Carlo (MCMC). Nagata and Watanabe [27,28] gave a mathematical foundation for analyzing and developing the precision of the MCMC method using our theoretical values of marginal likelihoods.

Acknowledgments

This research was supported by the Ministry of Education, Culture, Sports, Science and Technology in Japan, Grant-in-Aid for Scientific Research 22540224.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

We introduce Hironaka’s Theorem on desingularization.

Theorem 6

[Desingularization, Hironaka (1964), (Figure A1)]

Let f be a real analytic function in a neighborhood of

w^{*} = (w_{1}^{*}, \dots, w_{d}^{*}) \in R^{d}

with

f (w^{*}) = 0

. There exists an open set,

V ∋ w^{*}

, a real analytic manifold, U, and a proper analytic map, μ, from U to V, such that:

(1): $μ : U - E \to V - f^{- 1} (0)$ is an isomorphism, where $E = μ^{- 1} (f^{- 1} (0))$ ,
(2): for each $u \in U$ , there is a local analytic coordinate system $(u_{1}, \dots, u_{n})$ , such that $f (μ (u)) = \pm u_{1}^{s_{1}} u_{2}^{s_{2}} \dots u_{n}^{s_{n}}$ , where $s_{1}, \dots, s_{n}$ are non-negative integers.

Figure A1. Hironaka’s Theorem: diagram of desingularization, μ, of f:

E

maps to

f^{- 1} (0)

.

U - E

is isomorphic to

V - f^{- 1} (0)

by μ, where V is a small neighborhood of

w^{*}

with

f (w^{*}) = 0

.

Figure A1. Hironaka’s Theorem: diagram of desingularization, μ, of f:

E

maps to

f^{- 1} (0)

.

U - E

is isomorphic to

V - f^{- 1} (0)

by μ, where V is a small neighborhood of

w^{*}

with

f (w^{*}) = 0

.

Appendix B

The learning coefficient is the log canonical threshold of the Kullback function (relative entropy). In this section, we explain its difference for real and complex fields. Let f be a nonzero holomorphic function over

C

or an analytic function over

R

on a smooth variety, Y, and let

Z \subset Y

be a closed subscheme. The log canonical threshold,

λ_{Z} (Y, f)

, is defined analytically as:

λ_{Z} (Y, f) = {sup {c : | f |}^{- c} is locally L^{2} near Z}

over

C

, and:

λ_{Z} (Y, f) = {sup {c : | f |}^{- c} is locally L^{1} near Z}

over

R

[11,12]. It is known that if f is a polynomial or a convergent power series, then

λ_{0} (C^{d}, f)

is the largest root of the Bernstein-Sato polynomial,

b (s) \in C [s]

, of f, where

b (s) f^{s} = P f^{s + 1}

for a linear differential operator, P [29,30,31]. The log canonical threshold,

λ_{Z} (Y, f)

, also corresponds to the largest pole of

\int_{near Z} {| f |}^{2 z} ψ (w) d w

over

C

, (

\int_{near Z} {| f |}^{z} ψ (w) d w

over

R

), where

ψ (w)

is a

C^{\infty} -

function with a compact support, such that

ψ (w) \neq 0

on Z.

Appendix C

Using the blowup process and the method to add variables together with the inductive method for s, we demonstrate Theorem 5

We give below Lemma 1, as it is frequently used in the proofs.

Lemma 1

([18,24,32]) Let U be a neighborhood of

w^{*} \in R^{d}

. Let I be the ideal generated by

f_{1}, \dots, f_{n}

, which are analytic functions defined on U.

(1)

If

g_{1}^{2} + \dots + g_{m}^{2} \leq f_{1}^{2} + \dots + f_{n}^{2}

, then

λ_{w^{*}} (g_{1}^{2} + \dots + g_{m}^{2}) \leq λ_{w^{*}} (f_{1}^{2} + \dots + f_{n}^{2}) .

(2)

If

g_{1}, \dots, g_{m} \in I

, then

λ_{w^{*}} (g_{1}^{2} + \dots + g_{m}^{2}) \leq λ_{w^{*}} (f_{1}^{2} + \dots + f_{n}^{2}) .

In particular, if

g_{1}, \dots, g_{m}

generate the ideal I, then

λ_{w^{*}} (f_{1}^{2} + \dots + f_{n}^{2}) = λ_{w^{*}} (g_{1}^{2} + \dots + g_{m}^{2}) .

The following lemma is also used in the proofs.

Lemma 2

([19]) Let

I, J

be the ideals generated by

f_{1} (w), \dots, f_{n} (w)

and

g_{1} (w^{'}), \dots, g_{m} (w^{'})

, respectively. If w and

w^{'}

are different variables, then

λ_{(w^{*}, w^{' *})} (f_{1}^{2} + \dots + f_{n}^{2} + g_{1}^{2} + \dots + g_{m}^{2}) = λ_{w^{*}} (f_{1}^{2} + \dots + f_{n}^{2}) + λ_{w^{' *}} (g_{1}^{2} + \dots + g_{m}^{2}) .

Step 1

Let us consider the following procedure from

s = 1

to

s = H

, and the generators of the ideal:

J = 〈(\begin{matrix} a_{i_{0} 1} & \dots & a_{i_{0} H} \end{matrix}) (\begin{matrix} b_{11}^{ℓ_{1}} \dots b_{1 N}^{ℓ_{N}} \\ b_{21}^{ℓ_{1}} \dots b_{2 N}^{ℓ_{N}} \\ ⋮ \\ b_{H 1}^{ℓ_{1}} \dots b_{H N}^{ℓ_{N}} \end{matrix}) : 1 \leq i_{0} \leq M, \sum_{i = 1}^{N} ℓ_{i} = n Q + 1, n \geq 0〉 .

By constructing the blowup repeatedly and choosing one branch of the blowup process, we show the following (i)∼(v) in this subsection:

(i): $k (1), \dots, k (s - 1) \in {1, 2, \dots, N}$ ,
(ii): $c o u n t (i_{1}, i_{2}, j) = # {k (i) = j | i_{1} \leq i \leq i_{2}}$ for $1 \leq j \leq N$ ,
(iii): $b_{i j} = \{\begin{matrix} v_{1} v_{2} \dots v_{i} b_{i j}^{'}, & i < s - 1, \\ v_{1} v_{2} \dots v_{s - 1} b_{i j}^{'}, & i \geq s - 1, \end{matrix}$ and $b_{i k (i)}^{'} = 1$ ,
(iv): The Jacobian is:

$\prod_{i = 1}^{H} \prod_{j = 1}^{N} d b_{i j} = \prod_{i = 1}^{s - 1} v_{i}^{(H - i + 1) N - 1 + d_{i}} d v_{i} \prod_{i = 1}^{s - 1} \prod_{j = 1}^{N} d b_{i j}^{'}$

and

$d_{i} = (N - 1) Q \sum_{s_{1} = i}^{s - 1} (c o u n t (i, s_{1}, k (s_{1})) - 1),$
(v): $\begin{matrix} J = 〈a_{i_{0} i_{1}} b_{i_{1} k (i_{1})} \prod_{k (i) = k (i_{1}), 1 \leq i \leq i_{1} - 1} (b_{i_{1}, k (i_{1})}^{Q} - b_{i, k (i_{1})}^{Q}) : 1 \leq i_{0} \leq M, 1 \leq i_{1} \leq s - 1〉 \end{matrix} + 〈(\begin{matrix} a_{i_{0} s} & \dots & a_{i_{0} H} \end{matrix}) (\begin{matrix} b_{s j}^{n Q + 1} \prod_{k (i) = j, 1 \leq i \leq s - 1} (b_{s j}^{Q} - b_{i j}^{Q}) \\ ⋮ \\ b_{H j}^{n Q + 1} \prod_{k (i) = j, 1 \leq i \leq s - 1} (b_{H j}^{Q} - b_{i j}^{Q}) \end{matrix}) : 1 \leq i_{0} \leq M, j = 1, \dots, N, n \geq 0〉 + 〈(\begin{matrix} a_{i_{0} s} & \dots & a_{i_{0} H} \end{matrix}) (\begin{matrix} b_{s 1}^{ℓ_{1}} \dots b_{s N}^{ℓ_{N}} \\ ⋮ \\ b_{H 1}^{ℓ_{1}} \dots b_{H N}^{ℓ_{N}} \end{matrix}) : 1 \leq i_{0} \leq M, \sum_{i = 1}^{N} ℓ_{i} = n Q + 1, n \geq 1, \forall ℓ_{j} < Q n + 1〉 .$

(2)

By Theorem 3, we can set

b_{i k (i)}^{'}

as a variable.

Now, we show the above by the inductive method.

Define

1 \leq k (s) \leq N

. Construct the blowup along

{b_{i j}^{'} = 0, s \leq i \leq H, 1 \leq j \leq N}

. Set

b_{i j}^{'} = v_{s} b_{i j}^{″}

for

s \leq i \leq H, 1 \leq j \leq N,

and set

b_{s k (s)}^{″} = 1

.

By constructing the blowup along

{v_{i} = b_{s j}^{″} = 0, 1 \leq j \leq N, j \neq k (s)}

repeatedly, and by choosing one branch of the blowup process, set

b_{s j}^{″} = b_{s j}^{‴} \prod_{i = 1}^{s - 1} v_{i}^{d_{i}^{'}}

for

j \neq k (s)

, where

d_{i}^{'} = (c o u n t (i, s, k (s)) - 1) Q

.

Consider a sufficiently small neighborhood of

{v_{i} = 0}

using Theorem 2.

Set

f_{i k (s)} = b_{i k (s)} \prod_{k (i_{1}) = k (s), 1 \leq i_{1} \leq s - 1} (b_{i k (s)}^{Q} - b_{i_{1} k (s)}^{Q})

for

i \geq s

,

{\tilde{b}}_{i j}^{'} = b_{i j}^{″} - b_{s j}^{″} \frac{f_{i k (s)}}{f_{s k (s)}}

and

{\tilde{b}}_{i j} = {\tilde{b}}_{i j}^{'} \prod_{i_{1} = 1}^{s} v_{i_{1}} = b_{i j} - b_{s j} \frac{f_{i k (s)}}{f_{s k (s)}}

for

i \geq s + 1, j \neq k (s)

.

We then have:

(\begin{matrix} b_{s 1}^{ℓ_{1}} \dots b_{s N}^{ℓ_{N}} \\ ⋮ \\ b_{H 1}^{ℓ_{1}} \dots b_{H N}^{ℓ_{N}} \end{matrix}) = (\begin{matrix} \prod_{j = 1}^{N} b_{s j}^{ℓ_{j}} \\ b_{s + 1, k (s)}^{ℓ_{k (s)}} \prod_{j = 1, j \neq k (s)}^{N} {({\tilde{b}}_{s + 1, j} + b_{s j} \frac{f_{s + 1, k (s)}}{f_{s k (s)}})}^{ℓ_{j}} \\ ⋮ \\ b_{H, k (s)}^{ℓ_{k (s)}} \prod_{j = 1}^{N} {({\tilde{b}}_{H j} + b_{s j} \frac{f_{H k (s)}}{f_{s k (s)}})}^{ℓ_{j}} \end{matrix})

which is an element of the vector ideal:

〈(\begin{matrix} \prod_{j = 1}^{N} b_{s j}^{ℓ_{j}} \\ \prod_{j = 1, j \neq k (s)}^{N} b_{s j}^{ℓ_{j}} b_{s + 1, k (s)}^{ℓ_{k (s)}} {(\frac{f_{s + 1, k (s)}}{f_{s k (s)}})}^{\sum_{j = 1, j \neq k (s)}^{N} ℓ_{j}} \\ ⋮ \\ \prod_{j = 1, j \neq k (s)}^{N} b_{s j}^{ℓ_{j}} b_{H k (s)}^{ℓ_{k (s)}} {(\frac{f_{H k (s)}}{f_{s k (s)}})}^{\sum_{j = 1, j \neq k (s)}^{N} ℓ_{j}} \end{matrix})〉 + \sum_{\binom{ℓ_{k (s)}^{'} = ℓ_{k (s)},}{0 \leq ℓ_{j}^{'} \leq ℓ_{j}, \exists ℓ_{j}^{'} \neq ℓ_{j}}} 〈(\begin{matrix} 0 \\ \prod_{j = 1, j \neq k (s)}^{N} b_{s j}^{ℓ_{j}^{'}} b_{s + 1, k (s)}^{ℓ_{k (s)}} \prod_{j = 1, j \neq k (s)}^{N} {\tilde{b}}_{s + 1, j}^{ℓ_{j} - ℓ_{j}^{'}} {(\frac{f_{s + 1, k (s)}}{f_{s k (s)}})}^{\sum_{j = 1, j \neq k (s)}^{N} ℓ_{j}^{'}} \\ ⋮ \\ \prod_{j = 1, j \neq k (s)}^{N} b_{s j}^{ℓ_{j}^{'}} b_{H k (s)}^{ℓ_{k (s)}} \prod_{j = 1, j \neq k (s)}^{N} {\tilde{b}}_{H j}^{ℓ_{j} - ℓ_{j}^{'}} {(\frac{f_{s + 1, k (s)}}{f_{s k (s)}})}^{\sum_{j = 1, j \neq k (s)}^{N} ℓ_{j}^{'}} \end{matrix})〉

Furthermore, we have:

(\begin{matrix} \prod_{j = 1}^{N} b_{s j}^{ℓ_{j}} \\ \prod_{j = 1, j \neq k (s)}^{N} b_{s j}^{ℓ_{j}} b_{s + 1, k (s)}^{ℓ_{k (s)}} {(\frac{f_{s + 1, k (s)}}{f_{s k (s)}})}^{\sum_{j = 1, j \neq k (s)}^{N} ℓ_{j}} \\ ⋮ \\ \prod_{j = 1, j \neq k (s)}^{N} b_{s j}^{ℓ_{j}} b_{H k (s)}^{ℓ_{k (s)}} {(\frac{f_{H k (s)}}{f_{s k (s)}})}^{\sum_{j = 1, j \neq k (s)}^{N} ℓ_{j}} \end{matrix})

is an element of:

〈\frac{\prod_{j = 1, j \neq k (s)}^{N} b_{s j}^{ℓ_{j}}}{f_{s k (s)}^{\sum_{j = 1, j \neq k (s)}^{N} ℓ_{j}}} (\begin{matrix} b_{s, k (s)}^{n Q + 1} \prod_{k (i_{1}) = k (s), 1 \leq i_{1} \leq s - 1} (b_{s k (s)}^{Q} - b_{i_{1} k (s)}^{Q}) \\ b_{s + 1, k (s)}^{n Q + 1} \prod_{k (i_{1}) = k (s), 1 \leq i_{1} \leq s - 1} (b_{s + 1 k (s)}^{Q} - b_{i_{1} k (s)}^{Q}) \\ ⋮ \\ b_{H k (s)}^{n Q + 1} \prod_{k (i_{1}) = k (s), 1 \leq i_{1} \leq s - 1} (b_{H k (s)}^{Q} - b_{i_{1} k (s)}^{Q}) \end{matrix}) : n \geq 0〉

Since

b_{s j}^{″} = b_{s j}^{‴} \prod_{i = 1}^{s - 1} v_{i}^{d_{i}^{'}}

for

j \neq k (s)

, where

d_{i}^{'} = (c o u n t (i, s, k (s)) - 1) Q

, we have

b_{s j} = b_{s j}^{‴} \prod_{i = 1}^{s - 1} v_{i}^{d_{i}^{'} + 1}

and

\frac{\prod_{j = 1, j \neq k (s)}^{N} b_{s j}^{ℓ_{j}}}{f_{s k (s)}^{\sum_{j = 1, j \neq k (s)}^{N} ℓ_{j}}}

is finite. That is, we have:

(\begin{matrix} \prod_{j = 1}^{N} b_{s j}^{ℓ_{j}} \\ \prod_{j = 1, j \neq k (s)}^{N} b_{s j}^{ℓ_{j}} b_{s + 1, k (s)}^{ℓ_{k (s)}} {(\frac{f_{s + 1, k (s)}}{f_{s k (s)}})}^{\sum_{j = 1, j \neq k (s)}^{N} ℓ_{j}} \\ ⋮ \\ \prod_{j = 1, j \neq k (s)}^{N} b_{s j}^{ℓ_{j}} b_{H k (s)}^{ℓ_{k (s)}} {(\frac{f_{H k (s)}}{f_{s k (s)}})}^{\sum_{j = 1, j \neq k (s)}^{N} ℓ_{j}} \end{matrix}) \in 〈(\begin{matrix} b_{s, k (s)}^{n Q} f_{s k (s)} \\ b_{s + 1, k (s)}^{n Q} f_{s + 1, k (s)} \\ ⋮ \\ b_{H k (s)}^{n Q} f_{H k (s)} \end{matrix}) : n \geq 0〉

If we assume that for α:

J_{α} = 〈(\begin{matrix} b_{s k (s)}^{n Q} f_{s k (s)} \\ ⋮ \\ b_{H k (s)}^{n Q} f_{H k (s)} \end{matrix}), n \geq 0〉 + 〈(\begin{matrix} b_{s 1}^{ℓ_{1}} \dots b_{s N}^{ℓ_{N}} \\ ⋮ \\ b_{H 1}^{ℓ_{1}} \dots b_{H N}^{ℓ_{N}} \end{matrix}) : \sum_{j = 1}^{N} ℓ_{j} = n Q + 1, n \geq 1, \sum_{j = 1, j \neq k (s)}^{N} ℓ_{j} \leq α〉 = 〈(\begin{matrix} b_{s k (s)}^{n Q} f_{s k (s)} \\ ⋮ \\ b_{H k (s)}^{n Q} f_{H k (s)} \end{matrix}), n \geq 0〉 + 〈(\begin{matrix} 0 \\ b_{s + 1, k (s)} \prod_{j = 1, j \neq k (s)}^{N} {\tilde{b}}_{s + 1, j}^{ℓ_{j}} \\ ⋮ \\ b_{H k (s)} \prod_{j = 1, j \neq k (s)}^{N} {\tilde{b}}_{H j}^{ℓ_{j}} \end{matrix}) : \sum_{j = 1}^{N} ℓ_{j} = n Q + 1, n \geq 1, \sum_{j = 1, j \neq k (s)}^{N} ℓ_{j} \leq α〉,

we have for

\sum_{j = 1, j \neq k (s)}^{N} ℓ_{j} = α + 1

:

〈(\begin{matrix} b_{s 1}^{ℓ_{1}} \dots b_{s N}^{ℓ_{N}} \\ ⋮ \\ b_{H 1}^{ℓ_{1}} \dots b_{H N}^{ℓ_{N}} \end{matrix})〉 + J_{α} = 〈(\begin{matrix} 0 \\ b_{s + 1, k (s)} \prod_{j = 1, j \neq k (s)}^{N} {\tilde{b}}_{s + 1, j}^{ℓ_{j}} \\ ⋮ \\ b_{H k (s)} \prod_{j = 1, j \neq k (s)}^{N} {\tilde{b}}_{H j}^{ℓ_{j}} \end{matrix})〉 + J_{α}

since for

\exists ℓ_{j}^{'} \neq 0

, we have:

(\begin{matrix} 0 \\ \prod_{j = 1, j \neq k (s)}^{N} b_{s j}^{ℓ_{j}^{'}} b_{s + 1, k (s)}^{ℓ_{k (s)}} \prod_{j = 1, j \neq k (s)}^{N} {\tilde{b}}_{s + 1, j}^{ℓ_{j} - ℓ_{j}^{'}} {(\frac{f_{s + 1, k (s)}}{f_{s k (s)}})}^{\sum_{j = 1, j \neq k (s)}^{N} ℓ_{j}^{'}} \\ ⋮ \\ \prod_{j = 1, j \neq k (s)}^{N} b_{s j}^{ℓ_{j}^{'}} b_{H k (s)}^{ℓ_{k (s)}} \prod_{j = 1, j \neq k (s)}^{N} {\tilde{b}}_{H j}^{ℓ_{j} - ℓ_{j}^{'}} {(\frac{f_{H, k (s)}}{f_{s k (s)}})}^{\sum_{j = 1, j \neq k (s)}^{N} ℓ_{j}^{'}} \end{matrix}) = \frac{\prod_{j = 1, j \neq k (s)}^{N} b_{s j}^{ℓ_{j}^{'}}}{f_{s k (s)}^{\sum_{j = 1, j \neq k (s)}^{N} ℓ_{j}^{'}}} (\begin{matrix} 0 \\ b_{s + 1, k (s)}^{ℓ_{k (s)}} \prod_{j = 1, j \neq k (s)}^{N} {\tilde{b}}_{s + 1, j}^{ℓ_{j} - ℓ_{j}^{'}} f_{s + 1, k (s)}^{\sum_{j = 1, j \neq k (s)}^{N} ℓ_{j}^{'}} \\ ⋮ \\ b_{H k (s)}^{ℓ_{k (s)}} \prod_{j = 1, j \neq k (s)}^{N} {\tilde{b}}_{H j}^{ℓ_{j} - ℓ_{j}^{'}} f_{H, k (s)}^{\sum_{j = 1, j \neq k (s)}^{N} ℓ_{j}^{'}} \end{matrix}) \in J_{α}

Therefore, by setting:

a_{i_{0} s}^{'} b_{s k (s)} \prod_{k (i) = k (s), 1 \leq i \leq s - 1} (b_{s, k (s)}^{Q} - b_{i, k (s)}^{Q}) = \sum_{i_{1} = s}^{H} a_{i_{0} i_{1}} b_{i_{1} k (s)} \prod_{k (i) = k (s), 1 \leq i \leq s - 1} (b_{i_{1} k (s)}^{Q} - b_{i, k (s)}^{Q})

for

1 \leq i_{0} \leq M,

and by setting

b_{i j} = {\tilde{b}}_{i j}

again, we have:

J = 〈a_{i_{0} i_{1}} b_{i_{1} k (i_{1})} \prod_{k (i) = k (i_{1}), 1 \leq i \leq i_{1} - 1} (b_{i_{1}, k (i_{1})}^{Q} - b_{i, k (i_{1})}^{Q}) : 1 \leq i_{0} \leq M, 1 \leq i_{1} \leq s - 1〉 + 〈a_{i_{0} s}^{'} b_{s k (s)} \prod_{k (i) = k (s), 1 \leq i \leq s - 1} (b_{s, k (s)}^{Q} - b_{i, k (s)}^{Q}) : 1 \leq i_{0} \leq M〉 + 〈(\begin{matrix} a_{i_{0} s + 1} & \dots & a_{i_{0} H} \end{matrix}) (\begin{matrix} b_{s + 1 j} \prod_{k (i) = j, 1 \leq i \leq s} (b_{s + 1 j}^{Q} - b_{i j}^{Q}) \\ ⋮ \\ b_{H j} \prod_{k (i) = j, 1 \leq i \leq s} (b_{H j}^{Q} - b_{i j}^{Q}) \end{matrix}) : j = 1, \dots, N〉 + 〈(\begin{matrix} a_{i_{0} s + 1} & \dots & a_{i_{0} H} \end{matrix}) (\begin{matrix} b_{s + 1, 1}^{ℓ_{1}} \dots b_{s + 1, N}^{ℓ_{N}} \\ ⋮ \\ b_{H 1}^{ℓ_{1}} \dots b_{H N}^{ℓ_{N}} \end{matrix}) : 1 \leq i_{0} \leq M, \sum_{i = 1}^{N} ℓ_{i} = n Q + 1, n \geq 1, \forall ℓ_{j} < Q n + 1〉

with (i)∼(iv).

Step 2

By Step 1, we need to consider the ideal:

〈a_{i_{0}, i_{1}} \prod_{i = 1}^{i_{1}} v_{i}^{(c o u n t (i, i_{1}, k (i_{1})) - 1) Q + 1} : 1 \leq i_{0} \leq M, i_{1} \leq s〉 + 〈(\begin{matrix} a_{i_{0} s + 1} & \dots & a_{i_{0} H} \end{matrix}) \prod_{i = 1}^{s} v_{i}^{c o u n t (i, s, j) Q + 1} (\begin{matrix} b_{s + 1 j}^{'} \\ ⋮ \\ b_{H j}^{'} \end{matrix}) : j = 1, \dots, N〉 + 〈(\begin{matrix} a_{i_{0} s + 1} & \dots & a_{i_{0} H} \end{matrix}) \prod_{i = 1}^{s} v_{i}^{Q + 1} (\begin{matrix} b_{s + 1, 1}^{' ℓ_{1}} \dots b_{s + 1, N}^{' ℓ_{N}} \\ ⋮ \\ b_{H 1}^{' ℓ_{1}} \dots b_{H N}^{' ℓ_{N}} \end{matrix}) : 1 \leq i_{0} \leq M, \sum_{i = 1}^{N} ℓ_{i} = Q + 1, \forall ℓ_{j} < Q + 1〉

with Jacobian:

\prod_{i = 1}^{H} v_{i}^{(H - i + 1) N - 1 + d_{i} (s)} d v_{i} \prod_{k = 1}^{M} \prod_{i = 1}^{H} d a_{k i} \prod_{i = 1}^{H} \prod_{j = 1}^{N} d b_{i j}^{'}

where:

d_{i} (s) = (N - 1) Q \sum_{s_{1} = i}^{s} (c o u n t (i, s_{1}, k (s_{1})) - 1)

We have:

λ_{0} (| | A_{M, H} B_{H, N}^{(Q)} {| |}^{2}) \leq

min {\frac{(H - i + 1) N + d_{i} (s) + d_{i}^{'} (s) + d_{i, i_{s + 1}, \dots, i_{H}}^{″} (s)}{2 (c o u n t (i, s, k (s)) - 1) Q + 2} : 1 \leq i \leq s, i_{α} \geq 0, 1 \leq s \leq H}

where:

\begin{matrix} d_{i}^{'} (s) & = & M (i - 1) {(c o u n t (i, s, k (s)) - 1) Q + 1} \\ + Q M \sum_{\binom{s_{1} = i,}{c o u n t (i, s, k (s)) > c o u n t (i, s_{1}, k (s_{1}))}}^{s - 1} (c o u n t (i, s, k (s)) - c o u n t (i, s_{1}, k (s_{1}))) \\ d_{i, i_{s + 1}, \dots, i_{H}}^{″} (s) & = & \sum_{α = s + 1}^{H} {M i_{α} \\ + \sum_{\binom{j = 1, j \neq k (s)}{c o u n t (i, s, j) = 0}} max {Q (c o u n t (i, s, k (s)) - 1) - i_{α}, 0} \\ + \sum_{\binom{j = 1, j \neq k (s)}{c o u n t (i, s, j) \geq 1, c o u n t (i, s, k (s)) \geq 2}}^{N} max {Q (c o u n t (i, s, k (s)) - 2) - i_{α}, 0}} \end{matrix}

Set

C (i, s) = # {c o u n t (i, s, j) = 0, 1 \leq j \leq N} .

Then, we have:

\begin{matrix} d_{i}^{″} (s) & = & min {d_{i, i_{s + 1}, \dots, i_{H}}^{″}, i_{α} \geq 0} \\ = & \{\begin{matrix} 0, if c o u n t (i, s, k (s)) = 1 \\ (H - s) {C (i, s) Q + (N - 1) Q (c o u n t (i, s, k (s)) - 2)} \\ if c o u n t (i, s, k (s)) \geq 2, N - 1 \leq M \\ (H - s) {C (i, s) Q + M Q (c o u n t (i, s, k (s)) - 2)} \\ if c o u n t (i, s, k (s)) \geq 2, C (i, s) \leq M < N - 1 \\ (H - s) {M Q (c o u n t (i, s, k (s)) - 1)} \\ if c o u n t (i, s, k (s)) \geq 2, M \leq C (i, s) \end{matrix} \end{matrix}

By the above equation, we have bound

_{1}

. By [19], we have bound

_{2}

and bound

_{3}

, thus completing the proof.

References

Hartigan, J.A. A Failure of Likelihood Ratio Asymptotics for Normal Mixtures. In Proceedings of the Berkeley Conference in Honor of J.Neyman and J.Kiefer, California, CA, USA, 1985; Volume 2, pp. 807–810.
Sussmann, H.J. Uniqueness of the weights for minimal feed-forward nets with a given input-output map. Neural Netw. 1992, 5, 589–593. [Google Scholar] [CrossRef]
Hagiwara, K.; Toda, N.; Usui, S. On the problem of applying AIC to determine the structure of a layered feed-forward neural network. In Proceedings of the IJCNN Nagoya Japan, Nagoya Congress Center, Japan, 25–29 October 1993; Volume 3, pp. 2263–2266.
Fukumizu, K. A regularity condition of the information matrix of a multilayer perceptron network. Neural Netw. 1996, 9, 871–879. [Google Scholar] [CrossRef]
Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
Watanabe, S. Algebraic analysis for nonidentifiable learning machines. Neural Comput. 2001, 13, 899–933. [Google Scholar] [CrossRef] [PubMed]
Watanabe, S. Algebraic geometrical methods for hierarchical learning machines. Neural Netw. 2001, 14, 1049–1060. [Google Scholar] [CrossRef]
Watanabe, S. Algebraic geometry of learning machines with singularities and their prior distributions. J. Jpn. Soc. Artif. Intell. 2001, 16, 308–315. [Google Scholar]
Watanabe, S. Algebraic Geometry and Statistical Learning Theory; Cambridge University Press: New York, NY, USA, 2009; Volume 25. [Google Scholar]
Fulton, W. Introduction to Toric Varieties, Annals of Mathematics Studies; Princeton University Press: Princeton, NJ, USA, 1993. [Google Scholar]
Kollár, J. Singularities of Pairs. In Algebraic Geometry-Santa Cruz 1995, Series Proceedings of Symposia in Pure Mathematics, 9–29 July 1995; American Mathematical Society: Providence, RI, USA, 1997; Volume 62, pp. 221–287. [Google Scholar]
Mustata, M. Singularities of pairs via jet schemes. J. Am. Math. Soc. 2002, 15, 599–615. [Google Scholar] [CrossRef]
Yamazaki, K.; Aoyagi, M.; Watanabe, S. Asymptotic analysis of Bayesian generalization error with Newton diagram. Neural Netw. 2010, 23, 35–43. [Google Scholar] [CrossRef] [PubMed]
Aoyagi, M.; Nagata, K. Learning coefficient of generalization error in Bayesian estimation and Vandermonde matrix type singularity. Neural Comput. 2012, 24, 1569–1610. [Google Scholar] [CrossRef] [PubMed]
Aoyagi, M.; Watanabe, S. Stochastic complexities of reduced rank regression in Bayesian estimation. Neural Netw. 2005, 18, 924–933. [Google Scholar] [CrossRef] [PubMed]
Aoyagi, M.; Watanabe, S. Resolution of singularities and the generalization error with Bayesian estimation for layered neural network. IEICE Trans. J88-D-II 2005, 10, 2112–2124. [Google Scholar]
Aoyagi, M. The zeta function of learning theory and generalization error of three layered neural perceptron. RIMS Kokyuroku Recent Top. Real Complex Singul. 2006, 1501, 153–167. [Google Scholar]
Aoyagi, M. A Bayesian learning coefficient of generalization error and Vandermonde matrix-type singularities. Commun. Stat. Theory Methods 2010, 39, 2667–2687. [Google Scholar] [CrossRef]
Aoyagi, M. Learning coefficient in Bayesian estimation of restricted Boltzmann machine. J. Algebr. Stat. 2013, in press. [Google Scholar] [CrossRef]
Rusakov, D.; Geiger, D. Asymptotic Model Selection for Naive Bayesian Networks. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, Alberta, Canada, 1–4 August 2002; pp. 438–445.
Rusakov, D.; Geiger, D. Asymptotic model selection for naive Bayesian networks. J. Mach. Learn. Res. 2005, 6, 1–35. [Google Scholar]
Zwiernik, P. An asymptotic behavior of the marginal likelihood for general Markov models. J. Mach. Learn. Res. 2011, 12, 3283–3310. [Google Scholar]
Watanabe, S. Equations of states in singular statistical estimation. Neural Netw. 2010, 23, 20–34. [Google Scholar] [CrossRef] [PubMed]
Aoyagi, M. Log canonical threshold of Vandermonde matrix type singularities and generalization error of a three layered neural network. Int. J. Pure Appl. Math. 2009, 52, 177–204. [Google Scholar]
Drton, M. Conference Lecture: Reduced Rank Regression. Workshop on Singular Learning Theory, AIM 2011. Available online: http://math.berkeley.edu/critch/slt2011/ (accessed on 16 December 2011).
Drton, M. Conference Lecture: Bayesian Information Criterion for Singular Models. Algebraic Statistics 2012 in the Alleghenies at The Pennsylvania State University. Available online: http://jasonmorton.com/aspsu2012/ (accessed on 15 June 2012).
Nagata, K.; Watanabe, S. Exchange Monte Carlo Sampling from Bayesian posterior for singular learning machines. IEEE Trans. Neural Netw. 2008, 19, 1253–1266. [Google Scholar] [CrossRef]
Nagata, K.; Watanabe, S. Asymptotic behavior of exchange ratio in exchange Monte Carlo method. Int. J. Neural Netw. 2008, 21, 980–988. [Google Scholar] [CrossRef] [PubMed]
Bernstein, I.N. The analytic continuation of generalized functions with respect to a parameter. Funct. Anal. Appl. 1972, 6, 26–40. [Google Scholar]
Bjőrk, J.E. Rings of Differential Operators; North-Holland: Amsterdam, The Netherlands, 1979. [Google Scholar]
Kashiwara, M. B-functions and holonomic systems. Invent. Math. 1976, 38, 33–53. [Google Scholar] [CrossRef]
Lin, S. Asymptotic approximation of marginal likelihood integrals. 2010; arXiv:1003.5338v2. [Google Scholar]

© 2013 by the author; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Aoyagi, M. Consideration on Singularities in Learning Theory and the Learning Coefficient. Entropy 2013, 15, 3714-3733. https://doi.org/10.3390/e15093714

AMA Style

Aoyagi M. Consideration on Singularities in Learning Theory and the Learning Coefficient. Entropy. 2013; 15(9):3714-3733. https://doi.org/10.3390/e15093714

Chicago/Turabian Style

Aoyagi, Miki. 2013. "Consideration on Singularities in Learning Theory and the Learning Coefficient" Entropy 15, no. 9: 3714-3733. https://doi.org/10.3390/e15093714

Article Menu

Consideration on Singularities in Learning Theory and the Learning Coefficient

Abstract

1. Introduction

2. Learning Coefficients and Singular Fluctuations

3. Main Theorems and Vandermonde Matrix-Type Singularities

3.1. A Learning Coefficient for a Three-Layered Neural Network

3.2. A Learning Coefficient for a Normal Mixture Model

4. Conclusions

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

Appendix C

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI