Binary Classification with a Pseudo Exponential Model and Its Application for Multi-Task Learning

Takenouchi, Takashi; Komori, Osamu; Eguchi, Shinto

doi:10.3390/e17085673

Open AccessArticle

Binary Classification with a Pseudo Exponential Model and Its Application for Multi-Task Learning^†

by

Takashi Takenouchi

^1,*,

Osamu Komori

² and

Shinto Eguchi

²

¹

Future University Hakodate, 116-2 Kamedanakano, Hakodate Hokkaido 041-8655, Japan

²

The Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in Proceedings of the MaxEnt 2014 Conference on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Amboise, France, 21–26 September 2014.

Entropy 2015, 17(8), 5673-5694; https://doi.org/10.3390/e17085673

Submission received: 11 May 2015 / Revised: 31 July 2015 / Accepted: 3 August 2015 / Published: 6 August 2015

(This article belongs to the Special Issue Information, Entropy and Their Geometric Structures)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we investigate the basic properties of binary classification with a pseudo model based on the Itakura–Saito distance and reveal that the Itakura–Saito distance is a unique appropriate measure for estimation with the pseudo model in the framework of general Bregman divergence. Furthermore, we propose a novel multi-task learning algorithm based on the pseudo model in the framework of the ensemble learning method. We focus on a specific setting of the multi-task learning for binary classification problems. The set of features is assumed to be common among all tasks, which are our targets of performance improvement. We consider a situation where the shared structures among the dataset are represented by divergence between underlying distributions associated with multiple tasks. We discuss statistical properties of the proposed method and investigate the validity of the proposed method with numerical experiments.

Keywords:

multi-task learning; Itakura–Saito distance; pseudo model; un-normalized model

1. Introduction

In the framework of multi-task learning problems, we assume that there are multiple related tasks (datasets) sharing a common structure and can utilize the shared structure to improve the generalization performance of classifiers for multiple tasks [1,2]. This framework has been successfully employed in various kind of applications, such as medical diagnosis. Most methods utilize the similarity among tasks to improve the performance of classifiers by representing the shared structure as a regularization term [3,4]. We tackle this problem using a boosting method, which makes it possible to adaptively learn complicated problems with low computational cost. The boosting methods are notable implementations of the ensemble learning and try to construct a better classifier by combining weak classifiers. AdaBoost is the most popular boosting method, and many variations, including TrAdaBoost for the multi-task learning [5], have been developed. In face recognition [6], as well as web search ranking [7], the computational efficiency of boosting is paid attention to in the framework of multi-task learning.

In this paper, we firstly reveal that AdaBoost can be derived by a sequential minimization of the Itakura–Saito (IS) distance between an empirical distribution and a pseudo measure model associated with a classifier. The IS distance is a special case of the Bregman divergence [8] between two positive measures and is frequently used for non-negative matrix factorization (NMF) in the region of signal processing [9,10]. Secondly, we propose a novel boosting algorithm for the multi-task learning based on the IS distance. We utilize the IS distance as a discrepancy measure between pseudo models associated with tasks and incorporate the IS distance as a regularizer into AdaBoost. The proposed method can capture the shared structure, i.e., the relationship between underlying distributions by considering the IS distance between pseudo models constructed by classifiers. We discuss the statistical properties of the proposed method and investigate the validity of the regularization by the IS distance with small experiments using synthetic datasets and a real dataset.

This paper is organized as follows. In Section 2, basic settings are described, and a divergence measure is introduced. In Section 3, we briefly introduce the IS distance, which is a special case of the Bregman divergence, and investigate the relationship between a well-known ensemble algorithm, AdaBoost and estimation with a pseudo model using the Itakura–Saito distance. In Section 4, we propose a method for multi-task learning, which is derived from a minimization of the weighted sum of divergence, and the performance of the proposed methods is examined in Section 5 using a synthetic dataset and a real dataset (a short version of this article has been presented as a conference paper [11]; some theoretical results and numerical experiments are added to the current version).

2. Settings

In this study, we focus on binary classification problems. Let x be an input and

y \in 𝒴 = {\pm 1}

be a class label. Let us assume that J datasets

𝒟_{j} = {x_{i}^{(j)}, y_{i}^{(j)}}_{i = 1}^{n_{j}}

(

j = 1, ..., J

) are given, and let

p_{j} (y | x) r_{j} (x)

and

{\tilde{p}}_{j} (y | x) {\tilde{r}}_{j} (x)

be an underlying distribution and an empirical distribution associated with the dataset

𝒟_{j}

, respectively. Here, we assume that each conditional distribution of y given x is written as:

p_{k} (y | x) = p_{0} (y | x) + δ_{k} (x) y

(1)

where

p_{0} (y | x)

is a common conditional distribution for all datasets and

δ_{k} (x)

is a term that is specific to the dataset

𝒟_{k}

. Note that

\sum_{y \in 𝒴} δ_{k} (x) y = 0

holds, because

p_{k} (y | x)

is a probability distribution. While a discriminant function

F_{k}

is usually constructed using only the dataset

𝒟_{k}

, the multi-task learning aims to improve the performance of the discriminant function for each dataset

𝒟_{k}

with the help of datasets

𝒟_{j}

(

j \neq k

). For this purpose, we consider a risk minimization problem defined with a pseudo model and the Itakura–Saito (IS) distance, which is a discrepancy measure frequently used in a region of signal processing.

Let

ℳ = \{m (y) | 0 \leq \sum_{y \in 𝒴} m (y) < \infty\}

be a space of all positive finite measures over

𝒴

. The Itakura–Saito distance between

p, q \in ℳ

is defined as:

IS (p, q; r) = \int r (x) \sum_{y \in 𝒴} \{log \frac{q (y | x)}{p (y | x)} - 1 + \frac{p (y | x)}{q (y | x)}\} d x

(2)

where

r (x)

is a marginal distribution of x shared by p,

q \in ℳ

. Note that the IS distance is a kind of statistical version of the Bregman divergence [12], which makes it possible to directly plug-in the empirical distribution. We observe that

IS (p, q; r) \geq 0

and

IS (p, q; r) = 0

if and only if

p = q

. banerjee et al. [13] showed that there exists a unique Bregman divergence corresponding to every regular exponential family, and the Itakura–Saito distance is associated with the exponential distribution.

3. Itakura–Saito Distance and Pseudo Model

3.1. Parameter Estimation with the Pseudo Model

Let

q_{F} (y | x)

be an (un-normalized) pseudo model associated with a function

F (x)

,

q_{F} (y | x) = exp (F (x) y) .

(3)

Note that

q_{F} (y | x)

is not a probability function, i.e.,

\sum_{y \in 𝒴} q_{F} (y | x) \neq 1

in general. If

q_{F} (y | x)

is normalized, the model reduces to the classical logistic model as:

{\bar{q}}_{F} (y | x) = \frac{exp (F (x) y)}{exp (F (x)) + exp (- F (x))} .

(4)

When the function F is parameterized by θ, the maximum likelihood estimation (MLE)

{argmax}_{θ} \sum_{i = 1}^{n} log {\bar{q}}_{F} (y_{i} | x_{i})

or equivalently minimization of the (extended) Kullback–Leibler (KL) divergence is a powerful tool for the estimation of θ, and the MLE has properties such as asymptotic consistency and efficiency under some regularity conditions. Here, we consider parameter estimation with the pseudo model Equation (3) rather than the normalized model Equation (4).

Proposition 1. Let

p (y | x) = {\bar{q}}_{F_{0}} (y | x)

be the underlying distribution. Then, we observe:

\begin{matrix} \underset{F}{argmin} IS (p, q_{F}; r) & = F_{0}, \end{matrix}

(5)

\begin{matrix} \underset{F}{argmin} IS (q_{F}, p; r) & = F_{0} . \end{matrix}

(6)

Proof. See Appendix A

On the other hand, when we consider an estimation based on the extended KL divergence, i.e.,

{argmin}_{F} KL (p, q_{F}; r)

where:

KL (p, q; r) = \int r (x) \sum_{y \in 𝒴} {p (y | x) log \frac{p (y | x)}{q (y | x)} - p (y | x) + q (y | x)} d x,

(7)

we observe the following.

Proposition 2. Let

F_{0}

be a function

F_{0}

(

\neq 0

) and

p (y | x) = {\bar{q}}_{F_{0}} (y | x)

be the underlying distribution. Then, we observe:

\begin{matrix} F_{KL, 1} = \underset{F}{argmin} KL (p, q_{F}; r) \neq & F_{0}, \end{matrix}

(8)

\begin{matrix} F_{KL, 2} = \underset{F}{argmin} KL (q_{F}, p; r) \neq & F_{0} . \end{matrix}

(9)

Proof. See Appendix B.

Remark 1. Let

p (y | x) = {\bar{q}}_{F_{0}} (y | x)

be the underlying distribution. Then, minimizer Equation (8) or (9) of the extended KL divergence attains the Bayes rule, i.e.,

\begin{matrix} sgn (F_{KL, 1} (x)) & = sgn (F_{KL, 2} (x)) = sgn (\frac{1}{2} log \frac{p (+ 1 | x)}{p (- 1 | x)}) . \end{matrix}

(10)

The proposition and the remark show that the extended KL divergence is not completely appropriate for estimation with the pseudo model.

3.2. Characterization of the Itakura–Saito Distance

In this section, we investigate the characterization of the Itakura–Saito distance for estimation with the pseudo model, in the framework of the Bregman U-divergence. Firstly, we briefly introduce the statistical version of Bregman U-divergence [12]. The statistical version of Bregman U-divergence is a discrepancy measure between positive measures in

ℳ

defined by a generating function U and enables us to directly plug-in the empirical distribution for estimation. [12] proposed a general boosting-type algorithm for classification using the Bregman U-divergence and discussed properties of the method from the viewpoint of information geometry [14]. By changing the generating function U, the Bregman U-divergence can have a useful property as robustness against noise. For example, the β-divergence is a special case of the Bregman U-divergence and is frequently used for robust estimation in the context of unsupervised learning, such as clustering or component analysis [15,16]. Another example of the Bregman U-divergence is the η-divergence, which is employed to robustify the classification algorithm and is closely related to probability models of mislabeling [17,18].

Let U be a monotonically-increasing convex function and ξ be an inverse function of

U^{'}

, the derivative of U. From the convexity of the function U, the function ξ is a monotonically-increasing function. The statistical version of Bregman U-divergence between two measures

p, q \in ℳ

is defined as follows.

\begin{matrix} D_{U} (p, q; r) = \int r (x) \sum_{y \in 𝒴} \{U (ξ (q (y | x))) - U (ξ (p (y | x))) - p (y | x) (ξ (q (y | x)) - ξ (p (y | x)))\} d x . \end{matrix}

(11)

Note that the function ξ should be defined at least on

z > 0

.

Remark 2. The KL divergence and the Itakura–Saito distance are special cases of the Bregman U-divergence Equation (11) with generating functions

U (z) = exp (z)

and

U (z) = - log (c - z) + c_{1}

(

z < c

), where c and

c_{1}

are constants, respectively.

Here, we introduce the concept of reflection-symmetric for characterization of the IS distance.

Definition 3. A function

f (z)

is reflection-symmetric if:

\begin{matrix} f (z) = f (z^{- 1}) \end{matrix}

(12)

holds for all

z \neq 0

.

If the function f is reflection-symmetric, we observe that:

\begin{matrix} lim_{z \to 0} f (z) = lim_{z \to \infty} f (z) . \end{matrix}

(13)

Because of this property, the reflection-symmetric function often has a singular point at

z = 0

, and to investigate the behavior of the function, we can employ the Laurent series as:

\begin{matrix} f (z) = c + \sum_{k = 1}^{\infty} (a_{k} z^{k} + b_{k} z^{- k}) . \end{matrix}

(14)

Note that if the function f is holomorphic over R,

b_{k} = 0

for all k, and the Laurent series is equivalent to the Taylor series.

Remark 3. If the function f is reflection-symmetric and holomorphic over R,

a_{k} = b_{k} = 0

holds for all k, and then, f is a constant function.

For the Bregman U-divergence Equation (11), we observe the following Lemma.

Lemma 4. Let

F_{0}

be an arbitrary function,

p (y | x) = {\bar{q}}_{F_{0}} (y | x)

be the underlying distribution and

q_{F} (x)

be the pseudo model Equation (3). If the Bregman U-divergence associated with the function U attains:

\begin{matrix} F_{0} = \underset{F}{argmin} D_{U} (p, q_{F}; r), \end{matrix}

(15)

a function

ξ^{'} (z) z^{2}

derived from U is reflection-symmetric. In addition, if the Bregman U-divergence associated with the function U attains:

\begin{matrix} F_{0} = \underset{F}{argmin} D_{U} (q_{F}, p; r), \end{matrix}

(16)

a function

z \{ξ (z) - ξ (\frac{z}{z + z^{- 1}})\}

derived from U is reflection-symmetric.

Proof. See Appendix C.

Remark 4. Proposition 1 implies that the function ξ associated with the IS distance satisfies Lemma 4.

Remark 5. Propositions imply that the function U, i.e., Bregman U-divergence, attains Equation (15) or (16) is not unique and there exists divergences satisfying Equation (15) or (16), other than the Itakura–Saito distance. For example, a function:

\begin{matrix} ξ (z) = - 2 z^{- \frac{2}{3}} - z^{- \frac{4}{3}} \end{matrix}

(17)

satisfies

ξ^{'} (z) z^{2} = \frac{4}{3} (z^{1 / 3} + z^{- 1 / 3})

, and then,

ξ^{'} (z) z^{2}

is reflection-symmetric. The associated generating function U is written as:

\begin{matrix} U (z) & = \int^{z} ξ^{- 1} (z^{'}) d z^{'} = - 4 \frac{- 2 + \sqrt{1 - z}}{\sqrt{- 1 + \sqrt{1 - z}}} + C_{1} \end{matrix}

(18)

where

C_{1}

is a constant.

In the following theorem, we reveal the characterization of the Itakura–Saito distance for estimation with the pseudo model Equation (3) and the Bregman U-divergence.

Theorem 5. Let

p (y | x) = {\bar{q}}_{F_{0}} (y | x)

be the underlying distribution and

q_{F} (x)

be the pseudo model Equation (3). If conditions:

\begin{matrix} F_{0} & = \underset{F}{argmin} D_{U} (p, q_{F}; r), \end{matrix}

(19)

\begin{matrix} F_{0} & = \underset{F}{argmin} D_{U} (q_{F}, p; r) \end{matrix}

(20)

simultaneously hold, then

U (z) = - log (- z)

, i.e.,

D_{U} (p, q; r)

is the Itakura–Saito distance

IS (p, q; r)

.

Proof. See Appendix D.

Remark 6. If we assume that a function

ξ^{'} (z) z^{2}

derived from U is reflection-symmetric and holomorphic over R,

ξ^{'} (z) z^{2}

is a constant function from Remark 3. Then, we obtain

ξ (z) = c + \frac{b_{1}}{z}

where

c, b_{1}

are constants, implying that the associated divergence is equivalent to the Itakura–Saito distance.

3.3. Relationship with AdaBoost

The IS distance between the underlying conditional distribution

p (y | x)

and the pseudo model

q_{F} (y | x)

is written as:

\begin{matrix} IS (p, q_{F}; r) & = C + \int r (x) \sum_{y \in 𝒴} \{F (x) y + \frac{p (y | x)}{q_{F} (y | x)}\} d x \\ = C + \int r (x) \sum_{y \in 𝒴} p (y | x) e^{- F (x) y} d x, \end{matrix}

(21)

where C is a constant, and Equation (21) is equivalent to an expected loss of AdaBoost, except for the constant term. Then, sequential minimization of an empirical version of Equation (21) is equivalent to the algorithm of AdaBoost, which is the most popular boosting method for the binary classification. Furthermore, [12,19] discussed that a gradient-based boosting algorithm can be derived from the minimization of the KL divergence or the Bregman U-divergence between the underlying distribution and a pseudo model. An important difference between these frameworks and our framework Equation (21) is the employed pseudo model. The pseudo model employed by the previous frameworks assumes a condition called “consistent data assumption” and is defined with the empirical distribution, implying that the pseudo model varies depending on the dataset. On the other hand, the pseudo model Equation (3) employed in Equation (21) is fixed against the dataset as usual statistical models.

The IS distance between two pseudo models

q_{F} (y | x)

and

q_{F^{'}} (y | x)

is written as,

\begin{matrix} IS (q_{F}, q_{F^{'}}; r) & = \int r (x) \sum_{y \in 𝒴} \{F^{'} (x) y - F (x) y - 1 + exp (F (x) y - F^{'} (x) y)\} d x \\ = 2 + \int r (x) \{exp (F (x) - F^{'} (x)) + exp (F^{'} (x) - F (x))\} d x . \end{matrix}

(22)

Note that

IS (q_{F^{'}}, q_{F}; r) = IS (q_{F}, q_{F^{'}}; r)

holds for arbitrary

q_{F}

and

q_{F^{'}}

, while the IS distance itself is not necessarily symmetric. Furthermore, note that the symmetric property does not hold for normalized models

{\bar{q}}_{F}

and

{\bar{q}}_{F^{'}}

.

4. Application for Multi-Task Learning

There are two main types of frameworks for multi-task learning [20,21].

Case 1 :: There is a target dataset $𝒟_{k}$ , and our interest is to construct a discriminant function $F_{k}$ utilizing remaining datasets $𝒟_{j}$ ( $j \neq k$ ) or a priori constructed discriminant functions $F_{j}$ ( $j \neq k$ ).
Case 2 :: Our interest is to simultaneously construct better discriminant functions $F_{1}, ..., F_{J}$ using all J datasets $𝒟_{1}, ..., 𝒟_{J}$ by utilizing shared information among datasets.

4.1. Case 1

In this section, we focus on the above first framework. Let us assume that discriminant functions

F_{j} (x)

(

j \neq k

) are given or are constructed by an arbitrary binary classification method. Then, let us consider a risk function:

\begin{matrix} L_{k} (F_{k}) = & IS (p_{k}, q_{F_{k}}; r_{k}) + \sum_{j \neq k} λ_{k, j} IS (q_{F_{k}}, q_{F_{j}}; r_{k}) \\ = & \int r_{k} (x) \{\sum_{y \in 𝒴} p_{k} (y | x) e^{- F_{k} (x) y} + \sum_{j \neq k} λ_{k, j} \{e^{F_{k} (x) - F_{j} (x)} + e^{F_{j} (x) - F_{k} (x)}\}\} d x, \end{matrix}

(23)

where

λ_{k, j} \geq 0

(

j \neq k

) are regularization constants. Note that the risk function depends on functions

F_{j}

(

j \neq k

), and the second term becomes small when the target discriminant function

F_{k}

is similar to functions

F_{j}

(

j \neq k

) in the sense of the IS distance; and the second term corresponds to a regularizer incorporating the shared information among datasets into the target function

F_{k}

. Furthermore, note that the marginal distribution

r_{k}

is shared in the second term for the ease of implementation and the simplicity of theoretical analysis.

An empirical version of Equation (23) is written as:

{\bar{L}}_{k} (F_{k}) = \frac{1}{n_{k}} \sum_{i = 1}^{n_{k}} (e^{- F_{k} (x_{i}^{(k)}) y_{i}^{(k)}} + \sum_{j \neq k} λ_{k, j} (e^{F_{k} (x_{i}^{(k)}) - F_{j} (x_{i}^{(k)})} + e^{F_{j} (x_{i}^{(k)}) - F_{k} (x_{i}^{(k)})})) .

(24)

An algorithm is derived by sequential minimization of Equation (24) by updating

F_{k}

to

F_{k} + α f

, i.e.,

(α, f) = {argmin}_{α, f} {\bar{L}}_{k} (F_{k} + α f)

, where f is a weak classifier and α is a coefficient [22].

(1)

Initialize the function to

F_{k}^{0}

, and define weights for the i-th example with a function F as:

\begin{matrix} w_{1} (i; F) & = \frac{e^{- F (x_{i}^{(k)}) y_{i}^{(k)}}}{Z_{1} (F)}, \\ w_{2} (i; F) & = \frac{\sum_{j \neq k} λ_{k, j} e^{f (x_{i}^{(k)}) (F (x_{i}^{(k)}) - F_{j} (x_{i}^{(k)}))}}{Z_{2} (F)} \end{matrix}

where:

\begin{matrix} Z_{1} (F) & = \sum_{i = 1}^{n_{k}} e^{- F (x_{i}^{(k)}) y_{i}^{(k)}}, \\ Z_{2} (F) & = \sum_{i = 1}^{n_{k}} \sum_{j \neq k} λ^{k, j} (e^{F (x_{i}^{(k)}) - F_{j} (x_{i}^{(k)})} + e^{- F (x_{i}^{(k)}) + F_{j} (x_{i}^{(k)})}) . \end{matrix}

(2)

For

t = 1, ..., T

(a): Select a weak classifier $f_{k}^{t} \in {\pm 1}$ , which minimizes the following quantity:

$ε (f) = \frac{Z_{1} (F_{k}^{t - 1})}{Z_{1} (F_{k}^{t - 1}) + Z_{2} (F_{k}^{t - 1})} ε_{1} (f) + \frac{Z_{2} (F_{k}^{t - 1})}{Z_{1} (F_{k}^{t - 1}) + Z_{2} (F_{k}^{t - 1})} ε_{2} (f) .$

(25)

where $ε_{1} (f) = \sum_{i = 1}^{n_{k}} w_{1} (i; F_{k}^{t - 1}) I (f (x_{i}^{(k)}) \neq y_{i}^{(k)})$ and $ε_{2} (f) = \sum_{i = 1}^{n} w_{2} (i; F_{k}^{t - 1})$ .
(b): Calculate a coefficient of $f_{k}^{t}$ by $α_{k}^{t} = \frac{1}{2} log \frac{1 - ε (f_{k}^{t})}{ε (f_{k}^{t})}$ .
(c): Update the discriminant function as $F_{k}^{t} = F_{k}^{t - 1} + α_{k}^{t} f_{k}^{t}$ .

(3)

Output

F_{k}^{T} (x) = F_{k}^{0} (x) + \sum_{t = 1}^{T} α_{k}^{t} f_{k}^{t} (x)

.

In Step 1,

F_{k}^{0}

is typically initialized as

F_{k}^{0} (x) = 0

. The quantity Equation (25) is a mixture of two terms:

ε_{1} (f)

is a weighted error rate of the classifier f, and

ε_{2} (f)

is the sum of weights

w_{2} (f)

, which represents the degree of discrepancy between f and

F - F_{j}

.

ε_{2} (f)

becomes large when F is updated by f as departed from

F_{j}

. Note that if we set

λ_{k, j} = 0

for all j, the risk function Equation (24) coincides with that of AdaBoost, and the above derived algorithm reduces to the usual AdaBoost.

Because the empirical risk function Equation (24) is convex with respect to F or

F^{'}

, we can consider another version of the risk function as:

\begin{matrix} {\bar{L}}_{k} (F_{k}) & = \frac{1}{n_{k}} \sum_{i = 1}^{n_{k}} (e^{- F_{k} (x_{i}^{(k)}) y_{i}^{(k)}} + λ_{k} (e^{F_{k} (x_{i}^{(k)}) - {\bar{F}}_{k} (x_{i}^{(k)})} + e^{- F_{k} (x_{i}^{(k)}) + {\bar{F}}_{k} (x_{i}^{(k)})})) \end{matrix}

(26)

where

{\bar{F}}_{k} (x) = \sum_{j \neq k} \frac{λ_{k, j}}{λ_{k}} F_{j} (x)

. The risk function is upper bounded by the risk function Equation (24), implying that the effect of regularization by the shared information is weakened. The derived algorithm is almost the same as the one derived from Equation (24).

4.2. Case 2

In this section, we consider simultaneous construction of discriminant functions

F_{1}, ..., F_{J}

by minimizing the following risk function:

L (F_{1}, ..., F_{J}) = \sum_{j = 1}^{J} π_{j} L_{j} (F_{j})

(27)

where

π_{j}

(

j = 1, ..., J

) is a positive constant satisfying

\sum_{j = 1}^{J} π_{j} = 1

and

L_{k}

is defined in Equation (23).

Though we can directly minimize the empirical version of Equation (27), a derived algorithm is complicated and is computationally heavy. Then, we derive a simplified algorithm utilizing the algorithm shown in Case 1 in which a target dataset is fixed.

(1)

Initialize functions

F_{1}, ..., F_{J}

.

(2)

For

t = 1, ..., T

:

(a): Randomly choose a target index $k \in {1, ..., J}$ .
(b): Update the function $F_{k}$ using the algorithm in Case 1 by S steps, with fixed functions $F_{j}$ ( $j \neq k$ ).

(3)

Output learned functions

F_{1}, ..., F_{J}

.

Note that the empirical risk function cannot be monotonically decreased because the minimization of

L_{k} (F_{k})

is a trade-off of the first term and the second regularization term, and a decrease of

L_{k} (F_{k})

does not necessarily mean a decrease of the regularization term.

4.3. Statistical Properties of the Proposed Methods

In this section, we discuss the statistical properties of the proposed methods. Firstly, we focus on Case 1, and the minimizer

F_{k}^{*}

of the risk function Equation (23) satisfies the following:

\begin{matrix} {\frac{δ L_{k} (F_{k})}{δ F_{k} (x)}|}_{F_{k} = F_{k}^{*}} & \propto - p_{k} (+ 1 | x) e^{- F_{k}^{*} (x)} + p_{k} (- 1 | x) e^{F_{k}^{*} (x)} + \sum_{j \neq k} λ_{k, j} \{e^{F_{k}^{*} (x) - F_{j} (x)} - e^{F_{j} (x) - F_{k}^{*} (x)}\} = 0, \end{matrix}

(28)

which implies:

\begin{matrix} F_{k}^{*} (x) & = \frac{1}{2} log \frac{p_{k} (+ 1 | x) + \sum_{j \neq k} λ_{k, j} exp (F_{j} (x))}{p_{k} (- 1 | x) + \sum_{j \neq k} λ_{k, j} exp (- F_{j} (x))}, \end{matrix}

(29)

or equivalently:

\begin{matrix} p_{k} (y | x) & = p_{0, k} (y | x) (1 + \sum_{j \neq k} λ_{k, j} exp (- F_{j} (x) y)) - p_{0, k} (- y | x) \sum_{j \neq k} λ_{k, j} exp (F_{j} (x) y), \end{matrix}

(30)

where

p_{0, k} (y | x) = \frac{exp (F_{k}^{*} (x) y)}{exp (F_{k}^{*} (x)) + exp (- F_{k}^{*} (x))}

. This can be interpreted as a probabilistic model of asymmetric mislabeling [17,18]. In Equation (29), the confidence of classification is discounted by the results of remaining discriminant functions when the classifier

sgn (F_{k}^{*} (x))

makes a different decision from these of

sgn (F_{j} (x))

(

j \neq k

).

Remark 7.

F_{k}^{*} (x) \geq 0

does not mean

p_{k} (+ 1 | x) \geq \frac{1}{2}

unless

F_{j} (x) = \frac{1}{2} log \frac{p_{k} (+ 1 | x)}{p_{k} (- 1 | x)}

holds.

Proposition 6. Let us assume that

F_{j} (x)

satisfies:

\frac{exp (F_{j} (x) y)}{exp (F_{j} (x)) + exp (- F_{j} (x))} = p_{0} (y | x) + ϵ_{j} (x) y, | | ϵ_{j} (x) | | ≪ 1 .

(31)

Then, Equation (29) can be approximated as:

F_{k}^{*} (x) ≃ \frac{1}{2} log \frac{p_{0} (+ 1 | x)}{p_{0} (- 1 | x)} + \frac{1}{2 P^{2}} \frac{P δ_{k} (x) + \sum_{j \neq k} λ_{k, j} ϵ_{j} (x)}{P + λ_{k}}

(32)

where

P = \sqrt{p_{0} (+ 1 | x) p_{0} (- 1 | x)}

and

λ_{k} = \sum_{j \neq k} λ_{k, j}

.

Proof. We obtain Equation (32) by considering the Taylor expansion of Equation (29).

We observe that a discrepancy derived by

δ_{k}

is moderated by the mixture of

ϵ_{j}

when perturbations

ϵ_{j}

are independently and identically distributed.

Proposition 7. Let

η_{j} (x) = F_{j} (x) - F_{k} (x)

be a difference between two functions. Then,

F_{k}^{*}

can be approximated as:

\begin{matrix} F_{k}^{*} (x) ≃ \frac{1}{2} log \frac{p_{k} (+ 1 | x)}{p_{k} (- 1 | x)} + \frac{1}{P} \sum_{j \neq k} λ_{k, j} η_{j} (x) . \end{matrix}

(33)

Proof. See Appendix E.

Proposition 8. Let

{\bar{F}}_{k}^{*}

be a minimizer of the risk function Equation (23) with

λ_{k, j} = 0

(

j \neq k

). Then, we observe:

\begin{matrix} {({\bar{F}}_{k}^{*} (x) - \frac{1}{2} log \frac{p_{0} (+ 1 | x)}{p_{0} (- 1 | x)})}^{2} & \geq {(F_{k}^{*} (x) - \frac{1}{2} log \frac{p_{0} (+ 1 | x)}{p_{0} (- 1 | x)})}^{2}, \end{matrix}

(34)

i.e., the proposed method improves the performance in the sense of the squared error, when:

| δ_{k} (x) | \geq \frac{| \sum_{j \neq k} λ_{k, j} ϵ_{j} (x) |}{λ_{k}}

(35)

holds.

Proof. See Appendix F.

Secondly, we consider a property of the algorithm for Case 2.

Proposition 9. Let

r (x) = r_{j} (x)

(

j = 1, ..., J

) be a common marginal distribution shared by all tasks. Then, the minimizer of the risk function is written as:

\begin{matrix} F_{k} (x) & = \frac{1}{2} log \frac{p_{k} (+ 1 | x) + \sum_{j \neq k} λ_{j k} e^{F_{j} (x)}}{p_{k} (- 1 | x) + \sum_{j \neq k} λ_{j k} e^{- F_{j} (x)}}, \end{matrix}

(36)

where

λ_{j k} = λ_{k, j} + \frac{π_{j}}{π_{k}} λ_{k, j}

.

Proof. See Appendix G.

The only difference from Equation (28) is that regularization is strengthened by

\frac{π_{j}}{π_{k}} λ_{k, j}

, and then, the same propositions in Section 4.1 hold for Equation (36).

4.4. Comparison of Regularization Terms

The proposed method incorporates the regularization term defined by the IS distance into AdaBoost. In this section, we discuss a property of the regularization term.

Proposition 10. Let

ϵ (x)

be a perturbation function satisfying

| ϵ (x) | ≪ 1

. Then, we observe:

\begin{matrix} KL ({\bar{q}}_{F}, {\bar{q}}_{F + ϵ}; r) & ≃ \int 2 r (x) ϵ {(x)}^{2} {\bar{q}}_{F} (+ 1 | x) {\bar{q}}_{F} (- 1 | x) d x, \end{matrix}

(37)

\begin{matrix} KL (q_{F}, q_{F + ϵ}; r) & ≃ \int \frac{r (x)}{2} ϵ {(x)}^{2} \frac{1}{\sqrt{{\bar{q}}_{F} (+ 1 | x) {\bar{q}}_{F} (- 1 | x)}} d x, \end{matrix}

(38)

\begin{matrix} IS ({\bar{q}}_{F}, {\bar{q}}_{F + ϵ}; r) & ≃ \int 2 r (x) ϵ {(x)}^{2} \sum_{y \in 𝒴} {\bar{q}}_{F} {(y | x)}^{2} d x, \end{matrix}

(39)

\begin{matrix} IS (q_{F}, q_{F + ϵ}; r) & ≃ \int r (x) ϵ {(x)}^{2} d x . \end{matrix}

(40)

Proof. We obtain these approximations by considering the Taylor expansion up to second order.

Figure 1. Values of divergences (regularization terms) against

{\bar{q}}_{F}

.

Figure 1. Values of divergences (regularization terms) against

{\bar{q}}_{F}

.

Figure 1 shows values of divergences against a value of

{\bar{q}}_{F} (x)

. Those relations implies that the KL divergence Equation (37) emphasizes a region of input x whose conditional distribution

{\bar{q}}_{F} (x)

is nearly equal to

\frac{1}{2}

, i.e., the classification boundary, while the IS distance Equation (39) focuses on a region of x whose conditional distribution is nearly equal to zero or one. The IS distance between pseudo model Equation (40), i.e., the proposed method, considers the intermediate of Equations (37) and (39). This implies that the regularization Equation (40) with the IS distance puts more focus on a region far from the classification boundary compared to Equation (37), while Equation (39) tends to relatively ignore the region near the classification boundary. Furthermore, note that the employment of Equation (40) makes it possible to derive the simple algorithm shown in Section 4.1.

5. Experiments

In this section, we investigate the performance of the proposed multi-task algorithm with synthetic datasets and a real dataset.

5.1. Synthetic Dataset

Firstly, we investigate the performance of the proposed method using two synthetic datasets within the situation described in Case 2. We compared the proposed method with AdaBoost trained with an individual dataset and AdaBoost trained with all datasets simultaneously. We employed the boosting stump (the boosting stump is a decision tree with only one node) as the weak classifier and fixed as

π_{j} = 1 / J

. A boosting-type method has a hyper-parameter T, the step number of boosting, and the proposed method additionally has the hyper-parameter

λ_{k, j}

. In the experiment, we determined these parameters T and

λ_{k, j}

by the validation technique. Especially, we investigated two kinds of scenarios for the determination of

λ_{k, j}

.

We set that $λ_{k, j} = λ$ for all $j, k$ and determined λ.
We set that $λ_{k, j} = \frac{λ}{IS (q_{{\hat{F}}_{k}}, q_{{\hat{F}}_{j}}; r_{k})}$ where ${\hat{F}}_{j}$ is a discriminant function constructed by AdaBoost with the dataset $𝒟_{j}$ and determined λ.

Scenario 2 can incorporate more detailed information about the relationship between tasks, and the proposed method can ignore the information of tasks having less shared information. In summary, we compared the following four methods:

A:: The proposed method with $λ_{k, j}$ determined by Scenario 1.
B:: The proposed method with $λ_{k, j}$ determined by Scenario 2.
C:: AdaBoost trained with an individual dataset.
D:: AdaBoost trained with all datasets simultaneously.

We utilized

80 %

of the training dataset for training of classifiers and the remaining

20 %

for the validation. We repeated the above procedure 20 times and observed the averaged performance of the methods.

5.1.1. Dataset 1

We set the number J of tasks to three and assume that a marginal distribution of x is a uniform distribution on

{[- 1, 1]}^{2}

, and a discriminant function

F_{j}

(

j = 1, 2, 3

) associated with each dataset is generated by

F_{j} (x) = (1 + c_{j, 2}) (x_{1} - c_{j, 1}) - x_{2}

, where

c_{j, 1} \sim 𝒩 (0, 0 . 2^{2})

and

c_{j, 2} \sim 𝒩 (0, 0 . 1^{2})

. In addition, we randomly added a contamination noise on label y. Under these settings, we generated a training dataset, including 400 examples, and a test dataset, including 600 examples. Generated datasets are shown in Figure 2. We observe that each discriminant function and noise structure are different from the other two.

Figure 2. The three generated datasets and decision boundaries.

Figure 3 shows boxplots of the test errors of each method for datasets

𝒟_{j}

(

j = 1, 2, 3

). We observe that the proposed method consistently outperforms individually trained AdaBoost, and AdaBoost trained with all datasets simultaneously. The figure shows that the proposed method can incorporate shared information among datasets into classifiers.

Figure 3. Boxplots of the test error of each method: A—proposed method with λ in Scenario 1; B—proposed method with λ in Scenario 2; C—AdaBoost trained with the individual dataset; D—AdaBoost trained with all datasets simultaneously; for three datasets, over the 20 simulation trials.

5.1.2. Dataset 2

We set the number J of tasks to 6 and assume that a marginal distribution of x is a uniform distribution on

{[- 1, 1]}^{2}

. Discriminant functions associated with each dataset are generated by:

\begin{matrix} F_{j} (x) = \{\begin{matrix} (1 + c_{j, 2}) (x_{1} - c_{j, 1}) - x_{2}, & j = 1, 2, 3, \\ - (1 + c_{j, 2}) (x_{1} - c_{j, 1}) + x_{2}, & j = 4, 5, 6, \end{matrix} \end{matrix}

where

c_{j, 1} \sim 𝒩 (0, 0 . 1^{2})

and

c_{j, 2} \sim 𝒩 (0, 0 . 1^{2})

. In addition, we randomly added a contamination noise on label y. Under these settings, we generated training dataset, including 400 examples, and the test dataset, including 600 examples. Generated datasets are shown in Figure 4. We observe that Datasets 1, 2 and 3 share a structure, and Datasets 4, 5 and 6 share another structure.

Figure 4. The six generated datasets and decision boundaries.

Figure 5 shows boxplots of the test errors of each method for datasets

𝒟_{j}

(

j = 1, ..., 6

). We omitted the result of AdaBoost trained with all datasets simultaneously (D) from the figure, because its performance is significantly worse than those of the other methods: the median of classification errors is around

0.5

. This is because the structures of Datasets 1, 2, 3 and Datasets 4, 5, 6 are opposite, and the labeling of concatenated dataset seems to be random. We observe that the proposed method with Scenario 2 (B) improves performance against individually-trained AdaBoost (C) and the proposed method in Scenario 1 (A). This is because the structure shared among Datasets 1, 2 and 3 does not have information about Datasets 4, 5 and 6 (and vice versa), and Method (B) can ignore the influence of the irrelevant information by adjusting

λ_{k, j}

responding to

IS (q_{{\hat{F}}_{j}}, q_{{\hat{F}}_{k}}; r_{k})

. Note that the performance of Method (A) is not so degraded, because the regularization parameter

λ_{k, j}

was determined, so as to be zero, implying AdaBoost trained with the individual dataset.

Figure 6 shows examples of classification boundaries estimated by Methods A, B, C and D, for Dataset 6.

Figure 5. Boxplots of the test error of each method: A, Proposed method with λ in Scenario 1; B, proposed method with λ in Scenario 2; C, AdaBoost trained with the individual dataset ; for 6 datasets, over the 20 simulation trials.

Figure 6. Classification boundaries by Methods A, B, C and D for Dataset 6. The blue line is the true classification boundary, and the red line represents the estimated classification boundary.

5.2. Real Dataset: School Dataset

In this section, we compared the proposed method (Scenario 2) to the a binary decision tree-based ensemble method, called extremely randomized trees (ExtraTrees) [23], applying to a real dataset, “school data”, reported from the Inner London Education Authority [24]. The dataset consists of examination records of 15,362 students from 139 secondary schools, i.e., we had 139 tasks. The dimension of input x is 27, in which original variables that are categorical were transformed into dummy variables. The original target variable

y_{0}

represents score values in the range

[1, 70]

, and we transformed the target variable

y_{0}

to a binary variable as:

y = sgn (y_{0} - 20) .

We set the threshold to 20 to balance the ratio of classes (

- 1 : + 1 = 7930 : 7432

). We randomly divided the dataset of each tasks into

80 %

of the training dataset and remaining

20 %

test dataset. In addition, we used

20 %

of the divided training dataset as a validation dataset to determine the hyper-parameter λ and step number T. We repeated the above procedure 20 times and observed the average performance of the methods. Figure 7 shows the medians of error rates over 20 trials, by the proposed method and the ExtraTrees for 139 tasks. The horizontal axis indicates an index of a task, which is ranked in increasing order of the median error rate of the ExtraTrees. We observe that the proposed method is comparable to the ExtraTrees and especially has an advantage for tasks, in which the error rates of the ExtraTrees are large.

Figure 7. Medians of error rates by the proposed method and extremely randomized trees (ExtraTrees) for 139 tasks. The horizontal axis represents an index of a task, and the vertical axis indicates the median of error rates over 20 trials. Tasks are ranked in increasing order of the median error rate of the ExtraTrees.

6. Conclusions

In this paper, we investigate the properties of binary classification with the pseudo model and reveal that minimization of the Itakura–Saito distance between the empirical distribution and the pseudo model is equivalent to AdaBoost and provides suitable properties for the binary classification. In addition, we pointed out that the Itakura–Saito distance is a unique divergence, having a suitable property for estimation with the pseudo model in the framework of the Bregman divergence. Based on the framework, we proposed a novel binary classification method for the multi-task learning, which incorporates shared information among tasks into the targeted task. The risk function of the proposed method is defined by the mixture of IS distance. The IS distance between pseudo models can be interpreted as the regularization term, incorporating shared information among tasks into the binary classifier for the target task. We investigated statistical properties of the risk function and derived computationally-feasible boosting-based algorithms. Furthermore, we considered a mechanism for the adjustment of the degree of information sharing and numerically investigated the validity of the proposed methods.

Acknowledgments

This study was partially supported by a Grant-in-Aid for Young Scientists (B), 25730018, from MEXT, Japan. Shinto Eguchi and Osamu Komori were supported by the Japan Science and Technology Agency (JST), Core Research for Evolutionary Science and Technology (CREST).

Author Contributions

Takashi Takenouchi made major contributions to employing the Itakura–Saito divergence, and Shinto Eguchi gave a proof for the characterization associated with the divergence. Takashi Takenouchi and Osamu Komori contributed to the statistical discussion for the multi-task learning.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix

A. Proof of Proposition 1

By a variational calculation, a minimizer of Equation (5) satisfies:

\begin{matrix} \frac{δ IS (p, q_{F}; r)}{δ F (x)} & \propto \frac{e^{F_{0} (x) - F (x)} - e^{- F_{0} (x) + F (x)}}{e^{F_{0} (x)} + e^{- F_{0} (x)}} = 0, \end{matrix}

(41)

and

F = F_{0}

satisfies the above equation for an arbitrary

F_{0}

, which concludes Equation (5). Furthermore,

\begin{matrix} \frac{δ IS (q_{F}, p; r)}{δ F (x)} & \propto (e^{F_{0} (x)} + e^{- F_{0} (x)}) (e^{F (x) - F_{0} (x)} - e^{- F (x) + F_{0} (x)}) = 0, \end{matrix}

(42)

and

F = F_{0}

satisfies the above equation for an arbitrary

F_{0}

, concluding Equation (6).

B. Proof of Proposition 2

By a straightforward variational calculation, we observe that a minimizer

F_{KL, 1}

of Equation (8) satisfies:

\begin{matrix} \frac{δ KL (p, q_{F}; r)}{δ F (x)} & \propto - p (+ 1 | x) + p (- 1 | x) + exp (F_{KL, 1} (x)) - exp (- F_{KL, 1} (x)) \\ = \frac{- e^{F_{0} (x)} + e^{- F_{0} (x)}}{e^{F_{0} (x)} + e^{- F_{0} (x)}} + e^{F_{KL, 1} (x)} - e^{- F_{KL, 1} (x)} = 0, \end{matrix}

(43)

and

F_{KL, 1} = F_{0}

means

F_{0} (x) = 0

(

\forall x

), which concludes Equation (8). Furthermore, for Equation (9),

F_{KL, 2}

satisfies:

\begin{matrix} \frac{δ KL (q_{F}, p; r)}{δ F (x)} \\ \propto & (F_{KL, 2} (x) - F_{0} (x)) (e^{F_{KL, 2} (x)} + e^{- F_{KL, 2} (x)}) + (e^{F_{KL, 2} (x)} - e^{- F_{KL, 2} (x)}) log (e^{F_{0} (x)} + e^{- F_{0} (x)}) \\ = & 0, \end{matrix}

and

F_{KL, 2} = F_{0}

means

F_{0} (x) = 0

(

\forall x

), concluding Equation (9).

C. Proof of Lemma 4

If Equation (15) holds,

F_{0}

satisfies:

\begin{matrix} {\frac{δ D_{U} (p, q_{F}; r)}{δ F (x)}|}_{F = F_{0}} & = (1 - \frac{1}{\sum_{y \in 𝒴} q_{F_{0}} (y | x)}) \sum_{y \in 𝒴} y ξ^{'} (q_{F_{0}} (y | x)) q_{F_{0}} {(y | x)}^{2} \\ \propto ξ^{'} (e^{F_{0} (x)}) e^{2 F_{0} (x)} - ξ^{'} (e^{- 2 F_{0} (x)}) e^{- 2 F_{0} (x)} \\ = 0 . \end{matrix}

By setting

z = e^{F_{0} (x)}

, we have

z^{2} ξ' (z) = z^{- 2} ξ' (z^{- 1})

, and the function

ξ' (z) z^{2}

is reflection-symmetric.

If Equation (16) holds,

F_{0}

satisfies:

\begin{matrix} {\frac{δ D_{U} (q_{F}, p; r)}{δ F (x)}|}_{F = F_{0}} \\ = & \sum_{y \in 𝒴} y q_{F_{0}} (y | x) \{ξ (q_{F_{0}} (y | x)) - ξ ({\bar{q}}_{F_{0}} (y | x))\} \\ = & e^{F_{0} (x)} \{ξ (e^{F_{0} (x)}) - ξ (\frac{e^{F_{0} (x)}}{e^{F_{0} (x)} + e^{- F_{0} (x)}})\} - e^{- F_{0} (x)} \{ξ (e^{- F_{0} (x)}) - ξ (\frac{e^{- F_{0} (x)}}{e^{F_{0} (x)} + e^{- F_{0} (x)}})\} \\ = & 0, \end{matrix}

implying that the function

z \{ξ (z) - ξ (\frac{z}{z + z^{- 1}})\}

is reflection-symmetric.

D. Proof of Theorem 5

For the proof of the theorem, we firstly prepare the following lemmas.

Lemma 11. Let

f (z)

be a reflection-symmetric and holomorphic function on

z \neq 0

. Then,

a_{k} = b_{k}

holds for all

k \geq 1

.

Proof. The function f can be expressed as Equation (14), and let us assume that there exists an integer

k_{0}

, such that

a_{k_{0}} \neq b_{k_{0}}

. From the reflection-symmetric property, we have:

(a_{k_{0}} - b_{k_{0}}) (z^{k_{0}} - z^{- k_{0}}) = 0

(44)

for all

z > 0

, which contradicts

a_{k_{0}} \neq b_{k_{0}}

.

Lemma 12. Let

ξ (z)

be a holomorphic function on

z \neq 0

. If two functions:

\begin{matrix} ξ^{'} (z) z^{2}, and z \{ξ (z) - ξ (\frac{z}{z + z^{- 1}})\} \end{matrix}

(45)

are both reflection-symmetric, then

ξ (z) = \frac{c_{1}}{z} + c_{0}

.

Proof. We can express the function

ξ (z)

by a Laurent series as:

\begin{matrix} ξ (z) = c + \sum_{k = 1}^{\infty} (a_{k} z^{k} + b_{k} z^{- k}) . \end{matrix}

(46)

Then, we have:

\begin{matrix} ξ^{'} (z) z^{2} & = \sum_{k = 1}^{\infty} k (a_{k} z^{k + 1} - b_{k} z^{- k + 1}) \\ = - b_{1} - 2 b_{2} z^{- 1} + \sum_{k = 1}^{\infty} (k a_{k} z^{k + 1} - (k + 2) b_{k + 2} z^{- k - 1}) . \end{matrix}

(47)

Because of the assumption of reflection-symmetry for

z^{2} ξ' (z)

and Lemma 11, we have

b_{2} = 0

and

k a_{k} = - (k + 2) b_{k + 2}

for all

k \geq 1

. Thus, we obtain:

\begin{matrix} ξ (z) & = \int - \frac{b_{1}}{z^{2}} + \sum_{k = 1}^{\infty} a_{k} (k z^{k - 1} + k z^{- k - 3}) d z \\ = c + b_{1} z^{- 1} + \sum_{k = 1}^{\infty} a_{k} (z^{k} - \frac{k}{k + 2} z^{- k - 2}) . \end{matrix}

(48)

Then, we have:

\begin{matrix} z \{ξ (z) - ξ (\frac{z}{z + z^{- 1}})\} \\ = & b_{1} (1 - (z + z^{- 1})) + \sum_{k = 1}^{\infty} a_{k} \{z^{k + 1} (1 - {(z + z^{- 1})}^{- k}) - \frac{k}{k + 2} z^{- k - 1} (1 - {(z + z^{- 1})}^{k + 2})\} . \end{matrix}

(49)

From Equation (48) and the assumption of the reflection-symmetry of the function

z \{ξ (z) - ξ (\frac{z}{z + z^{- 1}})\}

, we observe that for all z,

\begin{matrix} z \{ξ (z) - ξ (\frac{z}{z + z^{- 1}})\} - z^{- 1} \{ξ (z^{- 1}) - ξ (\frac{z^{- 1}}{z + z^{- 1}})\} = & \sum_{k = 1}^{\infty} a_{k} h_{k} (z) \\ = & 0 \end{matrix}

(50)

where:

\begin{matrix} h_{k} (z) = (z^{k + 1} - z^{- k - 1}) \{1 - {(z + z^{- 1})}^{- k} + \frac{k}{k + 2} \{1 - {(z + z^{- 1})}^{k + 2}\}\} . \end{matrix}

(51)

Since

{h_{k} (z)}_{k = 1}^{\infty}

is functionally independent, we conclude that

a_{k} = 0

for all

k \geq 1

or, equivalently,

ξ (z) = c + \frac{b_{1}}{z}

.

We now give a proof for Theorem 5 using Lemma 12.

Proof. If condition Equations (19) and (20) hold, functions

ξ^{'} (z) z^{2}

and

z \{ξ (z) - ξ (\frac{z}{z + z^{- 1}})\}

are both reflection-symmetric from Lemma 4. From Lemma 12, the reflection-symmetric property of these two functions implies

ξ (z) = \frac{b_{1}}{z} + c

. Since the function should be defined on

z > 0

, the generating function U derived from ξ is written as:

\begin{matrix} U (z) = \int ξ^{- 1} (z) d z = b_{1} log (c - z) + c_{1} (z < c) \end{matrix}

(52)

where

c_{1}

is a constant and

b_{1} < 0

holds because of the convexity of function U. Then, we have

U (ξ (z))) = b_{1} log (- b_{1}) - b_{1} log z + c_{1}

(

z > 0

), and the associated divergence is equivalent to the IS distance, i.e.,

\begin{matrix} D_{U} (p, q; r) & = \int r (x) \sum_{y \in 𝒴} \{- b_{1} log \frac{q (y | x)}{p (y | x)} - p (y | x) \{\frac{b_{1}}{q (y | x)} - \frac{b_{1}}{p (y | x)}\}\} d x \\ = - b_{1} \int r (x) \sum_{y \in 𝒴} \{log \frac{q (y | x)}{p (y | x)} + \frac{p (y | x)}{q (y | x)} - 1\} d x \\ = - b_{1} IS (p, q; r), \end{matrix}

(53)

up to the constant

- b_{1}

.

E. Proof of Proposition 7

From Equation (28), we observe:

\begin{matrix} F_{k}^{*} (x) \\ = & log \frac{\sqrt{p_{k} (+ 1 | x) + \frac{1}{4 p_{k} (- 1 | x)} {(\sum_{j \neq k} λ_{k, j} \{e^{- η_{j} (x)} - e^{η_{j} (x)}\})}^{2}} - \frac{1}{2 \sqrt{p_{k} (- 1 | x)}} \sum_{j \neq k} λ_{k, j} \{e^{- η_{j} (x)} - e^{η_{j} (x)}\}}{\sqrt{p_{k} (- 1 | x)}} \\ ≃ & \frac{1}{2} log \frac{p_{k} (+ 1 | x)}{p_{k} (- 1 | x)} + \frac{1}{P} \sum_{j \neq k} λ_{k, j} η_{j} (x) . \end{matrix}

F. Proof of Proposition 8

We observe that:

\begin{matrix} {({\bar{F}}_{k}^{*} (x) - \frac{1}{2} log \frac{p_{0} (+ 1 | x)}{p_{0} (- 1 | x)})}^{2} - {(F_{k}^{*} (x) - \frac{1}{2} log \frac{p_{0} (+ 1 | x)}{p_{0} (- 1 | x)})}^{2} \\ = & \frac{1}{4 P^{4} {(P + λ_{k})}^{2}} (λ_{k} δ_{k} (x) - \sum_{j \neq k} λ_{k, j} ϵ_{j} (x)) ((λ_{k} + 2 P) δ_{k} (x) + \sum_{j \neq k} λ_{k, j} ϵ_{j} (x)), \end{matrix}

which implies the proposition.

G. Proof of Proposition 9

The minimizer of the risk function Equation (27) satisfies:

\begin{matrix} \frac{δ L (F_{1}, ..., F_{J})}{δ F_{k}} \propto & e^{F_{k} (x)} \{π_{k} p_{k} (- 1 | x) + \sum_{j \neq k} (π_{k} λ_{k, j} + π_{j} λ_{j, k}) e^{- F_{j} (x)}\} \\ - e^{- F_{k} (x)} \{π_{k} p_{k} (+ 1 | x) + \sum_{j \neq k} (π_{k} λ_{k, j} + π_{j} λ_{j, k}) e^{F_{j} (x)}\} \\ = & 0, \end{matrix}

implying Equation (36).

References

Caruana, R. Multitask learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
Argyriou, A.; Pontil, M.; Ying, Y.; Micchelli, C.A. A spectral regularization framework for multi-task structure learning. In Advances in Neural Information Processing Systems 19; MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
Evgeniou, A.; Pontil, M. Multi-task feature learning. In Advances in Neural Information Processing Systems 19; MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
Dai, W.; Yang, Q.; Xue, G.R.; Yu, Y. Boosting for transfer learning. In Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, USA, 20–24 June 2007; pp. 193–200.
Wang, X.; Zhang, C.; Zhang, Z. Boosted multi-task learning for face verification with applications to web image and video search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 142–149.
Chapelle, O.; Shivaswamy, P.; Vadrevu, S.; Weiinberger, K.; Zhang, Y.; Tseng, B. Multi-task learning for boosting with application to web search ranking. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 25–28 July 2010; pp. 1189–1198.
Cichocki, A.; Amari, S. Families of alpha- beta- and gamma-divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef]
Févotte, C.; Bertin, N.; Durrieu, J.L. Nonnegative matrix factorization with the Itakura–Saito divergence: With application to music analysis. Neural Comput. 2009, 21, 793–830. [Google Scholar] [CrossRef] [PubMed]
Lefevre, A.; Bach, F.; Févotte, C. Itakura–Saito nonnegative matrix factorization with group sparsity. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech, 22–27 May 2011; pp. 21–24.
Takenouchi, T.; Komori, O.; Eguchi, S. A novel boosting algorithm for multi-task learning based on the Itakura–Saito divergence. In Proceedings of the Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2014), Amboise, France, 21–26 September 2014; pp. 230–237.
Murata, N.; Takenouchi, T.; Kanamori, T.; Eguchi, S. Information geometry of U-boost and Bregman divergence. Neural Comput. 2004, 16, 1437–1481. [Google Scholar] [CrossRef] [PubMed]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
Amari, S.; Nagaoka, H. Methods of Information Geometry of Translations of Mathematical Monographs; Oxford University Press: Providence, RI, USA, 2000; Volume 191. [Google Scholar]
Mihoko, M.; Eguchi, S. Robust blind source separation by beta divergence. Neural Comput. 2002, 14, 1859–1886. [Google Scholar] [CrossRef] [PubMed]
Cichocki, A.; Cruces, S.; Amari, S.I. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy 2011, 13, 134–170. [Google Scholar] [CrossRef]
Takenouchi, T.; Eguchi, S. Robustifying AdaBoost by adding the naive error rate. Neural Comput. 2004, 16, 767–787. [Google Scholar] [CrossRef] [PubMed]
Takenouchi, T.; Eguchi, S.; Murata, T.; Kanamori, T. Robust boosting algorithm against mislabeling in multi-class problems. Neural Comput. 2008, 20, 1596–1630. [Google Scholar] [CrossRef] [PubMed]
Lafferty, G.L.J. Boosting and maximum likelihood for exponential models. In Advances in Neural Information Processing Systems 14; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
Evgeniou, T.; Pontil, M. Regularized multi-task learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22–25 August 2004; pp. 109–117.
Xue, Y.; Liao, X.; Carin, L.; Krishnapuram, B. Multi-task learning for classification with Dirichlet process priors. J. Mach. Learn. Res. 2007, 8, 35–63. [Google Scholar]
Mason, L.; Baxter, J.; Bartlett, P.; Frean, M. Boosting algorithms as gradient decent in function space. In Advances in Neural Information Processing Systems 11; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]
Goldstein, H. Multilevel modelling of survey data. J. R. Stat. Soc. Ser. D 1991, 40, 235–244. [Google Scholar] [CrossRef]

© 2015 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Takenouchi, T.; Komori, O.; Eguchi, S. Binary Classification with a Pseudo Exponential Model and Its Application for Multi-Task Learning. Entropy 2015, 17, 5673-5694. https://doi.org/10.3390/e17085673

AMA Style

Takenouchi T, Komori O, Eguchi S. Binary Classification with a Pseudo Exponential Model and Its Application for Multi-Task Learning. Entropy. 2015; 17(8):5673-5694. https://doi.org/10.3390/e17085673

Chicago/Turabian Style

Takenouchi, Takashi, Osamu Komori, and Shinto Eguchi. 2015. "Binary Classification with a Pseudo Exponential Model and Its Application for Multi-Task Learning" Entropy 17, no. 8: 5673-5694. https://doi.org/10.3390/e17085673

Article Menu

Binary Classification with a Pseudo Exponential Model and Its Application for Multi-Task Learning^†

Abstract

1. Introduction

2. Settings

3. Itakura–Saito Distance and Pseudo Model

3.1. Parameter Estimation with the Pseudo Model

3.2. Characterization of the Itakura–Saito Distance

3.3. Relationship with AdaBoost

4. Application for Multi-Task Learning

4.1. Case 1

4.2. Case 2

4.3. Statistical Properties of the Proposed Methods

4.4. Comparison of Regularization Terms

5. Experiments

5.1. Synthetic Dataset

5.1.1. Dataset 1

5.1.2. Dataset 2

5.2. Real Dataset: School Dataset

6. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

Appendix

A. Proof of Proposition 1

B. Proof of Proposition 2

C. Proof of Lemma 4

D. Proof of Theorem 5

E. Proof of Proposition 7

F. Proof of Proposition 8

G. Proof of Proposition 9

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Binary Classification with a Pseudo Exponential Model and Its Application for Multi-Task Learning †

Abstract

1. Introduction

2. Settings

3. Itakura–Saito Distance and Pseudo Model

3.1. Parameter Estimation with the Pseudo Model

3.2. Characterization of the Itakura–Saito Distance

3.3. Relationship with AdaBoost

4. Application for Multi-Task Learning

4.1. Case 1

4.2. Case 2

4.3. Statistical Properties of the Proposed Methods

4.4. Comparison of Regularization Terms

5. Experiments

5.1. Synthetic Dataset

5.1.1. Dataset 1

5.1.2. Dataset 2

5.2. Real Dataset: School Dataset

6. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

Appendix

A. Proof of Proposition 1

B. Proof of Proposition 2

C. Proof of Lemma 4

D. Proof of Theorem 5

E. Proof of Proposition 7

F. Proof of Proposition 8

G. Proof of Proposition 9

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Binary Classification with a Pseudo Exponential Model and Its Application for Multi-Task Learning^†