An Online Semi-Definite Programming with a Generalized Log-Determinant Regularizer and Its Applications

Liu, Yaxiong; Moridomi, Ken-ichiro; Hatano, Kohei; Takimoto, Eiji

doi:10.3390/math10071055

Open AccessArticle

An Online Semi-Definite Programming with a Generalized Log-Determinant Regularizer and Its Applications^†

¹

Department of Informatics, Kyushu University, Fukuoka 819-0395, Japan

²

AIP RIKEN, Tokyo 103-0027, Japan

³

SMN Corporation, Tokyo 141-0032, Japan

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in 2021 Proceedings of the 13th Asian Conference on Machine Learning, 17–19 November 2021, PMLR 157:1113-1128.

Mathematics 2022, 10(7), 1055; https://doi.org/10.3390/math10071055

Submission received: 21 February 2022 / Revised: 23 March 2022 / Accepted: 24 March 2022 / Published: 25 March 2022

(This article belongs to the Special Issue Advanced Optimization Methods and Applications)

Download Versions Notes

Abstract

:

We consider a variant of the online semi-definite programming problem (OSDP). Specifically, in our problem, the setting of the decision space is a set of positive semi-definite matrices constrained by two norms in parallel: the

L_{\infty}

norm to the diagonal entries and the

Γ

-trace norm, which is a generalized trace norm with a positive definite matrix

Γ

. Our setting recovers the original one when

Γ

is an identity matrix. To solve this problem, we design a follow-the-regularized-leader algorithm with a

Γ

-dependent regularizer, which also generalizes the log-determinant function. Next, we focus on online binary matrix completion (OBMC) with side information and online similarity prediction with side information. By reducing to the OSDP framework and applying our proposed algorithm, we remove the logarithmic factors in the previous mistake bound of the above two problems. In particular, for OBMC, our bound is optimal. Furthermore, our result implies a better offline generalization bound for the algorithm, which is similar to those of SVMs with the best kernel, if the side information is involved in advance.

Keywords:

online semi-definite programming; log-determinant; sparse loss matrix; side information; online binary matrix completion

MSC:

68W27

1. Introduction

Online binary matrix completion (OBMC) is standing on the frontier of research to online matrix completion, which is currently an active field in the machine learning community [1,2,3,4]. Intuitively, the OBMC problem is a sequential game of predicting the given entries from an unknown

m \times n

target binary matrix. More specifically, the problem can be formulated as a repeated game between the algorithm and the adversarial environment as described below: on each round t, (i) the environment confirms the location of an entry in the target matrix

(i_{t}, j_{t}) \in [m] \times [n]

, (ii) the algorithm predicts

{\hat{y}}_{t} \in {- 1, 1}

, and then (iii) the environment reveals the true label

y_{t} \in {- 1, 1}

. The goal of the algorithm is to minimize the total number of mistakes

\sum_{t = 1}^{T} I_{{\hat{y}}_{t} \neq y_{t}}

. This OBMC model is widely applied in the real world, such as “Netflix Challenges” [5], where the rating matrix (target matrix) is composed of rows representing the viewers and the columns corresponding to the movies. The entry

(i, j)

is the rating of the viewer i to the movie j, concretely.

For convenience, we define an underlying matrix

U,

the comparator matrix, as a good enough approximation matrix to the unknown target

m \times n

matrix. To be more precise, assume that

U \in R^{m \times n}

can be factorized into

U = P Q^{⊤}

for some matrices

P \in R^{n \times d}

and

Q \in R^{m \times d}

for some

d \geq 1

. Without loss of generality, we further assume that the rows of

P

and

Q

are normalized such that

∥ P_{i} ∥ = ∥ Q_{j} ∥ = 1

for all i and j, where

P_{i}

is the i-th row vector of

P

(interpreted as a linear classifier associated with row i of

U

) and

Q_{j}

is the j-th row vector of

Q

(interpreted as a feature vector associated with column j of

U

). Hence, the sign of

U_{i, j}

can be viewed as a classification of the classifier

P_{i}

to the feature

Q_{j}

. Next, to quantify the predictiveness of the comparator matrix

U

, we involve the hinge loss as

{[1 - z_{t} / γ]}_{+}

for a given margin parameter

γ > 0

, where

{[x]}_{+}

is x if

x > 0

and 0 otherwise. Note that

z_{t} = y_{t} U_{i_{t}, j_{t}}

can be considered as the margin of the labeled instance

(Q_{j_{t}}, y_{t})

with respect to a hyperplane

P_{i_{t}}

. On the other hand, the hinge loss converges to the 0–1 loss function, if

γ

converges to 0.

Recently, Herbster et al. explored the OMBC problem by adding side information in advance to the algorithm [2]. The side information brings some prior knowledge about the target matrix, or more generally, about a comparator matrix

U .

Moreover, side information is formally represented according to the columns and the rows of the comparator matrix as two symmetric positive definite matrices

M \in R^{m \times m}

and

N \in R^{n \times n}

. To measure the quality of the side information, Herbster et al. involved a concept quasi-dimension of a comparator matrix

U,

defined as the minimum of

D = Tr (P^{⊤} MP) + Tr (Q^{⊤} NQ)

over all the factorizations of

U

such that

U = γ P Q^{⊤}

. Then, they proved a mistake bound given by the total hinge loss of

U

with an additional term expressed in terms of

γ

, m, n, and

D

. In particular, Herbster et al. offered a bound

O (D ln (m + n) / γ^{2})

, if the total hinge loss of

U

is zero (in the realizable case). Moreover, they obtained a mistake bound

O (k l ln (m + n))

, when

U

has a

(k, l)

-biclustered structure (see Appendix A for details) and side information

M, N

are in accordance with this particular structure. Unfortunately, however, there still remains a logarithmic gap from a lower bound of

Ω (k l)

[1].

In this paper, unlike the definition of quasi-dimension introduced by Herbster et al., we simplify the quasi-dimension in the following part as

D = Tr (P^{⊤} MP) + Tr (Q^{⊤} NQ)

, only the sum of the trace norms. Then, we obtain a mistake bound

O (D / γ^{2})

, by improving a logarithmic factor

ln (m + n)

in the mistake bound of Herbster et al.; further, our bound recovers the lower bound at most by a constant when the comparator matrix has a

(k, l)

-biclustered structure. The basic idea is to reduce the OBMC problem with side information to an online semi-definite programming (OSDP) problem specified by

Γ

. In particular, the symmetric positive definite matrix

Γ

is transformed from the side information

(M, N)

in our reduction. Thus, the reduced OSDP problem is proposed as a repeated game based on a sparse loss matrices space and a decision space, which is a set of symmetric and positive semi-definite matrices

W

. Note that our decision set is constrained by the

L_{\infty}

-norm of the diagonal entries of

W

and

Γ

-trace norm,

Tr (Γ W Γ)

, simultaneously. Actually, our reduced OSDP problem is a generalization of the standard OSDP problem [6,7], where in the standard form,

Γ = E

. We design and analyze our algorithm for the generalized OSDP problem under the follow-the-regularized-leader (FTRL) framework (see, e.g., [8,9,10]). Note that to guarantee good performance of the proposed algorithm, we choose a specialized regularizer as stated later.

The OSDP framework/problem solved by the FTRL approaching is a classical method for various problems of online matrix prediction, such as online gambling [6,11], online collaborative filtering [12,13,14], online similarity prediction [15], and especially a non-binary version of online matrix completion with no side information [6,7]. To measure the performance of the algorithm for the above problems, a new concept, regret, the difference between the cumulative loss of the algorithm and the global optimal comparator matrix in hindsight, is always involved.

For the aforementioned results about non-binary online matrix completion with no side information, Hazan et al. [6] firstly proposed a reduction to the standard OSDP problem, and then, they utilized the FTRL algorithm with an entropic regularizer, obtaining a sub-optimal regret bound. Moridomi et al. [7] improved the regret bound by deploying a log-determinant regularizer, while they noticed that the loss matrices are sparse in the reduction.

Next, for the OMBC problem with side information, Herbster et al. [2] reduced this problem to another variant of the OSDP problem with different or fewer constraints to the decision space than ours. Then they followed a similar FTRL-based algorithm with the entropic regularizer as [6]. Note that instead of a general regret analysis for their OSDP problem, they only gave a particular analysis to the OSDP problem instance obtained from the reduction. Due to the inspiration of the work of [7], the gap of the mistake bound is from the choice of the entropic regularizer. As same as the case mentioned in [7], in our reduction, the loss matrices are sparse, which implies that the log-determinant regularizer can lead to better performance.

As mentioned previously, the OBMC problem with side information is reduced to an OSDP problem, where the decision space is parameterized with the side information. It requests a new form of the log-determinant regularizer since the standard log-determinant regularizer

R (W) = - ln det (W + ϵ E)

performs unsatisfactorily from our examination. Meanwhile, we attempt to reduce our problem to the standard OSDP problem in a straightforward and natural way. Unfortunately, this reduction fails, as we will show a counterexample in the latter section since this trivial reduction damages the sparsity of the loss matrices and the bound to the diagonal entries of the decision space. Therefore, to solve our reduced OSDP problem, a generalized log-determinant regularizer,

R (W) = - ln det (Γ W Γ + ϵ E)

, is required and this specified regularizer assists a successful regret bound. Conclusively, our reduction and solution not only show the power of the not well explored log-determinant regularizer compared with the entropic or Frobenius-norm regularizer in the OSDP framework, but also demonstrate that an appropriate choice of the regularizer, which is dependent on the form of the decision set, further on the side information, and the loss space can effectively improve the performance of the algorithm in theory. Note that although our derivation is similar to the analysis of Moridomi et al. [7], it is in fact a non-trivial generalization.

Furthermore, we apply our online algorithm in the statistical (batch) learning setting by the standard online-to-batch conversion framework (see, for example, the work of Mohri et al. [16]) and derive a generalization error bound with side information. Our generalized error bound is similar to the known margin-based bound of SVMs (e.g., Mohri et al. [16]) with the best kernel when the side information is vacuous. It is remarkable that we can not only obtain such a bound without knowing the best kernel, but it also implies that the error bound in the batch learning setting can be improved when the side information is given to the learner in advance.

Our main contribution is summarized as follows:

Firstly, we generalize the OSDP problem by parameterizing a symmetric and positive definite matrix $Γ$ in the decision set. This generalization definitely extends the standard OSDP problem and offers a more wildly applicable framework. Next, we design an FTRL-based algorithm with a generalized log-determinant regularizer depending on the matrix $Γ$ from the decision set. Our result recovers the previously known bound [7] in the case where $Γ$ is the identity matrix.
We obtain refined mistake bounds to the OBMC problem with side information and the online similarity prediction with side information under the umbrella of the above results. We reduce these problems to the OSDP framework and parameterize the side information into a symmetric positive definite matrix in the setting of decision space. Compared with the analysis of Herbster et al. [2], our reduction is explicit and easy to follow. Due to the achievement of the generalized OSDP problems, we improve the previously known mistake bounds by logarithmic factors for both of the problems. In particular, for the former problem, our mistake bound is optimal.
In addition, we added the online–offline conversion method to the OBMC problem with side information compared with the preliminary version [17]. We offer a standard online-to-batch conversion framework, which guarantees that our online algorithm can perform not worse than the traditional offline algorithm (SVM with the best kernel). As we demonstrated in the following section, our error bound recovers the best margin-based bound when the side information is vacuous. With the assistance of the ideal side information, our proposed algorithm performs even better than the previous one. On the one hand, we improve the error bound to the OBMC with side information in batch setting; on the other hand, our result implies that there might be a more effective algorithm for batch setting if the algorithm can make good use of the side information.

This paper is organized as follows. In Section 2, we give some basic notations and formally formulate the generalized OSDP. Then with a toy example, we show that a naive reduction from our problem to the standard OSDP can even lead to a worse regret bound. It yields the necessity of our generalized log-determinant regularizer. The main algorithm with its regret bound for the generalized OSDP is given in Section 3. In Section 4, we give the reductions of the OBMC problem and online similarity prediction with side information to our demonstrated OSDP problems, respectively. Moreover, we show that the mistake bound for the OMBC problem is optimal in the realizable case where the comparator matrix has a biclustered structure. In Section 5, we derive the batch setting to the OBMC problem with side information. In Appendix A.1, we describe some necessary lemmata for proof of our results. Further, we define the

(k, l)

-biclustered structure in Appendix A.2.

2. Preliminaries

For any positive integer T, a subset of

N,

{1, 2, \dots, T} \subseteq N

is denoted by

[T]

. Let

S^{N \times N}

,

S_{+}^{N \times N}

and

S_{+ +}^{N \times N}

denote the sets of

N \times N

symmetric matrices, symmetric positive semi-definite matrices and symmetric strictly positive definite matrices, respectively. The identity matrix is denoted as

E

. For an

m \times n

matrix

X \in R^{m \times n}

and

(i, j) \in [m] \times [n]

, we denote the i-th row vector of

X

and the

(i, j)

entry of

X

by

X_{i}

and

X_{i, j}

respectively. Furthermore,

vec (X)

is an

m n

-dimensional vector according to

X

and arranges all entries

X_{i, j},

i \in [m]

and

j \in [n]

, in some order. For any matrices

X, Y \in R^{m \times n}

,

X • Y = Tr (X^{⊤} Y) = vec {(X)}^{⊤} vec (Y)

denotes the Frobenius inner product of them. The trace norm of matrix

X \in S_{+}^{N \times N}

is defined as

Tr (X) = \sum_{i = 1}^{N} | λ_{i} (X) |

, where

λ_{i} (X)

denotes the i-th largest eigenvalue of

X

. Meanwhile,

Tr (X) = \sum_{i = 1}^{N} X_{i, i}

. In addition, we generalize the trace norm of matrix

X

to the

Γ

-trace norm of

X

as

Tr (Γ X Γ)

, for some

Γ \in S_{+ +}^{N \times N}

. Note that the

Γ

-trace norm recovers the trace norm when

Γ = E

. For a vector

x

, the

L_{p}

-norm of

x

is denoted by

{∥ x ∥}_{p}

.

2.1. Generalized OSDP Problem with Bounded $Γ$ -Trace Norm

Our generalized OSDP problem specified with a symmetric positive definite matrix

Γ \in S_{+ +}^{N \times N}

is formulated by a pair

(K, L)

, where

K = {W \in S_{+}^{N \times N} : Tr (Γ W Γ) \leq τ \land ∥ vec (W) ∥_{\infty} \leq β}

(1)

is called the decision space/set, and

L = {L \in S^{N \times N} : ∥ vec (L) ∥_{1} \leq g}

(2)

is called the loss space, where

τ > 0

,

β > 0

and

g > 0

are parameters. The generalized OSDP problem

(K, L)

is a repeated game between the algorithm and the adversarial environment as described below: On each round

t \in [T]

, we have the following:

The algorithm predicts a matrix $W_{t} \in K$ .
The algorithm receives a loss matrix $L_{t} \in L$ , returning from the environment.
The algorithm incurs the loss on round t: $W_{t} • L_{t}$ .

The goal of the algorithm is to minimize the following regret

{Regret}_{OSDP} (T, K, L) = \sum_{t = 1}^{T} W_{t} • L_{t} - min_{W \in K} \sum_{t = 1}^{T} W • L_{t} .

(3)

Note that the standard OSDP problem corresponds to the special case of our setting that

Γ = E

.

Due to the definition of the OSDP above, this problem is categorized to online linear optimization framework, since the convexity of the decision space and the Frobenius production is linear. Therefore, as Moridomi et al. [7] did for the standard OSDP problem, we can apply a standard FTRL algorithm. The FTRL algorithm outputs a matrix

W_{t}

according to

W_{t} = \underset{W \in K}{arg min} (R (W) + η \sum_{s = 1}^{t - 1} L_{s} • W),

(4)

where

R : K \to R

is a strongly convex function, and called regularizer. Entropy and Euclidean norm are classical regularizers in online linear optimization [9]. In particular, Moridomi et al. chose the log-determinant regularizer, which is not well studied, defined as

R (W) = - ln det (W + ϵ E),

(5)

where

ϵ > 0

is a parameter and derived the following regret bound for the standard OSDP problem.

Theorem 1

([7]). For the standard OSDP problem

(K, L)

with

Γ = E

, The FTRL algorithm with the log-determinant regularizer achieves

{Regret}_{OSDP} (T, K, L) = O (g \sqrt{τ β T}) .

(6)

In the next sub-section, we show that the standard log-determinant regularizer performs unsatisfactorily to our generalized OSDP, even if our generalized OSDP can be reduced to the standard OSDP.

2.2. A Naive Reduction

Define a generalized OSDP

(K, L)

as in Equations (1) and (2), naturally, there is a reduction to a standard OSDP problem

(\tilde{K}, \tilde{L})

where

\tilde{K} = {\tilde{W} \in S_{+}^{N \times N} : Tr (\tilde{W}) \leq τ \land ∥ vec (\tilde{W}) ∥_{\infty} \leq β^{'}}, \tilde{L} = {\tilde{L} \in S^{N \times N} : ∥ vec (\tilde{L}) ∥_{1} \leq g^{'}},

for some parameters

β^{'} > 0

and

g^{'} > 0

. For convenience, we denote

A

as a FTRL-based algorithm with the log-determinant regularizer.

The reduction consists of two transformations: one is to transform the decision matrix

{\tilde{W}}_{t} \in \tilde{K}

produced from

A

to the decision matrix

W_{t} = Γ^{- 1} {\tilde{W}}_{t} Γ^{- 1}

for OSDP

(K, L)

. The other one is to transform the loss matrices

L_{t} \in L

from the environment of

(K, L)

to

{\tilde{L}}_{t} = Γ^{- 1} L_{t} Γ^{- 1}

, which is fed to the algorithm

A

. Note that the loss is preserved under this reduction, that is,

W_{t} • L_{t} = Tr (W_{t} L_{t}) = Tr (Γ^{- 1} {\tilde{W}}_{t} Γ^{- 1} Γ {\tilde{L}}_{t} Γ) = Tr ({\tilde{W}}_{t} {\tilde{L}}_{t}) = \tilde{W_{t}} • \tilde{L_{t}}

. Moreover, the

Γ

-trace norm of

W_{t}

is the trace norm of

{\tilde{W}}_{t}

, i.e.,

Tr (Γ W_{t} Γ) = Tr ({\tilde{W}}_{t})

. Therefore, setting

β^{'}

and

g^{'}

appropriately such that for any

W \in K

and

L \in L

,

Γ W Γ \in \tilde{K}

and

Γ^{- 1} L Γ^{- 1} \in \tilde{L}

, respectively, we have that

{Regret}_{OSDP} (T, K, L) \leq {Regret}_{OSDP} (T, \tilde{K}, \tilde{L})

.

Hence, according to the regret bound to

(\tilde{K}, \tilde{L})

with log-determinant regularizer [7], we immediately have

{Regret}_{OSDP} (T, K, L) = O (g^{'} \sqrt{τ β^{'} T}) .

by Theorem 1.

In the following part, we give an example, which implies that the above reduction yields a worse regret bound than our proposed algorithm.

Example 1.

Define

Γ \in S_{+ +}^{N \times N}

as

Γ = [\begin{matrix} N & - 1 & \dots & - 1 \\ - 1 & N & \dots & - 1 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ - 1 & - 1 & \dots & N \end{matrix}] with Γ^{- 1} = [\begin{matrix} \frac{2}{N + 1} & \frac{1}{N + 1} & \dots & \frac{1}{N + 1} \\ \frac{1}{N + 1} & \frac{2}{N + 1} & \dots & \frac{1}{N + 1} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{1}{N + 1} & \frac{1}{N + 1} & \dots & \frac{2}{N + 1} \end{matrix}]

and let

τ = N^{3} + N^{2} - N

,

β = 1

and

g = 4

so that

E \in K .

Next we define loss matrix

L \in L,

where

L_{i, j} \leq 1

if

(i, j) \in {1, N} \times {1, N}

and 0 otherwise, as follows:

L = [\begin{matrix} L_{1, 1} & 0 & \dots & L_{1, N} \\ 0 & 0 & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ L_{N, 1} & 0 & \dots & L_{N, N} \end{matrix}] .

Then, with a simple calculation, we obtain

{| Γ E Γ |}_{i, i} = N^{2} + N - 1

for all

i \in [N],

which implies that we need

β^{'} \geq N^{2} + N - 1 .

Meanwhile, we have that

∥ vec (Γ^{- 1} L Γ^{- 1}) ∥_{1} = L_{1, 1} + L_{1, N} + L_{N, 1} + L_{N, N} \leq 4,

which suggests that

g^{'} \geq 4

. In other words, the regret bound we obtained must be larger than the order of

4 N \sqrt{τ T}

, if we only process the a naive reduction. Parallel, the regret bound to above example is

O (\sqrt{τ T})

, if we directly utilize our proposed algorithm in the next section, since

ρ = {max}_{i, j} | {(Γ^{- 1} Γ^{- 1})}_{i, j} | \leq 1

. Thus, our algorithm can improve the FTRL-based algorithm with the standard log-determinant regularizer significantly.

3. Algorithm for the Generalized OSDP Problem

In this section, we give the main algorithm and regret bound to the generalized OSDP problem

(K, L)

specified by (1) and (2), with respect to some

Γ \in S_{+ +}^{N \times N}

. We propose the FTRL algorithm (4) with the Γ-calibrated log-determinant regularizer:

R (W) = - ln det (Γ W Γ + ϵ E),

(7)

where

ϵ > 0

is a parameter.

The following theorem gives a regret bound of our algorithm.

Theorem 2

(Main Theorem). Given

Γ \in S_{+ +}^{N \times N}

. For the generalized OSDP problem specified with Equations (1) and (2), denoting

ρ = {max}_{i, j} | {(Γ^{- 1} Γ^{- 1})}_{i, j} |

, running the FTRL algorithm with the Γ-calibrated log-determinant regularizer for T times, the regret is bounded as follows:

{Regret}_{OSDP} (T, K, L) = O (g^{2} {(β + ρ ϵ)}^{2} T η + \frac{τ}{ϵ η}) .

In particular, letting

η = \sqrt{\frac{τ}{g^{2} {(β + ρ ϵ)}^{2} ϵ T}}

and

ϵ = β / ρ

, we have

{Regret}_{OSDP} (T, K, L) = O (g \sqrt{β ρ τ T}) .

(8)

Note that we can recover the same regret bound of Theorem 1, when

Γ = E

.

Before we give the proof of our main theorem, we need to introduce strong convexity, which will play a central role in our proof. The definition of strong convexity is as follows.

Definition 1.

For a decision space

K

and a real number

s \geq 0

, a regularizer

R : K \to R

is said to be s-strongly convex with respect to the loss space

L

if for any

α \in [0, 1]

, any

X, Y \in K

and any

L \in L

, the following holds

R (α X + (1 - α) Y) \leq α R (X) + (1 - α) R (Y) - \frac{s}{2} α (1 - α) {| L • (X - Y) |}^{2} .

(9)

This is equivalent to the following condition: for any

X, Y \in K

and

L \in L

,

R (X) \geq R (Y) + \nabla R (Y) • (X - Y) + \frac{s}{2} {| L • (X - Y) |}^{2} .

(10)

Note that the notion of strong convexity defined above is quite different from the standard one: usually, the strong convexity is defined with respect to some norm

∥ \cdot ∥

in decision set [9], but now it is defined with respect to the decision space and the loss space. A direct reason is that our decision set is constrained by two norms simultaneously, and the same definition is involved by [18].

The following lemma from [7] states a general form between the regret bound to the OSDP problem and the strongly convex regularizer of the FTRL-based algorithm.

Lemma 1

([7]). Let

R : K \to R

be an s-strongly convex regularizer with respect to a decision space

L

for a decision space

K

. Then the FTRL with the regularizer R applied to

(K, L)

achieves

{Regret}_{OSDP} (T, K, L) \leq \frac{H_{0}}{η} + \frac{η}{s} T,

(11)

where

H_{0} = {max}_{W, W^{'} \in K} (R (W) - R (W^{'}))

.

Due to the lemma above, it suffices to analyze the strong convexity of our Γ-calibrated log-determinant regularizer with respect to our decision space (1) and loss space (2). We show the result in the main proposition.

Proposition 1

(Main proposition). The Γ-calibrated log-determinant regularizer

R (W) = - ln det (Γ W Γ + ϵ E)

is s-strongly convex with respect to

L

for

K

with

s = 1 / (576 \sqrt{e} {(β + ρ ϵ)}^{2} g^{2})

, where

ρ = {max}_{i, j} | {(Γ^{- 1} Γ^{- 1})}_{i, j} |

.

We prove this proposition in the next sub-section. Based on this proposition, we firstly give a proof sketch of our main Theorem. The details are in the next sub-section.

Proof Sketch of Theorem 2.

According to Proposition 1 and Lemma 1, the only gap of proving this theorem is the bound of

H_{0}

. As we show in the following subsection,

H_{0} \leq \frac{τ}{ϵ}

, due to the definition of R. Obviously, the size of the matrix N has no effect on our regret bound in Theorem 2. □

Proof for Main Proposition and Theorem

Before we prove Theorem 2, we need to involve some lemmata and notations.

Given a distribution P over

R^{N}

, the negative entropy function in respect of P is defined by

H (P) = E_{x \sim P} [ln (P (x))]

. We define the characteristic function of P by

ϕ (u) = E_{x \sim P} [e^{i u^{T} x}]

where i is the imaginary unit. For two distributions P and Q over

R^{N}

,

\frac{1}{2} \int_{x} | P (x) - Q (x) | d x

denotes the total variation distance between P and Q.

Lemma 2.

Let

G_{1}

and

G_{2}

be two zero mean Gaussian distributions with covariance matrix

Γ X Γ

and

Γ Y Γ

, where

X, Y, Γ \in S_{+ +}^{N \times N}

, respectively. If there exists

(i, j)

such that

| X_{i, j} - Y_{i, j} | \geq δ (X_{i, i} + Y_{i, i} + X_{j, j} + Y_{j, j}),

(12)

for some

δ > 0

, then the total variation distance between

G_{1}

and

G_{2}

is at least

\frac{1}{12 e^{1 / 4}} δ .

Proof.

Given

ϕ_{1} (u)

and

ϕ_{2} (u)

as characteristic functions of

G_{1}

and

G_{2}

, respectively, due to Lemma A1, we have

\int_{x} | G_{1} (x) - G_{2} (x) | d x \geq max_{u \in R^{N}} | ϕ_{1} (u) - ϕ_{2} (u) |,

(13)

so we only need to show the lower bound of

{max}_{u \in R^{N}} | ϕ_{1} (u) - ϕ_{2} (u) | .

Then we set that characteristic function of

G_{1}

and

G_{2}

are

ϕ_{1} (u) = e^{\frac{- 1}{2} u^{T} Γ^{T} X Γ u}

and

ϕ_{2} (u) = e^{\frac{- 1}{2} u^{T} Γ^{T} Y Γ u}

, respectively. Let that

α_{1} = {(Γ v)}^{T} X (Γ v),

α_{2} = {(Γ v)}^{T} Y (Γ v)

and

Γ u = \frac{Γ v}{\sqrt{α_{1} + α_{2}}}

for some

v \in R^{N}

. Moreover we define that given

Γ

, for any

\bar{v} \in R^{N}

such that

\bar{v} = Γ v,

there exists

v \in R^{N} .

\bar{u} = Γ u

in the same way.

Now, let us show the lower bound of

{max}_{u \in R^{N}} | ϕ_{1} (u) - ϕ_{2} (u) | .

\begin{matrix} max_{u \in R^{N}} | ϕ_{1} (u) - ϕ_{2} (u) | \\ = max_{u \in R^{N}} |e^{\frac{- 1}{2} u^{T} Γ X Γ u} - e^{\frac{- 1}{2} u^{T} Γ Y Γ u}| \\ = max_{u \in R^{N}} |e^{\frac{- 1}{2} {(Γ u)}^{T} X (Γ u)} - e^{\frac{- 1}{2} {(Γ u)}^{T} Y (Γ u)}| \\ \geq max_{\bar{v} \in R^{N}} |e^{\frac{- α_{1}}{2 (α_{1} + α_{2})}} - e^{\frac{- α_{2}}{2 (α_{1} + α_{2})}}| \\ \geq max_{\bar{v} \in R^{N}} |\frac{1}{2 e^{1 / 4}} \frac{α_{1} - α_{2}}{α_{1} + α_{2}}| . \end{matrix}

(14)

Then second inequality is due to Lemma A4, since

min {\frac{α_{1}}{α_{1} + α_{2}}, \frac{α_{2}}{α_{1} + α_{2}}} \in (0, \frac{1}{2}]

.

Due to the assumption in the Lemma, we obtain for some

(i, j)

that

\begin{matrix} δ (X_{i, i} + Y_{i, i} + X_{j, j} + Y_{j, j}) \leq | X_{i, j} - Y_{i, j} | \\ = \frac{1}{2} | {(e_{i} + e_{j})}^{T} (X - Y) (e_{i} + e_{j}) - e_{i}^{T} (X - Y) e_{i} - e_{j}^{T} (X - Y) e_{j} | \end{matrix}

(15)

It implies that one of

{(e_{i} + e_{j})}^{T} (X - Y) (e_{i} + e_{j}),

e_{i}^{T} (X - Y) e_{i}

and

e_{j}^{T} (X - Y) e_{j}

has absolute value greater that

\frac{2 δ}{3} (X_{i, i} + Y_{i, i} + X_{j, j} + Y_{j, j}) .

Due to the positive definiteness of

X

and

Y

, we have that for all

v \in {e_{i} + e_{j}, e_{i}, e_{j}}

v^{T} (X + Y) v \leq 2 {(X + Y)}_{i, i} + 2 {(X + Y)}_{j, j} .

(16)

and therefore we have that

max_{\bar{v} \in R^{N}} |\frac{1}{2 e^{1 / 4}} \frac{α_{1} - α_{2}}{α_{1} + α_{2}}| \geq max_{\bar{v} \in {e_{i} + e_{j}, e_{i}, e_{j}}} |\frac{1}{2 e^{1 / 4}} \frac{v^{T} (X - Y) v}{v^{T} (X + Y) v}| \geq \frac{δ}{6 e^{1 / 4}}

(17)

□

Lemma 3.

Let

X, Y \in S_{+}^{N \times N}

be such that

| X_{i, j} - Y_{i, j} | \geq δ (X_{i, i} + Y_{i, i} + X_{j, j} + Y_{j, j}),

(18)

and Γ is a symmetric strictly positive definite matrix. Then the following inequality holds that

\begin{matrix} - ln det (α Γ X Γ + (1 - α) Γ Y Γ) \\ \leq - α ln det (Γ X Γ) - (1 - α) ln det (Γ Y Γ) - \frac{α (1 - α)}{2} \frac{δ^{2}}{36 e^{1 / 2}} . \end{matrix}

(19)

Proof.

Let

G_{1}

and

G_{2}

be zero mean Gaussian distributions with covariance matrix

Γ X Γ

and

Γ Y Γ .

The total variation distance between

G_{1}

and

G_{2}

is lower bounded by

\frac{δ}{12 e^{1 / 4}},

since the assumption in this Lemma and the result in Lemma 2. Consider the entropy of the following probability distribution of v with probability

α

that

v \sim G_{1}

and

v \sim G_{2}

otherwise. Its covariance matrix is

α Γ X Γ + (1 - α) Γ Y Γ .

Due to Lemma A3, we obtain that

- ln det (α Γ X Γ + (1 - α) Γ Y Γ) \leq 2 H (α G_{1} + (1 - α) G_{2}) + ln {(2 π e)}^{V} .

By Lemma A2, we further have that

\begin{matrix} 2 H (α G_{1} + (1 - α) G_{2}) + ln {(2 π e)}^{V} . \\ \leq 2 α H (G_{1}) + 2 (1 - α) H (G_{2}) + ln {(2 π e)}^{V} - 2 α (1 - α) \frac{δ^{2}}{144 e^{1 / 2}} \\ \leq 2 α H (G_{1}) + 2 (1 - α) H (G_{2}) + ln {(2 π e)}^{V} - α (1 - α) \frac{δ^{2}}{72 e^{1 / 2}}, \end{matrix}

where the second inequality is due the fact that

δ \geq 0 .

At last, from the assumption in this Lemma,

G_{1}

and

G_{2}

are Gaussian distributions, due to the statement in Lemma A3, we have that

\begin{matrix} 2 α H (G_{1}) + 2 (1 - α) H (G_{2}) + ln {(2 π e)}^{V} - α (1 - α) \frac{δ^{2}}{72 e^{1 / 2}} \\ = - α ln det (Γ Σ Γ) - (1 - α) ln det (Γ Θ Γ) - α (1 - α) \frac{δ^{2}}{72 e^{1 / 2}} . \end{matrix}

(20)

Thus, we have our conclusion. □

Lemma 4

(Lemma 5.4 [7]). Let

X, Y \in S_{+ +}^{N \times N}

be such that for all

i \in [N]

| X_{i, i} | \leq β^{^{'}}

and

| Y_{i, i} | \leq β^{^{'}}

Then for any

L \in L = {L \in S_{+}^{N \times N} : ∥ vec (L) ∥_{1} \leq g}

there exists that

| X_{i, j} - Y_{i, j} | \geq \frac{| L • (X - Y) |}{4 β^{^{'}} g} (X_{i, i} + Y_{i, i} + X_{j, j} + Y_{j, j}) .

(21)

Proposition 2

(Main proposition). For

Γ \in S_{+ +}^{N \times N},

the generalized log-determinant regularizer

R (X) = - ln det (Γ X Γ + ϵ E)

is s-strongly convex with respect to

L

for

K

with

s = 1 / (576 \sqrt{e} {(β + ρ ϵ)}^{2} g^{2}) .

Here

E

is identity matrix.

Proof.

Since the assumption of the proposition that

Γ

is positive definite then we have that

Γ X Γ + ϵ E = Γ (X + Γ^{- 1} ϵ {E Γ}^{- 1}) Γ .

For any

X \in K,

we obtain that

max_{i, j} | {(X + Γ^{- 1} ϵ {E Γ}^{- 1})}_{i, j} | \leq max_{i, j} | X_{i, j} | + ϵ ρ = β + ϵ ρ,

where

ρ = {max}_{i, j} | {(Γ^{- 1} Γ^{- 1})}_{i, j} |

.

Setting

β^{'} = β + ϵ ρ

in Lemma 4, combining Lemma 3 and Definition 1, our conclusion follows. □

The proof of the main theorem is given as follows:

Proof of Theorem 2.

Due to Lemma 1 (in main part) we obtain that

{Regret}_{OSDP} (T, K, L, W^{*}) \leq \frac{H_{0}}{η} + \frac{η}{s} T .

(22)

Due to the main proposition in main part we know that

s = 1 / (576 {(β + ρ ϵ)}^{2} \sqrt{e} g^{2}) .

Thus we need only to show

H_{0} \leq \frac{τ}{ϵ} .

Denoting

W_{0}

and

W_{1}

as the minimizer and maximizer of R, respectively, then we obtain that

\begin{matrix} max_{W, W^{^{'}} \in K} (R (W) - R (W^{^{'}})) & = R (W_{1}) - R (W_{0}) \\ = - ln det ({Γ W}_{1} Γ + ϵ E) + ln det ({Γ W}_{0} Γ + ϵ E) \\ = \sum_{i = 1}^{N} ln \frac{λ_{i} ({Γ W}_{0} Γ) + ϵ}{λ_{i} ({Γ W}_{1} Γ) + ϵ}, \end{matrix}

where the last equality is due to the fact that

det (A) = \prod_{i = 1}^{N} λ_{i} (A),

for any

A \in S_{+ +}^{N \times N}

. Further, we have that

\begin{matrix} \sum_{i = 1}^{N} ln \frac{λ_{i} ({Γ W}_{0} Γ) + ϵ}{λ_{i} ({Γ W}_{1} Γ) + ϵ} & = \sum_{i = 1}^{N} ln (\frac{λ_{i} ({Γ W}_{0} Γ)}{λ_{i} ({Γ W}_{1} Γ) + ϵ} + \frac{ϵ}{λ_{i} ({Γ W}_{1} Γ) + ϵ}) \\ \leq \sum_{i = 1}^{N} ln (\frac{λ_{i} ({Γ W}_{0} Γ)}{ϵ} + 1) \\ \leq \sum_{i = 1}^{N} \frac{λ_{i} ({Γ W}_{0} Γ)}{ϵ} = \frac{Tr ({Γ W}_{0} Γ)}{ϵ} \leq \frac{τ}{ϵ}, \end{matrix}

where the first inequality is from the inequality

ln (1 + x) \leq x

. Plugging

s,

we obtain that

{Regret}_{OSDP} (T, K, L, W^{*}) = O (g^{2} {(β + ρ ϵ)}^{2} T η + \frac{τ}{ϵ η}) .

(23)

□

4. Application to OBMC with Side Information

We demonstrate an explicit reduction from OBMC with side information to our aforementioned OSDP problem

(K, L)

in the following part. The reduction is two-fold. In the first step, we reduce OBMC with side information to an online matrix prediction (OMP) problem with side information by involving hinge loss function and mistake-driven technique. In the second step, we reduce OMP with side information to the OSDP problem, while representing the side information into the

Γ

-trace norm.

Before we show the reductions, we need define some necessary notations and the OBMC problem with side information formally.

4.1. The Problem Statement

In principle, our problem statement simplifies the settings in work of Herbster et al. [2].

Given

m, n \in N_{+}

, let the pair

(M, N)

be the side information given to the algorithm, where

M \in S_{+ +}^{m \times m}

and

N \in S_{+ +}^{n \times n}

.

The online binary matrix completion (OBMC) problem is a repeated game between the the algorithm and the adversarial environment formulated as follows: On each round

t \in [T]

:

The environment selects $(i_{t}, j_{t}) \in [m] \times [n]$ .
The algorithm returns a prediction ${\hat{y}}_{t} \in {- 1, + 1}$ .
The environment reveals the true label $y_{t} \in {- 1, 1}$ .

The target of the algorithm is to minimize the mistake number during the whole learning process

M = \sum_{t = 1}^{T} I_{y_{t} \neq {\hat{y}}_{t}}

. Particularly, with the assistance of the side information

(M, N)

, the mistake bound is supposed to be refined in the case that the side information is constructive.

Equivalently, if we describe the selections and true labels from the environment as a sequence

S = {(i_{t}, j_{t}), y_{t}}_{t = 1}^{T} \subseteq {([m] \times [n] \times {- 1, 1})}^{T}

. The problem can be seen as a sequential prediction of the entries in an underlying

m \times n

target matrix. On each single round t, the environment confirms the location of the entry

(i_{t}, j_{t})

and the algorithm is required to predict the label of the entry. However, we release the consistence of such unknown target matrix, that is, it can happen

y_{t} \neq y_{t^{'}}

even if

(i_{t}, j_{t}) = (i_{t^{'}}, j_{t^{'}})

.

Instead of the 0–1 loss in the OBMC problem, we firstly need introduce a convex surrogate loss function for the FTRL framework. Specifically, for a positive parameter

γ

, we define a hinge loss

h_{γ} : R \to R

with respect to

γ

as follows:

h_{γ} (x) = \{\begin{matrix} 0 & if γ \leq x, \\ 1 - x / γ & otherwise, \end{matrix}

where

γ

is also named the margin parameter.

Then for the sequence

S

we factorize the comparator matrix with

P Q^{⊤} \in R^{m \times n},

where

P \in R^{m \times d}

and

Q \in R^{n \times d}

for some d. Combing the definition of the hinge loss, we define the hinge loss of the sequence

S

as

hloss (S, (P, Q), γ) = \sum_{t = 1}^{T} h_{γ} (\frac{y_{t} P_{i_{t}} Q_{j_{t}}^{⊤}}{∥ P_{i_{t}} ∥_{2} {∥ Q_{j_{t}} ∥}_{2}}),

(24)

in terms of the factorization pair

(P, Q)

and margin parameter

γ

. This hinge loss can be interpreted as a measurement for the predictions in terms of the comparator matrix

{PQ}^{⊤}

to the true label

y_{t}

. In the following part, we can, without loss of generality, assume that each row of

P

and

Q

is normalized as

∥ P_{i} ∥_{2} = {∥ Q_{j} ∥}_{2} = 1

for every

(i, j) \in [m] \times [n]

. Moreover, we sometimes call the pair

(P, Q)

the comparator matrix. Moreover, in the following part, we involve the row normalized matrix

\bar{X},

according to any matrix

X

, such that

\bar{X} = diag (\frac{1}{∥ X_{1} ∥_{2}}, \dots, \frac{1}{∥ X_{k} ∥_{2}}) X .

Now for each pair of the comparator matrix, we define the quasi-dimension, measuring the quality of the side information. Specifically, the quasi-dimension of a comparator matrix

(P, Q)

with respect to the side information

(M, N)

, is defined as

D_{M, N} (P, Q) = α_{M} Tr (P^{⊤} M P) + α_{N} Tr (Q^{⊤} N Q),

where

α_{M} = {max}_{i \in [m]} {(M^{- 1})}_{i, i}

and

α_{N} = {max}_{j \in [n]} {(N^{- 1})}_{j, j}

. Note that

M

and

N

are set as identity matrices, when the side information is empty. In this case, the quasi-dimension is

m + n

for any comparator matrix. Nevertheless, the quasi-dimension will be smaller, if the rows of

P

and/or the columns of

Q

are correlated to

M

and/or

N

, which implies that the side information is appropriate and reflects the useful information to the comparator matrix.

Note that the notion of quasi-dimension is defined in a different way in [2].

4.2. Reduction from OBMC with Side Information to an Online Matrix Prediction (OMP)

We formulate an OMP problem, to which our problem is firstly reduced. The OMP problem is specified by a decision space

X \subseteq {[- 1, 1]}^{m \times n}

and a margin parameter

γ > 0

, and again it is described as a repeated game between the algorithm and adversary. On each round

t \in [T]

,

The algorithm predicts a matrix $X_{t} \in R^{m \times n}$ .
The adversary returns a triple $(i_{t}, j_{t}, y_{t}) \in [m] \times [n] \times {- 1, 1}$ , and
the algorithm suffers the loss defined by $h_{γ} (y_{t} X_{t, (i_{t}, j_{t})})$ .

The goal of the algorithm is to minimize the regret:

{Regret}_{OMP} (T, X, X^{*}) = \sum_{t = 1}^{T} h_{γ} (y_{t} X_{t, (i_{t}, j_{t})}) - min_{X^{*} \in X} \sum_{t = 1}^{T} h_{γ} (y_{t} X_{i_{t}, j_{t}}^{*}),

{sup}_{X^{*} \in X} {Regret}_{OMP} (T, X, X^{*}) = {Regret}_{OMP} (T, X) .

Note that unlike the standard setting of online prediction, we do not require

X_{t} \in X

Below we reduce the OBMC problem with side information

(M, N)

to the OMP problem with the following decision space:

X = {\bar{P} {\bar{Q}}^{⊤} : P Q^{⊤} \in R^{m \times n}, D_{M, N} (\bar{P}, \bar{Q}) \leq \hat{D}},

where

\hat{D}

is an arbitrary parameter. Assume that we have an algorithm

A

for the OMP problem

(X, γ)

. In this reduction, we involve the mistake-driven technique, that is the algorithm

A

will be only launched when the prediction error appears. The reduced algorithm is as follows.

Run the algorithm

A

and receive the first prediction matrix

X_{1}

from

A

. Then, in each round

t \in [T]

, we do the following:

Observe an index pair $(i_{t}, j_{t}) \in [m] \times [n]$ .
Predict ${\hat{y}}_{t} = sgn (X_{t, (i_{t}, j_{t})})$ .
Observe a true label $y_{t} \in {- 1, 1}$ .
If ${\hat{y}}_{t} = y_{t}$ then $X_{t + 1} = X_{t}$ , and if ${\hat{y}}_{t} \neq y_{t}$ , then feed $(i_{t}, j_{t}, y_{t})$ to $A$ to let it proceed and receive $X_{t + 1}$ .

Note that due to the mistake-driven technique, we run the algorithm

A

for at most

M = \sum_{t = 1}^{T} I_{{\hat{y}}_{t} \neq y_{t}}

rounds, where M is the number of mistakes of the reduction algorithm above.

The next lemma shows the performance of the reduction.

Lemma 5.

Let

{Regret}_{OMP} (M, X, X^{*})

denote the regret of the algorithm

A

in the reduction above for a competitor matrix

X^{*} \in X

, where

M = \sum_{t = 1}^{T} I ({\hat{y}}_{t} \neq y_{t})

. Then,

\begin{matrix} M & \leq inf_{\bar{P} {\bar{Q}}^{T} \in X} ({Regret}_{OMP} (M, X, \bar{P} {\bar{Q}}^{T}) + hloss (S, (P, Q), γ)) \\ \leq {Regret}_{OMP} (M, X) + hloss (S, γ), \end{matrix}

(25)

where we define

hloss (S, γ) = min_{\bar{P} {\bar{Q}}^{T} \in X} hloss (S, (P, Q), γ) .

(26)

Remark 1.

If

M

and

N

are identity matrices, then we have

D_{M, N} (\bar{P}, \bar{Q}) = m + n

, and thus the decision space is an unconstrained set

X = {\bar{P} {\bar{Q}}^{⊤} : P Q^{⊤} \in R^{m \times n}}

.

Proof.

Let

P

and

Q

be arbitrary matrices such that

\bar{P} {\bar{Q}}^{⊤} \in X .

Since

I (sgn (x) \neq y) \leq h_{γ} (y x)

for any

x \in R

and

y \in {- 1, 1}

, we have

\begin{matrix} M & = & \sum_{t = 1}^{T} I ({\hat{y}}_{t} \neq y_{t}) \leq \sum_{{t : {\hat{y}}_{t} \neq y_{t}}} h_{γ} (y_{t} X_{t, (i_{t}, j_{t})}) \\ = & {Regret}_{OMP} (M, X, \bar{P} {\bar{Q}}^{⊤}) + \sum_{{t : {\hat{y}}_{t} \neq y_{t}}} h_{γ} (y_{t} {(\bar{P} {\bar{Q}}^{⊤})}_{i_{t}, j_{t}}) \\ \leq & {Regret}_{OMP} (M, X, \bar{P} {\bar{Q}}^{⊤}) + \sum_{t = 1}^{T} h_{γ} (y_{t} {(\bar{P} {\bar{Q}}^{⊤})}_{i_{t}, j_{t}}) \\ = & {Regret}_{OMP} (M, X, \bar{P} {\bar{Q}}^{⊤}) + hloss (S, (P, Q), γ), \end{matrix}

where the second equality follows from the definition of regret, and the third equality follows from the fact that

{(\bar{P} {\bar{Q}}^{⊤})}_{i, j} = P_{i} Q_{j}^{⊤} / (∥ P_{i} ∥_{2} ∥ Q_{j} ∥_{2})

. Since the choice of

P

and

Q

is arbitrary, the following inequality holds:

M \leq inf_{\bar{P} {\bar{Q}}^{T} \in X} ({Regret}_{OMP} (M, X, \bar{P} {\bar{Q}}^{T}) + hloss (S, (P, Q), γ)) .

(27)

Now, let

P

and

Q

be the matrices that attain (26). Hence, our lemma follows from

M \leq {Regret}_{OMP} (M, X, \bar{P} {\bar{Q}}^{⊤}) + hloss (S, γ) \leq sup_{X^{*} \in X} {Regret}_{OMP} (M, X, X^{*}) + hloss (S, γ) .

□

4.3. Reduction from OMP to the Generalised OSDP Problem

Throughout this sub-section, we reduce the OMP with side information to the OSDP problem parameterized with

Γ

. Our reduction is similar to the work of [1,6]. For convenience, we denote

N = m + n

in the following part.

First of all, we formulate the side information

M \in S_{+ +}^{m \times n}, N \in S_{+ +}^{n \times n}

into

Γ \in S_{+ +}^{N \times N}

for our generalized OSDP problem in the following equation:

Γ = [\begin{matrix} \sqrt{α_{M} M} & 0 \\ 0 & \sqrt{α_{N} N} \end{matrix}] .

(28)

Next we define the decision space

K

. For any comparator matrix

(P, Q)

such that

P Q^{⊤} \in R^{m \times n}

, we define

W_{P, Q} = [\begin{matrix} \bar{P} \\ \bar{Q} \end{matrix}] [\begin{matrix} {\bar{P}}^{⊤} & {\bar{Q}}^{⊤} \end{matrix}] = [\begin{matrix} \bar{P} {\bar{P}}^{⊤} & \bar{P} {\bar{Q}}^{⊤} \\ \bar{Q} {\bar{P}}^{⊤} & \bar{Q} {\bar{Q}}^{⊤} \end{matrix}] .

Trivially,

W_{P, Q}

is an

N \times N

symmetric and positive semi-definite matrix. Intuitively, any comparator matrix

(P, Q)

such that

\bar{P} {\bar{Q}}^{⊤} \in X

, is embedded into the upper right block in

W_{P, Q}

. In addition, since the normalization of

\bar{P}

and

\bar{Q}

,

{(W_{P, Q})}_{i, i} \leq 1

for all

i \in [N]

. Hence, we need to find a convex decision space

K \subseteq S_{+ +}^{N \times N}

which satisfies

K \supseteq {W_{P, Q} : \bar{P} {\bar{Q}}^{⊤} \in X} .

First, we involve the following Lemma:

Lemma 6

(Lemma 8 [2]). For any pair of side information matrices

(M, N),

where

M \in S_{+ +}^{m \times m}

and

N \in S_{+ +}^{n \times n}

, and Γ is induced as in Equation (28).

Tr (Γ W_{P, Q} Γ) = α_{M} Tr ({\bar{P}}^{⊤} M \bar{P}) + α_{N} Tr ({\bar{Q}}^{⊤} N \bar{Q}) .

(29)

Due to this lemma and the definition of

X

, we can directly define

K

as follows:

K = {W \in S_{+ +}^{N \times N} {: ∥ vec (W) ∥}_{\infty} \leq 1 \land Tr (Γ W Γ) \leq \hat{D}} \supseteq {W_{P, Q} : \bar{P} {\bar{Q}}^{⊤} \in X} .

(30)

Then, we describe the loss matrix class

L

. We first define a sparse matrix

Z 〈 i, j 〉 \in S_{+}^{N \times N}

with any pair

(i, j) \in [m] \times [n]

such that only entries

(i, m + j)

and

(m + j, i)

are 1, others are 0. More formally,

Z 〈 i, j 〉 = \frac{1}{2} (e_{i} e_{m + j}^{⊤} + e_{m + j} e_{i}^{⊤}),

where

e_{k}

is the k-th orthogonal basis vector of

R^{N}

. Note that due to the definition of the Frobenius product, we have that

W_{P, Q} • Z 〈 i, j 〉 = {(\bar{P} {\bar{Q}}^{⊤})}_{i, j},

which is what we focus on. Thus,

L

is defined as

L = \{c Z 〈 i, j 〉 : c \in {- 1 / γ, 1 / γ}, i \in [m], j \in [n]\} .

(31)

Now we state the reduction from the OMP problem with side information to the OSDP problem

(K, L)

specified by

Γ

. Assume that there is an algorithm

A

for the OSDP problem.

Run the algorithm

A

and receive the first prediction matrix

W_{1} \in K

from

A

.

In each round t,

let $X_{t}$ be the upper right $m \times n$ component matrix of $W_{t}$ .
// $X_{t, (i, j)} = W_{t} • Z 〈 i, j 〉$
observe a triple $(i_{t}, j_{t}, y_{t}) \in [m] \times [n] \times {- 1, 1}$ ,
suffer loss $ℓ_{t} (W_{t})$ where $ℓ_{t} : W \mapsto h_{γ} (y_{t} (W • Z 〈 i_{t}, j_{t} 〉))$ ,
let $L_{t} = \nabla_{W} ℓ_{t} (W_{t}) = \{\begin{matrix} - \frac{y_{t}}{γ} Z 〈 i_{t}, j_{t} 〉 & if y_{t} X_{t, (i, j)} \leq γ \\ 0 & otherwise \end{matrix}$ ,
feed $L_{t}$ to the algorithm $A$ to let it proceed and receive $W_{t + 1}$ .

Due to the convexity of

ℓ_{t}

, a standard linearization argument ([9]) gives

ℓ_{t} (W_{t}) - ℓ_{t} (W^{*}) \leq W_{t} • L_{t} - W^{*} • L_{t}

for any

W^{*} \in K

. Moreover, according to our reduction that

ℓ_{t} (W_{t}) = h_{γ} (y_{t} X_{t, (i_{t}, j_{t})})

and

ℓ_{t} (W_{P, Q}) = h_{γ} (y_{t} {(\bar{P} {\bar{Q}}^{⊤})}_{i_{t}, j_{t}})

, the following lemma immediately follows.

Lemma 7.

Let

{Regret}_{OSDP} (T, K, L, W_{P, Q}) = \sum_{t = 1}^{T} (W_{t} - W_{P, Q}) • L_{t}

denote the regret of the algorithm

A

in the reduction above for a competitor matrix

W_{P, Q}

and

{Regret}_{OMP} (T, X, \bar{P} {\bar{Q}}^{⊤}) = \sum_{t = 1}^{T} (h_{γ} (y_{t} X_{t, (i_{t}, j_{t})}) - h_{γ} (y_{t} {(\bar{P} {\bar{Q}}^{⊤})}_{i_{t}, j_{t}})

denote the regret of the reduction algorithm for

\bar{P} {\bar{Q}}^{⊤}

. Then,

{Regret}_{OMP} (T, X, \bar{P} {\bar{Q}}^{⊤}) \leq {Regret}_{OSDP} (T, K, L, W_{P, Q}) .

Combining Lemmas 5 and 7, we have the following corollary.

Corollary 1.

There exists an algorithm for the OBMC problem with side information with the following mistake bounds.

\begin{matrix} M & \leq inf_{\bar{P} {\bar{Q}}^{⊤} \in X} ({Regret}_{OSDP} (M, K, L, W_{P, Q}) + hloss (S, (P, Q), γ)) \\ \leq {Regret}_{OSDP} (M, K, L) + hloss (S, γ) . \end{matrix}

4.4. Application to Matrix Completion

Combining previous two reductions, our ultimate reduction immediately follows. We reduce OBMC with side information to OSDP specified with

Γ

. Compared with the analysis of Herbster et al. [2], our reduction is explicit and easy to follow. Note that in our reduction, the side information

M

and

N

for OBMC is parameterized by

Γ

. Again, the

Γ

is the identity matrix if the side information is vacuous. Finally, running our proposed FTRL-based algorithm with the

Γ

-calibrated log-determinant regularizer, we improve the logarithmic factor in the previous mistake bound [2]. Particularly, our mistake bound is optimal.

Remark 2.

Since the definition of Γ in Equation (28), we have that

ρ = 1 .

According to our analysis, the parameters in the reduced OSDP problem

(K, L)

with

Γ

defined in Equation (28), can trivially be set as

g = 1 / γ,

β = ϵ = ρ = 1,

τ = \hat{D},

then utilizing Theorem 2, we obtain the following result

{Regret}_{OSDP} (T, K, L, W^{*}) = O (\frac{T η}{γ^{2}} + \frac{\hat{D}}{η}) .

(32)

Next, we give our main algorithm for the OBMC problem with side information

M, N

in Algorithm 1. Putting the two reductions together and proceeding with the FTRL-based algorithm (4), the main algorithm is as follows:

Theorem 3.

Running Algorithm 1 with parameter

η = \sqrt{γ^{2} \hat{D} / T}, γ \in (0, 1]

, the hinge loss of OBMC with side information is bounded as follows:

\sum_{t = 1}^{T} h_{γ} (y_{t} \cdot {\hat{y}}_{t}) - \sum_{t = 1}^{T} h_{γ} (y_{t} \cdot {(\bar{P} {\bar{Q}}^{⊤})}_{i_{t}, j_{t}}) \leq O (\sqrt{\frac{\hat{D} T}{γ^{2}}}) .

(33)

Compared with [2], the logarithmic factor

ln (m + n)

is improved in our regret bound to hinge loss. Meanwhile, since the mistake-driven technique is involved in our reduction, the horizon T is replaced by M, the number of mistakes, which is unknown in advance. Then, by choosing

η

independent of M, we can derive a good mistake bound due to above theorem, resulting in Equation (32).

Algorithm 1 Online binary matrix completion with side information algorithm.

1:: Parameters: $γ > 0$ , $η > 0,$ side information matrices $M \in S_{+ +}^{m \times m}$ and $N \in S_{+ +}^{n \times n} .$ Quasi dimension estimator $1 \leq \hat{D} .$ $Γ$ is composed as in Equation (28), and decision set $K$ is given as (30).
2:: Initialize $\forall W \in K,$ set $W_{1} = W$ .
3:: for $t = 1, 2, \dots, T$ do
4:: Receive $(i_{t}, j_{t}) \in [m] \times [n]$ .
5:: Let $Z_{t} = \frac{1}{2} (e_{i_{t}} e_{m + j_{t}}^{T} + e_{m + j_{t}} e_{i_{t}}^{T})$ .
6:: Predict ${\hat{y}}_{t} = sgn (W_{t} • Z_{t})$ and receive $y_{t} \in {- 1, 1}$ .
7:: if ${\hat{y}}_{t} \neq y_{t}$ then
8:: Let $L_{t} = \frac{- y_{t}}{γ} Z_{t}$ and $W_{t + 1} = arg {min}_{W \in K} - ln det (Γ W Γ + E) + η \sum_{s = 1}^{t} W • L_{s}$ .
9:: else
10:: Let $L_{t} = 0$ and $W_{t + 1} = W_{t}$ .
11:: end if
12:: end for

Theorem 4.

Algorithm 1 with

η = c γ^{2}

for some

c > 0

achieves

M = \sum_{t = 1}^{T} I_{{\hat{y}}_{t} \neq y_{t}} = O (\frac{\hat{D}}{γ^{2}}) + 2 hloss (S, γ) .

(34)

Proof.

Combining Corollary 1 and the regret bound (32), we have

M = O (\frac{M η}{γ^{2}} + \frac{\hat{D}}{η}) + hloss (S, γ) .

Choosing

η = c γ^{2}

for sufficiently small constant c, we get

M \leq \frac{M}{2} + O (\frac{\hat{D}}{γ^{2}}) + hloss (S, γ),

from which (34) follows. □

Again if the side information is vacuous, which means that

M, N

are identity matrices, from Remark 1 and Theorem 4, we can set that

\hat{D} = m + n

and obtain the mistake bound as follows:

O (\frac{m + n}{γ^{2}} + 2 {hloss}_{P Q^{T} \in R^{m \times n}} (S, (P, Q), γ)) .

Nevertheless, the side information indeed matters in non-trivial cases. When the comparator matrix

U

contains some latter structure, specifically, when

U

is

(k, l)

-biclustered, the quasi-dimension estimator

\hat{D} \in O (k + l),

which is strictly smaller than

O (m + n),

if the side information

M, N

is chosen as a special matrix according to the structure of

(P, Q)

(the details are in Appendix A). According to this instance, the accuracy of the prediction can be effectively improved when the side information is selected correlating to the structure of the underlying matrix.

Note that our mistake bound performs better than previous bound especially in the realizable case. Compared with the bound

O (\frac{\hat{D}}{γ^{2}} ln (m + n))

in [2], our mistake bound is

O (\frac{\hat{D}}{γ^{2}}),

which removes the logarithmic factor

ln (m + n) .

In addition, our bound recovers the known lower bound of Herbster et al. [1] up to a constant factor. If

U

contains a

(k, l)

-biclustered structure (

k \geq l

), by setting

γ = \frac{1}{\sqrt{l}}

, our mistake bound can become

O (k l) .

On the other hand, the lower bound of Herbster et al. is

Ω (k l) .

Thus, the mistake bound of Theorem 4 is optimal.

In the next subsection, we show an example, online similarity prediction with side information, where the comparator matrix is

(k, l)

-biclustered. With the ideal side information, we can effectively improve the mistake bound.

4.5. Online Similarity Prediction with Side Information

In this subsection, we show the application of our reduction method and generalized log-determinant regularizer to online similarity prediction with side information.

Before we introduce the online similarity prediction problem, we need involve some notations and basic concepts. Denote

G = (V, E)

be an undirected and connected graph, where

n = | V |

and

m = | E |

. In graph G, if all the vertices are assigned into K different classes, i.e., there is n-dimensional vector

Y = {y_{1}, \dots, y_{n}}

, where

y_{i} \in {1, \dots, K},

and the classification of each vertex

v_{i}

is represented by

y_{i}

in vector

Y

, we denote this graph with respect to the assignment

y

as

(G, y) .

Next we define the cut-edges of

(G, y),

Φ^{G} (y) = {(i, j) \in E : y_{i} \neq y_{j}},

and

Φ^{G}

in abbreviation. The cardinality of

| Φ^{G} (y) |

is the cut size. For each graph G, we denote the adjacency matrix of G as A if

A_{i j} = A_{j i} = 1

if

(i, j) \in E (G)

and

A_{i j} = 0

, otherwise. The degree matrix of

G,

D \in R^{n \times n}

is defined as a diagonal matrix where

D_{i i}

is the degree of vertex i. The Laplacian is defined as

L = D - A .

We define the PD-Laplacian

\tilde{L} = L + (\frac{1}{n}) {(\frac{1}{n})}^{⊤} α_{L}

, where

1

is a n-dimensional vector that all entries are

1 .

Given a graph

G = (V, E)

and its Laplacian

L,

assume that each edge of G is a unit resistor, the effective resistance of any pair of vertices

(i, j) \in V \times V,

R_{i, j}^{G} = (e_{i} - e_{j}) L^{+} (e_{i} - e_{j})

, where

E_{i}

is the standard basis in

R^{n}

, and

L^{+}

is the pseudo-inverse matrix of

L .

Now the online similarity prediction is formulated as follows. Given a K classified graph

G = (V, E) .

On each round

t \in [T]

, we have the following:

1.: The environment confirms a pair of vertices $(i_{t}, j_{t}) \in V \times V .$
2.: The algorithm predicts whether these two vertices belong to the same class. If they are classified in the same class, the algorithm predicts ${\hat{y}}_{i_{t}, j_{t}} = 1$ and $- 1$ , otherwise.
3.: The environment reveals the true answer $y_{i_{t}, j_{t}} .$ If they are in the same class, then $y_{i_{t}, j_{t}} = 1$ ; $y_{i_{t}, j_{t}} = - 1$ otherwise.

The target of the algorithm is to minimize the number of the prediction mistakes

M = \sum_{t = 1}^{T} I_{{\hat{y}}_{i_{t}, j_{t}} \neq y_{i_{t}, j_{t}}} .

Due to this formulation of online similarity prediction, this problem is a special case of an OBMC problem. We denote also a sequence

S = {((i_{t}, j_{t}), y_{t})}_{t = 1}^{T} \subseteq {([n] \times [n], {- 1, 1})}^{T}

for our online similarity prediction problem.

The side information we defined for online similarity prediction is the PD-Laplacian

\tilde{L}

of graph G in our paper. Note that the side information is restricted about only the graph G itself, and irrelevant to the classification vector

Y

. Gentile et al. [15] explored this problem by involving the graph G as the prior information to the algorithm. It is equivalent to our problem setting. The mistake bound from [15] is described in the following proposition:

Proposition 3.

Let

(G, y)

be a labeled graph. If we run Matrix Winnow with G as an input graph, we have the following mistake bound:

M^{W} = O (| Φ^{G} | max_{(i, j) \in V^{2}} R_{i, j}^{G} ln n) .

(35)

From the definitions of the online similarity prediction with side information and OBMC with side information, online similarity prediction is a special case of OBMC, by considering the comparator matrix

U \in {1, - 1}^{n \times n}

, where

U_{i j}

indicates the classifications of vertices i and j. More specifically, if vertices

i, j

are in the same class, then

U_{i j} = 1,

and

U_{i j} = - 1,

otherwise. Meanwhile, the side information unifies the symmetric positive definite matrices

M, N

into

\tilde{L}

.

Moreover, due to [2,15] (see details in Appendix A.2), the comparator matrix

U

is actually a

(K, K)

-biclustered

n \times n

binary matrix. According to the aforementioned result from [2], there exists a matrix

R \in B^{n, k}

such that

U = R U^{*} R^{⊤}

, where

U^{*} = 2 I_{K} - 11^{⊤},

denoting

1

as a K-dimensional vector for which all entries are

1 .

As same as the previous reductions, the reduced

Γ

specified OSDP problem corresponding to the online similarity prediction with side information is as follows. Firstly, we define the side information

\tilde{L}

parameterized

Γ

in the following matrix.

Γ = [\begin{matrix} \sqrt{α_{\tilde{L}} \tilde{L}} & 0 \\ 0 & \sqrt{α_{\tilde{L}} \tilde{L}} \end{matrix}],

(36)

where

α_{\tilde{L}} = {max}_{i} {({\tilde{L}}^{- 1})}_{i i} .

Next the decision space

K

and the loss space

L

are defined as previously as in Equations (30) and (31), respectively.

Thus, the following proposition demonstrates that the online similarity prediction with side information

\tilde{L}

can be reduced to an OSDP problem

(K, L)

parameterized with

Γ

.

Proposition 4.

Given an online similarity prediction problem with graph

(G, y)

, set the side information as

\tilde{L}

. Then we can reduce this problem to a generalized OSDP problem

(K, L)

with bounded Γ-trace norm such that

\begin{matrix} K = \{X \in S_{+ +}^{n \times n} : {∥ vec (X) ∥}_{\infty} \leq 1 \land Tr (Γ X Γ) \leq \hat{D}\} \\ L = \{\frac{1}{γ} Z 〈 i, j 〉, \frac{- 1}{γ} Z 〈 i, j 〉 : \forall i \in [n], j \in [n]\}, \end{matrix}

where Γ is defined as above, and

Z 〈 i, j 〉 = \frac{1}{2} (e_{i} e_{n + j}^{⊤} + e_{n + j} e_{i}^{⊤}) .

(37)

Therefore, the mistake bound of online similarity prediction is bounded as follows:

M = \sum_{t = 1}^{T} I_{{\hat{y}}_{i_{t}, j_{t}} \neq y_{i_{t}, j_{t}}} \leq {Regret}_{OSDP} (M, K, L) + hloss (S, γ),

(38)

where γ is the margin parameter.

According to [2], there exists

P = R P^{*}, Q = R Q^{*}

where

P^{*} = {(P_{i j}^{*} = \sqrt{2} I_{i = j} + I_{j = k + 1})}_{i \in [k], j \in [k + 1]} and Q^{*} = {(Q_{i j}^{*} = \sqrt{2} I_{i = j} - I_{j = k + 1})}_{i \in [k], j \in [k + 1]}

such that

U = P Q^{⊤} .

It implies that the hinge loss

hloss (S, γ) = 0

, when

γ \in O (1)

, more specifically

γ = \frac{1}{3}

.

Remark 3.

According to Theorem 3 and Section 4.2 in [2], and the characterization of online similarity prediction with side information, the quasi-dimension estimator

\hat{D} \leq O (Tr (R^{⊤} LR) α_{L})

, where

L

is the Laplacian of the corresponding graph

G,

if we set the side information is

\tilde{L}

.

Due to our Theorem 4 and running Algorithm 1 in main part, finally, we have the mistake bound as follows:

M \leq O (Tr (R^{⊤} LR) α_{L}) .

(39)

Remark 4.

In the work of Herbster et al. [2], the resulting mistake bound is

O (Tr (R^{⊤} LR) α_{L} ln n),

which can recover the bound of

O (| Φ^{G} | {max}_{(i, j) \in V^{2}} R_{i, j}^{G} ln n)

in [15] up to a constant factor. Moreover, our bound improves the logarithmic factor

ln n

compared with [2].

5. Connection to a Batch Setting

In this section, we employ the well-known online-to-batch conversion technique (see, for example, [16]) and obtain a batch learning algorithm with generalization error bounds. The results imply that the algorithm performs nearly as well as the support vector machine (SVM) running over the optimal feature space, although the side information is vacuous. Moreover, with the assistance of the side information, a more refined mistake bound for batch setting follows from our online version analysis.

First, we describe our setting formally. We consider the problem in the standard probably approximately correct (PAC) learning framework [16,19,20]. The algorithm is given the side information matrix

M \in S_{+ +}^{m \times m}

and

N \in S_{+ +}^{n \times n}

and a sample sequence

S

:

S = ((i_{1}, j_{1}, y_{1}), (i_{2}, j_{2}, y_{2}), \dots, (i_{T}, j_{T}, y_{T}))

where each triple

(i_{t}, j_{t}, y_{t})

is randomly and independently generated according to some unknown probability distribution

V

over

[m] \times [n] \times {- 1, 1}

. Then the algorithm outputs a hypothesis

f : [m] \times [n] \to [- 1, 1]

. The goal is to find, with high probability, a hypothesis f that has small generalization error

R (f) = {Pr}_{(i, j, y) \sim V} (sgn (f (i, j)) \neq y) .

In particular, we consider a hypothesis of the form

f_{W} : (i, j) \mapsto W • Z 〈 i, j 〉

where

W \in K = {W \in S_{+ +}^{N \times N} : \forall i \in [N], W_{i, i} \leq 1 \land Tr (Γ W Γ) \leq \hat{D}},

where

Γ

is defined as in Equation (28).

In Algorithm 2, we give the algorithm obtained by the online-to-batch conversion.

Algorithm 2 Binary matrix completion in the batch setting.

1:: Parameter: $γ > 0$
2:: Input: a sample $S$ of size T.
3:: Run Algorithm 1 over $S$ and get its predictions $W_{1}, W_{2}, \dots, W_{T}$ .
4:: Choose $W$ from ${W_{1}, W_{2}, \dots, W_{T}}$ uniformly at random.
5:: Output $f_{W}$ .

To bound the generalization error, we use the following lemma, which is straightforward from Lemma 7.1 of [16].

Lemma 8.

Let

L : [- 1, 1] \times {- 1, 1} \to [- B, B]

be a function and

W_{1}, \dots, W_{T}

and

W

be the matrices obtained in Algorithm 2. Then, for any

δ > 0

, with probability at least

1 - δ

, the following holds:

\begin{matrix} E_{(i, j, y) \sim V, W} [L (f_{W} (i, j), y)] & = & \frac{1}{T} \sum_{t = 1}^{T} E_{(i, j, y) \sim V} [L (f_{W} (i, j), y)] \\ \leq & \frac{1}{T} \sum_{t = 1}^{T} L (f_{W_{t}} (i_{t}, j_{t}), y_{t}) + B \sqrt{\frac{2 ln 1 / δ}{T}} . \end{matrix}

Applying the lemma with the zero-one loss

L (r, y) = 1 (sgn (r) \neq y)

combined with the mistake bound (34) of Theorem 4, we have the following generalization bound.

Theorem 5.

For any

δ > 0

, with probability of at least

1 - δ

, Algorithm 2 produces

f_{W}

with the following property:

E_{W} [R (f_{W})] \leq \frac{O (\frac{\hat{D}}{γ^{2}} + hloss (S, γ))}{T} + \sqrt{\frac{2 ln 1 / δ}{T}} .

(40)

On the other hand, when applying the lemma with the hinge loss

L (r, y) = h_{γ} (r y)

combined with an

O (\sqrt{\hat{D} T / γ^{2}})

regret bound of (32) with the minimizer

η = \sqrt{γ^{2} \hat{D} / T}

, then we have

\begin{matrix} E_{W} [R (f_{W})] & \leq & E_{(i, j, y) \sim V, W} [h_{γ} (y f_{W} (i, j))] \\ \leq & O (\sqrt{\frac{\hat{D}}{γ^{2} T}} + \frac{hloss (S, γ)}{T}) + (1 + γ) \sqrt{\frac{2 ln 1 / δ}{T}}, \end{matrix}

(41)

which is slightly worse than (40). Note that

\hat{D} = O (m)

, if the side information is vacuous.

Now we examine some implications of our generalization bounds. First, we assume without loss of generality that

m \geq n

, because otherwise, we can make everything transposed.

As explained in the aforementioned sections, we can think of each

{\bar{Q}}_{j}

as a feature vector of item j. Assume all feature vectors

\bar{Q} \in R^{n \times k}

are given and consider the problem of finding a good linear classifier

{\bar{P}}_{i}

for each user i independently. A natural way is to use the SVM, which solves

inf_{γ > 0, P_{i} \in R^{k}} (1 / γ^{2} + C \sum_{t : i_{t} = i} h_{γ} (y_{t} {\bar{P}}_{i} {\bar{Q}}_{j_{t}}^{⊤}))

for every

i \in [m]

for some constant

C > 0

. Now if we fix

γ

to be a constant for all i, then the optimization problems above are summarized as

\begin{matrix} inf_{P \in R^{m \times k}} \sum_{i = 1}^{m} (1 / γ^{2} + C \sum_{t : i_{t} = i} h_{γ} (y_{t} {\bar{P}}_{i} {\bar{Q}}_{j_{t}}^{⊤})) & = & inf_{P \in R^{m \times k}} (m / γ^{2} + C \sum_{t} h_{γ} (y_{t} {\bar{P}}_{i_{t}} {\bar{Q}}_{j_{t}}^{⊤})) \\ = & \frac{m}{γ^{2}} + C inf_{P} hloss (S, (P, Q), γ) . \end{matrix}

So, if we further optimize feature vectors, we obtain

\frac{m}{γ^{2}} + C inf_{P Q^{⊤} \in R^{m \times n}} hloss (S, (P, Q), γ) = \frac{m}{γ^{2}} + C hloss (S, γ)

(42)

which roughly is proportional to our generalization bound (40), when the side information is vacuous (i.e.,

M

and

N

are identity matrices). This result implies that our generalization bound is upper bounded by the objective function value of the SVM running over the optimal choice of feature vectors. Meanwhile, we expect a more refined bound when the side information is not vacuous for the batch setting.

Moreover, a well known generalization bound for linear classifiers (see, for example, [16]) gives

{Pr}_{(j, y) \sim V_{i}} (sgn ({\bar{P}}_{i} {\bar{Q}}_{j}^{⊤}) \neq y) \leq \frac{1}{T_{i}} \sum_{t : i_{t} = i} h_{γ} (y_{t} {\bar{P}}_{i_{t}} {\bar{Q}}_{j_{t}}^{⊤}) + 2 \sqrt{\frac{1}{γ^{2} T_{i}}} + \sqrt{\frac{ln (1 / δ)}{2 T_{i}}}

for every

i \in [m]

, where

V_{i}

is the conditional distribution of

V

given that the first component is i, and

T_{i}

is the number of

t \in [T]

that satisfies

i_{t} = i

. Assume for simplicity that

V (i) = \sum_{j, y} V (i, j, y) = 1 / m

and

T_{i} = T / m

for every i. Then,

\begin{matrix} R (f_{W_{P, Q}}) & = & \sum_{i = 1}^{m} V (i) {Pr}_{(j, y) \sim V_{i}} (sgn ({\bar{P}}_{i} {\bar{Q}}_{j}^{⊤}) \neq y) \\ \leq & \sum_{i = 1}^{m} V (i) (\frac{1}{T_{i}} \sum_{t : i_{t} = i} h_{γ} (y_{t} {\bar{P}}_{i_{t}} {\bar{Q}}_{j_{t}}^{⊤})) + 2 \sqrt{\frac{1}{γ^{2} T_{i}}} + \sqrt{\frac{ln (1 / δ)}{2 T_{i}}}) \\ = & \frac{1}{T} hloss (S, (P, Q), γ) + 2 \sqrt{\frac{m}{γ^{2} T}} + \sqrt{\frac{m ln (1 / δ)}{2 T}} . \end{matrix}

Minimizing

R (f_{W_{P, Q}})

over all

P

and

Q

such that

P Q^{⊤} \in R^{m \times n}

, the bound obtained is very similar to our bound (41). This observation implies that our hypothesis has generalization ability competitive with the optimal linear classifiers

\bar{P}

over the optimal feature vectors

\bar{Q}

.

6. Conclusions

In this paper, on the one hand, we consider a variant of the OSDP problem, whose constraint to the decision space is the bounded with

Γ

-trace norm, a generalization of the trace norm in the standard OSDP problem. By involving an FTRL-based algorithm with a

Γ

-calibrated log-determinant regularizer, we achieve a regret bound irrelevant to the size of the matrix. On the other hand, we utilize our result to the online binary matrix completion problem with side information, particularly. We reduce OBMC with side information to our new OSDP framework and parameterize the side information into

Γ

. With our proposed algorithm, combining the result to the generalized OSDP framework, we obtain a tighter mistake bound than the previous work by removing the logarithmic factor. Furthermore, our result in the off-line version is better than the traditional margin-based SVMs with the best kernel, when the side information is not vacuous.

Author Contributions

Conceptualization, K.-i.M., Y.L., K.H. and E.T.; methodology, K.-i.M. and Y. L.; validation, K.H. and E.T.; formal analysis, Y.L., K.H. and E.T.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L., K.H. and E.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JSPS KAKENHI Grant Numbers JP19H04174, JP19H04067 and JP20H05967, respectively.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We thank the reviewers for their suggestions and comments.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. Proof of Main Proposition and Main Theorem

Lemma A1

([7]). Let P and Q be probability distributions over

R^{N}

and

ϕ_{P} (u)

and

ϕ_{Q} (u)

be their characteristic functions, respectively. Then

max_{u \in R^{N}} | ϕ_{P} (u) - ϕ_{Q} (u) | \leq \int_{x} | P (x) - Q (x) | d x,

(A1)

the right hand side is the total variation distance between any distribution Q and

P .

Lemma A2

([18]). Let P and Q be probability distributions over

R^{N}

with total variation distance

δ .

Then

H (α P + (1 - α) Q) \leq α H (P) + (1 - α) H (Q) - α (1 - α) δ^{2},

(A2)

where

H (P) = E_{x \sim P} [ln P (x)] .

Lemma A3

([7]). For any probability distribution P over

R^{N}

with zero mean and covariance matrix Σ, its entropy is bounded by the log-determinant of covariance matrix. That is,

- H (P) \leq \frac{1}{2} ln (det (Σ) {(2 π e)}^{N}) .

(A3)

Note that the equality holds if and only if P is a Gaussian distribution.

Lemma A4

([7]).

e^{\frac{- x}{2}} - e^{- \frac{1 - x}{2}} \geq \frac{e^{- 1 / 4}}{2} (1 - 2 x),

(A4)

for

0 \leq x \leq 1 / 2 .

Appendix A.2. Definition of Biclustered Structure and Ideal Quasi Dimension

As in [2], we define the class of

(k, l)

-biclustered structure matrices as follows.

Definition A1.

For

m \geq k

and

n \geq l,

the class of

(k, l)

-binary biclustered matrices is defined as

\begin{matrix} B_{k, l}^{m \times n} & = {U \in {- 1, + 1}^{m \times n} : r \in {[k]}^{m}, c \in {[l]}^{n}, V \in {1, - 1}^{k \times l}, U_{i, j} = V_{r_{i}, c_{j}}, i \in [m], j \in [n]} . \end{matrix}

Visually, if

U \in B_{k, l}^{m \times n},

there exists a permutation of rows and columns over

U

such that after this permutation,

U

becomes a

k \times l

block matrix where in each block, all the entries are uniformly labeled

- 1

or

+ 1

. Formally, for any matrix

U \in B_{k, l}^{m, n}

we can decompose

U = {RU}^{*} C^{⊤}

for some

U^{*} \in {- 1, + 1}^{k \times l}, R \in B^{m, k}

and

C \in B^{n, l},

where

B^{m, d} {= {R \subset {0, 1}}^{m \times d} : ∥ R_{i} ∥_{2} = 1, i \in [m], rank (R) = d} .

In the following theorem, we demonstrate that in the OBMC problem with side information, the quasi-dimension can be effectively bounded with an appropriate choice of side information, if the underlying matrix

U

is biclustered.

Theorem A1

([2]). If

U \in B_{k, l}^{m \times n}

define

D_{M, N}^{o} (U)

as

D_{M, N}^{o} (U) = 2 Tr (R^{⊤} MR) α_{M} + 2 Tr (C^{⊤} NC) α_{N} + 2 k + 2 l,

(A5)

where

M, N

are PD-Laplacian, as the minimum over all decompositions of

U = {RU}^{*} C^{⊤}

for some

U^{*} \in {- 1, + 1}^{k \times l}, R \in B^{m, k}

and

C \in B^{n, l}

. Thus, for

U \in B_{k, l}^{m \times n},

D_{M, N}^{γ} (U) \leq D_{M, N}^{o} (U),

(A6)

if

{∥ U ∥}_{max} \leq \frac{1}{γ} .

Moreover, we define the max-norm of a matrix

U \in R^{m \times n}

as follows:

{∥ U ∥}_{max} = min_{{PQ}^{⊤} = U} \{max_{1 \leq i \leq m} ∥ P_{i} ∥ max_{1 \leq j \leq n} ∥ Q_{j} ∥\} .

(A7)

Furthermore we define the quasi-dimension of a matrix

U

with

M \in S_{+ +}^{m \times m}

and

N \in S_{+ +}^{n \times n}

at margin

γ

as

D_{M, N}^{γ} (U) = min_{\bar{P} {\bar{Q}}^{⊤} = γ U} α_{M} Tr ({\bar{P}}^{⊤} M \bar{Q}) + α_{N} Tr ({\bar{Q}}^{⊤} N \bar{Q}) .

(A8)

See Section 4.1 from [2]; if

U

is a

(k, l)

-biclustered structured matrix, they show an example where

D_{M, N}^{o} (U) \in O (k + l)

with ideal side information. It implies that in a realizable case, such as that for the sequence

S = {((i_{t}, j_{t}), y_{t})}_{t = 1}^{T}

where

y_{t} = {(\bar{P} {\bar{Q}}^{⊤})}_{i_{t}, j_{t}} = U_{i_{t}, j_{t}}

for some

U

satisfying the conditions in [2], and

(\bar{P}, \bar{Q}) = {arg min}_{P, Q} D_{M, N}^{γ} (U),

with appropriate side information, the the quasi-dimension estimator

\hat{D} \in O (k + l)

. It follows that with ideal side information, our proposed algorithm in the main part (Algorithm 1) is effective when the binary comparator matrix

U

is

(k, l)

-biclustered.

References

Herbster, M.; Pasteris, S.; Pontil, M. Mistake bounds for binary matrix completion. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 3954–3962. [Google Scholar]
Herbster, M.; Pasteris, S.; Tse, L. Online Matrix Completion with Side Information. Adv. Neural Inf. Process. Syst. 2020, 33, 20402–20414. [Google Scholar]
Zhang, X.; Wang, L.; Yu, Y.; Gu, Q. A primal-dual analysis of global optimality in nonconvex low-rank matrix recovery. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 5862–5871. [Google Scholar]
Beckerleg, M.; Thompson, A. A divide-and-conquer algorithm for binary matrix completion. Linear Algebra Its Appl. 2020, 601, 113–133. [Google Scholar] [CrossRef] [Green Version]
Bennett, J.; Lanning, S. The netflix prize. In Proceedings of the KDD Cup and Workshop, Halifax, NS, Canada, 13–17 August 2007; Volume 2007, p. 35. [Google Scholar]
Hazan, E.; Kale, S.; Shalev-Shwartz, S. Near-optimal algorithms for online matrix prediction. SIAM J. Comput. 2016, 46, 744–773. [Google Scholar] [CrossRef] [Green Version]
Moridomi, K.i.; Hatano, K.; Takimoto, E. Online linear optimization with the log-determinant regularizer. IEICE Trans. Inf. Syst. 2018, 101, 1511–1520. [Google Scholar] [CrossRef] [Green Version]
Cesa-Bianchi, N.; Lugosi, G. Prediction, Learning, and Games; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
Shalev-Shwartz, S. Online learning and online convex optimization. Found. Trends Mach. Learn. 2012, 4, 107–194. [Google Scholar] [CrossRef]
Hazan, E. Introduction to Online Convex Optimization. Found. Trends Optim. 2016, 2, 157–325. [Google Scholar] [CrossRef] [Green Version]
Abernethy, J.D. Can We Learn to Gamble Efficiently? In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), Haifa, Israel, 27–29 June 2010; pp. 318–319. [Google Scholar]
Shamir, O.; Shalev-Shwartz, S. Collaborative filtering with the trace norm: Learning, bounding, and transducing. In Proceedings of the 24th Annual Conference on Learning Theory, Budapest, Hungary, 9–11 July 2011; pp. 661–678. [Google Scholar]
Cesa-Bianchi, N.; Shamir, O. Efficient online learning via randomized rounding. In Proceedings of the Advances in Neural Information Processing Systems, Granada, Spain, 12–15 December 2011; pp. 343–351. [Google Scholar]
Koltchinskii, V.; Lounici, K.; Tsybakov, A.B. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Stat. 2011, 39, 2302–2329. [Google Scholar] [CrossRef]
Gentile, C.; Herbster, M.; Pasteris, S. Online Similarity Prediction of Networked Data from Known and Unknown Graphs. In Proceedings of the Conference on Learning Theory, Princeton, NJ, USA, 12–14 June 2013; pp. 662–695. [Google Scholar]
Mohri, M.; Rostamizadeh, A.; Talwalkar, A. Foundations of Machine Learning; Adaptive Computation and Machine Learning Series; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
Liu, Y.; Moridomi, K.I.; Hatano, K.; Takimoto, E. An online semi-definite programming with a generalised log-determinant regularizer and its applications. In Proceedings of the Asian Conference on Machine Learning, PMLR, Virtual, 17–19 November 2021; pp. 1113–1128. [Google Scholar]
Christiano, P. Online local learning via semidefinite programming. In Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing, New York, NY, USA, 31 May–3 June 2014; pp. 468–474. [Google Scholar]
Valiant, L.G. A theory of the learnable. Commun. ACM 1984, 27, 1134–1142. [Google Scholar] [CrossRef] [Green Version]
Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Moridomi, K.-i.; Hatano, K.; Takimoto, E. An Online Semi-Definite Programming with a Generalized Log-Determinant Regularizer and Its Applications. Mathematics 2022, 10, 1055. https://doi.org/10.3390/math10071055

AMA Style

Liu Y, Moridomi K-i, Hatano K, Takimoto E. An Online Semi-Definite Programming with a Generalized Log-Determinant Regularizer and Its Applications. Mathematics. 2022; 10(7):1055. https://doi.org/10.3390/math10071055

Chicago/Turabian Style

Liu, Yaxiong, Ken-ichiro Moridomi, Kohei Hatano, and Eiji Takimoto. 2022. "An Online Semi-Definite Programming with a Generalized Log-Determinant Regularizer and Its Applications" Mathematics 10, no. 7: 1055. https://doi.org/10.3390/math10071055

APA Style

Liu, Y., Moridomi, K.-i., Hatano, K., & Takimoto, E. (2022). An Online Semi-Definite Programming with a Generalized Log-Determinant Regularizer and Its Applications. Mathematics, 10(7), 1055. https://doi.org/10.3390/math10071055

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Online Semi-Definite Programming with a Generalized Log-Determinant Regularizer and Its Applications^†

Abstract

1. Introduction

2. Preliminaries

2.1. Generalized OSDP Problem with Bounded $Γ$ -Trace Norm

2.2. A Naive Reduction

3. Algorithm for the Generalized OSDP Problem

Proof for Main Proposition and Theorem

4. Application to OBMC with Side Information

4.1. The Problem Statement

4.2. Reduction from OBMC with Side Information to an Online Matrix Prediction (OMP)

4.3. Reduction from OMP to the Generalised OSDP Problem

4.4. Application to Matrix Completion

4.5. Online Similarity Prediction with Side Information

5. Connection to a Batch Setting

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Proof of Main Proposition and Main Theorem

Appendix A.2. Definition of Biclustered Structure and Ideal Quasi Dimension

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

An Online Semi-Definite Programming with a Generalized Log-Determinant Regularizer and Its Applications †

Abstract

1. Introduction

2. Preliminaries

2.1. Generalized OSDP Problem with Bounded Γ -Trace Norm

2.2. A Naive Reduction

3. Algorithm for the Generalized OSDP Problem

Proof for Main Proposition and Theorem

4. Application to OBMC with Side Information

4.1. The Problem Statement

4.2. Reduction from OBMC with Side Information to an Online Matrix Prediction (OMP)

4.3. Reduction from OMP to the Generalised OSDP Problem

4.4. Application to Matrix Completion

4.5. Online Similarity Prediction with Side Information

5. Connection to a Batch Setting

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Proof of Main Proposition and Main Theorem

Appendix A.2. Definition of Biclustered Structure and Ideal Quasi Dimension

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

An Online Semi-Definite Programming with a Generalized Log-Determinant Regularizer and Its Applications^†

2.1. Generalized OSDP Problem with Bounded $Γ$ -Trace Norm