A Customized ADMM Approach for Large-Scale Nonconvex Semidefinite Programming

Sun, Chuangchuang

doi:10.3390/math11214413

Open AccessArticle

A Customized ADMM Approach for Large-Scale Nonconvex Semidefinite Programming

by

Chuangchuang Sun

^†

Mississippi State University, Starkville, MS 39762, USA

^†

Current address: 75 B. S. Hood Rd, Mississippi State, MS 39762, USA.

Mathematics 2023, 11(21), 4413; https://doi.org/10.3390/math11214413

Submission received: 23 September 2023 / Revised: 16 October 2023 / Accepted: 16 October 2023 / Published: 24 October 2023

(This article belongs to the Section Computational and Applied Mathematics)

Download

Browse Figures

Versions Notes

Abstract

:

We investigate a class of challenging general semidefinite programming problems with extra nonconvex constraints such as matrix rank constraints. This problem has extensive applications, including combinatorial graph problems, such as MAX-CUT and community detection, reformulated as quadratic objectives over nonconvex constraints. A customized approach based on the alternating direction method of multipliers (ADMM) is proposed to solve the general large-scale nonconvex semidefinite programming efficiently. We propose two reformulations: one using vector variables and constraints, and the other further reformulating the Burer–Monteiro form. Both formulations admit simple subproblems and can lead to significant improvement in scalability. Despite the nonconvex constraint, we prove that the ADMM iterates converge to a stationary point in both formulations, under mild assumptions. Additionally, recent work suggests that in this matrix form, when the matrix factors are wide enough, the local optimum with high probability is also the global optimum. To demonstrate the scalability of our algorithm, we include results for MAX-CUT, community detection, and image segmentation.

Keywords:

semidefinite optimization; symmetric matrix factorization; nonconvex optimization; large-scale graph problems

MSC:

90C22; 90C26

1. Introduction

We consider rank-constrained semidefinite optimization problems (SDPs) of the type:

\begin{matrix} \min_{Z, X} & f (Z), \\ s . t . & A (Z) = b, Z = X X^{T}, X \in C, \end{matrix}

(1)

where the matrix variable

Z \in S_{+}^{n}

is an

n \times n

symmetric semidefinite matrix, and

X \in R^{n \times r}

a low-rank symmetric factor. The linear constraints

A (Z) = b

constrain either the diagonal or trace of Z, and the set

C

controls desirable features of the factor—e.g., nonnegativity, integer, norm-1, etc. (

C

may be nonconvex.) The objective function

f (x)

is convex, differentiable everywhere, with

L_{f}

-Lipschitz gradient, but the overall problem (1) is nonconvex.

This problem is equivalent to many important nonconvex SDPs, such as the MAX-CUT problem and its related applications [1,2,3], rank-constrained nonnegative matrix factorization problem [4,5], and constrained eigenvalue problems [6,7,8]. It is known that exactly solving (1) globally is in general a very difficult problem, as it includes many NP-hard problems. Methods for heuristically solving (1) fall in three categories: (i) solving the convexified SDP, where (1) does not have the rank-r or

X \in C

constraint, using any convex optimization method [9,10,11], (ii) approximately solving (1) using an alternating minimization method [12,13] and relying on statistical arguments suggesting that the acquired local optimal = the global optimal [13], or (iii) using other application-specific approaches [2,14]. The methods investigated in this paper fall in the second category. Specifically, we investigate solving (1) using ADMM and linearized ADMM on two reformulations. We find that these flexible reformulations allow easy incorporation of low-rank and sparse structures, making the resulting algorithm extremely scalable in both memory and computation, which we demonstrate on a number of popular applications.

However, often nonconvex formulations of SDPs are not favored because the convergence behavior of standard algorithms is not well understood. Specifically, an iterative procedure can do one of four things: diverge, oscillate within a bounded interval, converge to an arbitrary point, or converge to a useful point. We show that linearized ADMM on a nonsymmetric reformulation of (1) can either converge to a stationary point or diverge to

\pm \infty

; it cannot oscillate or converge to a non-stationary point. Additionally, for the case without linear constraints, vanilla ADMM is guaranteed to converge to a stationary point with a monotonically decreasing augmented Lagrangian term, and at a linear rate if the objective is strongly convex.

2. Applications

It is well known that many convex optimization problems can be reformulated as SDPs (e.g., [15]). In nonconvex optimization, SDPs are studied in several key areas, as tight convex relaxations of otherwise NP-hard problems.

2.1. Combinatorial Problems

A simple reparameterization of the constraint

x \in R^{n}

,

x_{i} \in {- 1, 1}

is as

X = x x^{T}

,

diag (X) = 1

. This property has been heavily exploited for finding lower bounds in combinatorial optimization [9,16,17] and generalized further to polynomial optimization [18,19]. Of high interest is the MAX-CUT problem:

\min_{x \in R^{n}} x^{T} C x, s . t . x_{i} \in {- 1, 1}, i = 1, \dots, n,

(2)

where

C = (A - diag (A 1)) / 4

and

A \in S^{n}

is the symmetric adjacency matrix of an undirected graph. Written in this way, the solution to (1) is exactly the maximum cut of an undirected graph with nonnegative weights

A_{i j}

.

This seemingly simple framework appears in many other applications, such as community detection [20] and image segmentation [21], and is equivalent to the nonconvex SDP:

\min_{Z} Tr (C Z), s . t . Z_{k k} = 1, Z ⪰ 0, rank (Z) = 1 .

(3)

Lifting

x \in R^{n}

to a skinny matrix

X \in R^{n \times k}

generalizes this technique to partitioning [22] and graph coloring problems [23].

2.1.1. Related Works on MAX-CUT

More generally, combinatorial methods can be solved using branch-and-bound schemes, using a linear relaxation of (1) as a bound [24,25], where the binary constraint

x \in {- 1, 1}

is relaxed to

0 \leq (x + 1) / 2 \leq 1

. Historically, these “polyhedral methods” were the main approach to finding exact solutions to the MAX-CUT problem. Though this is an NP(non-deterministic polynomial-time)-hard problem, if the graph is sparse enough, branch-and-bound converges quickly even for very large graphs [25]. However, when the graph is not very sparse, the linear relaxation is loose, and finding efficient branching mechanisms is challenging, causing the algorithm to run slowly. The MAX-CUT problem can also be approximated by one pass of the linear relaxation (with bound

\frac{f_{relax}}{f_{exact}} \geq 2 \times #

edges) [26].

A tighter approximation can be found with the semidefinite relaxation, which is also used for better bounding in branch-and-bound techniques [27,28,29,30,31]. In particular, the rounding algorithm of [9] returns a feasible

\hat{x}

given optimal Z, and is shown in expectation to satisfy

\frac{x^{T} C x}{{\hat{x}}^{T} C \hat{x}} \geq 0.878

. For this reason, the semidefinite relaxation for problems of type (1) is heavily studied (e.g., [11,32,33]).

2.1.2. Specialization to Community Detection

A small modification of the matrix C generalizes problems of form (2) and (3) to community detection in machine learning. Here, the problem is to identify node clusters in undirected graphs that are more likely to be connected with each other than with nodes outside the cluster. This prediction is useful in many graphical settings, such as interpreting online communities through social networks or linking behavior [34], interpreting biological ecosystems [35], finding disease sources in epidemiology [36], and many more. There are many varieties and methodologies in this field, and it would be impossible to list them all, though many comprehensive overviews exist (e.g., [2]).

The stochastic binary model [37] is one of the simplest generative models for this application. Given a graph with n nodes and parameters

0 < q < p < 1

, the model partitions the nodes into two communities and generates an edge between nodes in a community with probability p and nodes in two different communities with probability q. Following the analysis in [20], we can define

C = \frac{p + q}{2} 1 1^{T} - A

, where A is the graph adjacency matrix, and the solution to (1) gives a solution to the community detection problem with sharp recovery guarantees.

2.2. Nonnegative Factorization

For a symmetric matrix C, the maximum eigenvalue/eigenvector pair of C is the solution to the nonconvex optimization problem:

\max_{x \in R^{n}} x^{T} C x, s . t . {∥ x ∥}_{2} = 1 .

(4)

By inverting the sign of C, we can transform this into a minimization problem or equivalently acquire the minimum eigenvalue/eigenvector pair. Interestingly, despite the nonconvex nature of (4), we have many efficient globally optimal methods for finding x, e.g., Lanczos, Arnoldi, etc. However, adding any additional constraints, such as nonnegativity of x [38], and simple methods generally do not work without heavy data assumptions [39]. This is of interest in problems such as phase retrieval, recommender systems with positive-only observations, clustering and topic models, etc. Here, we discuss three variations of the nonnegative factorization problem appearing in the literature, all of which are special instances of (1).

2.2.1. Optimization over Spectrahedron

We can frame (4) as a linear objective over the spectrahedron:

\min_{Z \in S^{n}} Tr (C Z), s . t . Tr (Z) = 1, Z ⪰ 0 .

(5)

If additionally the maximum eigenvalue of C is isolated (corresponding only to one leading eigenvector), then

Z = x x^{T}

and

C x = λ_{max} (C) x

. To see this, by definition,

\begin{matrix} λ_{max} (C) & = max_{{x : ∥ x ∥}_{2} = 1} x^{T} C x \\ = max_{Z : Z = x x^{T}, {∥ x ∥}_{2} = 1} Tr (C Z) \\ = max_{Z : Tr (Z) = 1, X ⪰ 0} Tr (C Z) . \end{matrix}

(6)

As a consequence, note that though (5) is convex, the solution

Z^{*}

will always have rank 1 when

λ_{max} (C)

has multiplicity 1. A simple extension of (5) often used in nonnegative PCA [40] is:

\begin{matrix} \min_{Z \in S^{n}, x \in R^{n}} Tr (C Z), \\ s . t . Tr (Z) = 1, Z ⪰ 0, Z = x x^{T}, x \geq 0, \end{matrix}

(7)

which is an instance of (1) with

C

the nonnegative orthant.

2.2.2. Factorization with Partial Observations

An equivalent way of formulating the top-k nonnegative-eigenvector problem is as the nonnegative minimizer X to

∥ X X^{T} {- C ∥}_{2}

where X is

R^{n \times k}

. However, in many applications, we may not have full view of the matrix C, (e.g., C is a rating matrix). Suppose that an index set

Ω

defines the observed entries, e.g.,

{i, j} \in Ω

implies that

C_{i j}

is known. Then, the nonnegative factorization problem can be written as:

\begin{matrix} \min_{Z \in S^{n}, x \in R^{n}} \sum_{i, j \in Ω} {(Z_{i j} - C_{i j})}^{2}, \\ s . t . Z = x x^{T}, x \geq 0 . \end{matrix}

(8)

This formulation exists in [41].

2.2.3. Projective Nonnegative Matrix Factorization

A third method toward this goal is to optimize over the low-rank projection matrix itself [42], a variant of nonnegative matrix factorization, solving:

\begin{matrix} min_{Z \in S^{n}, X \in R^{n \times k}} {∥ B - Z B ∥}_{2}, \\ s . t . Z = X X^{T}, X \geq 0, \end{matrix}

(9)

Here, the data matrix may not even be symmetric, but

\frac{1}{Tr (Z)} Z B

will approximate the projection of B to its top-k singular vectors.

3. Related Work

3.1. Convex Relaxations

If

r = n

and

C = S^{n}

, then (1) is a convex problem, and can be solved using many conventional methods with strong convergence guarantees. However, even in this case, if n is large, traditional semidefinite solvers are computationally limiting. In the most general case, an interior point method solves at each iteration a KKT system of at least order

n^{6}

, and most first-order methods for general SDPs require eigenvalue decompositions, which are of order

O (n^{3})

per iteration.

3.2. Low-Rank Convex Cases

In fact, assuming low-rank solutions often allows for the construction of faster SDP methods. In [43], it is noted that the rank of the primal PSD matrix variable is equal to the multiplicity of the matrix variable arising from the gauge dual formulation, and finding only those r corresponding eigenvectors can recover the primal solution. In [10], a similar observation is made of the Lagrange dual variable and thus the dual problem can be solved via a modified bundle method. More generally, the recently popularized conditional gradient algorithm (also called the Frank–Wolfe algorithm) efficiently solves norm-constrained problems for nonsymmetric matrices [44], exploiting the fact that the dual norm minimizer can be computed efficiently; see also [45,46,47].

3.3. Nonconvex Cases

In close connection with these observations, [12,48] proposed simply reformulating semidefinite matrix variables

Z = X X^{T}

, solving the “standard” nonconvex SDP:

\min_{X \in R^{n \times r}} 〈 C, X X^{T} 〉, s . t . A (X X^{T}) = b,

(10)

by sequentially optimizing the Lagrangian. However, solving (1) is still numerically burdensome; in the augmented Lagrangian term, the objective is quartic in R, and is usually solved using an iterative numerical method, such as L-BFGS.

3.4. Global Optimality of a Nonconvex Problem with Linear Objective

The main motivation behind solving rank-constrained problems using convex optimization methods comes from key results in [49,50] which show that for a linear SDP, when

X^{*}

is the optimum and

r = rank (X^{*})

, then

\frac{r (r + 1)}{2} \geq m

where m is the number of linear constraints. Furthermore, a recent work [13] shows that almost all local optima of FSDP are also global optima, suggesting that any stationary point of the FSDP is also a reasonable approximation of (1), if the constraint space of (10) is compact and sufficiently smooth, e.g.,

A_{i} Y

linearly independent whenever

〈 A_{i}, Y Y^{T} 〉 = b_{i}

for all

i = 1, \dots, m

. The MAX-CUT problem satisfies this constraint; an example of a linear SDP without this condition is the phase retrieval problem [51], when

m > n

.

3.5. Nonconvex Constraint $C$

Although there are many cases where the linear constraint in (1) serves a distinct purpose, largely it is introduced to tighten the convex relaxation. When working in the nonconvex formulation, for many applications, the linear constraint becomes superfluous, and a more useful reformulation may be:

\min_{x, y} g (x), s . t . x = y, y \in C,

for some nonconvex set

C

(e.g.,

C = {- 1, 1}^{n}

). Note that the projection on

C

is extremely easy, despite its nonconvexity. Although less explored, this idea is not new; see [52] (chapter 9).

3.6. ADMM for Nonconvex Problems

The alternating direction method of multipliers (ADMM) [53,54] is now a popular method [52] for convex large-scale distributed optimization problems with understood convergence rates [55] and variations [56,57,58]. It is closely related to dual decomposition methods, but alternates its subproblems, and makes use of augmented Lagrangians, which smooths the subproblems and reduces the influence of the dual ascent step size. Although there are extensions to many variable blocks, most ADMM implementations use two variable block decompositions, solving:

min_{x} g (x) + h (y), s . t . A x = B y,

by alternatingly minimizing over each variable in the augmented Lagrangian:

L_{ρ} (x, y; u) = g (x) + h (y) + u^{T} (A x - B y) + \frac{ρ}{2} {∥ A x - B y ∥}_{2}^{2},

and then incrementally updating the dual variable:

\begin{matrix} x^{+} & = & arg min_{x} L_{ρ} (x, y; u), \\ y^{+} & = & arg min_{x} L_{ρ} (x^{+}, y; u), \\ u^{+} & = & u + ρ (A x^{+} - B y^{+}) . \end{matrix}

Here, any

ρ > 0

will achieve convergence.

In general, there is a lack of theoretical justification for ADMM on nonconvex problems despite its good numerical performance. Almost all works concerning ADMM on nonconvex problems investigate when nonconvexity is in the objective functions ([59,60,61,62,63], and also [64,65] for matrix factorization). Under a variety of assumptions (e.g., convergence or boundedness of dual objectives) they are shown to converge to a KKT stationary point.

In comparison, relatively fewer works deal with nonconvex constraints. Ref. [66] tackles polynomial optimization problems by minimizing a general objective over a spherical constraint

{∥ x ∥}_{2} = 1

, Ref. [67] solves general QCQPs, and Ref. [68] solves the low-rank-plus-sparse matrix separation problem. In all cases, they show that all limit points are also KKT stationary points, but do not show that their algorithms will actually converge to the limit points. In this work, we investigate a class of nonconvex constrained problems and show with much milder assumptions that the sequence always converges to a KKT stationary point.

We now present our main results, the algorithms, and convergence analysis for different formulations.

4. Linearized ADMM on Full SDP

We first investigate a reformulation of (1) as:

\begin{matrix} \min_{Z, X, Y} & f (Z) + δ_{{0}} (A (Z) - b) + δ_{C} (Y), \\ s . t . & Z = {(X Y^{T})}_{Ω}, X = Y, \end{matrix}

(11)

with variables

Z \in S^{n \times n}

,

X \in R^{n \times r}

, and

Y \in R^{n \times r}

. The affine and

C

constraints are lifted to the objective via an indicator function:

δ_{C} (x) = \{\begin{matrix} 0 & if x \in C, \\ \infty & else . \end{matrix}

The notation

A_{Ω}

for a symmetric matrix A is the projection of A on the sparsity pattern

Ω

:

{(A_{Ω})}_{i j} = \{\begin{matrix} A_{i j}, & if {i, j} \in Ω \\ 0, & else, \end{matrix}

and we write

A \in S_{Ω}^{n}

if

A_{Ω} = A

. Specifically,

Ω

captures the effective sparsity of the problem; that is,

f (Z) = f (Z_{Ω})

and

A (Z) = A (Z_{Ω})

. We assume

{i, i} \in Ω

for all i, so the second is trivially true.

4.1. Duality

As shown in [69], a notion of a dual problem can be established via the augmented Lagrangian of (11):

\begin{matrix} L_{ρ} (Z, X, Y; S, U) = \\ f (Z) + δ_{C} (Y) + 〈 U, X - Y 〉 + 〈 S, Z - X Y^{T} 〉 \\ + \frac{ρ}{2} {∥ X - Y ∥}_{F}^{2} + \frac{ρ}{2} {∥ Z - X Y^{T} ∥}_{F}^{2}, \end{matrix}

(12)

where the dual problem is:

\max_{S, U} min_{Z, X, Y} L_{ρ} (Z, X, Y; S, U) .

The minimization of

L_{ρ}

over Z and X is the solution to:

\begin{matrix} \nabla f (Z) - A^{*} (ν) + S + ρ (Z - X Y^{T}) & = & 0 \\ U - S Y + ρ (X Y^{T} Y - Z Y) + ρ (X - Y) & = & 0 \\ A (Z) & = & b, \end{matrix}

(13)

where

ν > 0

is a Lagrange dual variable for the local constraint

A (Z) = b

. The minimization of

L_{ρ}

over Y is the solution to the generalized projection problem:

min_{Y \in C} {〈 Y - \hat{Y}, Y - \hat{Y} 〉}_{H} = Tr ((Y - \hat{Y}) H {(Y - \hat{Y})}^{T}),

(14)

where:

\hat{Y} = U + S X + ρ (X + Z^{T} X), H = ρ (I + X^{T} X) .

For general nonconvex problems, it is difficult to guarantee global minimality. Here, we introduce two sought-after properties that are more reasonably attainable.

Definition 1

([70]). The tangent cone of a nonconvex set

C

at x is given by

T_{C} (x) = {d : for all t \to 0, \hat{x} \to x, \hat{x} \in C, there exists \hat{d} \to d, \hat{x} + t \hat{d} \in C} .

The normal cone of

C

at x (

: = N_{C} (x)

) is the polar of the tangent cone.

Definition 2.

For a minimization of a smooth constrained function

min_{x \in C} f (x)

we say that

x^{*}

is a KKT-stationary point if

- \nabla f (x^{*}) \in N_{C} (x^{*})

.

Definition 3.

For a function defined over M variables

L (X_{1}, \dots, X_{m})

, we say that

X_{1}^{*}, \dots, X_{m}^{*}

are (block) coordinatewise minimum points if for each

k = 1, \dots, m

,

X_{k}^{*} = \underset{X}{argmin} L (X_{1}^{*}, \dots, X_{k - 1}^{*}, X, X_{k + 1}^{*}, \dots, X_{m}^{*}) .

Note that it is not always the case that stationarity is stronger than coordinatewise minimum. A simple example is

C = {- 1, 1}^{n}

. Then, for all points

x \in C

, the tangent cone is

{0}

and the normal cone is

R^{n}

. Then, every point in

C

is stationary, no matter the objective function.

Proposition 1.

If Algorithm 1 converges to coordinatewise minimum points

({(X, Z)}^{*}, Y^{*}, S^{*}, U^{*})

, then the primal points (i) satisfy (13) for some choice of

ν \geq 0

, (ii) minimize (14), (iii) and are primal-feasible, e.g.,

X^{*} = Y^{*}

and

{(X^{*} {(Y^{T})}^{*})}_{Ω} = Z^{*}

. Furthermore,

(X^{*}, Y^{*}, Z^{*}, S^{*}, U^{*})

are stationary points of (12).

Proof.

It is clear that the convergent points of Algorithm 1 exactly satisfy the three conditions. To show that these points are stationary, note that the augmented Lagrangian is convex with respect to

X, Z

jointly, and is a projection on a compact set

C

with respect to Y. Therefore:

\begin{matrix} \nabla_{X, Z, S, U} L_{ρ} (Z^{*}, X^{*}, Y^{*}; S^{*}, U^{*}) = 0, \\ - \nabla_{Y} {\bar{L}}_{ρ} (Z^{*}, X^{*}, Y^{*}; S^{*}, U^{*}) \in N_{C} (Y^{*}), \end{matrix}

(15)

where

{\bar{L}}_{ρ} (Z, X, Y; S, U) = - 〈 U, Y 〉 - 〈 S, X Y^{T} 〉 + \frac{ρ}{2} {∥ X - Y ∥}_{F}^{2} + \frac{ρ}{2} {∥ Z - X Y^{T} ∥}_{F}^{2},

with all the differentiable terms of

L_{ρ}

involving Y. □

4.2. Linearized ADMM

We propose to solve (11) via the linearized ADMM, e.g., where at each iteration, the objective is replaced by its current linearization:

f (Z) \approx {\hat{f}}^{k} (Z) : = f (Z^{k - 1}) + 〈 \nabla f (Z^{k - 1}), Z - Z^{k - 1} 〉 .

We then build the linearized augmented Lagrangian function as:

\begin{matrix} {\hat{L}}^{k} (Z, X, Y; S, U) = g_{k} (X, Z) + h (Y) + 〈 U, X - Y 〉 \\ + 〈 S, Z - X Y^{T} 〉 + \frac{ρ}{2} {∥ X - Y ∥}_{F}^{2} + \frac{ρ}{2} {∥ Z - X Y^{T} ∥}_{F}^{2} \end{matrix}

(16)

where

g_{k} (X, Z) = {\hat{f}}^{k} (Z) + δ_{{0}} (A (Z) - b), h (Y) = δ_{C} (Y)

and

S \in R^{n \times n}

and

U \in R^{n \times r}

are the dual variables corresponding to the two coupling constraints. The full algorithm is given in Algorithm 1.

Algorithm 1 ADMM for solving (11)

1:: Inputs: $ρ_{0} > 0$ , $α > 1$ , tol $ϵ > 0$
2:: Initialize: $Z^{0}, X^{0}; S^{0}, U^{0}$ as random matrices
3:: Outputs: Z, $X = Y$
4:: for $k = 1 \dots$ do
5:: Update $Y^{k + 1}$ the solution of:

$\begin{matrix} \min_{Y \in R^{n \times k}} ∥ Z^{k} - X^{k} Y^{T} + \frac{S^{k}}{ρ^{k}} ∥_{F}^{2} + {∥ X^{k} - Y + \frac{U^{k}}{ρ^{k}} ∥}_{F}^{2}, \\ s . t . Y \in C \end{matrix}$

(17)
6:: Update ${(Z, X)}^{k + 1}$ as the solutions of:

$\begin{matrix} min_{X, Z \in S_{Ω}^{n}} L_{k + 1} (Z, X, Y^{k + 1}; S^{k}, U^{k}; ρ^{k}), \\ s . t . A (Z) = b \end{matrix}$

(18)

where $L$ is the linearized augmented Lagrangian as defined in (16).
7:: Update $S, U$ and $ρ$ via:

$\begin{matrix} S^{k + 1} = & S^{k} + ρ^{k} {(Z^{k + 1} - X^{k + 1} {(Y^{k + 1})}^{T})}_{Ω} \\ U^{k + 1} = & U^{k} + ρ^{k} (X^{k + 1} - Y^{k + 1}) \\ ρ^{k + 1} = & α ρ^{k} \end{matrix}$

(19)
8:: if $max {∥ X^{k} - Y^{k} ∥, ∥ {(Z^{k} - X^{k} {(Y^{k})}^{T})}_{Ω} ∥} \leq ϵ$ then
9:: break
10:: end if
11:: end for

4.2.1. Minimizing over Y

The generalized projection (14) can be solved a number of ways. Note that if

r = 1

, then H is a positive scalar, and the problem reduces to

Y^{+} = {proj}_{Y \in C} (\frac{1}{H} \hat{Y})

. When

C = {- 1, 1}^{n}

, this process reduces to recovering the signs of

\hat{Y}

i.e.,

Y_{i} = {sign}_{C} ({\hat{Y}}_{i})

, and when

C = {u : ∥ u ∥_{2} = 1}

the set of unit-norm vectors, Y is just a properly scaled version of

\hat{Y}

:

Y = \frac{1}{∥ \hat{Y} ∥_{2}} \hat{Y} .

However, in general, it is difficult to compute the generalized projection over a nonconvex set. When

C

is convex, the generalized projection problem (14) can be computed using projected gradient descent. Note that the objective of (14) is 1-strongly convex; thus, we expect fast convergence in this subproblem. In practice, we find that if r is not too large, often a few tens of iterations is enough.

4.2.2. Minimizing over X and Z

Using standard linear algebra techniques, the linear system (13) can be reduced to a few simple instructions. First, we solve for the Lagrange dual variable

ν

associated with the linear constraints (and localized to the minimization of X and Z):

\begin{matrix} A (A^{*} (ν) (Y Y^{T} + I)) = \\ ρ (b - A (D Y^{T} + Y Y^{T})) + A ((G + S) (I + Y Y^{T})), \end{matrix}

(20)

where

D = \frac{1}{ρ} (S Y - U) + Y

and

G = \nabla f (Z^{k - 1})

the local gradient estimate. When

A = diag

, (20) reduces to n scalar element-wise computations

ν_{i} = \frac{ρ (b - {(D Y^{T})}_{i i}) + {((G + S) (I + Y Y^{T}))}_{i i}}{{(Y Y^{T})}_{i i} + 1} .

When

A = Tr

,

ν = \frac{ρ (b - Tr (D Y^{T}) + Tr ((G + S) (I + Y Y^{T}))}{Tr (Y Y^{T}) + 1}

. Note that in both cases, no

n \times n

matrix need ever be formed, so the memory requirement remains

O (n r)

. (See Appendix A for elaboration). Then, the primal variables are recovered via

X = B Y + D, and Z = {(X Y^{T})}_{Ω} + B,

with

B = - \frac{1}{ρ} (C - A^{*} (ν) + S) .

In these cases, the complexity is dominated by multiplications between

n \times n

and

n \times r

matrices. Thus, the method is especially efficient when

r ≪ n

.

4.3. Convergence Analysis

Theorem 1.

Assume that

f (Z)

is

L_{f}

-smooth. Assume the dual variables are bounded, e.g.,

max {∥ S^{k} ∥_{F}, ∥ U^{k} ∥_{F}, ∥ Y^{k} {∥_{F}}}_{k} \leq B_{P} < + \infty,

and

\frac{L_{f}}{σ_{max}}

is bounded above, where

σ_{max} = 1 - \frac{\sqrt{σ_{Y}^{4} + 4 σ_{Y}^{2}} - σ_{Y}^{2}}{2}, σ_{Y} = {∥ Y^{k + 1} ∥}_{2} .

Then, by running Algorithm 1 with

ρ^{k} = α ρ^{k - 1} = α^{k} ρ_{0}

, if

L_{k}

is bounded below, then the sequence

{P^{k}, D^{k}}

converges to a stationary point of (12).

Proof.

See Appendix B. □

Corollary 1.

If

r \geq ⌈\sqrt{2 n}⌉

and the stationary point of Algorithm 1 converges to a second-order critical point of (1), then it is globally optimal for the convex relaxation of (10) [13].

Unfortunately, the extension of KKT stationary points to global minima is not yet known when

\frac{r (r + 1)}{2} < n

(i.e.,

r = 1

). However, our empirical results suggest that even when

r = 1

, often a local solution to (10) well-approximates the global solution to (1).

5. ADMM on Simplified Nonconvex SDP

When the linear constraints are not present, (1) can be reformulated without Z, into:

\min_{X, Y} g (X) + δ_{C} (Y), s . t . X = Y,

(21)

with matrix variables

X \in R^{n \times r}, Y \in R^{n \times r}

, and where

g (X) = f (X X^{T})

is smooth. We can also define an augmented Lagrangian of (21) as

L_{ρ} (X, Y; U) = g (X) + δ_{C} (Y) + 〈 U, X - Y 〉 + \frac{ρ}{2} {∥ X - Y ∥}_{F}^{2} .

Theorem 2.

The coordinatewise minimum points

X^{*} = Y^{*}

satisfying:

\begin{matrix} 0 & = & \nabla g (X^{*}) + U + ρ (X - Y) \\ Y & = & {proj}_{C} (X + \frac{1}{ρ} U) \\ X & = & Y, \end{matrix}

(22)

are the stationary points of the problem:

\min_{X} g (X), s . t . X \in C .

(23)

Proof.

The KKT stationary points of (23) can be characterized in terms of the normal cone of

C

at

X^{*}

; specifically,

X^{*}

is stationary if:

〈 \nabla g (X^{*}), X - X^{*} 〉 \leq 0, \forall X \in C \cap N_{ϵ} (X^{*}),

where

N_{ϵ} (X^{*})

is some small neighborhood containing

X^{*}

. (This is an equivalent definition of the Clarke stationary point [70], since in a close enough neighborhood to

X^{*}

, the subdifferential of

δ_{C} (x)

is

N_{C} (x)

).

Combining terms in (22) gives

X^{*} = Y^{*}

satisfying

X^{*} = {proj}_{C} (X^{*} - \frac{1}{ρ} \nabla g (X^{*})) .

The optimality condition of the projection is

〈 X - (X - \frac{1}{ρ} \nabla g (X^{*})), X - X^{*} 〉 \leq 0, \forall X \in C \cap N_{ϵ} (X^{*})

which reduces to the desired condition. □

5.1. ADMM

The alternating steps in minimizing the augmented Lagrangian over the primal variables are extremely simple, compared with the previous matrix formulation. In general, we are considering

f (X)

linear (in which case the update of X involves only addition) or quadratic with strictly positive diagonal Hessian (which adds a small scaling step).

C = {- 1, 1}^{n}, C = {{x : ∥ x ∥}_{2} = 1},

even when

r > 1

.

5.2. Convergence Analysis

Definition 4.

A differentiable convex function

g (X)

is

L_{g}

-smooth and

H_{g}

-strongly convex over

R^{n}

if for any X, Y,

g (X) - g (Y) \geq 〈 \nabla f (X), X - Y 〉 - \frac{L_{g}}{2} {∥ X - Y ∥}_{F}^{2}

and

g (X) - g (Y) \leq 〈 \nabla f (X), X - Y 〉 - \frac{H_{g}}{2} {∥ X - Y ∥}_{F}^{2} .

Theorem 3.

Assume

g (X)

is lower bounded over

C

, and is

L_{g}

-smooth. Given a sequence

{ρ^{k}}

such that:

\frac{ρ^{k} - 3 L_{g}}{2} - L_{g}^{2} \frac{ρ^{k + 1} + ρ^{k}}{2 {(ρ^{k})}^{2}} > 0, ρ^{k} > L_{g}

for all k, then under Algorithm 2 the augmented Lagrangian

L (X^{k}, Y^{k}; U^{k})

is lower bounded and convergent, with

{X^{k}, Y^{k}, U^{k}} \to {X^{*}, Y^{*}, U^{*}}

a stationary and feasible solution of (23).

Proof.

See Appendix C. □

Remark 1.

Convergence is guaranteed under a constant penalty coefficient

ρ_{k} \equiv ρ^{0} \geq \frac{3 + \sqrt{17}}{2} L_{g}, α = 1 .

However, in implementation, we find empirically that increasing

{ρ^{k}}

from a relatively small

ρ^{0}

can encourage convergence to more useful global minima.

Algorithm 2 ADMM for solving (23)

1:: Inputs: $ρ_{0} > 0$ , $α > 1$ , tol $ϵ > 0$
2:: Initialize: $Z^{0}, X^{0}; S^{0}, U^{0}$ as random matrices
3:: Outputs: Z, $X = Y$
4:: for $k = 1 \dots$ do
5:: Update $Y^{k + 1}$ the solution of:

$\min_{Y \in R^{n \times k}} {∥ X^{k} - Y + \frac{U^{k}}{ρ^{k}} ∥}_{F}^{2}, s . t . Y \in C .$

(24)
6:: Update $X^{k + 1}$ as the solution of:

$0 = \nabla g (X) + U + ρ (X - Y) .$

(25)
7:: Update U and $ρ$ via:

$\begin{matrix} U^{k + 1} = & U^{k} + ρ^{k} (X^{k + 1} - Y^{k + 1}), \\ ρ^{k + 1} = & α ρ^{k} . \end{matrix}$

(26)
8:: if $∥ X^{k} - Y^{k} ∥_{F} \leq ϵ$ then
9:: break
10:: end if
11:: end for

Theorem 4.

If

g (X)

is

H_{g}

-strongly convex and

ρ^{k} = ρ

constant, with

\frac{ρ + H_{g}}{2} \geq \frac{L_{g}^{2}}{ρ}, ρ > L_{g}

then under Algorithm 2 the augmented Lagrangian

L (X^{k}, Y^{k}; U^{k})

converges to

L (X^{*}, Y^{*}, U^{*})

at a linear rate.

Proof.

See Appendix C.1. □

6. Numerical Experiments

In this section, we give numerical results on the proposed methods for community detection, MAX-CUT, image segmentation, and symmetric matrix factorization. In each application, we evaluate and compare these four methods. (i) SD: the solution to a semidefinite relaxation of (1) (SDR), where

C = R^{n, r}

. The binary vector factor x where

x x^{T} = Z

is recovered using a Goemans–Williamson style rounding. [9] technique. This is our baseline method and is described in more detail below. (ii) MR1: Algorithm 1 with

r = 1

. (iii) MRR: Algorithm 1 with

r = ⌈\sqrt{2 n}⌉

, then rounded to a binary vector using a nonsymmetric version of the Goemans–Williamson style rounding [9] technique. Both MR1 and MRR have the following stopping criterion

max {P^{(k)}, D^{(k)}} \leq ϵ

for some tolerance parameter

ϵ > 0

, where:

P^{(k)} : = \{\frac{∥ Z^{k} - Z^{k - 1} ∥_{2}}{∥ Z^{k} ∥_{2}}, \frac{∥ X^{k} - X^{k - 1} ∥_{2}}{∥ X^{k} ∥_{2}}, \frac{∥ Y^{k} - Y^{k - 1} ∥_{2}}{∥ Y^{k} ∥_{2}}\},

D^{(k)} : = max \{\frac{∥ Z^{(k)} - X^{(k)} {(Y^{(k)})}^{T} ∥_{2}}{∥ Z^{k} ∥_{2}}, \frac{∥ X^{(k)} - Y^{(k)} ∥_{2}}{∥ X^{(k)} ∥_{2}}\} .

(Here,

D^{(k)}

is also proportional to the difference in dual iterates, and thus

P^{(k)}

and

D^{(k)}

can be interpreted as primal and dual residuals, respectively). (iv) V: Algorithm 2, with stopping criterion

max {P^{(k)}, D^{(k)}} \leq ϵ

where

P^{(k)} : = \{\frac{∥ x^{k} - x^{k - 1} ∥_{2}}{∥ x^{k} ∥_{2}}, \frac{∥ y^{k} - y^{k - 1} ∥_{2}}{∥ y^{k} ∥_{2}}\}, D^{(k)} : = \frac{∥ x^{k} - y^{k} ∥_{2}}{∥ x^{k} ∥_{2}} .

The same primal and dual residual interpretation can be used here as well. In all cases, we use the following scheme for

ρ

:

ρ^{k} = min {ρ_{max}, ρ^{k - 1} * γ},

where

ρ_{max} \approx 10, 000

and

γ \approx 1.05

(slightly larger than 1).

6.1. Solving the Baseline (SDR)

As a baseline, we compare against the solution of the semidefinite relaxed problem without factor variables X (e.g.,

C = R^{n, n}

):

\min_{Z} f (Z), s . t . A (Z) = b, Z ⪰ 0 .

(27)

For a fair comparison, we use a first-order splitting method very similar to ADMM, which is the Douglas–Rachford Splitting (DRS) method ([71,72], see also [73,74]). We introduce dummy variables and solve the reformulation of (27):

\min_{Z_{1}, Z_{2}, Z_{3}} g_{1} (Z_{1}) + g_{2} (Z_{2}) + g_{3} (Z_{3}), s . t . Z_{1} + Z_{2} + Z_{3},

where

g_{1} (Z_{1}) = Tr (C Z_{1}), g_{2} (Z_{2}) = \{\begin{matrix} 0, & A (Z_{2}) = b \\ + \infty, & else, \end{matrix} g_{3} (Z_{3}) = \{\begin{matrix} 0, & Z_{3} ⪰ 0 \\ + \infty, & else . \end{matrix}

. An application of the DRS on this reformulation (see also Algorithm 3.1 in [75]) is then the following iteration scheme: for

i = 1, 2, 3

,

\begin{matrix} X_{i}^{(k + 1)} & = & {prox}_{t g_{i}} (Z_{i}), {\hat{Y}}_{i} = 2 X_{i}^{(k + 1)} - Z_{i}^{(k)}, \\ Y^{(k + 1)} & = & \frac{1}{3} (X_{1}^{(k + 1)} + X_{2}^{(k + 1)} + X_{3}^{(k + 1)}), \\ Z_{i}^{(k + 1)} & = & Z_{i}^{(k)} + ρ (Y^{(k + 1)} - X_{i}^{(k + 1)}) \end{matrix}

and for a convex function f

z = {prox}_{t f} (u) \Leftrightarrow \underset{z}{argmin} f (z) + \frac{1}{2 t} {∥ z - u ∥}_{2}^{2} .

6.2. Rounding

Following the technique in [9], we can estimate x from a rank r matrix

X \approx x x^{T}

by randomly projecting the main eigenspaces on the unit sphere. The exact procedure is as follows. (i) For the symmetric SDP solution X, we first perform an eigenvalue decomposition

X = Q Λ Q^{T}

and form a factor

F = Q Λ^{1 / 2}

where the diagonal elements of

Λ

are in decreasing magnitude order. Then, we scan

k = 1, \dots, n

and find

x_{k, t} = sign (F_{k} z_{t})

for trials

t = 1, \dots 10

. Here,

F_{k}

contains the first k columns of F, and each element of

z_{t} \in R^{k}

is drawn i.i.d from a normal Gaussian distribution. We report the values for

x_{r} = \underset{x_{k, t}}{argmin} {x_{r}^{T} C x_{r}}

. (ii) For the MRR method, we repeat the procedure using a factor

F = U Σ^{1 / 2}

where

X = U Σ V^{T}

is the SVD of X. (iii) For MR1 and V, we simply take

x_{r} = sign (x)

as the binary solution.

6.3. Computer Information

The following simulations are performed on a standard desktop computer with an Intel Xeon processor (3.6 GHz), and 32 GB of RAM. It is running with Matlab R2017a.

6.4. MAX-CUT

Table 1 gives the best MAX-CUT values using best-of-random-guesses and our approaches over four examples from the seventh DIMACS Implementation Challenge in 2002 (see http://dimacs.rutgers.edu/Workshops/7thchallenge/, problems downloaded from http://www.optsicom.es/maxcut/). Often, we find the quality of our recovered solutions close to the best-known solutions and often achieve similar suboptimality as the rounded SDR solutions. However, the runtime comparison (Figure 1) suggests that the ADMM methods (especially MR1 and SDR) are much more computationally efficient and scalable. All experiments are performed with

ϵ = 1 \times 10^{- 3}

.

6.5. Image Segmentation

Both community detection and MAX-CUT can be used in image segmentation, where each pixel is a node and the similarity between pixels forms the weight of the edges. Generally, solving (1) for this application is not preferred, since the number of pixels in even a moderately sized image is extremely large. However, because of our fast methods, we successfully performed image segmentation on several thumbnail-sized images, as seen in Figure 2.

The C matrix is composed as follows. For each pixel, we compose two feature vectors:

f_{c}^{i j}

containing the RGB values and

f_{p}^{i j}

containing the pixel location. Scaling

f_{c}^{i j}

by some weight c, we form the concatenated feature vector

f^{i j} = [f_{c}^{i j}, c f_{p}^{i j}]

, and form the weighted adjacency matrix as the squared distance matrix between each feature vector

A_{(i j), (k l)} = {∥ f^{i j} - f^{k l} ∥}_{2}^{2}

. For MAX-CUT, we again form

C = A - Diag (A 1)

as before. For community detection, since we do not have exact p and q values, we use an approximation as

C = a 1 1^{T} - A

where

a = \frac{1}{n^{2}} 1^{T} A 1

the mean value of A. Sweeping C and

ρ_{0}

, we give the best qualitative result in Figure 2.

6.6. Symmetric Factorization with Partial Observations

Recall the factorization with partial observations formulation as follows:

\min_{Z \in S^{n}, X \in R^{n \times r}} \sum_{i, j \in Ω} {(Z_{i j} - C_{i j})}^{2}, s . t . Z = X X^{T}, X \geq 0 .

(28)

Note that here we generalize the aforementioned formulation with

r = 5

. In this setting, while the strongly convex

Y —

update in the proposed algorithm can no longer be solved in closed form, projected gradient descent is applied to deal with it. The relative error defined as

∥ {(Z^{*} - C)}_{Ω} ∥ / ∥ C_{Ω} ∥

and CPU time with varying problem size and sparsity are demonstrated in Table 2.

7. Conclusions

We present two methods for solving quadratic combinatorial problems using ADMM on two reformulations. Though the problem has a nonconvex constraint, we give convergence results to KKT solutions under mild conditions. From this, we give empirical solutions to several graph-based combinatorial problems, specifically MAX-CUT and community detection; both can be used in additional downstream applications, like image segmentation.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Derivation of X, Z Update

In a linearized case, consider

G = \nabla f (Z^{k - 1}) = G_{Ω}

. Then, the optimality conditions are:

\begin{matrix} G - A^{*} (ν) + S + ρ (Z - {(X Y^{T})}_{Ω}) & = & 0 \\ U - S Y + ρ ({(X Y^{T})}_{Ω} Y - Z Y) + ρ (X - Y) & = & 0 \\ A (Z) & = & b . \end{matrix}

Using

D = ρ^{- 1} (S Y - U) + Y, B = - ρ^{- 1} (G - A^{*} (ν) + S),

we obtain:

\begin{matrix} - B + Z - {(X Y^{T})}_{Ω} & = & 0 \\ - D + {(X Y^{T})}_{Ω} Y - Z Y + X & = & 0 \\ A (Z) & = & b . \end{matrix}

Substitute for Z:

Z = {(X Y^{T})}_{Ω} + B \Rightarrow D + ({(X Y^{T})}_{Ω} + B) Y = {(X Y^{T})}_{Ω} Y + X \Rightarrow D + B Y = X .

Since we assume the diagonal is in

Ω

,

A (X_{Ω}) = A (X)

, so to solve for

ν

:

\begin{matrix} A ({(X Y^{T})}_{Ω} + B) = A (X Y^{T} + B) \\ = A ((D + B Y) Y^{T} + B) = b, \end{matrix}

and therefore

A (B (Y Y^{T} + I)) = b - A (D Y^{T}) .

Insert B and simplify:

\begin{matrix} b - A (D Y^{T}) \\ = & A ((- ρ^{- 1} (G - A^{*} (ν) + S)) (Y Y^{T} + I)) \\ = & - ρ^{- 1} A ((G - A^{*} (ν) + S) (Y Y^{T} + I)), \end{matrix}

and thus:

\begin{matrix} b - A (D Y^{T}) + ρ^{- 1} A ((G + S) (Y Y^{T} + I)) \\ = ρ^{- 1} A (A^{*} (ν) (Y Y^{T} + I)) = ρ^{- 1} H ν, \end{matrix}

(A1)

where H is an

m \times m

matrix with:

H_{i j} = 〈 A_{i}, A_{j} (Y Y^{T} + I) 〉 .

Thus this system reduces to

ν = H^{- 1} (b - A (D Y^{T}) + ρ^{- 1} A ((G + S) (Y Y^{T} + I))) .

Implicit Inverse of H

When

A = diag

, (20) reduces to n scalar element-wise computations

ν_{i} = \frac{ρ (b - {(D Y^{T})}_{i i}) + {((G + S) (I + Y Y^{T}))}_{i i}}{{(Y Y^{T})}_{i i} + 1} .

When

A = Tr

,

ν = \frac{ρ (b - Tr (D Y^{T}) + Tr ((G + S) (I + Y Y^{T}))}{Tr (Y Y^{T}) + 1} .

Note that in both cases, the computation for

ν

can be done without ever forming an

n \times n

matrix. For example, for

A = diag

,

D Y_{i i}^{T} = ρ^{- 1} {(S Y Y^{T})}_{i i} - ρ^{- 1} {(U Y^{T})}_{i i} + {(Y Y^{T})}_{i i}

Recall that for any two matrices A,

B \in R^{n \times r}

,

{(A B^{T})}_{i i} = A_{i}^{T} B_{i}

where

A_{i}

,

B_{i}

are the ith rows of A and B; thus, an efficient way of computing

ν

is (i) Compute more skinny matrices

F_{1} = S Y

,

F_{2} = G Y

. (ii) Compute the element-wise products

G_{1} = F_{1} \circ Y

,

G_{2} = U \circ Y

,

G_{3} = F_{2} \circ Y

, and

G_{4} = Y \circ Y

, where

{(A \circ B)}_{i j} = A_{i j} B_{i j}

(element-wise multiplication). (iii) Compute the row sums

g_{i} = G_{i} 1

,

i = 1, \dots, 4

. (iv) Compute the “numerator vector”

h_{1} = ρ (b - (ρ^{- 1} (g_{1} - g_{2}) + g_{4}) + diag (G) + diag (S) + g_{3} + g_{1}

and “denominator vector”

h_{2} = g_{4} + 1

. (v) Then,

ν_{i} = \frac{{(h_{1})}_{i}}{{(h_{2})}_{i}}

.

A similar procedure can be executed for

A = Tr

to keep memory requirements low.

Appendix B. Convergence Analysis for Matrix Form

To simplify notation, we first collect the primal and dual variables

P^{k} = {(Z, X, Y)}^{k}

and

D^{k} = {(Λ_{1}, Λ_{2})}^{k}

. We define the augmented Lagrangian at iteration k as:

\begin{matrix} L^{k} : & = & L (P^{k}; D^{k}; ρ^{k}) = f (Z^{k}) + δ_{C} (Y) \\ + & 〈 U, X - Y 〉 + 〈 S, Z - X Y^{T} 〉 \\ + & \frac{ρ}{2} {∥ X - Y ∥}_{F}^{2} + \frac{ρ}{2} {∥ Z - X Y^{T} ∥}_{F}^{2}, \end{matrix}

(A2)

and its linearization at iteration k as:

\begin{matrix} {\bar{L}}^{k} & : = & \bar{L} (P^{k}; D^{k}; ρ^{k}; {\bar{f}}^{k}) = {\bar{f}}^{k} + δ_{C} (Y) \\ + & 〈 U, X - Y 〉 + 〈 S, Z - X Y^{T} 〉 + \frac{ρ}{2} {∥ X - Y ∥}_{F}^{2}, \\ + & \frac{ρ}{2} {∥ Z - X Y^{T} ∥}_{F}^{2} \end{matrix}

(A3)

Here,

{\bar{f}}^{k} : = f (Z^{k - 1}) + 〈 G^{k - 1}, Z - Z^{k - 1} 〉

such that

f^{k}

is the linearization of f at

Z^{k - 1}

.

Lemma A1.

\nabla^{2} L_{Y} = \nabla^{2} {\bar{L}}_{Y} ⪰ ρ^{k} I

.

Proof.

Given the definition of

L

, we can see that the Hessian

\nabla^{2} L_{Y} = ρ^{k} (M + I) ⪰ ρ^{k} I

where

M = blkdiag (X^{T} X, X^{T} X, . . .) ⪰ 0 .

□

Lemma A2.

\nabla^{2} {\bar{L}}_{(X, Z)} ⪰ ρ^{k} (1 - \frac{\sqrt{λ_{N}^{2} + 4 λ_{N}} - λ_{N}}{2}) I

.

Proof.

For

(X, Z)

, we have

\nabla_{(X, Z)}^{2} L_{k} = ρ^{k} [\begin{matrix} I + N N^{T} & - N \\ - N^{T} & I \end{matrix}]

where

N = blkdiag (Y^{T}, \dots, Y^{T}) \in R^{n r \times n^{2}}

. Note that for block diagonal matrices,

{∥ N ∥}_{2} = {∥ Y ∥}_{2}

. Note also that the determinant of

\frac{1}{ρ^{k}} \nabla_{(X, Z)}^{2} L_{k}

is

\det ((I + N N^{T}) - N N^{T}) = 1 \geq 0

, so

\nabla_{(X, Z)}^{2} {\tilde{L}}_{k} ≻ 0

and equivalently

λ_{min} (\nabla_{(X, Z)}^{2} L_{k}) > 0

.

To find the smallest eigenvalue

λ_{min} (\nabla_{(X, Z)}^{2} L_{k})

, it suffices to find the largest

σ > 0

such that:

\begin{matrix} H_{2} & = & {(ρ^{k})}^{- 1} \nabla_{(X, Z)}^{2} {\tilde{L}}_{k} - σ I \\ = & [\begin{matrix} (1 - σ) I + N N^{T} & - N \\ - N^{T} & (1 - σ) I \end{matrix}] ⪰ 0 . \end{matrix}

(A4)

Equivalently, we want to find the largest

σ > 0

where

(1 - σ) I ⪰ 0

and the Schur complement of

H_{2}

i.4.,

H_{3} = (1 - σ) I + N N^{T} (1 - {(1 - σ)}^{- 1})) ⪰ 0 .

Defining

σ_{Y} = {∥ Y ∥}_{2}

the largest singular vector of

Y^{k + 1}

, and noting that

λ_{min} (α I + A) = α + λ_{min} (A)

for any positive semidefinite matrix A, we have

λ_{min} (H_{3}) = (1 - σ) + {(σ_{Y})}^{2} (1 - {(1 - σ)}^{- 1}) .

We can see that

(1 - σ) λ_{min} (H_{3})

is a convex function in

(1 - σ)

, with two zeros at

1 - σ = \frac{\pm \sqrt{σ_{Y}^{4} + 4 σ_{Y}^{2}} - {(σ_{Y})}^{2}}{2} .

In between the two roots,

λ_{min} (H_{3}) < 0

. Since the smaller root cannot satisfy

1 - σ > 0

, we choose

σ_{max} = 1 - \frac{\sqrt{σ_{Y}^{4} + 4 σ_{Y}^{2}} - {(σ_{Y})}^{2}}{2} > 0

as the largest feasible

σ

that maintains

λ_{min} (H_{3}) \geq 0

. As a result,

λ_{min} (\nabla_{(X, Z)}^{2} \bar{L}) = ρ^{k} σ_{max} = ρ^{k} (1 - \frac{\sqrt{σ_{Y}^{4} + 4 σ_{Y}^{2}} - σ_{Y}^{2}}{2})

. Figure A1 shows how this term behaves according to the spectral norm of Y. □

Figure A1. Strong convexity with respect to X, Z. Smallest eigenvalue of

\nabla_{X, Z}^{2} L

as a function of the spectral norm of Y.

Figure A1. Strong convexity with respect to X, Z. Smallest eigenvalue of

\nabla_{X, Z}^{2} L

as a function of the spectral norm of Y.

We now prove the main theorem.

Lemma A3.

Consider the sequence:

\begin{matrix} L^{k} & : = & L (P^{k}; D^{k}) \\ = & f (Z^{k - 1}) + 〈 \nabla f (Z^{k - 1}), Z - Z^{k - 1} 〉 + δ_{C} (Y) \\ + & 〈 U, X - Y 〉 + 〈 S, Z - X Y^{T} 〉 + \frac{ρ}{2} {∥ X - Y ∥}_{F}^{2} \\ + & \frac{ρ}{2} {∥ Z - X Y^{T} ∥}_{F}^{2} . \end{matrix}

If

f (Z)

is

L_{f}

-Lipschitz smooth, then sequence

L^{k}

generated from Algorithm 1 satisfies:

\begin{matrix} L^{k + 1} - L^{k} \leq \\ - c_{1}^{k} ∥ X^{k + 1} - X^{k} ∥_{F}^{2} - c_{2}^{k} ∥ Z^{k + 1} - Z^{k} ∥_{F}^{2} - c_{3}^{k} {∥ Y^{k + 1} - Y^{k} ∥}_{F}^{2} \\ + \frac{ρ^{k + 1} + ρ^{k}}{2 {(ρ^{k})}^{2}} (∥ S^{k + 1} - S^{k} ∥_{F}^{2} + {∥ U^{k + 1} - U^{k} ∥}_{F}^{2}) . \end{matrix}

(A5)

with

c_{1}^{k} = \frac{ρ^{k}}{2} (1 - \frac{\sqrt{σ_{Y}^{4} + 4 σ_{Y}^{2}} - σ_{Y}^{2}}{2})

,

c_{2}^{k} = c_{1}^{k} - \frac{L_{f}}{2}

, and

c_{3}^{k} = \frac{ρ^{k}}{2} > 0

.

Proof.

The proof outline of Lemma A3 is to show that each update step is a non-ascent step in the linearized augmented Lagrangian, and at least one update step is descent. We can describe the linearized ADMM in terms of four groups of updates: the primal variable Y, the primal variables X and Z, the dual variables U, S, and coefficient

ρ

. In other words, at iteration k, taking:

$Δ X^{k} = X^{k + 1} - X^{k}$ , $Δ Y^{k} = Y^{k + 1} - Y^{k}$ , and $Δ Z^{k} = Z^{k + 1} - Z^{k}$ .
$L^{k} = L (Z^{k}, X^{k}, Y^{k}; D^{k}; ρ^{k}; G^{k})$ ,
$L^{Y} = L (Z^{k}, X^{k}, Y^{k + 1}; D^{k}; ρ^{k}; G^{k})$ ,
$L^{X Z} = L (P^{k + 1}; D^{k}; ρ^{k}; G^{k})$ , and
$L^{k + 1} = L (P^{k + 1}; D^{k + 1}; ρ^{k + 1}; G^{k})$

and

L^{k + 1} - L^{k} = (L^{Y} - L^{k}) + (L^{X Z} - L^{Y}) + (L^{k + 1} - L^{X Z}) .

We now lower bound each term.

Update Y. For the update of Y in (17), taking $L^{Y} = L (Z^{k}, X^{k}, Y^{k + 1}; D^{k}; ρ^{k}; G^{k})$ , we have:

$\begin{matrix} L^{Y} - L^{k} & \overset{(a)}{\leq} 〈 \nabla_{Y} L^{Y}, Y^{k + 1} - Y^{k} 〉 \\ - \frac{λ_{min} (\nabla_{vec Y}^{2} L^{Y})}{2} {∥ Y^{k + 1} - Y^{k} ∥}_{F}^{2} \\ \overset{(b)}{\leq} - \frac{ρ^{k}}{2} {∥ Y^{k + 1} - Y^{k} ∥}_{F}^{2}, \end{matrix}$

(A6)

where (a) follows from the definition of strong convexity, and (b) the optimality of $Y^{k + 1}$ .
Update X, Z. Similarly, the update of $(Z, X)$ in (17), denoting $L^{X Z} = L (P^{k + 1}; D^{k}; ρ^{k}; G^{k})$ , we have:

$\begin{matrix} {\bar{L}}^{X Z} - L^{Y} \\ \overset{(a)}{\leq} 〈 \nabla_{Z} {\bar{L}}^{X Z}, Z^{k + 1} - Z^{k} 〉 + 〈 \nabla_{X} {\bar{L}}^{X Z}, X^{k + 1} - X^{k} 〉 \\ - \frac{λ_{min} (\nabla_{(X, Z)}^{2} L^{X Z})}{2} (∥ Δ Z^{k} ∥_{F}^{2} + {∥ Δ X^{k} ∥}_{F}^{2}) \\ \overset{(b)}{\leq} - \frac{λ_{min} (\nabla_{(X, Z)}^{2} {\bar{L}}^{X Z})}{2} (∥ Δ Z^{k} ∥_{F}^{2} + {∥ Δ X^{k} ∥}_{F}^{2}), \end{matrix}$

(A7)

where (a) follows from the definition of strong convexity, and (b) the optimality of $X^{k + 1}$ and $Z^{k + 1}$ . To further bound $L^{X Z} - {\bar{L}}^{X Z}$ , we use the linearization definitions:

$\begin{matrix} L^{X Z} - {\bar{L}}^{X Z} \\ = & f (Z^{k + 1}) - f (Z^{k}) - 〈 \nabla f (Z^{k}), Δ Z^{k} 〉 \\ \overset{(a)}{\leq} & \frac{L_{f}}{2} {∥ Z^{k + 1} - Z^{k} ∥}_{F}^{2}, \end{matrix}$

(A8)

where (a) comes from the $L_{f}$ Lipschitz smooth property of f. For a function f with Lipschitz constant $L_{f}$ , the following holds $f (y) \leq f (x) + 〈 y - x, \nabla f (x) 〉 + \frac{L_{f}}{2} {∥ y - x ∥}_{2}^{2}$ .
Update S, U, and $ρ$ . For the update of the dual variables and the penalty coefficient, with $L^{k} = L (P^{k}; D^{k}; ρ^{k})$ , we have:

$\begin{matrix} L^{D} - L^{X Z} \\ \overset{(a)}{=} 〈 S^{k + 1} - S^{k}, Z^{k + 1} - X^{k + 1} {(Y^{k + 1})}^{T} 〉 \\ + 〈 U^{k + 1} - U^{k}, X^{k + 1} - Y^{k + 1} 〉 \\ + \frac{ρ^{k + 1} - ρ^{k}}{2} (∥ Z^{k + 1} - X^{k + 1} {(Y^{k + 1})}^{T} ∥_{F}^{2}) \\ + \frac{ρ^{k + 1} - ρ^{k}}{2} (∥ X^{k + 1} - Y^{k + 1} ∥_{F}^{2}) \\ \overset{(b)}{=} \frac{ρ^{k + 1} + ρ^{k}}{2 {(ρ^{k})}^{2}} (∥ S^{k + 1} - S^{k} ∥_{F}^{2}, + {∥ U^{k + 1} - U^{k} ∥}_{F}^{2}), \end{matrix}$

(A9)

where (a) follows the definition of $L$ and (b) from the dual update procedure.

The lemma statement results by incorporating (A6)–(A9). □

Lemma A4.

If

L_{k}

is unbounded below, then either problem (1) is unbounded below, or the sequence

L_{f} {∥ Z_{k} - Z_{k - 1} ∥}_{F}

diverges.

Proof.

First, consider the case that

L_{k}

is unbounded below. First, rewrite

L^{k}

equivalently as:

\begin{matrix} L^{k} = f (Z^{k - 1}) + 〈 \nabla f (Z^{k - 1}), Z^{k} - Z^{k - 1} 〉 + δ_{C} (Y^{k}) \\ + \frac{ρ}{2} ∥ X^{k} - Y^{k} + \frac{1}{ρ^{k}} U^{k} ∥_{F}^{2} + \frac{ρ}{2} ∥ Z^{k} - X^{k} {(Y^{k})}^{T} \\ + \frac{1}{ρ^{k}} S^{k} ∥_{F}^{2} - \frac{1}{2 ρ^{k}} ∥ U^{k} ∥_{F}^{2} - \frac{1}{2 ρ^{k}} {∥ S^{k} ∥}_{F}^{2} . \end{matrix}

Since

∥ U^{k} ∥_{F}

and

∥ S^{k} ∥_{F}

are bounded above, this implies that the linearization

g^{k} : = f (Z^{k - 1}) + 〈 \nabla f (Z^{k - 1}), Z^{k} - Z^{k - 1} 〉

is unbounded below.

Note that:

\begin{matrix} g^{k} - f (Z^{k}) \\ = f (Z^{k - 1}) - f (Z^{k}) - \nabla f (Z^{k - 1}), Z^{k - 1} - Z^{k} 〉 \\ \geq - \frac{L_{f}}{2} {∥ Z^{k} - Z^{k - 1} ∥}_{F}^{2}, \end{matrix}

which implies either

f (Z^{k}) \to - \infty

or

L_{f} {∥ Z^{k} - Z^{k - 1} ∥}_{F}^{2} \to + \infty

. □

Corollary A1.

If

L_{k}

is unbounded below and the objective

f (Z) = Tr (C Z)

, then it must be that (1) is unbounded below. This follows immediately since

L_{f} = 0

.

Theorem A1.

Assume the dual variables are bounded, e.g.,

max {∥ S^{k} ∥_{F}, ∥ U^{k} ∥_{F}, ∥ Y^{k} {∥_{F}}}_{k} \leq B_{P} < + \infty,

and

\frac{L_{f}}{σ_{max}}

is bounded above, where

σ_{max} = 1 - \frac{\sqrt{σ_{Y}^{4} + 4 σ_{Y}^{2}} - σ_{Y}^{2}}{2}, σ_{Y} = {∥ Y^{k + 1} ∥}_{2} .

Then, by running Algorithm 1 with

ρ^{k} = α ρ^{k - 1} = α^{k} ρ_{0}

, if

L_{k}

is bounded below, then the sequence

{P^{k}, D^{k}}

converges to a stationary point of (12).

Proof.

If

f (Z)

is linear, take

K_{0} = 0

. If

f (Z)

is

L_{f} >

smooth, take

\hat{K}

large enough such that for all

k > K_{0}

,

α^{k} ρ \geq L_{f} σ_{max}

. By assumption,

K_{0}

is always finite.

Taking

Δ_{X Y Z}^{k} = (∥ Δ Z^{k} ∥_{F}^{2} + ∥ Δ X^{k} ∥_{F}^{2} + {∥ Δ Y^{k} ∥}_{F}^{2})

and

c^{k} = min {c_{1}, c_{2}, c_{3}}

, the summation of (A5) leads to:

\begin{matrix} L^{K} - L^{K_{0}} = \sum_{k = K_{0}}^{K - 1} L^{k + 1} - L^{k} \\ \leq \sum_{k = K_{0}}^{K - 1} \frac{ρ^{k + 1} + ρ^{k}}{2 {(ρ^{k})}^{2}} (∥ S^{k + 1} - S^{k} ∥_{F}^{2} + {∥ U^{k + 1} - U^{k} ∥}_{F}^{2}) \\ - \sum_{k = K_{0}}^{K - 1} c^{k} Δ_{X Y Z}^{k} \\ \overset{(a)}{\leq} 4 B_{P} \sum_{k = K_{0}}^{K - 1} \frac{ρ^{k + 1} + ρ^{k}}{2 {(ρ^{k})}^{2}} - \sum_{k = K_{0}}^{K - 1} c^{k} Δ_{X Y Z}^{k} \\ \overset{(b)}{\leq} 4 B_{P} \sum_{k = K_{0}}^{K - 1} \frac{ρ^{k + 1} + ρ^{k}}{2 {(ρ^{k})}^{2}}, \end{matrix}

(A10)

where (a) follows from the boundedness assumption of the dual variables, and (b) follows from Lemmas A1 and A2, and careful construction of

ρ

with respect to

L_{f}

and

∥ Y^{k + 1} ∥_{2}

. Further simplifying, we see that

L^{K}

is thus bounded above, since:

\begin{matrix} L^{K} - L^{K_{0}} & \leq & lim_{K \to \infty} 4 B_{P} \sum_{k = K_{0}}^{K - 1} \frac{ρ^{k + 1} + ρ^{k}}{2 {(ρ^{k})}^{2}} \\ = & 4 B_{P} \frac{1 + α}{2 α^{K_{0}} ρ} (1 + \frac{1}{α} + \frac{1}{α^{2}} + \dots) \\ = & \frac{4 B_{P}}{2 α^{K_{0}} ρ} < + \infty . \end{matrix}

If

L^{k}

is not unbounded below, then:

0 \leq \sum_{k = K_{0}}^{K - 1} (c_{1} ∥ Δ X^{k} ∥_{F}^{2} + c_{2} ∥ Δ Z^{k} ∥_{F}^{2} + c_{3} {∥ Δ Y^{k} ∥}_{F}^{2}) \leq + \infty .

(A11)

Recall

c_{3}^{k} = \frac{ρ^{k}}{2}

, and by boundedness assumption on

∥ Y_{2}^{k + 1}

, for

k > K_{0}

,

c_{1}^{k}, c_{2}^{k} \propto ρ^{k}

. Since additionally

\sum_{k} ρ_{k} = + \infty

, then this immediately yields

Z^{k + 1} - Z^{k} \to 0, X^{k + 1} - X^{k} \to 0, Y^{k + 1} - Y^{k} \to 0

.

Therefore, since the primal variables are convergent, this implies that:

\begin{matrix} Z^{k + 1} - {(X^{k + 1} {(Y^{k + 1})}^{T})}_{Ω} = \frac{1}{ρ^{k}} (S^{k + 1} - S^{k}), \\ X^{k + 1} - Y^{k + 1} = \frac{1}{ρ^{k}} (U^{k + 1} - U^{k}), \end{matrix}

converges to a constant. But since

ρ^{k} \to \infty

and the dual variables are all bounded, then it must be that:

Z^{k + 1} - {(X^{k + 1} {(Y^{k + 1})}^{T})}_{Ω} \to 0, X^{k + 1} - Y^{k + 1} \to 0 .

Therefore, the limit points

X^{*}, Y^{*}

, and

Z^{*}

are all feasible, and simply checking the first optimality condition will verify that this accumulation point is a stationary point of (12). □

Appendix C. Convergence Analysis for Vector Form

Lemma A5.

For two adjacent iterations of Algorithm 2, we have:

\begin{matrix} ∥ U^{k + 1} - U^{k} ∥_{2}^{2} \leq L_{g}^{2} {∥ X^{k + 1} - X^{k} ∥}_{2}^{2} . \end{matrix}

(A12)

Proof.

From the first-order optimality conditions for the update of X:

\begin{matrix} \nabla g (X^{k + 1}) + U^{k} + ρ^{k} (X^{k + 1} - Y^{k + 1}) = 0 . \end{matrix}

(A13)

Combining with the dual update, we obtain

\nabla g (X^{k + 1}) + U^{k + 1} = 0 .

Then, the result follows from the definition of

L_{g}

. □

Next, we will show that the augmented Lagrangian is monotonically decreasing and lower bounded.

Lemma A6.

Each step in the augmented Lagrangian update is decreasing, e.g., for:

\begin{matrix} L (X, Y; U; ρ) \\ : = g (X) + δ_{C} (Y) + 〈 U, X - Y 〉 + \frac{ρ}{2} {∥ X - Y ∥}_{F}^{2} \end{matrix}

(A14)

we have:

\begin{matrix} L (Y^{k + 1}, X^{k + 1}; U^{k + 1}; ρ^{k + 1}) \leq L (Y^{k + 1}, X^{k + 1}; U^{k}; ρ^{k}) \\ \leq L (Y^{k + 1}, X^{k}; U^{k}; ρ^{k}) \leq L (Y^{k}, X^{k}; U^{k}; ρ^{k}) . \end{matrix}

(A15)

Furthermore, the amount of decrease is:

\begin{matrix} L (Y^{k + 1}, X^{k + 1}; U^{k + 1}; ρ^{k + 1}) - L (Y^{k}, X^{k}; U^{k}; ρ^{k}) \\ \leq - ρ^{k} ∥ Y^{k + 1} - Y^{k} ∥_{F}^{2} - c^{k} {∥ X^{k + 1} - X^{k} ∥}_{F}^{2} . \end{matrix}

(A16)

Here,

if $g (X)$ is $H_{g}$ -strongly convex (where $H_{g} = 0$ if g is convex but not strongly convex) then $c^{k} = \frac{ρ^{k} + H_{g}}{2} - L_{g}^{2} \frac{ρ^{k + 1} + ρ^{k}}{2 {(ρ^{k})}^{2}}$ , and
if $g (X)$ is nonconvex but $L_{g}$ -smooth, then $c^{k} = \frac{ρ^{k} - 3 L_{g}}{2} - L_{g}^{2} \frac{ρ^{k + 1} + ρ^{k}}{2 {(ρ^{k})}^{2}}$ .

Proof.

Both the updates of Y and X globally minimize

L

with respect to those variables. To minimize Y at

(X, U) = (X^{k}, U^{k})

:

\begin{matrix} L (Y^{k + 1}, X; U; ρ) - L (Y^{k}, X; U; ρ) \\ \overset{(a)}{\leq} & 〈 \nabla_{Y} L (Y^{k + 1}, X; U; ρ), Δ Y^{k} 〉 - \frac{ρ^{k}}{2} {∥ Δ Y^{k} ∥}_{2}^{2} \end{matrix}

(A17)

\begin{matrix} \overset{(b)}{\leq} & - \frac{ρ^{k}}{2} {∥ Δ Y^{k} ∥}_{2}^{2} . \end{matrix}

(A18)

To minimize X at

(Y, U) = (Y^{k + 1}, U^{k})

, we consider two cases. If g is

H_{g}

-strongly convex, then:

\begin{matrix} L (Y, X^{k + 1}; U; ρ) - L (Y, X^{k}; U; ρ) \\ \overset{(a)}{\leq} & 〈 \nabla_{X} L (Y, X^{k + 1}; U; ρ), X^{k + 1} - X^{k} 〉 \\ - \frac{ρ^{k} + H_{g}}{2} {∥ X^{k + 1} - X^{k} ∥}_{2}^{2} \\ \overset{(b)}{\leq} & - \frac{ρ^{k} + H_{g}}{2} {∥ X^{k + 1} - X^{k} ∥}_{2}^{2}, \end{matrix}

(A19)

where (a) follows from the strong convexity of

L (Y, X; U; ρ)

with respect to X, and (b) follows from the optimality condition of the update. If g is nonconvex but

L_{g}

-Lipschitz, then note that:

\begin{matrix} g (X^{k + 1}) - g (X^{k}) \\ \leq & 〈 \nabla g (X^{k}), X^{k + 1} - X^{k} 〉 + \frac{L_{g}}{2} {∥ X^{k + 1} - X^{k} ∥}_{F}^{2} \\ \overset{(a)}{=} & 〈 \nabla g (X^{k}) - \nabla g (X^{k + 1}), X^{k + 1} - X^{k} 〉 \\ + \frac{L_{g}}{2} {∥ X^{k + 1} - X^{k} ∥}_{F}^{2} + 〈 \nabla g (X^{k + 1}), X^{k + 1} - X^{k} 〉 \\ \overset{(b)}{\leq} & ∥ \nabla g (X^{k}) - \nabla g (X^{k + 1}) ∥_{F} {∥ X^{k + 1} - X^{k} ∥}_{F} \\ + \frac{L_{g}}{2} {∥ X^{k + 1} - X^{k} ∥}_{F}^{2} + 〈 \nabla g (X^{k + 1}), X^{k + 1} - X^{k} 〉 \\ \overset{(c)}{\leq} & \frac{3 L_{g}}{2} {∥ X^{k + 1} - X^{k} ∥}_{F}^{2} + 〈 \nabla g (X^{k + 1}), X^{k + 1} - X^{k} 〉 \end{matrix}

where (a) follows from adding and subtracting a term, (b) from Cauchy–Schwartz, and (c) from the Lipschitz gradient condition on g. Therefore:

\begin{matrix} L (Y, X^{k + 1}; U; ρ) - L (Y, X^{k}; U; ρ) \\ \overset{(a)}{\leq} & 〈 \nabla_{X} L (Y, X^{k + 1}; U; ρ), X^{k + 1} - X^{k} 〉 \\ - \frac{ρ^{k} - 3 L_{g}}{2} {∥ X^{k + 1} - X^{k} ∥}_{2}^{2} \\ \overset{(b)}{\leq} & - \frac{ρ^{k} - 3 L_{g}}{2} {∥ X^{k + 1} - X^{k} ∥}_{2}^{2} . \end{matrix}

In the dual variables, using

{X, Y} = {X^{k + 1}, Y^{k + 1}}

we have:

\begin{matrix} L (Y, X; U^{k + 1}; ρ^{k + 1}) - L (Y, X; μ^{k}; ρ^{k}) \\ \overset{(a)}{\leq} & 〈 U^{k + 1} - U^{k}, X - Y 〉 + \frac{ρ^{k + 1} - ρ^{k}}{2} {∥ X - Y ∥}_{F}^{2} \\ \overset{(b)}{\leq} & \frac{ρ^{k + 1} + ρ^{k}}{2 {(ρ^{k})}^{2}} {∥ U^{k + 1} - U^{k} ∥}_{2}^{2} \\ \overset{(c)}{\leq} & L_{g}^{2} \frac{ρ^{k + 1} + ρ^{k}}{2 {(ρ^{k})}^{2}} {∥ X^{k + 1} - X^{k} ∥}_{2}^{2}, \end{matrix}

where (a) follows the definition of

L

, (b) follows from the update of U, and (c) follows from Lemma (A5) since

ρ^{k} > 0

for al k. Incorporating these observations completes the proof. □

Lemma A7.

If

ρ^{k} \geq L_{g}

and the objective

g (X)

is lower-bounded over

C

, then the augmented Lagrangian (A14) is lower bounded.

Proof.

From the

L_{g}

-Lipschitz continuity of

\nabla g (X)

, it follows that:

\begin{matrix} g (X) \geq g (Y) + 〈 \nabla g (X), X - Y 〉 - \frac{L_{g}}{2} {∥ X - Y ∥}_{F}^{2} \end{matrix}

(A20)

for any X and Y. By definition:

\begin{matrix} L (Y^{k}, X^{k}; U^{k}; ρ^{k}) \\ = & g (X^{k}) + 〈 U^{k}, X^{k} - Y^{k} 〉 + \frac{ρ^{k}}{2} {∥ X^{k} - Y^{k} ∥}_{F}^{2} \\ \overset{(a)}{=} & g (X^{k}) - 〈 \nabla g (X^{k}), X^{k} - Y^{k} 〉 + \frac{ρ^{k}}{2} {∥ X^{k} - Y^{k} ∥}_{F}^{2} \\ \overset{(b)}{\geq} & g (Y^{k}) + \frac{ρ^{k} - L_{g}}{2} {∥ X^{k} - Y^{k} ∥}_{F}^{2}, \end{matrix}

(A21)

where (a) follows from the optimality in updating X and (b) follows from (A20). Since

L^{k}

is unbounded below, then

g (Y^{k})

is unbounded below. Since

Y^{k} \in C

for all k, this implies that g is unbounded below over

C

. □

Thus, if

g (X)

is lower-bounded over

C

, since the sequence

{L (X^{k}, Y^{k}; U^{k})}

is monotonically decreasing and lower bounded, the sequence

{L (X^{k}, Y^{k}; U^{k})}

converges. Given the monotonic descent of each subproblem (Lemma A6) and strong convexity of

L^{k}

with respect to X and Y, it is clear that

X^{k} \to X^{*}

,

Y^{k} \to Y^{*}

fixed points. Combining with Lemma A5 gives also

U^{k} \to U^{*}

.

The proof of Theorem 3 easily follows from Lemma A7.

Appendix C.1. Linear Rate of Convergence when g Is Strongly Convex

Lemma A8.

Consider Algorithm 2 with

ρ^{k}

constant. Then, collecting the variables all vectorized

x = (X, Y, Y)

,

L^{k + 1} - L^{k} \leq - c_{3} {∥ x^{k + 1} - x^{k} ∥}^{2},

where g is

H_{g}

strongly convex and:

\begin{matrix} c_{3} = max_{θ \in (0, 1)} min \\ \{θ (\frac{ρ + H_{g}}{2} - \frac{L_{g}^{2}}{ρ}), (1 - θ) (\frac{ρ + H_{g}}{2 H_{g}} - \frac{L_{g}^{2}}{ρ H_{g}}), - ρ\} . \end{matrix}

Proof.

From Lemma A6, we already have that:

L^{k + 1} - L^{k} \leq - ρ ∥ Y^{k + 1} - Y^{k} ∥^{2} - c {∥ X^{k + 1} - X^{k} ∥}^{2},

where for constant

ρ

,

c = \frac{ρ + H_{g}}{2} - \frac{L_{g}^{2}}{ρ}

. Moreover, when

g (X)

is

H_{g}

-strongly convex,

\begin{matrix} ∥ U^{k + 1} - U^{k} ∥_{2} = {∥ \nabla g (X^{k + 1}) - \nabla g (X^{k}) ∥}_{2} \\ \geq H_{g} {∥ X^{k + 1} - X^{k} ∥}_{2} . \end{matrix}

Therefore:

L^{k + 1} - L^{k} \leq - θ \frac{c}{H_{g}} ∥ U^{k + 1} - U^{k} ∥_{2}^{2} - (1 - θ) c {∥ X^{k + 1} - X^{k} ∥}^{2} .

for any

θ \in (0, 1)

, We thus have:

\begin{matrix} L^{k + 1} - L^{k} \\ \leq & - θ c ∥ Δ X^{k} ∥_{F}^{2} - (1 - θ) \frac{c}{H_{g}} {∥ Δ U ∥}_{F}^{2} - ρ {∥ Δ Y^{k} ∥}_{F}^{2} \\ \leq & - min \{θ c, (1 - θ) \frac{c}{H_{g}}, - ρ\} [\begin{matrix} X^{k + 1} - X^{k} \\ Y^{k + 1} - Y^{k} \\ U^{k + 1} - U^{k} \end{matrix}], \end{matrix}

with

Δ U = U^{k + 1} - U^{k}

. Note that this does not mean

L

is strong convex with respect to the collected variables

x = (X, Y, Z)

(

L

is not even convex). But with respect to each variable X, Y, and Z, it is strongly convex. □

Lemma A9.

Again with

ρ^{k} > 1

constant and collecting

x = (X, Y, Z)

, we have:

L^{k + 1} - L^{*} \leq c_{4} {∥ x^{k + 1} - x^{*} ∥}^{2}, c_{4} = min {L_{g} + ρ + 2, 2 ρ, 1}

whenever

Y^{k + 1}

and

Y^{*}

are both in

C

.

Proof.

Over the domain

C

, the augmented Lagrangian can be written as:

L (x) = g (X) + 〈 U, X - Y 〉 + \frac{ρ}{2} {∥ X - Y ∥}_{F}^{2},

with gradient

\nabla L (x) = [\begin{matrix} \nabla g (X) + U + ρ (X - Y) \\ - Y + ρ (Y - X) \\ X - Y \end{matrix}]

and thus:

\begin{matrix} ∥ \nabla L (x_{1}) - \nabla L (x_{2}) ∥_{F}^{2} \\ = & ∥ \nabla_{X} L (x_{1}) - \nabla_{X} L (x_{2}) ∥_{F}^{2} \\ + ∥ \nabla_{Y} L (x_{1}) - \nabla_{Y} L (x_{2}) ∥_{F}^{2} \\ + ∥ \nabla_{U} L (x_{1}) - \nabla_{U} L (x_{2}) ∥_{F}^{2} \\ \leq & (L_{g} + ρ + 2) ∥ X_{1} - X_{2} ∥_{F}^{2} + (2 ρ) {∥ Y_{1} - Y_{2} ∥}_{F}^{2} \\ + ∥ U_{1} - U_{2} ∥_{F}^{2} \\ \leq & min {L_{g} + ρ + 2, 2 ρ, 1} {∥ x_{2} - x_{1} ∥}_{2}^{2}, \end{matrix}

which reveals the Lipschitz smoothness constraint for

L

as

c_{4} = min {L_{g} + ρ + 2, 2 ρ, 1} .

Then, using first-order optimality conditions,

\begin{matrix} L^{k + 1} & \leq & L^{*} + 〈 \nabla L (x^{*}), x^{k + 1} - x^{*} 〉 + c_{4} {∥ x^{k + 1} - x^{*} ∥}_{2}^{2} \\ \overset{(a)}{\leq} & L^{*} + c_{4} {∥ x^{k + 1} - x^{*} ∥}_{2}^{2}, \end{matrix}

where (a) follows from the optimality of

L^{*}

. □

Lemma A10.

Consider

g (x)

H_{g}

-strongly convex in x, and ρ large enough so that

c_{3} > 0

. Then, the number of steps for

| L^{k} - L^{0} | \leq ϵ

is

O (log (1 / ϵ))

.

This proof is standard in the linear convergence of block coordinate descent when the objective is strongly convex. Note that

L

is not strongly convex or even convex, but still all the steps hold.

Proof.

Take

x^{k} = {X^{k}, Y^{k}, U^{k}}

and

x^{*} = {X^{*}, Y^{*}, U^{*}}

. Then:

\begin{matrix} L (x^{k}) - L (x^{*}) & = & L (x^{k}) - L (x^{k + 1}) + L (x^{k + 1}) - L (x^{*}) \\ \geq & c_{3} {∥ x^{k + 1} - x^{k} ∥}^{2} + L (x^{k + 1}) - L (x^{*}) \\ \geq & (\frac{c_{3}}{c_{4}} + 1) (L (x^{k + 1}) - L (x^{*})) \end{matrix}

Therefore:

\frac{L (x^{k}) - L (x^{*})}{L (x^{0}) - L (x^{*})} \leq {(\frac{c_{4}}{c_{4} + c_{3}})}^{k}

and so:

L (x^{k}) - L (x^{*}) \leq ϵ

if:

k \geq D_{1} log (1 / ϵ) + D_{2}

where:

D_{1} = {log}^{- 1} (\frac{c_{4} + c_{3}}{c_{4}}), D_{2} = \frac{log (L (x^{0}) - L (x^{*}))}{log (\frac{c_{4} + c_{3}}{c_{4}})} .

□

References

Bandeira, A.S.; Boumal, N.; Voroninski, V. On the low-rank approach for semidefinite programs arising in synchronization and community detection. In Proceedings of the Conference on Learning Theory, New York, NY, USA, 23–26 June 2016; pp. 361–382. [Google Scholar]
Fortunato, S.; Hric, D. Community detection in networks: A user guide. Phys. Rep. 2016, 659, 1–44. [Google Scholar] [CrossRef]
Javanmard, A.; Montanari, A.; Ricci-Tersenghi, F. Phase transitions in semidefinite relaxations. Proc. Natl. Acad. Sci. USA 2016, 113, E2218–E2223. [Google Scholar] [CrossRef] [PubMed]
Gillis, N. Nonnegative Matrix Factorization: Complexity, Algorithms and Applications. Doctoral Dissertation, Université Catholique de Louvain (CORE), Louvain-La-Neuve, France, 2011. [Google Scholar]
Ding, C.; He, X.; Simon, H.D. On the equivalence of nonnegative matrix factorization and spectral clustering. In Proceedings of the 2005 SIAM International Conference on Data Mining (SIAM), New Orleans, LA, USA, 11–15 July 2005; pp. 606–610. [Google Scholar]
Da Costa, A.P.; Seeger, A. Cone-constrained eigenvalue problems: Theory and algorithms. Comput. Optim. Appl. 2010, 45, 25–57. [Google Scholar] [CrossRef]
Gander, W.; Golub, G.H.; von Matt, U. A constrained eigenvalue problem. In Numerical Linear Algebra, Digital Signal Processing and Parallel Algorithms; Springer: Berlin/Heidelberg, Germany, 1991; pp. 677–686. [Google Scholar]
Júdice, J.J.; Sherali, H.D.; Ribeiro, I.M. The eigenvalue complementarity problem. Comput. Optim. Appl. 2007, 37, 139–156. [Google Scholar] [CrossRef]
Goemans, M.X.; Williamson, D.P. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 1995, 42, 1115–1145. [Google Scholar] [CrossRef]
Helmberg, C.; Rendl, F. A spectral bundle method for semidefinite programming. SIAM J. Optim. 2000, 10, 673–696. [Google Scholar] [CrossRef]
Fujie, T.; Kojima, M. Semidefinite programming relaxation for nonconvex quadratic programs. J. Glob. Optim. 1997, 10, 367–380. [Google Scholar] [CrossRef]
Burer, S.; Monteiro, R.D. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Math. Program. 2003, 95, 329–357. [Google Scholar] [CrossRef]
Boumal, N.; Voroninski, V.; Bandeira, A. The non-convex Burer-Monteiro approach works on smooth semidefinite programs. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 5–10 December 2016; pp. 2757–2765. [Google Scholar]
Lee, D.D.; Seung, H.S. Algorithms for non-negative matrix factorization. In Proceedings of the Advances in Neural Information Processing Systems Conference, Denver, CO, USA, 4 October 2001; pp. 556–562. [Google Scholar]
Wolkowicz, H.; Saigal, R.; Vandenberghe, L. Handbook of Semidefinite Programming: Theory, Algorithms, and Applications; Springer Science & Business Media: New York, NY, USA, 2012; Volume 27. [Google Scholar]
Laurent, M. Sums of squares, moment matrices and optimization over polynomials. In Emerging Applications of Algebraic Geometry; Springer: Berlin/Heidelberg, Germany, 2009; pp. 157–270. [Google Scholar]
Rendl, F. Semidefinite relaxations for partitioning, assignment and ordering problems. 4OR 2012, 10, 321–346. [Google Scholar] [CrossRef]
Blekherman, G.; Parrilo, P.A.; Thomas, R.R. Semidefinite optimization and convex algebraic geometry. In Proceedings of the 2012 Annual Meeting of the Society for Industrial and Applied Mathematics (SIAM), Minneapolis, MN, USA, 9–13 July 2012. [Google Scholar]
Anjos, M.F.; Lasserre, J.B. Introduction to semidefinite, conic and polynomial optimization. In Handbook on Semidefinite, Conic and Polynomial Optimization; Springer: Berlin/Heidelberg, Germany, 2012; pp. 1–22. [Google Scholar]
Abbe, E.; Bandeira, A.S.; Hall, G. Exact recovery in the stochastic block model. IEEE Trans. Inf. Theory 2016, 62, 471–487. [Google Scholar] [CrossRef]
Shi, J.; Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 888–905. [Google Scholar]
Karisch, S.E.; Rendl, F. Semidefinite programming and graph equipartition. Top. Semidefin. Inter. Point Methods 1998, 18, 77–95. [Google Scholar]
Karger, D.; Motwani, R.; Sudan, M. Approximate graph coloring by semidefinite programming. J. ACM 1998, 45, 246–265. [Google Scholar] [CrossRef]
Barahona, F.; Grötschel, M.; Jünger, M.; Reinelt, G. An application of combinatorial optimization to statistical physics and circuit layout design. Oper. Res. 1988, 36, 493–513. [Google Scholar] [CrossRef]
De Simone, C.; Diehl, M.; Jünger, M.; Mutzel, P.; Reinelt, G.; Rinaldi, G. Exact ground states of Ising spin glasses: New experimental results with a branch-and-cut algorithm. J. Stat. Phys. 1995, 80, 487–496. [Google Scholar] [CrossRef]
Poljak, S.; Tuza, Z. The expected relative error of the polyhedral approximation of the MAX-CUT problem. Oper. Res. Lett. 1994, 16, 191–198. [Google Scholar] [CrossRef]
Helmberg, C.; Rendl, F. Solving quadratic (0,1)-problems by semidefinite programs and cutting planes. Math. Program. 1998, 82, 291–315. [Google Scholar] [CrossRef]
Rendl, F.; Rinaldi, G.; Wiegele, A. A branch and bound algorithm for MAX-CUT based on combining semidefinite and polyhedral relaxations. In Proceedings of the 12th International IPCO Conference, Ithaca, NY, USA, 25–27 June 2007; Volume 4513, pp. 295–309. [Google Scholar]
Burer, S.; Vandenbussche, D. A finite branch-and-bound algorithm for nonconvex quadratic programming via semidefinite relaxations. Math. Program. 2008, 113, 259–282. [Google Scholar] [CrossRef]
Bao, X.; Sahinidis, N.V.; Tawarmalani, M. Semidefinite relaxations for quadratically constrained quadratic programming: A review and comparisons. Math. Program. 2011, 129, 129–157. [Google Scholar] [CrossRef]
Krislock, N.; Malick, J.; Roupin, F. Improved semidefinite branch-and-bound algorithm for k-cluster. HAL Open Sci. Prepr. 2012, hal-00717212. Available online: https://inria.hal.science/file/index/docid/717823/filename/krislock-malick-roupin-2012a.pdf (accessed on 22 September 2023).
Poljak, S.; Rendl, F.; Wolkowicz, H. A recipe for semidefinite relaxation for (0, 1)-quadratic programming. J. Glob. Optim. 1995, 7, 51–73. [Google Scholar] [CrossRef]
Helmberg, C. Semidefinite Programming for Combinatorial Optimization; Konrad-Zuse-Zentrum für Informationstechnik: Berlin, Germany, 2000. [Google Scholar]
Papadopoulos, S.; Kompatsiaris, Y.; Vakali, A.; Spyridonos, P. Community detection in social media. Data Min. Knowl. Discov. 2012, 24, 515–554. [Google Scholar] [CrossRef]
Girvan, M.; Newman, M.E. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 2002, 99, 7821–7826. [Google Scholar] [CrossRef]
Keeling, M. The implications of network structure for epidemic dynamics. Theor. Popul. Biol. 2005, 67, 1–8. [Google Scholar] [CrossRef]
Holland, P.W.; Laskey, K.B.; Leinhardt, S. Stochastic blockmodels: First steps. Soc. Netw. 1983, 5, 109–137. [Google Scholar] [CrossRef]
Queiroz, M.; Judice, J.; Humes, C., Jr. The symmetric eigenvalue complementarity problem. Math. Comput. 2004, 73, 1849–1863. [Google Scholar] [CrossRef]
Deshpande, Y.; Montanari, A.; Richard, E. Cone-constrained principal component analysis. In Proceedings of the 28th Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2717–2725. [Google Scholar]
Zass, R.; Shashua, A. Nonnegative sparse PCA. In Proceedings of the 20th Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–9 December 2007; pp. 1561–1568. [Google Scholar]
Lee, D.D.; Seung, H.S. Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401, 788. [Google Scholar] [CrossRef] [PubMed]
Yuan, Z.; Oja, E. Projective nonnegative matrix factorization for image compression and feature extraction. In Proceedings of the Scandinavian Conference on Image Analysis, Joensuu, Finland, 19–22 June 2005; pp. 333–342. [Google Scholar]
Friedlander, M.P.; Macedo, I. Low-rank spectral optimization via gauge duality. SIAM J. Sci. Comput. 2016, 38, A1616–A1638. [Google Scholar] [CrossRef]
Jaggi, M.; Sulovsk, M. A simple algorithm for nuclear norm regularized problems. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 471–478. [Google Scholar]
Candès, E.J.; Recht, B. Exact matrix completion via convex optimization. Found. Comput. Math. 2009, 9, 717. [Google Scholar] [CrossRef]
Recht, B.; Fazel, M.; Parrilo, P.A. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 2010, 52, 471–501. [Google Scholar] [CrossRef]
Udell, M.; Horn, C.; Zadeh, R.; Boyd, S. Generalized low rank models. Found. Trends® Mach. Learn. 2016, 9, 1–118. [Google Scholar] [CrossRef]
Burer, S.; Monteiro, R.D. Local minima and convergence in low-rank semidefinite programming. Math. Program. 2005, 103, 427–444. [Google Scholar] [CrossRef]
Pataki, G. On the rank of extreme matrices in semidefinite programs and the multiplicity of optimal eigenvalues. Math. Oper. Res. 1998, 23, 339–358. [Google Scholar] [CrossRef]
Barvinok, A.I. Problems of distance geometry and convex properties of quadratic maps. Discret. Comput. Geom. 1995, 13, 189–202. [Google Scholar] [CrossRef]
Candes, E.J.; Eldar, Y.C.; Strohmer, T.; Voroninski, V. Phase retrieval via matrix completion. SIAM Rev. 2015, 57, 225–251. [Google Scholar] [CrossRef]
Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 2011, 3, 1–122. [Google Scholar]
Glowinski, R.; Marroco, A. Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires. Rev. Fr. Autom. Inform. Rech. Oper. Anal. Numer. 1975, 9, 41–76. [Google Scholar] [CrossRef]
Gabay, D.; Mercier, B. A Dual Algorithm for the Solution of Non Linear Variational Problems via Finite Element Approximation; Institut de Recherche d’Informatique et d’Automatique: Rocquencourt, France, 1975. [Google Scholar]
Eckstein, J.; Yao, W. Understanding the convergence of the alternating direction method of multipliers: Theoretical and computational perspectives. Pac. J. Optim. 2015, 11, 619–644. [Google Scholar]
Sun, R.; Luo, Z.Q.; Ye, Y. On the expected convergence of randomly permuted ADMM. arXiv 2015, arXiv:1503.06387. [Google Scholar]
Yin, W. Three-Operator Splitting and its Optimization Applications. Set-Valued Var. Anal. 2017, 25, 829–858. [Google Scholar]
Goldstein, T.; O’Donoghue, B.; Setzer, S.; Baraniuk, R. Fast alternating direction optimization methods. SIAM J. Imaging Sci. 2014, 7, 1588–1623. [Google Scholar] [CrossRef]
Hong, M.; Luo, Z.Q.; Razaviyayn, M. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM J. Optim. 2016, 26, 337–364. [Google Scholar] [CrossRef]
Wang, Y.; Yin, W.; Zeng, J. Global convergence of ADMM in nonconvex nonsmooth optimization. arXiv 2015, arXiv:1511.06324. [Google Scholar] [CrossRef]
Li, G.; Pong, T.K. Global convergence of splitting methods for nonconvex composite optimization. SIAM J. Optim. 2015, 25, 2434–2460. [Google Scholar] [CrossRef]
Magnússon, S.; Weeraddana, P.C.; Rabbat, M.G.; Fischione, C. On the convergence of alternating direction Lagrangian methods for nonconvex structured optimization problems. IEEE Trans. Control. Netw. Syst. 2016, 3, 296–309. [Google Scholar] [CrossRef]
Liu, Q.; Shen, X.; Gu, Y. Linearized admm for non-convex non-smooth optimization with convergence analysis. arXiv 2017, arXiv:1705.02502. [Google Scholar]
Lu, S.; Hong, M.; Wang, Z. A nonconvex splitting method for symmetric nonnegative matrix factorization: Convergence analysis and optimality. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017. [Google Scholar]
Xu, Y.; Yin, W.; Wen, Z.; Zhang, Y. An alternating direction algorithm for matrix completion with nonnegative factors. Front. Math. China 2012, 7, 365–384. [Google Scholar] [CrossRef]
Jiang, B.; Ma, S.; Zhang, S. Alternating direction method of multipliers for real and complex polynomial optimization models. Optimization 2014, 63, 883–898. [Google Scholar] [CrossRef]
Huang, K.; Sidiropoulos, N.D. Consensus-ADMM for general quadratically constrained quadratic programming. IEEE Trans. Signal Process. 2016, 64, 5297–5310. [Google Scholar] [CrossRef]
Shen, Y.; Wen, Z.; Zhang, Y. Augmented Lagrangian alternating direction method for matrix separation based on low-rank factorization. Optim. Methods Softw. 2014, 29, 239–263. [Google Scholar] [CrossRef]
Rockafellar, R.T. Augmented Lagrange multiplier functions and duality in nonconvex programming. SIAM J. Control 1974, 12, 268–285. [Google Scholar] [CrossRef]
Clarke, F.H. Optimization and Nonsmooth Analysis; SIAM: Philadelphia, PA, USA, 1990; Volume 5. [Google Scholar]
Lions, P.L.; Mercier, B. Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 1979, 16, 964–979. [Google Scholar] [CrossRef]
Douglas, J.; Rachford, H.H. On the numerical solution of heat conduction problems in two and three space variables. Trans. Am. Math. Soc. 1956, 82, 421–439. [Google Scholar] [CrossRef]
Spingarn, J.E. Applications of the method of partial inverses to convex programming: Decomposition. Math. Program. 1985, 32, 199–223. [Google Scholar] [CrossRef]
Eckstein, J.; Bertsekas, D.P. On the Douglas Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 1992, 55, 293–318. [Google Scholar] [CrossRef]
Combettes, P.L.; Pesquet, J.C. A proximal decomposition method for solving convex variational inverse problems. Inverse Probl. 2008, 24, 065014. [Google Scholar] [CrossRef]

Figure 1. Time comparisons for DIMACS problems. (top): average runtime per iteration. (bottom): total runtime. We observe that both V and MRR converge in relatively few iterations, with MR1 taking slightly longer. However, as previously observed with splitting methods, the convergence rate is sensitive to the parameter choices

ρ^{(t)}

. For best performance, we start with a relatively small initial penalty coefficient and increase it with the iteration until the upper bound is achieved.

Figure 1. Time comparisons for DIMACS problems. (top): average runtime per iteration. (bottom): total runtime. We observe that both V and MRR converge in relatively few iterations, with MR1 taking slightly longer. However, as previously observed with splitting methods, the convergence rate is sensitive to the parameter choices

ρ^{(t)}

. For best performance, we start with a relatively small initial penalty coefficient and increase it with the iteration until the upper bound is achieved.

Figure 2. Image segmentation. The center and right columns are the best MAX-CUT and community detection results, respectively.

Table 1. MAX-CUT values for graphs from the 7th DIMACS Challenge. MRR = matrix formulation,

r = ⌈\sqrt{2 n}⌉

. SDR = SDP relaxation + rounding technique.

Table 1. MAX-CUT values for graphs from the 7th DIMACS Challenge. MRR = matrix formulation,

r = ⌈\sqrt{2 n}⌉

. SDR = SDP relaxation + rounding technique.

Database	n	Sparsity	BK	V	MR1	MRR	SDR
g3-8	512	0.012	41,684,814	34,105,231	36,780,180	35,943,350	33424095
g3-15	3375	0.018	281,029,888	235,893,612	255,681,256	241,740,931	212,669,181
pm3-8-50	512	0.012	454	394	346	378	416
pm3-15-50	3375	0.018	2964	2594	1966	2140	2616
G1	800	0.0599	11,624	10,938	11,047	11,321	11,360
G2	800	0.0599	11,620	10,834	11,082	11,144	11,343
G3	800	0.0599	11,622	10,858	10,894	11,174	11,367
G4	800	0.0599	11,646	10,849	10,760	11,192	11,429
G5	800	0.0599	11,631	10,796	10,783	11,352	11,394
G6	800	0.0599	2178	1853	1820	1949	1941
G7	800	0.0599	2003	1694	1644	1705	1774
G8	800	0.0599	2003	1688	1641	1728	1766
G9	800	0.0599	2048	1771	1681	1807	1830
G10	800	0.0599	1994	1662	1641	1737	1732
G11	800	0.005	564	496	460	480	506
G12	800	0.005	556	486	448	480	512
G13	800	0.005	580	516	476	498	528
G14	800	0.0147	3060	2715	2768	2861	2901
G15	800	0.0146	3049	2625	2810	2803	2884
G16	800	0.0146	3045	2667	2736	2862	2910
G17	800	0.0146	3043	2638	2789	2840	2920
G18	800	0.0147	988	798	768	841	858
G19	800	0.0146	903	700	641	694	780
G20	800	0.0146	941	723	691	766	788
G21	800	0.0146	931	696	713	810	794
G22	2000	0.01	13,346	12,461	12,548	12,751	12,926
G23	2000	0.01	13,317	12,540	12,528	12,853	12,889
G24	2000	0.01	13,314	12,540	12,447	12,723	12,904
G25	2000	0.01	13,326	12,447	12,558	12,733	12,874
G26	2000	0.01	13,314	12,445	12,475	12,718	12,847
G27	2000	0.01	3318	2824	2508	2807	2909
G28	2000	0.01	3285	2753	2518	2796	2845
G29	2000	0.01	3389	2864	2628	2901	2896
G30	2000	0.01	3403	2887	2639	2937	2971
G31	2000	0.01	3288	2833	2518	2902	2825
G32	2000	0.002	1398	1220	1066	1204	1254
G33	2000	0.002	1376	1202	1054	1166	1250
G34	2000	0.002	1372	1208	1096	1170	1222
G35	2000	0.0059	7670	6605	6914	6764	7209
G36	2000	0.0059	7660	6564	6943	6598	7228
G37	2000	0.0059	7666	6478	6839	6789	7183
G38	2000	0.0059	7681	6486	6759	6768	7212
G39	2000	0.0059	2395	1616	1697	1840	1997
G40	2000	0.0059	2387	1617	1438	1921	1890
G41	2000	0.0059	2398	1606	1656	1778	1899
G42	2000	0.0059	2469	1707	1756	1862	1971
G43	1000	0.02	6659	6222	6236	6398	6475
G44	1000	0.02	6648	6275	6192	6447	6458
G45	1000	0.02	6652	6243	6255	6407	6454
G46	1000	0.02	6645	6217	6233	6398	6407
G47	1000	0.02	6656	6221	6266	6433	6454
G48	3000	0.0013	6000	5882	5006	5402	6000
G49	3000	0.0013	6000	5844	5038	5362	6000
G50	3000	0.0013	5880	5814	4994	5410	5880
G51	1000	0.0118	3846	3317	3446	3524	3642
G52	1000	0.0118	3849	3360	3471	3499	3662
G53	1000	0.0118	3846	3323	3510	3516	3660
G54	1000	0.0118	3846	3306	3428	3509	3651

Table 2. Result for nonnegative factorization with partial observations from linearized ADMM (5 trials). STD = standard deviation.

n	1000			3000			5000			8000
$\| Ω \| / n^{2}$	0.1	0.5	0.8	0.1	0.5	0.8	0.1	0.5	0.8	0.1	0.5	0.8
CPU time/s	9.74	13.53	13.97	61.15	78.99	64.76	117.54	85.24	131.64	212.26	220.42	337.74
$\frac{∥ {(Z^{*} - C)}_{Ω} ∥}{∥ C_{Ω} ∥}$	0.86	0.85	0.86	0.89	0.89	0.89	0.89	0.88	0.87	0.88	0.90	0.89
STD	0.043	0.020	0.021	0.010	0.006	0.008	0.008	0.012	0.018	0.004	0.008	0.008

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, C. A Customized ADMM Approach for Large-Scale Nonconvex Semidefinite Programming. Mathematics 2023, 11, 4413. https://doi.org/10.3390/math11214413

AMA Style

Sun C. A Customized ADMM Approach for Large-Scale Nonconvex Semidefinite Programming. Mathematics. 2023; 11(21):4413. https://doi.org/10.3390/math11214413

Chicago/Turabian Style

Sun, Chuangchuang. 2023. "A Customized ADMM Approach for Large-Scale Nonconvex Semidefinite Programming" Mathematics 11, no. 21: 4413. https://doi.org/10.3390/math11214413

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Customized ADMM Approach for Large-Scale Nonconvex Semidefinite Programming

Abstract

1. Introduction

2. Applications

2.1. Combinatorial Problems

2.1.1. Related Works on MAX-CUT

2.1.2. Specialization to Community Detection

2.2. Nonnegative Factorization

2.2.1. Optimization over Spectrahedron

2.2.2. Factorization with Partial Observations

2.2.3. Projective Nonnegative Matrix Factorization

3. Related Work

3.1. Convex Relaxations

3.2. Low-Rank Convex Cases

3.3. Nonconvex Cases

3.4. Global Optimality of a Nonconvex Problem with Linear Objective

3.5. Nonconvex Constraint C

3.6. ADMM for Nonconvex Problems

4. Linearized ADMM on Full SDP

4.1. Duality

4.2. Linearized ADMM

4.2.1. Minimizing over Y

4.2.2. Minimizing over X and Z

4.3. Convergence Analysis

5. ADMM on Simplified Nonconvex SDP

5.1. ADMM

5.2. Convergence Analysis

6. Numerical Experiments

6.1. Solving the Baseline (SDR)

6.2. Rounding

6.3. Computer Information

6.4. MAX-CUT

6.5. Image Segmentation

6.6. Symmetric Factorization with Partial Observations

7. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Derivation of X, Z Update

Implicit Inverse of H

Appendix B. Convergence Analysis for Matrix Form

Appendix C. Convergence Analysis for Vector Form

Appendix C.1. Linear Rate of Convergence when g Is Strongly Convex

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.5. Nonconvex Constraint $C$