The Proximal Gradient Method for Composite Optimization Problems on Riemannian Manifolds

Li, Xiaobo

doi:10.3390/math12172638

Open AccessArticle

The Proximal Gradient Method for Composite Optimization Problems on Riemannian Manifolds

by

Xiaobo Li

School of Sciences, Civil Aviation Flight University of China, Guanghan 618300, China

Mathematics 2024, 12(17), 2638; https://doi.org/10.3390/math12172638

Submission received: 1 August 2024 / Revised: 22 August 2024 / Accepted: 23 August 2024 / Published: 25 August 2024

(This article belongs to the Special Issue Advances in Nonlinear Analysis: Theory, Methods and Applications)

Download Versions Notes

Abstract

In this paper, the composite optimization problem is studied on Riemannian manifolds. To tackle this problem, the proximal gradient method to solve composite optimization problems is proposed on Riemannian manifolds. Under some reasonable conditions, the convergence of the proximal gradient method with the backtracking procedure in the nonconvex case is presented. Furthermore, a sublinear convergence rate and the complexity result of the proximal gradient method for convex case are also established on Riemannian manifolds.

Keywords:

composite optimization problems; the proximal gradient method; convergence result; Riemannian manifolds

MSC:

58C99; 90C26; 90C90

1. Introduction

Optimization is a fundamental concept in various fields, ranging from engineering and economics to machine learning and logistics. In the case of composite optimization, the objective function or the constraints are composed of multiple individual functions or constraints, making the problem more complex. Composite optimization problems [1,2,3,4] arise in diverse applications, such as signal processing, image reconstruction, and data analysis. The challenge in composite optimization lies in effectively optimizing these composite functions, potentially involving iterative algorithms or gradient-based methods. Researchers in optimization continually develop new techniques and algorithms to efficiently solve composite optimization problems and improve decision-making processes in various domains.

Solving composite optimization problems efficiently requires specialized algorithms that can exploit the problem’s composite nature to achieve optimal solutions. Many popular methods, including the proximal gradient method, the alternating direction method of multipliers (ADMM), the block coordinate descent, and the primal–dual method can be applied to solve the composite optimization problem [5,6,7,8,9,10,11]. The proximal gradient method is a widely used optimization algorithm for solving composite optimization problems. This method leverages both gradient descent and proximal operators to handle composite objective functions, where the objective is the sum of a smooth and a nonsmooth component. By iteratively updating variables using gradient steps and proximal operators, proximal gradient methods can efficiently solve composite optimization problems. For example, Sahu et al. [8] proposed first-order methods such as proximal gradient, which use forward–backward splitting techniques. In addition, they derived convergence rates for the proposed formulations and showed that the speed DJM of convergence of these algorithms is significantly better than the traditional forward–backward algorithm. In [9], Neal et al. discussed many different interpretations of proximal operators and algorithms; they also described their connections to many other topics in optimization and applied mathematics.

On the other hand, extending some theory results and methods of optimization problems from Euclidean spaces to Riemannian manifolds has attracted significant attention in recent years; see, e.g., [12,13,14,15,16,17,18,19,20,21]. For instance, Riemannian proximal gradient methods have been developed in [22,23], which utilize the Riemannian metric and curvature information to iteratively minimize the objective function. In [22], Bento et al. presented the proximal point method for finding minima of a special class of nonconvex function on Hadamard manifolds. The well definedness of the sequence generated by the proximal point method was established, and its convergence for a minima was obtained. Based on the results of [22], Feng et al. proposed a monotone proximal gradient algorithm with a fixed step size on Hadamard manifolds in [23]. They also established the convergence theorem of the proposed method under the reasonable definition of proximal gradient mapping on manifolds. There are many advantages of transforming algorithms from Euclidean space to algorithms on Riemannian manifolds: First, Riemannian manifolds capture the underlying geometry of the data space, allowing algorithms to leverage this structure for more accurate and efficient computations; second, Riemannian manifolds can model nonlinear and curved data more effectively than Euclidean spaces, enabling algorithms to better represent and process complex data distributions; third, algorithms designed for Riemannian manifolds are often more suitable on curved surfaces or non-Euclidean spaces, leading to improved performance and results.

Composite optimization problems on Riemannian manifolds pose a unique computational challenge due to the non-Euclidean geometry of the manifold. In such problems, the objective function is composed of several terms defined on the Riemannian manifold, which requires specialized optimization techniques that respect the manifold structure. In this paper, we propose the proximal gradient method for composite optimization problems on Riemannian manifolds. The proximal gradient method on Riemannian manifolds performs a gradient descent step and a proximal operator step. For the gradient descent step on Riemannian manifolds, the algorithm computes the gradient of the smooth function with respect to the Riemannian metric at the current iteration. This gradient is then used to update the iteration in the direction that minimizes the smooth component of the objective function on Riemannian manifolds. After the gradient descent step, the algorithm applies a proximal operator to the current iteration on Riemannian manifolds. The proximal operator is a mapping that projects the updated iteration onto the set of feasible points on the manifold, taking into account the nonsmooth component of the objective function. By iteratively alternating between these two steps, the proximal gradient method aims to efficiently minimize the composite objective function on the Riemannian manifold while respecting the manifold’s geometry and constraints. This approach is particularly useful for optimization problems in machine learning, computer vision, and other fields where data lie on complex geometric structures.

The proximal gradient method is a powerful optimization algorithm commonly used to solve composite optimization problems on Euclidean spaces. However, extending this method to Riemannian manifolds involves some challenges due to the non-Euclidean geometry of these spaces. One approach to address this is the Riemannian proximal gradient method, which combines ideas from optimization theory with Riemannian geometry to efficiently solve composite optimization problems on Riemannian manifolds. Since a Riemannian manifold, in general, does not have a linear structure, usual techniques in the Euclidean space cannot be applied, and new techniques have to be proposed. Our results contribute as follows. First, the proximal gradient method for composite optimization problems is introduced on Riemannian manifolds, and its convergence results are established, which generalizes some algorithm results in [8,9] from

R^{n}

to Riemannian manifolds. Second, some global convergence results of the proximal gradient method for composite optimization problems are proved using the backtracking procedure, which makes the algorithm more efficient. Furthermore, a sublinear convergence rate of the generalized sequence of function values to the optimal value is established on Riemannian manifolds, and the complexity result of the proximal gradient for convex case is also obtained. Third, since the computation of exponential mappings and parallel transports can be quite expensive in manifolds, and many convergence results show that the nice properties of some algorithms hold for all suitably defined retractions and general vector transports on Riemannian manifolds [22,23], the geodesic and parallel transports are replaced by retractions and general vector transports, respectively.

This work is organized as follows. In Section 2, some necessary definitions and concepts are provided for Riemannian manifolds. In Section 3, the proximal gradient method for composite optimization problems is presented for Riemannian manifolds. In Section 4, under some reasonable conditions, some convergence results of the proximal gradient method for composite optimization problems are provided for Riemannian manifolds.

2. Preliminaries

In this section, some standard definitions and results from Riemannian manifolds are recalled, which can be found in some introductory books on Riemannian geometry; see, for example, [24,25].

Let M be a finite-dimensional differentiable manifold and

x \in M

. The tangent space of M at x is denoted by

T_{x} M

and the tangent bundle of M by

T M = ⋃_{x \in M} T_{x} M

.

{〈, 〉}_{x}

is denoted by the inner product on

T_{x} M

with the associated norm

{∥ \cdot ∥}_{x}

. If there is no confusion, then the subscript x is omitted. If M is endowed with a Riemannian metric g, then M is a Riemannian manifold. Given a piecewise smooth curve

γ : [t_{0}, t_{1}] \to M

joining x to y, that is,

γ (t_{0}) = x

and

γ (t_{1}) = y

, the length of

γ

by

l (γ) = \int_{a}^{b} ∥ γ^{'} (t) ∥ d t

can be defined. Minimizing this length functional over the set of all curves, a Riemannian distance

d (x, y)

, which induces the original topology on M, is obtained.

A Riemannian manifold is complete if, for any

x \in M

, all geodesics emanating from x are defined for all

t \in R

. By the Hopf–Rinow theorem [13], any pair of points

x, y \in M

can be joined by a minimal geodesic. The exponential mapping

{exp}_{x} : T_{x} M \to M

is defined by

{exp}_{x} v = γ_{v} (1, x)

for each

v \in T_{x} M

, where

γ (\cdot) = γ_{v} (\cdot, x)

is the geodesic starting x with velocity v, that is,

γ (0) = x

and

γ^{'} (0) = v

. It is easy to see that

{exp}_{x} t v = γ_{v} (t, x)

for each real number t.

The exponential mapping

{exp}_{x}

provides a local parametrization of M via

T_{x} M

. However, the systematic use of the exponential mapping may not be desirable in all cases. Some local mappings to

T_{x} M

may reduce the computational cost while preserving the useful convergence properties of the considered method.

Definition 1

([21]). Given

x \in M

, a retraction is a smooth mapping

R_{x} : T_{x} M \to M

, such that

(i): $R_{x} (0_{x}) = x$ for all $x \in M$ , where $0_{x}$ denotes the zero element of $T_{x} M$ ;
(ii): $D R_{x} (0_{x}) = {id}_{T_{x} M}$ , where $D R_{x}$ denotes the derivative of $R_{x}$ , and id denotes the identity mapping.

It is well-known that the exponential mapping is a special retraction, and some retractions are approximations of the exponential mapping.

The parallel transport is often too expensive to compute in a practical method, so a more general vector transport can be considered, see, for example, [14,15], which is built upon the retraction

R_{x}

. A vector transport

T : T M ⨁ T M \to T M

,

(η_{x}, ξ_{x}) \mapsto T_{η_{x}} ξ_{x}

with the associated retraction

R_{x}

is a smooth mapping, such that, for all

η_{x}

in the domain of

R_{x}

and all

ξ_{x}, ζ_{x} \in T_{x} M

, (i)

T_{η_{x}} ξ_{x} \in T_{R_{x} (η_{x})} M

; (ii)

T_{0_{x}} ξ_{x} = ξ_{x}

; and (iii)

T_{η_{x}}

is a linear mapping. Let

T_{S}

denote the isometric vector transport (see, for example, [15,20]) with

R_{x}

as the associated retraction. Then, it satisfies (i), (ii), (iii), and

\begin{matrix} (iv) g (T_{S (η_{x})} ξ_{x}, T_{S (η_{x})} ζ_{x}) = g (ξ_{x}, ζ_{x}) . \end{matrix}

In most practical cases,

T_{S (η_{x})}

exists for all

η_{x} \in T_{x} M

, and this assumption is made throughout the paper. Furthermore, let

T_{η_{x}}

denote the derivative of the retraction, i.e.,

\begin{matrix} T_{η_{x}} ξ_{x} = D R_{x} (η_{x}) [ξ_{x}] = \frac{d}{d t} R_{x} (η_{x} + t ξ_{x}) |_{t = 0} . \end{matrix}

Let

L (T M, T M)

denote a fiber bundle with base space

M \times M

, such that the fiber over

(x, y) \in M \times M

is

L (T_{x} M, T_{y} M)

, the set of all linear mappings from

T_{x} M

to

T_{y} M

. From [21], it follows that a transporter

L

on M is a smooth section of the bundle

L (T M, T M)

. Furthermore,

L^{- 1} (x, y) = L (y, x)

, and

L (x, z) = L (y, z) L (x, y)

. Given a retraction

R_{x}

, for any

η_{x}, ξ_{x} \in T_{x} M

, the isometric vector transport

T_{S}

can be defined by

\begin{matrix} T_{S (η_{x})} ξ_{x} = L (x, R_{x} (η_{x})) (ξ_{x}) . \end{matrix}

In this paper, from the locking condition proposed by Huang [20],

\begin{matrix} T_{η_{x}} ξ_{x} = T_{S (η_{x})} ξ_{x} \end{matrix}

is required. In some manifolds, there exist retractions, such that the above equality holds, e.g., the Stiefel manifold and the Grassman manifold [20]. Furthermore, from the above results, it follows that

\begin{matrix} ∥ ξ_{x} ∥ = ∥ T_{S (η_{x})} ξ_{x} ∥ = ∥ L (x, R_{x} (η_{x})) (ξ_{x}) ∥ = ∥ T_{η_{x}} ξ_{x} ∥ = ∥ D R_{x} (η_{x}) [ξ_{x}] ∥ . \end{matrix}

3. The Proximal Gradient Method

In this paper, the following composite optimization problems are studied on Riemannian manifolds:

\begin{matrix} min_{x \in M} F (x) = f (x) + g (x) . \end{matrix}

(1)

Suppose the following assumption holds.

Assumption 1.

(i): $g : M \to (- \infty, + \infty]$ is proper, closed, and convex;
(ii): $f : M \to (- \infty, + \infty]$ is proper and closed, $dom (f)$ is convex, $dom (g) \subseteq int (dom (f))$ , and f is $L_{f}$ —smooth over $int (dom (f))$ , that is, for any $x \in M, ∥ grad f (x) ∥ \leq L_{f}$ for some $L_{f} > 0$ ;
(iii): The optimal set of (1) is nonempty and denoted by $X^{*}$ , and the optimal value of problem (1) is denoted by $F^{*}$ .

Remark 1.

There are some special cases of problem (1).

(i): If $g = 0$ and $dom (f) = M$ , then (1) reduces to the unconstrained smooth minimization problem on Riemannian manifolds

$\begin{matrix} min_{x \in M} f (x), \end{matrix}$

where $f : M \to R$ is an $L_{f}$ -smooth function on Riemannian manifolds.
(ii): If $g = δ_{C}$ , where C is a nonempty closed and convex set on M, then (1) reduces to the problem of minimizing a different function over a nonempty closed and convex set on Riemannian manifolds.

Let

\hat{f} : f \circ R

and

\hat{g} : g \circ R

denote the pullback of f and g through R, respectively. For any

x \in M

, let

\begin{matrix} {\hat{f}}_{x} = f \circ R, {\hat{g}}_{x} = g \circ R \end{matrix}

(2)

denote the restriction of

\hat{f}

and

\hat{g}

to

T_{x} M

. In [21], it follows that

\begin{matrix} grad {\hat{f}}_{x} (0_{x}) = grad f (x) . \end{matrix}

(3)

For problem (1), it is natural to define the following iteration:

\begin{matrix} η^{k} = {argmin}_{η \in T_{x^{k}} M} {{\hat{f}}_{x^{k}} (0_{x^{k}}) + 〈 grad {\hat{f}}_{x^{k}} (0_{x^{k}}), η 〉 + {\hat{g}}_{x^{k}} (η) + \frac{1}{2 t_{k}} {∥ η ∥}^{2}} . \end{matrix}

(4)

After some simple manipulation, (4) can be rewritten as

\begin{matrix} η^{k} = {argmin}_{η \in T_{x^{k}} M} {t_{k} {\hat{g}}_{x^{k}} (η) + \frac{1}{2} ∥ η + t_{k} grad {\hat{f}}_{x^{k}} (0_{x^{k}}) ∥^{2}}, \end{matrix}

(5)

which, by the definition of the proximal operator, is the same as

\begin{matrix} η^{k} = {prox}_{t_{k} {\hat{g}}_{x^{k}}} (- t_{k} grad {\hat{f}}_{x^{k}} (0_{x^{k}})) . \end{matrix}

Now, the proximal gradient method for composite optimization problems is introduced for Riemannian manifolds.

Let

T_{L}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) : = {prox}_{\frac{1}{L} {\hat{g}}_{x}} (η - \frac{1}{L} grad {\hat{f}}_{x} (η)), \forall η \in T_{x} M ⋂ int (dom ({\hat{f}}_{x}))

. Then, the general update step of the proximal gradient method can be written as

\begin{matrix} x^{k + 1} = R_{x^{k}} T_{L_{k}}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x^{k}}) . \end{matrix}

Lemma 1

([11]). Let

f : E \to (- \infty, + \infty]

be an L-smooth function

(L \geq 0)

over a given convex set D. Then, for any

x, y \in D

,

\begin{matrix} f (y) \leq f (x) + 〈 grad f (x), y - x 〉 + \frac{L}{2} {∥ x - y ∥}^{2} . \end{matrix}

Lemma 2

([11]). Let

f : E \to (- \infty, + \infty]

be a proper, closed, and convex function. Then, for any

x, u \in E

, the following three claims are equivalent:

(i): $u = {prox}_{f} (x)$ ;
(ii): $x - u \in \partial f (u)$ ;
(iii): $〈 x - u, y - u 〉 \leq f (y) - f (u), \forall y \in E$ .

Lemma 3.

Suppose that f and g satisfy Assumption 1. Let

{\hat{F}}_{x} = {\hat{f}}_{x} + {\hat{g}}_{x}

. Then, for any

η \in T_{x} M ⋂ int (dom ({\hat{f}}_{x}))

and

L \in (\frac{L_{{\hat{f}}_{x}}}{2}, + \infty)

, the following inequality holds:

\begin{matrix} {\hat{F}}_{x} (η) - {\hat{F}}_{x} (T_{L}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η)) \geq \frac{L - \frac{L_{{\hat{f}}_{x}}}{2}}{L^{2}} {∥ G_{L}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) ∥}^{2}, \end{matrix}

where

G_{L}^{{\hat{f}}_{x}, {\hat{g}}_{x}} : T_{x} M ⋂ int (dom ({\hat{f}}_{x})) \to T_{x} M

is the operator defined by

\begin{matrix} G_{L}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) : = L (η - T_{L}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η)) . \end{matrix}

Proof.

Using the notation

η^{+} = T_{L}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η)

. By Lemma 1, it follows that

\begin{matrix} {\hat{f}}_{x} (η^{+}) \leq {\hat{f}}_{x} (η) + 〈 grad {\hat{f}}_{x} (y), η^{+} - η 〉 + \frac{L_{{\hat{f}}_{x}}}{2} {∥ η^{+} - η ∥}^{2} . \end{matrix}

(6)

By Lemma 2, since

η^{+} = {prox}_{\frac{1}{L} {\hat{g}}_{x}} (η - \frac{1}{L} grad {\hat{f}}_{x})

, it follows that

\begin{matrix} 〈 η - \frac{1}{L} grad {\hat{f}}_{x} (η) - η^{+}, η - η^{+} 〉 \leq \frac{1}{L} {\hat{g}}_{x} (η) - \frac{1}{L} {\hat{g}}_{x} (η^{+}), \end{matrix}

which implies that

\begin{matrix} 〈 grad {\hat{f}}_{x} (η), η^{+} - η 〉 \leq - L {∥ η^{+} - η ∥}^{2} + {\hat{g}}_{x} (η) - {\hat{g}}_{x} (η^{+}), \end{matrix}

which, together with (6), implies that

\begin{matrix} {\hat{f}}_{x} (η^{+}) + {\hat{g}}_{x} (η^{+}) \leq {\hat{f}}_{x} (η) + {\hat{g}}_{x} (η) + (- L + \frac{L_{{\hat{f}}_{x}}}{2}) {∥ η^{+} - η ∥}^{2} . \end{matrix}

Therefore,

\begin{matrix} {\hat{F}}_{x} (η) - {\hat{F}}_{x} (T_{L}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η)) \geq \frac{L - \frac{L_{{\hat{f}}_{x}}}{2}}{L^{2}} {∥ G_{L}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) ∥}^{2} . \end{matrix}

□

Definition 2.

Suppose that f and g satisfy Assumption 1. Then, for any

x \in M

, the gradient mapping is the operator

G_{L}^{{\hat{f}}_{x}, {\hat{g}}_{x}} : T_{x} M ⋂ int (dom ({\hat{f}}_{x})) \to T_{x} M

defined by

\begin{matrix} G_{L}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) : = L (η - T_{L}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η)) . \end{matrix}

The update step of the proximal gradient method can be rewritten as

\begin{matrix} η^{k} = - \frac{1}{L_{k}} G_{L_{k}}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x^{k}}), \end{matrix}

and

\begin{matrix} x^{k + 1} = R_{x^{k}} η^{k} = R_{x^{k}} (- \frac{1}{L_{k}} G_{L_{k}}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x^{k}})) . \end{matrix}

Theorem 1.

Let f and g satisfy Assumption 1, and let

L > 0

. Then,

(i): $G_{L}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) = grad {\hat{f}}_{x} (η)$ for all $η \in T_{x} M ⋂ int (dom ({\hat{f}}_{x}))$ , where $g = 0$ ;
(ii): For $0_{x^{*}} \in T_{x^{*}} M ⋂ int (dom ({\hat{f}}_{x^{*}}))$ , it holds that $G_{L}^{{\hat{f}}_{x^{*}}, {\hat{g}}_{x^{*}}} (0_{x^{*}}) = 0$ if and only if $x^{*}$ is a stationary point of problem (1).

Proof.

(i) For

g = 0

, since

\begin{matrix} {prox}_{\frac{1}{L} {\hat{g}}_{x}} (η) & = & {argmin}_{ξ \in T_{x} M} {\frac{1}{L} {\hat{g}}_{x} (η) + \frac{1}{2} {∥ ξ - (0 + η) ∥}^{2}} \\ = & {argmin}_{ξ \in T_{x} M} {\frac{1}{2} {∥ ξ - (0 + η) ∥}^{2}} = η . \end{matrix}

It follows that

\begin{matrix} G_{L}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) & = & L (η - T_{L}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η)) = L (η - {prox}_{\frac{1}{L} {\hat{g}}_{x}} (η - \frac{1}{L} grad {\hat{f}}_{x} (η)) \\ = & L (η - (η - \frac{1}{L} grad {\hat{f}}_{x} (η))) = grad {\hat{f}}_{x} (η), \forall η \in T_{x} M ⋂ int (dom ({\hat{f}}_{x})) . \end{matrix}

(ii)

G_{L}^{{\hat{f}}_{x^{*}}, {\hat{g}}_{x^{*}}} (0_{x^{*}}) = 0

if and only if

0_{x^{*}} = {prox}_{\frac{1}{L} {\hat{g}}_{x^{*}}} (0_{x^{*}} - \frac{1}{L} grad {\hat{f}}_{x^{*}} (0_{x^{*}}))

; from Lemma 2, the latter relation holds if and only if

\begin{matrix} 0_{x^{*}} - \frac{1}{L} grad {\hat{f}}_{x^{*}} (0_{x^{*}}) - 0_{x^{*}} \in \frac{1}{L} \partial {\hat{g}}_{x^{*}} (0_{x^{*}}), \end{matrix}

that is,

\begin{matrix} 0_{x^{*}} \in grad {\hat{f}}_{x^{*}} (0_{x^{*}}) + \partial {\hat{g}}_{x^{*}} (0_{x^{*}}) . \end{matrix}

This implies that

0_{x^{*}}

is a stationary point of

{\hat{f}}_{x^{*}} + {\hat{g}}_{x^{*}}

. Then,

x^{*}

is a stationary point of problem (1). □

Next, the monotonicity property

∥ G_{L} (x) ∥

with the respect to the parameter L is obtained for Riemannian manifolds.

Theorem 2.

Suppose that f and g satisfy Assumption 1, and

L_{1} \geq L_{2} > 0

. Then, for any

η \in T_{x} M ⋂ int (dom ({\hat{f}}_{x}))

, it holds that

\begin{matrix} ∥ G_{L_{1}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) ∥ \geq ∥ G_{L_{2}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) ∥ . \end{matrix}

(7)

Proof.

For any

x \in M, ξ_{1}, ξ_{2} \in T_{x} M ⋂ int (dom ({\hat{f}}_{x}))

and

L > 0

, from Lemma 2, the following inequality holds:

\begin{matrix} 〈 ξ_{1} - {prox}_{\frac{1}{L} {\hat{g}}_{x}} (ξ_{1}), {prox}_{\frac{1}{L} {\hat{g}}_{x}} (ξ_{1}) - ξ_{2} 〉 \geq \frac{1}{L} {\hat{g}}_{x} ({prox}_{\frac{1}{L} {\hat{g}}_{x}} (ξ_{1})) - \frac{1}{L} {\hat{g}}_{x} (ξ_{2}) . \end{matrix}

Plugging

L = L_{1}, ξ_{1} = η - \frac{1}{L_{1}} grad {\hat{f}}_{x} (η)

, and

ξ_{2} = {prox}_{\frac{1}{L_{2}} {\hat{g}}_{x}} (η - \frac{1}{L_{2}} grad {\hat{f}}_{x} (η)) = T_{L_{2}} (η)

into the last inequality, it follows that

\begin{matrix} 〈 η - \frac{1}{L_{1}} grad {\hat{f}}_{x} (η) - T_{L_{1}} (η), T_{L_{1}} (η) - T_{L_{2}} (η) 〉 \geq \frac{1}{L_{1}} {\hat{g}}_{x} (T_{L_{1}} (η)) - \frac{1}{L_{1}} {\hat{g}}_{x} (T_{L_{2}} (η)), \end{matrix}

or

\begin{matrix} 〈 \frac{1}{L_{1}} G_{L_{1}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) - \frac{1}{L_{1}} grad {\hat{f}}_{x} (η), \frac{1}{L_{2}} G_{L_{2}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) - \frac{1}{L_{1}} G_{L_{1}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) 〉 \geq \frac{1}{L_{1}} {\hat{g}}_{x} (T_{L_{1}} (η)) - \frac{1}{L_{1}} {\hat{g}}_{x} (T_{L_{2}} (η)) . \end{matrix}

Exchanging the roles of

L_{1}

and

L_{2}

yields the following inequality:

\begin{matrix} 〈 \frac{1}{L_{2}} G_{L_{2}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) - \frac{1}{L_{2}} grad {\hat{f}}_{x} (η), \frac{1}{L_{1}} G_{L_{1}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) - \frac{1}{L_{2}} G_{L_{2}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) 〉 \geq \frac{1}{L_{2}} {\hat{g}}_{x} (T_{L_{2}} (η)) - \frac{1}{L_{2}} {\hat{g}}_{x} (T_{L_{1}} (η)) . \end{matrix}

Multiplying the first inequality by

L_{1}

and the second by

L_{2}

and adding them, it follows that

\begin{matrix} 〈 G_{L_{1}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) - G_{L_{2}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η), \frac{1}{L_{2}} G_{L_{2}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) - \frac{1}{L_{1}} G_{L_{1}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) 〉 \geq 0 . \end{matrix}

That is,

\begin{matrix} \frac{1}{L_{1}} ∥ G_{L_{1}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} {(η) ∥}^{2} + \frac{1}{L_{2}} {∥ G_{L_{2}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) ∥}^{2} & \leq & (\frac{1}{L_{1}} + \frac{1}{L_{2}}) 〈 G_{L_{1}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η), G_{L_{2}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) 〉 \\ \leq & (\frac{1}{L_{1}} + \frac{1}{L_{2}}) ∥ G_{L_{1}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) ∥ ∥ G_{L_{2}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) ∥ . \end{matrix}

(8)

Note that if

G_{L_{2}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) = 0

, then, by (8),

G_{L_{1}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) = 0

. Assume that

G_{L_{2}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) \neq 0

, and define

t = \frac{G_{L_{1}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η)}{G_{L_{2}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η)}

. Then, by (8), it follows that

\begin{matrix} \frac{1}{L_{1}} t^{2} - (\frac{1}{L_{1}} + \frac{1}{L_{2}}) t + \frac{1}{L_{2}} \leq 0 . \end{matrix}

This implies that

\begin{matrix} 1 \leq t \leq \frac{L_{1}}{L_{2}} . \end{matrix}

Therefore,

∥ G_{L_{1}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) ∥ \geq ∥ G_{L_{2}}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) ∥ .

□

4. The Convergence Result

4.1. The Non-Convex Case

In this section, the convergence of the proximal gradient method is analyzed for Riemannian manifolds. Now, the backtracking procedure B1 is considered as follows.

The procedure requires three parameters

(s, r, q)

, where

s > 0, r \in (0, 1)

, and

q > 1

. The choice of

L_{k}

is performed as follows.

First,

L_{k}

is the set to be equal to the initial s. Then, while

\begin{matrix} {\hat{F}}_{x^{k}} (0_{x^{k}}) - {\hat{F}}_{x^{k}} (T_{L_{k}} (0_{x^{k}})) < \frac{r}{L_{k}} {∥ G_{L_{k}}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x^{k}}) ∥}^{2}, \end{matrix}

we set

L_{k} : = q L_{k}

. In other words,

L_{k}

is chosen as

L_{k} : = s q^{i_{k}}

, where

i_{k}

is the smallest non-negative integer for which the condition

\begin{matrix} {\hat{F}}_{x^{k}} (0_{x^{k}}) - {\hat{F}}_{x^{k}} (T_{s q^{i_{k}}} (0_{x^{k}})) \geq \frac{r}{s q^{i_{k}}} {∥ G_{s q^{i_{k}}}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x^{k}}) ∥}^{2} \end{matrix}

(9)

is satisfied.

Lemma 4.

Suppose that Assumption 1 holds. Let

{x^{k}}

be the sequence generated by Algorithm 1 with a step size chosen by the backtracking procedure B1. Then, for any

k \geq 0

,

\begin{matrix} F (x^{k}) - F (x^{k + 1}) \geq m {∥ G_{s}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x^{k}}) ∥}^{2}, \end{matrix}

(10)

where

m = \frac{r}{max (s, \frac{q L_{{\hat{f}}_{x^{k}}}}{2 (1 - r)})}

.

Algorithm 1The proximal gradient method for Riemannian manifolds.

Initialization: pick

x^{0} \in int (dom (f))

.

General step: for any

k = 0, 1, \dots,

execute the following step:

\begin{matrix} η_{k} : = {prox}_{\frac{1}{L_{k}} {\hat{g}}_{x^{k}}} (- \frac{1}{L_{k}} grad {\hat{f}}_{x^{k}} (0_{x^{k}})), \end{matrix}

(11)

and set

x^{k + 1} = R_{x^{k}} η_{k},

where

L_{k} > 0

.

Stopping criteria:

η_{k} = 0

.

Proof.

It follows from Lemma 3 that

\begin{matrix} {\hat{F}}_{x^{k}} (0_{x^{k}}) - {\hat{F}}_{x^{k}} (T_{L}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x^{k}})) \geq \frac{L - \frac{L_{{\hat{f}}_{x^{k}}}}{2}}{L^{2}} {∥ G_{L}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x^{k}}) ∥}^{2} . \end{matrix}

(12)

If

L \geq \frac{L_{{\hat{f}}_{x^{k}}}}{2 (1 - r)}

, then

\frac{L - \frac{L_{{\hat{f}}_{x^{k}}}}{2}}{L} \geq r

; hence, by (12), it follows that

\begin{matrix} {\hat{F}}_{x^{k}} (0_{x^{k}}) - {\hat{F}}_{x^{k}} (T_{L}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x^{k}}) \geq \frac{r}{L} {∥ G_{L}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x^{k}}) ∥}^{2} \end{matrix}

(13)

holds. This implies that the backtracking procedure B1 must end when

L_{k} \geq \frac{L_{{\hat{f}}_{x^{k}}}}{2 (1 - r)}

. An upper bound on

L_{k}

can be computed: either

L_{k}

is equal to s, or the backtracking procedure B1 in invoked, meaning that

\frac{L_{k}}{q}

did not satisfy the backtracking condition, which implies that

\frac{L_{k}}{q} < \frac{L_{{\hat{f}}_{x^{k}}}}{2 (1 - r)}

; so,

L_{k} < \frac{q L_{{\hat{f}}_{x^{k}}}}{2 (1 - r)}

. That is,

\begin{matrix} L_{k} \leq max {s, \frac{q L_{{\hat{f}}_{x^{k}}}}{2 (1 - r)}} . \end{matrix}

This, together with (13), implies that

\begin{matrix} {\hat{F}}_{x^{k}} (0_{x_{k}}) - {\hat{F}}_{x^{k}} (T_{L_{k}}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x^{k}}) & \geq & \frac{r}{L_{k}} {∥ G_{L_{k}}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x^{k}}) ∥}^{2} \\ \geq & \frac{r}{max {s, \frac{q L_{{\hat{f}}_{x^{k}}}}{2 (1 - r)}}} {∥ G_{L_{k}}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x_{k}}) ∥}^{2} . \end{matrix}

(14)

By Theorem 2, it follows that

\begin{matrix} ∥ G_{L_{k}}^{{\hat{f}}_{x_{k}}, {\hat{g}}_{x_{k}}} (0_{x_{k}}) ∥ \geq ∥ G_{s}^{{\hat{f}}_{x_{k}}, {\hat{g}}_{x_{k}}} (0_{x_{k}}) ∥ . \end{matrix}

This, together with (13), (14) and Theorem 2, implies that

\begin{matrix} F (x_{k}) - F (x_{k + 1}) = {\hat{F}}_{x^{k}} (0_{x_{k}}) - {\hat{F}}_{x^{k}} (T_{L_{k}} (0_{x_{k}})) \geq m {∥ G_{s}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x_{k}}) ∥}^{2}, \end{matrix}

where

m = \frac{r}{max {s, \frac{q L_{{\hat{f}}_{x^{k}}}}{2 (1 - r)}}}

.

□

Theorem 3.

Suppose that Assumption 1 holds, and let

{x^{k}}

be the sequence generated by Algorithm 1 with a step size chosen by the backtracking procedure B1. Then,

(i): The sequence ${F (x^{k})}$ is non-increasing;
(ii): ${min}_{n = 0, 1, \dots, k} ∥ G_{s}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x^{k}}) ∥ \leq \frac{\sqrt{F (x^{0}) - F^{*}}}{\sqrt{m (k + 1)}}$ ;
(iii): All limit points of ${x^{k}}$ are stationary points of (1).

Proof.

(i) By Lemma 4, it follows that

\begin{matrix} F (x^{k}) - F (x^{k + 1}) \geq m {∥ G_{s}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x^{k}}) ∥}^{2}, \end{matrix}

(15)

where

m = \frac{r}{max {s, \frac{q L_{{\hat{f}}_{x^{k}}}}{2 (1 - r)}}}

. From the above equality, it follows that

F (x^{k}) \geq F (x^{k + 1})

.

(ii) Since the sequence

{F (x^{k})}

is non-increasing and bounded below, it converges. Thus,

\begin{matrix} F (x^{k}) - F (x^{k + 1}) \to 0, \end{matrix}

which implies that

\begin{matrix} ∥ G_{s}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x^{k}}) ∥ \to 0 . \end{matrix}

Summing the inequality

\begin{matrix} F (x^{k}) - F (x^{k + 1}) \geq m {∥ G_{s}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x^{k}}) ∥}^{2} \end{matrix}

over

n = 0, 1, \dots, k

implies that

\begin{matrix} F (x^{0}) - F (x^{k + 1}) \geq \sum_{n = 0}^{k} m ∥ G_{s}^{{\hat{f}}_{x^{n}}, {\hat{g}}_{x^{n}}} (0_{x^{n}}) ∥^{2} \geq m (k + 1) min_{n = 0, 1, \dots, k} {∥ G_{s}^{{\hat{f}}_{x^{n}}, {\hat{g}}_{x^{n}}} (0_{x^{n}}) ∥}^{2} . \end{matrix}

Since

F (x^{k + 1}) \geq F^{*}

, it follows that (ii) holds.

(iii) Let

\bar{x}

be a limit point

{x^{k}}

. Then, there exists a subsequence

{x^{k_{j}}}

converging to

\bar{x}

. From (15), it follows that

∥ G_{s}^{{\hat{f}}_{x^{k}}, {\hat{g}}_{x^{k}}} (0_{x^{k}}) ∥ \to 0

. It is easy to check that when

x^{k_{j}} \to \bar{x}

,

\begin{matrix} G_{s}^{{\hat{f}}_{x^{k_{j}}}, {\hat{g}}_{x^{k_{j}}}} (0_{x^{k_{j}}}) \to G_{s}^{{\hat{f}}_{\bar{x}}, {\hat{g}}_{\bar{x}}} (0_{\bar{x}}) . \end{matrix}

Since

\begin{matrix} ∥ G_{s}^{{\hat{f}}_{\bar{x}}, {\hat{g}}_{\bar{x}}} (0_{\bar{x}}) ∥ \leq ∥ G_{s}^{{\hat{f}}_{x^{k_{j}}}, {\hat{g}}_{x^{k_{j}}}} (0_{x^{k_{j}}}) - G_{s}^{{\hat{f}}_{\bar{x}}, {\hat{g}}_{\bar{x}}} (0_{\bar{x}}) ∥ + ∥ G_{s}^{{\hat{f}}_{x^{k_{j}}}, {\hat{g}}_{x^{k_{j}}}} (0_{x^{k_{j}}}) ∥, \end{matrix}

(16)

and the right hand side of (16) goes to 0 as

j \to + \infty

, this implies that

G_{s}^{{\hat{f}}_{\bar{x}}, {\hat{g}}_{\bar{x}}} (0_{\bar{x}}) = 0

. Therefore, from Theorem 1 (ii), it follows that

\bar{x}

is a stationary point of (1). □

4.2. The Convex Case

In this section, suppose that f is convex on M. Under some conditions, some convergence results of the proximal gradient method for composite optimization problems are obtained for Riemannian manifolds.

Definition 3

([11]). A function

f : E \to (- \infty, + \infty]

is called σ-strongly convex for a given

σ > 0

if

dom (f)

is convex, and the following inequality holds for any

x, y \in dom (f)

and

λ \in [0, 1]

:

\begin{matrix} f (λ x + (1 - λ) y) \leq λ f (x) + (1 - λ) f (y) - \frac{σ}{2} λ (1 - λ) {∥ x - y ∥}^{2} . \end{matrix}

(17)

Lemma 5

([11]). Let

f : E \to (- \infty, + \infty]

be a proper closed and σ-strongly convex function (

σ > 0

). Then,

\begin{matrix} f (x) - f (x^{*}) \geq \frac{σ}{2} {∥ x - x^{*} ∥}^{2}, \forall x \in dom (f), \end{matrix}

where

x^{*}

is the unique minimizer of f.

Theorem 4.

Suppose that f and g satisfy Assumption 1. For any

x \in M, ξ, η \in T_{x} M ⋂ int (dom ({\hat{f}}_{x}))

and

L > 0

satisfying

\begin{matrix} {\hat{f}}_{x} (T_{L} (η)) \leq {\hat{f}}_{x} (η) + 〈 grad {\hat{f}}_{x} (η), T_{L} (η) - η 〉 + \frac{L}{2} {∥ T_{L} (η) - η ∥}^{2}, \end{matrix}

(18)

it holds that

\begin{matrix} {\hat{F}}_{x} (ξ) - {\hat{F}}_{x} (T_{L} (η)) \geq \frac{L}{2} ∥ ξ - T_{L} {(η) ∥}^{2} - \frac{L}{2} {∥ η - ξ ∥}^{2} + l_{{\hat{f}}_{x}} (ξ, η), \end{matrix}

(19)

where

\begin{matrix} l_{{\hat{f}}_{x}} (ξ, η) = {\hat{f}}_{x} (ξ) - {\hat{f}}_{x} (η) - 〈 grad {\hat{f}}_{x} (η), ξ - η 〉, \end{matrix}

Proof.

Consider the function

\begin{matrix} ϕ_{x} (u) = {\hat{f}}_{x} (η) + 〈 grad {\hat{f}}_{x} (η), u - η 〉 + {\hat{g}}_{x} (u) + \frac{L}{2} {∥ u - η ∥}^{2} . \end{matrix}

It is easy to check that

ϕ

is

L -

strongly convex, and

\begin{matrix} T_{L} (η) = {argmin}_{u \in T_{x} M} ϕ_{x} (u) . \end{matrix}

From Lemma 5, it follows that

\begin{matrix} ϕ_{x} (ξ) - ϕ_{x} (T_{L} (η)) \geq {∥ ξ - T_{L} (η) ∥}^{2} . \end{matrix}

(20)

By (18), it is easy to check that

\begin{matrix} ϕ_{x} (T_{L} (η)) & = & {\hat{f}}_{x} (η) + 〈 grad {\hat{f}}_{x} (η), T_{L} (η) - η 〉 + {\hat{g}}_{x} (T_{L} (η)) + \frac{L}{2} {∥ T_{L} (η) - η ∥}^{2} \\ \geq & {\hat{f}}_{x} (T_{L} (η)) + {\hat{g}}_{x} (T_{L} (η)) = {\hat{F}}_{x} (T_{L} (η)) . \end{matrix}

This, together with (20), implies that

\begin{matrix} ϕ_{x} (ξ) - {\hat{F}}_{x} (T_{L} (η)) \geq \frac{L}{2} {∥ ξ - T_{L} (η) ∥}^{2} . \end{matrix}

By the definition of

ϕ_{x} (ξ)

, it follows that

\begin{matrix} {\hat{f}}_{x} (η) + 〈 grad {\hat{f}}_{x} (η), ξ - η 〉 + {\hat{g}}_{x} (ξ) + \frac{L}{2} {∥ ξ - η ∥}^{2} - {\hat{F}}_{x} (T_{L} (η)) \geq \frac{L}{2} {∥ ξ - T_{L} (η) ∥}^{2}, \end{matrix}

which is equivalent to

\begin{matrix} {\hat{F}}_{x} (ξ) - {\hat{F}}_{x} (T_{L_{{\hat{f}}_{x}}} (η)) \geq \frac{L}{2} ∥ ξ - T_{L} {(η) ∥}^{2} - \frac{L}{2} {∥ ξ - η ∥}^{2} + {\hat{f}}_{x} (ξ) - {\hat{f}}_{x} (η) - 〈 grad {\hat{f}}_{x} (η), ξ - η 〉 . \end{matrix}

□

The following result is a direct result of Theorem 4.

Corollary 1.

Suppose that f and g satisfy Assumption 1. For any

x \in M, ξ, η \in T_{x} M ⋂ int (dom ({\hat{f}}_{x}))

, for which

\begin{matrix} {\hat{f}}_{x} (T_{L} (η)) \leq {\hat{f}}_{x} (η) + 〈 grad {\hat{f}}_{x} (η), T_{L} (η) - η 〉 + \frac{L}{2} {∥ T_{L} (η) - η ∥}^{2}, \end{matrix}

(21)

it holds that

\begin{matrix} {\hat{F}}_{x} (η) - {\hat{F}}_{x} (T_{L} (η)) \geq \frac{1}{2 L} {∥ G_{L}^{{\hat{f}}_{x}, {\hat{g}}_{x}} (η) ∥}^{2} . \end{matrix}

(22)

Next, the backtracking procedure B2 for the case f is convex is considered. The procedure requires two parameters

(s, q)

, where

s > 0

and

q > 1

. Define

L_{- 1} = s

. The choice

L_{k}

is obtained as follows. First,

L_{k}

is set to be

L_{k - 1}

. Then, while

\begin{matrix} {\hat{f}}_{x^{k}} (T_{L_{k}} (0_{x^{k}})) \geq {\hat{f}}_{x^{k}} (0_{x^{k}}) + 〈 grad {\hat{f}}_{x^{k}} (0_{x^{k}}), T_{L_{k}} (0_{x^{k}}) 〉 + \frac{L_{k}}{2} {∥ T_{L_{k}} (0_{x^{k}}) ∥}^{2}, \end{matrix}

(23)

we set

L_{k} : = q L_{k}

. That is,

L_{k}

is chosen as

L_{k} = L_{k - 1} q^{i_{k}}

, where

i_{k}

is the smallest non-negative integer for which the condition

\begin{matrix} {\hat{f}}_{x^{k}} (T_{L_{k - 1} q^{i_{k}}} (0_{x^{k}})) & \leq & {\hat{f}}_{x^{k}} (0_{x^{k}}) + 〈 grad {\hat{f}}_{x^{k}} (0_{x^{k}}), T_{L_{k - 1} q^{i_{k}}} (0_{x^{k}}) 〉 \\ + & \frac{L_{k - 1} q^{i_{k}}}{2} {∥ T_{L_{k - 1} q^{i_{k}}} (0_{x^{k}}) ∥}^{2}, \end{matrix}

(24)

is satisfied. Under Assumption 1 and Lemma 1, it follows that

\begin{matrix} s \leq L_{k} \leq max {q L_{f}, s} . \end{matrix}

(25)

s \leq L_{k}

is obvious. For the inequality

L_{k} \leq max {q L_{f}, s}

, if

L_{k} > s

, the inequality (24) is not satisfied with

\frac{L_{k}}{q}

replacing

L_{k}

. By Lemma 1, it follows that

\frac{L_{k}}{q} < L_{f}

; so,

L_{k} \leq max {q L_{f}, s}

.

Next, an

O (1 / k)

rate of convergence of the generated sequence of function values to the optimal value is established for Riemannian manifolds. This rate of convergence is called a sublinear rate.

Theorem 5.

Suppose that Assumption 1 holds, and f is convex on M. Let

{x^{k}}

be the sequence generated by Algorithm 1 with the backtracking procedure B2. Then, for any

x^{*} \in X^{*}

and

k > 0

, there exists

α > 0

, such that

\begin{matrix} F (x^{k}) - F^{*} \leq \frac{α L_{f} d^{2} (x^{0}, x^{*})}{2 k} . \end{matrix}

(26)

Proof.

For any

n \geq 0

, it follows from (19) that

\begin{matrix} \frac{2}{L_{n}} (F (x^{*}) - F (x^{n + 1})) & = & \frac{2}{L_{n}} ({\hat{F}}_{x^{n}} (R_{x^{n}}^{- 1} (x^{*})) - {\hat{F}}_{x^{n}} (T_{L_{n}} (0_{x^{n}})) \\ \geq & ∥ R_{x^{n}}^{- 1} (x^{*}) - T_{L_{n}} (0_{x^{n}}) ∥^{2} - {∥ 0_{x^{n}} - R_{x^{n}}^{- 1} (x^{*}) ∥}^{2} \\ + & \frac{2}{L_{n}} l_{f_{x^{n}}} (R_{x^{n}}^{- 1} (x^{*}), 0_{x^{n}}) \\ \geq & ∥ R_{x^{n}}^{- 1} (x^{*}) - T_{L_{n}} (0_{x^{n}}) ∥^{2} - {∥ 0_{x^{n}} - R_{x^{n}}^{- 1} (x^{*}) ∥}^{2}, \end{matrix}

(27)

where the last inequality is obtained by the convexity of f. From [21], it is easy to check that there exists

m > 0

, such that

\begin{matrix} ∥ R_{x^{n}}^{- 1} (x^{*}) - T_{L_{n}} (0_{x^{n}}) ∥^{2} - {∥ 0_{x^{n}} - R_{x^{n}}^{- 1} (x^{*}) ∥}^{2} & = & ∥ R_{x^{n}}^{- 1} (x^{*}) - R_{x^{n}}^{- 1} (x^{n + 1}) ∥^{2} - {∥ R_{x^{n}}^{- 1} (x^{*}) ∥}^{2} \\ \geq & m [d^{2} (x^{n + 1}, x^{*}) - d^{2} (x^{n}, x^{*})] . \end{matrix}

(28)

Summing (27) over

n = 0, 1, \dots, k - 1

and using

L_{n} \leq p L_{f}

, where

p = max {q, \frac{s}{L_{f}}}

, together with (28), implies that

\begin{matrix} \frac{2}{p L_{f}} \sum_{n = 0}^{k - 1} (F (x^{*}) - F (x^{k})) \geq m [d^{2} (x^{n + 1}, x^{*}) - d^{2} (x^{0}, x^{*})] . \end{matrix}

Thus,

\begin{matrix} \sum_{n = 0}^{k - 1} (F (x^{n + 1}) - F (x^{*})) \leq \frac{m p L_{f}}{2} d^{2} (x^{0}, x^{*}) - \frac{m α L_{f}}{2} d^{2} (x^{k}, x^{*}) \leq \frac{m p L_{f}}{2} d^{2} (x^{0}, x^{*}) . \end{matrix}

From (22), it follows that

F (x^{n + 1}) \leq F (x^{n})

for all

n \geq 0

; so,

\begin{matrix} k (F (x^{k}) - F (x^{*})) \leq \sum_{n = 0}^{k - 1} (F (x^{n + 1}) - F (x^{*})) \leq \frac{m p L_{f}}{2} d^{2} (x^{0}, x^{*}) . \end{matrix}

Therefore,

\begin{matrix} F (x^{k}) - F (x^{*}) \leq \frac{m p L_{f}}{2 k} d^{2} (x^{0}, x^{*}) . \end{matrix}

Let

α = m p

. Then,

\begin{matrix} F (x^{k}) - F^{*} \leq \frac{α L_{f} d^{2} (x^{0}, x^{*})}{2 k} . \end{matrix}

□

To derive the complexity result for the proximal gradient method for Riemannian manifolds, let us assume that

d (x^{0}, x^{*}) \leq R

for some

x^{*} \in X^{*}

and some constant

R > 0

. For example, if

dom (g)

is bounded, the R might be taken as its diameter. In order to obtain an

ϵ

-optimal solution of (1), by (26), it is enough to require that

\frac{α L_{f} R^{2}}{2 k} \leq ϵ

. The following complexity of the proximal gradient method is a direct result of Theorem 5.

Theorem 6.

Suppose that Assumption 1 holds, and f is convex on M. Let

{x^{k}}

be the sequence generated by Algorithm 1 with the backtracking procedure B2. For k satisfying

\begin{matrix} k \geq [\frac{α L_{f} R^{2}}{2 ϵ}], \end{matrix}

it holds that

F (x^{k}) - F^{*} \leq ϵ

, where R is an upper bound on

d (x^{0}, x^{*})

for some

x^{*} \in X^{*}

.

Example 1.

On the unit sphere

S^{n - 1}

considered as a Riemannian submanifold of

R^{n}

, the inner product inherited from the standard inner product on

R^{n}

is given by

\begin{matrix} 〈 ξ_{x}, η_{x} 〉 = ξ_{x}^{T} η_{x}, \forall ξ_{x}, η_{x} \in T_{x} S^{n - 1} . \end{matrix}

The normal space is

\begin{matrix} {(T_{x} S^{n - 1})}^{⊥} = {x α : α \in R}, \end{matrix}

and the projections are given by

\begin{matrix} P_{x} ξ_{x} = (I - x x^{T}) ξ_{x}, P_{x}^{⊥} ξ_{x} = x x^{T} ξ, \forall x \in S^{n - 1} . \end{matrix}

From Section 4 in [21], it follows that

R_{x} (η_{x}) = \frac{x + η_{x}}{∥ x + η_{x} ∥}, \forall η_{x} \in T_{x} S^{n - 1}

. The tangent space to

S^{n - 1}

, viewed as a subspace of

T_{x} R^{n} ≃ R^{n}

, is

\begin{matrix} T_{x} S^{n - 1} = {ξ \in R^{n} : x^{T} ξ = 0} . \end{matrix}

The function is considered as follows:

\begin{matrix} f : S^{n - 1} \to R : x \mapsto \frac{1}{2} {∥ A x - b ∥}^{2}, \end{matrix}

(29)

and

\begin{matrix} g : S^{n - 1} \to R : x \mapsto μ {∥ x ∥}_{1}, \end{matrix}

(30)

on the unit sphere

S^{n - 1}

, viewed as a Riemannian submanifold of the Euclidean space

R^{n}

. Furthermore,

\begin{matrix} \bar{f} : R^{n} \to R : x \mapsto \frac{1}{2} {∥ A x - b ∥}^{2}, \end{matrix}

(31)

and

\begin{matrix} \bar{g} : R^{n} \to R : x \mapsto μ {∥ x ∥}_{1}, \end{matrix}

(32)

whose restriction to

S^{n - 1}

is f. From Section 4 in [21], it follows that

\begin{matrix} grad f (x) = P_{x} grad \bar{f} (x) = P_{x} [A^{T} (A x - b)] = (I - x x^{T}) A^{T} (A x - b) . \end{matrix}

(33)

It is easy to obtain

\begin{matrix} {prox}_{t \bar{g}} (x) = sign (x) max {| x | - t μ, 0} . \end{matrix}

(34)

From Algorithm 1, pick

L_{k} > 0

, which is defined by (9),

\begin{matrix} ξ_{k} : = - \frac{1}{L_{k}} grad f (x^{k}) = - \frac{1}{L_{k}} (I - x^{k} {(x^{k})}^{T}) A^{T} (A x^{k} - b), \end{matrix}

(35)

\begin{matrix} η_{k} : = P_{x} ({prox}_{\frac{1}{L_{k}} {\bar{g}}_{x^{k}}} (ξ_{k})) = (I - x^{k} {(x^{k})}^{T}) sign (ξ_{k}) max {| ξ_{k} | - \frac{1}{L_{k}} μ, 0}, \end{matrix}

(36)

and

x^{k + 1} = R_{x^{k}} η_{k} = \frac{x^{k} + η_{k}}{∥ x^{k} + η_{k} ∥}

. Set

μ = 10^{- 3}

, and

L_{k} = λ_{\max} (A^{T} A)

. Let

{x^{k}}

be the sequence generated by Algorithm 1. It is easy to check that all assumptions of Theorem 4 are satisfied; so, the sequence

{x^{k}}

converges to

x^{*}

sublinearly.

5. Conclusions

The proximal gradient method is a popular optimization algorithm for solving composite optimization problems for Riemannian manifolds. In this context, the algorithm involves minimizing a sum of a smooth function and a nonsmooth function, where the function is defined for the Riemannian manifold. In this paper, the global convergence result of the proximal gradient method for the composite optimization problem is established, and the sublinear convergence rate and the complexity result of the proximal gradient method for convex case are also obtained for Riemannian manifolds. Future research directions involve establishing explicit asymptotic and non-asymptotic convergence rates and a numerically competitive proximal gradient method for the composite optimization problem for Riemannian manifolds. Furthermore, a version of the block proximal gradient method in which, at each iteration, a prox-grad step is performed at a randomly chosen block will be considered on Riemannian manifolds.

Funding

This research was funded by the Young Scientists Fund of the National Natural Science Foundation of China (No. 11901485), the Natural Science Foundation of Sichuan (No.2023NSFSC1354), and the Fundamental Research Funds for the Central Universities (No. PHD2023-057).

Data Availability Statement

Some or all data, models, or code generated or used during the study are available in a repository or online in accordance with funder data retention policies.

Acknowledgments

Xiaobo Li expressed the gratitude to the Young Scientists Fund of the National Natural Science Foundation of China (No. 11901485), the Natural Science Foundation of Sichuan (No. 2023NSFSC1354), and the Fundamental Research Funds for the Central Universities (No. PHD2023-057).

Conflicts of Interest

The author declares no conflicts of interest.

References

Nesterov, Y. Gradient methods for minimizing composite functions. Neural Comput. 2012, 140, 125–161. [Google Scholar] [CrossRef]
Beck, A.; Teboulle, Q.M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. Siam J. Imaging Sci. 2009, 2, 183–202. [Google Scholar] [CrossRef]
Liu, J.; Zhao, S.; Ji, P.; Ye, Y.; Luo, Z.Q. Block successive upper-bound minimization for solving a class of composite optimization problems. Math. Program. 2015, 149, 371–404. [Google Scholar]
Chen, X.; Lin, L. Smoothing methods for nonsmooth, nonconvex minimization. J. Comput. Appl. Math. 2012, 134, 71–99. [Google Scholar] [CrossRef]
Becker, S.; Candes, E.J. A Iteratively Reweighted Least Squares Algorithm for Sparse Regularization. J. Comput. Graph. Stat. 2011, 22, 985–1008. [Google Scholar]
Boyd, S.; Chu, N.; Peleato, B.; Eckstein, J. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Found. Trends Mach. Learn. 2011, 3, 1–122. [Google Scholar] [CrossRef]
Tseng, P. Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization. J. Optim. Theory Appl. 2001, 109, 475–494. [Google Scholar] [CrossRef]
Sahu, D.R.; Yao, J.C.; Verma, M.; Shukla, K.K. Convergence rate analysis of proximal gradient methods with applications to composite minimization problems. Optim. J. Math. Program. Oper. Res. 2021, 70, 75–100. [Google Scholar] [CrossRef]
Neal, P.; Stephen, B. Proximal Algorithms. Found. Trends Optim. 2013, 1, 123–231. [Google Scholar]
Li, Q.J.; Li, T.; Guo, K. A note on the (accelerated) proximal gradient method for composite convex optimization. J. Nonlinear Convex Anal. 2022, 23, 2847–2857. [Google Scholar]
Amir, B. First-Order Methods in Optimization; Society for Industrial and Applied Mathematics and the Mathematical Optimization Society: Philadelphia, PA, USA, 2017. [Google Scholar]
Boumal, N.; Mishra, B.; Absil, P.A.; Sepulchre, R. Manopt, a Matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 2014, 15, 1455–1459. [Google Scholar]
Sakai, T. Riemannian Geometry; Translations of Mathematical Monographs, American Mathematical Society: Providence, RI, USA, 1996. [Google Scholar]
Huang, W.; Gallivan, K.A.; Absil, P.A. A Broyden class of quasi-newton methods for Riemannian optimization. SIAM J. Optim. 2015, 25, 1660–1685. [Google Scholar] [CrossRef]
Ring, W.; Wirth, B. Optimization methods on Riemannian manifolds and their application to shape space. SIAM J. Optim. 2012, 22, 596–627. [Google Scholar] [CrossRef]
Li, X.; Ge, X.; Tu, K. The generalized conditional gradient method for composite multiobjective optimization problems on Riemannian manifolds. J. Nonlinear Var. Anal. 2023, 7, 839. [Google Scholar]
Bento, G.C.; Ferreira, O.P.; Oliveira, P.R. Unconstrained steepest descent method for multicriteria optimization on Riemmanian manifolds. J. Optim. Theory Appl. 2012, 154, 88–107. [Google Scholar] [CrossRef]
Neto, J.X.d.; Oliveira, P.R. Geodesic Methods in Riemannian Manifolds; Research Report; PESC-COPPE-UFRJ: Rio de Janeiro, Brazil, 1995. [Google Scholar]
Burago, D.; Burago, Y.; Ivanov, S. A Course in Metric Geometry; American Mathematical Society: Providence, RI, USA, 2001. [Google Scholar]
Huang, W. Optimization Algorithms on Riemannian Manifolds with Applications. Ph.D. Thesis, Department of Mathematics, Florida State University, Tallahassee, FL, USA, 2013. [Google Scholar]
Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University Press: Princeton, NJ, USA, 2008. [Google Scholar]
Bento, G.C.; Ferreira, O.P.; Oliveira, P.R. Proximal point method for a special class of nonconvex functions on Hadamard manifolds. Optim. J. Math. Program. Oper. Res. 2015, 3, 289–319. [Google Scholar] [CrossRef]
Feng, S.; Huang, W.; Song, L.; Ying, S.; Zeng, T. Proximal gradient method for nonconvex and nonsmooth optimization on Hadamard manifolds. Optim. Lett. 2022, 8, 2277–2297. [Google Scholar] [CrossRef]
Chavel, I. Riemannian Geometry-A Modern Introduction; Cambridge University Press: London, UK, 1993. [Google Scholar]
Klingenberg, W. A Course in Differential Geometry; Springer: Berlin/Heidelberg, Germany, 1978. [Google Scholar]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X. The Proximal Gradient Method for Composite Optimization Problems on Riemannian Manifolds. Mathematics 2024, 12, 2638. https://doi.org/10.3390/math12172638

AMA Style

Li X. The Proximal Gradient Method for Composite Optimization Problems on Riemannian Manifolds. Mathematics. 2024; 12(17):2638. https://doi.org/10.3390/math12172638

Chicago/Turabian Style

Li, Xiaobo. 2024. "The Proximal Gradient Method for Composite Optimization Problems on Riemannian Manifolds" Mathematics 12, no. 17: 2638. https://doi.org/10.3390/math12172638

APA Style

Li, X. (2024). The Proximal Gradient Method for Composite Optimization Problems on Riemannian Manifolds. Mathematics, 12(17), 2638. https://doi.org/10.3390/math12172638

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Proximal Gradient Method for Composite Optimization Problems on Riemannian Manifolds

Abstract

1. Introduction

2. Preliminaries

3. The Proximal Gradient Method

4. The Convergence Result

4.1. The Non-Convex Case

4.2. The Convex Case

5. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI