1. Introduction
Breast cancer is the most common type of cancer in Thai women. Anxiously, although the breast cancer can be treated, the risk of developing diseases that affect the heart or blood vessels is very high.
The three most common methods for treating breast cancer are surgery, chemotherapy and radiotherapy. However, radiotherapy often involves some incidental exposure of the heart to ionizing radiation because it was discovered, in [
1], that the exposure of the heart to ionizing radiation during the therapy increases the consequent rate of ischemic heart disease which begins within a few years after exposure and continues for at least 20 years. Thus, women with preexisting cardiac risk factors have higher absolute increases in risk from this therapy than other women.
Therefore, if a patient is diagnosed with heart disease early, they will be able to prevent the risks from this type of treatment. Similarly, the malignant cells of a patient can be treated before it spreads to other parts of the body when cancer is detected at an early stage. To support the diagnosis of breast cancer and heart disease, our objective in this work is developing an algorithm for such patient prediction.
It is well known that symmetry serves as the foundation for fixed-point and optimization theory and methods. We first recall the background of some mathematical models. Consider the constrained minimization problem:
when
is a real Hilbert space,
is a strongly convex differentiable function with convexity parameter
, and
is the nonempty set of minimizers of the unconstrained minimization problem, as in the form:
where
are proper convex and lower semicontinuous functions and
is a smooth function. Problems (
1) and (
2) are called outer-level and inner-level problems, respectively. In [
2,
3,
4,
5], such a problem is labeled as a simple bilevel optimization problem.
In 2017, Sabach and Shtern [
6] introduced the Bilevel Gradient Sequential Averaging Method (BiG-SAM) for solving (
1) and (
2) as defined by Algorithm 1.
Algorithm 1 BiG-SAM: Bilevel Gradient Sequential Averaging Method |
- 1:
Initial step. Let and is a sequence in satisfying the conditions assumed in [ 7]. Select and while is the Lipschitz gradient of and is the Lipschitz gradient of . - 2:
Step 1. For , compute
where and are gradients of and , respectively.
|
They presented that BiG-SAM appears simpler and cheaper than the method desired in [
8]. Moreover, the authors in [
6] used a numerical example to show that BiG-SAM outruns the method in [
8] for solving problems (
1) and (
2). Up to this point, the algorithm in [
6] seems to be the most efficient method for convex simple bilevel optimization problems.
In 2019, Shehu et al. [
9] utilized the notion of an inertial technique, which was proposed by Polyak [
10], to be beneficial to accelerate the convergence rate of the BiG-SAM method, called iBiG-SAM, as defined by Algorithm 2.
Algorithm 2 iBiG-SAM: Inertial with Bilevel Gradient Sequential Averaging Method |
- 1:
Initial step. Let and be Lipschitz gradients of and , respectively. Given be a sequence in and . Select arbitrary points and . - 2:
Step 1. Choose such that for ,
- 3:
Step 2. Compute
where and are gradients of and , respectively.
|
They also proved that the sequence
generated by iBiG-SAM converges to the optimal solution of problems (
1) and (
2) under the sequence
satisfying conditions:
- (1)
;
- (2)
.
The above assumptions are derived from [
7] by reducing some situations.
Recently, to accelerate the convergence of the iBiG-SAM algorithm, Duan and Zhang [
11] proposed three algorithms of inertial approximation methods based on the proximal gradient algorithm as defined by Algorithms 3–5.
Algorithm 3 aiBiG-SAM: The alternated inertial Bilevel Gradient Sequential Averaging Method |
- 1:
Initial step. Let and be Lipschitz gradients of and , respectively. Given . Let be a sequence in satisfying the conditions assumed in [ 9]. Select arbitrary points and Set . - 2:
- 3:
When k is odd, choose such that with defined by
- 4:
When k is even, . - 5:
Step 2. Compute
where and are gradients of and , respectively. - 6:
Step 3. If , then stop. Otherwise, set and go to Step 1.
|
Algorithm 4 miBiG-SAM: The multi-step inertial Bilevel Gradient Sequential Averaging Method |
- 1:
Initial step. Let and be Lipschitz gradients of and , respectively. Given and Let be a sequence in satisfying the conditions assumed in [ 9]. Select arbitrary points and . Set . - 2:
Step 1. Given and compute
where . Choose such that with defined by
- 3:
Step 2. Compute
where and are gradients of and , respectively. - 4:
Step 3. If , then stop. Otherwise, set and go to Step 1.
|
Algorithm 5 amiBiG-SAM: The multi-step alternated inertial Bilevel Gradient Sequential Averaging Method |
- 1:
Initial step. Let and be Lipschitz gradients of and , respectively. Given and Let be a sequence in satisfying the conditions assumed in [ 9]. Select arbitrary points and . Set . - 2:
Step 1. Given and compute
where . Choose such that with defined by
- 3:
Step 2. Compute
where and are gradients of and , respectively. - 4:
Step 3. If , then stop. Otherwise, set and go to Step 1.
|
The convergence behavior of Algorithms 3–5 was shown, in [
11], to be better than that of BiG-SAM and iBiG-SAM.
It is known that the following variational inequality:
implies
is a solution of convex bilevel optimization problem (
1); for more details, see [
12]. For recent results, see [
13,
14] and references therein.
It is worth noting that
can be described by fixed-point equation:
where
and prox
, which was introduced by Moreau [
15]. This means that solving the bilevel problem is equivalent to finding a fixed point of the proximal operator. It is well known that the fixed point theory plays a very crucial role in solving many real-world problems, such as problems in engineering, economics, machine learning and data science, see [
16,
17,
18,
19,
20,
21,
22,
23,
24] for more details. For the past three decades, several fixed point algorithms were introduced and studied by many authors, see [
25,
26,
27,
28,
29,
30,
31,
32,
33,
34]. Some of these algorithms were applied for solving various problems in images and signal processing, data classification and regression, for example, see [
19,
20,
21,
22,
23]. In addition, fuzzy classification is another important data classification mechanism, see [
35,
36].
All of the works mentioned above motivate and inspire us to establish a new accelerated algorithm to solve a convex bilevel optimization problem and apply it for solving data classification problems.
We organize the paper as follows: In
Section 2, we provide some basic definitions and useful lemmas used in the later section. The main results of the paper are given in
Section 3. In this section, we introduce and study a new accelerated algorithm for solving a convex bilevel optimization problem and then prove a strong convergence of our proposed algorithm. After that, we apply our main results for solving a data classification problem in
Section 4. Finally, a brief conclusion of the paper is given in
Section 5.
2. Preliminaries
Throughout this paper, a real Hilbert space, denoted by , with the inner product , inducing the norm .
A mapping
is called
L-Lipschitz if there exists
such that
If , then T is called contraction. It is called nonexpansive if . We denote by the set of all fixed points of T, that is, . For a sequence in , we denote the strong convergence and the weak convergence of to by and , respectively.
Let and ℑ be families of nonexpansive operators from C into itself with where is the set of all common fixed points of ℑ and is the set of all fixed points of .
The sequence
is said to satisfy the NST-condition
with
ℑ if for every bounded sequence
in
C,
see [
37] for more details. In particular, if
, then
is a sequence satisfying NST-condition
with
T.
Later, NST
-condition was proposed by Nakajo et al. [
38] which is a weaker condition than that of NST-condition
. A sequence
is said to satisfy NST
-condition if for every bounded sequence
in
if
and
imply
where
is the set of all weak cluster points of
. It is easy to see that if
satisfies the NST-condition
, then it satisfies the NST
-condition.
In a real Hilbert space , these properties hold: for any ,
- (1)
- (2)
If
C is a nonempty closed convex subset of
, then for each
, there exists a unique element in
C, say
, such that
The mapping
is known as the metric projection of
onto
C and it is also nonexpansive. Moreover,
holds for all
and
.
The following results are also essential for proving our main results.
Lemma 1 ([
39]).
Let be nonnegative real numbers sequences, a sequence in and a sequence of numbers such thatIf all following conditions hold:- (1)
;
- (2)
;
- (3)
Then,
Lemma 2 ([
40]).
Let be a real Hilbert space and a nonexpansive mapping with . Then, for any sequence in such that and imply . Lemma 3 ([
41]).
Let be a sequence of real numbers that does not decrease at infinity in the sense that there exists a subsequence of which satisfies for all . Define of integers as follows:where such that . Then, the following hold:- (1)
;
- (2)
and for all .
Proposition 1 ([
6]).
Suppose is strongly convex with convexity parameter and continuously differentiable function such that is Lipschitz continuous with constant . Then, the mapping is contraction for all , where I is the identity operator. Definition 1 ([
15]).
Let be a proper convex and lower semicontinuous function. The proximity operator of parameter of ψ at is denoted by prox and it is defined by The operator prox is known as a forward–backward operator of and with respect to , where and is the gradient operator of function . Moreover, T is a nonexpansive mapping whenever where is a Lipschitz gradient of .
Lemma 4 ([
42]).
For a real Hilbert space let be a proper convex and lower semicontinuous function, and be convex differentiable with gradient being -Lipschitz gradient for some If is the family of forward–backward operators of ϕ and ψ with respect to such that converges to c, then satisfies NST-condition (I) with T, where T is the forward–backward operator of ϕ and ψ with respect to 3. Main Results
We start this section by introducing a new common fixed point algorithm using the inertial technique together with the modified Ishikawa iteration (see [
43,
44,
45] for more details) to obtain a strong convergence theorem for two countable families of nonexpansive mappings in a real Hilbert space as seen in Algorithm 6.
Algorithm 6 IVAM (I): Inertial Viscosity Approximation Method for Two Families of Nonexpansive Mappings |
- 1:
Input. Let a positive sequence and a contraction with constant . Choose and . - 2:
Select such that for ,
- 3:
|
Lemma 4 Let and be two countable families of nonexpansive mappings from into itself such that and let be a contraction. If , then the sequence generated by Algorithm 6 is bounded. Furthermore, and are bounded.
Proof. Let
be such that
. Then, by the definition of
and
in Algorithm 6, for every
, we have
and
It follows from (
8) and (
11) that
Using
and (
7), we obtain
Thus, there exists
such that
for all
, which implies
By mathematical induction, we conclude that for all , where It follows that is bounded. This implies that the sequences and are bounded. □
We now prove a strong convergence theorem of the sequence generated by Algorithm 6 to solve a common fixed point problem as follows.
Theorem 1 Let and be two countable families of nonexpansive mappings from into such that . Let be a sequence generated by Algorithm 6. Suppose and satisfy NST-conditions and the following conditions hold:
- (1)
;
- (2)
;
- (3)
;
- (4)
and ;
- (5)
,
where and are real positive numbers. Then, converges strongly to , where .
Proof. Let
be such that
. It follows from (
11) that
This together with
and
give us that
Because
there exists
such that
for all
Put
. This together with (
13) and (
14) yields
We now set
and
as the following:
and
So, we have from (
15) that
Next, we analyze the convergence of sequence by considering the following two cases:
Case 1. Suppose
is nonincreasing for some
. Because
is bounded from below by zero, we obtain
exists. It follows from
and
that
To apply Lemma 1, we need to claim that
. Indeed, by definition of
, we have
By Algorithm 6, (
10) and (
17), we obtain
which implies that for any
,
Because
and
, we derive
From
, (
20) and (
21), we obtain
Moreover, we have from (
9), (
18) and nonexpansiveness of
that
The above inequality implies
By assumptions (3), (4) and
exists together with
, we obtain
From the definition of
and assumption (3), we have
It follows from (
22) and (
23) that
Using the definition of
, we have
Due to
, (
24) and the boundedness of
and
, we obtain
Let
The boundedness of
implies that there exists a subsequence
such that
and
It derives from the nonexpansiveness of
that
It follows from (
19) and (
21) that
.
Using Lemma 2, we obtain
. Due to
being nonexpansive, we have for any
,
which implies
by employing (
22) and (
23). By Lemma 2, we obtain
. Because
, it follows that
converges weakly to
x.
In addition, utilizing
together with (
6) gives us that
Invoking
and (
28), we obtain
Coming back to (
16), by Lemma 1, we can conclude that
.
Case 2. Suppose that
is not a monotonically decreasing sequence. To apply Lemma 3, put
. Then, there exists a subsequence
of
such that
In this case, let
be defined by
Therefore,
satisfies the condition in Lemma 3. Hence, we have
for all
This means that
As the proof in Case 1, we also have that for any
Because
for all
k, the above inequality leads to
Using
and
, we obtain
Similar to the proof of Case 1, we conclude
and so
Put
Due to
being bounded, there exists a subsequence
of
such that
and
for some
. The nonexpansiveness of
and
implies
and
Taking
in (
35) and (
36), we derive from (
30)–(
33) that
and
By Lemma 2, we obtain
Due to
, we obtain
. Furthermore, it follows from
and (
6) that
and thus
Because
, as in the proof of Case 1, we have that for every
k,
From
and
, we obtain
, which implies
Invoking
and (
39), we obtain
and hence
It follows from (
34) that
By Lemma 3, we obtain
Therefore, converges strongly to . □
We observe that Algorithm 6 can be reduced to Algorithm 7 by setting for finding a common fixed point of a countable family of nonexpansive mappings of .
Corollary 1. Let be a countable family of nonexpansive mappings from into itself such that . Suppose satisfies NST-conditions and the following conditions hold:
- (1)
;
- (2)
;
- (3)
;
- (4)
and ;
- (5)
,
where and are real positive numbers. Then, the sequence generated by Algorithm 7 converges strongly to , where .
Algorithm 7 IVMIA (II): Inertial Viscosity Approximation Method for a family of Nonexpansive Mappings |
- 1:
Input. Let a positive sequence and a -contraction. Choose and . - 2:
Select such that for ,
- 3:
|
4. Application to Convex Bilevel Optimization Problems
The aim of this section is to apply our proposed algorithm for solving the following convex bilevel optimization problem:
where
is strongly convex differentiable with
being
-Lipschitz continuous and
is the set of all common minimizers of the following unconstrained minimization problems:
where
,
, are proper convex and lower semicontinuous functions and
are differentiable functions. Problem (
45) can be reduced to (
2) if
and
As in the literature, we know that
if and only if
where
while
and
are Lipschitz gradients of
and
respectively. In addition,
also is a solution of problem (
44) if it satisfies the following form:
Therefore, we solve convex bilevel optimization problems (
44) and (
45) by finding a common fixed point
of
and
, which satisfies the formulation of (
46).
Next, we present the algorithm derived from our main result for solving the convex bilevel optimization problem as defined by Algorithm 8.
In order to solve (
44) and (
45), we suppose the following conditions hold:
- (1)
is a -contraction with ;
- (2)
and with ;
- (3)
and with ;
- (4)
and are sequences in ;
- (5)
be two lower semicontinuous functions and convex from into ;
- (6)
be two smooth convex loss functions and differentiable with -Lipschitz continuous gradients of respectively;
- (7)
is strongly convex differentiable with being -Lipschitz constant and where is a parameter such that is strongly convex.
Theorem 2. Let be a sequence generated by Algorithm 8 such that all conditions as in Theorem 1 hold. Let Ω be the set of all solutions of (44). Then, converges strongly to which satisfies . Algorithm 8 iVMBi(I): Inertial Viscosity Method for Bilevel Optimization Problem (I) |
- 1:
Input. Let a positive sequence. Choose and . - 2:
Step 1. Select such that for ,
- 3:
|
Proof. Let
prox
and
prox
as in Algorithm 6, where
while
are Lipschitz gradients of
, respectively. Using Proposition 1, we get that
is a contraction mapping. By Theorem 1 and setting
, we obtain that
converges strongly to
, where
. Observe that
It is derived from (
6) that for any
,
Because
, we conclude
for all
that is,
is an optimal solution of problem (
44). Hence, we obtain the desired result. □
Furthermore, our algorithm can be applied to solving convex bilevel optimization problems (
1) and (
2) by using the same proximity operator in step 2 and 3 as seen in Algorithm 9.
Algorithm 9 iVMBi(II): Inertial Viscosity Method for Bilevel Optimization Problem (II) |
- 1:
Input. Let a positive sequence. Choose and . - 2:
Step 1. Select such that for ,
- 3:
|
The following result is immediately obtained by Theorem 2.
Theorem 3. Let be a sequence generated by Algorithm 9 such that all conditions as in Corollary 1 hold. Then, converges strongly to which satisfiesthat is, , where Ω is the set of all solutions of problems (1) and (2). Next, we use Algorithm 9 as a machine learning algorithm for solving some data classification problems applying on UCI-datasets of breast cancer and heart disease. Moreover, we compare the performance of Algorithm 9 with BiG-SAM, iBiG-SAM, aiBiG-SAM, miBiG-SAM and amiBiG-SAM.
In order to employ Algorithm 9 for solving data classification, we need to know what is the objective function of the inner level. To obtain this, we use a single-layer feedback neuron network (SLFNs) model and the concept of extreme learning machine (ELM) introduced by Huang et al. [
46].
In supervised learning, we start with the training set of
N samples
, where
is input data and
is a target. The mathematical model of ELM for SLFNs with
M hidden nodes and activate function
is given by
where
is the weight vector connecting the
i-th hidden node and the output node,
is a bias and
is the weight vector connecting the
i-th hidden node and the input node.
Let
be a matrix given by the following:
This matrix is known as the hidden-layer output matrix.
For prediction or classification problem by using ELM model, we need a zero mean, that is,
Hence,
We can write the above system of linear equations of
M variable and
N equations as a matrix equation as follows:
where
and
is the training data. To solve ELM, it is to find a weight
m satisfies (
49). If the Moore–Penrose generalized inverse
of
exists, then
. However, in the case that
does not exist, we can find
m as the minimizer of the following convex minimization problem:
Using a least squares model (
50) may cause the over fitting problem. In order to prevent this problem, the regularization methods were proposed. The classical one is the Tikhonov regularization [
47], which was employed to solve the following minimization problem:
where
is the regularized parameter and
K is the Tikhonov matrix. In the standard form,
K is set to be the identity.
Another regularization method is the least absolute shrinkage and selection operator (LASSO), which was proposed by Tibshirani [
48] for solving the following convex minimization problem:
where
is the regularized parameter and
In this work, we set
and
. Based on model (
51), we can apply Algorithm 9 for solving the convex bilevel optimization problems (
1) and (
2) while the objective function of the outer level
. We now conduct some numerical experiments for classifications of the following datasets.
In these experiments, we aim to classify the datasets of breast cancer and heart disease from
https://archive.ics.uci.edu, accessed on 12 June 2022.
Breast cancer dataset [
49]. This dataset contains 699 samples, each of which has 11 attributes. In this dataset, we classify two classes of data.
Heart disease dataset [
50]. This dataset contains 303 samples, each of which has 13 attributes. In this dataset, we classify two classes of data.
Throughout these experiments, all the results are performed under MATLAB 9.6 (R2019a) running on a MacBook Air 13.3-inch, 2020, with Apple M1 chip processor and 8-core GPU, configured with 8 GB of RAM.
In all the experiments, sigmoid is used as an activation function, and we set the number of hidden node
. The following formula for the accuracy of the data classification is given by
where
is the model successfully predicting the patient as positive,
denotes the model successfully predicting the patient as negative,
represents the prediction of the diseased patient as healthy by negative test results and
means the prediction of a healthy patient as diseased by a positive test result.
We also compute the success probability of making a correct positive class classification as the following form:
In addition, we measure the sensitivity of the model toward identifying the positive class as the following form:
The Lipschitz gradient
of
is computed by
. When the dimension of
is so large, it is hard to compute such
. All parameters for each algorithm of our experiments are given in
Table 1.
From
Table 1, we select the best choice of parameter for each algorithm in order to achieve the highest performance. It is worth noting that all parameters satisfy the assumptions of each convergence theorem, see [
6,
9,
11] for more details. In addition, we set
which is a regularized parameter of problem (
52). In Algorithm 9, we choose
for experimentation on the breast cancer dataset, while the classification of heart disease uses
together with
We compare the performance of each method at the 100th and 500th iterations and obtain the following results, as seen in
Table 2 and
Table 3, respectively.
Table 2 shows that our algorithm performs the best accuracy at the 100th iteration. Moreover,
Table 3 shows the performance of each algorithm at the 500th iteration. We found that Algorithm 9 has a better accuracy than the others.
Next, we show the performance for the prediction of each algorithm in terms of the number of iterations and training times for which each algorithm achieves the highest accuracy.
From
Table 4, comparing with Algorithm 1 (BiG-SAM), Algorithm 2 (iBiG-SAM), Algorithm 3 (aiBiG-SAM), Algorithm 4 (miBiG-SAM) and Algorithm 5 (amiBiG-SAM), Algorithm 9 provides a higher value of accuracy for training. In the testing case, we found that the accuracy of Algorithm 2 (iBiG-SAM) is better than our algorithm on the breast cancer experimentation. However, our method has the lowest number of iterations and training times compared with the others.
We also construct a 10-fold cross validation to appraise the performance of each algorithm and use Average accuracy as the appraising tool. It is defined as follows:
where
N is a number of sets considered during the cross validation (
),
is a number of correctly predicted data at fold
i and
is a number of all data at fold
i.
Let
Err sum of errors in all 10 training sets,
Err = sum of errors in all 10 testing sets,
sum of all data in 10 training sets and
sum of all data in 10 testing sets. Then,
where
and
We split the data into training sets and testing sets by using the 10-fold cross validation, as seen in
Table 5.
In
Table 6, we show the average of the accuracy of each algorithm with the 500th iteration.
Table 6 demonstrates that Algorithm 9 performs better than Algorithm 1 (BiG-SAM), Algorithm 2 (iBiG-SAM), Algorithm 3 (aiBiG-SAM), Algorithm 4 (miBiG-SAM) and Algorithm 5 (amiBiG-SAM) in terms of the accuracy in all the experiments conducted.