Next Article in Journal
The Dirac-Dolbeault Operator Approach to the Hodge Conjecture
Previous Article in Journal
An Improved NSGA-III with a Comprehensive Adaptive Penalty Scheme for Many-Objective Optimization
Previous Article in Special Issue
Sparse Fuzzy C-Means Clustering with Lasso Penalty
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Reinforcement Learning-Based Particle Swarm Optimization Algorithm for Better Symmetry between Convergence Speed and Diversity

1
The Sixty-Third Research Institute, National University of Defense Technology, Nanjing 210001, China
2
College of Automotive Engineering, Changzhou Institute of Technology, Changzhou 213032, China
*
Author to whom correspondence should be addressed.
Symmetry 2024, 16(10), 1290; https://doi.org/10.3390/sym16101290
Submission received: 31 July 2024 / Revised: 11 September 2024 / Accepted: 25 September 2024 / Published: 1 October 2024
(This article belongs to the Special Issue Symmetry in Intelligent Algorithms)

Abstract

:
This paper introduces a novel Particle Swarm Optimization (RLPSO) algorithm based on reinforcement learning, embodying a fundamental symmetry between global and local search processes. This symmetry aims at addressing the trade-off issue between convergence speed and diversity in traditional algorithms. Traditional Particle Swarm Optimization (PSO) algorithms often struggle to maintain good convergence speed and particle diversity when solving multi-modal function problems. To tackle this challenge, we propose a new algorithm that incorporates the principles of reinforcement learning, enabling particles to intelligently learn and adjust their behavior for better convergence speed and richer exploration of the search space. This algorithm guides particle learning behavior through online updating of a Q-table, allowing particles to selectively learn effective information from other particles and dynamically adjust their strategies during the learning process, thus finding a better balance between convergence speed and diversity. The results demonstrate the superior performance of this algorithm on 16 benchmark functions of the CEC2005 test suite compared to three other algorithms. The RLPSO algorithm can find all global optimum solutions within a certain error range on all 16 benchmark functions, exhibiting outstanding performance and better robustness. Additionally, the algorithm’s performance was tested on 13 benchmark functions from CEC2017, where it outperformed six other algorithms by achieving the minimum value on 11 benchmark functions. Overall, the RLPSO algorithm shows significant improvements and advantages over traditional PSO algorithms in aspects such as local search strategy, parameter adaptive adjustment, convergence speed, and multi-modal problem handling, resulting in better performance and robustness in solving optimization problems. This study provides new insights and methods for the further development of Particle Swarm Optimization algorithms.

1. Introduction

The PSO algorithm is a population-based stochastic optimization technique developed by Kennedy and Eberhart in 1995 [1]. The algorithm is inspired by the behavior of birds foraging, and finds the optimal solution by simulating the process of particles moving in the search space. The PSO algorithm has the advantages of a simple implementation, high computational efficiency, and easy convergence, and has quickly become an effective tool to solve complex optimization problems. The core idea of the PSO algorithm is to find the optimal solution by simulating the process of particles flying in the search space. In the algorithm, each particle represents a potential solution, and the position of the particle is updated by adjusting its position and velocity. The particle adjusts its position according to its own experience and the experience of the group to find the global optimal solution. This way of simulating the behavior of swarm intelligence in nature makes the PSO algorithm perform well in dealing with high-dimensional, nonlinear and multi-modal optimization problems.
With its fast computation speed and relatively good stability, the PSO algorithm has been widely applied in various fields, such as neural network training [2,3], fault diagnosis [4,5], economics and pattern recognition, power system optimization, signal processing, data mining, image processing, finance, and medicine. It can be applied to multiple aspects of power systems, including power generation scheduling, grid planning, and power market analysis. It can help optimize the efficiency, reliability, and economy of power systems. Although the traditional PSO algorithm performs well in many problems, it also has some limitations and disadvantages. The traditional PSO algorithm tends to fall into local optima when dealing with complex optimization problems, especially for high-dimensional, non-convex, nonlinear problems, where particles may prematurely converge to local optima and fail to discover the global optimum. The convergence speed of the PSO algorithm may be slow, and especially when the search space of the optimization problem is large or the number of particles is small, it may take a long time to converge. The traditional PSO algorithm involves many parameters, such as inertia weight and acceleration coefficient, whose values have a significant impact on the performance of the algorithm. Lacking a universal selection method, experimentation and adjustment are required to determine the optimal parameter values. Additionally, the PSO algorithm is sensitive to the initial conditions and parameter settings of the problem, which makes the algorithm unstable in some cases and susceptible to noise and interference. As regards premature convergence and premature stagnation, in some cases, the PSO algorithm may miss better solutions due to premature convergence, or experience premature stagnation, leading to algorithmic stagnation.
Currently, various improved and optimized PSO algorithms have been proposed by researchers to address the shortcomings of traditional PSO algorithms, such as with parameter adjustments [6,7,8,9], multi-strategy cooperative PSOs [10,11,12,13,14,15,16,17], and hybrid evolutionary algorithms [18,19,20,21,22,23,24]. Based on existing research, improvements in PSO algorithms can be mainly classified into the following three categories:
Class one: Adaptive parameter adjustment. This category of algorithms mainly adjusts the design parameters of the PSO algorithm, such as the inertia weight of velocity and the weights between individual best (pbest) and global best (gbest), to change the convergence speed of the algorithm. For example, the PSO with inertia weight (PSO-w) algorithm is used to increase the convergence speed [6], while Ratnaweera et al. proposed various inertia weight adjustment strategies [7]. Additionally, some algorithms such as the Q-learning-based Particle Swarm Optimization (QLPSO) [8] and Adaptive Weighted PSO (AWPSO) [9] algorithms adjust the convergence speed by controlling algorithm parameters.
Class two: Comprehensive learning from other particles. The main idea of such algorithms is to fully utilize information from other particles to update the flight speed and position of the current particle, including the current positions and pbests of other particles. Every particle in the swarm can comprehensively learn from other particles. In the Fully Informed Particle Swarm (FIPS) algorithm, each particle utilizes information from all neighboring particles, not just from gbest [10]. In the PSO-w-local algorithm, each particle compares the performances of every other member in the social network structure and imitates the best-performing particle [11]. The Cooperative Particle Swarm Optimization (CPSO) algorithm divides the particle swarm into multiple subgroups to optimize different components of the solution vector cooperatively. In each iteration, the CPSO algorithm uses each dimension of particles to update gbest, avoiding the issue of “two steps forward, one step back” [12]. The Comprehensive Learning Particle Swarm Optimization (CLPSO) algorithm utilizes pbests of all other particles to update the velocity of particles. This learning strategy can prevent premature convergence [13]. The Example-based Learning Particle Swarm Optimization (ELPSO) algorithm proposes a strategy to update particle positions using an example set of multiple global best particles [14]. The Heterogeneous Comprehensive Learning Particle Swarm Optimization (HCLPSO) algorithm divides the swarm into two subgroups, one focusing on exploration and the other on exploitation [15]. The Terminal Crossover and Direction-based Particle Swarm Optimization (TCSPSO) utilizes pbest to enhance population diversity at the terminal stage of iteration, helping particles jump out of local optima [16]. G.P. Xu et al. proposed the Two-Swarm Learning PSO (TSLPSO) algorithm, which is a dimensional learning strategy (DLS) for discovering and integrating promising information of the population best solution [17].
Class three: Hybrid particle swarm optimization. This category of algorithms integrates different optimization ideas to improve traditional PSO algorithms. For example, the PSO-GA algorithm incorporates mutation operators from genetic algorithms into the PSO algorithm [18,19]. Uriarte A et al. integrated the gradient descent method (BP algorithm) as an operator into the PSO algorithm [20]. Additionally, hybrid particle swarm optimization algorithms combine the PSO algorithm with other optimization techniques such as simulated annealing [21]. Aydilek et al. proposed a hybrid (HFPSO) algorithm that combines advantages of the firefly algorithm and PSO algorithm [22]. Moreover, the PSO-CL algorithm adopts a crossover learning strategy, utilizing a comprehensive learning strategy (CCL) and stochastic example learning strategy (SEL), to balance global exploration and local exploitation capabilities [23]. Zhang et al. constructed the TLS-PSO algorithm, utilizing a worst–best example learning strategy, to achieve a hybrid learning mechanism with three learning strategies in PSO [24].
Despite the improved performance of PSO algorithms in respective problems, they still have limitations when solving complex problems. For example, a high convergence speed can quickly approach individual best points (pbest) and global best points (gbest), but it may lead to the loss of diversity in the particle swarm, especially when gbest and pbest are far from the global optimum but close to each other [14]. The good diversity of the particle swarm ensures the algorithm’s global search capability but may result in slow convergence. According to the “no free lunch” theorem [25], many improved PSO algorithms may still fall into local optima or converge too slowly when solving complex problems. Especially when balancing convergence speed and diversity, existing algorithms often fail to achieve ideal results.
The RLPSO algorithm proposed in this paper is inspired by the comprehensive learning strategy of CLPSO and integrates reinforcement learning policies. Through intelligent learning strategies, it achieves a better balance between convergence speed and diversity, endowing particles with more intelligent learning strategies, effective global search capabilities, dynamic adjustment of learning strategies, and a good balance between convergence speed and diversity, as well as wide applicability. This enables more effective information sharing and utilization during the learning process, providing an efficient and intelligent optimization method for problem-solving. RLPSO has a wide range of applications and can play an important role in engineering [4], science [11], finance [19], medicine [26], and other fields by improving the performance and robustness of optimization algorithms, thus providing effective solutions to practical problems.
The remaining parts of this paper are as follows. In Section 2, the theoretical foundation of the RLPSO algorithm is briefly introduced. In Section 3, the execution process of the RLPSO algorithm is explained in detail. In Section 4, we discuss parameter selection and the role of Q-learning, while also selecting 29 benchmark functions to validate the RLPSO algorithm. Finally, conclusions with a discussion and summary are given in Section 5.

2. The Basic Principle of the CLPSO Algorithm

In the PSO algorithm, the velocity V i d update and position X i d update of the dth dimension of the particle i are represented by Equation (1) and Equation (2), respectively [19].
V i d = V i d + c 1 × r 1 d × ( pbest i d X i d ) + c 2 × r 2 d × ( gbest d X i d )
X i d = X i d + V i d Δ t
X i = ( X i 1 , X i 2 , , X i D ) and V i = ( V i 1 , V i 2 , , V i D ) represent the position and velocity of the particle i, D is the number of dimensions, p b e s t i = ( p b e s t i 1 , p b e s t i 2 , , p b e s t i D ) is the best-so-far position of the particle i, g b e s t = ( g b e s t 1 , g b e s t 2 , , g b e s t D ) is the best-so-far position of the whole swarm, c 1 and c 2 are constant weights for pbest and gbest, respectively, r 1 d and r 2 d are two random numbers in the range [0,1]; Δ t = 1 . If V i d > V m a x d , then V i d = V m a x d sign ( V i d ) , where V m a x d is maximum allowable velocity of the dth dimension.
The CLPSO algorithm has a solid theoretical foundation, wide application, and effectively utilizes swarm intelligence to address multimodal problems, demonstrating a certain degree of reliability and effectiveness, especially in tackling multimodal optimization problems. The CLPSO algorithm [13] can maintain a good diversity within the swarm; it excels particularly in tackling multimodal problems. Therefore, as the theoretical foundation of the RLPSO algorithm, CLPSO possesses a certain degree of reliability and effectiveness in addressing multimodal optimization problems. At the core of the CLPSO algorithm lies the updating rule, which serves as its main concept. The specific formulas are given as Equations (3) and (4).
V i d = w k V i d + c × r × ( p b e s t f i ( d ) d X i d )
X i d = X i d + V i d Δ t
ωk is inertia weight when the number of iterations is k; fi = [ fi(1), fi(2), …, fi(D)] defines the particles’ pbests that the particle i should follow. p b e s t f i ( d ) d can be the corresponding dimension of any particle’s pbest as the selection of fi(d) depends on the probability Pci. The probability of selecting the i particle’s pbest is (1 − Pci), while the probability of selecting other particles’ pbest is Pci. The value of Pci is computed as in Equation (5).
P c i = 0.05 + 0.45 × exp 10 ( i 1 ) p s 1 1 / exp ( 10 ) 1
where ps is the population size of the swarm, and i is the particle’s id counter.
In this paper, the objective of the PSO algorithm and its various variants is to locate the global minimum [3,4]. Figure 1 illustrates the flowchart of the CLPSO algorithm.
Figure 1 illustrates how the CLPSO algorithm updates pbest and gbest by learning from other particles, although the particles being learned are randomly selected. This may lead to certain particles being unable to learn from superior ones and deriving no valuable insights from inferior ones. Consequently, learning may occasionally prove ineffective, causing the CLPSO algorithm to converge slowly and underutilize swarm information. For instance, considering the equation f ( X ) = i = 1 d { X i } 2 , we assume the following conditions:
V 1 = ( 0 , 0 , 0 ) , w = 1 , c = 2 , r = 0.5 , X 1 = ( 6 , 2 , 0 ) , p b e s t 1 = ( 4 , 1 , 4 ) , p b e s t 2 = ( 1 , 2 , 5 ) , p b e s t 3 = ( 1 , 2 , 1 )
The first, second, and third dimensions of this particle learn from pbest1, pbest2, and pbest3, respectively. X 1 is updated according to Equations (3) and (4); we can obtain a new X 1 _ n e w = ( 4 , 2 , 1 ) and f ( X 1 _ n e w ) = 21 , which is better than f ( X 1 ) = 40 and f ( p b e s t 1 ) = 33 . So, p b e s t 1 = ( 4 , 2 , 1 ) is updated to p b e s t 1 = X 1 _ n e w = ( 4 , 2 , 1 ) . For this case, although the new fitness value is better, the second dimension of X 1 _ n e w is not updated (still at 2), and the third dimension is updated to be farther away from the optimal point (0, 0, 0), shifting from 0 to 1. Each dimension of a particle has the potential to learn from a different particle, but the particles learned may not necessarily be optimal within the context of the CLPSO algorithm. Therefore, we adjust the particle learning strategy based on the CLPSO algorithm, optimizing the learning objects and enhancing learning efficiency.

3. Reinforcement Learning-Based PSO Algorithm

To address the limitations of the aforementioned PSO algorithms, we introduce a novel PSO algorithm incorporating reinforcement-learning principles. In the RLPSO algorithm, particles consistently learn from superior peers while preserving diversity. Rather than randomly selecting particles from the swarm, each particle chooses its learned peers based on the Q table. Moreover, to maximize the effectiveness of each learning instance, the RLPSO algorithm updates every dimension of the global best (gbest) using all recently updated personal bests (pbests). Further insights into the RLPSO algorithm are provided below.

3.1. Q-Learning in RLPSO

Reinforcement learning (RL) is a branch of machine learning that dictates how machines should behave in their current environment to maximize cumulative rewards. Within machine learning, there are three fundamental components: state, action, and reward [26]. Q-learning, first introduced by Watkins in 1989 [27], is a specific type of RL algorithm. Q-learning involves the creation and updating of a Q table, which guides the machine’s actions based on the current state. In some scenarios, the machine selects the action with the highest Q value from the Q table, updating the Q table during training. The updated policy is illustrated in Equation (6).
Q ( s , a ) = Q ( s , a ) + α × [ R ( s , a ) + γ × m a x a Q ( s , a ) Q ( s , a ) ]
α is the learning rate, γ is the discount factor, R(s, a) is the immediate reward acquired from executing action a under the state s, Q(s, a) is an accumulative reward, and s′ is the next state of the state s when executing action a under the state s; a′ is an optional action under the state s′ and max a (s′, a′) is the maximum Q value that can be obtained at the state s′. The model of Q-learning is shown in Figure 2.
In the RLPSO algorithm, we randomly generate a (D × ps)-size Q table for each particle of the swarm. D is the number of particle dimensions; ps is the population size of the swarm. The Q table dictates from which particle’s pbest each dimension of every particle learns during each iteration. When updating each dimension of the particle, the particle with the highest Q value becomes the focal point of learning. In instances where the particle with the highest Q value is itself, the particle can still learn from its own past experiences. Each particle maintains its own Q table, as illustrated in Figure 3.
As shown in Figure 3, each dimension of the particle i has only one state, which is how each dimension learns from other particles’ pbests. Every dimension of the particle i has ps actions, that is, selects which particle’s pbest to learn from ps particles. Thus, Equation (6) can be simplified to Equation (7):
Q d , f i ( d ) = Q d , f i ( d ) + α × [ R + γ × m a x Q d Q d , f i ( d ) ]
d represents the dth dimension of the particle i, fi(d) represents which particle’s pbest the dth dimension of the particle i learns from, Qd,fi(d) is the Q value that the dth dimension of the particle i can obtain when learning from the pbest of particle fi(d), and maxQd is the largest Q value of the dth dimension in the Q table of the particle i.
To effectively harness the collective knowledge within the particle swarm, we employ Q-learning for selecting learned particles, rather than resorting to random selection from the swarm. Within the RLPSO algorithm, particles initially choose particles to learn from randomly with a certain probability, thereby exploring the solution space extensively and updating the Q table at the onset of iterations. As the iterations progress, particles gradually adjust their selection of particles to learn from based on the Q table, thereby expediting convergence and preventing divergence in certain dimensions.
Depending on the outcomes of the updates, different rewards are assigned during the Q table update process. The update strategy is outlined as follows: when gbest undergoes an update, it receives the highest “global reward” as an immediate reward for updating the Q table. Conversely, when pbest is updated, it receives a larger “local reward” as an immediate incentive. In cases where no update occurs, a “penalty” is assigned. The Q table is updated according to Equation (7).

3.2. The Strategy of Selecting Learned Particles

During each particle’s updating process, the RLPSO algorithm randomly generates a number dimup from 1 to D as the number of dimensions which need updating, and randomly selects dimup dimensions from the particle to be updated, leaving the unselected dimensions unchanged. In the learning process of each dimension within the particle, the algorithm randomly selects a particle from the swarm of learning with a certain probability εk and selects a particle to learn from according to the Q table with probability (1 − εk). Then, V i d and X i d are updated according to Equation (3) and Equation (4), respectively. After each iteration, εk updates according to Equation (8); des is the magnitude of the descent per iteration, and k is the number of iterations.
ε k + 1 = ε k × ( 1 d e s )
Equation (8) reveals that as iterations progress, the likelihood of selecting learned particles based on the Q table gradually escalates. Consequently, particles begin to incrementally glean insights from superior counterparts, as dictated by the Q table.
In contrast to the CLPSO algorithm, which relies on the fitness values of two randomly selected particles to determine learned particles, the RLPSO algorithm expedites this process by leveraging the Q table. The selection process of fi(d) for the particle i is shown in Figure 4. Simultaneously, we update the Q table of particle i in response to rewards or penalties incurred during the updating of gbest and pbest. If the dimension that requires updating is denoted by d, we update Qd,fi(d) in accordance with Equation (7). Otherwise, no updates are made. We repeat this process from d = 1 until d = D, resulting in the complete update of particle i’s Q table.

3.3. The Full Flow of the RLPSO Algorithm

In this section, we present the specific process of RLPSO, outlined as follows:
Step 1: We initialize the parameters of the RLPSO algorithm, including the position X, associated velocities V, Q table, ε0, pbest, and gbest of population; set k = 0.
Step 2: Each dimension of the particle i selects the learned particle fi(d) according to Figure 4 to obtain a new velocity Vi and position Xi.
Step 3: For the particle i, if the pbest is updated, we update each dimension of gbest based on the 1 to D dimensions of the pbest. This ensures that certain dimensions of gbest do not stray too far from the optimal point. We can view each dimension of pbest as a gene, with gbest selectively inheriting useful genes from pbest. If gbest is not updated, we set flagi = flagi + 1. If flagi ≥ m, run the PSO algorithm and reset flagi = 0. Different from the CLPSO algorithm where flagi is reset to 0 when pbest is updated, the RLPSO algorithm only resets flagi to 0 when gbest is updated. This encourages particles to choose learned particles based on Q-learning more frequently.
Step 4: If gbest is updated, the particle i receives a global reward; if gbest is not updated but pbest is, the particle i receives a local reward; if neither gbest nor pbest is updated, particle i incurs a penalty. Subsequently, we update the Q table of particle i based on the rewards or penalties obtained from updating gbest and pbest.
Step 5: We set i = i +1; repeat Step 2, Step 3, and Step 4 until i equals the population size ps.
Step 6: We set k = k + 1; repeat Step 2, Step 3, Step 4, and Step 5 until d equals the maximum number of iterations.
A detailed description of algorithm pseudocode is shown in Algorithm 1, and the entire flowchart of the RLPSO algorithm is illustrated in Figure 5. The parameters shown in Algorithm 1 and Figure 5 are consistent with those set in Figure 1.
Algorithm 1: RLPSO algorithm
Input: Initialize position X, associated velocities V, Q table, ε0, pbest and gbest of population, and set k = 0, flag = 0, m = 10;
Output: Optimal solution;
1for k < max_gen do
2 for I < ps do
3 if flagi ≥ m then
4 run PSO algorithm and reset flagi = 0;
5 End
6 for d < D do
7 Each dimension of the particle i selects the learned particle fi(d) according
8 to Figure 4 to get new velocity Vi and position Xi;
9 End
10 if Fit (Xi) < Fit (pbest)
11 Update the pbest of particle i;
12 End
13 if the pbest of particle i is updated then
14 we will update each dimension of gbest basis with 1 to D dimensions of
15 the pbest from particle i;
16 else
17 we set flagi = flagi + 1
18 End
19 if gbest is updated, then
20 R = global reward;
21 else if gbest is not updated but pbest is updated, then
22 R = local reward;
23 else if both gbest nor pbest are not updated, then
24 R = get penalty;
25 End
26 Update the Q table of particle i according to reward or penalty getting from
27 updating of gbest and pbest;
28 i = i + 1;
29 End
30 k = k + 1
31End
32Output: gbest
The pseudocode of the RLPSO algorithm reveals a structure consisting of two nested loops. Each iteration within the outer loop corresponds to a particle within the swarm, and within each particle’s iteration, every dimension is iterated over. Thus, the time complexity of the algorithm is influenced by the particle dimension D, the population size ps, and the number of algorithm iterations max_gen. Consequently, the algorithm’s time complexity can be inferred as O(max_gen × ps × D).

4. Experimental Validations

For testing the performance of the RLPSO algorithm, a total of 16 famous benchmark functions were selected from CEC2005 [28], and the performance of the RLPSO algorithm was compared with other PSO algorithms. The 16 benchmark functions include seven unimodal functions and nine multimodal functions to ensure the comprehensiveness of the experiment. The presented algorithm is implemented in Python 3.9 and the program has been run on a i7-8565U @1.80 GHz Intel(R) Core(TM) 4 Duo processor with 8 GB of Random Access Memory (RAM). The tested benchmark functions are listed as follows:
Unimodal functions:
(1)
f1: sphere model:
f 1 ( x ) = i = 1 30 x i 2 ,   100 x i 100 ,   min ( f 1 ) = f 1 ( 0 , , 0 ) = 0
(2)
f2: Schwefel’s problem 2.22:
f 2 ( x ) = i = 1 30 x i + i = 1 30 x i ,   10 x i 10 ,   min ( f 2 ) = f 2 ( 0 , , 0 ) = 0
(3)
f3: Schwefel’s problem 1.2:
f 3 ( x ) = i = 1 30 j = 1 i x j 2 ,   100 x i 100 ,   min ( f 3 ) = f 3 ( 0 , , 0 ) = 0
(4)
f4: Schwefel’s problem 2.21:
f 4 ( x ) = max { x i , 1 i 30 } ,   100 x i 100 ,   min ( f 4 ) = f 4 ( 0 , , 0 ) = 0
(5)
f5: generalized Rosenbrock’s function:
f 5 ( x ) = i = 1 29 [ 100 ( x i + 1 x i 2 ) 2 + ( x i 1 ) 2 ] ,   30 x i 30 ,   min ( f 5 ) = f 5 ( 1 , , 1 ) = 0
(6)
f6: step function:
f 6 ( x ) = i = 1 30 ( x i + 0.5 ) 2 ,   100 x i 100 ,   min ( f 6 ) = f 6 ( 0 , , 0 ) = 0
(7)
f7: quartic function, i.e., noise:
f 7 ( x ) = i = 1 30 i x i 4 + random [ 0 , 1 ) ,   1.28 x i 1.28 ,   min ( f 7 ) = f 7 ( 0 , , 0 ) = 0
Multimodal functions:
(8)
f8: generalized Schwefel’s problem 2.26:
f 8 ( x ) = i = 1 30 ( x i sin ( x i ) ) ,   500 x i 500 ,   min ( f 8 ) = f 8 ( 420.9687 , , 420.9687 ) = 12569.5
(9)
f9: generalized Rastrigin’s function:
f 9 ( x ) = i = 1 30 [ x i 2 10 cos ( 2 π x i ) + 10 ] ,   5.12 x i 5.12 ,   min ( f 9 ) = f 9 ( 0 , , 0 ) = 0
(10)
f10: Ackley’s function:
f 10 ( x ) = 20 e x p ( 0.2 1 30 i = 1 30 x i 2 ) exp ( 1 30 i = 1 30 cos 2 π x i ) + 20 + e ,   32 x i 32 , min ( f 10 ) = f 10 ( 0 , , 0 ) = 0
(11)
f11: generalized Griewank function:
f 11 ( x ) = 1 4000 i = 1 30 x i 2 i = 1 30 cos ( x i i ) + 1 ,   600 x i 600 ,   min ( f 11 ) = f 11 ( 0 , , 0 ) = 0
(12)
f12: generalized penalized function:
f 12 ( x ) = π 30 10 sin 2 ( π y 1 ) + i = 1 29 { ( y i 1 ) 2 × [ 1 + 10 sin 2 ( π y i + 1 ) ] } + ( y n 1 ) 2 + i = 1 30 u i ( x i , 10 , 100 , 4 ) , y i = 1 + 1 4 ( x i + 1 ) , u ( x i , a , k , m ) = k ( x i a ) m x i > a 0 a x i a k ( x i a ) m x i < a   50 x i 50 ,   min ( f 12 ) = f 12 ( 1 , , 1 ) = 0
(13)
f13: generalized penalized function:
f 13 ( x ) = 0.1 sin 2 ( 3 π x 1 ) + i = 1 29 { ( x i 1 ) 2 × [ 1 + 10 sin 2 ( 3 π x i + 1 ) ] } + ( x n 1 ) 2 × [ 1 + sin 2 ( 2 π x 30 ) ] + i = 1 30 u i ( x i , 5 , 100 , 4 ) ,   50 x i 50 ,   min ( f 13 ) = f 13 ( 1 , , 1 ) = 0
The function u is the same as above.
(14)
f14: six-hump camel-back function:
f 14 ( x ) = 4 x 1 2 2.1 x 1 4 + 1 3 x 1 6 + x 1 x 2 4 x 2 2 + 4 x 2 4 ,   5 x i 5 x min = ( 0.08983 , 0.7126 ) , ( 0.08983 , 0.7126 ) , min ( f 14 ) = 1.0316285
(15)
f15: Branin function:
f 15 ( x ) = ( x 2 5.1 4 π 2 x 1 2 + 5 π x 1 6 ) 2 + 10 ( 1 1 8 π ) cos x 1 + 10 ,   5 x 1 10 ,   0 x 2 15 x min = ( 3.142 , 12.275 ) , ( 3.142 , 2.275 ) , ( 9.425 , 2.425 ) , min ( f 15 ) = 0.398
(16)
f16: Goldstein–Price function
f 16 ( x ) = [ 1 + ( x 1 + x 2 + 1 ) 2 × ( 19 14 x 1 + 3 x 1 2 14 x 2 + 6 x 1 x 2 + 3 x 2 2 ) ] × [ 30 + ( 2 x 1 3 x 2 ) 2 × ( 18 32 x 1 + 12 x 1 2 + 48 x 2 36 x 1 x 2 + 27 x 2 2 ) ]   2 x i 2 ,   min ( f 16 ) = f 16 ( 0 , 1 ) = 3
To evaluate the performance of the RLPSO algorithm, comparisons with other algorithms were conducted under the same test parameters. The problem dimension, population size, maximum number of iterations, and number of independent runs were set uniformly as 30, 40, 5000, and 30 [14]. For the RLPSO algorithm, we randomly generated a Q table of (D × ps) size for each particle of the population, and each value in the Q table was randomly generated as an integer between −40 and 0. Then, we compared the performance of the PSO, CLPSO, ELPSO, and RLPSO algorithms using experimental data from an H.D. Shao article [14]. Table 2 presents the parameter settings for all comparison algorithms, as obtained from their respective studies.

4.1. An Analysis of the Role of Q-Learning

To explore the role that Q-learning plays in the RLPSO algorithm, we set some parameters, such as ε0 = 1 and des = 0, while keeping other parameters constant, to evaluate performance. Under these conditions, the RLPSO algorithm randomly selects learned particles without considering the Q table. We denote this configuration of the RLPSO algorithm as “Random RLPSO”. The performance of both “Random RLPSO” and “RLPSO” on 16 benchmark function problems is depicted in Figure 6, while corresponding experimental results are presented in Table 3.
Figure 6 illustrates that “RLPSO” achieves earlier convergence compared to “Random RLPSO” across almost all 16 benchmark functions. Additionally, according to the experimental results in Table 3, “RLPSO” outperforms “Random RLPSO” in terms of convergence for eight benchmark functions (f3f5, f7, f10f13). As for the remaining eight benchmark functions (f1f2, f6, f8f9, f14f16), both “RLPSO” and “Random RLPSO” converge to optimal values, but “RLPSO” does so sooner.
The above analysis demonstrates that Q-learning can enhance the RLPSO algorithm’s fitness value by accelerating convergence at appropriate intervals, thereby achieving a better balance between convergence speed and diversity. Strategic learning proves to be more efficient than random learning, as evidenced by experimental results. Initially, random learning with a certain probability of εk allows for comprehensive exploration and utilization of population diversity. As εk decreases, particles gradually learn from superior particles based on the Q table, further expediting convergence. The adjustment of parameters ε0 and des enables control over the timing of convergence acceleration. With a maximum of 5000 iterations, we aim for enhanced convergence speed after 2000 iterations (about 40% of the total) by setting parameters to ε0 = 0.6 and des = 0.001. Consequently, εk becomes very small (εk < 0.078) after 2000 iterations.

4.2. Parameter Setting of RLPSO

The configurations of global reward, local reward, and penalty are crucial in determining the performance of the RLPSO algorithm. This section outlines an experimental approach for configuring these parameters.
We evaluated the performance of five sets of parameter configurations across benchmark functions. Table 4 presents these parameter sets and compares their performance across 16 benchmark functions. Table 4 also includes the mean and standard deviation of the optimal solutions, with the best results among the five algorithms highlighted in bold.
The results indicate that the RLPSO algorithm performs best when Global Reward = 10, Local Reward = 2 and Penalty = −1. In this scenario, the RLPSO algorithm outperforms others in 13 benchmark functions, expect for f3, f10, and f11. When Global Reward = 10 and Penalty = −1, if gbest is updated, the particle receives Global Reward and follows the same learning strategy at least 10 times. Similarly, if pbest is updated, the particle receives Local Reward and follows the same learning strategy at least two times. If gbest and pbest remain unchanged for an extended period, the particle adjusts its learning strategy and begins to learn from other particles’ pbests. The RLPSO algorithm balances convergence speed and particle diversity through reward and penalty. Therefore, setting Global Reward = 10, Local Reward = 2, and Penalty = −1 achieves a desirable balance between convergence speed and particle diversity. Additionally, we set α = 0.1 and γ = 0.95 following the conventional Q-learning parameter settings of many cases.

4.3. Experimental Results and Analysis

In this section, we compare the RLPSO algorithm and the Random RLPSO algorithm with classical PSO, CLPSO, and ELPSO algorithms. Table 3 presents a comparison of the above five algorithms in 16 benchmark functions. The table displays the mean and standard deviation of the optimal solutions for the PSO, CLPSO, ELPSO, Random RLPSO, and RLPSO algorithms, with the best results among the five algorithms highlighted in bold.
The results of the t-test are presented in the last column of Table 3. At a 95% confidence level, when s = 1, this indicates that the performance differences between the RLPSO algorithm and the PSO, CLPSO, and ELPSO algorithms are statistically significant. Conversely, when s = 0, it suggests no statistically significant differences. Among the comparisons for the 16 benchmark functions in Table 3, 12 exhibited statistically significant disparities.
As shown in Table 3, the RLPSO algorithm demonstrates superior performance in solving unimodal function problems (f1f5, f7) compared to all other algorithms. When tackling multimodal function problems (f8f16), RLPSO outperforms in seven cases (f8, f9, f12f16). Overall, the RLPSO algorithm emerges as the best solution in 14 out of 16 function problems.
Despite the RLPSO algorithm’s average best solution value being lower than that of the CLPSO and ELPSO algorithms in the multimodal function problem f11, RLPSO successfully identifies 12 global optimal solutions of f11 across 30 runs. To provide a comprehensive assessment of the RLPSO algorithm, Figure 7 illustrates the frequency with which RLPSO finds the global optimum in 30 runs across 16 benchmark functions.
Figure 7 shows that when the RLPSO algorithm is applied to each benchmark function problem 30 times, it successfully identifies 12 global minima for 16 benchmark function problems. In the case of the remaining 4 benchmark function problems (f3f4, f7, f10) where the global minimum is not found, the gbests discovered by the RLPSO algorithm are very close to the global optimum. For instance, the average value of the optimal solution for f3 is 3.77 × 10−209, which is close to 0, the global optimal solution of f3. Similarly, the average value of the optimal solution for f4 is 1.91 × 10−214, also close to 0, the global optimal solution of f4. The same applies to f7 and f10. Based on the above analysis, we can conclude that the RLPSO algorithm has the capability to identify the global optimum when run multiple times within a certain error range.
Furthermore, it is evident that various PSO algorithms exhibit different performances across different benchmark functions. At times, they may successfully locate the global optimum, while in other instances, they may become trapped in local optima or converge too slowly. The experimental results confirm that the RLPSO algorithm manages to obtain nearly all global optima within a certain margin of error. Consequently, the RLPSO algorithm demonstrates greater stability compared to other PSO algorithms when addressing diverse problem sets.
To further validate the performance of the RLPSO algorithm, we selected 13 test functions from the CEC2017 benchmark set as the benchmark functions and compared the RLPSO algorithm with four PSO algorithms and two evolutionary algorithms on the test set. To evaluate the RLPSO algorithm’s performance, we compared it with other algorithms using identical test parameters. All algorithms were configured with the same settings: a population size of 50, a maximum iteration count of 1000, a particle dimension of 30, and each algorithm was executed 50 times [24]. The selected test functions are listed in Table 5, algorithm parameters are provided in Table 6, and the computational results are presented in Table 7.
The data from Table 7 clearly indicate that the RLPSO algorithm outperforms the other six algorithms (CLPSO, HPSO, THSPSO, TCSPSO, BOA, and OSA algorithms) in the testing of the 13 benchmark functions of CEC2017, achieving the minimum value on 11 of these benchmarks. This demonstrates the superior performance of the RLPSO algorithm, suggesting its capability to find solutions closer to the global optimum in most scenarios. These results imply that the RLPSO algorithm could be an effective and reliable choice for addressing complex optimization problems.

4.4. Particle Swarm Diversity Analysis

We conducted Principal Component Analysis (PCA) on functions f17 to f29 over 1000 iterations, tracking the positions of 50 particles at the 200th, 400th, 600th, and 800th iterations. To evaluate particle diversity during the iterative process, we utilized PCA to reduce the 30-dimensional particle position data to 3 dimensions. This analysis of the principal components offered valuable insights into particle behavior and enhanced swarm optimization. The findings are illustrated in Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19 and Figure 20.
From Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19 and Figure 20, it is evident that the algorithm continues to improve beyond 600 function iterations and does not reach convergence. For certain functions (f17, f18, f25, f26, f28, f29), convergence is not achieved even after 800 function evaluations. Furthermore, the comparisons across different runs highlight the algorithm’s stability for these functions.
To further analyze swarm diversity, a diversity measurement [33] is considered and defined as follows:
d i v j = 1 N i = 1 N m e d i a n ( x j ) x i j
d i v = 1 d j = 1 d d i v j
N and d represent the number of particles and dimensions, respectively, x i j represents the j’th dimension of the i’th particle, while the median ( x j ) is the median of dimension j in the whole swarm, d i v j is the diversity in each dimension, the diversity of whole population (div) is calculated by averaging all d i v j .
Furthermore, with the help of diversity measurement, we can calculate the percentage of exploration and exploitation during each iteration using the following equations:
e x p l o r a t i o n % = ( d i v d i v max ) × 100 %
e x p l o i t a t i o n % = 1 e x p l o r a t i o n %
divmax is defined as the maximum diversity value achieved throughout the optimization process. The exploration% connects the diversity in each iteration to this maximum diversity. It is inversely related to the exploitation level and is calculated as the complementary percentage to exploitation%.
We conducted research on the percentage of exploration and exploitation in 1000 iterations for f17f29. The results are shown in Figure 21.
From Figure 21, we can observe that the exploitation% of the algorithm starts to increase around the 400th iteration. For f20 and f23, when the number of iterations reaches 800, the exploitation% nearly approaches 100%. This observation is consistent with what is shown in Figure 11 and Figure 14, where the particles are in a clustered state. The data in Figure 21 indicate that as the number of iterations increases, the algorithm’s performance gradually improves and eventually stabilizes. This suggests that in the later stages of the algorithm’s execution, the clustering of particles significantly enhances the exploitation rate, reaching nearly full exploitation.
Moreover, the analysis of the percentages of exploration and exploitation reveals that the algorithm predominantly explores in the early stages, with exploration% reaching nearly 100%. In the later stages, the algorithm shifts towards exploiting the accumulated information to accelerate convergence. This behavior is consistent across different problems, highlighting the algorithm’s stability and its ability to effectively balance exploration and exploitation in various scenarios.

4.5. Engineering Problem

In this section, the RLPSO algorithm was applied to solve the three-bar truss problem [33], and the maximum iteration and population size are 1000 and 30, respectively. The mathematical formulations are as follows:
x = ( x 1 , x 2 )
Objective function:
M i n . f ( x ) = L ( x 2 + 2 2 x 1 )
These are subject to the following:
h 1 ( x ) = x 2 2 x 2 x 1 + 2 x 1 2 P δ 0 ,
h 2 ( x ) = x 2 + 2 x 1 2 x 2 x 1 + 2 x 1 2 P δ 0 ,
h 3 ( x ) = 1 x 1 + 2 x 2 P δ 0 ,
where 0 ≤ x 1 , x 2 ≤ 1, and P = 2, L = 100, and δ = 2 .
This engineering problem is solved by using our proposed RLPSO algorithm and compared with the methods mentioned in reference [33]. The results of the comparison are shown in Table 8.
As shown in Table 8, RLPSO performs well in solving the engineering problem. However, the complexity of the boundary conditions in this engineering problem often leads to particles not satisfying the constraints after updating, resulting in ineffective learning. Additionally, due to the problem’s low dimensionality, particles lose a significant amount of their inherent information after updating, resulting in substantial changes in particle positions and the loss of valuable data. Consequently, addressing practical engineering problems with complex constraint conditions will be a major focus of our future research.

5. Conclusions

This study explores the application of strategic learning in optimization by proposing a Reinforcement Learning-based Particle Swarm Optimization (RLPSO) algorithm aimed at improving the performance and convergence speed of traditional PSO algorithms. The research on the RLPSO algorithm involves knowledge from multiple theoretical domains, including Particle Swarm Optimization, Reinforcement Learning, Q-learning, multi-modal optimization, and adaptive algorithms. This study is based on a deep understanding of these theories and their effective integration.
Under the same testing parameter settings, performance comparisons were conducted between the RLPSO algorithm and the PSO, CLPSO, and ELPSO algorithms on the testing of 16 benchmark functions from CEC2005. The results revealed that, compared to other algorithms, the RLPSO algorithm exhibited the fastest convergence speed. It also found the global optimum in 14 out of the 16 benchmark functions, showing significant statistical differences. In the testing of 13 benchmark functions from CEC2017, performance comparisons were made between the RLPSO algorithm and six other algorithms (CLPSO, HPSO, THSPSO, TCSPSO, BOA, and OSA). The results demonstrated that the RLPSO algorithm found the global optimum in 11 out of the 13 benchmark functions, with statistically significant differences. This indicates that the RLPSO algorithm exhibits excellent performance across multiple benchmark function problems, which finds the global optimum in almost all the cases. The introduced Q-learning mechanism in reinforcement learning plays a crucial role in enhancing algorithm performance.
This algorithm selects particles to update their velocities based on an online updated Q-table. At the beginning of each iteration, particles randomly choose particles to learn from with a certain probability, exploring the solution space as much as possible to update the Q-table. This process filters out particles worth learning from and stores the information in the Q-table. As iterations proceed, particles determine which particles they want to learn from based on the Q-table and store the learning results in the Q-table to guide the next step of learning. The algorithm continuously adjusts the learning targets and updates the learning strategy online to accelerate convergence speed at the right time, thus striking a good balance between convergence speed and diversity. In this paper, comparisons with random algorithms that do not incorporate Q-table learning reveal that the RLPSO algorithm converges faster. This indicates the crucial role of the Q-learning mechanism introduced in reinforcement learning in enhancing algorithm performance.
By combining Q-learning with particle swarm optimization, we achieve effective learning and experience sharing among particles, accelerating the algorithm’s convergence speed, and obtaining better fitness values at the appropriate times. This demonstrates the effectiveness of strategic learning compared to random learning, providing strong support for further research and the application of reinforcement learning in optimization algorithms. Compared with traditional PSO algorithms and other improved versions, the RLPSO algorithm exhibits strong adaptability, fast convergence speed, strong global search capability, ease of implementation and application, and demonstrates more stable and efficient performance, making it more applicable and versatile when facing different types of optimization problems. The RLPSO algorithm has a wide range of applications, such as in engineering design, data mining, and artificial intelligence. By improving the performance and robustness of optimization algorithms, the RLPSO algorithm provides effective solutions for solving practical problems.
Additionally, in terms of performance on the CEC2005 and CEC2017 test functions, the RLPSO algorithm performs excellently in handling both multi-modal and single-modal problems, and is particularly outstanding in solving multi-modal problems, indicating its applicability in solving complex problems and further validating the superiority of the RLPSO algorithm over other PSO algorithms in solving practical problems. The diversity graphs also reveal that the algorithm maintains the diversity of the particle swarm throughout the computation, and balances the exploration and utilization of the algorithm. Through the engineering problem verification, it shows that the algorithm has a certain practical application value. This suggests that the RLPSO algorithm has broad potential in practical applications, especially in engineering optimization and data mining fields.
Despite the significant achievements of the RLPSO algorithm in optimization problems, there are limitations that need to be considered. The reinforcement learning parameters in the RLPSO algorithm need appropriate adjustments, including on learning rate, rewards, and penalties. The choice of these parameters may significantly affect the algorithm’s performance and convergence speed, but determining the optimal parameter settings typically requires a large number of experiments and experiences. The updating and maintenance of the Q-table in the RLPSO algorithm increase the computational cost, especially when dealing with high-dimensional problems or large-scale optimization tasks. Therefore, practical applications may face limitations in computational resources. Although the RLPSO algorithm demonstrates many advantages in optimization problems, further research and improvements are still needed to overcome its limitations and enhance the algorithm’s performance and applicability.
Future research can focus on optimizing the parameter settings of Q-learning in the RLPSO algorithm to further enhance the algorithm’s performance and robustness. Additionally, ideas from other novel PSO algorithms can be borrowed, such as the Particle Swarm Optimization algorithm with priority-based sorting [34] and the DOADAPO algorithm [35], to extend Q-learning into the multi-objective optimization domain, thereby exploring a wider problem space. These research directions will contribute to a deeper understanding of the working principles of the RLPSO algorithm and further promote the application and development of reinforcement learning-based optimization algorithms in practical problems.
In summary, as a reinforcement learning-based optimization algorithm, the RLPSO algorithm not only has significant theoretical significance but also has broad prospects in practical applications. We believe that through further research and exploration, the RLPSO algorithm will play a more important role in the field of optimization and provide effective solutions for practical problems.

Author Contributions

F.Z. and Z.C. proposed the algorithm and were responsible for writing the article. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Social Science Fund of China (2022-SKJJ-B-112).

Data Availability Statement

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Acknowledgments

We would like to thank all the authors whose articles are referenced in our study. We extend special thanks to the professors of the Sixty-third Institute of the National University of Defense Technology.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the 1995 IEEE International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; pp. 1942–1948. [Google Scholar]
  2. Sabir, Z.; Raja, M.A.Z.; Umar, M.; Shoaib, M. Design of neuro-swarming-based heuristics to solve the third-order nonlinear multi-singular Emden-Fowler equation. Eur. Phys. J. Plus 2020, 135, 410. [Google Scholar] [CrossRef]
  3. Che, H.; Wang, J. A Two-Timescale Duplex Neurodynamic Approach to Mixed-Integer Optimization. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 36–48. [Google Scholar] [CrossRef] [PubMed]
  4. Shao, H.D.; Jiang, H.K.; Zhang, X.; Niu, M.G. Rolling bearing fault diagnosis using an optimization deep belief network. Meas. Sci. Technol. 2015, 26, 115002. [Google Scholar] [CrossRef]
  5. Yan, X.A.; Jia, M.P. A novel optimized SVM classification algorithm with multi-domain feature and its application to fault diagnosis of rolling bearing. Neurocomputing 2018, 313, 47–64. [Google Scholar] [CrossRef]
  6. Mohammad, Y.; Eberhart, R.; Mohammad, H.S. A Novel Flexible Inertia Weight Particle Swarm Optimization Algorithm. PLoS ONE 2016, 11, e0161558. [Google Scholar]
  7. Ratnaweera, J.A.; Mousa, S.; Watson, H.C. Self-organizing hierarchical particle swarm optimizer with time-varying acceleration coefficients. IEEE Trans. Evol. Comput. 2004, 8, 240–255. [Google Scholar] [CrossRef]
  8. Liu, Y.; Lu, H.; Cheng, S.; Shi, Y. An Adaptive Online Parameter Control Algorithm for Particle Swarm Optimization Based on Reinforcement Learning. In Proceedings of the 2019 IEEE Congress on Evolutionary Computation, Wellington, New Zealand, 10–13 June 2019; pp. 815–822. [Google Scholar]
  9. Liu, W.; Wang, Z.; Yuan, Y.; Zeng, N.; Hone, K.; Liu, X. A novel sigmoid-function-based adaptive weighted particle swarm optimizer. IEEE Trans. Cybern. 2021, 51, 1085–1093. [Google Scholar] [CrossRef]
  10. Mendes, R.; Kennedy, J.; Neves, J. The fully informed particle swarm: Simpler, maybe better. IEEE Trans. Evol. Comput. 2004, 8, 204–210. [Google Scholar] [CrossRef]
  11. Yang, C.H.; Lin, Y.S.; Chang, L.Y.; Chang, H.W. A Particle Swarm Optimization-Based Approach with Local Search for Predicting Protein Folding. J. Comput. Biol. 2017, 24, 981–994. [Google Scholar] [CrossRef]
  12. Bergh, F. A Cooperative approach to particle swarm optimization. IEEE Trans. Evol. Comput. 2004, 8, 225–239. [Google Scholar]
  13. Liang, J.J.; Qin, A.K.; Suganthan, P.N.; Baskar, S. Comprehensive learning particle swarm optimizer for global optimization of multimodal functions. IEEE Trans. Evol. Comput. 2006, 10, 281–295. [Google Scholar] [CrossRef]
  14. Huang, H.; Qin, H.; Hao, Z.F.; Lim, A. Example-based learning particle swarm optimization for continuous optimization. Inf. Sci. 2012, 182, 125–138. [Google Scholar] [CrossRef]
  15. Lynn, N.; Suganthan, P.N. Heterogeneous comprehensive learning particle swarm optimization with enhanced exploration and exploitation. Swarm Evol. Comput. 2015, 24, 11–24. [Google Scholar] [CrossRef]
  16. Zhang, X.M.; Lin, Q.Y. Three-learning strategy particle swarm algorithm for global optimization problems. Inf. Sci. Int. J. 2022, 593, 289–313. [Google Scholar] [CrossRef]
  17. Xu, G.P.; Cui, Q.L.; Shi, X.H.; Ge, H.W.; Zhan, Z.H.; Lee, H.P.; Liang, Y.C.; Tai, R.; Wu, C.G. Particle swarm optimization based on dimensional learning strategy. Swarm Evol. Comput. 2019, 45, 33–51. [Google Scholar] [CrossRef]
  18. Cai, L.; Hou, Y.; Zhao, Y.; Wang, J. Application research and improvement of particle swarm optimization algorithm. In Proceedings of the 2020 IEEE International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, 28–30 July 2020; pp. 238–241. [Google Scholar]
  19. Garg, H. A hybrid PSO-GA algorithm for constrained optimization problems. Appl. Math. Comput. 2016, 274, 292–305. [Google Scholar] [CrossRef]
  20. Uriarte, A.; Melin, P.; Valdez, F. An improved Particle Swarm Optimization algorithm applied to Benchmark Functions. In Proceedings of the 2016 IEEE 8th International Conference on Intelligent Systems (IS), Sofia, Bulgaria, 4–6 September 2016; pp. 128–132. [Google Scholar]
  21. Wang, X.H.; Li, J.J. Hybrid particle swarm optimization with simulated annealing. In Proceedings of the 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826), Shanghai, China, 26–29 August 2004; pp. 2402–2405. [Google Scholar]
  22. Aydilek, I.B. A hybrid firefly and particle swarm optimization algorithm for computationally expensive numerical problems. Appl. Soft Comput. 2018, 66, 232–249. [Google Scholar] [CrossRef]
  23. Liang, B.X.; Zhao, Y.; Li, Y. A hybrid particle swarm optimization with crisscross learning strategy. Eng. Appl. Artif. Intell. 2021, 105, 104418. [Google Scholar] [CrossRef]
  24. Zhang, X.W.; Liu, H.; Zhang, T.; Wang, Q.W.; Tu, L.P. Terminal crossover and steering-based particle swarm optimization algorithm with disturbance. Appl. Soft Comput. 2019, 85, 105841. [Google Scholar] [CrossRef]
  25. Babak, Z.A. No-free-lunch-theorem: A page taken from the computational intelligence for water resources planning and management. Environ. Sci. Pollut. Res. Int. 2023, 30, 57212–57218. [Google Scholar]
  26. Xu, L.; Zhu, S.; Wen, N. Deep reinforcement learning and its applications in medical imaging and radiation therapy: A survey. Phys. Med. Biol. 2022, 67, 22. [Google Scholar] [CrossRef] [PubMed]
  27. Watkins, C.J.C.H. Learning from Delayed Rewards. Ph.D. Thesis, Cambridge University, Cambridge, UK, 1989. [Google Scholar]
  28. Suganthan, P.N.; Hansen, N.; Liang, J.J.; Deb, K.; Tiwari, S. Problem Definitions and Evaluation Criteria for the CEC 2005 Special Session on Real-Parameter Optimization. In Natural Computing; Nanyang Technological University: Singapore, 2005; pp. 341–357. [Google Scholar]
  29. Liu, H.; Zhang, Y.; Tu, L.; Wang, Y. Human Behavior-Based Particle Swarm Optimization: Stability Analysis. In Proceedings of the 2018 37th Chinese Control Conference (CCC), Wuhan, China, 25–27 July 2018; pp. 3139–3144. [Google Scholar]
  30. Liu, Y.J. A hierarchical simple particle swarm optimization with mean dimensional information. Appl. Soft Comput. 2019, 76, 712–725. [Google Scholar] [CrossRef]
  31. Arora, S.; Singh, S. Butterfly optimization algorithm: A novel approach for global optimization. Soft Comput. 2019, 23, 715–734. [Google Scholar] [CrossRef]
  32. Jain, M.; Maurya, S.; Rani, A.; Singh, V. Owl search algorithm: A novel nature-inspired heuristic paradigm for global optimization. J. Intell. Fuzzy Syst. 2018, 34, 1573–1582. [Google Scholar] [CrossRef]
  33. Sahoo, S.K.; Saha, A.K.; Nama, S.; Masdari, M. An improved moth flame optimization algorithm based on modified dynamic opposite learning strategy. Artif. Intell. Rev. 2022, 56, 2811–2869. [Google Scholar] [CrossRef]
  34. Wang, Y.J.; Yang, Y.P. Particle swarm optimization with preference order ranking for multi-objective optimization. Inf. Sci. 2009, 179, 1944–1959. [Google Scholar] [CrossRef]
  35. Deng, W.; Zhao, H.M.; Yang, X.H.; Xiong, J.X.; Sun, M.; Li, B. Study on an improved adaptive PSO algorithm for solving multi-objective gate assignment. Appl. Soft Comput. 2017, 59, 288–302. [Google Scholar] [CrossRef]
Figure 1. A flowchart of the CLPSO algorithm. Note: the meanings of the parameters in the figure are shown in Table 1.
Figure 1. A flowchart of the CLPSO algorithm. Note: the meanings of the parameters in the figure are shown in Table 1.
Symmetry 16 01290 g001
Figure 2. The model of Q-learning.
Figure 2. The model of Q-learning.
Symmetry 16 01290 g002
Figure 3. The Q table of the particle i.
Figure 3. The Q table of the particle i.
Symmetry 16 01290 g003
Figure 4. The flowchart of selecting learned particles for the particle i.
Figure 4. The flowchart of selecting learned particles for the particle i.
Symmetry 16 01290 g004
Figure 5. A Flowchart of the RLPSO algorithm, cite from Figure 4.
Figure 5. A Flowchart of the RLPSO algorithm, cite from Figure 4.
Symmetry 16 01290 g005
Figure 6. A comparison of Random RLPSO and RLPSO on the convergence of f1f16.
Figure 6. A comparison of Random RLPSO and RLPSO on the convergence of f1f16.
Symmetry 16 01290 g006aSymmetry 16 01290 g006b
Figure 7. The times of finding the global minimum by the RLPSO algorithm.
Figure 7. The times of finding the global minimum by the RLPSO algorithm.
Symmetry 16 01290 g007
Figure 8. PCA was applied to the positions of 50 particles in f17 of CEC2017 for Shifted and Rotated Zakharov. Note: The colored dots are the positions of the particles.
Figure 8. PCA was applied to the positions of 50 particles in f17 of CEC2017 for Shifted and Rotated Zakharov. Note: The colored dots are the positions of the particles.
Symmetry 16 01290 g008
Figure 9. PCA was applied to the positions of 50 particles in f18 of CEC2017 for Shifted and Rotated Rastrigin. Note: The colored dots are the positions of the particles.
Figure 9. PCA was applied to the positions of 50 particles in f18 of CEC2017 for Shifted and Rotated Rastrigin. Note: The colored dots are the positions of the particles.
Symmetry 16 01290 g009
Figure 10. PCA was applied to the positions of 50 particles in f19 of CEC2017 for Shifted and Rotated Lunacek Bi-Rastrigin. Note: The colored dots are the positions of the particles.
Figure 10. PCA was applied to the positions of 50 particles in f19 of CEC2017 for Shifted and Rotated Lunacek Bi-Rastrigin. Note: The colored dots are the positions of the particles.
Symmetry 16 01290 g010
Figure 11. PCA was applied to the positions of 50 particles in f20 of CEC2017 for Shifted and Rotated Non-Continuous Rastrigin. Note: The colored dots are the positions of the particles.
Figure 11. PCA was applied to the positions of 50 particles in f20 of CEC2017 for Shifted and Rotated Non-Continuous Rastrigin. Note: The colored dots are the positions of the particles.
Symmetry 16 01290 g011
Figure 12. PCA was applied to the positions of 50 particles in f21 of CEC2017 for Shifted and Rotated Schwefel. Note: The colored dots are the positions of the particles.
Figure 12. PCA was applied to the positions of 50 particles in f21 of CEC2017 for Shifted and Rotated Schwefel. Note: The colored dots are the positions of the particles.
Symmetry 16 01290 g012
Figure 13. PCA was applied to the positions of 50 particles in f22 of CEC2017 for Hybrid Function 2 (N = 3). Note: The colored dots are the positions of the particles.
Figure 13. PCA was applied to the positions of 50 particles in f22 of CEC2017 for Hybrid Function 2 (N = 3). Note: The colored dots are the positions of the particles.
Symmetry 16 01290 g013
Figure 14. PCA was applied to the positions of 50 particles in f23 of CEC2017 for Hybrid Function 4 (N = 4). Note: The colored dots are the positions of the particles.
Figure 14. PCA was applied to the positions of 50 particles in f23 of CEC2017 for Hybrid Function 4 (N = 4). Note: The colored dots are the positions of the particles.
Symmetry 16 01290 g014
Figure 15. PCA was applied to the positions of 50 particles in f24 of Hybrid Function 5 (N = 4). Note: The colored dots are the positions of the particles.
Figure 15. PCA was applied to the positions of 50 particles in f24 of Hybrid Function 5 (N = 4). Note: The colored dots are the positions of the particles.
Symmetry 16 01290 g015
Figure 16. PCA was applied to the positions of 50 particles in f25 of Hybrid Function 6 (N = 4). Note: The colored dots are the positions of the particles.
Figure 16. PCA was applied to the positions of 50 particles in f25 of Hybrid Function 6 (N = 4). Note: The colored dots are the positions of the particles.
Symmetry 16 01290 g016
Figure 17. PCA was applied to the positions of 50 particles in f26 of Composition Function 1 (N = 3). Note: The colored dots are the positions of the particles.
Figure 17. PCA was applied to the positions of 50 particles in f26 of Composition Function 1 (N = 3). Note: The colored dots are the positions of the particles.
Symmetry 16 01290 g017
Figure 18. PCA was applied to the positions of 50 particles in f27 of Composition Function 2 (N = 3). Note: The colored dots are the positions of the particles.
Figure 18. PCA was applied to the positions of 50 particles in f27 of Composition Function 2 (N = 3). Note: The colored dots are the positions of the particles.
Symmetry 16 01290 g018
Figure 19. PCA was applied to the positions of 50 particles in f28 of Composition Function 5 (N = 5). Note: The colored dots are the positions of the particles.
Figure 19. PCA was applied to the positions of 50 particles in f28 of Composition Function 5 (N = 5). Note: The colored dots are the positions of the particles.
Symmetry 16 01290 g019
Figure 20. PCA was applied to the positions of 50 particles in f29 of Composition Function 6 (N = 5). Note: The colored dots are the positions of the particles.
Figure 20. PCA was applied to the positions of 50 particles in f29 of Composition Function 6 (N = 5). Note: The colored dots are the positions of the particles.
Symmetry 16 01290 g020
Figure 21. exploration% and exploitation% of f17f29.
Figure 21. exploration% and exploitation% of f17f29.
Symmetry 16 01290 g021
Table 1. The meanings of the parameters for the CLPSO algorithm.
Table 1. The meanings of the parameters for the CLPSO algorithm.
ParametersThe Meaning of the Parameters
ω0inertia weight of the first iteration
ω1inertia weight of the last iteration
PSOthe PSO algorithm
pspopulation size
max_genmaximum generations
kgeneration counter
iparticle’s id counter
ddimension
ω(k)inertia weight
gbestdthe dth dimension’s value of gbest
flagithe number of generations for which the particle i has not improved its pbest, and the initial value for flagi of the particle i is 0.
Table 2. Parameter settings for PSO algorithms.
Table 2. Parameter settings for PSO algorithms.
AlgorithmParameter Settings
PSO [1]ω = 0.729, c1 = c2 = 1.494
CLPSO [13]ω = 0.9–0.4, c = 1.494, m = 7
ELPSO [14]ω = 0.729, c1 = 1.49445, c2 = 1.494, m = 7, Bm = 4
RLPSOω = 0.9–0.4, c = 1.49445, m = 10, dimup = 30,
global reward = 10, local reward = 2, penalty = −1,
α = 0.1, γ = 0.95, ε0 = 0.6, des = 0.001
Table 3. Comparisons of values extracted by the PSO, CLPSO, ELPSO, and RLPSO algorithms in 30 trials.
Table 3. Comparisons of values extracted by the PSO, CLPSO, ELPSO, and RLPSO algorithms in 30 trials.
Function PSOCLPSOELPSORLPSORandom RLPSOt-Test s Value
f1Mean6.75 × 10−964.46 × 10−144.8 × 10−93001
Std3.5 × 10−1001.73 × 10−141.26 × 10−9300
f2Mean5.53 × 10−163.79 × 10−121.41 × 10−12001
Std1.72 × 10−202.19 × 10−122.51 × 10−1200
f3Mean6.98 × 10−94.68 × 10−38.45 × 10−123.77 × 10−2094.71 × 10−1851
Std1.25 × 10−103.83 × 10−36.58 × 10−122.02 × 10−2081.93 × 10−184
f4Mean1.52 × 10−62.62.71 × 10−71.91 × 10−2142.2 × 10−2071
Std1.84 × 10−72.42.81 × 10−77.2 × 10−2147.15 × 10−207
f5Mean1.01 × 1012.1 × 1019.826.67 × 10−301.17 × 10−291
Std1.662.981.661.49 × 10−291.37 × 10−29
f6Mean000000
Std00000
f7Mean7.49 × 10−35.78 × 10−33.9 × 10−31.97 × 10−32.4 × 10−31
Std9.4 × 10−42.34 × 10−31.42 × 10−38.1 × 10−41.09 × 10−3
f8Mean−8.44 × 103−9.54 × 103−1.22 × 104−1.25 × 104−1.25 × 1041
Std5.68 × 1022.15 × 1023.29 × 1029.01 × 1018.5 × 101
f9Mean4.69 × 1014.85 × 10−100000
Std1.59 × 1013.63 × 10−10000
f10Mean1.21001.47 × 10−141.8 × 10−141
Std8.6 × 10−1004.1 × 10−155.2 × 10−15
f11Mean2.88 × 10−23.14 × 10−1001.82 × 10−22.62 × 10−21
Std3.18 × 10−24.64 × 10−1002.36 × 10−23.01 × 10−2
f12Mean3.521.12 × 10−113.02 × 10−17001
Std5.2 × 10−11.12 × 10−101.35 × 10−1800
f13Mean8.46 × 1011.07 × 10−112.88 × 10−1703.66 × 10−41
Std2.35 × 10−11.7 × 10−272.06 × 10−2201.97 × 10−3
f14Mean−1.0316285−1.0316285−1.0316285−1.0316285−1.03162850
Std00004.44 × 10−16
f15Mean0.3989030.3989030.3988740.3978870.3978871
Std00000
f16Mean3.003.003.003.003.000
Std00001.22 × 10−15
Note: a t-test s value of 1 indicates statistically significant differences in performances between the RLPSO algorithm and the PSO, CLPSO, and ELPSO algorithms at a 95% confidence level, while a t-test s value of 0 suggests no statistically significant differences. The best results among the five algorithms highlighted in bold.
Table 4. Comparisons of 5 groups of parameter settings.
Table 4. Comparisons of 5 groups of parameter settings.
Function Global Reward = 10
Local Reward = 5
Penalty = −1
Global Reward = 10
Local Reward = 5
Penalty = −2
Global Reward = 10
Local Reward = 2
Penalty = −1
Global Reward = 10
Local Reward = 2
Penalty = −2
Global Reward = 10
Local Reward = 1
Penalty = −2
f1Mean00000
Std00000
f2Mean00000
Std00000
f3Mean8.62 × 10−1974.3 × 10−2183.77 × 10−2092.6 × 10−2011.15 × 10−194
Std4.64 × 10−1962.31 × 10−2172.02 × 10−2081.4 × 10−2005.52 × 10−194
f4Mean1.3 × 10−2042.99 × 10−2091.91 × 10−2143.39 × 10−2105.13 × 10−206
Std6.99 × 10−2048.97 × 10−2097.2 × 10−2141.82 × 10−2092.53 × 10−205
f5Mean1.53 × 10−292.25 × 10−296.67 × 10−301.05 × 10−291.01 × 10−29
Std2.28 × 10−297.23 × 10−291.49 × 10−292.09 × 10−291.57 × 10−29
f6Mean00000
Std00000
f7Mean2.47 × 10−32.21 × 10−31.97 × 10−32.43 × 10−32.51 × 10−3
Std9.26 × 10−41.43 × 10−38.1 × 10−41.06 × 10−39.22 × 10−4
f8Mean−1.25 × 104−1.25 × 104−1.25 × 104−1.25 × 104−1.25 × 104
Std6.23 × 1017.95 × 1019.01 × 1016.23 × 1016.23 × 101
f9Mean0003.32 × 10−20
Std0001.79 × 10−10
f10Mean1.44 × 10−141.43 × 10−141.47 × 10−141.41 × 10−141.38 × 10−14
Std3.99 × 10−152.95 × 10−154.1 × 10−152.76 × 10−153.86 × 10−15
f11Mean2.12 × 10−21.81 × 10−21.82 × 10−21.52 × 10−22.29 × 10−2
Std2.4 × 10−22.08 × 10−22.36 × 10−21.91 × 10−22.56 × 10−2
f12Mean00000
Std00000
f13Mean00000
Std00000
f14Mean−1.0316285−1.0316285−1.0316285−1.0316285−1.0316285
Std09.57 × 10−16009.62 × 10−15
f15Mean0.3978870.3978870.3978870.3978870.397887
Std03.19 × 10−16004.66 × 10−14
f16Mean3.003.003.003.003.00
Std00000
Note: the best results among the five algorithms highlighted in bold.
Table 5. CEC2017 benchmark functions.
Table 5. CEC2017 benchmark functions.
NO.FunctionDRangefopt
f17Shifted and Rotated Zakharov30[−100,100]300
f18Shifted and Rotated Rastrigin30[−100,100]500
f19Shifted and Rotated Lunacek Bi-Rastrigin30[−100,100]700
f20Shifted and Rotated Non-Continuous Rastrigin30[−100,100]800
f21Shifted and Rotated Schwefel30[−100,100]1000
f22Hybrid Function 2 (N = 3)30[−100,100]1200
f23Hybrid Function 4 (N = 4)30[−100,100]1400
f24Hybrid Function 5 (N = 4)30[−100,100]1500
f25Hybrid Function 6 (N = 4)30[−100,100]1600
f26Composition Function 1 (N = 3)30[−100,100]2100
f27Composition Function 2 (N = 3)30[−100,100]2200
f28Composition Function 5 (N = 5)30[−100,100]2500
f29Composition Function 6 (N = 5)30[−100,100]2600
Table 6. Some variants of PSO and other compared evolutionary algorithms.
Table 6. Some variants of PSO and other compared evolutionary algorithms.
AlgorithmYearsParameter Information
CLPSO [13]2006ω = 0.9~0.4, c = 1.49445, m = 7.
HPSO [29]2014ω = 0.9~0.4, c1 = c2 = 2, c3 = randn [0,1].
THSPSO [30]2019c1 = c2 = c3 = 2, ω = 0.9
TCSPSO [24]2019ω = 0.9~0.4, c1 = c2 = 2
BOA [31]2018Sensormodality c = 0~1, Powerexponent: 0.1~0.3,
Switchprobability: p = 0~8
OSA [32]2018β = 1.9~0
Table 7. Comparisons of values extracted by the CLPSO, HPSO, THSPSO, TCSPSO, BOA, and OSA algorithms in 50 trials.
Table 7. Comparisons of values extracted by the CLPSO, HPSO, THSPSO, TCSPSO, BOA, and OSA algorithms in 50 trials.
Function CLPSOHPSOTHSPSOTCSPSOBOAOSARLPSOt-Test s Value
f17Mean9.17 × 1044.57 × 1048.07 × 1042.2 × 1048.23 × 1049.16 × 1043.13 × 1021
Std1.46 × 1041.75 × 1044.45 × 1034.72 × 1037.23 × 1032.79 × 1031.75 × 101
f18Mean6.57 × 1026.12 × 1028.99 × 1026.02 × 1028.99 × 1029.38 × 1025 × 1021
Std1.05 × 1015.04 × 1013.8 × 1012.54 × 1012.59 × 1012.35 × 1017.36× 10−4
f19Mean9.53 × 1028.58 × 1021.38 × 1038.51 × 1021.36 × 1031.47 × 1039.06 × 1021
Std1.64 × 1014.07 × 1015.56 × 1013.98 × 1014.04 × 1014.32 × 1015.96 × 101
f20Mean9.6 × 1029.01 × 1021.13 × 1038.91 × 1021.13 × 1031.15 × 1038.09 × 1021
Std1.26 × 1015.11 × 1012.93 × 1012.53 × 1011.95 × 1012.43 × 1015.18
f21Mean6.26 × 1038.05 × 1038.75 × 1034.85 × 1038.84 × 1039.04 × 1031.31 × 1031
Std2.88 × 1025.99 × 1025.95 × 1029.05 × 1023.11 × 1024.36 × 1023.14 × 102
f22Mean1.16 × 1078.87 × 1051.07 × 10103.18 × 1061.27 × 10101.44 × 10104.18 × 1031
Std3.54 × 1069.07 × 1052.88 × 1094.1 × 1063.34 × 1092.16 × 1093.03 × 103
f23Mean6.13 × 1046.8 × 1041.56 × 1066.6 × 1041.22 × 1072.14 × 1073.83 × 1031
Std4.41 × 1045.24 × 1049.77 × 1051.02 × 1052.09 × 1072.03 × 1072.16 × 103
f24Mean3.72 × 1041.03 × 1041.48 × 1089.3 × 1036.83 × 1088.28 × 1081.63 × 1031
Std1.84 × 1041 × 1041.52 × 1088.85 × 1034.37 × 1083.43 × 1081.88 × 102
f25Mean2.47 × 1032.65 × 1035.04 × 1032.67 × 1037.3 × 1036.17 × 1031.84 × 1031
Std1.6 × 1023.51 × 1026.82 × 1023.1 × 1021.39 × 1039.21 × 1022.92 × 102
f26Mean2.44 × 1032.41 × 1032.71 × 1032.41 × 1032.65 × 1032.76 × 1032.24 × 1031
Std3.7 × 1015.12 × 1015.49 × 1012.77 × 1011.23 × 1024.56 × 1012.71 × 101
f27Mean2.94 × 1037.04 × 1038.2 × 1032.74 × 1035.44 × 1031.01 × 1042.44 × 1031
Std1.07 × 1033.2 × 1031.29 × 1031.36 × 1031.08 × 1036.9 × 1029.51 × 101
f28Mean2.93 × 1032.89 × 1034.43 × 1032.94 × 1035.64 × 1034.71 × 1033.32 × 1031
Std9.741.453.71 × 1022.57 × 1015.27 × 1023.61 × 1022.65 × 101
f29Mean4.81 × 1034.99 × 1031.01 × 1044.7 × 1031.14 × 1041.14 × 1043.14 × 1031
Std4.67 × 1025.5 × 1026.96 × 1021.19 × 1039.03 × 1028.81 × 1028.63
Note: a t-test s value of 1 indicates statistically significant differences in performances between the RLPSO algorithm and any other algorithm at a 95% confidence level, while a t-test s value of 0 suggests no statistically significant differences.
Table 8. Comparison performance of RLPSO with other algorithms for three-bar truss problem.
Table 8. Comparison performance of RLPSO with other algorithms for three-bar truss problem.
AlgorithmOptimal Weight
RLPSO209.173475679612
m-DMFO174.2761613819025
MFO263.895979682
DEDS263.8958434
MBA263.8958522
Tsa263.68
PSO-DE263.8958433
CS263.9716
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, F.; Chen, Z. A Novel Reinforcement Learning-Based Particle Swarm Optimization Algorithm for Better Symmetry between Convergence Speed and Diversity. Symmetry 2024, 16, 1290. https://doi.org/10.3390/sym16101290

AMA Style

Zhang F, Chen Z. A Novel Reinforcement Learning-Based Particle Swarm Optimization Algorithm for Better Symmetry between Convergence Speed and Diversity. Symmetry. 2024; 16(10):1290. https://doi.org/10.3390/sym16101290

Chicago/Turabian Style

Zhang, Fan, and Zhongsheng Chen. 2024. "A Novel Reinforcement Learning-Based Particle Swarm Optimization Algorithm for Better Symmetry between Convergence Speed and Diversity" Symmetry 16, no. 10: 1290. https://doi.org/10.3390/sym16101290

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop