1. Introduction
Optimization methods involve automatically finding the best solution in a problem’s solution space set. When converting a real problem into a mathematical model, simulating the actual physical characteristics of the problem requires a detailed description of the conversion process. Moreover, the mathematical model of the problem becomes more complex. Currently, three methods exist for solving optimization problems: numerical method, enumerative, and random search.
Numerical methods use the derivative as a technique to find the best value in the space. For example, the traditional neural network series of algorithms is based on this method’s gradient descent to find the best parameters [
1]. However, the numerical method has two shortcomings. First, it searches for the best solution from a local point of view, so there is no guarantee that the solution found is globally optimal. Second, numerical methods are not applicable for search spaces that are not smooth or continuous [
2,
3]. However, because usually many regional optimal solutions exist in a search space that is not smooth or continuous, it is easy to converge early in the search process and find the optimal local solution.
The enumeration method, such as grid search, uses the objective function to test all solutions in the search space at that level when the segmentation level is selected. This method has a better chance of obtaining the best solution, but it requires considerable computation time. Therefore, when the search space is ample, the enumeration method is inefficient. The random search method is currently a commonly used optimization method, which finds the best solution space by imitating natural or biological behavior, and particle swarm optimization (PSO) [
4] is one of the optimization methods that imitate biological behavior.
PSO is an intelligent swarm algorithm that observes the behavior of swarm creatures. This method is used to find the best position in the current space. After PSO was published, many scholars proposed different methods to improve the algorithm. These methods have been applied in many fields [
5,
6,
7].
For example, Shi and Eberhart [
8] proposed a constant inertia weight to improve the moving direction of particles. Different inertial weights make it possible to find a balance between local and global searches. Thus, the algorithm considers the best solution in the whole domain. Shi and Eberhart [
9] again proposed linearly decreasing inertia weights. In the same year, Suganthan [
10] proposed a linear decrease with dynamic characteristics applied to individual learning parameters
c1 and group learning parameters
c2, which effectively improved global search. Clerc [
11,
12] put forward the concept of the shrinkage coefficient, and its idea is to change the moving direction of particles to increase local search. Shi and Eberhart [
13] proposed the maximum speed method to improve search. Ratnaweera et al. [
14] improved the method proposed by Suganthan by changing the swarm learning parameter from linearly decreasing to linearly increasing. Chatterjee and Siarry [
15] proposed to change inertia weights in a nonlinearly decreasing way. Ko et al. [
16] extended the concept of nonlinear change to individual and group learning parameters. They changed the individual learning parameters to nonlinear decreasing individual learning parameters and the group learning parameters to nonlinear increasing group learning parameters.
With many scholars proposing methods to improve the original algorithm, the particle swarm algorithm search has been dramatically improved. Since 2002, Clerc and Kennedy [
12] have used the dynamic system notation in control theory to explore the internal operation of the particle swarm algorithm. Many scholars have also proposed the stability of particle swarm algorithms under different conditions based on a dynamic system representation [
17,
18,
19,
20]. In particular, in 2014, Lin [
21] proposed the PSO algorithm with controllability. The method explores the particle swarm algorithm from the viewpoint of state controllability in dynamic systems. When the controllable conditions are met, the particle swarm algorithm’s position and velocity vectors are controlled by its own best solution position vector and the global best solution position vector; this makes the convergence better than in the original particle swarm algorithm.
However, the searchability of the integer-order particle swarm algorithm proposed by Lin [
21] is poor and is unstable in high-dimensional or complex spaces. Therefore, this study suggests combining fractional-order particle swarm optimizer and PSO algorithm with controllability, which is called “the controllable fractional-order particle swarm algorithm”. We can use the proposed algorithm to improve the PSO algorithm with controllability such that it performs better in high-dimensional or complex spaces.
We need to optimize hyperparameters efficiently and systematically for machine learning algorithms. Therefore, this study applies the “controllable fractional-order particle swarm algorithm” to optimize machine learning hyperparameters. We demonstrated how to efficiently and systematically find hyperparameters in extreme gradient boosting (XGBoost) [
22] machine learning algorithms using our recommended method. Further, we used the heart disease data set downloaded from the UCI website. The six best hyperparameters found using our recommended method and the hyperparameters officially recommended by XGBoost [
23] were trained and tested, and data sets were compared. Experimental results showed that the recommended method had better performance, and no false-negative results were produced. This method can help physicians quickly determine whether a patient has heart disease using the learned model.
2. Materials and Methods
The particle swarm algorithm was initially inspired by Kennedy and Eberhart [
4] by observing the foraging behavior of birds. It is a kind of mimic optimization algorithm with the concept of swarm intelligence. Suppose there is a flock of birds randomly scattered in a space where food exists, and there are many food piles of different sizes in the space. Then, the largest food pile is the best position (
) in this space. First, each bird starts to search for food piles at a random location, searches for routes with its own experience, and records the largest food pile (
) that it has ever searched. When a particular bird finds a better food pile than all the current bird flocks, it will notify the other bird flocks to move toward the best food pile. Therefore, the following search route of each bird will be affected by three factors: the direction in its own experience (the direction of its own speed), the current direction of finding the best food pile position by itself (the direction of its own best solution), and best food pile position direction found among all the flocks (the best solution direction in the whole domain).
2.1. Particle Swarm Algorithm
In the theory of particle swarm algorithm, each bird in the space is regarded as a particle in the “solution space”. The current position of each particle is considered to be a solution to the optimization problem of the solution space, and each solution corresponds to an answer, which is called the objective function value or fitness value of the solution space. Each particle has its own speed (
Vi) and uses its own speed direction, the best solution (
pbest) currently found by itself, and the best solution (
gbest) found by the current group to generate a new particle speed. After determining the update speed and direction of the particles, the conditions are used to generate a new update position. Subsequently, the value of the objective function is brought into the position of each particle to judge the pros and cons of the current position. If it is better than the previously searched solution, replace it. Otherwise, keep the original best solution. We used this mechanism to iteratively search in the solution space to find the best solution in the space and used the following Equations (1) and (2) to express the search mechanism for finding the best solution:
where
i (
m denotes the number of particles);
k represents the iteration index;
denotes the velocity vector of the
i-th particle in the
θ dimension;
represents the position of the
i-th particle in the
θ dimension vector;
denotes the position vector of the best solution
θ dimension of each iteration;
represents the position vector of the best solution
θ dimension of the group;
denotes the individual learning parameter;
represents the group learning parameter;
and
denote random numbers between 0 and 1;
θ represents the dimensionality of the search space.
The individual learning parameter
and the group learning parameter
represent the acceleration weights of the best solution and the best solution of the group that advances the particle to each iteration, respectively. When the value of
c is small, the particle is allowed to perform multiple searches near the target area before reaching the best solution of its own or the best solution of the group in each iteration. This increases the probability of finding the best solution in the entire domain but at the cost of more computation power and time. When the value of
c is large, the particles are allowed to reach their own best solution or the best solution of the group at a faster speed in each iteration. This will save some unnecessary calculations and time and improve the convergence speed. Moreover, when the value
or
is 0, the particle swarm algorithm will have different characteristics [
24].
The first part of Equation (1) is the particle’s previous inertia, i.e., the velocity of its previous experience. The second part is the “cognition” part, which represents the thinking of the particle itself. Finally, the third part is the “social” part, which implies that the information among particles is shared such that the particles can cooperate. Therefore, the core of the particle swarm algorithm is to use these three parts to update the particle speed and position in a linear combination and to calculate the fitness value to complete the problem optimization.
2.2. Fractional-Order Particle Swarm Algorithm
Fractional calculus is derived from traditional calculus [
25]. For example, Equation (3) represents fractional differentiation based on the Grünwald–Letnikov definition, where
D stands for the differential operator,
λ denotes a fractional-order power, and
Γ is the Euler function. According to Solteiro Pires et al. [
26], if
h is expressed in discrete terms, it can be approximated as Equation (4).
where
T denotes the sampling period, and
r represents the truncation order.
In contrast to the integer-order derivative as a finite series, the fractional-order derivative requires an infinite number of terms. This implies that the information obtained using the fractional order is more global than the integer-order differentiation. Therefore, the solution space that can be explored for the fraction order is more refined than the integer order, and it is expected to obtain better solution space accuracy. Further, the fractional differentiation allows the particle swarm algorithm to memorize the position. Therefore, the velocity vector is affected by the positions of the previous and last generations. This makes the fractional-order particle swarm algorithm more conservative in the search process, making the solution space results more similar and stable every time.
If
r = 4 is used as an example, the speed and position vectors are updated with the following Equations (5) and (6).
To improve the point of the direction from which the particles are searched, Shi and Eberhart [
8] added an inertia weight term
to the velocity
to facilitate the contribution of the particle itself to the update velocity. When
is large, the direction of the update speed will depend on the direction of the previous generation speed. At this time, the search direction is more stable, which improves the global search ability of the particles in space. However, when
is considerably large, overcorrection occurs. Consequently, the particle correction speed is excessively large and deviates from the better solution, resulting in “flying” trajectories.
When
is small, the update speed direction is dominated by the optimal local solution and the direction of the global optimal solution. At this time, a local search capability is provided. However, because the solution searched by the particle is not explored globally, the obtained solution may not achieve the global best solution due to its locality. Therefore, Shi and Eberhart [
9] once again proposed a solution, changing the “constant inertia weight” to a “linearly decreasing inertia weight.” When
is set to a larger value in the initial stage, the particle swarm has a better ability to expand the search to find the best solution area in the whole domain. After the number of iterations increases, the value of
is gradually reduced. The particle swarm will switch from an extended search to a local search to find a better solution in the best found so far. The formula for changing the “constant term inertia weight
” to “time-varying linear inertia weight
(
k)” is shown in Equation (7).
where
k denotes the number of iterations;
represents the maximum value of the inertia weight;
denotes the minimum value of the inertia weight;
represents the maximum number of iterations.
To avoid excessive velocity exceeding the search space during the particle update, Shi and Eberhart [
13] used the maximum velocity method (
Vmax Method) to limit particle velocity and improve particle search capability. Among them, the value of
Vmax cannot be set too large because the particles can have a high speed, which may cause the particles to fly out of the search range. The value of
Vmax cannot be set too small either because the particle swarm will search the space too slowly and thus will not be able to search the global space and is limited to the best solution in a local range. The formula of the maximum speed method is as follows:
Among them, this study set to be 0.2 times of the maximum search range, i.e., .
2.3. Robust and Controllable
Combine the “linearly decreasing inertia weight” of Equation (7) with Equation (5) and expand Equation (6) into Equations (9) and (10).
Subsequently, rewrite these equations into the state Equation (11):
where
denotes the system state vector;
represents the control input vector;
i (
m denotes the number of particles) and
, while
θ represents the dimension for the search space, and
k is the iteration index.
AI denotes the system matrix, and
BI represents the input matrix, as shown in the following Equation (12):
where
and
I denotes the
θ ×
θ unit matrix.
and
denote random numbers between 0 and 1. Equations (9) and (10) are equivalent to Equation (11) based on Equation (12) and matrix multiplication rules.
If each pair
is controllable, then the state Equation (11) is said to be robust and controllable [
27]. Suppose that a fixed fractional value
λ, an inertial weight constant reference value
, a volume learning parameter constant reference value
, and a group learning parameter constant reference value
are selected as the nominal values of the fractional-order particle swarm algorithm. In that case, the equation of state (11) can be rewritten as an uncertain linear system, i.e., the system will be transformed into a nominal fractional-order particle swarm optimization (FPSO) combined with uncertain matrices:
where
and ∆
A and ∆
B denote the uncertainty matrices of the (system matrix)
and (input matrix)
, respectively, as shown in the following equation:
In this study, a sufficient condition is proposed to explain that the linear system with unstructured parametric uncertainties is robust and controllable: assume that the linear interval system of Equation (11),
is controllable. If the following conditions are true, Equations (13) and (15) are robust and controllable.
where
denotes the identity matrix of
, and
represents the number of uncertain matrices. The matrices
are defined as follows:
and
allows singular value decomposition to become
where
and
are unitary matrices,
. The singular values of
are
. The proof of this sufficient condition is shown in
Appendix A.
2.4. Uncertain Parameter Range Corresponds to a Random Number Range
This section uses the sufficient condition proved in
Appendix A to analyze the uncertain linear system of the fractional-order particle swarm algorithm for finding the range of uncertain parameters corresponding to the range of random numbers. Herein, we selected the maximum inertia weight
of the fractional-order particle swarm algorithm to be 0.9, the minimum inertia weight
to be 0.4, the individual learning parameter
to be 2, and the group learning parameter
also to be 2. Moreover, we set the individual learning parameter constant value
to 1 and the group learning parameter constant value
to 1. Therefore, the following Equation (20) can be obtained:
where
and
. In this system, the inertia weight parameter of each generation is a constant value, which does not affect the robustness and controllability of Equation (20).
As the fractional-order particle swarm algorithm generates different values in Equation (20) based on different values of
λ, the ranges of random numbers
and
corresponding to
and
respectively, are also different. As the value of
λ is usually between 0 and 2, this study breaks
λ into 20 values and deduces them one by one at the intervals of 0.1.
Table 1 shows the
and
for different
λ values according to Equation (20), where the random number range of
is between 0 and 1, and the random number range
will change as per different
λ values. Among them, the range of
obtained with
λ = 0.3 is the widest and least conservative. Thus, the final range of
and
is
The fractional-order particle swarm algorithm using the random number of Equation (21) is called the controllability fractional-order particle swarm optimizer (CFPSO) algorithm.
This study quotes Lin [
21] when it is time to implement the controllable fractional-order particle swarm algorithm. If it meets the conditions, it will be executed. The conditions are as follows:
Among them, and are set to 10−4. The execution steps of the controllable fractional-order particle swarm algorithm are as follows:
Step 1: Set the number of groups, maximum value , and minimum value of the inertia weight of Equation (7). Then, set the individual learning parameter , group learning parameter , fractional value λ, function evaluations, and maximum number of iterations of Equation (7) ;
Step 2: Initialize the random particle position and initial velocity of to 0;
Step 3: Calculate particle fitness;
Step 4: Update each particle’s best solution and global best solution;
Step 5: Check whether the condition of Equation (22) is satisfied. If it is satisfied, obtain the controllable random number range according to Equation (21) and update Equations (6) and (9);
Step 6: Check whether the stop condition is met; if not, go back to Step 3, Step 4, and Step 5 until the stop condition is met.
3. XGBoost
The integrated machine learning algorithm combines many “weak learners” into one “strong learner” and has two integrated methods. One of the methods is “bagging” [
28]. Each weak learner will randomly select some samples for independent training. The final classification result is to calculate the category that all weak learners discriminate the most times (majority voting). The most representative algorithm is random forest [
29]. Another method is “boosting” [
30]. The weak learner has a sequence relationship. The next weak learner will learn the information that the previous weak learner has not learned. After repeating
N times, the
N weak learners are weighted and combined into a strong learner. The most representative algorithm is the adaptive boost (AdaBoost) algorithm.
Chen and Guestrin [
22] proposed the XGBoost algorithm. It combines the advantages of bagging and boosting and introduces a regularization function to improve the boosting method, and only optimizes the loss function. The regularization function is mainly used to limit the complexity of the model. With the regularization function, the model will be less complicated and is less likely to overfit. XGBoost uses a classification and regression tree (CART) [
31] to classify weak learners. CART can be applied to classification tasks because CART uses binary segmentation, and features can be reused to generate trees. Further, it can also be applied to regression tasks. CART uses the maximum Gini index (Gini) as a method to select features to reduce the number of calculations.
For a given training set
S, its Gini index is
where
denotes a subset of samples belonging to the
k-th category in
S, and
K represents the number of categories. The greater the Gini index, the greater the uncertainty of the data.
CART uses a binary tree as a decision tree and only classifies node features as “yes” or “no.” Therefore, the decision tree is equivalent to recursively dicing each feature: divide the feature space into a finite number of units, and determine the prediction probability distribution on these units. Thus, the overall process comprises two steps: decision tree generation and pruning. The generating calculation step is to start from the root node and divide the root node recursively until the stopping conditions are met. The stopping conditions are as follows: (1) the number of samples in the node is less than the preset threshold; (2) the Gini index of the sample set is less than the preset threshold; (3) the depth of the decision tree meets the specified conditions; (4) after the feature is used, it cannot be divided. The pruning step is to start pruning from the bottom of the decision tree generated using the decision tree generation algorithm and pruning one node at a time until the root node of forms a subtree sequence . Subsequently, the cross-validation method predicts the subtree sequence in the verification data set, and the best subtree is selected from it.
Suppose that when a new tree
is to be constructed in the
n-th iteration, the objective function is
where
denotes a loss function that is a convex function;
represents a regularization term;
denotes the model prediction for the
n-th round;
represents the number of leaf nodes;
denotes the structure of the
n-th tree;
represents the penalty coefficient for the number of leaf nodes;
denotes the penalty coefficient for the leaf node score;
represents the score of each tree leaf node.