3.1. Multi-Objective Chicken Swarm Optimization Solution Strategy Based on Flow Collaboration
In order to solve the collaborative optimization model constructed in
Section 2.4 and investigate the optimal process indicators to guide the adjustment of the operating parameters, a multi-objective chicken swarm optimization algorithm based on flow collaboration (IMOCSO) is proposed in this paper. The specific contents include the following:
(1) Hierarchical relationship update between chicken populations: In the MOCSO, the synergy degree (
) is selected as the aggregation function of multiple objectives. The MOCSO algorithm sorts the population of chickens according to the values of the aggregate objective function and follows the rate into the rooster (NR), hen (NH), and chick (NC) population groups. The order parameters are discussed with two opposite effects: The positive effect means that the degree of order of the subsystem increases as the order parameter increases [
31]. Conversely, a negative effect means that the degree of order of the subsystem decreases as the order parameter increases [
32]. Based on the efficacy coefficient, the degree of synergy among the MF–EF–IF can be introduced to show the overall performance of the milling system. The efficacy coefficient (
Fs) and synergy degree (
SE) of the order parameter are calculated as follows:
where
is the order parameter,
i is the
i-th flow, and
and
are the maximum and the minimum of
, respectively.
(2) Update the position of each chicken group: The forward learning mechanism is introduced into the rooster subgroup, which can accelerate the rate of convergence.
where
indicates that the
i-th rooster weakly dominates the
k-th rooster,
is a Gaussian distribution with a mean of zero and a standard deviation of
,
is a small constant to prevent the denominator from being zero,
is the position of the
i-th rooster at the
t-th iteration,
is the position of the
i-th rooster at the t + 1-th iteration, and
is the globally optimal individual at the
t-th iteration, which has the largest degree of collaboration in the archive, while
is the learning factor of forward learning. According to Equations (12) and (13), the
SE of each rooster is calculated, where
is the synergy degree of the
i-th rooster, and
is the synergy degree of the
k-th individual. The hen randomly selects the rooster to follow, and its position is updated as follows:
where
is the position of the
i-th hen at the
t-th iteration,
is the position of the
i-th hen at the
t + 1-th iteration,
is the rooster followed by the
i-th hen at the
t-th iteration,
is the rooster or hen randomly selected from the whole flock, and
;
,
, and
are the synergy degree of the
i-th,
r1-th, and
r2-th individuals, respectively. The parental guidance mechanism and adaptive factors are introduced into the chick’s position update as follows:
where
is the position of the
i-th chick at the
t-th iteration,
is the position of the
i-th chick at the
t + 1-th iteration,
is the hen followed by the
i-th individual,
is the rooster followed by the
i-th chick,
is the weight, and
and
are the learning factors from the hens and roosters, respectively.
(3) Maintenance of external archives: The obtained non-dominated solution set is stored in an external archive. An exponential function is introduced to maintain information sharing between the particles to avoid the explosion of—And preserve the diversity of—The archive population. The Euclidean distance
is used to measure the degree of aggregation between the
i-th particle and the
j-th particle, after which an exponential distance update is introduced [
33].
where
is the distance between the
i-th particle and the
j-th particle,
and
are the upper and lower limits of the
k-th variable, respectively, and the function
represents a randomly selected normal distribution value.
The IMOCSO algorithm is used to solve the established model, and the solution process is shown in
Figure 8.
3.2. Adaptive Optimization of Operating Parameters Based on Deep Reinforcement Learning
The sugarcane milling process is a 24/7 production process. During the production, when the order parameters of the sugarcane milling process (i.e., grinding capacity, electric consumption per ton of sugarcane, and sucrose extraction) fluctuate, it is necessary to adjust the operating parameters so that the order parameters can be quickly return to near the optimal target. Due to the working conditions are constantly changing throughout the production process, the operating parameters need to be continuously adjusted during the production cycle to make the production process stable. Therefore, adaptive optimization of the sugarcane milling process means that the operational parameters are continuous adjusting according to the real-time detection values of the order parameters when the working conditions of the production process change, ensuring that the order parameters are stable in the optimal range.
In
Section 2.4, a data-driven model of MF–EF–IF model is presented for real-time detection values of the order parameters, while in
Section 3.1 MOCSO is used to solve the optimal values of the order parameters under all working conditions. However, there are many contradictions among the order parameters, and constraints such as production boundary conditions will change with time, resulting in the optimal solution set and Pareto frontier surface also changing with time. The traditional multi-objective optimization methods has been unable to adapt to the new production environment, and it is difficult to quickly track the Pareto frontier and Pareto solution set after detecting the environmental changes. Therefore, on the basis of the above, a deep reinforcement learning technique was introduced and applied to the sugarcane milling process to optimize the process’ operating parameters.
Deep reinforcement learning (DRL) is a technique to train an agent to interact with its environment and to learn the mapping relationship from state to behavior based on the powerful fitting capability of a neural network. DRL uses the Markov decision process (MDP) to model the training process, including four basic elements:
, where
is the set of all states of the process,
is the set of all possible actions taken,
denotes the probability of the occurrence of a transfer from one state to another, and
is the reward function by which the action taken by the agent affects the environmental state. Li et al. developed a deep-reinforcement-learning-based online path-planning approach for unmanned aerial vehicles (UAVs) and used Markov decision processes to define and explain the UAV state space, UAV action space, and reward functions [
34]. Zhang et al. proposed a deep-reinforcement-learning-based energy scheduling strategy to optimize multiple targets, taking diversified uncertainties into account; an integrated power, heat, and natural gas system consisting of energy-coupling units and wind power generation interconnected via a power grid was modeled as a Markov decision process [
35]. Liu et al. proposed an adaptive uncertain dynamic economic dispatch method based on deep deterministic policy gradient (DDPG); on the basis of the economic dispatch model, they built a Markov decision process for power systems [
36]. In this paper, the operation optimization of the sugarcane milling process is described as an MDP process, which is modeled as follows:
(1) State space
: The state space determines the environmental perception of the agent. On the basis of the obtained state parameters of MF–EF–IF of the milling system as described in
Section 2.3, 14 parameters with a certain influence on the order parameters—Such as #2 crusher current (West) (
x3) and #3 crusher current (
x4)—Are selected as the state space. The state space is expressed as follows:
(2) Action space
: The action space of the agent is the algorithm’s output, which comprises the operating parameters that need to be adaptively adjusted. Based on the principle that the selection of action should be consistent with the actual control variables, the key process parameters of the sugarcane milling process—I.e., first-level belt speed (
x6), second-level belt speed (
x8), #3 squeezer speed (
x15), #4 squeezer speed (
x17), and #6 double roller speed (
x21)—Are selected as the action space. Assuming that the speed control of the first five actions is v
1, v
2, v
3, v
4, and v
5, respectively, and that the control action of osmotic water on the sugarcane ratio is h, the action space is expressed as follows:
(3) Reward function
: The agent evaluates the action taken by the reward function. Considering that the optimization objective is to minimize the deviation between the optimal values of the order parameters obtained in
Section 3.1 (i.e., grinding capacity, electric consumption per ton of sugarcane, and sucrose extraction) and their actual values, the reward function (
) of different actions under different states is determined as follows:
where
is the mathematical model of MF–EF–IF based on the DK-ELM method, and
represents the optimal order parameters of each flow solved by the MOCSO based on flow collaboration.
It is necessary to choose a specific depth-enhanced learning framework combining the application scenarios of each algorithm along with its advantages and disadvantages. Common deep reinforcement learning methods include deep Q networks (DQNs), Actor–Critic (AC), policy gradient (PG), and deep deterministic policy gradient (DDPG) [
37,
38,
39]. Considering that the optimization of the operational parameters in sugarcane milling is a continuous process, the DDPG algorithm composed of an actor–critic framework is selected. After DDPG perceives the environmental state
, the actor online policy network outputs the action
, and the critic online Q network evaluates the action value
, where
and
are the actor and critic online network parameters, respectively. In order to improve the stability of the algorithm, the actor target policy network and target Q network are also constructed.
To update the actor and critic networks, DDPG draws N small batches of sequence data
from the experience playback pool M to train the model, and the critic network is updated in the direction of the minimization loss function
L, denoted as follows:
where
is the target value,
i is the extracted sample sequence number,
is the discount factor,
is the determined action of the target policy network based on the output of the next state
, and
and
represent the parameters of the actor target policy network and the target Q network, respectively. Meanwhile, the actor network is updated according to the policy gradient as follows:
The parameters of the target valuation network and the target policy network in DDPG are updated in a soft manner, as follows:
Due to the introduction of the soft update method, the parameters of the target network are updated by a smaller magnitude each time, making it easier to converge and more stable.
In order to ensure that the diversity of samples in the experience pool is conducive to network convergence, a random discarding sample based on the sample similarity algorithm is introduced during the network training to improve the DDPG algorithm. Sample similarity is calculated as follows:
where
is the sample similarity,
is the state space of the running process,
is the state space in the sample pool,
is the Euclidean distance between
and
, and
is the maximum of all Euclidean distances; the greater the similarity, the higher the probability of discarding that sample.
The optimization framework of the operating parameters in the sugarcane milling process based on improved DDPG is shown in
Figure 9, and the improved DDPG algorithm is used to realize the adaptive adjustment of operating parameters in the sugarcane milling process, which is solved in the following steps:
Step 1: First, the experience pool D with capacity N, the action value network, and the policy network are initialized, and the weight parameters are randomly generated. Then, the parameters of the action value network and the policy network are initialized and copied to the corresponding target network;
Step 2: The Ornstein–Uhlenbeck (OU) noise of the random process for action exploration is initialized, and the current state is obtained. The action is selected based on the current policy network and noise, and then the current action is executed to update the environment and to obtain the rewards and the next moment state ;
Step 3: The sample similarity between the current state space and the state space in the experience pool is calculated. The state is discarded if the similarity is greater than a given threshold; otherwise, it is stored in the experience pool. Step 3 is repeated to determine whether the inner loop is reached; if so, Step 2 is repeated;
Step 4: After a certain number of data are stored in the experience pool, a small batch of trajectory data are randomly sampled from the experience pool D at specific time intervals. The target action value network and the policy network are updated according to Equations (25) and (26), and the action value network and policy network are softly updated after a certain time interval;
Step 5: The above steps are repeated until the training times are achieved, and the set values of the optimal process parameters are output.