1. Introduction
In recent years, unmanned platforms have gradually been applied to the military field. It is widely adopted that unmanned platforms will play a vital role in future warfare. In surface or subsurface operations, such as nearshore defense operations, Unmanned Surface Vehicles (USVs) may be used for escort cruising, collaborative strikes, and regional detection [
1]. How to foster collaboration among swarm systems becomes an important issue and attracts a lot of research interest. The Defense Advanced Research Projects Agency (DARPA) also sponsored several research programs, such as OFFensive Swarm-Enabled Tactics (OFFSET) [
2], Collaborative Operations in Denied Environment (CODE) [
3], etc., to advance capabilities for swarms in complex environments.
Modern war is an all-round war, and coastal defense is undoubtedly indispensable. For countries with territorial seas, coastal defense is the first line of defense for their sovereignty and security [
4]. From the maritime confrontation during World War II to the current territorial conflicts such as in the Black Sea and the Indian Ocean [
5,
6], the importance of maritime defense lines has been reflected [
7]. Surface operation with USVs is a typical scenario in future maritime conflicts. A swarm of USVs equipped with weapons can be used to strike multiple targets. To achieve aforementioned mission, each USV should be assigned to at least one target, with a planned path to attack position. In other words, target assignment and path finding are key problems.
Many researchers focus on it and propose a problem named weapon-target assignment. Traditional weapon-target assignment models mainly establish models and utility functions. For the model, many scholars focus on the static target assignment problem, that is, they do not consider the change in the position of the strike target, simplifying it to a target fixed at a specified location within a certain time window [
8]; or they fix a weapon firepower unit [
9]. However, in actual nearshore combat scenarios, both unmanned surface vessel formations and the strike target are dynamically changing. So that the application of static models will cause deviations in the strike results. Furthermore, traditional models mostly concentrate on the allocation while not considering the path programming, which is not suitable for USV planning. Finally, for utility functions, most only consider combat efficiency, such as the product of target threat and probability of damaging the target, without considering the impact of time costs.
The target assignment problem is NP (non-deterministic polynomial)-hard [
10] and can be solved with heuristic algorithms, including construction algorithms and local search algorithms [
11]. Construction algorithms mainly sacrifice the quality of the solution under the premise of high efficiency. They are usually used to generate initial solutions [
12,
13], while local search algorithms mainly search for better solutions through different search operators. Existing algorithms include simulated annealing [
14], taboo search [
15], GA, and large neighborhood search [
16] algorithms. For heuristic algorithms, the final result depends on the initial solution. Therefore, for large-scale problems [
17], inappropriate initial solutions will cause the entire algorithm to fall into local optima and a large amount of calculation time. In recent years, reinforcement learning (RL) has made breakthroughs in the field of combinatorial optimization [
18,
19,
20] due to its ability to estimate useful patterns that are difficult to find manually, especially in large-scale problems [
21] and the fast route generation process [
22]. It is now widely used to solve NP-hard optimization problems in VRP based on RL models [
23]. For example, the literature [
24] uses a Deep Deterministic Policy Gradient algorithm for longitudinal speed control to implement a trajectory tracking control method for self-driving vehicles.
Therefore, to solve the large-scale dynamic weapon firepower allocation problem, this paper proposes a dynamic weapon firepower allocation solution model based on deep reinforcement learning (DRL) because the heuristic algorithm cannot accomplish the real-time assignment of dynamic targets. As this moving target problem pays more attention to the cooperation between different agents, we used multi-agent reinforcement learning instead of single-agent. The main contributions are as follows: ① We propose a new mathematical model to describe the surface unmanned craft fire assignment problem in dynamic environments, and mathematically solve the model initially including the strike bearing as well as the strike time. ② We combine task completion time with target threat level to establish a task evaluation model for the current problem. ③ In contrast to the previous use of heuristic algorithms, we propose a MADDPG-based algorithm for solving dynamic problems at different scales and experimentally validate it.
The remainder of this study is organized as follows:
Section 1 introduces the background and importance of the moving target assignment.
Section 2 provides a brief survey of related studies on model and solving algorithms.
Section 3 proposes a mathematical formulation.
Section 4 introduces the MTWTA algorithm to solve the formulation.
Section 5 test and discuss the algorithm with several experiences. Finally, the conclusions and discussion are presented in
Section 6.
3. Moving Targets WTA Model
This study primarily focuses on constructing a target assignment model that targets dynamic objectives using a single firepower node. The model considers a relative positional motion model to determine the real-time positions of the firepower strike node and the dynamic objectives. Additionally, a task evaluation model is established to assess the threat level posed by the dynamic objectives. Finally, mathematical formulas are proposed to minimize the enemy threat level within the entire target assignment model.
In summary, this research aims to develop an effective firepower allocation strategy by considering the real-time positions of the targets and evaluating their threat levels. By minimizing the enemy threat level, the proposed model seeks to optimize the allocation of firepower resources.
3.1. Problem Description
The primary objective of this research is to establish a model for firepower allocation targeting dynamic objectives. In this scenario, the enemy firepower units are dispersed around bases and engage in reconnaissance and destruction missions. The strategy involves deploying a swarm of unmanned surface vessels (USVs) from the base to strike all enemy firepower units. This can be represented by an array <>, where A represents the USV swarm, N represents the enemy firepower units with their survival and threat assessment information, including the number of units, E represents the real-time position and situational information of both firepower units and enemy targets, and C represents the constraints for the model during the mission.
In the context of dynamic target firepower allocation, determining the optimal path for firepower nodes faces two main challenges. Firstly, enemy targets typically move in different directions within a specific range during their reconnaissance and destruction operations. This necessitates continuous observation of their positions by these firepower nodes, emphasizing the need for minimal travel time. Secondly, each target possesses different characteristics in terms of size, velocity, direction, and distance. For example, priority may be given to swiftly advancing enemy units or larger targets. Therefore, it is crucial to prioritize targets with higher threat levels. Considering these factors, the model considers the time required for firepower nodes to reach each strike position and the threat levels of targets when determining the optimal firepower allocation path. The model is based on the following assumptions:
Real-time position, direction, and velocity information of targets can be obtained through intelligence reconnaissance systems. The model in this paper only considers the enemy speed and direction at the calculation time.
The time required for firepower strikes is not explicitly considered. It is assumed that once firepower nodes reach the designated strike positions, the attacks on enemy targets are successfully executed.
3.2. Time
In order to determine the transfer route of the fire nodes, we analyze the relative movement of fire nodes and targets. We establish a Cartesian coordinate system. Due to the dynamic nature of the enemy targets, it is necessary to determine the position points and time at which the enemy targets are within the striking range of the firepower strike equipment. The meaning of parameters is shown in
Table 1.
3.2.1. The Position Points
If the distance between the firepower attacking nodes and the enemy target is greater than the attacking range, we can determine the position point based on the following two hypotheses.
Hypothesis 1. When the target and the initial position of these firepower attacking nodes are on the same straight line as the attacking point, it is the optimal attacking point.
As shown in
Figure 1, we construct a circle with the enemy target
B as the center,
L as the optimal attacking point, and any point
on the attacking circle as the attacking point. According to the perpendicular theorem and Pythagorean theorem, we have:
Therefore, it can be concluded that the
L point is the optimal striking point as it results in the shortest possible path.
Hypothesis 2. When the time for the strike equipment to reach the striking position is equal to the time for the enemy to reach the target position, it is considered the shortest time.
As shown in
Figure 2,
the optimal strike point,
means that the time it takes for the firepower strike node to reach the impact point is less than the time it takes for the enemy to arrive at the target location. Analysis indicates that when the fire strike node
A arrives at
ahead of time, it needs to wait for a duration equal to the time it takes for the enemy target to reach
, which is longer than the duration to reach strike point
; if the target arrives
early, the enemy target has not yet entered firing range. Therefore, it is proved that the arrival time of the enemy target and firepower node should be consistent.
3.2.2. The Time of Fight
In order to determine the transfer route for the fire strike node, we have established a Cartesian coordinate system based on the relative motion of fire strike node and the enemy target. Due to the dynamic nature of the enemy target, we need to determine the position and time at which the enemy target falls within the striking range of fire strike equipment.
Scenario 1: Initial Position Transfer.
As shown in
Figure 3, at the initial moment, we depart from the base to engage the first designated target. Let us assume that we arrive at the firing position after a certain time, denoted as
. Point
A represents the location of the base, point
B represents the initial location of the enemy’s first target,
L represents the designated strike position, and
represents the mapped point of the enemy target. In
AB, we can apply the cosine rule to calculate:
the length is:
where
represents the direction of the enemy target’s movement, which can be obtained from Equation (6):
Scenario 2: intermediate node transfer.
After determining the time when the fire strike node completes the strike on target
i, we can calculate the position changes of each target. As shown in
Figure 4, when the fire strike node completes the strike on target i at time
, the positions of the fire strike node and target
j are at points
and
. Now, we consider that the fire strike node departs from
to complete the strike on target
J. Assuming that after the interval
, the fire strike node moves from point
to point
, and the enemy target moves from point
J to point
. At this moment, the enemy target enters the firing range. In
J, according to the cosine rule, we have:
The location of enemy target at
:
the
can be calculated by equal (11):
According to Equations (8)–(12), we can calculate the time:
3.2.3. Model for Threat Assessment
The traditional VRP problem typically measures the quality of a selected solution by the total length of the path. However, in the context of military operations, the main objective is to maximize the overall combat effectiveness by minimizing the threat level posed by the enemy. Therefore, this article primarily focuses on evaluating this criterion from two perspectives:
(a) Enemy Threat Level: Enemy threat: The primary goal of VRP in combat is to minimize the threat of the enemy. In offshore defense, the threat degree of the target is initialized according to the speed and type of the enemy attacking ships. The calculation of the threat degree of the enemy can be obtained from the literature [
45].
(b) Strike Time (time from target detection to engagement): The enemy targets may gather information about resources and infrastructure through reconnaissance and sabotage missions. As time goes on, the enemy can collect more intelligence, posing a greater threat to us. Modern warfare emphasizes quick response and mobility. Conducting operations within a short time frame can shorten the enemy’s reaction time, reducing their interference and obstruction of actions. Swiftly striking targets can increase mobility and flexibility.
Since the target strike order is different and the rewards obtained are different, we define the threat reduction of the enemy target as follows:
is the strike sequence of the target in the unmanned surface vessels; if it is the second strike, then .
Therefore, the reward for one of the unmanned surface vessels is defined as:
3.2.4. Mathematical Formulation
Constraint (20): At most, one visit from i to j. This constraint states that in path planning, each target point can only be visited once in the path from the starting point i to the destination point j.
Constraints (21) and (22): Must depart from and return to the base.
These constraints ensure that the path planning must start from the base, pass through a series of target points, and finally return to the base.
Constraint (23): Must visit all target points. This constraint requires that the path planning must pass through all target points, without skipping or ignoring any of them.
Constraint (24): Distance between target and strike point must be within strike radius. This constraint ensures that the selected target points in the path planning are within the strike radius of force, enabling effective strikes.
Constraint (25): Maximum endurance time. This constraint limits the maximum endurance time of the aircraft in the path planning, ensuring that the aircraft can complete the mission within the specified time frame.
4. Reinforcement Learning Model
In this section, we will introduce the algorithm. As shown in
Figure 5, it is the framework overview of MTWTA, which is based on the MADDPG. And this article has combined it with the RNN and attention mechanism. The whole framework is made up of three parts: the encoder, the network of actors and the network of critics. The encoder draws on RNN and self-attention mechanisms for state representation, the actor network for action selection, and the critic network for training. The following will describe each component in detail.
4.1. Multi-Agent Reinforcement Learning Setting
State: Since the entire environment for the side is open information, that is, all unmanned surface vessels share target location information and attack information, we define the state of a single agent to include its own information and environmental information. The state of each unmanned surface vessel includes its location information, the sequence of targets it has already visited, and the remaining energy of the unmanned surface vessels; the public information in the environment includes the location, speed, threat level, attack status of the target, and the locations of other unmanned surface vessels.
Action: The unmanned surface vessels can choose an un-attacked target or stop moving. This can be encoded as a discrete action space. We use one-hot encoding to represent each action value. For example, if the third target is selected, it is defined as and the stop action is represented as . When the unmanned surface vessels reach their maximum capacity or there are no targets to attack in the entire environment, the returned action is ‘−1’, that is, return to the origin. The step indicates that one target is assigned, and the other unassigned targets will be reassigned to achieve online dynamic allocation.
Reward: When the unmanned surface vessels completes the target attack, a positive reward is given. The size of the reward is related to the threat level of the target. At the same time, each time an action is executed, the unmanned surface vessels will consume a certain amount of energy, so each action will receive a negative reward proportional to the energy consumption.
Mask: Due to the limitations of the target visit times and the maximum capability of the unmanned surface vessels, there may be moments when the target can no longer be selected or the unmanned surface vessels cannot carry out the strike mission. Therefore, we use a mask to indicate whether the unmanned surface vessels can perform the action at a certain moment. The mask rule is defined as follows:
When the unmanned surface vessels reach their maximum capability, we define the mask as , indicating that the unmanned surface vessels cannot perform more actions.
4.2. Multi-Agent Deep Deterministic Policy Gradient
In an environment with multiple unmanned surface vessels, traditional heuristic algorithms face significant challenges. In this environment, each unmanned surface vessel is an individual agent that needs to continuously learn to obtain the optimal strategy. From the perspective of each agent, the environment is no longer static but dynamic. The appearance of the MADDPG algorithm [
46] can effectively solve such problems.
The Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm is an extension of the Deep Deterministic Policy Gradient (DDPG) algorithm [
47]. It is an intelligent algorithm that can handle multi-agent cooperation problems that traditional reinforcement methods cannot address. The MADDPG algorithm adopts a “centralized training, decentralized execution” framework for learning, as shown in
Figure 6. In a multi-agent environment, the behavior of each agent affects the observation results and rewards of other agents, making the dynamics of the environment non-stationary and posing challenges to learning. MADDPG addresses this problem by allowing each agent to have its own actor–critic network, where the actor is the action network and the critic is the evaluation network. During the training process, each agent’s policy network only uses the agent’s own observations and actions, while the value function network uses the observations and actions of all agents.
The training process of MADDPG includes two main steps: policy update and value function update.
In the policy update step, each agent
i generates actions
according to its policy function, and then uses the value function to calculate the expected rewards of these actions
. Then, the parameters of the policy function are updated through gradient ascent to maximize the expected reward:
In the value function update step, each agent
i uses its policy function and the policy functions of other agents to generate actions
, and then uses these actions and observations to calculate the target value
. Then, the parameters of the value function are updated through gradient descent to minimize the difference between the target value and the actual value:
where
D is the experience replay buffer,
is the discount factor, and ′ indicates the next time step.
Furthermore, gradient vanishing may occur during training. Gradient vanishing is a common problem during the training of neural network models, especially when using back-propagation algorithms. As the number of network layers increases, the gradient may gradually decrease to close to zero during back-propagation, resulting in slow or stagnant network weight update, which affects the training effect. Gradient vanishing mainly focuses on unreasonable limitations of the number of network layers, inappropriate activation functions or poor weight initialization. It is suggested in the literature [
48] that the effect of gradient vanishing can be reduced by dynamically adjusting the learning rate and optimizing the loss function. On this basis, we reduce the possibility of gradient vanishing by constantly adjusting the number of network layers and using unsaturated activation functions such as ReLU.
4.3. Encoding
In the research, we propose a novel approach to handle variable-length navigation trajectory sequences. This method combines recurrent neural networks (RNN) and self-attention mechanisms to capture long-distance dependencies in the sequence and handle inputs of different lengths.
Firstly, we employ an RNN to process the input navigation trajectory sequence. RNNs are a type of neural network capable of handling sequence data, capturing temporal dependencies in the sequence. In the model, we use gated recurrent units (GRU) as the basic unit of the RNN, as GRUs can effectively handle long sequences and have relatively low computational complexity. The update equations for GRU are as follows:
where
and
are the reset and update gates,
is the hidden state,
is the input, ∗ denotes element-wise multiplication,
is the sigmoid function, and
W represents the weight matrices.
Next, we employ a self-attention mechanism to further process the output of the RNN. The self-attention mechanism is a method capable of capturing long-distance dependencies in the sequence. It computes an interaction between each element and all other elements in the sequence to generate a context vector as a weighted average. The computation for self-attention is as follows:
where
Q,
K,
V are the query, key, and value,
h is the output of the RNN,
,
,
are the weight matrices,
is the dimension of the key, and softmax is the softmax function.
By combining RNN and self-attention mechanisms, the model can effectively handle variable-length navigation trajectory sequences and capture long-distance dependencies in the sequence.
4.4. Algorithm
In this section, we will show the algorithm of action selection and the MTWTA. As shown in Algorithm 1, is the action selection for each agent used the self-attention with RNN, and Algorithm 2 shows the whole progress of the algorithm we have proposed.
Algorithm 1 Self-Attention with RNN and Mask for Action Selection |
Initialize random weights of RNN |
Initialize self-attention mechanism |
Initialize action space A |
For each episode: |
4.1 Get initial state s0 |
4.2 Pass s0 through RNN to get hidden state h0 |
4.3 For each time step t: |
4.3.1 Pass ht−1 through self-attention to get |
attention output ot |
4.3.2 Concatenate st and ot to form the input |
xt for the RNN |
4.3.3 Pass xt through RNN to get ht |
4.3.4 Compute action probabilities pt using |
gumbel-softmax function on ht |
4.3.5 Apply mask to pt to get masked |
probabilities |
4.3.6 Select action at from A according to |
4.3.7 Execute action at and observe reward |
rt and new state |
Algorithm 2 Processing procedure of WTA |
Initialize Q, , , , R |
For each episode, initialize N, receive |
For each time step t: |
3.1 For each agent i, select , |
apply mask |
3.2 Execute , observe r, , |
store in R, set |
3.3 Sample minibatch of N transitions from R, |
set |
3.4 Update critic by minimizing |
|
3.5 Update actor policy using sampled policy |
gradient |
3.6 Update target networks: |
, |
5. Experiment
Numerate experiments are divided into two parts. The first one is a case study; we set an instance of (5, 25) to calculate WTA. The second part is a comparative performance evaluation. The experiment is carried out on AMD Ryzen7
[email protected] Hz, RTX3060 with 16 GB RAM, code in Python 3.10.
Dataset: Due to the restrictions of military datasets, all data in this experiment are randomly generated. A specified number of targets are randomly generated according to the problem scale, and their coordinates are quantified and limited within (−10, 10), with their speed within (1, 2). Their initial threat levels are calculated through a table and all threat levels are normalized. The initial coordinates of agents are all (0, 0), with a set speed of 10, and the maximum attack capability of unmanned surface vessels is 5.
Baseline: The conventional solution to planning problems in the military field mainly uses the GA algorithm. Therefore, this experiment mainly compares the genetic algorithm under different scales. When the target is greater than or equal to the maximum allocation ability, we allocate by bit [
49], that is, we group the chromosome encoding according to the maximum ability of the unmanned surface vessels and allocate it to unmanned surface vessels in turn. When the enemy target has less than the maximum allocation ability, we refer to the literature, insert the virtual encoding of the unmanned surface vessels into the target chromosome encoding, and define the chromosome segment between two unmanned surface vessels as the attack segment of the previous unmanned surface vessels. The chromosome sequence is the attack sequence of the unmanned surface vessels. If the number between the unmanned surface vessels is greater than the maximum ability of a single boat, the first max ones are taken as the attack sequence of this unmanned surface vessels.
Gray Wolf Algorithm: The optimal solution to the problem is achieved by simulating the social and predatory behavior of grey wolves. A hierarchical model of social hierarchy is formed by ranking each grey wolf according to its fitness value. The grey wolf algorithm adjusts its movement strategy according to the current position and the target position to better approximate the optimal solution and finds the optimal solution in the search space and approaches the target gradually by adjusting their position. We use the basic grey wolf algorithm with 300 iterations as a comparison algorithm.
Parameter setting: the parameter of MTWTA is shown in
Table 2.
We set the reward as , and the fitness function of GA is set as same as the reward for that the result is my optimization goal. But different with the RL, we only calculate after the entire pre-assignment is completed, without considering dynamic reward.
5.1. Case Study
We use the MTWTA method to solve the fire distribution problem, where the number of unmanned surface vessels is 5 and the number of enemy unmanned surface vessels is 25. We set the number of training times to 10,000, the dimension of the hidden layer to 128, and the model updates every 50 times. We apply the trained model to solve the problem, and the results obtained are as shown in
Figure 7; the assignment is that s1:
, s2:
, s3:
, s4:
, s5:
. The coordinates in
Table 3 represent the positions of each unmanned surface vessel when attacking the final target. It shows the final position of each unmanned surface vessel before it comes back to the base.
5.2. Comparative Performance Analysis
5.2.1. The Quality of Solution
To verify the scalability of the algorithm, this experiment is designed with three comparative tests, which are conducted, respectively, when the enemy targets exceed unmanned surface vessels’ total allocation ability, when the enemy targets equal with maximum allocation ability, and when the enemy targets are less than the maximum allocation ability.
Experiment 1: When the target is less than maximum capability.
We set the number of unmanned surface vessels and the enemy, respectively, as (30, 100), (20, 60), and (5, 16) for three sets of experiments. Here, GA100, GA200, and GA300 represent experiments with iteration times of 100, 200, and 300, respectively. For reinforcement learning, we calculate the average of 100 times, and the solution of the GA algorithm is the average of 10 solutions. The quality of the solution is shown in
Table 4. The results represent the ratio of the solution rewards compared to the proposed algorithm. The larger the result, the worse the quality of the solution.
Experiment 2: When the target is equal to the maximum capability.
We set the number of unmanned surface vessels and the enemy, respectively, as (30, 100), (20, 60), and (5, 16) for three sets of experiments. The quality of the solution is shown in
Table 5.
Experiment 3: When the target is more than maximum capability.
We set the number of unmanned surface vessels and the enemy, respectively, as (3, 17), (20, 120), and (30, 160) for three sets of experiments. The quality of the solution is shown in
Table 6.
Through experiments under three different scenarios, we found that the algorithm proposed in this paper is superior to the baseline algorithm in terms of solution quality when the scale is large. However, when the scale is small, the solution quality obtained by the MTWTA algorithm is lower than that of the GA, especially when the number of enemy targets exceeds the maximum capacity, the rewards obtained by MTWTA are much lower than GA.
The reason for this phenomenon is that when the problem scale is small, the solution space of the entire experiment is also small. As a search algorithm, the GA is more likely to find the optimal solution in a smaller solution space. However, when the problem scale expands, the solution space also increases, and the quality of the solution cannot be guaranteed in a short number of iterations. From the algorithmic design point of view, reinforcement learning interacts with the environment in real time and uses a trial-and-error learning mechanism. The intelligent body learns the optimal strategy by continuously trying different actions and observing the results, and with the balance of exploration and utilization, can lead the intelligent body to learn a better strategy, while GA focuses more on searching for optimal solutions through the evolution of individuals in a population. Although it has a certain exploration ability, its learning mechanism is relatively passive, mainly relying on the evaluation of the fitness function and genetic manipulation, and therefore has relatively little feedback and is slow to guide changes in the quality of the solution.
5.2.2. The Running Time
In practical applications, the solution time is a key indicator to measure the performance of an algorithm. Therefore, we have calculated and compared the solution times of various algorithms at different scales, as shown in
Figure 8. We have calculated the average time of running the algorithm ten times. This method can help us to evaluate the performance of the algorithm more accurately.
In terms of comparing the solution time of algorithms, the algorithm proposed in this paper shows significantly better performance than GA. Although the genetic algorithm may find a better solution for small-scale problems, as the scale of the problem increases, its solution time also significantly increases. Experimental results show that the solution time of the proposed algorithm is much less than that of GA, and as the scale of the problem increases, this time advantage becomes more prominent. For the GA, with the increase of the number of genetic generations, it will continuously expand its search space due to the influence of various parameters, such as crossover mutation operation, and at the same time, it is necessary to calculate the fitness function value of each individual in each iteration, which increases the solution time of the whole problem.
5.2.3. The Stability
In this paper, we mainly focus on the solution stability. The quality of solutions provided by search algorithms largely depends on the quality of the initial solution space. If the initial solution space is of poor quality, even the best search algorithms may not be able to find satisfactory solutions. Therefore, the stability of solutions becomes an important criterion for measuring algorithm performance.
The stability of solutions reflects the consistency of the algorithm’s performance under different initial solution spaces. A highly stable algorithm can find satisfactory solutions consistently, regardless of the quality of the initial solution space. This is particularly important for real-world problems, as we often cannot guarantee the quality of the initial solution space.
Therefore, we use the stability of solutions as an important comparison criterion to evaluate and compare the performance of different algorithms. In this way, we can understand and evaluate the performance of algorithms more comprehensively and accurately, thereby selecting the algorithm that is most suitable for the actual problem. The result is shown in
Figure 9.
Experimental results show that the solutions of GA are volatile, while the model trained by MTWTA shows stability in solving problems. This is because the MTWTA algorithm continuously optimizes strategies by learning from environmental feedback. Even in cases where the initial solution space is of poor quality, it can find satisfactory solutions through continuous learning and iteration. On the contrary, if the initial solution space of the genetic algorithm is of poor quality, it may be difficult to get rid of the local solution space by operations such as crossover and mutation, while the randomness of the crossover and mutation operations can lead to fluctuations in the solution of the problem.
6. Summary and Discussion
This study proposes a new model to describe and solve the weapon-target assignment problem of unmanned surface vessels in the near sea. Unlike the existing problems, the target of the surface unmanned craft is in a moving state, so the existing modeling methods have certain constraints. This means that it needs to consider the strike position of each target and the movement direction of the unmanned surface vessels at different times. To better describe this problem, this research has established a new weapon-target assignment model and a reasonable performance evaluation system. A constrained mathematical model with the goal of maximizing the target was set up. In order to solve this problem, we modified the embedding of MADDPG with the help of the principle of Seq2seq, and proved the timeliness, stability and quality of the proposed framework through the comparison experiment with the genetic algorithm. Through the experimental results, we found that our proposed framework is able to obtain a stable solution that is 30% better in less than 10% of the time in situations of different sizes.
This study can provide some help in algorithm and modeling for subsequent dynamic fire distribution. In the current intelligent battlefield, the timeliness and stability of the algorithm can better assist the commander in making decisions. At the same time, this algorithm is not the only solution to this problem. In the subsequent research process, we will focus on the fire distribution of continuous actions, and consider the the confrontation.