Tracking and Data Association Based on Reinforcement Learning

Xiong, Wei; Gu, Xiangqi; Cui, Yaqi

doi:10.3390/electronics12112388

Open AccessArticle

Tracking and Data Association Based on Reinforcement Learning

by

Wei Xiong

,

Xiangqi Gu

^*

and

Yaqi Cui

Research Institute of Information Fusion, Naval Aviation University, Yantai 264001, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(11), 2388; https://doi.org/10.3390/electronics12112388

Submission received: 23 April 2023 / Revised: 19 May 2023 / Accepted: 23 May 2023 / Published: 25 May 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Currently, most multi-target data association methods require the assumption that the target motion model is known, but this assumption is clearly not valid in a real environment. In the case of an unknown system model, the influence of environmental clutter and sensor detection errors on the association results should be considered, as well as the occurrence of strong target maneuvers and the sudden appearance of new targets during the association process. To address these problems, this paper designs a target tracking and data association algorithm based on reinforcement learning. First, this algorithm combines the dynamic exploration capability of reinforcement learning and the long-time memory function of LSTM network to design a policy network that predicts the probability of associating a point with its various possible source targets. Then, the Bayesian network and the multi-order least squares curve fitting method are combined to predict the location of target, and the results are fed into the Bayesian recursive function to obtain the reward. Simultaneously, some corresponding mechanisms are proposed for possible problems that interfere with the association process. Finally, the simulation experimental results show that this algorithm associates the results with higher accuracy compared to other algorithms when faced with the above problem.

Keywords:

data association; system model; LSTM network; Bayesian recursive function

1. Introduction

Data association is a key technique within the field of radar data processing. The technique is used to determine the correct track by establishing relationships between radar measurements at adjacent moments. In general, the traditional data association algorithms are centered on the predicted value of track and select the qualified points according to some specific criteria. These algorithms are generally related to filtering algorithms in target tracking. For example, Probability Data Association Filter (PDAF), Joint Probability Data Association Filter (JPDAF), Multiple Hypothesis Tracking Filter (MHTF), Interacting Multiple Model Filter (IMMF), and Probabilistic Hypothesis Density Filter (PHDF) [1,2,3,4,5]. Except for the PDAF algorithm, other algorithms can be used to handle the problem of multi-target data association.

Radar measurement data can be affected by many factors, such as sensor detection errors and clutter generated by the external environment. Additionally, the motion mode of target can increase the difficulty of data association. To solve these problems, scholars have proposed many related algorithms [6,7,8,9]. For example, a truncated JPDAF algorithm is proposed in reference [3], which can effectively filter the clutter in the data combined with the characteristics of target motion. In reference [4], fuzzy recursive least squares filtering is combined with JPDAF to achieve effective tracking of maneuvering targets in the environment where both the measurement error and the motion model are uncertain. In reference [5], the consistent distributed filtering process was incorporated into the JPDAF algorithm using the random Gibbs sampling algorithm to improve the accuracy of association results. In reference [6], a radiometric intensity PHDF algorithm was designed to solve the tracking problem of close targets. In reference [7], the DBSCAN algorithm and the sequential random sample consistency algorithm are used to preprocess the measurement data, which substantially reduces the computational resources of algorithm. In reference [8], a new multi-sensor PHDF algorithm is proposed that can fuse information from different sensors and overcome the difficulties of missing the priori statistical information such as clutter, measurement noise, and detection probability. In some specific environments, by some special means, these algorithms are able to handle some multi-target data association problems. However, none of them has completely escaped the limitation of traditional filter framework, and they need to realize the data association of multiple targets based on the system model. Obviously, the system model in the real environment is unpredictable, which is one of the reasons why the traditional filter algorithms are not very practical. Therefore, how to achieve accurate association of multiple targets in the complex environment where the priori information is scarce is an urgent problem.

In recent years, with the development of artificial intelligence, using artificial intelligence to analyze data has become the mainstream in the new era [10,11,12]. Similarly, there are many scholars who have incorporated artificial intelligence techniques into the data association process and have achieved some results. Compared with traditional filter algorithms, these results have some breakthroughs in some specific environments. However, in the final analysis, they still follow the traditional data association rules and combine with filter to process data. In this case, it must be assumed that the system model is known. However, in the real environment, the system model is difficult to predict in advance, so it is worthwhile to study how to solve the problem of target tracking and data association without the priori conditions.

Reinforcement Learning (RL) is an important branch in the field of machine learning. It emphasizes how to act based on the environment to obtain the maximum expected benefits. After decades of development, RL technology has many achievements, such as Q-learning, dynamic programming, Policy Gradients, Deep-Q-Network, etc. [13,14,15,16,17,18,19,20,21,22,23,24]. In essence, RL is a process in which the agent learns itself in an unknown environment under defined rules. The behavior of agent conforms to the real environment. The reward or punishment are fed by defined rules. The end mark is the arrival of “terminus “. In simple terms, the target data association process is to find the real points generated by the track, and all the points are sorted according to the time axis to form the complete track. This process is not only similar to finding the optimal path but can also be roughly regarded as a snake game. The environment of target requires the agent to adapt. The rules of the game are determined by the motion of target. Therefore, in theory, RL can solve the problem of target tracking and data association.

In the real environment, the target system model is unknown, and the association results are vulnerable to environmental clutter, detection errors of sensors, strong maneuvers, and the sudden appearance of new targets. Traditional data association algorithms have difficulty in achieving data association of multiple targets within the constraints of these conditions. In this paper, a novel tracking and data association network architecture based on RL networks is established. First, a policy network that can predict the probability of associating the point with its various possible source targets is designed, which exploits the dynamic exploration capability of RL techniques and the long-time memory function of LSTM networks. Then, the multi-order least square curve fitting method is used to predict the position of point. The order of least square curve fitting method is obtained by Bayesian network analysis. The results are wired into the Bayesian recursive function to obtain the reward for each point. Finally, some corresponding mechanisms are proposed for the characteristics of targets and measurements. For example, the ring wave gate mechanism that eliminates some of the effects of clutter, the detection mechanism that enables timely detection of new targets, the Compete mechanism that solves multiple target association problems, and the adjustment mechanism that automatically corrects the association results. This paper pioneers a data association idea of association while learning to achieve accurate association of measurement and target. This has high practical engineering significance and helps national defense construction, which is a trend of future development.

The paper is organized as follows. Section 2 introduces the overall network architecture of algorithm and briefly explains the learning approach of the network. Section 3 designs the state space and agent detection mechanism according to the motion trend of target. Section 4 designs the Ring Wave Gate Screening mechanism and the Compete mechanism, taking into account the characteristics of measurement. Section 5 defines the reward function and details the design ideas. Section 6 proposes an adaptive adjustment mechanism for the problems that may occur in the complex and variable multi-target data association process. Section 7 trains the network with a large amount of simulation data, tests the performance of algorithm, and makes the corresponding analysis. Section 8 summarizes the advantages and disadvantages of this algorithm and looks forward to the future development direction.

2. Network Architecture

In the process of tracking and data association, sensors continuously obtain measurement data in a certain time sequence, which contains both real points of target and clutter generated by sensor interference, external environment, and other factors. When at a certain moment, some of the clutter may be mixed with the points generated by target, and it is difficult to accurately associate the points originating from the target. Therefore, this paper designs a tracking and data association network architecture based on RL, as shown in Figure 1.

First, this architecture can output the association probability of targets and points in a real environment lacking a priori information using a policy network and select the point with the highest probability for association. When there are multiple agents associated with points in the detection area at the same time, competition may occur, so the Compete mechanism is set to let each agent make a suitable choice. Then, the reward function calculates the reward of current selection and feeds back the evaluation of policy network. Finally, the optimal policy that matches the real motion of target is found by learning to achieve accurate association between target and point. The network architecture contains four core components, namely Environment State Design, Action Selection, Reward Function Definition, and Policy Network Learning.

3. Environment State Design

3.1. Agent Definition

All moving targets have inertia, and inertia makes the motion trend of adjacent time targets similar. Based on the motion trend, agent can find the points originating from the target and achieve accurate association between measurement and target. A single point cannot be used to represent the motion trend of a target, and multiple consecutive moments are needed to approximate the motion trend of a target. Therefore, a “sliding window” state is designed in this paper. The state consists of association results for

N

consecutive moments.

N

is the “window”, and the value of

N

is determined by the sampling interval of sensor. In general, the larger value of the sampling interval is, the smaller value of

N

is.

As shown in Figure 1,

Z_{t - N}

is the set of measurements when

t - N

, and

s_{n}^{t}

is the state consisting of points in

{Z_{t - 1}, Z_{t - 2}, \dots, Z_{t - N}}

originating from the target

n

when

t

. Suppose

z_{i}^{t - N} = [\begin{matrix} x_{i}^{t - N} \\ y_{i}^{t - N} \end{matrix}]

is the point in

Z_{t - N}

.

x_{i}^{t - N}

, and

y_{i}^{t - N}

are the positions of the x and y axes in the Cartesian coordinate, respectively. If

s_{n}^{t}

consists of

{z_{i}^{t - 1}, z_{i}^{t - 2}, \dots, z_{i}^{t - N}}

, the expression of

s_{n}^{t}

is as follows.

s_{n}^{t} = [\begin{matrix} X_{n}^{t} \\ Y_{n}^{t} \end{matrix}] = [\begin{matrix} x_{i}^{t - 1} & x_{i}^{t - 2} & \dots & x_{i}^{t - N} \\ y_{i}^{t - 1} & y_{i}^{t - 2} & \dots & y_{i}^{t - N} \end{matrix}]

(1)

3.2. Agent Detection

According to Equation (1), it can be seen that each state is composed of points associated with

N

consecutive moments. However, at the beginning of data association process, the real information of agent is unknown, and the initial state of agent needs to be obtained from the measurement data of previous

N

moments. Similarly, during the data association process, some new agents may appear in the detection area at any time, and their real information is obtained from the measurement data. Therefore, an agent detection mechanism is designed in this paper. Considering that randomly distributed clutter is difficult to be associated into a complete track at

N

consecutive moments, this mechanism defines that the primary condition for determining the appearance of a new agent is the presence of points originating from the target at

N

consecutive moments. The flow of this mechanism is shown below.

An exhaustive enumeration is used to traverse the measurement data at $N$ moments to find all possible new agents. Based on the sampling interval $T_s a m p l e$ of the sensor, the measurement data $z$ , the maximum velocity $v_m a x$ , and the minimum velocity $v_m i n$ of target motion, the points at adjacent moments are required to meet the velocity threshold.

$v_m i n \leq \frac{‖\vec{z_{i}^{t - N} z_{i}^{t - N + 1}}‖}{T_s a m p l e} \leq v_m a x$

(2)

As shown in the above equation, the calculation of

z_{i}^{t - N}

and

z_{i}^{t - N + 1}

is used as an example. It can be calculated for all points at adjacent moments and thus filter all states of agent that meet the velocity threshold requirements.

2.: The agents are grouped according to the principle of “a target corresponds to a point and a point originates from a target”. Usually, in the process of multi-target data association, by default a target can only generate one point, and similarly, a point can only originate from one target. Based on this principle, all the identified agents are grouped to ensure that all agents do not select multiple points at the same time, and that a point is not selected by multiple agents in a group at the same time.
3.: If the agent already exists, the groups that are similar to the already existing agent are deleted according to the Cyclic Frechet Distance Clustering method. As shown in Figure 2, $S_n e w$ denotes the set of all agent groups; $S_{i}$ denotes the state set of the $i$ -th group of agents in $S_n e w$ ; $s_{j}$ denotes the state of the $j$ -th agent in $S_{i}$ ; $S_s u r v i v a l$ denotes the state set of already existing agent; and $s_{z}$ denotes the state of the $z$ -th agent in $S_s u r v i v a l$ . First, in order to compare the similarity of states $s_{j}$ and state $s_{z}$ more conveniently, the states $s_{j}$ and state $s_{z}$ are normalized. Then, $F r e c h e t (s_{j}, s_{z})$ is calculated according to reference [25]. Finally, the value domain of $F r e c h e t (s_{j}, s_{z})$ can be controlled at (0, 1). As shown in Figure 2, $d$ is the basis for judging whether states $s_{j}$ and state $s_{z}$ are similar, and $d = 0.5$ is usually set. If the new agent still exists in $S_n e w$ , the next step is continued. If not, the Agent Detection mechanism ends.
4.: The motion track of target in the environment can be represented by the Vector Space Model, where two points at adjacent moments form a vector. Three points at consecutive moments constitute two vectors, and the cosine of angle between these two vectors is assumed to be the motion trend factor of target during this time. The motion trend factor of track at three consecutive moments can be calculated according to the cosine theorem formula.

f_{i}^{t - N + 1} = \frac{\vec{z_{i}^{t - N} z_{i}^{t - N + 1}} • \vec{z_{i}^{t - N + 1} z_{i}^{t - N + 2}}}{‖\vec{z_{i}^{t - N} z_{i}^{t - N + 1}}‖ ‖\vec{z_{i}^{t - N + 1} z_{i}^{t - N + 2}}‖}, f_{i}^{t - N + 1} \in [- 1, 1]

(3)

According to the above equation, the motion trend

f = \{f_{i}^{t - 2}, f_{i}^{t - 3}, \dots, f_{i}^{t - N + 1}\}

of a new agent at

N

consecutive moments can be calculated, and thus the motion trend of all new agents can be obtained.

5.: Consider that variance can be used to measure the magnitude of fluctuations in a batch of data. With the same sample size, the larger the variance is, the more volatile and unstable the data are. For the track, the more stable the motion trend is, the higher its reliability is. Therefore, the variance of $f$ is obtained as

V a r i a n c e = var (f)

(4)

Finally, the data with the smallest

V a r i a n c e

from each category is found as the state of new agent, and the category is the number of new agents.

4. Action Selection

In the tracking and data association process, the agent needs to associate the points from the measurement data that are most likely to originate from the target. Such points are the actions that need to be selected, and the points in consecutive moments constitute the action space. As shown in Figure 1, the policy network can output the association probability of state and measurement, and the agent selects the action based on the association probability. Considering that clutter is randomly distributed, the spatial distance between some of the clutter and the current moment position of agent may far exceed the motion limit of agent in a single sampling interval. Therefore, a Ring Wave Gate Screening process is designed. In addition, a Compete mechanism is designed to solve the competition problem that may occur when there are multiple agents selecting actions simultaneously in the detection area.

4.1. Ring Wave Gate Screening

The points originating from the target necessarily follow the motion rules of target. The ring distribution area of points can be determined based on the maximum velocity

v_m a x

and minimum velocity

v_m i n

of target motion, and all points that may be generated by target can be found.

As shown in the Algorithm 1, the inputs are the measurement data

Z

at the current moment, the maximum velocity

v_m a x

, the minimum velocity

v_m i n

, the position

z

of target at the current moment, and the sampling interval

T_s a m p l e

of sensor.

z_{i}

is the

i

-th point of

Z

.

x_{i}

and

y_{i}

are the positions of

z_{i}

in the x-axis and y-axis, respectively.

Z

has the total of

n

points.

x

and

y

are the positions of

z

in the x-axis and y-axis, respectively. The output is the filtered measurement data

Z_

. The mechanism needs to calculate the distance between all points and

z

. Then, all points that meet the velocity threshold are filtered out.

Algorithm 1. Pseudo code for Ring Wave Gate Screening.
Input: Z = {z_i = [x_i, y_i]\|i = 1, 2,…, n}, v_max, v_min, z = [x,y], T_sample
Output: Z_
1:	i = 1
2:	I = []
3:	while i <= n do
4:	d = ((x − x_i)² + (y − y_i)²)^1/2
5:	if d >= v_min ∗ T_sample and d <= v_max ∗ T_sample then
6:	I.append(i)
7:	end if
8:	i = i + 1
9:	end while
10:	Z_ = {z_i\|i = I}
11:	return Z_

However, considering the effect of sensor detection error

σ_{v}

, the speed has to vary with the detection error so that

v_m a x = v_m a x + \frac{σ_{v}^{2}}{T_s a m p l e}

(5)

v_\min = v_\min - \frac{σ_{v}^{2}}{T_s a m p l e}

(6)

As shown in Figure 3, target 1 moves in the detection area, and there are nine points in the measurement data at the next moment. By drawing the ring wave gate area, it can be seen that only four points may originate from the target, namely

z_{1}

,

z_{2}

,

z_{3}

, and

z_{4}

.

The Ring Wave Gate Screening process uses basic information about target and sensor to remove some of the clutter from the measurement data. This not only increases the likelihood of selecting points originating from the target but also saves the computational resources of policy network.

4.2. Compete Mechanism

When there are multiple agents in the detection area, the association probability of agent and point can be obtained through the policy network. In this case, an agent will select the point with the highest association probability. However, if multiple agents select points at the same time, competition is bound to occur.

Suppose two targets in the detection area cross meet, and there are nine points in the measurement data at the next moment. The ring wave gate areas of target 1 and target 2 are shown in Figure 3 and Figure 4, respectively. Target 1 has four points that may originate from the target, namely

z_{1}

,

z_{2}

,

z_{3}

, and

z_{4}

. Target 2 also has four points that may originate from the target, namely

z_{1}

,

z_{3}

,

z_{4}

, and

z_{6}

. Obviously, when faced with

z_{1}

,

z_{3}

, and

z_{4}

, there are competitions between target 1 and target 2.

In the face of competing phenomena, the association probability matrix is generated based on the state of current moment and the measurement data of next moment.

P^{t} = [\begin{matrix} P_{1}^{t} \\ P_{2}^{t} \\ ⋮ \\ P_{n}^{t} \end{matrix}] = [\begin{matrix} p_{11}^{t} & p_{12}^{t} & \dots & p_{1 m}^{t} \\ p_{21}^{t} & p_{22}^{t} & \dots & p_{2 m}^{t} \\ \begin{matrix} ⋮ \end{matrix} & \begin{matrix} ⋮ \end{matrix} & \begin{matrix}  \end{matrix} & \begin{matrix} ⋮ \end{matrix} \\ p_{n 1}^{t} & p_{n 2}^{t} & \dots & p_{n m}^{t} \end{matrix}]

(7)

In the above equation,

P^{t}

is the association probability matrix when

t

;

P_{n}^{t}

is the association probability matrix of the agent

n

with measurement data when

t

; and

p_{n m}^{t}

is the association probability of the agent

n

with point

m

when

t

.

First, the clutters for each agent are found based on the ring wave gate. Then, in the association probability matrix, the association probability of each agent with its clutter is set to 0, and the association probability of each agent with other points is output by the policy network. Finally, the agent needs to follow two rules when choosing the action:

Each point has only a unique source target.
A target can generate at most one point.

Based on these two rules, the point with the highest association probability is selected for each agent. If multiple agents select the same points, the points are assigned according to the following priority.

The agent has only one point that may originate from the target.
The association probability between agent and point is maximum.

According to this way, all agents can be associated to points.

Through the Compete mechanism, it is possible to match the most appropriate action to each agent and find the point that originates from the target.

5. Reward Function Definition

This paper defines the reward function that can calculate the true score of the action selected by agent. First, this function predicts the order of least squares by Bayesian network based on the state of the agent at the current moment. Then, it uses least squares to predict the position of target at the next moment. Finally, it is combined with the Bayesian recursive function to calculate the reward for the action selected by the agent. As shown in Figure 1, the reward is directly input into the loss function of LSTM network, which can feed back the evaluation of policy network and help the agent find the optimal policy that matches the real motion of target.

In the real environment, the motion of the target is complex and variable. A complete track may contain many kinds of motion models such as Constant Velocity (CV), Coordinate Turn (CT), and Constant Acceleration (CA). If the least squares method of fixed order is used, it cannot accurately describe the motion trend of the whole track. Therefore, it is necessary to predict the order of least squares that can fit the motion trend of target at these

N

moments based on the current state. Referring to the Bayesian network model in reference [26], this paper designs a Bayesian network with

M

classification function. The value of

M

indicates that the order of least squares is divided into

M

classes. If the value of

M

is too small, it is difficult to fit all possible motion trends. If the value of

M

is too large, the fitting results are prone to large deviations. The input of network is the state of agent at the current moment. The output of network is the probability of each class. The class with the highest probability is chosen as the order of least squares method.

Then, according to the predicted order

g

, this function uses least squares [27] to fit the state of agent and predict the position at the next moment. The fitted state is

\tilde{s_{n}^{t}} = [\begin{matrix} \tilde{X_{n}^{t}} \\ \tilde{Y_{n}^{t}} \end{matrix}]

. The predicted position is

\tilde{z_{n}^{t}} = [\begin{matrix} \tilde{x_{n}^{t}} \\ \tilde{y_{n}^{t}} \end{matrix}]

.

\tilde{X_{n}^{t}} = M_{t} W_{x}

(8)

\tilde{Y_{n}^{t}} = M_{t} W_{y}

(9)

\tilde{x_{n}^{t}} = M_{t + 1} W_{x}

(10)

\tilde{y_{n}^{t}} = M_{t + 1} W_{y}

(11)

In the above equation,

W_{x}

and

W_{y}

are the fitted coefficient matrices.

W_{x} = {(M_{t}^{T} M_{t})}^{- 1} M_{t}^{T} X_{n}^{t}

(12)

W_{y} = {(M_{t}^{T} M_{t})}^{- 1} M_{t}^{T} Y_{n}^{t}

(13)

M_{t}

is the fitted information when

t

.

M_{t} = [T_{t}^{0}, T_{t}^{1}, \dots, T_{t}^{g}]

(14)

T_{t}^{g} = [t - 1^{g}, t - 2^{g}, \dots, t - N^{g}]

(15)

Finally, referring to the Bayesian recursive function in reference [7], the reward function

r_{n}^{t}

is defined as

r_{n}^{t} = \frac{P_D r_{n}^{t - 1} q_{n}^{t} (z_{i}^{t})}{K_{t} (Z_{}^{t}) + P_D r_{n}^{t - 1} q_{n}^{t} (z_{i}^{t})}

(16)

q_{n}^{t} (z_{i}^{t}) = N (z_{i}^{t}; \tilde{z_{n}^{t}}, S_{n}^{t})

(17)

S_{n}^{t} = cov (z_{i}^{t}, \tilde{z_{n}^{t}}) + R

(18)

In the above equation,

z_{i}^{t}

is the point

i

selected from the measurement data

Z_{}^{t}

when

t

.

K_{t} (Z_{}^{t})

is the clutter intensity when

t

, namely

K_{t} (Z_{}^{t}) = \frac{n u m_{Z_{}^{t}}}{T S}

.

n u m_{Z_{}^{t}}

is the number of points in the measurement data

Z_{}^{t}

.

T S

is the area of sensor detection region.

R

is the covariance matrix of measurement, which is also the detection error of sensor.

6. Policy Network Learning

As shown in Figure 1, the policy network of this network architecture uses the LSTM network [28]. The input layer of network is the state of the agent at the current moment and a point at the next moment. Since the output of policy network is the association probability between state and point, and the range of association probability is

[0, 1]

, a Sigmoid layer is added after the output layer of LSTM network.

6.1. Learning Style

In the real environment, there are factors such as high maneuverability of target and dense clutter interference. It is difficult to learn the motion trend of multiple targets simultaneously, and the accuracy cannot be guaranteed. Usually, the multi-target data association process can be regarded as the multiple single-target data association processes. Therefore, based on the learning outcomes of a single target, supplemented by an adaptive adjustment mechanism, the precise association of multiple targets can be achieved.

As shown in Figure 5, the single-target learning process consists of two parts, namely training and testing. The whole process has modules such as initialization, ring wave gate screening, action selection, and stepping. After single-target learning is completed, multi-target testing is conducted based on the learning experience. The multi-target testing process is basically the same as the single-target learning process, except that Compete mechanism and adaptive adjustment mechanism are added.

In Figure 5, Random Choice module and Argmax Choice module are action selection processes. In the Random Choice module, the state of agent and a randomly selected point are input into the policy network to obtain the association probability. In the Argmax Choice module, the state of agent and all points are input into the policy network to obtain all association probabilities, and the point with the highest probability is selected for testing. The Random Choice module enables the learning of strategies in the continuous trial-and-error process, while the Argmax Choice module can be used as the means of testing the results of learning. Done is the marker that determines whether the learning process is ended or not. There is no clear “terminal” for the data association process. When there is no new measurement data at the next moment, the learning process is considered to be ended, namely Done is “True”. The Discount Sum Rewards module is the process of calculating the cumulative return of track discount. The Normalization module is to normalize the cumulative return of track discount. The Save Data module is used to save the data. The Loss module is the cross-entropy loss function. The Optimizer module uses the Adam optimization algorithm with the learning rate of 0.01.

6.2. Adaptive Adjustment Mechanism

In the real environment, there are many ways of target movement, and it is not possible for an agent to learn all of them. When testing a new strategy, the agent still needs to adapt to the environment, and it is difficult to accurately select the points originating from the target. Therefore, this paper designs a mechanism that can automatically adjust the motion trend during the test, which improves the weak migration ability of RL technique, as shown in Figure 6. When the test process ends, if the measurements are still present at the next moment, it indicates that there may be some incorrect association. This mechanism can find the wrongly selected actions and correct them.

As shown in Figure 6, this process can be performed in four steps.

Find the minimum reward $r_m i n$ and the corresponding state $s_m i n$ from the reward set $r e w a r d$ .
Calculate the reward $r_{i}$ for each valid action $a_{i} (i = 1, 2, \dots, m)$ according to Equation (16).
Judgment: If the maximum reward $r_m a x > r_m i n$ , the action is selected and the test continues. Conversely, remove $r_m i n$ from $r e w a r d$ and continue the calculation.
Iterate through all rewards in $r e w a r d$ . If all rewards do not need to be adjusted, the association results are considered to have been associated without problems, and the test is over.

During multi-target data association testing, this mechanism can automatically adjust the track to make the association results more accurate. It also eliminates the process of re-adapting to the environment, making the algorithm more practical.

7. Simulation Experiment and Result Analysis

7.1. Simulation Environment Settings

Suppose there are flying targets in motion in an air-to-air radar detection area, and the sampling interval

T_s a m p l e

of system is 1, and the “window”

N

is 5. The detection probability

P_D

of target is 1. The initial positions are randomly distributed at (−500 m, 500 m), and the initial velocity is randomly selected between 30 m/s and 100 m/s. The maximum speed of target is

v_m a x = 150 m / s

, and the minimum speed of target is

v_m i n = 10 m / s

. The number of clutter follows Poisson distribution, and the mean of Poisson distribution is

λ

. The acceleration in the CA model is randomly chosen between −10 m/s² and 10 m/s². The turning angular acceleration in the CT model is randomly chosen between 0 and 0.8.

According to reference [1], the motion process of target satisfies the state transition equation.

X_{k} = F X_{k - 1} + Γ v_{k - 1}

(19)

In the above equation,

F

is the state transition matrix, and

Γ

is the process noise distribution matrix.

v_{k - 1}

is additive white noise, and its covariance matrix is

Q_{k - 1}

. Combined with the parameters related to the real motion of target in the experiment, set

Q_{k - 1} = diag ([5^{2}, 5^{2}])

.

Corresponding to the CV model, CA model, and CT model, according to reference [1], the expressions of

F

and

Γ

are as follows:

F_{C V} = [\begin{matrix} 1 & T & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & T \\ 0 & 0 & 0 & 1 \end{matrix}] Γ_{C V} = [\begin{matrix} \frac{T^{2}}{2} & 0 \\ T & 0 \\ 0 & \frac{T^{2}}{2} \\ 0 & T \end{matrix}]

F_{C A} = [\begin{matrix} 1 & T & \frac{T^{2}}{2} & 0 & 0 & 0 \\ 0 & 1 & T & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & T & \frac{T^{2}}{2} \\ 0 & 0 & 0 & 0 & 1 & T \\ 0 & 0 & 0 & 0 & 0 & 1 \end{matrix}] Γ_{C A} = [\begin{matrix} \frac{T^{2}}{2} & 0 \\ T & 0 \\ 1 & 0 \\ 0 & \frac{T^{2}}{2} \\ 0 & T \\ 0 & 1 \end{matrix}]

F_{C T} = [\begin{matrix} 1 & \frac{\sin ω T}{ω} & 0 & \frac{\cos ω T - 1}{ω} \\ 0 & \cos ω T & 0 & - \sin ω T \\ 0 & \frac{1 - \cos ω T}{ω} & 1 & \frac{\sin ω T}{ω} \\ 0 & \sin ω T & 0 & \cos ω T \end{matrix}] Γ_{C T} = [\begin{matrix} \frac{T^{2}}{2} & 0 \\ T & 0 \\ 0 & \frac{T^{2}}{2} \\ 0 & T \end{matrix}]

According to reference [1], the measurement satisfies the following equation.

Z_{k} = H X_{k} + W_{k}

(20)

In the above equation,

H

is the measurement matrix.

W_{k}

is white Gaussian noise, and its covariance matrix is

R_{k} = diag ([\begin{matrix} σ_{v}^{2} & σ_{v}^{2} \end{matrix}])

. The measurement error

σ_{v}^{}

is determined by sensor performance.

Corresponding to the CV model, CA model, and CT model, according to reference [1], the expressions of

H

are as follows:

H_{C V} = [\begin{matrix} 1 & \begin{matrix} 0 & 0 & 0 \end{matrix} \\ 0 & \begin{matrix} 0 & 1 & 0 \end{matrix} \end{matrix}] H_{C A} = [\begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \end{matrix}] H_{C T} = [\begin{matrix} 1 & \begin{matrix} 0 & 0 & 0 \end{matrix} \\ 0 & \begin{matrix} 0 & 1 & 0 \end{matrix} \end{matrix}]

7.2. Result Analysis

The whole experiment is divided into the single-target data association learning process and the multi-target data association testing process. The movement time of target is 30 s in the single-target data association learning process. The number of data in the training set is 5 groups. The measurement error is

σ_{v}^{} = 0, 10, 20, 30, 40

, and each measurement error corresponds to one group of data. Each group has 10000 data, and the clutter parameter

λ

is randomly selected from 0 to 90. The motion model of each track consists of any combination of the CV model, CA model, and CT model. The number of data in the testing set also is 5 groups. The measurement error is

σ_{v}^{} = 0, 10, 20, 30, 40

, and each measurement error corresponds to one group of data. Each group has 100 data. The clutter parameter is

λ = 0, 10, 20, \dots, 90

, and each clutter parameter has 10 data. The motion model of each track consists of any combination of CV model, CA model, and CT model.

After the test, the association accuracy

P

can be calculated by comparing the test results with the true points originating from the target, namely

P = \frac{N_{c o r r e c t}}{N_{a l l}}

(

N_{c o r r e c t}

represents the number of true points originating from the target in the association result of agent, and

N_{a l l}

represents the number of real points originating from the target, here

N_{a l l} = 30

). After the test, the average association accuracy

\bar{P}

can be calculated, namely

\bar{P} = \frac{1}{N_{d a t a}} \sum_{i = 1}^{N_{d a t a}} P_{i}

(

N_{d a t a}

represents the amount of data in the test set, here

N_{d a t a} = 10

). The experimental results of test are shown in Algorithm 1.

As shown in the Table 1, the average association accuracy is inversely proportional to measurement error and clutter parameters in terms of the general trend, and the values of

\bar{P}

are affected by both. However, the value of

\bar{P}

exceeds 80% in all cases, and there are only a few cases where the value of

\bar{P}

is below 90%. Therefore, the test results of single-target learning process are relatively good and can basically assist in implementing the multi-target data association process.

The movement time of the target is 30 s in the multi-target data association testing process. The number of data in the testing set is 5 groups. The measurement error is

σ_{v}^{} = 0, 10, 20, 30, 40

, and each measurement error corresponds to one group of data. Each group has 10 data, and the clutter parameter is

λ = 0, 10, 20, \dots, 90

. The motion model of each track consists of any combination of CV model, CA model, CT1 model, CT2 model, and CT3 model. Figure 7 is the real track where target 5 appears suddenly at the fifth sampling interval. The state transition matrix and process noise distribution matrix of CV model and CA model are unchanged, namely

F_{C V}

,

Γ_{C V}

,

F_{C A}

, and

Γ_{C A}

. However,

F_{C T}

and

Γ_{C T}

of the CT model are also basically unchanged; only

ω

in

F_{C T}

changes, and the corresponding relationship is

CT 1 model : ω_{1} = 0.3 CT 2 model : ω_{2} = - 0.3 CT 3 model : ω_{3} = 0.5

Here, set the abbreviation of the algorithm in this paper as TDA. Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12 show the measurement and association result of TDA at

σ_{v}^{} = 0, λ = 90

,

σ_{v}^{} = 10, λ = 80

,

σ_{v}^{} = 20, λ = 70

,

σ_{v}^{} = 30, λ = 60

, and

σ_{v}^{} = 40, λ = 50

, respectively. In the figure of measurement, the blue dots are measurements, and Target 1–Target 5 are the real points originating from the target. In the figure of association as result of TDA, the red dots are the real points originating from the target, and the green lines are the association track.

As shown in the following figure, when two targets have similar motion trends, close spatial distances, and close locations of real points for

N

consecutive moments, there may be the “wrong selection”, and the agent may choose to associate the clutter if the real measurement at the next moment is close to the spatial location of the clutter. However, from the association results, these problems do not affect the association process in the subsequent moments and have the minimal impact on the whole process. There are two reasons to guarantee the performance of TDA. First, the fundamental method of TDA to achieve multi-target data association is to select the point that best matches the target motion trend from the next moment of measurement based on the motion trend of

N

consecutive moments. An incorrectly associated point has less impact on this track, much less on the whole association process. Second, if the association process is interrupted due to the wrong association of some points, TDA can correct the error in time through the Adaptive Adjustment Mechanism to ensure the accuracy of association results. So, TDA can basically accurately associate to the points originating from the targets, and there are no problems of association interruptions, missed targets, and incorrect associations. It can effectively solve the data association problem of multiple targets.

In order to evaluate the performance of the algorithm more conveniently, several metrics are proposed in this paper. Assuming that the number of real points originating from the target is

n u m

, the number of all points in the measurement data is

n u m_z_a l l

, and the number of all real points originating from the target in the association result of agent is

n u m_a l l

, the average overall association accuracy

P_a l l

is

P_a l l = \frac{n u m_a l l}{n u m}

(21)

The average association accuracy

P_z t r u e

is

P_z t r u e = \frac{n u m_a l l}{n u m_z_a l l}

(22)

The average accuracy of association results for each agent is

P_e q u a l

. Assuming that the association result of an agent corresponds to the track of a real target, if the number of real points originating from the target in the association result is

n u m_i

, and the number of real points originating from the target is

n u m_z i

, then

P_e q u a l_{i} = \frac{n u m_i}{n u m_z i}

(23)

Set the number of targets in the association result to

n u m_o b s

.

Table 2, Table 3, Table 4, Table 5, Table 6 and Table 7 show the performance of TDA when

σ_{v}^{} = 0, 10, 20, 30, 40

and

λ = 0, 10, 20, \dots, 90

. Monte Carlo simulation is 100 times. Table 2 and Table 3 show the change of

n u m_a l l

and

P_a l l

, respectively. As can be seen from the figure, when

σ_{v}^{}

is larger,

P_a l l

is more affected by clutter. As

λ

gets bigger,

P_a l l

gets smaller. However, the lowest of

P_a l l

remains at 0.95, which shows the validity of association results.

Table 4 and Table 5 show the change of

n u m_z a l l

and

P_z t r u e

, respectively. As shown in the figure, when

σ_{v}^{}

is larger,

P_z t r u e

is more affected by clutter. As

λ

gets bigger,

P_z t r u e

gets smaller. However, the lowest of

P_z t r u e

is 0.96. So, the

P_z t r u e

is high and stable.

Table 6 and Table 7 show the change of

P_e q u a l

and

n u m_o b s

, respectively. As can be seen from the table,

P_e q u a l

is stable at about 0.9, which is slightly affected by clutter and has good association performance. However, combined with

P_a l l

and

P_z t r u e

,

P_e q u a l

decreases with the increase in

σ_{v}^{}

. This indicates that there may be a little crossover, aliasing, and other phenomena in the association tracks when the target is moving in parallel or hedging. However, the decrease in

P_e q u a l

is not large, which proves that the association performance is still very good. The

n u m_o b s

is stable. It is only at

σ_{v}^{} = 40, λ = 80

that a large error occurs, and

n u m_o b s

is only 6. This indicates that there are few interruptions in the association track.

The performance of this algorithm can be verified by combining the data in Table 2, Table 3, Table 4, Table 5, Table 6 and Table 7. First, the Adaptive Adjustment Mechanism plays the positive role to ensure the continuity of association track. Second, TDA can basically learn the real motion trend of the target, which ensures the accuracy of association results. Third, the agent detection mechanism can detect new targets that appear suddenly in the environment in time. Even if the association process is interrupted, it can still find the target and restart the association in the subsequent association process. In summary, although the performance of TDA decreases when

σ_{v}^{}

and

λ

increase, the results show that the performance of TDA decreases insignificantly, and the validity of association results can still be guaranteed. It is also proved that TDA can effectively handle the problems of data association for strongly maneuvering targets, even when multiple targets doing special maneuvers are also unaffected. Moreover, TDA does not require any impractical prerequisites and meets the requirements for solving practical problems.

In this paper, the TDA algorithm is compared with the JPDAF algorithm, MHTF algorithm, Fuzzy C-Means Clustering Filter (FCMF) algorithm [29], and Generalized Multi-Bernoulli Filter (GLMBF) algorithm [30]. Considering that most targets are maneuvering in the environment, the IMM model is added to the original algorithm, namely IMM-JPDAF algorithm, IMM-MHTF algorithm, IMM-FCMF algorithm, and IMM-GLMB algorithm. In the following, the four algorithms are described by their abbreviations, namely JPDA, MHT, FCM, and GLMB. Given the shortcomings of the four algorithms, it is necessary to assume that the initial position, motion model, and starting time of all target are known during the experiment. However, the TDA algorithm does not have this priori information.

Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13 show the performance comparison of five algorithms when

σ_{v}^{} = 30, λ = 0, 10, 20, \dots, 90

. Table 8 and Table 9 show the performance comparison of

n u m_a l l

and

P_a l l

, respectively. As can be seen from the table, the

P_a l l

of TDA, MHT, and FCM is relatively stable, and the

P_a l l

of TDA is the highest.

Table 10 and Table 11 show the performance comparison of

n u m_z a l l

and

P_z t r u e

, respectively. As shown in the table, the

P_z t r u e

of TDA and FCM is high and stable. FCM is based on the principle of “one target corresponds to one measurement”, and the number of targets is assumed to be known, so

P_z t r u e

is higher. MHT adopts the principle of “target: measurement = 1: N”. JPDA is greatly affected by clutter.

n u m_z a l l

increases as clutter density increases, and

P_z t r u e

decreases for both MHT and JPDA. GLMB is slightly affected by clutter, and its performance is stable. However, there are so many cases of incorrect association in the association process that make

P_z t r u e

of GLMB very low.

Table 12 and Table 13 show the performance comparison of

P_e q u a l

and

n u m_o b s

, respectively. As can be seen from the table, JPDA and FCM assume that the number of targets is known;

n u m_o b s

of TDA is the most accurate, followed by MHT. Compared with the

n u m_o b s

of other algorithms, the difference of GLMB is larger, and there are clearly too many interruptions of association. Combined with

P_a l l

and

P_z t r u e

, the performance difference between FCM and TDA is not significant. However, according to

P_e q u a l

, the association performance of TDA is better. Obviously, when the targets do parallel motion or hedging motion, there are track crossings and track blending in the association results of FCM. The

P_e q u a l

of JPDA is determined by

n u m_a l l

,

n u m_z a l l

,

n u m_o b s

, and other factors. The

n u m_o b s

is known, and the

n u m_z a l l

is significantly higher than other algorithms, which makes the

P_e q u a l

small. So the

P_e q u a l

are all 0.

In summary, when

σ_{v}^{} = 30

, the association performance of TDA is the best, followed by FCM, MHT, JPDA, and GLMB. When solving practical problems, the algorithm complexity is high, and the algorithm performance is not necessarily good. The complexity of GLMB is obviously higher than other algorithms, but its association results are not satisfactory. JPDA and MHT are affected by clutter, and their performance is unstable. FCM is difficult to deal with strong maneuvering targets effectively. Especially, when multiple targets perform special maneuvers, the performance of FCM algorithm will be affected. Moreover, FCM, MHT, JPDA, and GLMB require many assumptions, such as target motion model, the number of targets, and other information that is impossible to know when solving the practical problem. Compared with FCM, MHT, JPDA, and GLMB, TDA does not require assuming any unpredictable information as the priori information. It can also utilize the learning network to process the data quickly and efficiently and reduce the complexity of algorithm. Moreover, TDA provides solutions to various problems that may exist in the process of multi-target data association and ensures the accuracy of association results. Based on RL technology, TDA creates a new multi-target data association architecture, which breaks all the limitations of traditional data association algorithms and significantly improves the association accuracy. It has certain practical application value.

8. Discussion

In order to solve the problems that may occur in the unknown system model, such as strong clutter interference, sensor detection error, target strong maneuvering, and new target suddenly appearing, this paper proposes a multi-target data association algorithm based on RL technology. This algorithm abandons the traditional data association framework and combines with RL technology to build a new multi-target data association network architecture. It can not only combine the dynamic exploration ability of RL and the long-term memory function of LSTM network to predict the association probability between measurement and target but also use the Bayesian network and multi-order least squares curve fitting method to predict the target position. After that, the output value is fed into the Bayesian recursive function to calculate the reward. In addition, some mechanisms to solve the practical problems are established for the characteristics of target and measurement data. Finally, it realizes the accurate association between the point and the track. The performance of this algorithm is verified by comparing with four classical and advanced algorithms in different environments. Moreover, quite complex scenarios are designed in the test process of multi-target data association. Both multiple targets in the form of formation for crossover, parallel, and other complex movement modes, as well as strong maneuver models such as turning and acceleration during the target movement were added. The special situation that new targets may suddenly appear during the association process is also fully taken into account. Such complex scenarios cover almost all possible situations in real scenarios. The experimental results demonstrate that this algorithm can still maintain a good association performance in the face of large-scale target scenarios.

This paper breaks through the shackles of traditional data association and innovatively puts forward a new idea to solve the problem of multi-target data association. When the real scenario has dense clutter distribution, strong target mobility, and large sensor errors, it is still possible to achieve accurate association between points and tracks of multiple targets, which reflects the engineering application value of this algorithm and has high theoretical research significance. Likewise, this algorithm has certain limitations. When there is stealthy target movement or a large amount of electronic interference in the environment, the points originating from the target in the measurement data acquired by the sensor may be missing, and the association is easily broken. This algorithm faces great challenges, and its performance cannot be guaranteed. During the simulation experiments, as

σ_{v}^{}

and

λ

become larger, mechanisms such as Agent detection and Ring Wave Gate Screening consume more and more time. Moreover, the frequent false association phenomenon causes the number triggering the Adaptive Adjustment Mechanism to increase and take up more and more time. The average running time of a single simulation experiment has been tested to increase by a factor of about 10, and the computational complexity is quite high. Therefore, the focus of subsequent research is to break the limitations of this algorithm and improve the performance and practicality of algorithm.

Author Contributions

Conceptualization, W.X.; Methodology, Y.C.; Software, X.G.; Validation, X.G.; Formal Analysis, X.G.; Investigation, X.G.; Resources, X.G.; Data Curation, X.G.; Writing—Original Draft Preparation, W.X. and X.G.; Writing—Review and Editing, X.G. and Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by China Under Grant. The grant number is 62001499.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Huang, X.; Wang, Y. Kalman Filter Principle and Application-MATLAB Simulation; Publishing House of Electronics Industry: Beijing, China, 2015. [Google Scholar]
Lexa, M.; Coraluppi, S.; Carthel, C.; Willett, P. Distributed MHT and ML-PMHT Approaches to Multi-Sensor Passive Sonar Tracking. In Proceedings of the 2020 IEEE Aerospace Conference, Big Sky, MT, USA, 7–14 March 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
Li, Q.; Song, L.; Zhang, Y. Multiple extended target tracking by truncated JPDA in a clutter environment. IET Signal Process. 2021, 15, 207–219. [Google Scholar] [CrossRef]
Fan, E.; Xie, W.; Pei, J.; Hu, K.; Li, X.; Podpečan, V. Improved Joint Probabilistic Data Association (JPDA) Filter Using Motion Feature for Multiple Maneuvering Targets in Uncertain Tracking Situations. Information 2018, 9, 322. [Google Scholar] [CrossRef]
He, S.; Shin, H.S.; Tsourdos, A. Distributed Multiple Model Joint Probabilistic Data Association with Gibbs Sampling-Aided Implementation. Inf. Fusion 2020, 64, 20–31. [Google Scholar] [CrossRef]
Ma, M.; Wang, D.; Sun, H.; Zhang, T. Radiation intensity Gaussian mixture PHD filter for close target tracking. Signal Process. 2021, 188, 108196. [Google Scholar] [CrossRef]
Qin, Z.; Liang, Y.; Li, K.; Zhou, J. Measurement-driven sequential random sample consensus GM-PHD filter for ballistic target tracking. Mech. Syst. Signal Process. 2021, 155, 107407. [Google Scholar] [CrossRef]
Li, T.; Prieto, J.; Fan, H.; Corchado, J.M. A Robust Multi-Sensor PHD Filter Based on Multi-Sensor Measurement Clustering. IEEE Commun. Lett. 2018, 22, 2064–2067. [Google Scholar] [CrossRef]
Streit, R.; Angle, R.B.; Efe, M. Multi-Bernoulli Mixture and Multiple Hypothesis Tracking Filters. In Analytic Combinatorics for Multiple Object Tracking; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
Díaz, Z.; Segovia, M.J.; Fernández, J.; del Pozo, E. Machine Learning and Statistical Techniques. An Application to the Prediction of Insolvency in Spanish Non-life Insurance Companies. Int. J. Digit. Account. Res. 2020, 5, 1–45. [Google Scholar] [CrossRef]
Tran, D.A.; Tsujimura, M.; Ha, N.T.; Van Binh, D.; Dang, T.D.; Doan, Q.V.; Bui, D.T.; Ngoc, T.A.; Thuc, P.T.B.; Pham, T.D. Evaluating the predictive power of different machine learning algorithms for groundwater salinity prediction of multi-layer coastal aquifers in the Mekong Delta, Vietnam. Ecol. Indic. 2021, 127, 107790. [Google Scholar] [CrossRef]
Aswad, F.M.; Kareem, A.N.; Khudhur, A.M.; Khalaf, B.A.; Mostafa, S.A. Tree-based machine learning algorithms in the Internet of Things environment for multivariate flood status prediction. J. Intell. Syst. 2021, 31, 1–14. [Google Scholar] [CrossRef]
Zhang, X.; Li, P.; Zhu, Y.; Li, C.; Yao, C.; Wang, L.; Dong, X.; Li, S. Coherent beam combination based on Q-learning algorithm. Opt. Commun. 2021, 490, 126930. [Google Scholar] [CrossRef]
Li, H.; Zhang, X.; Bai, J.; Sun, H. Quadric Lyapunov Algorithm for Stochastic Networks Optimization with Q-learning Perspective. J. Phys. Conf. Ser. 2021, 1885, 042070. [Google Scholar] [CrossRef]
Zhang, Y.; Ma, R.; Zhao, D.; Huangfu, Y.; Liu, W. A Novel Energy Management Strategy based on Dual Reward Function Q-learning for Fuel Cell Hybrid Electric Vehicle. IEEE Trans. Ind. Electron. 2021, 69, 1537–1547. [Google Scholar] [CrossRef]
Li, M.; Wang, Z.; Li, K.; Liao, X.; Hone, K.; Liu, X. Task Allocation on Layered Multi-Agent Systems: When Evolutionary Many-Objective Optimization Meets Deep Q-Learning. IEEE Trans. Evol. Comput. 2021, 25, 842–855. [Google Scholar] [CrossRef]
Zhao, B.; Ren, G.; Dong, X.; Zhang, H. Distributed Q-Learning Based Joint Relay Selection and Access Control Scheme for IoT-Oriented Satellite Terrestrial Relay Networks. IEEE Commun. Lett. 2021, 25, 1901–1905. [Google Scholar] [CrossRef]
Zhang, Q.; Lin, M.; Yang, L.T.; Chen, Z.; Li, P. Energy- Efficient Scheduling for Real-Time Systems Based on Deep Q-Learning Model. IEEE Trans. Sustain. Comput. 2017, 4, 132–141. [Google Scholar] [CrossRef]
Huang, R.; Yu, T.; Ding, Z.; Zhang, S. Policy Gradient. In Deep Reinforcement Learning: Fundamentals, Research and Applications; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533.a3. [Google Scholar] [CrossRef]
Liu, Y.; Hu, Y.; Gao, Y.; Chen, Y.; Fan, C. Value Function Transfer for Deep Multi-Agent Reinforcement Learning Based on N-Step Returns. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019. [Google Scholar]
Tirinzoni, A.; Sessa, A.; Pirotta, M.; Restelli, M. Importance Weighted Transfer of Samples in Reinforcement Learning. arXiv 2018, arXiv:1805.100886. [Google Scholar]
Gamrian, S.; Goldberg, Y. Transfer Learning for Related Reinforcement Learning Tasks via Image-to-Image Translation. arXiv 2018, arXiv:1806.07377. [Google Scholar]
Khan, A.; Jiang, F.; Liu, S.; Omara, I. Playing a FPS Doom Video Game with Deep Visual Reinforcement Learning. Autom. Control Comput. Sci. 2019, 53, 214–222. [Google Scholar] [CrossRef]
Cao, J.; Liang, M.; Li, Y.; Chen, J.; Li, H.; Liu, R.W.; Liu, J. PCA-Based Hierarchical Clustering of AIS Trajectories with Automatic Extraction of Clusters. In Proceedings of the 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA), Shanghai, China, 9–12 March 2018; pp. 448–452. [Google Scholar]
Li, B.; Yang, Y. Complexity of concept classes induced by discrete Markov networks and Bayesian networks. Pattern Recognit. 2018, 82, 31–37. [Google Scholar] [CrossRef]
Wang, C.; Wang, H.P.; Xiong, W.; He, Y. Data association algorithm based on least square fitting. Acta Aeronaut. Et Astronaut. Sin. 2016, 37, 1603–1613. [Google Scholar]
Jithesh, V.; Sagayaraj, M.J.; Srinivasa, K.G. LSTM recurrent neural networks for high resolution range profile based radar target classification. In Proceedings of the 2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT), Ghaziabad, India, 9–10 February 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–6. [Google Scholar]
Zhang, H.; Li, H.; Chen, N.; Chen, S.; Liu, J. Novel fuzzy clustering algorithm with variable multi-pixel fitting spatial information for image segmentation. Pattern Recognit. 2022, 121, 108201. [Google Scholar] [CrossRef]
Do, C.T.; Nguyen, T.T.D.; Moratuwage, D.; Shim, C.; Chung, Y.D. Multi-object tracking with an adaptive generalize d lab ele d multi-Bernoulli filter. Signal Process. Off. Publ. Eur. Assoc. Signal Process. 2022, 196, 108532. [Google Scholar]

Figure 1. Network architecture.

Figure 2. Cyclic Frechet Distance Clustering method.

Figure 3. Ring Wave Gate Screening (target 1).

Figure 4. Ring Wave Gate Screening (target 2).

Figure 5. Single-Target Learning Process.

Figure 6. Adaptive Adjustment Mechanism.

Figure 7. Real Track.