AK-MADDPG-Based Antijamming Strategy Design Method for Frequency Agile Radar

Zhu, Zhidong; Deng, Xiaoying; Dong, Jian; Feng, Cheng; Fu, Xiongjun

doi:10.3390/s24113445

Open AccessArticle

AK-MADDPG-Based Antijamming Strategy Design Method for Frequency Agile Radar

by

Zhidong Zhu

¹,

Xiaoying Deng

¹,

Jian Dong

¹,

Cheng Feng

¹

and

Xiongjun Fu

^1,2,*

¹

Beijing Institute of Technology, Beijing 100081, China

²

Tangshan Research Institute of BIT, Tangshan 063007, China

^*

Author to whom correspondence should be addressed.

Sensors 2024, 24(11), 3445; https://doi.org/10.3390/s24113445

Submission received: 29 March 2024 / Revised: 9 May 2024 / Accepted: 24 May 2024 / Published: 27 May 2024

(This article belongs to the Special Issue Advanced Anti-Jamming Methods and Signal Processing Techniques for Radar System)

Download

Browse Figures

Versions Notes

Abstract

:

Frequency agility refers to the rapid variation of the carrier frequency of adjacent pulses, which is an effective radar active antijamming method against frequency spot jamming. Variation patterns of traditional pseudo-random frequency hopping methods are susceptible to analysis and decryption, rendering them ineffective against increasingly sophisticated jamming strategies. Although existing reinforcement learning-based methods can adaptively optimize frequency hopping strategies, they are limited in adapting to the diversity and dynamics of jamming strategies, resulting in poor performance in the face of complex unknown jamming strategies. This paper proposes an AK-MADDPG (Adaptive K-th order history-based Multi-Agent Deep Deterministic Policy Gradient) method for designing frequency hopping strategies in frequency agile radar. Signal pulses within a coherent processing interval are treated as agents, learning to optimize their hopping strategies in the case of unknown jamming strategies. Agents dynamically adjust their carrier frequencies to evade jamming and collaborate with others to enhance antijamming efficacy. This approach exploits cooperative relationships among the pulses, providing additional information for optimized frequency hopping strategies. In addition, an adaptive K-th order history method has been introduced into the algorithm to capture long-term dependencies in sequential data. Simulation results demonstrate the superior performance of the proposed method.

Keywords:

frequency agility; radar; antijamming strategies; multi-agent reinforcement learning

1. Introduction

Compared to fixed frequency radar, frequency agile (FA) radar exhibits higher resistance to jamming by swiftly adjusting frequency points to avoid jamming bands or dilute jamming power spectral density [1,2]. However, with the continuous advancement of jamming technologies, jammers can analyze the modulation patterns of FA radar, adaptively adjusting jamming strategies to counter frequency agile signals, challenging the antijamming performance of traditional FA radar [3,4,5]. Therefore, it is crucial to conduct research on intelligent antijamming for FA radar systems.

Reinforcement learning (RL) theory has been widely applied to the research of intelligent antijamming decision-making problems in radar. It is a branch of machine learning aimed at enabling agents to take appropriate actions to obtain high rewards [6,7]. The development and maturity of RL-related theories have significantly enhanced radar situational awareness, autonomous learning, and decision-making capabilities [8,9,10,11]. Introducing RL into the radar antijamming field and optimizing radar antijamming strategies through collecting interaction information between radar and jammer can improve the intelligence of radar antijamming decision-making capability [12,13,14].

Traditional RL struggles to handle problems with excessively large state spaces or continuous action spaces. Deep reinforcement learning (DRL) combines deep neural networks with RL, using neural networks to approximate value functions or policy functions to address the shortcomings of traditional RL [15]. For example, Deep Deterministic Policy Gradient (DDPG) is a DRL algorithm based on actor-critic networks that perform well in policy optimization problems [16,17]. The Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm extends the DDPG algorithm to the multi-agent domain, introducing other agents’ actions as additional information to obtain Q-value functions [18].

Further improvements and optimizations are still required for the use of RL and DRL to address the intelligent antijamming problem of FA radar. Ref. [19] employed array beamforming data and jamming status characteristics after pulse compression sensing as model inputs, implementing intelligent updates of the antijamming knowledge base using Q-learning and SARSA algorithms to determine the optimal antijamming strategy based on the knowledge base. Ref. [20] selected 12 antijamming measures using an improved DDPG algorithm, with jamming state changes before and after the implementation of antijamming measures as feedback. Ref. [21] employed the DDPG-MADDPG (Deep Deterministic Policy Gradient and the Multi-Agent Deep Deterministic Policy Gradient) method for adaptive selection of multiple domain antijamming measures, including composite jamming types, using the antijamming improvement factor as feedback. Refs. [19,20,21] all require online recognition of jamming and acquisition of jamming state information to intelligently select antijamming measures. However, the complexity and diversity of jamming signals make the design of online recognition algorithms another challenge. Additionally, mitigating the impact of jamming online recognition on the real-time nature of radar decision-making is also a problem that needs to be addressed.

Ref. [22] utilized the past radar transmission behavior as state input, designing an antijamming strategy for FA radar based on DQN. Ref. [23] proposed a robust antijamming strategy design method by parameterizing the jamming strategy and incorporating jamming strategy perturbations into Wasserstein robust reinforcement learning. The methods presented in [22,23] focus on devising antijamming strategies specifically tailored to known jamming tactics, yet they do not delve into their applicability in scenarios involving unknown jamming tactics.

In complex electronic warfare scenarios, it is often challenging to accurately obtain the jamming status of jammers through real-time online identification. For radar systems, the jamming strategies of jammers are also unknown and dynamically changing. The dynamic nature of jamming strategies is reflected in the fact that jammers choose different jamming strategies when facing different scenarios or at different stages. Therefore, it is necessary to explore the feasibility and effectiveness of intelligent decision-making in complex unknown jamming strategy scenarios. Ref. [24] used radar transmission behavior and signal-to-interference ratio of echo signals as state inputs, optimizing cognitive radar antijamming frequency hopping strategy using two algorithms: Q-learning and Deep Q-Network (DQN). In this study, the jammer model is unknown; thus, the jamming strategy faced by the FA radar is unknown. However, the jamming strategy faced by the radar in the simulation is fixed and single, without considering complex jamming strategies. Ref. [25] employs radar detection probability as the reward function and utilizes proximal policy optimization (PPO) to solve the optimization problem of FA radar antijamming strategies. Additionally, it proposes a unified antijamming strategy learning method based on policy distillation technique to combat multiple jamming strategies. The paper considers the correlation between the target returns, which affects radar detection performance. It designs the state of the agent using the K-th order history method to mitigate partial observability issues, thereby enhancing the optimization performance of the algorithm. Ref. [26], inspired by the successful application of single-agent reinforcement learning in cognitive systems, attempted to apply multi-agent deep reinforcement learning (MDRL) to the antijamming decision-making model of cognitive radar. By incorporating cognitive radar and intelligent jammer into the MDRL framework, based on the decision network of DDPG, the study explores the competition between cognitive radar and intelligent jammer, and overcomes the non-stationarity of the environment using the MDRL framework. This demonstrates the effective role of MDRL theory in optimizing antijamming decision-making of cognitive radar in “cooperation-competition” scenarios. Taking into account the potential cooperative relationship among FA radar signal pulses and the optimization problem of FA radar strategies under unknown jamming strategy scenarios is partially observable, the above research provides valuable references and insights into how FA radar can cope with complex and changing unknown jamming strategies.

To improve the antijamming performance of FA radar in complex unknown jamming strategy scenarios, this paper proposes a frequency hopping strategy optimization method based on AK-MADDPG by modeling and simulating the jamming environment and FA radar. The main contributions of this paper are as follows:

Constructed simulation models for FA radar and jammers. FA radar can control the transmission frequency of signal pulses emitted within a coherent processing time, while jammers do not intercept radar signals when transmitting signals and employ various jamming strategies. For FA radar, the jamming strategies of jammers are unknown.
Constructed a Markov decision process for the interaction between radar and jammer and designed various elements of multi-agent reinforcement learning. To address the insufficient research on FA radar hopping strategy optimization based on RL in complex unknown jamming strategies scenarios, this paper proposes an AK-MADDPG method for hopping strategy optimization. This method introduces multiple agents to explore the cooperative relationship among radar signal pulses and uses an adaptive K-th order history method to exploit the long-term dependencies of input data.
Conducted simulation verification. Under the complex jamming conditions of three mixed jamming strategies, compared with traditional random hopping strategy methods, and classical single-agent reinforcement learning DDPG- and MADDPG-based methods, the proposed method in this paper exhibits better antijamming performance.

2. Materials

When studying the game between the FA radar and the frequency spot jammer, understanding the specific principles of both is essential. The following briefly outlines the FA radar signal and jamming echo signal model, along with their element design within a multi-agent reinforcement learning framework.

2.1. The Principle of Antijamming for FA Radar

For FA radars, frequency agility can be categorized into inter-pulse frequency agility and inter-group frequency agility. The alteration of radar carrier frequencies renders interception and prediction of the carrier frequency by jamming challenging, thereby enhancing radar’s antijamming capabilities. Inter-pulse frequency agility refers to the inclusion of multiple signal pulses within a coherent processing interval (CPI), with the carrier frequency of each pulse being subject to arbitrary changes. Compared to inter-group frequency agility, inter-pulse frequency agility endows radar with greater resilience against jamming.

The FA radar system, operating with inter-pulse frequency hopping, is capable of effectively countering frequency spot jamming and possesses the ability to reduce the probability of radar signals being intercepted.

Upon detection of our radar signals by the jammer, high-power noise jamming signals are promptly emitted at the fastest rate to suppress genuine target echoes, rendering the radar unable to detect real targets. This type of jamming can even cause our radar system to overload due to the introduction of excessive high-power false target signals, thereby disrupting normal radar operation. The principle of FA radar countering frequency spot noise suppression jamming is illustrated in the accompanying Figure 1. The carrier frequency utilized for jamming differs from the echo signal’s carrier frequency. Through radio frequency filtering, the energy of frequency spot noise jamming can be significantly suppressed, allowing the FA radar to effectively counteract frequency spot noise jamming.

Due to current limitations, spot jammers cannot simultaneously scan the ultra-wide frequency band when surveying radar signals; rather, they must segment and strategically scan them one by one. Should the radar carrier frequency change between pulses during this process, it would render the jammer unable to intercept all radar signals. Furthermore, the jammer categorizes the detected radar signals based on parameters such as carrier frequency band, pulse repetition frequency, waveform, etc., to determine whether they belong to the same group of radar signals. Consequently, when the radar employs frequency hopping, it severely disrupts the jamming aircraft’s classification of radar signals, greatly reducing the probability of radar signals being intercepted by the jammer.

2.2. The Frequency Agile Radar Signal and Jamming Echo Signal Model

Figure 2 illustrates the schematic diagram of inter-pulse frequency agile radar with pseudo-random frequency hopping. Pseudo-random frequency hopping is an extremely flexible frequency hopping method, allowing the radar to transmit signals in other frequency bands to filter out targets and jamming signals when some frequency bands are covered by jammers.

Assuming

M

linear frequency signals are transmitted by the FA radar within a CPI, with a pulse bandwidth of

B

, the frequency modulation rate is

γ = B / T_{p}

, and

T_{p}

is the pulse width. The carrier frequency of the m-th pulse is

f_{m} = f_{0} + d_{m} Δ f

,

d_{m} \in {0, 1, \dots, N - 1}

, and

N

is the maximum number of hopping points,

m \in {0, 1, \dots, M - 1}

, and the transmission time of the m-th pulse is

T_{m}

. The m-th signal pulse can be represented as

s_{m} (t) = r e c t (\frac{t - T_{m}}{T_{p}}) \exp (j π γ {(t - T_{m})}^{2}) \exp (j 2 π f_{m} (t - T_{m}))

(1)

In the equation,

t

represents the fast time and

r e c t (\cdot)

represents the rectangular pulse,

r e c t (x) = {\begin{cases} 1, | x / T_{c} | \leq 0.5 \\ 0, | x / T_{c} | > 0.5 \end{cases}

.

Assuming the FA radar is dealing with frequency spot noise AM jamming. The received echo signal with jamming by the radar can be expressed as

s_{R} (t) = A_{m} r e c t (\frac{t - τ_{m}}{T_{p}}) \exp (j π γ {(t - τ_{m})}^{2}) \exp (j 2 π f_{m} (t - τ_{m})) + n (t) + J (t)

(2)

where

A_{m}

and

τ_{m}

denote the echo signal amplitude and echo delay of the m-th pulse, respectively,

n (t)

represents receiver noise, which follows a Gaussian distribution, and

J (t)

is the frequency spot noise AM jamming.

J (t) = [U_{0} + U_{n} (t)] \cos [2 π f_{j} + ϕ]

(3)

where

U_{n} (t)

represents modulated noise with a mean of 0 and variance

δ^{2}

, which is modeled as Gaussian band-limited white noise. The parameter

ϕ

follows a uniform distribution and is independent of

U_{n} (t)

.

U_{0}

denotes the carrier voltage of the jammer.

f_{j}

represents the center frequency of the jamming signal, which approximates the center frequency of the radar transmission signal targeted by the frequency spot noise AM jamming.

Assuming that the radar cross section (RCS) of the target at this time is

δ

, the channel gain from the frequency agile radar to the target jammer is

h_{s}

, and the noise power of the environmental clutter is

P_{n}

. Let

f_{m}

represents the carrier frequency of the m-th pulse within one CPI of the FA radar. The signal-to-interference plus noise ratio (SINR) after pulse compression for the m-th pulse can be expressed as

S I N R_{m} = \frac{P_{s} h_{s}^{2} σ D}{P_{n} + P_{j} h_{s} I (f_{j} = f_{m})}

(4)

where

P_{s}

represents the transmission power of the frequency-agile radar, and

P_{j}

denotes the signal power of the frequency spot noise AM jamming. Assuming that the filtering can completely eliminate the jamming signal,

I

equals 1 if

f_{j} = f_{m}

, otherwise it equals 0.

D

stands for the time–bandwidth product of the LFM signal.

After pulse compression, followed by coherent processing, in an ideal scenario, the energy accumulation of

M

echo pulses is

M^{2}

times that of a single pulse. Since noise samples are independent and zero-mean, the total power of noise is the sum of the powers of individual noise samples. Similarly, the frequency spot noise AM jamming samples entering the receiver, originating from jammers, follow the same statistical distribution, and the total jamming power is the sum of the powers of individual jamming samples. Frequency agility between pulses leads to irregular phase variations in echo signals, preventing direct coherent accumulation. This necessitates phase compensation algorithms and may result in coherent degradation. Consequently, compared to the ideal coherent accumulation scenario, there is a certain loss in SINR after the coherent accumulation of FA radar pulse signals. Therefore, the SINR after coherent accumulation is given by the following:

S I N R = \frac{M^{2} P_{s} h_{s}^{2} σ D L}{M P_{n} + \sum_{m = 1}^{M} P_{j} h_{s} I (f_{j}^{m} = f_{m})}

(5)

where

f_{j}^{m}

denote the jamming frequency corresponding to the timing of the M-th pulse, and

L

represent the loss factor, which measures the extent of coherent degradation.

2.3. The Design of Multi-Agent Reinforcement Learning Elements

As shown in Figure 3, the FA radar and the frequency spot jammer can be regarded as the multi-agent and environment of multi-agent reinforcement learning, respectively. At the time step

t

, the radar is in state

s_{t}

and takes action

a_{t}

. The jammer executes jamming actions, causing the environment to transition to the next state

s_{t + 1}

, while the radar receives a reward

r_{t}

. This process continues iteratively until the end of the game round.

It is worth noting that in unknown jamming strategy environments, the radar’s input does not include the jammer’s state. The specific design of the reinforcement learning elements is as follows.

2.3.1. Action

Assuming the FA radar contains

M

signal pulses within one CPI, the radar pulse can be modeled as a multi-agent system comprising

M

agents. For each signal pulse, its carrier frequency can be selected from a given set of

N

frequency points.

The FA radar’s action within the t-th CPI can be represented by a vector

a_{t}

of length

1 \times M

:

a_{t} = [a_{0}, \dots, a_{M - 1}]

(6)

where each element represents the action taken by the corresponding agent, i.e., the carrier frequency of the pulse. The range of values for each element is from

0

to

N - 1

, corresponding to the carrier frequency of pulse from

f_{0}

to

f_{0} + (N - 1) Δ f

.

The jammer’s actions can also be represented by a

1 \times M

vector

a_{t}^{j}

, which can be encoded according to different situations of the jammer.

K

can represent the jammer in intermittent observation state,

Φ

can represent the jammer emitting barrage jamming,

k

can represent the jammer emitting frequency spot jamming, and

k \in {0, 1, \dots, N - 1}

corresponds to the center frequency

f_{c} + k Δ f

of the jamming pulse.

2.3.2. State

In multi-agent reinforcement learning theory, other agents can be regarded as part of the environment for the current agent. Meanwhile, in complex unknown jamming scenarios, radars cannot effectively perceive jamming actions. Since the radar cannot acquire the jamming strategy of the jamming machine, the observation of the environment by the agent only includes the actions taken by other agents. It should be noted that because jamming actions may be related to radar’s historical actions, agents cannot make correct decisions based solely on the previous observation; they need to make decisions based on historical states. In reinforcement learning theory, the historical state

H_{t}

of the i-th agent at the time

T

is expressed as follows:

H_{T}^{i} = {a_{0} [i], \dots, a_{T - 1} [i], o_{1}^{i}, \dots, o_{T}^{i}}

(7)

Here,

o^{i}

represents the historical actions taken by all agents except the current i-th agent.

o^{i} = {a [0], \dots, a [j], \dots, a [M - 1]}, j \neq i

(8)

In unknown environments, if an agent relies solely on observing the actions of other agents to make decisions, then the

H_{t}

contains all the information that the agent requires. As time progresses, the size of

H_{t}

also increases, making it impractical for the agent to directly use it as a state. To address this issue, the K-th order history method is employed to approximate the historical states. The K-th order history method is a technique commonly used to handle problems with long-term dependencies. In essence, the K-th order history method approximates

H_{t}

using the past

K

observations and actions. Therefore, the state

S_{T}^{i}

of the i-th agent at the time

T

can be represented as

S_{T}^{i} = {o_{T}^{i}, a_{T - 1} [i], o_{T - 1}^{i}, \dots, a_{T - K} [i]}

(9)

The states obtained by the K-th order history method are just an approximation of the historical states, thus there may be a significant loss of historical state information in some cases. In unknown environments, due to the lack of prior information, the selection of the

K

value cannot be based on prior information as a reference, and a poor choice of

K

value can lead to suboptimal performance in multi-agent reinforcement learning. This paper proposes an adaptive K-th order history method, which can adaptively adjust

S_{T}^{i}

based on actual feedback. The specific implementation will be explained later in this paper.

2.3.3. Reward

This paper employs the SINR of radar echoes as a metric to characterize the reward obtained by the radar. SINR is an important performance metric in radar systems, indicating the ratio of the target echo signal power to the jamming and noise power received at the receiver. The magnitude of SINR directly impacts the performance and effectiveness of radar systems. A higher SINR implies stronger received signals relative to noise and jamming, thereby enhancing detection capabilities, tracking accuracy, and reducing false alarm rates. Conversely, a lower SINR makes it challenging to distinguish targets from noise and jamming, thereby affecting the detection capabilities and performance of radar systems. The reward for the n-th agent can be expressed as

r_{n} = C_{R} * S I N R

(10)

In this context,

C_{R}

is the reward coefficient. To promote smoothness in reward variation, the logarithm of the SINR is utilized for computation. From the above equation, it can be observed that each agent receives the same reward, which is not only dependent on its own state but also on the states of other agents.

3. Methods

In reinforcement learning, the policy followed by an agent when taking actions is denoted as

π

. The policy represents the mapping relationship from states to actions:

π : S \to A

. Assuming the initial state of the agent is

s_{0}

. The agent transitions to state

s_{1}

after executing action

a_{0}

according to policy

π

and receives rewards

R (s_{1}, a_{1})

. This continues iteratively, and based on the Markov state transition chain, the total state transition reward value, denoted as

Q_{π} (s_{0})

, can be obtained:

Q_{π} (s_{0}) = \sum_{t = 0} γ^{t} R (s_{t}, a_{t})

(11)

Here,

γ

is the discount factor, used to measure the importance the decision-maker places on future rewards. The discount factor takes values between 0 and 1, where a value closer to 0 indicates more emphasis on immediate rewards, while a value closer to 1 indicates more emphasis on future rewards.

We can evaluate the quality of a policy using the

Q_{π}

function, and we can also assess the value of actions using the state–action value function

Q_{π} (s, a)

. It is defined as the expected reward obtained by the agent when executing action

a

in state

s

under policy

π

:

Q_{π} (s, a) = R (s) + γ Q_{π} (s^{'}, a^{'})

(12)

where

s^{'}

represents the next state and

a^{'}

represents the action to be taken in the next state under policy

π

.

The ultimate goal of reinforcement learning is to find an optimal policy

* π

such that the

Q_{π}

function produced by this policy is maximal among all

Q_{π}

functions for all policies:

π^{*} (a | s) = {\begin{array}{l} 1 & if a = \underset{a \in A}{argmax} Q_{*} (s, a) \\ 0 & otherwise \end{array}

(13)

Once the optimal policy

* π

is obtained, the agent can select the optimal action for the current state based on

Q_{* π} (s, a)

.

3.1. The DDPG Algorithm

The DDPG algorithm is a method capable of effectively solving or approximating the optimal policy. Its algorithmic structure is illustrated in Figure 4.

As depicted in the diagram, the DDPG algorithm consists of four neural network modules for the agent: the policy network (Actor network), the value network (Critic network), the target policy network (Actor–Target network), and the target value network (Critic–Target network).

The role of the policy network is to output an action

a

based on the current state

s

, which can be represented by a function

μ_{θ} (s)

. Meanwhile, the value network evaluates the actions outputted by the policy network, corresponding to the state–action value function

Q_{π} (s, a)

mentioned earlier. Initially, the value network, acting as a judge, is unaware of whether the actions output by the policy network are sufficiently good. It needs continuous parameter updates to provide accurate scores. By utilizing the next-step value

Q_{ω} (s^{'}, a^{'})

approximated by the target network, along with the actual reward

r

and the Q-value generated by the value network, Critic’s loss function is constructed to minimize mean squared error.

L o s s_{c r i t i c} = M S E (Q_{ω} (s, a), r + γ Q_{ω} (s^{'}, a^{'}))

(14)

This optimization process involves continuously minimizing the loss function. The optimization principle for the policy network is similar. Since the goal of the policy network is to find an action

a

that maximizes the output value

Q

of the value network, the method of optimizing the policy network is to maximize the Q-value outputted by the value network. Thus, a loss function can be constructed as follows:

L o s s_{a c t o r} = - Q_{ω} (s, μ_{θ} (s))

(15)

The target networks are updated periodically through soft updates, by copying the network parameters of the value network and the policy network at regular intervals and weighing them into the target network parameters by a certain coefficient. The purpose of the target networks is to reduce training instability and improve the algorithm’s performance. The DDPG algorithm solves for the optimal policy by simultaneously optimizing the policy network and the value network.

3.2. The Process of the AK-MADDPG Algorithm

For FA radar, effective antijamming often requires cooperation among pulses within a single CPI, indicating a cooperative relationship among pulses. Traditional single-agent systems have struggled to exploit these interdependencies effectively. Moreover, multi-agent reinforcement learning has demonstrated outstanding performance in “cooperative-competitive” scenarios, making it a viable approach for optimizing frequency hopping strategies in FA radar.

The DDPG algorithm has found wide application in the single-agent domain, and its extension to the multi-agent domain yields the MADDPG algorithm. In the MADDPG algorithm, each agent in the multi-agent system has neural networks similar to those in the DDPG algorithm, with each agent possessing its own policy and value networks. Unlike the DDPG algorithm, the MADDPG algorithm introduces actions from other agents as additional information to obtain Q-value functions, where the input to the value network includes not only the current state of the agent but also the states of other agents. This additional input helps agents explore the environment while learning the actions of other agents, prompting them to collaborate to cope with complex unknown external environments.

Given that the actions of jammers may be related to the radar’s historical actions, agents in multi-agent reinforcement learning may struggle to make correct decisions based solely on the previous observation. Therefore, this study improves the MADDPG algorithm by using an adaptive K-th order history method to enhance the input states, enabling them to contain the information necessary for radar decision-making. The improved AK-MADDPG algorithm’s schematic diagram is illustrated in Figure 5.

The K-th order history method approximates historical states by considering the past K states approaching the current time. However, the crucial information required for agent decision-making may not necessarily be contained within these states, potentially resulting in suboptimal reinforcement learning outcomes for agents. To address this issue, we propose an adaptive K-th order history method. The principle of the adaptive K-th order history involves adding K actions related to K-th order history into the action space of the agent. The values of these actions determine which past states should be aggregated when approximating history. For instance, as illustrated in Figure 6, at time T, when K = 3, the values of these actions are [0, 2, 4]. Consequently, the observation and action at times T, T-2, and T-4 are combined, rather than simply considering the three adjacent observations and actions at the current time. Through the reinforcement learning process, the agent gradually learns to select the optimal K past states to approximate historical states. This indicates that as the reinforcement learning process progresses,

S_{T}^{i}

will incorporate more information essential for the agents. It functions akin to a self-attention mechanism, allowing the agent to focus on important information while being resource-efficient and simple to implement.

One of the primary advantages of introducing the adaptive K-th order history method is enhancing the agents’ understanding of the long-term evolution of the environment. By considering the historical states, agents can better capture long-term dependencies in actions and the environment, thereby improving decision stability and reliability. The AK-MADDPG algorithm procedure follows the following procedure (See Algorithm 1):

Algorithm 1 The AK-MADDPG algorithm procedure

1: Initialize the network parameters for each agent i:

θ_{i}^{Q}

,

θ_{i}^{π}

2: for training epochs from 1 to N do

3: Randomly initialize the state

s

for each agent

4: for each timestep

T

from 0 to the end of a training epochs do

5: Select action

a_{i}

based on the current policy

π

, state

s

for each agent

i

6: Execute actions

a_{T} = [a_{0}, \dots, a_{M - 1}]

, and receive rewards

r

7: State transfer to

s^{'}

8: Store data

(s, a_{i}, r, s^{'})

in the buffer

D^{i}

9: Update the states:

s \leftarrow s^{'}

10: for each agent

i

from 1 to

M

do

11: if the experience data reaches a trainable amount do

12: Extract a set of data from the experience buffer

D^{i}

13: Compute the loss according to the loss functions

14: Update the network parameters

θ_{i}^{Q}

,

θ_{i}^{π}

15: end if

16: end for

17: end for

18: end for

4. Discussion

This section validates the optimization effectiveness of the frequency hopping strategy based on AK-MADDPG through the simulation experiments.

To validate the effectiveness of the proposed method, we assume that the jammer can adopt three different jamming strategies. As depicted in Figure 7, the distinctions among the three jamming strategies primarily lie in the duration of intermittent observation and the timing of emitting jamming signals. It should be noted that these three strategies belong to basic strategies; the jammer can achieve more complex strategies by dynamically adjusting different jamming strategies over a long game time. The descriptions of the three basic jamming strategies are as follows:

Jamming Strategy 1: As illustrated in the figure, the intermittent observation time of Jamming Strategy 1 equals the duration of a signal pulse. After intercepting the pulse, the jammer will emit a frequency spot jamming signal.
Jamming Strategy 2: As shown in the figure, the intermittent observation time of Jamming Strategy 2 equals the duration of two radar pulses, slightly longer than Jamming Strategy 1. Unlike Jamming Strategy 1, to avoid being deceived by the radar, the jamming device will skip the first of the intercepted two pulses, and the center frequency of its frequency spot jamming corresponds to the carrier frequency of the second pulse.
Jamming Strategy 3: As depicted in the figure, the intermittent observation time of the jammer equals the duration of one CPI, which means the jammer will continuously intercept pulses for some time and use them as a basis for emitting jamming signals thereafter. To avoid being deceived by the radar, the same jammer will skip the first intercepted pulse and then compare the frequencies of the remaining pulses. If the frequencies of the remaining pulses are consistent, the jammer will emit frequency spot jamming with the same center frequency as the intercepted pulse’s carrier frequency. If the remaining pulses correspond to two different frequencies, the jammer will alternately emit frequency spot jamming corresponding to these two different frequencies. If the remaining pulses correspond to more than two different frequencies, the jammer will emit barrage jamming. It is assumed that the duration of jamming lasts for 4 CPIs.

The radar simulation parameters are set as shown in Table 1:

We employ observations and actions from the past

K = 2

pulses to estimate the agent’s historical observations and behaviors.

The algorithm parameters are configured as shown in Table 2:

In the simulation, it is assumed that the FA radar uses the first pulse within each CPI as a deceptive pulse to induce the jammer to jam on the wrong frequency band. During the coherent processing, the FA radar will exclude the deceptive pulse and process only the remaining pulses. In the cooperative scenario, the reward in multi-agent reinforcement learning is the sum of rewards for all agents. However, since the deceptive pulse does not affect the coherent processing outcome of the FA radar, the reward for this agent is ignored.

The reward curves for combating the three different jamming strategies using the method proposed in this paper and three other methods are shown in Figure 8. The x-axis represents the number of training epochs, while the y-axis represents the total reward obtained during a single training epoch. It can be observed from the figure that all three DRL-based frequency hopping strategy design methods outperform the random hopping method, demonstrating their effectiveness in learning effective antijamming strategies. Because jamming strategies 1 and 2 are relatively simple, even in the absence of prior knowledge about the jamming strategy, the optimization results of the three DRL-based antijamming strategies can approach the theoretically optimal strategy. As shown in Figure 8c, the performance of the proposed method is significantly better than the other three methods, particularly when dealing with the more complex Jamming Strategy 3. Additionally, the performance of the MADDPG algorithm surpasses that of the DDPG algorithm, highlighting the advantage of multi-agent systems in handling complex scenarios. It is noteworthy that during the first 256 epochs, the performance of the three DRL-based methods is similar to that of the random hopping method because during this period, the algorithm only stores data in the experience buffer without updating the network parameters.

The antijamming strategies learned by the method proposed in this paper against the three jamming strategies are shown in Figure 9. The x-axis represents time, while the y-axis represents the carrier frequency of the pulses. As shown in Figure 9a, most of the frequencies of pulse 1 are

f_{0}

. Since pulse 1 is intercepted by the jammer, other pulses will avoid

f_{0}

and mostly concentrate around

f_{1}

. Similarly, in Figure 9b, most of the frequencies of pulse 2 are

f_{2}

, and other pulses avoid the frequencies of pulse 2, mostly concentrating around

f_{1}

. As pulse 1 serves as a deceptive pulse and contributes nothing to coherent accumulation, the carrier frequencies are evenly distributed among the three frequencies. In Figure 9c, to counteract Jamming Strategy 3, the carrier frequencies of pulses 2, 3, and 4 tend to be consistent, alternating between

f_{0}

and

f_{1}

.

Building upon the three fundamental jamming strategies, we construct more intricate jamming strategy scenarios through combinations. We assume that the jammer initially employs Jamming Strategy 1, switching to Jamming Strategy 2 upon detecting deceptive pulses, and ultimately adopting Jamming Strategy 3 upon identifying the radar signal as frequency agile. To streamline the simulation, Jamming Strategy 1 and Jamming Strategy 2 are maintained for 20 gaming rounds each, followed by exclusive utilization of Jamming Strategy 3 thereafter. To verify the convergence of the algorithm as the number of pulses increases, the number of pulses within one CPI and available frequency points for the radar are increased. The number of pulses is increased to 8, and the available frequency points are increased to 7. Additionally, to allow for longer convergence time for the algorithms, we extend the training rounds to 90,000. The simulation results are presented in the following Figure 10.

Figure 10 displays the reward curves under the scenario of complex unknown jamming strategies, based on the method proposed in this paper and four other methods. The x-axis represents the number of training epochs, while the y-axis represents the total reward obtained within a single training epoch. From Figure 10, it can be observed that the performance of the random frequency hopping method remains the poorest. The performance of multi-agent reinforcement learning methods surpasses that of single-agent reinforcement learning methods. Moreover, the performance of the method proposed in this paper is significantly better than that of the methods based on MADDPG with K-th order history and MADDPG alone. This indicates the clear advantage of the AK-MADDPG algorithm proposed in this paper when facing scenarios with complex unknown jamming strategies.

5. Conclusions

In the context of a single jamming strategy scenario, reinforcement learning-based radar antijamming decision-making methods often deviate from the practical adversarial gaming environment of complex and unknown jamming strategies, thereby limiting their application in practical electronic warfare. To address this issue, this paper proposes a radar frequency agile antijamming strategy optimization method based on multi-agent reinforcement learning, delving into the optimization problem of antijamming strategies for frequency agile radars in complex and unknown jamming strategy scenarios.

In complex and unknown jamming strategy scenarios, transforming the original single-agent problem into a multi-agent problem yields better results. This is because multi-agent reinforcement learning introduces additional information during the training process, enabling better learning and optimization of behavior in complex and unknown environments. Additionally, this paper improves the MADDPG algorithm used by introducing an adaptive K-th order history method to exploit long-term dependencies in input data sequences. Simulation results demonstrate the effectiveness and superiority of the proposed method.

Although the proposed method outperforms other methods when facing complex and unknown jamming strategies, it still falls short of achieving the theoretically optimal antijamming effect, indicating that there is still room for improvement in the algorithm. Moreover, to simplify the analysis, the number of pulses and frequency points in the simulation is set relatively low, which may not fully reflect the actual optimization of frequency agile radar hopping strategies. These aspects will be further refined and improved in future research.

Author Contributions

Conceptualization, Z.Z. and C.F.; methodology, Z.Z.; software, Z.Z.; validation, Z.Z.; formal analysis, Z.Z.; investigation, Z.Z.; resources, Z.Z., C.F. and J.D.; data curation, Z.Z.; writing—original draft preparation, Z.Z.; writing—review and editing, Z.Z., X.D. and X.F.; supervision, X.D., X.F. and J.D.; project administration, X.D. and X.F.; funding acquisition, X.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by 111 Project of China, grant number B14010.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, S.; Cao, Y.; Yeo, T.; Wu, W.; Liu, Y. Adaptive Clutter Suppression in Randomized Stepped-Frequency Radar. IEEE Trans. Aerosp. Electron. Syst. 2021, 57, 1317–1333. [Google Scholar] [CrossRef]
Quan, Y.H.; Wu, Y.J.; Li, Y.C.; Sun, G.C.; Xing, M.D. Range–Doppler reconstruction for frequency agile and PRF-jittering radar. Iet Radar Sonar Navig. 2018, 12, 348–352. [Google Scholar] [CrossRef]
Liu, Y.; Li, P.; Wang, J.; Xie, D.; Jiang, D. Research on Jamming to Coherent FA Radar Based on Intermittent Sampling Repeater. J. Physics. Conf. Ser. 2021, 2026, 12005. [Google Scholar] [CrossRef]
Li, H.; Han, Z.; Pu, W.; Liu, L.; Li, K.; Jiu, B. Counterfactual Regret Minimization for Anti-Jamming Game of Frequency Agile Radar. In Proceedings of the 2022 IEEE 12th Sensor Array and Multichannel Signal Processing Workshop (SAM), Trondheim, Norway, 20–23 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 111–115. [Google Scholar]
Xia, L.; Wang, L.; Xie, Z.; Gao, X. GA-Dueling DQN Jamming Decision-Making Method for Intra-Pulse Frequency Agile Radar. Sensors 2024, 24, 1325. [Google Scholar] [CrossRef] [PubMed]
Chen, E.; Zhang-Wei, H.; Pajarinen, J.; Agrawal, P. Redeeming Intrinsic Rewards via Constrained Optimization; Cornell University Library: Ithaca, NY, USA, 2022. [Google Scholar]
Ardon, L. Reinforcement Learning to Solve NP-hard Problems: An Application to the CVRP; Cornell University Library: Ithaca, NY, USA, 2022. [Google Scholar]
Ma, O.; Chiriyath, A.R.; Herschfelt, A.; Bliss, D.W. Cooperative Radar and Communications Coexistence Using Reinforcement Learning. In Proceedings of the 2018 52nd Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 28–31 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 947–951. [Google Scholar]
Selvi, E.; Buehrer, R.M.; Martone, A.; Sherbondy, K. Reinforcement Learning for Adaptable Bandwidth Tracking Radars. IEEE Trans. Aerosp. Electron. Syst. 2020, 56, 3904–3921. [Google Scholar] [CrossRef]
Thornton, C.E.; Buehrer, R.M.; Martone, A.F.; Sherbondy, K.D. Experimental Analysis of Reinforcement Learning Techniques for Spectrum Sharing Radar. In Proceedings of the 2020 IEEE International Radar Conference (RADAR), Washington, DC, USA, 28–30 April 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 67–72. [Google Scholar]
Ahmed, A.M.; Ahmad, A.A.; Fortunati, S.; Sezgin, A.; Greco, M.S.; Gini, F. A Reinforcement Learning Based Approach for Multitarget Detection in Massive MIMO Radar. IEEE Trans. Aerosp. Electron. Syst. 2021, 57, 2622–2636. [Google Scholar] [CrossRef]
Ailiya; Yi, W.; Varshney, P.K. Adaptation of Frequency Hopping Interval for Radar Anti-Jamming Based on Reinforcement Learning. IEEE Trans. Veh. Technol. 2022, 71, 12434–12449. [Google Scholar] [CrossRef]
Aziz, M.M.; Maud, A.R.M.; Habib, A. Reinforcement Learning Based Techniques for Radar Anti-Jamming. In Proceedings of the 2021 International Bhurban Conference on Applied Sciences and Technologies (IBCAST), Islamabad, Pakistan, 12–16 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1021–1025. [Google Scholar]
Yi, W.; Yuan, Y. Reinforcement Learning-Based Joint Adaptive Frequency Hopping and Pulse-Width Allocation for Radar anti-Jamming. In Proceedings of the 2020 IEEE Radar Conference (RadarConf20), Florence, Italy, 21–25 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
Zhu, J.; Wu, F.; Zhao, J. An overview of the action space for deep reinforcement learning. In Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence, Sanya, China, 22–24 December 2021; Association for Computing Machinery: Sanya, China, 2021. [Google Scholar]
Wu, J.; Wang, R.; Li, R.; Zhang, H.; Hu, X. Multi-Critic DDPG Method and Double Experience Replay. In Proceedings of the 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Miyazaki, Japan, 7–10 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 165–171. [Google Scholar]
Zhang, M.; Zhang, Y.; Gao, Z.; He, X. An Improved DDPG and Its Application Based on the Double-Layer BP Neural Network. IEEE Access 2020, 8, 177734–177744. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments; Cornell University Library: Ithaca, NY, USA, 2017. [Google Scholar]
Hao, W.; Feng, W. Application of reinforcement learning algorithms in anti-jamming of intelligent radar. Modern Radar 2020, 42, 40–44. [Google Scholar]
Jiang, W.; Wang, Y.; Li, Y.; Lin, Y.; Shen, W. An Intelligent Anti-jamming Decision-making Method Based on Deep Reinforcement Learning for Cognitive Radar. In Proceedings of the 2023 26th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Rio de Janeiro, Brazil, 24–26 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1662–1666. [Google Scholar]
Wei, J.; Wei, Y.; Yu, L.; Xu, R. Radar Anti-Jamming Decision-Making Method Based on DDPG-MADDPG Algorithm. Remote Sens. 2023, 15, 4046. [Google Scholar] [CrossRef]
Li, K.; Jiu, B.; Liu, H. Deep Q-Network based Anti-Jamming Strategy Design for Frequency Agile Radar. In Proceedings of the 2019 International Radar Conference (RADAR), Toulon, France, 23–27 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
Li, K.; Jiu, B.; Liu, H.; Pu, W. Robust Antijamming Strategy Design for Frequency-Agile Radar against Main Lobe Jamming. Remote Sens. 2021, 13, 3043. [Google Scholar] [CrossRef]
Kang, L.; Bo, J.; Hongwei, L.; Siyuan, L. Reinforcement Learning Based Anti-Jamming Frequency Hopping Strategies Design for Cognitive Radar. In Proceedings of the 2018 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Qingdao, China, 14–16 September 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–5. [Google Scholar]
Li, K.; Jiu, B.; Wang, P.; Liu, H.; Shi, Y. Radar active antagonism through deep reinforcement learning: A Way to address the challenge of mainlobe jamming. Signal Process. 2021, 186, 108130. [Google Scholar] [CrossRef]
Jiang, W.; Ren, Y.; Wang, Y. Improving anti-jamming decision-making strategies for cognitive radar via multi-agent deep reinforcement learning. Digit. Signal Process. 2023, 135, 103952. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of inter-pulse FA radar countering frequency spot jamming.

Figure 2. Schematic diagram of inter-pulse frequency agile radar frequency hopping.

Figure 3. The multi-agent reinforcement learning framework for FA radar and jammer.

Figure 4. Neural network structure of the agent.

Figure 5. AK-MADDPG algorithm framework.

Figure 6. Adaptive K-th order history diagram.

Figure 7. Schematic diagram of the three different jamming strategies.

Figure 8. The figures illustrate the reward curves for combating three different jamming strategies using four methods. (a), (b), and (c) depict the reward curves for countering jamming strategies 1, 2, and 3, respectively.

Figure 9. The figures depict the antijamming strategies learned by the method proposed in this paper against three different jamming strategies. (a), (b), and (c) represent the antijamming strategies learned when countering jamming strategies 1, 2, and 3, respectively.

Figure 10. The reward curves under the scenario of mixed jamming strategies.

Table 1. Radar simulation parameter setting.

Parameter	Value
radar power $P_{s}$	1000 W
jammer power $P_{j}$	80,000 W
noise power $P_{n}$	10 W
time–bandwidth product $D$	50
initial carrier frequency $f_{0}$	3 GHz
Target’s RCS $δ$	1 m²
number of pulses in one CPI	4
number of available frequency points for the radar	3
channel gain $h_{s}$	0.1

Table 2. The algorithm parameters setting.

Parameter	Value
number of training epochs $e_{n u m}$	3000
number of times the network is updated per epoch $N_{u p d a t e}$	100
learning rate of the Actor network	1 × 10⁻³
learning rate of the Critic network	1 × 10⁻³
discount factor $γ$	0.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, Z.; Deng, X.; Dong, J.; Feng, C.; Fu, X. AK-MADDPG-Based Antijamming Strategy Design Method for Frequency Agile Radar. Sensors 2024, 24, 3445. https://doi.org/10.3390/s24113445

AMA Style

Zhu Z, Deng X, Dong J, Feng C, Fu X. AK-MADDPG-Based Antijamming Strategy Design Method for Frequency Agile Radar. Sensors. 2024; 24(11):3445. https://doi.org/10.3390/s24113445

Chicago/Turabian Style

Zhu, Zhidong, Xiaoying Deng, Jian Dong, Cheng Feng, and Xiongjun Fu. 2024. "AK-MADDPG-Based Antijamming Strategy Design Method for Frequency Agile Radar" Sensors 24, no. 11: 3445. https://doi.org/10.3390/s24113445

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AK-MADDPG-Based Antijamming Strategy Design Method for Frequency Agile Radar

Abstract

1. Introduction

2. Materials

2.1. The Principle of Antijamming for FA Radar

2.2. The Frequency Agile Radar Signal and Jamming Echo Signal Model

2.3. The Design of Multi-Agent Reinforcement Learning Elements

2.3.1. Action

2.3.2. State

2.3.3. Reward

3. Methods

3.1. The DDPG Algorithm

3.2. The Process of the AK-MADDPG Algorithm

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI