A Novel Intelligent Anti-Jamming Algorithm Based on Deep Reinforcement Learning Assisted by Meta-Learning for Wireless Communication Systems

Chen, Qingchuan; Niu, Yingtao; Wan, Boyu; Xiang, Peng

doi:10.3390/app132312642

Open AccessArticle

A Novel Intelligent Anti-Jamming Algorithm Based on Deep Reinforcement Learning Assisted by Meta-Learning for Wireless Communication Systems

¹

School of Electronics and Information Engineering, Nanjing University of Information Science and Technology, Nanjing 210007, China

²

The Sixty-Third Research Institute, National University of Defense Technology, Nanjing 210007, China

³

Fundamentals Department, Air Force Engineering University of PLA, Xi’an 710051, China

⁴

College of Communications Engineering, Army Engineering University of PLA, Nanjing 210007, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(23), 12642; https://doi.org/10.3390/app132312642

Submission received: 11 October 2023 / Revised: 21 November 2023 / Accepted: 22 November 2023 / Published: 24 November 2023

Download

Browse Figures

Versions Notes

Abstract

:

In the field of intelligent anti-jamming, deep reinforcement learning algorithms are regarded as key technical means. However, the learning process of deep reinforcement learning algorithms requires a stable learning environment to ensure its effectiveness. Moreover, the inherent limitations of deep reinforcement learning algorithms mean that they can only demonstrate excellent learning capabilities on specific tasks with constant parameters. When parameters change, they can only resample and relearn to converge. In a changing jamming environment, its stability and convergence speed may be challenged, thereby affecting its robustness and generalization capabilities. Aiming at the naive yet unique similarity characteristics of the communication anti-jamming problem, this paper designs a new Meta-PPO deep reinforcement learning algorithm that combines Proximal Policy Optimization (PPO) and MAML meta-learning ideas. The proposed algorithm engrafts the principle of meta-learning used in the Model Agnostic Meta-Learning (MAML) model onto the Proximal Policy Optimization (PPO)-based schemes, enabling the communication systems to harness its prior learned experiences acquired from previous anti-jamming tasks to facilitate and speed up its optimal decision-making process when faced with incoming jamming attacks with similar features. The proposed algorithm is verified through computer simulation analyses and the results show that the proposed novel Meta-PPO algorithm can outperform traditional DQN- and PPO-based algorithms in terms of better robustness and generalization abilities, which can be used to enhance the anti-jamming capabilities of wireless communication systems.

Keywords:

anti-jamming communication; deep reinforcement learning; meta-learning

1. Introduction

Wireless communication technologies have been widely applied in many fields over the past two decades and have become one of the cornerstones of modern military information systems [1,2]. Improving anti-jamming capabilities in communications can enhance the reliability and security of military communications, especially in complex electronic warfare environments. For unmanned aerial vehicles (UAVs) and remote control systems, enhanced communication stability is crucial. Autonomous vehicles and intelligent transportation systems rely on reliable communication to prevent accidents, and improving anti-jamming performance can enhance the reliability of vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications. Enhancing anti-jamming abilities can maintain higher operational stability of networks in the face of natural disasters or human attacks. At personal and enterprise levels, improved communication anti-jamming capabilities can reduce the risk of private information leakage. In remote or complex environments, more reliable communication can provide better connectivity and service quality. In summary, enhancing communication anti-jamming capabilities benefits not only specific industries but also has a profound impact on the security and efficiency of the entire wireless communication field.

However, due to its open channel and application environments, wireless communication systems are vulnerable to all kinds of jamming, especially malicious jamming from electronic warfare attacks. And those jamming and attacks have already become one of the primary factors affecting the reliability and efficiency of wireless communication systems [3]. As a result, the issue of communication anti-jamming has become a hot topic in the field of wireless communication technology research.

Traditional anti-jamming techniques that have been used in wireless communication systems are mostly based on signal spread spectrum technologies, combined with all kinds of error-correcting coding and signal power/frequency adapting techniques. These methods can be effectively used to counter conventional jamming such as single-tone, multi-tone, and partial-band jamming [4]. However, with the development of machine learning, research on intelligent jamming has gradually deepened. In the confrontation with the communication party, intelligent jamming accumulates “jamming knowledge” through experience or exploratory learning, and makes jamming decisions to dynamically adapt to the local changes in the electromagnetic environment. In order to enhance its own benefits, intelligent jamming has achieved a strategic closed loop of “perception-learning-prediction-decision-feedback.” Intelligent jamming not only includes the innovation of jamming styles, but also the comprehensive, flexible, and efficient application of basic jamming styles, that is, according to the actual needs of confrontation, based on reasoning, prediction, and decision-making capabilities, comprehensively and flexibly using aiming or blocking interference methods to block or weaken the effective communication of the communication party. It can be seen that traditional anti-jamming measures can no longer solve intelligent jamming with high dynamic characteristics, and various intelligent jamming empowered by artificial intelligence technology urgently needs effective countermeasures.

In [5], a deep reinforcement learning-based routing algorithm was proposed to tackle the anti-jamming communication challenges in the heterogeneous Internet of Satellite (IoS), which operates in a highly dynamic communication environment with potential intelligent jamming. The proposed algorithm can obtain a usable subset of routes for the traffic in IoS with superior anti-jamming performance and can lower the routing costs. In addition, the proposed anti-jamming strategy can converge to a Stackelberg equilibrium.

In [6], a deep reinforcement learning-based maritime anti-jamming algorithm was proposed to address the issue of wireless communication interruptions in a maritime communications environment due to tracking jamming. The algorithm reduced the probability of communication systems being found and tracked by tracking jammers and lowered the bit error rate of maritime wireless transmission, and the proposed algorithm also helped save the energy consumption of the ship and drone platforms.

As can be seen from the existing literature, deep reinforcement learning techniques can help wireless communication systems learn from their environment and adjust their strategies according to the jamming and jamming they encounter in various application scenarios, thus making optimal anti-jamming decisions to ensure their performance of signal transmissions [7,8,9,10,11,12,13,14,15].

However, existing deep reinforcement learning anti-jamming algorithms still face two challenges. First, their robustness is expected to be further improved. When faced with the same jamming pattern, even a minor adjustment to jamming parameters can compromise or even fail their convergence strategy, causing the algorithms to start their relearning process, which can be very time-consuming. Secondly, the proposed algorithms still lack generalization ability, and their applications are still limited in combating some specific jamming patterns. Most of the existing algorithms still lack the ability to explore their previous anti-jamming task experiences to accelerate their learning process for implementing new anti-jamming tasks.

Meta-learning is a branch of General AI, which aims to enable machines to learn like humans, possessing cognitive and logical analytical abilities, and to be able to realize a self-update so as to adapt to new learning tasks. In 2017, Chelsea Finn from the Bair lab proposed the Model Agnostic Meta-Learning (MAML) algorithm for the first time in [16]. This algorithm can quickly and accurately transfer a well-trained deep learning model to a new environment.

Unlike transfer learning, MAML-based algorithms can complete their decision-making tasks using a meta-learner (meta-layer) to guide their learner (base layer) to reach the optimized decisions. The meta-layer can let the algorithms combine their previous learning experiences from all of their past tasks, providing the necessary initial model values for their learning process for new tasks.

MAML is applicable to all optimized learners based on stochastic gradient descent. Hence, it is possible to integrate an MAML module into all deep learning models, which can improve their generalization ability and greatly enlarge their application scenarios while maintaining the algorithm performance [17].

In [18], the author addressed the issue that deep neural networks usually perform well in data-rich situations but poorly when data are less available or when rapid adaptation is needed for task changes. They introduced a meta-learner called SNAIL (Simple Neural Attentive Learner), which offered an effective solution for deep neural networks in data-scarce environments or when rapid task adaptation is needed and excellent performance was achieved in some multiple-task scenarios.

In [19], a facial recognition algorithm (Meta Face Recognition (MFR) Algorithm) was proposed by using the meta-learning technique. This algorithm addressed the poor generalization ability of the facial recognition systems used in real-life applications and can effectively apply models well-trained with webface data to real-life surveillance scenarios to improve their performances.

The existing works show that MAML-based algorithms require fewer training samples and have better generalization abilities and they can especially enable intelligent anti-jamming decisions to fully exploit their experiences from previous anti-jamming tasks to make optimal decisions in a faster manner when faced with jamming patterns with a few parameter changes [20,21,22,23,24,25].

These advantages make MAML-based algorithms particularly suitable for combating malicious intelligent jamming and attacks with high dynamic characteristics. Also, they can overcome the limitations of existing deep reinforcement learning anti-jamming algorithms in terms of algorithm robustness and generalization abilities.

In this paper, an intelligent anti-jamming algorithm based on the integration of the principles of deep reinforcement learning and meta-learning is proposed and verified by computer simulations. First of all, a base learner is employed by the algorithm to derive optimal strategies from a series of known, similar anti-jamming tasks. Then, a meta-learner is used to abstract the general characters and rules of all the previous tasks and acquire the general knowledge about the jamming and attacks. This can provide a more universal strategy to be used by the intelligent anti-jamming algorithms to finish their learning process for new tasks. Thus, the proposed algorithm, starting from this initial strategy, can swiftly adapt to new and more dynamic jamming scenarios.

The main contribution of this paper is as follows:

We propose the Meta-PPO intelligent anti-jamming algorithm to address the issue where the communication party is unable to use known anti-jamming experience and thus led to slow strategy convergence when faced with malicious jamming that dynamically and randomly modifies jamming parameters. When the external jammer randomly changes jamming parameters, the algorithm uses the MAML thought to update the initial model parameters in real-time. This enables the communication party to utilize known historical anti-jamming experience and converge more quickly to the optimal anti-jamming strategy.

To validate the effectiveness of the Meta-PPO algorithm proposed in this paper, we simulated scenarios of wireless communication interaction using Pytorch software (version 2.0). In these scenarios, the communication party employed both Meta-PPO and other algorithms for decision making, and their normalized throughput was compared. Simulation results show that the proposed algorithm can consistently maintain a high throughput performance for data transmission, especially in a jamming environment where jamming parameters are frequently changing. Also, the proposed algorithm shows enhanced generalization ability and robustness compared with conventional reinforcement learning algorithms, resulting in superior anti-jamming capabilities. For future research directions, we plan to integrate actual hardware and conduct experimental simulations of the Meta-PPO algorithm in specific wireless communication environments.

The organization of this paper is as follows: The components of the system and the formulation of the problem are presented in Section 2. The Meta-PPO-based intelligent anti-jamming algorithm for wireless communication is detailed in Section 3. The simulation outcomes and analysis of the algorithm are discussed in Section 4. Finally, the paper is concluded in Section 5.

2. System Model and Problem Formulation

2.1. System Model

Figure 1 depicts the model of a communication system utilized in this paper. This model includes a set of wireless communication transceivers and a jamming device. The jamming signals from this device are capable of effectively enveloping the receiver. This paper makes the following assumptions:

a: Communication receiver

The wireless communication transceivers operate in the frequency bandwidth of

B

with dynamic spectrum access and power control abilities. The channels available are denoted as

f_{n} \in {f_{1}, f_{2}, . . ., f_{M}}

and the

M

channels do not overlap with each other, and each of them has a bandwidth of

Δ B = B / m

. Their available power levels are represented as

p_{n} \in {p_{1}, . . ., p_{D}}

. The transmission channel gain for the communication system is denoted as

g_{n} = g_{p} |h_{s}|

, where

g_{p}

represents the path loss at a given distance, and

|h_{s}|

is a Rayleigh distributed random variable. The total time duration for communication is denoted as

T_{m a x}

. Equation (1) shows the signal-to-jamming-plus-noise ratio (SINR)

H_{T}

in time slot

T

.

H_{T} = \frac{P_{r, T}}{p_{k, T}^{'} + P_{n, T}}

(1)

The SINR in the communication system needs to meet the requirement of

H_{T} \geq H_{T}^{t h}

at any moment in time slot

T

to make sure the data frame transmitted in time slot

T

can be successfully decoded, where

H_{T}^{t h}

represents the SINR threshold.

The normalized throughput of the communication system’s data transmission is defined as

c_{T} = δ (H_{T} \geq H_{T}^{t h})

, where

δ (\cdot)

is the indicator function.

b: Communication transmitter

The transmitter can transmit one subframe in each time slot, with each subframe containing the same amount of information. A packet, comprising

l

subframes, includes a Cyclic Redundancy Check (CRC) field. The transmitter is located far from the jammer, minimally affected by jamming, and can reliably receive control information transmitted by the receiver through a protocol-reinforced low-capacity control channel, enabling cooperative anti-jamming.

c: Malicious jammer

The jammer preemptively acquires the communication frequency and time slot synchronization information of both the transmitter and receiver. The bandwidth of the pulse jamming signal can completely cover the transmission channel, and the duration of a single pulse is equal to the length of a time slot.

The jammer can indiscriminately attack all nodes in the communication system and can dynamically adjust its jamming power and its targeted channels between different communication time slots. Its available power is denoted as

p_{k}^{'} \in {p_{1}^{'}, . . ., p_{E}^{'}}

, and the selectable channel as

f_{k}^{'} \in {f_{1}, f_{2}, . . ., f_{M}}

, where

k \in {1,2, . . ., K_{T}}

represents the number of blocked channels, and

K_{T} \leq M

represents the number of channels blocked by the jammer during time slot

T

.

P_{n, T}

represents the background noise during the time slot

T

.

2.2. Problem Modeling

In this study, we model the deep reinforcement learning problem of the agent using a Markov Decision Process (MDP), which can be represented as

< S, A, F, R >

, where

S

denotes the set of environmental states,

A

represents the set of actions taken by the agent,

F

is the state transition probability function, indicating the probability distribution of the next state given a specific state and action, and

R

is the reward function for the agent.

The base learner uses the classic PPO (Proximal Policy Optimization) deep reinforcement learning algorithm.

The main parameters include a state space

S

, an action space

A

, a state transition probability

P

, and a reward function

R

.

The above parameters are defined as follows:

State space S:

The system state at time slot

T

is defined as

(f_{n, T}, f_{k, T}^{'}, p_{n, T}, p_{k, T}^{'})

, where

f_{n, T}

represents the communication channel chosen by the communication system,

f_{k, T}^{'}

represents the channel number chosen to be jammed by the jammer,

p_{n, T}

is the transmission power chosen by the communication system, and

p_{k, T}^{'}

is the jamming power chosen by the jammer, all in time slot

T

.

Equation (2) shows the received power of the communication system

P_{r, T}

in time slot

T

:

P_{r, T} = p_{n, T} \cdot g_{n}

(2)

b: Action Space A:

The action space regarding the time slot

T

is defined as

(f_{n, T + 1}, p_{n, T})

, where

f_{n, T + 1}

represents the transmission channel chosen by the communication system in time slot

T + 1

and

p_{n, T}

represents the power chosen by the communication system in time slot

T + 1

.

c: Reward Function R:

When the communication system successfully transmits its data once, it will receive a reward; otherwise, if it fails in data transmission, it will receive a penalty.

The communication system needs to consider the channel switching cost

C_{f}

when making decisions. The reward function of the communication system represents the immediate reward that can be obtained by executing action

a_{T}

under the environmental state

S_{T}

.

R_{T} (S_{T}, a_{T}) = c_{T} - \frac{η P_{r, T}}{{p_{m a x}}_{f}}

(3)

where

c_{T}

is the normalized throughput of the communication system and the power discount factor

η

has a constant value within it [0,1]. The higher the transmission power

P_{r, T}

is, the greater the reward discount

η P_{r, T} / p_{m a x}

will be.

3. Meta-PPO Anti-Jamming Intelligent Decision Algorithm

Meta-learning enables agents to have the ability to learn how to learn. The focus of meta-learning is on how to introduce prior knowledge into the model and optimize external memory during training, so as to learn faster and more accurately when training new tasks. Unlike other deep learning algorithms, MAML does not aim to find the optimal parameters for a specific task, but rather seeks to find initial parameters

η

through training a series of task-related meta-tasks, which enable the model to quickly reach the optimum when faced with new tasks.

η

has the sensitive characteristic to the learning domain distribution of new tasks, which allows certain features inside the trained model to be more easily transferred among various tasks. Optimal model network parameters can be obtained after a few steps of updating. The gradient descent process of MAML is shown in Figure 2. The

η

in Figure 2 represents the initial parameters obtained after MAML pre-training;

L_{1}, L_{2}, L_{3}

, respectively, represent the loss functions of the new task; ∇ represents the gradient operator;

η_{1}

,

η_{2}

,

η_{3}

indicate the optimal updating directions under the new task.

The proposed intelligent communication anti-jamming algorithm is based on meta deep reinforcement learning combined with the meta-learner and base learner defined in the meta-learning-based MAML algorithm. It takes anti-jamming tasks with different processes but belonging to similar types as independent base learners. Then, the base learners transmit the knowledge they have learned to the meta-learner for collection and summarization, and through which the initial network parameters of the model can be obtained with fast convergence, strong robustness, and better generalization abilities. Although the base learners are independent from each other, they perform intelligent anti-jamming tasks of similar types. Therefore, they are based on the same model. The proposed intelligent communication anti-jamming algorithm is presented in Figure 3.

As shown in Figure 3, the Meta-PPO algorithm consists of a base learner and a meta-learner. The base learner learns from multiple similar anti-jamming tasks, where it gains the network parameters learned from different tasks, which represent the knowledge of each task’s characteristics. The meta-learner is responsible for integrating all the task characteristics and guiding the base learner, allowing the base learner to adapt to new tasks faster, thus solving new problems with better performance. The meta-learner represents the common knowledge of all tasks. Meta-testing uses the initial network parameters obtained by the meta-learner to learn the anti-jamming model and evaluates the parameters with new tasks not included in the meta-training set.

The modules of the Meta-PPO algorithm are described in detail as follows:

3.1. Base Learner

In the proposed Meta-PPO algorithm, the base learner learns from communication anti-jamming tasks, which are similar but are independent from each other. All tasks are finished on the same communication system, facing similar jamming environments. The only difference is that the jamming parameters used by the jammer for each task are different. Therefore, the intelligent anti-jamming strategies used by the algorithm for each task are similar. The basic functions of the base learner performed in each task are as follows:

(1) As shown in Figure 4, for the current task, using the PPO algorithm to find the pattern of the jamming signals from the jammer and obtaining the optimal communication strategy under the current communication environment.

(2) Obtaining the experience from the meta-learner that is helpful for completing the current task, including the initial model, initial parameters, etc.

(3) Feeding the learned model and parameters back to the meta-learner after the current task learning is completed.

The proposed algorithm adopted the idea used in MAML meta-learning. During the base learner’s learning process, the methods reported in [26,27,28,29,30] were adopted and the concepts used in the PPO deep reinforcement learning algorithm such as the experience replay, valuation neural network, and target neural network were also adopted.

During the algorithm learning phase, the base learner uses the

ε - g r e e d y

strategy for learning updates. Under this strategy, the action with the highest reward is selected with a probability of

1 - ε

, and actions are randomly selected with a probability of

ε

, as shown in Equation (4):

π (a | s) = \{\begin{matrix} argmax Q (s, a), 0 < x \leq 1 - ε \\ random selection, 1 - ε < x < 1 \end{matrix}

(4)

The goal of adopting meta-learning in the proposed algorithm is to let its learning process quickly adapt to new tasks. In order to achieve this, the Meta-PPO algorithm uses a few samples of multiple similar tasks as the input data for the learning process to reduce the expected loss of the algorithm on multiple tasks. And by doing so, the direction is set for the parameter updates. Thus, a set of model parameters, adapting to a new task, can be acquired in a faster manner. Therefore, the optimized objective function of the proposed algorithm can be expressed with Equation (5):

\min_{θ} E_{τ ~ p (τ)} [L_{τ} (U_{τ}^{k} (θ))]

(5)

In the equation,

p (τ)

represents the task distribution.

θ

represents the parameters of the model.

U_{τ}^{k} (θ)

indicates that the model parameters are updated

k

times using a small amount of data collected from task

τ

.

L_{τ} (U_{τ}^{k} (θ))

represents the loss function of the model on task

τ

.

The parameter updating process in the meta-training phase of the proposed algorithm can be divided into two parts, an inner loop and an outer loop, according to Equation (6). In the inner loop, the Meta-PPO algorithm uses a small amount of data from a randomly chosen task

τ

as the learning data to update the model parameters, reducing the model’s loss on task

τ

. In this loop, the model parameter updating process is the same as the PPO algorithm proposed in [26,27,28,29,30]. The neural network of the algorithm learns from several batches of data on the randomly chosen tasks. In each round, the agent performs actions according to the policy of the Actor network, interacting with the environment to collect experience, and then saves the collected experience to an experience pool. When the number of collected experiences meets the model update threshold, the model parameters will be updated.

First of all, during the updating process, the temporal difference error

δ_{T}

is calculated according to Equation (6):

δ_{T} = R_{T} + γ V_{μ} (S_{T + 1}) - V_{μ} (S_{T})

(6)

In the equation

, δ_{T}

represents the single-step temporal difference error at time

T

.

r_{T}

represents the immediate reward at time

T

.

μ

represents the parameters of the Critic network.

V_{μ} (S_{T + 1})

is the estimated state value of state

S_{T + 1}

at time

T + 1

, output by the Critic network.

V_{μ} (S_{T + 1})

is the estimated state value of the next state

S_{T + 1}

at time

T + 1

, and

V_{μ} (S_{T})

is the output of the Critic network.

After calculating the temporal difference error for each time step, the target for updating the Critic network is calculated according to Equation (7). The mean squared error is used as the loss function for updating the Critic network.

The equation for calculating the Critic loss function can be expressed as Equation (8). Then, the Critic network is updated through backpropagation.

y_{T} = A_{T} + V_{μ} (S_{T})

(7)

In the equation,

A_{T}

represents the estimated advantage function at time

T

.

L (μ) = \frac{1}{batch_size} \sum_{T} (y_{T} - V_{μ} (S_{T}))^{2}

(8)

In the equation,

y_{T}

represents the update target of the Critic network.

V_{μ} (S_{T})

represents the estimated state value of state

S_{T}

at time

T

.

The loss function calculation equation for the Actor network is shown in Equation (9). The Actor network parameters are updated by minimizing the loss function.

L (θ) = \frac{1}{batch_size} \sum_{T} \min (r_{T} (θ) A_{T}, c l i p (r_{T} (θ) 1 - ε, 1 + ε) A_{T})

(9)

In the equation,

r_{T} (θ) = \frac{π_{θ} (a_{T} | S_{T})}{π_{θ 1} (a_{T} | S_{T})}

represents the ratio of the new policy to the old policy.

θ

represents the parameters of the new policy network.

θ_{1}

represents the parameters of the old policy network.

A_{T}

represents the estimated advantage function at time

T

.

In the inner loop, after updating the model parameters

k

times, the process will enter the outer loop. In the outer loop, the Meta-PPO algorithm will calculate the update gradient according to Equations (10) and (11) and update the model parameters again to minimize the expected loss function over the task distribution, aiming to find a set of initial parameters that can quickly adapt to new tasks.

μ = μ + ε (\tilde{μ} - μ)

(10)

In the equation,

μ

represents the parameters of the Critic network before learning,

\tilde{μ}

′ represents the parameters after learning on task

τ

, and

ε

is the update step size.

θ = θ + ε (\tilde{θ} - θ)

(11)

In the equation,

θ

represents the parameters of the Critic network before learning,

\tilde{θ}

′ represents the parameters after learning on task

τ

, and

ε

is the update step size.

For the assumed communication environment in our research, the decision model of the basic learner based on the proposed Meta-PPO algorithm is shown in Figure 5. When a communication process starts, the Actor network of the PPO algorithm receives the communication environment values perceived by the upper layer as its input, and then it outputs action commands to control the decisions made by the communication system. During the interaction with the external environment, the generated data

(S_{t}, A_{t}, R_{t}, S_{t + 1})

will be stored in the experience pool for subsequent model learning.

3.2. Meta-Learner

As shown in Figure 6, in the proposed Meta-PPO algorithm, the meta-learner is responsible for collecting and summarizing the learned experiences from all tasks. After each learning process from the basic learner, the meta-learner will integrate those experiences and update its parameters accordingly. In the end, the meta-learner, having integrated the learning experiences from all tasks, delivers the initial model network values to the basic learner, allowing the basic learner to achieve good accuracy after a few iterations in dealing with new tasks.

As shown in Figure 3, the memory module of the proposed algorithm includes two parts: an initial parameter value and the loss function’s gradient value. The initial parameter value represents the initial parameter of a certain task. During the basic learner’s learning process, the meta-learner extracts the initial parameter value of the most relevant task from the memory module and feeds it to the basic learner as the initial parameter value.

After the basic learner completes its learning process, it feeds the loss function’s gradient value back to the meta-learner. The meta-learner then distributes the initial parameter values to all the basic learners.

After the basic learners complete their learning processes, they feed the loss function’s gradient values back to the meta-learner. Finally, the meta-learner collects the function’s gradient values from all the tasks and then uses these gradient values to update the initial parameter values stored in the memory module in a timely manner. Therefore, the latest initial parameter values from the most relevant task experience can be provided timely for each task.

The intelligent anti-jamming model of the Meta-PPO algorithm is shown in Figure 7. The specific learning process of the meta-learner is as follows:

Assume there is a model

f_{θ} ()

affected by

θ

and the distribution of tasks

p (T)

. First, initialize the parameters

θ

with some random values. Next, draw a batch of tasks

T_{i}

(

T_{i} ~ p (T)

) from the distribution of tasks.

Then, for each task, draw

k

trajectories and construct the training and test sets:

D_{i}^{t r a i n} ~ T_{i}

. Execute gradient descent through Equation (12) and find the optimal parameters

θ_{i}^{'}

to minimize the loss on the training set

D_{i}^{t r a i n}

:

θ_{i}^{'} = θ - α \nabla_{θ} L_{T_{i}} (f_{θ})

(12)

Before drawing the next batch of tasks, perform a meta-update. Minimize the loss by calculating the loss gradient relative to the optimal parameter

θ_{i}^{'}

through Equation (13) to update our randomly initialized parameter

θ

:

θ \leftarrow θ - β \nabla_{θ} \sum_{T_{i} ~ ρ (τ)} L_{T_{i}} (f_{θ'})

(13)

The references cited in this article are presented in Table 1, which lists the network environments and core algorithms of all the referenced literature.

3.3. Meta-PPO Intelligent Anti-Jamming Algorithm

The Meta-PPO intelligent anti-jamming algorithm designed in this paper is shown in Algorithm 1. After iterating

n

times, the meta-learner will send the final initial model parameter

θ

to the basic learner, and then learn and test new tasks that are not in the training set

D_{i}^{t r a i n}

.

Algorithm 1: Meta-PPO

1: initialize:
Number of initial tasks

T_{i}

;
Number of tracks to be extracted in each task

k

;
Number of training rounds

n

;
Inner loop gradient update hyperparameter

α

;
Outer loop gradient update hyperparameter

β

;
Random initialization Actor network parameters

θ

;
Critic network parameters

μ

.

2: for

t = 1,2, . . ., T

do

3: Extract

k

tracks to form a training set

D_{i}^{t r a i n}

and use PPO neural network algorithm for training

4: Calculate the temporal difference error

δ_{t}

according to Equation (7)

5: Update Critic network parameters by minimizing Critic network loss function according to Equation (9) to obtain the updated parameters

\tilde{μ}

6: Update the Actor network parameters by minimizing the Actor network loss function according to Equation (10), and obtain the updated parameters

\tilde{θ}

7: Calculate the gradient as the difference between the model parameters before and after training. Then use this gradient to update the Actor network’s parameters

θ

, as specified in Equation (12).

8:

t = t + 1

;

9: end for

10: When the training round is over, the meta-learner will send the initial model parameters

θ

and

μ

obtained in the current round and the random initial model parameters for the next round to the base learner and test the new tasks outside the training set.

4. Simulations and Analyses

In our simulation analyses, we mainly focus on how the communication system, when faced with a random sweeping jamming attack with the change in jamming parameters, can extract the learned experience from previously learned experiences using the proposed Meta-PPO algorithm and quickly select the best channel.

As shown in Figure 8, it is assumed that the jammer launches random sweep jamming attacks. Within each jamming cycle, the jammer randomly chooses its transmission power, then sweeps all channels within frequency band

B

, and randomly selects up to three channels to launch jamming attacks.

The proposed algorithm is then verified through computer simulations based on a neural network built with Pytorch. Table 2 shows the parameters used in our simulations. In the simulations, the proposed Meta-PPO algorithm implemented its learning processes for 1000 episodes and 10 iterations for parameter updates in each episode.

Figure 9 shows the curves of normalized network throughput changing with the episodes when the proposed Meta-PPO algorithm was used, which was compared with PPO [31], DQN [32], Double DQN [33], and Dueling DQN [34] algorithms. Normalized throughput is a widely accepted criterion for measuring the performance of an agent’s strategy [35]. A better strategy the agent takes will result in higher normalized network throughput.

As can be seen from Figure 9, for the proposed Meta-PPO algorithm, the normalized network throughput rises rapidly in the initial stage, before the 50th episode, and then reaches its maximum value after it. This indicates that in the early stages of the learning process, the agent can quickly reach an effective strategy to deal with the jamming attacks. However, between the 100th and 150th episodes, the normalized network throughput drops slightly, suggesting that the Meta-PPO algorithm undergoes a fine-tuning or policy oscillation stage. After the 200th episode, the normalized network throughput curve becomes flat, showing that there is little change in the subsequent learning process. This indicates that the proposed Meta-PPO algorithm has found an efficient strategy and further optimization after this point brings few gains in the normalized network throughput.

In contrast, for PPO, DQN, Double DQN, and Dueling DQN algorithms, it takes them more than 300 episodes to research the best strategy. This shows that the proposed Meta-PPO algorithm can quickly achieve similar or even better anti-jamming strategies for wireless communication systems using a shorter period of learning time compared with traditional deep reinforcement learning-based algorithms. The Meta-PPO algorithm saves 60% of the time in strategy convergence compared to the PPO algorithm, and nearly 80% of the time compared to the DQN algorithm.

This simulation result shows the better performance of our proposed Meta-PPO algorithm in terms of rapid adaptation ability for anti-jamming tasks.

The clipping ratio

r (θ)

of a new and an old policy is an important hyperparameter for Meta-PPO algorithms that is used to limit the size of policy updates, ensuring that the difference between these two policies is not too large. If

r (θ)

is too small, then beneficial policy updates might be hindered, while if

r (θ)

is too large, then overly aggressive and potentially harmful updates might happen. Therefore, this hyperparameter must be chosen properly to optimize the algorithm performance.

Figure 10 shows the normalized network output under different clipping ratios while other parameters remain unchanged for the simulation analyses. As can be seen, when clipping ratio

r (θ)

of the new and old policies is modified, the speed at which the Meta-PPO algorithm learns the optimal policy remains unchanged. The optimal policy appears after about 100 episodes. When the clipping ratio is chosen to be

r (θ) = 0.2

, the algorithm approaches its optimal upper limit. At this point, the Meta-PPO algorithm allows the policy to make necessary updates without over-updating. Therefore, the best clipping ratio choice is

r (θ) = 0.2

.

The MAML gradient learning update rate, denoted as

α

, is another crucial hyperparameter in the Meta-PPO algorithm, determining the update rate of parameters in the inner loop. The purpose of the inner loop in the MAML algorithm is to learn a specific task rapidly, while the outer loop aims for generalization across multiple tasks.

As shown in Figure 11, when other parameters remain unchanged and only the MAML gradient learning update rate

α

is modified, the speed at which the Meta-PPO algorithm learns the optimal policy remains unchanged, converging around 100 episodes. The highest normalized throughput is achieved when

α = 0.1

. At this point, in the inner loop, the model parameter updates are moderate, allowing the model to adapt to the specific tasks and environments of the basic learner without over-fitting. This prevents a reduction in the generalization capability in the outer loop.

Figure 12a illustrates the curve of the action loss for the Meta-PPO algorithm varying with the number of episodes. Typically, a loss function signifies a negative return of a strategy as expected, which means that a loss function should have negative values. The optimization process is expected to maximize this return, which can be translated to minimizing the negative value of this loss function. As seen from Figure 12a, during the strategy optimization process of the proposed algorithm, the strategy loss steadily rises from a notably large negative value and then starts to converge to a stable value after the 100th episode. This suggests that the strategy has been refined and has reached an optimal point.

Figure 12b presents the curve of the Meta-PPO algorithm’s strategy entropy varying with the number of episodes. Entropy is widely regarded as a measure of the randomness or uncertainty of a strategy. Enhancing the entropy during a strategy’s optimization process means exploration in a wider action space is encouraged, which is beneficial to the strategy optimization since a strategy with a high-entropy optimization process means that the strategy is optimized over a wider range of actions rather than only over a few concentrated ones. As seen from Figure 12b, during the algorithm’s policy optimization, the entropy escalates from 0 to approximately 2.71 after the 100th episode and subsequently oscillates between 2.7 and 2.71. Notably, even when the policy loss stabilizes around the 100th episode, the policy entropy does not plummet to zero or near-zero values. This indicates that the Meta-PPO’s learning trajectory initially transitions from a deterministic strategy, diversifying over time, and then stabilizing while retaining a degree of exploration. This suggests that the identified optimal solution is not trapped to a local optimum.

Figure 12c shows the curve of the proposed Meta-PPO algorithm’s value loss varying with the number of episodes. In the reinforcement learning process, the value function can be used to predict the expected return of a given state. In our proposed Meta-PPO algorithm, a Critic network is employed to estimate this value loss, which represents the disparity between the Critic network’s predictions and the actual returns. A greater value loss indicates a greater deviation between the Critic network’s predictions and the actual obtained rewards. If this loss keeps increasing, it may indicate that the Critic network is not learning effectively or that the learning rate is not fitting its purpose. As can be seen from Figure 12c, the use of the Meta-PPO algorithm initially results in a high initial value loss, which is because of the neural network with random initialization for value function estimation used in the algorithm. As the episode approaches 100, there’s a noticeable drop in the value loss, reaching

2 \times 1 0^{11}

. This sharp decline is then followed by small variations. These changes suggest that the value function estimator is gradually becoming more accurate in predicting actual returns.

In summary, the Meta-PPO algorithm demonstrates a rapid and efficient learning trend in the initial learning phase, quickly mastering the optimal strategy, and then undergoing fine-tuning and stabilization in subsequent learning. Meanwhile, although DQN can eventually find an effective strategy, it requires a longer time to achieve similar performance. Under the current system model, the Meta-PPO algorithm achieves the maximum normalized throughput when using the clipping ratio

r (θ) = 0.2

and the MAML gradient learning update rate

α = 0.1

as parameters. This curve change reflects the effectiveness and robustness of the two algorithms and further emphasizes the superior performance of Meta-PPO.

5. Conclusions

This algorithm uses the PPO policy optimization method to allow the basic learner to learn a series of anti-jamming tasks with known jamming parameters. Then, the meta-learner in the meta-learning idea summarizes and generalizes the general rules of this series of similar anti-jamming tasks, so that when the communication system encounters new jamming similar to historical anti-jamming tasks, the communication system can make decisions more quickly using the experience provided by the meta-learner.

Simulation results show that compared with other deep reinforcement learning algorithms that use random initial networks, Meta-PPO can still maintain good learning performance under constantly changing jamming conditions by adopting the initial strategy provided by the meta-learner. It also demonstrates more outstanding robustness and generalization characteristics, further enhancing its anti-jamming characteristics.

In the experimental simulation of this paper, the sensitivity of communication parameters still needs further study. At the same time, the idea of combining meta-learning with the DQN algorithm itself has a lot of room for optimization, such as optimizing the computational complexity of second-order gradient descent algorithms, and measuring the specific similarity of meta-training tasks. We plan to conduct research on these issues in the next step and perform actual simulations on hardware to continuously enhance the generalization and robustness of intelligent anti-jamming decision making.

Author Contributions

Conceptualization, Q.C., Y.N. and B.W.; methodology, Q.C., Y.N. and B.W.; software, Q.C. and B.W.; validation, Q.C., Y.N. and B.W.; formal analysis, Q.C. and Y.N.; investigation, Q.C. and Y.N.; resources, Q.C. and Y.N.; data curation, Q.C. and Y.N.; writing—original draft preparation, Q.C., Y.N., B.W. and P.X.; writing—review and editing, Q.C., Y.N. and P.X.; visualization, Q.C. and Y.N.; supervision, Q.C. and Y.N.; project administration, Q.C. and Y.N.; funding acquisition, Q.C. and Y.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by [National Natural Science Foundation of China] grant number [62371461], [National Natural Science Foundation of China] grant number [61971439] and [National Natural Science Foundation of China] grant number [U22B2002].

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to institutional data privacy requirements.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bout, E.; Loscri, V.; Gallais, A. How Machine Learning Changes the Nature of Cyberattacks on IoT Networks: A Survey. IEEE Commun. Surv. Tutor. 2021, 24, 248–279. [Google Scholar] [CrossRef]
Qin, Z.; Zhou, X.; Zhang, L.; Gao, Y.; Liang, Y.-C.; Li, G.Y. 20 Years of Evolution from Cognitive to Intelligent Communications. IEEE Trans. Cogn. Commun. Netw. 2019, 6, 6–20. [Google Scholar] [CrossRef]
Wang, J.; Jiang, C.; Zhang, H.; Ren, Y.; Chen, K.-C.; Hanzo, L. Thirty Years of Machine Learning: The Road to Pareto-Optimal Wireless Networks. IEEE Commun. Surv. Tutor. 2020, 22, 1472–1514. [Google Scholar] [CrossRef]
Luong, N.C.; Hoang, D.T.; Gong, S.; Niyato, D.; Wang, P.; Liang, Y.-C.; Kim, D.I. Applications of Deep Reinforcement Learning in Communications and Networking: A Survey. IEEE Commun. Surv. Tutor. 2019, 21, 3133–3174. [Google Scholar] [CrossRef]
Han, C.; Huo, L.; Tong, X.; Wang, H.; Liu, X. Spatial Anti-Jamming Scheme for Internet of Satellites Based on the Deep Reinforcement Learning and Stackelberg Game. IEEE Trans. Veh. Technol. 2020, 69, 5331–5342. [Google Scholar] [CrossRef]
Liu, K.; Li, P.; Liu, C.; Xiao, L.; Jia, L. UAV-Aided Anti-Jamming Maritime Communications: A Deep Reinforcement Learning Approach. In Proceedings of the 2021 13th International Conference on Wireless Communications and Signal Processing (WCSP), Changsha, China, 20–22 October 2021; pp. 1–6. [Google Scholar] [CrossRef]
Xiao, L.; Ding, Y.; Huang, J.; Liu, S.; Tang, Y.; Dai, H. UAV Anti-Jamming Video Transmissions with QoE Guarantee: A Reinforcement Learning-Based Approach. IEEE Trans. Commun. 2021, 69, 5933–5947. [Google Scholar] [CrossRef]
Lu, X.; Xiao, L.; Niu, G.; Ji, X.; Wang, Q. Safe Exploration in Wireless Security: A Safe Reinforcement Learning Algorithm with Hierarchical Structure. IEEE Trans. Inf. Forensics Secur. 2022, 17, 732–743. [Google Scholar] [CrossRef]
Zhou, Q.; Li, Y.; Niu, Y. Intelligent Anti-Jamming Communication for Wireless Sensor Networks: A Multi-Agent Reinforcement Learning Approach. IEEE Open J. Commun. Soc. 2021, 2, 775–784. [Google Scholar] [CrossRef]
Lu, X.; Xiao, L.; Dai, C.; Dai, H. UAV-Aided Cellular Communications with Deep Reinforcement Learning Against Jamming. IEEE Wirel. Commun. 2020, 27, 48–53. [Google Scholar] [CrossRef]
Zhang, G.; Li, Y.; Niu, Y.; Zhou, Q. Anti-jamming path selection method in a wireless communication network based on Dyna-Q. Electronics 2022, 11, 2397. [Google Scholar] [CrossRef]
Wang, W.; Lv, Z.; Lu, X.; Zhang, Y.; Xiao, L. Distributed reinforcement learning based framework for energy-efficient UAV relay against jamming. Intell. Converg. Netw. 2021, 2, 150–162. [Google Scholar] [CrossRef]
Lu, X.; Jie, J.; Lin, Z.; Xiao, L.; Li, J.; Zhang, Y. Reinforcement learning based energy efficient robot relay for unmanned aerial vehicles against smart jamming. Sci. China Inf. Sci. 2022, 65, 112304. [Google Scholar] [CrossRef]
Zheng, S.; Chen, S.; Yang, X. DeepReceiver: A deep learning-based intelligent receiver for wireless communications in the physical layer. IEEE Trans. Cogn. Commun. Netw. 2020, 7, 5–20. [Google Scholar] [CrossRef]
Xiao, L.; Jiang, D.; Chen, Y.; Su, W.; Tang, Y. Reinforcement-learning-based relay mobility and power allocation for underwater sensor networks against jamming. IEEE J. Ocean. Eng. 2019, 45, 1148–1156. [Google Scholar] [CrossRef]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
Antoniou, A.; Edwards, H.; Storkey, A. How to train your MAML. arXiv 2018, arXiv:1810.09502. [Google Scholar]
Mishra, N.; Rohaninejad, M.; Chen, X.; Abbeel, P. A simple neural attentive meta-learner. arXiv 2017, arXiv:1707.03141. [Google Scholar]
Guo, J.; Zhu, X.; Zhao, C.; Cao, D.; Lei, Z.; Li, S.Z. Learning meta face recognition in unseen domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6163–6172. [Google Scholar]
Zintgraf, L.; Shiarli, K.; Kurin, V.; Hofmann, K.; Whiteson, S. Fast context adaptation via meta-learning. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 7693–7702. [Google Scholar]
Schweighofer, N.; Doya, K. Meta-learning in reinforcement learning. Neural Netw. 2003, 16, 5–9. [Google Scholar] [CrossRef]
Fakoor, R.; Chaudhari, P.; Soatto, S.; Smola, A.J. Meta-q-learning. arXiv 2019, arXiv:1910.00125. [Google Scholar]
Peng, S. Overview of Meta-Reinforcement Learning Research. In Proceedings of the 2020 2nd International Conference on Information Technology and Computer Application (ITCA), Guangzhou, China, 18–20 December 2020. [Google Scholar]
Zhang, H.; Kan, Z. Temporal Logic Guided Meta Q-Learning of Multiple Tasks. IEEE Robot. Autom. Lett. 2022, 7, 8194–8201. [Google Scholar] [CrossRef]
Zhou, A.; Jang, E.; Kappler, D.; Herzog, A.; Khansari, M.; Wohlhart, P.; Bai, Y.; Kalakrishnan, M.; Levine, S.; Finn, C. Watch, try, learn: Meta-learning from demonstrations and reward. arXiv 2019, arXiv:1906.03352. [Google Scholar]
Sun, Y.; Yuan, X.; Liu, W.; Sun, C. Model-based reinforcement learning via proximal policy optimization. In Proceedings of the 2019 Chinese Automation Congress (CAC), Hangzhou, China, 22–24 November 2019; pp. 4736–4740. [Google Scholar]
Xiao, Z.; Xie, N.; Yang, G.; Du, Z. Fast-PPO: Proximal policy optimization with optimal baseline method. In Proceedings of the 2020 IEEE International Conference on Progress in Informatics and Computing (PIC), Shanghai, China, 18–20 December 2020; pp. 22–29. [Google Scholar]
Gu, Y.; Cheng, Y.; Chen, C.P.; Wang, X. Proximal policy optimization with policy feedback. IEEE Trans. Syst. Man Cybern. Syst. 2021, 52, 4600–4610. [Google Scholar] [CrossRef]
Zhao, G.; Xu, J.; Liu, A.; Yu, J. Research on proximal policy optimization algorithm based on N-step update. In Proceedings of the 2021 International Conference on Communications, Information System and Computer Engineering (CISCE), Beijing, China, 14–16 May 2021; pp. 854–857. [Google Scholar]
Ratcliffe, D.S.; Hofmann, K.; Devlin, S. Win or learn fast proximal policy optimization. In Proceedings of the 2019 IEEE Conference on Games (CoG), London, UK, 20–23 August 2019; pp. 1–4. [Google Scholar]
Vaca-Rubio, C.J.; Manchón, C.N.; Adeogun, R.; Popovski, P. Proximal Policy Optimization for Integrated Sensing and Communication in mmWave Systems. arXiv 2023, arXiv:2306.15429. [Google Scholar]
Thanh, P.D.; Giang, H.T.H.; Hong, I.-P. Anti-Jamming RIS Communications Using DQN-Based Algorithm. IEEE Access 2022, 10, 28422–28433. [Google Scholar] [CrossRef]
Nguyen, P.K.H.; Nguyen, V.H.; Do, V.L. A Deep Double-Q Learning-based Scheme for Anti-Jamming Communications. In Proceedings of the 2020 28th European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands, 24–28 August 2020; pp. 1566–1570. [Google Scholar] [CrossRef]
Chen, J.; Gao, Y.; Zhou, Y.; Liu, Z.; Zhang, M. Machine Learning enabled Wireless Communication Network System. In Proceedings of the International Wireless Communications and Mobile Computing (IWCMC), Dubrovnik, Croatia, 30 May–3 June 2022; pp. 1285–1289. [Google Scholar] [CrossRef]
Pirayesh, H.; Zeng, H. Jamming Attacks and Anti-Jamming Strategies in Wireless Networks: A Comprehensive Survey. IEEE Commun. Surv. Tutor. 2022, 24, 767–809. [Google Scholar] [CrossRef]

Figure 1. The wireless communication system model used in our research.

Figure 2. MAML gradient descent process.

Figure 3. Meta-PPO algorithm framework.

Figure 4. Decision model of the basic learner based on the PPO algorithm.

Figure 5. Decision model of the basic learner based on the Meta-PPO algorithm.

Figure 6. Overall decision-making process based on the Meta-PPO algorithm.

Figure 7. Intelligent anti-jamming model based on Meta-PPO algorithm.

Figure 8. Random sweep frequency jamming’s channel–time slot diagram.

Figure 9. Comparison of anti-jamming performance of different algorithms.

Figure 10. Comparison of normalization throughput at different clipping ratios.

Figure 11. Comparison of different MAML gradient learning update rates.

Figure 12. Meta-PPO algorithm’s convergence curves of different parameters. (a) Meta-PPO policy action losses curve; (b) Meta-PPO policy entropy curve; (c) Meta-PPO value-losses curve.

Table 1. Summary of references.

Number	Network Background	Core Algorithm
[1]	Internet of Things (IoT)	Machine Learning
[2]	Cognitive Radio (CR)	Machine Learning
[3]	Wireless Network	Supervised Learning Unsupervised Learning Reinforcement Learning Deep Learning
[4]	Internet of Things (IoT) Unmanned Aerial Vehicles (UAVs)	Deep Reinforcement Learning
[5]	Internet of Satellites (IoS)	Stackelberg Game Reinforcement Learning
[6]	Maritime Communication	Deep Reinforcement Learning
[7]	Unmanned Aerial Vehicles (UAVs)	Reinforcement Learning
[8]	Unmanned Aerial Vehicles (UAVs)	Deep Reinforcement Learning
[9]	Wireless Sensor Network	Stochastic Game Framework
[10]	Cellular System	Deep Reinforcement Learning Transfer Learning
[11]	Wireless Network	Dyna-Q-Learning
[12]	Unmanned Aerial Vehicles (UAVs)	Distributed Reinforcement Learning
[13]	Unmanned Aerial Vehicles (UAVs)	Reinforcement Learning
[14]	Wireless Network	Deep Learning-Based Intelligent Receiver
[15]	Underwater Sensor Networks (UWSNs)	Reinforcement Learning
[16]	Few-shot Image Classification	Meta-Learning
[17]	Few-shot Learning	Model Agnostic Meta-Learning (MAML)
[19]	Face Recognition System	Meta-Learning
[21]	Markov Decision Task Non-linear Control Task	Meta-Reinforcement Learning
[22]	Continuous-control Benchmark	Meta Q-Learning
[23]	Artificial Intelligence	Meta-Reinforcement Learning
[24]	Artificial Intelligence	Meta Q-Learning of Multi-task (MQMT)
[25]	Vision-based Control Task	Meta-imitation Learning
[26]	Pendulum Benchmark Problem	Proximal Policy Optimization (PPO)
[28]	Atari Games and Control Tasks	PPO with Policy Feedback (PPO-PF)
[30]	Matrix Games and Grid-based Games	Fast-PPO
[31]	Sensing and Communication in mmWave System	Proximal Policy Optimization (PPO)
[32]	Wireless Network	Deep Reinforcement Learning
[33]	Wireless Network	Double-Q Learning
[34]	Wireless Network	Machine Learning
[35]	Wireless Network	Machine Learning

Table 2. Simulation Parameters.

Parameters	Numerical Values	Parameters	Numerical Values
PPO Reward discount rate $γ$	0.95	Policy update cut rate in PPO	0.2
Estimate the strategic advantage discount factor	1	The weight of the median loss of the PPO cost function	0.5
Size of the hidden layer of the policy network	100	Weight of the entropy regularization term in PPO	0.01
Batch size for each task $T_{i}$	5	Number of channels $N_{0}$	5
MAML gradient update learning rate $α$	0.1	Total channel bandwidth $B$	10 MHZ
Total number of batches learned	1000	Channel switching loss factor $C_{f}$	10⁻⁴
Number of jamming channels $K_{T}$	3	Number of tasks per batch	10
Jamming power $p_{k}^{'}$	{10 dBm,15 dBm,20 dBm}	Transmission power $p_{n}$	{5 dBm,10 dBm,15 dBm}

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Q.; Niu, Y.; Wan, B.; Xiang, P. A Novel Intelligent Anti-Jamming Algorithm Based on Deep Reinforcement Learning Assisted by Meta-Learning for Wireless Communication Systems. Appl. Sci. 2023, 13, 12642. https://doi.org/10.3390/app132312642

AMA Style

Chen Q, Niu Y, Wan B, Xiang P. A Novel Intelligent Anti-Jamming Algorithm Based on Deep Reinforcement Learning Assisted by Meta-Learning for Wireless Communication Systems. Applied Sciences. 2023; 13(23):12642. https://doi.org/10.3390/app132312642

Chicago/Turabian Style

Chen, Qingchuan, Yingtao Niu, Boyu Wan, and Peng Xiang. 2023. "A Novel Intelligent Anti-Jamming Algorithm Based on Deep Reinforcement Learning Assisted by Meta-Learning for Wireless Communication Systems" Applied Sciences 13, no. 23: 12642. https://doi.org/10.3390/app132312642

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Intelligent Anti-Jamming Algorithm Based on Deep Reinforcement Learning Assisted by Meta-Learning for Wireless Communication Systems

Abstract

1. Introduction

2. System Model and Problem Formulation

2.1. System Model

2.2. Problem Modeling

3. Meta-PPO Anti-Jamming Intelligent Decision Algorithm

3.1. Base Learner

3.2. Meta-Learner

3.3. Meta-PPO Intelligent Anti-Jamming Algorithm

4. Simulations and Analyses

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI