Anti-Jamming Path Selection Method in a Wireless Communication Network Based on Dyna-Q

Zhang, Guoliang; Li, Yonggui; Niu, Yingtao; Zhou, Quan

doi:10.3390/electronics11152397

Open AccessArticle

Anti-Jamming Path Selection Method in a Wireless Communication Network Based on Dyna-Q

¹

The Sixty-Third Research Institute, National University of Defense Technology, Nanjing 210007, China

²

College of Communications Engineering, Army Engineering University of PLA, Nanjing 210007, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(15), 2397; https://doi.org/10.3390/electronics11152397

Submission received: 16 June 2022 / Revised: 29 July 2022 / Accepted: 29 July 2022 / Published: 31 July 2022

(This article belongs to the Section Networks)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Aiming at efficiently establishing the optimal transmission path in a wireless communication network in a malicious jamming environment, this paper proposes an anti-jamming algorithm based on Dyna-Q. Based on previous observations of the environment, the algorithm selects the optimal sequential node by searching the Q table to reduce the packet loss rate. The algorithm can accelerate the updating of the Q table based on previous experience. The Q table converges to the optimal value quickly. This is beneficial for the optimal selection of subsequent nodes. Simulation results show that the proposed algorithm has the advantage of faster convergence speed compared with the model-free reinforcement learning algorithm.

Keywords:

reinforcement learning; network anti-jamming; Dyna-Q; multi-hop path

1. Introduction

With the explosive growth of wireless communication equipment [1], wireless communication is presenting a trend toward network development [2]. Due to the openness of the electromagnetic propagation space [3] and the limitation of the transmitting power of wireless communication nodes, the transmission reliability and effectiveness of wireless communication are susceptible to malicious electromagnetic jamming [4]. In a wireless communication network, due to the limited coverage of a single wireless communication node, the distance between the source node and destination node is too far to communicate directly. Therefore, an intermediate node is required to forward and establish a multi-hop path [5]. When malicious jamming exists, it is difficult for conventional routing algorithms to efficiently find the optimal transmission path. Therefore, efficiently establishing the optimal transmission path [6] in the context of malicious jamming is an urgent problem for wireless communication networks.

1.1. Related Works

In order to realize reliable information transmission in wireless communication networks, Reference [7] proposed a multi-hop routing algorithm based on K-means clustering and heredity in wireless sensor networks (WSN), which effectively balanced network energy consumption and extended the network life cycle but did not reduce the overall network energy consumption. Reference [8] proposed the classical “watchdog” algorithm in an ad hoc network, which can identify selfish nodes and avoid these malicious nodes by using the “passer-choosing” protocol. However, the validity of its protocol highly depends on the reliability of the “watchdog” algorithm. Reference [9] proposed a routing algorithm based on the DSR protocol for an ad hoc network. Malicious nodes can be assessed by comparing the trust degree of each node, but there is not much research on the optimization of the selection of trusted sequential nodes. In view of the multi-hop number problem in the clustering routing algorithm, Reference [10] deduced the optimal hop number to minimize the total energy consumption of straight-line data transmission from node to base station and obtained the ideal path with the minimum energy consumption. However, the throughput of the destination node was not considered; only the energy consumption was considered. In Reference [11], an MDCE routing algorithm was proposed for a delay-tolerant network, which solves the problem of the high load rate and packet loss rate in delay-tolerant networks. However, this algorithm only considers the local topology of the network, not the global topology. Reference [12] proposed a multi-channel anti-jamming routing protocol algorithm based on LEACH, which can switch the channel in a timely manner according to the interference and effectively avoid malicious jamming. However, this algorithm is only applicable to a network with low density. In [13], the aim was to use the spectral-efficient IBFD scheme to improve the security of the system with minimum interference. Interference can be reduced by the addition of orthogonality between the transmitted and received signals of a relay. However, the proposed algorithm does not consider external interference. In Reference [14], an Internet-of-Things (IoT) system containing a relay selection was studied by employing an emerging multiple access scheme, namely, non-orthogonal multiple access (NOMA). However, the algorithm proposed here considers wiretap jamming. Reference [15] proposed an anti-jamming path selection method based on the MAB model under full-frequency blocking jamming, but the proposed algorithm needs to exhaust all available paths, resulting in high computational complexity. To sum up, the shortcomings of the existing literature are that there are few studies considering electromagnetic jamming, and existing routing anti-jamming methods [16,17,18] have problems of high complexity, poor convergence effect, and poor scalability.

The anti-jamming decision of intelligent communication is usually modeled as a Markov decision process (MDP) [19]. In practical application scenarios, it is difficult to obtain the state transition probability function [20], so reinforcement learning can be adopted. Reinforcement learning takes environmental feedback as input, constantly interacts with the environment, and learns the optimal action strategy through trial and error. Because of the complex time variation of the electromagnetic environment, model-free reinforcement learning is a common method for intelligent anti-jamming communication. However, an algorithm based on model-free reinforcement learning has the problems of “dimensional disaster” and poor convergence. The motivation is that the transmission distance of a single communication node is limited, and other algorithms have difficulty quickly finding the optimal path. Therefore, we designed the proposed scheme. In this study, the model-based and model-free [21] methods were applied to multi-hop path anti-jamming, and a model-based reinforcement learning (Dyna-Q) anti-jamming path selection algorithm framework was designed [22]. The algorithm proposed in this paper is aimed at a multi-node wireless communication network. When data are sent from the source to the host [23] in an environment of broadband blocking jamming, the Dyna-Q algorithm is adopted to select relay nodes, thus forming the optimal communication path. The main contribution of this paper is as follows: Aiming at the problem of path selection in a broadband jamming environment, an anti-jamming algorithm based on Dyna-Q is proposed for wireless communication networks.

1.2. Contribution and Structure

The contributions of this paper are as follows:

The algorithm can make the source node find the optimal path to transmit to the destination node, which has the advantages of fast convergence speed and improved communication efficiency.
This paper proposes an anti-jamming algorithm based on Dyna-Q. This is the first use of Dyna-Q in anti-jamming communication.

The rest of this paper is organized as follows: Section 2 presents the system model and problem formulation. In Section 3, we introduce the anti-jamming path selection algorithm based on Dyna-Q. The simulation results and analysis are discussed in Section 4, and concluding remarks are given in Section 5.

2. System Model and Problem Formulation

2.1. System Model

In order to facilitate the research, the following hypotheses are made in this paper:

In a wireless communication network, there are $N$ communication nodes. The wireless communication links are non-interfered links. The communication node set is defined as $\{C = \{C_{1}, C_{2}, \dots, C_{N}\}\}$ , among which $C_{i} (i \in \{1, 2, \dots, N\})$ is the $i$ th node. In particular, $C_{1}$ is the source node, and $C_{N}$ is the destination node, where “node $i$ ” and “node $C_{i}$ ” refer to the $i$ th node. If the energy of all nodes is large enough, there are $N - 1$ next hop nodes for the current node to select. The location set of all communication nodes is denoted as $Λ = \{L_{1}, L_{2}, \dots, L_{N}\}$ . The adjacent node set of the current node can be judged by the location set of all communication nodes [24]. In practice, all nodes have limited energy. The current node can only select neighboring nodes. Within an area, there is a jammer $J$ that continuously interferes with the surrounding area at constant power $P_{J 0}$ , as shown in Figure 1.
When each node transmits information, the channel bandwidth is $B$ . Each packet transmitted from the source node $C_{1}$ to the target node $C_{N}$ has a length of $D$ bits. Each communication node receives all bits of the whole packet before forwarding it to the subsequent nodes. Assuming that all nodes have limited energy, the current node can only select neighboring nodes.
Location information and link status between nodes are shared through a common control link. However, the specific position and jamming range of jammers are unknown to communication nodes.
The jammer executes precise jamming aimed at tracking frequency and communication signal time alignment, which will lead to effective jamming in both the time domain and frequency domain. This jamming effect is equivalent to full-frequency blocking jamming. Additive White Gaussian noise exists in all channels, and its power spectral density is $n_{0}$ .
It is assumed that when a communication node that is not jammed is selected, namely, $S J N R_{t h 1} \leq S J N R$ , the modulation mode of the current communication node is $Q P S K$ , and the transmission is carried out at the full rate $v$ . If a communication node that is partially disturbed is selected, namely, if $S J N R_{t h 2} < S J N R < S J N R_{t h 1}$ , the modulation mode of the current communication node is $B P S K$ , and the transmission is carried out at the half-rate $v / 2$ . If a communication node that is seriously disturbed is selected, namely, if $S J N R \leq S J N R_{t h}$ , the transmission rate is 0, and the nodes cannot communicate with each other. Here, $S J N R_{t h 1}$ , $S J N R_{t h 2}$ , and $S J N R_{t h}$ are the threshold values of $S J N R$ , which gradually decrease.

In the current environment with broadband blocking jamming, our goal is to find a path with the minimum hop count that can transfer all data from the source node to the target node.

2.2. Problem Formulation

The optimal path selection problem is modeled as a Markov game problem in a full-frequency jamming environment. We define the nodes currently communicating in the state space as:

S = \{C_{1}, C_{2}, \dots, C_{N}\}

(1)

We define the adjacent node set of the current node

i

as:

Z = {Z_{_{i}}^{1}, Z_{_{i}}^{2}, \dots, Z_{_{i}}^{M}} (M < N)

(2)

\max \{d_{Z_{i}^{1}, i}, d_{Z_{i}^{2}, i}, \dots, d_{Z_{i}^{M}, i}\} < d_{i j}

(3)

where the adjacent node of the current node

i

is

Z_{_{i}}^{1}, Z_{_{i}}^{2}, \dots, Z_{_{i}}^{M}

. The distance between adjacent nodes and current node

i

is

d_{Z_{i}^{1}, i}, d_{Z_{i}^{2}, i}, \dots, d_{Z_{i}^{M}, i}

. The distance from node

i

to node

j

is

d_{i j}

. Node

j

is any of the remaining communication nodes except adjacent nodes. If there are more than two adjacent nodes in the direction of the current node, that is, on a straight line, the nearest adjacent node is selected. As shown in Figure 2, the current node has four adjacent nodes, but only three can be selected.

The action of the communication node is applied to the subsequent communication node that selects forwarding, and the action space

A

is all possible actions. Therefore, the action space

A

can be defined as:

A = {Z_{_{i}}^{1}, Z_{_{i}}^{2}, \dots, Z_{_{i}}^{M}}

(4)

We define the single-hop reward function

r

as:

r = \{\begin{cases} 1 r e a c h C_{N} \\ - 1 t h e t r a n s m i s s i o n r a t e i s 0 \\ - 0.5 b y c o m m u n i c a t i o n n o d e B P S K \\ 0 b y c o m m u n i c a t i o n n o d e Q P S K \end{cases}

(5)

The reward function of this path is defined as

R

; that is, the sum of all single-hop reward functions

R

of each path is:

R = \sum_{h o p = 1}^{m} r_{h o p}

(6)

where

m

is the total hop count, and

r_{h o p}

is the reward function of the current hop count.

3. An Anti-Jamming Algorithm Based on Dyna-Q

In this work, an anti-jamming path selection algorithm framework based on reinforcement learning (Dyna-Q) was designed. As shown in Figure 3, the continuous interaction between communication nodes and the environment generates practical experience, partly directly improving the value functions and strategies in reinforcement learning. The other part is used to improve the model. The former is called direct reinforcement learning, and the latter is called model learning.

In the beginning, the

Q (s, a)

table is initalized, where the initial state is the source node, and the action is selected. At first, there is an instant reward, that is, 0; when arriving at

C_{N}

communication nodes, there is a corresponding reward of 1 in the Q table. According to the principle of Q-learning and Equation (7), the function returns to update the current Q value. When the Q value in the Q table becomes less than the threshold value, the Q table converges.

Communication nodes can gain some “experience” in actual communication. These historical experiences can be represented by the quad

(s_{t}, a_{t}, s_{t + 1}, r)

, where

s_{t}

is the communication node at time

t

,

a_{t}

is the communication node adopted at time

t

,

s_{t + 1}

is the communication node selected at time

t + 1

, and

r

is the reward obtained. All quads form a historical experience set, denoted as

model (s, a)

. The Q function is expressed as

Q (s, a)

.

According to the literature [25], Q-learning is a type of reinforcement learning. The agent constantly interacts with the environment and finds the optimal communication node according to the feedback information of the environment. Equation (7) is derived from the Bellman equation. Q is updated as follows:

Q (s_{t}, a_{t}) = (1 - α) Q (s_{t}, a_{t}) + α (r + γ \max_{a_{t + 1} \in A} Q (s_{t + 1}, a_{t + 1}))

(7)

where

α

is the learning rate, and

γ

is the discount factor.

The Q table is created by constant interaction. The agent chooses the best action according to the Q table. The Q table is the same as the Q function. It is an array in the simulation. Using Q-learning produces Q tables.

When a new communication experience is obtained, a new quad

(s_{t}, a_{t}, s_{t + 1}, r)

will be obtained. Based on the current state

s_{t}

, the agent takes action

a_{t}

, then gets a timely reward

r

, and moves to the next state

s_{t + 1}

. The quad consists of

s_{t}

,

a_{t}

,

r

, and

s_{t + 1}

. It is determined whether the new quad

(s_{t}, a_{t}, s_{t + 1}, r)

exists in

model (s, a)

, and if not, the new quad

(s_{t}, a_{t}, s_{t + 1}, r)

is stored in

model (s, a)

to update

model (s, a)

.

model (s, a)

will not be updated when there is no new communication experience; that is, no new quad is generated.

model (s, a)

was used to learn the simulation process, and a group of quads was randomly selected in

model (s, a)

. According to the selected quad

(s_{t}, a_{t}, s_{t + 1}, r)

, communication users find the corresponding

\max_{a_{t + 1} \in A} Q (s_{t + 1}, a_{t + 1})

in the Q table.

Then,

s_{t}, a_{t}, r

, and

\max_{a_{t + 1} \in A} Q (s_{t + 1}, a_{t + 1})

are substituted into Equation (7), and the Q table is updated. This process is repeated

n

times.

After updating the Q table, the agent adopts the ε-greedy algorithm to select action

a_{t}

. The specific algorithm can be expressed as follows:

a_{t} = \{\begin{cases} \arg \max_{a \in A} Q (s_{t}, a) Probability of 1 - ε \\ \forall a \in A Probability of ε \end{cases}

(8)

Finally,

a_{t}

is output.

The anti-jamming algorithm based on Dyna-Q communication can be summarized as Algorithm 1:

Algorithm 1. Anti-jamming algorithm for wireless communication network

Initialize Q (s, a) = 0

, model (s, a) = 0

, \forall s \in S, \forall a \in A, set the parameters α, γ

.

For

t = 1, 2, \dots T

do

(1) In the current environment state

s_{t}

, the transmitter performs the last decision action

a_{t}

, or the initial action to select the next communication node;

(2) After taking the action

a_{t}

, the payoff r and the state

s_{t + 1}

at the next moment can be obtained.

(3) According to Equation (7), the Q function is updated.

(4) The latest quad

(s_{t}, a_{t}, s_{t + 1}, r)

is used to update

model (s, a)

.

(5) n cycles are performed:

① Random quads

(s_{t}, a_{t}, s_{t + 1}, r)

are selected from

model (s, a)

;

② The Q function is updated according to Equation (7).
(6) The receiver gets the transmitter action

a_{t + 1}

at the next moment according to the following criteria and sends it back to the transmitter:

a_{t} = \{\begin{cases} \arg \max_{a \in A} Q (s_{t}, a) Probability of 1 - ε \\ \forall a \in A Probability of ε \end{cases}

end for

4. Simulation Results and Analysis

4.1. Parameter Settings

We set the parameters for the simulation as shown in Table 1.

The number of communication nodes in the communication network is

N = 64

, distributed in a square site with the range of

25 km \times 25 km

, and the nodes are randomly generated according to the Poisson distribution. It is assumed that the number of elements in the adjacent point set of the current node is

M = 4

, the learning factor is

α = 0.9

, and the discount factor is

γ = 0.6

. In addition, to achieve a smooth transition from exploration to development, we set

ε = 1 / \sqrt{t}

, where the step is the total hop number of the whole experiment. It is assumed that the jammer is in the middle position between the destination node and the source node and forms a circular jamming range. Because the communication nodes are in different spatial positions from the jammer, the jamming conditions of communication nodes are different. The schematic diagram of simple communication nodes is shown in Figure 4.

In the following simulation, we compared the performance of the proposed algorithm with that of the following methods:

MAB model optimal anti-jamming path selection algorithm: By using the UCBI algorithm to select the arm, in the first K round experiment, all available communication nodes were tested once for the same communication node, and the results of each experiment are independent identically distributed. In subsequent experiments, the selected nodes were determined by the average income of each communication node in the previous K round experiment and the number of times each communication node was selected.
Classic Q-learning: The communication node constantly interacts with the environment and selects the communication node by the next hop through feedback and return.

Successful transmission means that the source node transmits the information to the destination node, and the destination node successfully receives the information. When the experiment ends, the destination node is put in the end state, that is, in the state of the “end” of the application, and the agent and environment interactions can naturally be divided into a series of subsequences (the disturbance of each sequence from the source node to the node undisturbed by the normal communication jammer and the partial node that puts an end to the state). We call each subsequence an experiment, sometimes called a curtain. When the information is transmitted to the destination node, the agent returns to the source node to start a new experiment.

model (s, a)

is used to learn the simulation process, and a cycle of n times becomes n-step planning. Planning takes an environment model as input and generates or improves the computational processes that interact with it.

4.2. Analysis of Simulation

Figure 5 shows the influence of different positions of jammers on the optimal path return. The X-axis and Y-axis represent the position coordinates of the jammer, and the colored area in the figure represents the mean of the maximum return obtained by the path with the minimum transmission hop number in each experiment when the jammer is at this position. Assume that the source node is at (1,1) and the destination node is at (25,25). As can be seen in Figure 5, when the jamming power and transmitting power remain unchanged, the closer the jammer is to the line where the source node and the destination node are, the smaller the return of the optimal path. When the jammer approaches the source node or the destination node, the last hop of the source node and the previous hop of the destination node are both within the serious jamming range of the jammer. At this time, the signal-to-noise ratio of all available nodes is less than the minimum transmission signal-to-noise ratio threshold, so the transmission path may not be found.

Figure 6 shows the comparison of hops required by different algorithms in each experiment. The X-axis is the number of experiments, and the Y-axis is the number of hops in each experiment. Simulations demonstrate the superiority of conventional Q-learning and the MAB proposed in 2019. The Dyna-Q algorithm proposed in this paper can find the path of the minimum hop number at about the 15th experiment. The classical Q-learning algorithm can find the minimum hop path in about 21 experiments, and the MAB algorithm can find the minimum hop path in 32 experiments. As can be seen in the figure, the proposed Dyna-Q algorithm can converge to the optimal result faster than the classical Q-learning and MAB algorithms. Here, the communication nodes in the Dyna-Q algorithm are planned, while the communication nodes in the classical Q-learning and MAB algorithms are not planned, especially in Q-learning. When n in the n-step Dyna-Q algorithm is 0, it is classical Q-learning. Because the communication nodes are not planned, each experiment will only increase one learning opportunity to learn the strategy. That is, information is learned when it is transmitted to the destination node. The planned communication node, in the first experiment, just like Q-learning, will only have one learning opportunity, but

model (s, a)

is initially established, so in the second experiment, multiple backtracking updates can be made to calculate the Q table without waiting for the end of this experiment. Therefore, a planned agent can find the minimum hop path faster than others. The number of experiments required to find the minimum number of hops represents the time spent. The fewer the experiments, the less time, and the faster the speed.

Figure 7 shows the returns obtained by the above three algorithms in each experiment. It can be seen in the figure that the n-step Dyna-Q algorithm can obtain the highest returns faster than the other two algorithms, resulting in faster convergence.

As shown in Figure 8, in the n-step Dyna-Q algorithm, the more steps planned, the faster the convergence and the better the effect. Compared with 15-step, 5-step, and 0-step planning, 50-step planning has the best effect. The 0-step Dyna-Q algorithm here is direct reinforcement learning; 50-step planning can converge to the optimal value in the third experiment, the 15-step plan converges after about 30 times, the 5-step plan converges after about 50 times, and the 0-step plan converges after even more experiments.

As shown in Figure 9, the algorithms of different step planning are compared. It can be seen in the figure that 50-step planning still achieves the fastest convergence, followed by 15-step planning, 5-step planning, and 0-step planning. It can be concluded that in the n-step Dyna-Q algorithm, the more planned steps, the faster the convergence, and the better the effect.

5. Conclusions

This paper combines model-based and model-free methods to design a model-based reinforcement learning (Dyna-Q) anti-jamming path selection algorithm framework. For a multi-node wireless communication network, when data are sent from the source to the host under broadband blocking jamming, the anti-jamming decision of intelligent communication is usually modeled as a Markov decision process (MDP). The Dyna-Q algorithm is used to select relay nodes so as to form the optimal communication path. The simulation results show that the shorter the distance between the jammer and the source node or the destination node, the more adverse the impact of the jammer on the whole network. Compared with Q-learning, the Dyna-Q algorithm can find the optimal path faster. The proposed algorithm has only a theoretical value because in a wireless communication network, nodes have interfered. Further, throughput can be degraded along with the traffic load, attenuation loss, fading, and signal-to-noise ratio. Moreover, the algorithm proposed in this paper has good convergence and is beneficial for making fast and correct decisions in a complex electromagnetic environment.

Author Contributions

Methodology, Y.L. and Y.N.; writing—original draft, G.Z.; software, G.Z. and Q.Z.; supervision, Y.N. and Y.L.; writing—review and editing, G.Z.; validation, G.Z. and Q.Z.; funding acquisition, Y.N.; project administration, Y.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation of China (NSFC grants: U19B2014).

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, G.Y.; Xu, Z.; Xiong, C.; Yang, C.; Zhang, S.; Chen, Y.; Xu, S. Energy-efficient wireless communications: Tutorial, survey, and open issues. IEEE Wirel. Commun. 2011, 18, 28–35. [Google Scholar] [CrossRef]
Klaus, W.; Puttnam, B.J.; Luis, R.S.; Sakaguchi, J.; Mendinueta, J.D.; Awaji, Y.; Wada, N. Advanced space division multiplexing technologies for optical networks [Invited]. J. Opt. Commun. Netw. 2017, 9, C1–C11. [Google Scholar] [CrossRef]
Lin, J.; Tian, B.; Wu, J.; He, J. Spectrum Resource Trading and Radio Management Data Sharing Based on Blockchain. In Proceedings of the 2020 IEEE 3rd International Conference on Information Systems and Computer Aided Education (ICISCAE), Dalian, China, 27–29 September 2020; pp. 83–87. [Google Scholar]
Zou, Y.; Zhu, J.; Wang, X.; Hanzo, L. A Survey on Wireless Security: Technical Challenges, Recent Advances, and Future Trends. Proc. IEEE 2016, 104, 1727–1765. [Google Scholar] [CrossRef] [Green Version]
Sun, Z. Hop Number and Relay Nodes Optimization in Clustering Routing Algorithms. Minicomput. Syst. 2019, 40, 1299–1305. [Google Scholar]
Haoyu, Z.; Yan, H.; Xinxing, X.; Nengling, T. Optimal Configuration of Integrated Energy System Considering Energy Transmission Paths. In Proceedings of the 2021 4th International Conference on Energy, Electrical and Power Engineering (CEEPE), Chongqing, China, 23–25 April 2021; pp. 1119–1124. [Google Scholar]
Miao, J. Multi-hop routing algorithm based on genetic algorithm and K-means clustering algorithm used in WSN. Mod. Electron. Technol. 2021, 44, 42–48. [Google Scholar]
Marti, S.; Giuli, T.J.; Lai, K.; Baker, M. Mitigating Routing Misbehavior in Mobile ad Hoc Networks. 2000. Available online: https://www.cs.cmu.edu/~srini/15-829/readings/marti-giuli-lai-baker-mitigating-routing-misbehavior.pdf (accessed on 15 June 2022).
Lu, B.; Pooch, U.W. Cooperative security-enforcement routing in mobile ad hoc networks. In Proceedings of the 4th International Workshop on Mobile and Wireless Communications Network 2002, Stockholm, Sweden, 9–11 September 2002; pp. 157–161. [Google Scholar]
Bhatnagar, M.R.; Mallik, R.K.; Tirkkonen, O. Performance Evaluation of Best-Path Selection in a Multihop Decode-and-Forward Cooperative System. IEEE Trans. Veh. Technol. 2016, 65, 2722–2728. [Google Scholar] [CrossRef]
Yang, W. Analysis and improvement of MDCE routing algorithm. Comput. Appl. Softw. 2021, 38, 105–110. [Google Scholar]
Lin, M. Research on Anti-Jamming Routing Algorithm for ISM Band Multi-Channel Wireless Sensor Networks; Nanjing University of Technology: Nanjing, China, 2021. [Google Scholar]
Khan, R.; Jayakody, D.N.K. Full Duplex Component—Forward Cooperative Communication for a Secure Wireless Communication System. Electronics 2020, 9, 2102. [Google Scholar] [CrossRef]
Do, D.-T.; Van Nguyen, M.-S.; Hoang, T.-A.; Voznak, M. NOMA-assisted multiple ac cess scheme for IoT deployment: Relay selection model and secrecy performance improvement. Sensors 2019, 19, 736. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wang, Y.T. Research on Anti-Jamming Method of Wireless Communication Based on Reinforcement Learning; Army Engineering University: Nanjing, China, 2019. [Google Scholar]
Abunada, A.H.; Osman, A.Y.; Khandakar, A.; Chowdhury, M.E.H.; Khattab, T.; Touati, F. Design and Implementation of a RF Based Anti-Drone System. In Proceedings of the 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies, Doha, Qatar, 2–5 February 2020; pp. 42–45. [Google Scholar]
Lu, Z.; Song, J.; Zheng, C.; Xu, W.; Wang, X. Generalized State Space Average-value Model of MAB Based Power Electrical Transformer. In Proceedings of the 2021 5th International Conference on Power and Energy Engineering (ICPEE), Xiamen, China, 2–4 December 2021; pp. 46–52. [Google Scholar]
Chen, M.; Poor, H.V.; Saad, W.; Cui, S. Convergence Time Minimization of Federated Learning over Wireless Networks. In Proceedings of the ICC 2020–2020 IEEE International Conference on Communications (ICC), Dublin, Ireland, 7–11 June 2020; pp. 1–6. [Google Scholar]
Almeida, P.J.; Lieb, J. Complete j-MDP Convolutional Codes. IEEE Trans. Inf. Theory 2020, 12, 7348–7359. [Google Scholar] [CrossRef]
Xie, G.; Sun, L.; Wen, T.; Hei, X.; Qian, F. Adaptive Transition Probability Matrix-Based Parallel IMM Algorithm. IEEE Trans. Syst. Man Cybern. Syst. 2021, 51, 2980–2989. [Google Scholar] [CrossRef]
Richard, S.S.; Andrew, G.B. Reinforcement Learning; Electronic Industry Press: Beijing, China, 2021. [Google Scholar]
Su, W.; Tao, J.; Pei, Y.; You, X.; Xiao, L.; Cheng, E. Reinforcement Learning Based Efficient Underwater Image Communication. IEEE Commun. Lett. 2021, 25, 883–886. [Google Scholar] [CrossRef]
Zhou, Y. Fundamentals of Information Theory; Beijing University of Aeronautics and Astronautics Press: Beijing, China, 2019. [Google Scholar]
Sun, J. A Rapid Deeployment Method for Wireless Emergency Communication Relay Nodes in Complex Environment. Astrom. Tech. 2022, 42, 45–50. [Google Scholar]
Vlassis, N. A Concise Introduction to Multiagent Systems and Distributed Artificial Intelligence; Morgan and Claypool Publishers: Williston, VT, USA, 2007. [Google Scholar]

Figure 1. System model.

Figure 2. Schematic diagram of current node selection.

Figure 3. Structure diagram of Dyna-Q.

Figure 4. Schematic diagram of communication nodes.

Figure 5. Influence of different jammer positions on optimal path return.

Figure 6. Comparison of hops required by different algorithms in each experiment.

Figure 7. Rewards obtained by different algorithms in each experiment.

Figure 8. Rewards per act for different plans.

Figure 9. Minimum hops required for each different number of plans.

Table 1. Settings of model-related parameters.

Parameters	Value
$Communication nodes N$	$64$
$Adjacent nodes M$	$4$
$Learning factor α$	$0.9$
$Discount factor γ$	$0.6$
$Greedy index ε$	$1 / \sqrt{t}$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, G.; Li, Y.; Niu, Y.; Zhou, Q. Anti-Jamming Path Selection Method in a Wireless Communication Network Based on Dyna-Q. Electronics 2022, 11, 2397. https://doi.org/10.3390/electronics11152397

AMA Style

Zhang G, Li Y, Niu Y, Zhou Q. Anti-Jamming Path Selection Method in a Wireless Communication Network Based on Dyna-Q. Electronics. 2022; 11(15):2397. https://doi.org/10.3390/electronics11152397

Chicago/Turabian Style

Zhang, Guoliang, Yonggui Li, Yingtao Niu, and Quan Zhou. 2022. "Anti-Jamming Path Selection Method in a Wireless Communication Network Based on Dyna-Q" Electronics 11, no. 15: 2397. https://doi.org/10.3390/electronics11152397

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Anti-Jamming Path Selection Method in a Wireless Communication Network Based on Dyna-Q

Abstract

1. Introduction

1.1. Related Works

1.2. Contribution and Structure

2. System Model and Problem Formulation

2.1. System Model

2.2. Problem Formulation

3. An Anti-Jamming Algorithm Based on Dyna-Q

4. Simulation Results and Analysis

4.1. Parameter Settings

4.2. Analysis of Simulation

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI