1. Introduction
In wireless communication networks, the wireless channel serves as the transmission path between the transmitter and receiver. It is profound to focus on their properties in real-life situations despite the lack of tangible connections. However, wireless channels are vulnerable to malicious jamming attacks [
1], posing a nonnegligible challenge to reliable communication. To address these challenges, spread spectrum technology [
2], including direct-sequence spread spectrum (DSSS), frequency-hopping (FH), and time-hopping (TH), has emerged as an effective anti-jamming measure. The control competition over the electromagnetic spectrum between communication terminals requires a decision-making scheme that can assist or predict such struggles. Game theory [
3] has become a valuable mathematical tool to tackle anti-jamming issues. It enables the system to select the optimal policies in confrontational or conflict situations. However, with the increasing complexity and variability of the jamming environment in wireless communication channels, coupled with advancements in artificial intelligence software and hardware, jamming technology has become more intelligent and dynamic. Consequently, traditional anti-jamming techniques struggle to effectively combat these challenges. Integrating machine learning with anti-jamming technology offers a promising solution. By developing intelligent anti-jamming systems, we can cater to the specific demands of real-world scenarios. The real-world scenario of this paper, that is, in the actual confrontation scenario, the reliability and security of the communication are guaranteed, and the control channel is not affected by jamming. Therefore, the development of efficient and flexible intelligent anti-jamming schemes has become a significant challenge [
4,
5].
The adversarial dynamics between jammers and communicators can be effectively analyzed as a game. It has led to numerous research efforts in this domain. Jia et al. [
6] investigated anti-jamming communication in wireless networks from the perspective of Stackelberg games. Jia et al. [
7] explored the framework and challenges of game theory in learning anti-jamming, providing an analysis of various anti-jamming game models. Reinforcement Learning (RL) has also gained significant traction as an effective method for addressing anti-jamming in communication. A common approach involves leveraging the Q-Learning algorithm proposed in [
8,
9] to obtain the optimal anti-jamming policy by querying the Q-value table. However, the traditional Q-Learning algorithm faces challenges in solving high-dimensional spectrum state problems due to the complexity of the spectrum environment. To overcome this issue, Liu et al. [
10] introduced Deep Reinforcement Learning (DRL) techniques in communication anti-jamming. The Recurrent Convolutional Neural Network (RCNN) is involved in processing infinite spectrum environmental states and extract relevant features. Compared with traditional RL methods, the optimized RCNN significantly improves convergence speed, offering a more effective solution.
To combat unknown jamming threats and develop optimal anti-jamming policies, intelligent frequency decision anti-jamming technology has been garnering significant attention. In the pursuit of enhancing communication’s anti-jamming capabilities within dynamic adversarial environments, Chang et al. [
11] introduced an improved anti-jamming method that leverages feature engineering and DRL. This method exhibits superior performance and reduces computational complexity. Liu et al. [
12] proposed a sequential DRL algorithm without prior information to tackle the anti-jamming communication issues in a dynamic, intelligent jamming environment. This algorithm enables the rapid and effective selection of anti-jamming channels. Additionally, Li et al. [
13] presented an anti-jamming algorithm utilizing Parallel Policy Networks based on Deep Q-Networks (DQN). This algorithm adjusts power levels and accesses idle channels simultaneously, aiming to resist intense jamming attacks. Furthermore, it explores dynamic spectrum anti-jamming access. Han et al. [
14] investigated the application of DRL. With the help of a trained channel decision network, the secondary users are expected to be guided, and they can flexibly access the spectrum environment in the presence of jamming. In addition, Li et al. [
15] delved into the frequency selection problem in a jamming environment with a vast number of available frequencies. They proposed a hierarchical DRL algorithm that effectively addresses various jamming scenarios, even without prior knowledge of the jamming mode and channel model. Collectively, these studies present various approaches that utilize DRL to enhance intelligent anti-jamming techniques. They demonstrate promising policies that adapt to dynamic jamming environments, optimize channel selection, and successfully resist jamming attacks without prior knowledge of the jamming type or channel model.
While the intelligent frequency decision anti-jamming algorithms discussed in [
11,
12,
13,
14,
15] have shown promise in combating jamming, they share a common assumption: the existence of an unjammed control channel for direct information transmission. However, in unknown and dynamic environments where information is incomplete, maintaining a reliable control channel is often impractical. It becomes difficult for communication terminals to exchange policy information without a predetermined frequency change sequence. Consequently, achieving FH synchronization autonomously becomes a significant challenge in the absence of a control channel.
In consideration of the limitations associated with the ideal control channel in existing intelligent frequency decision anti-jamming techniques, this paper focuses on studying an intelligent frequency decision scheme that does not rely on an additional control channel. The research objectives of this study are as follows:
How to design a communication system without communication negotiation?
How to design a game model to ensure that the two terminals involved in the communication can achieve convergence and converge to the optimal outcome?
How to design an algorithm that allows the two terminals to learn synchronously and find an equilibrium?
The rest of the paper is organized as follows. In
Section 2, we provide a review of the related work in the field, discussing the existing research on intelligent frequency decision communication. In
Section 3, a framework for intelligent frequency decision communication is introduced, which does not rely on a control channel. In
Section 4, an OPG model is constructed and analyzed. The NE existence of the model is proved. In
Section 5, a two-agent frequency decision learning algorithm based on the best-response policy is proposed. The algorithm is designed to converge to the NE. In
Section 6, the simulation results are presented and thoroughly analyzed. In
Section 7, conclusions are shown. All the frequent abbreviations used in our work are shown in Abbreviations.
2. Related Work
The problem of anti-jamming, based on game theory, has been extensively researched. Game theory provides a useful framework for modeling, implementing jamming countermeasures, and determining optimal anti-jamming policies. Xu et al. [
16] took a bird’s eye view of the issue of multi-user anti-jamming spectrum access and developed a game model to prove the existence of an NE. Jia et al. [
17] presented a dynamic game approach to dealing with the problem of anti-jamming channel selection. They propose a distributed algorithm that converges to the NE of the game in a dynamic environment. However, applying game theory methods to the intricate and ever-changing electromagnetic environment often requires access to upgrading precise jamming parameters and environmental information, which can be challenging to obtain in real-world scenarios.
With the rapid development of artificial intelligence technology, the practical value of RL is becoming more evident. In the dynamic jamming wireless channel, the jamming parameters are no longer necessary. Instead, the agent interacts directly with the environment to learn how to implement counter-jamming and gain an advantage position in real-world combat scenarios. In [
18], Xiao et al. examined the problem of anti-jamming power control for secondary users in large-scale cooperative cognitive radio networks. They utilized RL methods, such as Q-learning [
19], to achieve optimal anti-jamming power. However, as the spectrum environment grows more complex, the curse of dimensionality may bring extra difficulties for the RL algorithm. DRL leverages the powerful perception capabilities of deep learning in vision and other areas. It is combined with the decision-making abilities of RL and can enable end-to-end learning. This approach partially alleviates the issue of dimensionality. Despite the increasing complexity of dynamic jamming in wireless channels, it has been successfully applied in the field of communication to handle intelligent anti-jamming decision-making in high-dimensional, complex, and dynamic environments, yielding significant results. Specifically, the literature [
11,
12,
13] demonstrated how the agent can effectively achieve intelligent anti-jamming communication by utilizing the DRL algorithm to perceive environmental spectrum information.
However, as the number of wireless communication devices continues to grow, it becomes increasingly challenging for agents to develop proficient policies, understand tactics, and effectively collaborate in multi-user environments. Solely relying on a single agent is insufficient to solve the coordination problem among multiple agents. Therefore, many existing intelligent anti-jamming technologies have been extended to include research on multi-agent systems. In [
20], Yao et al. examined the problem of anti-jamming defense in multi-user scenarios. They utilized a Markov Game framework for modeling and analysis and proposed a cooperative multi-agent anti-jamming algorithm (CMAA) to obtain optimal anti-jamming policies. In [
21], a distributed multi-agent reinforcement learning (MARL) anti-jamming algorithm was proposed to address the challenge of relying on a reliable control channel for information exchange between users in multi-agent cooperative learning algorithms. While some other works [
20,
21] have conducted research on the cooperative anti-jamming problem in multi-user scenarios, they all rely on the control channel to facilitate information interaction among users. However, it is often arduous to achieve an ideal control channel in real-world environments, and the decisions made by one agent can greatly impact other agents. Consequently, studying an intelligent frequency decision communication method without the strong limitations on a control channel. They are of great importance of theoretical and practical value.
In response to the above challenges, an OPG model is introduced, and it is deeply analyzed to prove the existence of NE in the game. To verify the model’s capability to converge to an NE, a two-agent DRL algorithm is proposed based on the best-response policy. The proposed method eliminates the need for information exchange among users, allowing them to achieve FH synchronization via self-learning.
3. Intelligent Frequency Decision Communication Framework without Control Channel Assistance
As illustrated in
Figure 1, an intelligent frequency decision communication system without a control channel is introduced. The system comprises a pair of intelligent nodes and one or more jamming nodes. In this scenario, the communication terminals have not prearranged a FH sequence and are all functioning within a jamming environment. Each node in the system is equipped with a transmitter, receiver, agent, and sensing device. Notably, node B utilizes a multi-channel receiver, enabling it to simultaneously receive signals across multiple channels. Therefore, they increase the chances of capturing the transmission frequency. The data transmission process between node A and node B is illustrated in
Figure 2. Node A serves as the primary node and is responsible for initiating communication, while node B acts as a secondary node. During a round of data transmission, the primary node A selects a communication frequency based on the sensing information to initiate communication, while the secondary node B selects a group based on the sensing information to await reception. If node B scans and identifies node A’s transmission frequency within its selected group of frequencies, a successful frequency is matched. If node B successfully receives the data information, it is expected a rapid response by sending confirming information back to node A using the same frequency. If node A also successfully receives the confirming information, it signifies the completion of a normal round of communication under this circumstance.
To facilitate modeling the transmission process, the continuous time is divided into a series of equal-length time slots
. The process within the whole simulation time can be regarded as an iterative process, and the two communication terminals conduct a round of communication in a time slot. As is shown in
Figure 3, Each time slot
is further divided into uplink and downlink time slots of equal duration.
is considered the number of wireless transmission channels that are available, with each channel having a bandwidth of
. The set of communication frequencies are
, and the entire communication frequency band that is divided into distinct groups
. In each group
, some frequencies are arranged in a group, such as
. As mentioned earlier, during a single round of data transmission, node A selects a frequency for transmission. Therefore, we define the set of communication frequencies selected by node A as
. However, node B utilizes a multi-channel receiver and can only select one group for reception at a time; we define the set of groups selected by node B as
. Simultaneously, the jammer selects a jamming frequency to disrupt data transmission. We define the set of jamming frequencies as
.
is considered as a parameter representing the power of the jamming signal. Assuming that in the time slot
, the received Signal-to-Interference-plus-Noise Ratio (SINR) of the receiver is expressed as follows:
is used as the transmit power, the link gain from transmitter to receiver is
; the background noise power is
; and the function
represents an indicator that is equal to one if the condition is true
and zero otherwise.
In addition,
represents the minimum detectable SINR threshold. If the output SINR of the receiver is greater than or equal to this threshold, it indicates successful data reception; otherwise, it denotes unsuccessful data reception. The reward function
is defined as follows:
Due to the dynamic properties of jamming, the agent is unable to determine the current jamming state. To record the current jamming state, we make the assumption that the spectrum vector detected by the sensing device in the time slot
is as follows:
In the above Equation (3), .
4. Potential Game Model of Frequency Decision without Information Interaction
The Materials and Methods should be described with sufficient details to allow others to understand that Stochastic games (SGs) are a combination of Markov Decision Processes (MDP) and game theory. They provide a framework to describe dynamic game processes where multiple decision-makers interact and make decisions repeatedly in various states. MDPs are primarily utilized to address decision problems in uncertain environments, while game theory offers tools to analyze the interactions among decision-makers. In our intelligent frequency decision communication system, the decision is influenced by the two intelligent nodes. They can be vividly illustrated and stimulated by using an SG model [
22]. This model captures the frequency decision problem for both the receiver and the transmitter, taking into account the uncertainties and interactions involved in the communication process.
Definition 1. The decision process of the two intelligent nodes is formulated as an SG, which is defined by a quintuple .
denotes the set of decision-makers participating in the game. In this paper, specifically when, the decision-makers involved in the game are node A and node B.
denotes the state space. The global spaceconsists of two different state spaces of node A and node B. At time, the set of global states formed by the corresponding states of A and B is denoted as .
denotes the joint selection policy and denotes the joint anti-jamming policy of node A and node B. represents the set of policies available to node A, while represents the set of policies available to node B.
denotes the reward function.
denotes the state transition probability function of the SG.
is the discount factor.
To analyze the evolution of the jamming state, we consider the
time slots and define it as the state of a single agent represented by
, where it represents the length of the historical time slot [
10]. In
Figure 4, the non-convergence of spectrum waterfall
of node A, denoted as
, is a two-dimensional matrix
consisting of channels and time slots. Not only is the distribution of the signal in the time and frequency domains shown by this figure, but also the intensity of the signal is shown by the color depth. At time
, the agent based on the current state
, selects a policy from the action set
, and receives a reward
from the environment. Subsequently, the agent transitions to the next state
according to the state transition probability
.
In potential games, the change in the utility function of each decision-maker resulting from a policy is proportionally mapped to the global potential function. In this paper, for the intelligent frequency decision communication scenario without a control channel, the utility of each node is designed to make it suitable for a potential game [
23]. This ensures that the independent decisions of the communication terminals can ultimately converge to the optimal joint anti-jamming policy. In this context, the utility function is formulated as an indicator function. Its value only depends on the reward function. Specifically, the reward is 1 when data is successfully received and −1 when it is not. The utility function
is defined as follows:
In the above Equation (4),
represents the combination of policies of other decision-makers, excluding the decision-maker
. The decision-maker can adjust its own policy by analyzing the policies
of other decision-makers, aiming to maximize the value of its utility function
:
The decision-maker’s policy will be iteratively adjusted, and the value of the utility function will demonstrate a monotonic change as the policy is coordinated. Via a finite number of iterations, it will ultimately converge to a stable state known as NE.
Definition 2. As previously defined, the policy of node A is denoted as and the policy of node B is denoted as .
In the non-cooperative game model, the policies exist where neither player in the satisfaction game can maximize their utility or payoff by unilaterally changing their policies:The set of policies is adopted to represent the NE of the game. In the intelligent frequency decision communication system, for node A and node B, the utility function is represented as , , . The magnitude of the utility function value is influenced by both policy and policy . As the existence of NE cannot be guaranteed, potential games are introduced to analyze the NE problem.
Definition 3. If there exists an ordinal potential function for and :
thenIf such conditions are guaranteed, then it can be classified as an OPG. A potential function is constructed to prove that the problem of intelligent frequency decision communication without a control channel is an OPG [
24] problem. It is ensured that it can converge to an NE within a finite number of iterations.
Theorem 1. The intelligent frequency decision communication problem without a control channel is an OPG. An ordinal potential function is defined as the sum of the utility valuesof all nodes: To establish that the aforementioned problem is an OPG, it is crucial to demonstrate that the utility value increases when a player updates their policy from to , consequently elevating the overall situation function .
A proof of Theorem 1 is provided in
Appendix A. By constructing the ordinal potential function
, it is proven that the intelligent frequency decision communication problem without a control channel is an OPG. Theorem 1 guarantees the existence of at least one NE solution. Moreover, this equilibrium that maximizes OPG is also the Pareto optimal solution [
25,
26]. Pareto optimality [
27] refers to a combination of policies that maximizes the utility of all players involved, thereby constituting the global optimal solution to the problem.
5. Two-Agent Frequency Decision Learning Algorithm
5.1. Q-Learning and DQN Algorithm
Q-learning algorithm is a classical RL algorithm. Traditional Q-learning is presented in the form of a Q-value table to store the Q-value of each state-action pair. The goal of RL is to maximize long-term future rewards. The environment gives the agent a reward
after taking an action
in each state
. The cumulative discounted return is defined as
In MDP, the agent takes an action
according to the policy
at the state
, and the expected reward is defined as the state action value function:
The optimal state action value function is to select the function with the largest state action value from the state action value functions generated by all policies:
In practice, however, Q-learning in tabular form is difficult to implement. Because in practical applications, the number of states and actions is often very large, it requires a lot of storage space and computing resources to maintain the Q-value table. To solve this problem, neural networks emerged instead of Q-value tables. DQN uses deep neural networks to approximate the optimal action-value function and successfully solves the complex problem of huge states and action spaces. The update procedure for the q-function can be defined as follows:
Here, the function fits the long-term payoff evaluation of a state-action pair. and are denoted as the agent’s current state and action, respectively. denotes the value of Q corresponding to the action performed in the current state . is the maximum value of all Q values in the state at the next time. is the learning rate of the neural network and is the discount factor.
In DQN, the idea of function approximation is used to find a set of optimal parameters and thus determine the best-fitting function. The whole training process can be viewed as a process in which the Q-value is constantly approaching the target Q-value. The Q-value represents the current Q-value, and the target Q-value is the score that the agent obtains by interacting with the environment. During training, learning is performed by using the target Q-value to update the original Q-value. The loss function can be defined as follows:
In the above Equation (14), . denotes the parameters of the prediction network and denotes the parameters of the target network. After the weight parameters of the prediction network are updated, the target network replicates to update its weight parameters. In DQN, experience replay is used to train the agent. The agent is allowed to explore the environment first, after which the experience value is stored. After the experience value has accumulated to a certain extent, another batch of experience is randomly selected for training.
In practice, agents often become stuck in local optima, resulting in poor learning. To solve this problem, the
-greedy policy is often used to make the agent fully explore the environment to find the optimal decision. Each time the agent selects an action, it will select the action with the largest Q-value with some probability and select a random action with some probability. The value decreases as the agent explores. The greedy policy is defined as follows:
5.2. Two-Agent Frequency Decision Learning Algorithm Flow
Multi-agent learning algorithms, such as MARL, provide a convenient approach to verify the convergence of a game model toward NE. Therefore, this paper proposes a two-agent RL algorithm based on best response (Algorithm 1). Our novel method aims to investigate the optimal anti-jamming policy and equilibrium in the dynamic game between the two terminals of communication. The fundamental concept of this algorithm is to address the challenge that the two terminals of communication cannot acquire the global state. Each node serves as an individual agent that interacts with the jamming environment over multiple iterations. The -greedy policy was used to adjust the policy according to the perceived environmental state information, and the corresponding channel was selected to increase the utility value, and finally, the intelligent frequency decision was realized. The policy is chosen randomly for the first time slots, after which the -greedy policy is adopted.
During the training process of the anti-jamming algorithm, each node
is equipped with two neural networks. An online Q network with parameters
is utilized for action reasoning, while a target Q network with parameters
is employed for parameter update learning [
28]. To enhance learning efficiency, an experience replay technique is employed, which allows for the reuse of previous experiences stored in the experience replay pool. This approach effectively breaks the correlation among experience data and maximizes data utilization. During each iteration, a batch of experience values are randomly sampled from the experience replay pool, and these experiences are used to update the parameters of the neural network. The Q-value update is performed as follows:
The loss function of the online Q-network is minimized as follows:
In the above Equation (17),
represents the weight parameter of the online Q-network, and the target Q-network
prediction is compared to the true observation in order to determine the reward
.
represents the weight parameter of the target Q-network.
Stochastic Gradient Descent (SGD) is utilized to train the parameters of the online Q-network. The gradient of the loss function
with respect to
is calculated as follows:
is adopted as the update rate to adjust the parameters in the target Q-network. After enough iterations, the weight parameters of the target Q-network are updated by applying a soft update, which is between the current target Q-network parameters and the online Q-network parameters:
After the training process, the parameters
of the neural network are saved. In a complex and dynamic jamming environment, both terminals involved in the communication only need to load their neural network parameters locally after the training process. This allows them to respond to the observed local environment state without rapid extensive training.
Algorithm 1: Two-agent DRL Algorithm Based on best-response (Training Phase) |
Input: Experience replay pool , network parameters , , state , . |
Output: Parameters of the trained network , .
- 1:
for to do - 2:
if - 3:
Nodes randomly select an anti-jamming policy based on the current state of the game; - 4:
else - 5:
The anti-jamming policy is selected according to the -greedy policy; - 6:
Node A performs the action , node A performs the action , the utility value is obtained according to Equation (4) and transferred to the next game state ; - 7:
The experience values are put into the experience replay pool of node , respectively; - 8:
for do - 9:
Batch experience values are randomly sampled from the experience replay pool ; - 10:
Gradient descent is performed on the loss function to update according to Equation (19); - 11:
The target network parameters are updated at intervals ; - 12:
end for - 13:
end for - 14:
until reaching a NE
|
5.3. Computational Complexity Analysis
Considering the memory limitation and computational power scheduling, the algorithm complexity of the network model is thoroughly analyzed. The online Q-network and the target Q-network are set to have the same structure, consisting of one convolutional layer and two fully connected layers. In the convolutional layer, 16 convolution kernels of size 10 × 4 are utilized and the stride is set to 2. The first fully connected layer consists of 512 neurons, while the number of neurons in the second fully connected layer matches the size of the action space for a single agent. The overall computational complexity for one forward propagation is represented as
and can be calculated as follows [
29]:
where
represents the number of convolutional layers in the neural network;
represents the lth convolutional layer; the time complexity of a single convolutional layer is determined by the dimensions of the output feature map
; the dimensions of the convolution kernel is represented by
; the number of input feature map channels is denoted as
; and output feature map channels is denoted as
. Similarly,
represents the number of fully connected layers in the neural network,
represents the v-th fully connected layer, and the time complexity of each fully connected layer is determined by the number of input neurons
and the number of output neurons
.
When updating the network parameters, samples can be selected for node randomly from the experience replay pool to update the neural network parameters. During the gradient descent process using the loss function , it requires forward propagations and backward propagations. As a result, each round of the game involves a computational complexity of .
5.4. Description of Simulation Parameters
In this paper, the simulation experiment utilizes the tensorflow2.8.0 deep learning development framework for building the network model. Python 3.9 is used in Windows 11 for our simulations. The simulation is run on a PyCharm platform with a GeForce RTX 4060 GPU. The CPU used in the experiments is Intel Core i5-13500HX. Adam is chosen as the optimizer during the training process of the neural network. During the simulation, the number of nodes
is 2, and both the nodes and jammers deployed in the same environment transmit data in each channel. The bandwidth
is 1 MHz. The transmit power of the two nodes is 100 dBm, the length of each time slot
is 10 ms, the history duration
of the spectral waterfall is 100 ms, the transmit power of the jammer is 1000 dBm, the SINR threshold
is 6 dB, and the background noise power is −80 dBm. The learning rate
of the algorithm is set to 0.5, the discount factor
is set to 0.8, the greedy factor
is set to 1.0, and the soft update coefficient
is set to 0.01. When updating the network parameters, each node is trained by randomly sampling a batch of 128 samples from the experience replay pool
, which has a capacity of 5000. Different simulation conditions are established by adjusting the simulation time
, the element combination of a single group
in the group set
, the number of groups
, and the number of available channels
. The parameters for the rest part of the simulation experiment are shown in
Table 1.
6. Simulation Results and Analysis
To verify the effectiveness of the proposed algorithm, simulation experiments are conducted in three typical jamming environments: sweeping jamming, comb jamming, and dynamic jamming:
In this scenario, the jammer performs the sweeping jamming at a specific frequency. The frequency sweep rate is 1 MHz/ms.
- (2)
Comb jamming
In this scenario, the jammer simultaneously implements jamming at multiple frequencies. The specific frequencies 2 MHz, 6 MHz, and 9 MHz for jamming are selected.
- (3)
Dynamic jamming
In this scenario, the jamming environment is dynamic. It periodically alternates between comb jamming and sweep jamming. Once selected, a jamming mode will remain unchanged within a certain duration of time, which is 10 ms. If switching to comb jamming, the specific frequencies 1 MHz, 5 MHz, and 9 MHz for jamming are selected.
Firstly,
Figure 5 and
Figure 6 illustrate the spectrum waterfall in the initial and convergent states, respectively, for a channel number of 9 and a group number of 3, under different jamming environments. The figures reveal that the sweeping jamming technique executes a periodic linear scan of each channel, resulting in narrowband jamming with the jamming frequency exhibiting a linear variation. During the uplink time slot, node A transmits the signal, while node B acts as the receiver. Conversely, during the downlink time slot, node B becomes the transmitter, and node A takes on the role of the receiver.
It can be observed that two figures exist for each jamming environment. The left figure represents the spectral waterfall of node A, while the right figure represents the spectral waterfall of node B. The horizontal axis of each node’s spectrum waterfall represents the time slot, while the vertical axis represents the channel. The figures provide a clear visualization of the available idle channels at each node, as well as the changes in the user’s uplink signal, downlink signal, and jamming signal over time. The user signal is represented by a pair of yellow-green rectangular color blocks. The yellow rectangular block indicates that the node is transmitting a signal during that time slot, while the green rectangular block represents the node receiving a signal. If the color blocks overlap, it indicates jamming with the user signal.
Figure 5 illustrates the initial state, where fewer pairs of yellow-green blocks are observed, suggesting that the two communicating nodes have not yet synchronized or have been disrupted by jamming during the communication process. In contrast,
Figure 6 displays the convergence state, where all the yellow and green blocks appear in pairs without overlapping with other color blocks. This indicates that both node A and node B have achieved frequency decision synchronization in various dynamic jamming environments. Although the two terminals of communication are not aware of the specific changes in the jamming environment, they can effectively avoid jamming via the convergence decision after learning, which demonstrates the effectiveness of the algorithm.
Secondly, in order to evaluate the influence of the number of available channels on the proposed algorithm, the number of channels was increased to 12, 15, and 18, respectively, while keeping the number of groups unchanged, and comparative experiments were carried out. By conducting these experiments, we aimed to evaluate the performance of the algorithm in scenarios with an increased number of channels. The results of the experiments are presented in
Figure 7,
Figure 8,
Figure 9 and
Figure 10. From the figures, it can be observed that in all cases, the convergence state is achieved after a certain number of iterations (4000, 10,000, and 12,000). As the number of channels increases, the decision-making complexity for the two terminals of communication also increases. The environment becomes more intricate, and the probability of achieving synchronization decreases, resulting in a slower convergence speed. Nevertheless, despite these challenges, it is worth noting that the two terminals of communication are still able to converge to the optimal decision even with the increased number of channels. This result demonstrates the effectiveness and applicability of the proposed algorithm in scenarios with a higher number of channels. In conclusion, our experiments indicate that the algorithm proposed in this paper can effectively handle scenarios with different numbers of channels, maintaining its superior performance and showing its potential in real-world applications.
Finally, the performance of the proposed algorithm in three different jamming environments is compared with that of the method using the presence of control channels:
- (1)
PPQN-AJ [
13]: An anti-jamming algorithm based on a DQN parallel policy network is proposed, which adaptively selects power and channel.
- (2)
ADRLA [
10]: The proposed method takes the spectrum environment as the input state and uses DQN to continuously try different actions and sense the spectrum environment in order to learn the optimal anti-jamming policy.
In
Table 2, it can be seen that the normalized throughput performance of different methods under different jamming environments. In the comb-jamming environment, the throughput of the communication system is relatively high. This is because the pattern of comb jamming is relatively stable, and if the uplink is not jammed, then the downlink is necessarily not jammed either. Meanwhile,
Table 2 also provides the number of iterations required to reach the convergence state. It can be observed that there is no significant difference between the proposed algorithm and the method with control channels in terms of performance. The convergence speed of the proposed algorithm is relatively slow. This is mainly because, in the proposed method, both terminals of communication cannot exchange information and need to realize tacit communication via continuous learning and mutual cooperation. Nevertheless, the strength of the proposed method lies in abandoning the conventional approach of relying on control channels for interaction to achieve synchronization and instead realizing a more independent decision-making process.
Experimental results show the efficiency and feasibility of this method. The proposed algorithm exhibits strong robustness and adaptability in typical jamming environments, including sweeping jamming, comb jamming, and dynamic jamming. It effectively mitigates the influence of various jamming signals, ensuring reliable and stable communication.
7. Discussion and Conclusions
In this paper, we focus on the problem of intelligent frequency decisions in the absence of control channels. Our proposed framework avoids the need for a control channel by enabling continuous iteration and adjustment within the communication system. By transforming the frequency decision problem between the transmitter and receiver, we introduce an SG model. Additionally, we design the utility of each node to meet the requirements of OPG, ensuring the attainment of global optimal decision and equilibrium in the dynamic game. To solve the equilibrium policy of a two-agent game, a novel DRL algorithm based on the best response is proposed. Via iterative updates of the policies of both terminals of communication, the algorithm can eventually converge to an NE, even in scenarios where the jamming parameters are unknown. The purpose of introducing RL into the algorithm is to search for the optimal decision. Due to the dynamically changing environment of the two communication terminals, directly determining the optimal decisions that both terminals can reach is challenging. RL is well-suited for exploring unknown environments. Hence, the DRL method is employed to enable the two communication terminals to eventually converge to the same policy consistently via trial and error. Simulation experiments are conducted to verify the anti-jamming performance of the proposed scheme in various jamming scenarios. The results confirm the effectiveness of the scheme.
In future research, there will be ongoing efforts to study and develop a more efficient and flexible intelligent frequency decision communication scheme. This will enable both terminals of communication to rapidly select the optimal joint anti-jamming policy. At the same time, the more complex situation will be considered, that is, how to realize the frequency decision synchronization in the frequency decision network with three nodes or even multiple nodes participating.