Sea-Based UAV Network Resource Allocation Method Based on an Attention Mechanism

Mao, Zhongyang; Zhang, Zhilin; Lu, Faping; Pan, Yaozong; Zhang, Tianqi; Kang, Jiafang; Zhao, Zhiyong; You, Yang

doi:10.3390/electronics13183686

Open AccessArticle

Sea-Based UAV Network Resource Allocation Method Based on an Attention Mechanism

by

Zhongyang Mao

¹,

Zhilin Zhang

^1,*

,

Faping Lu

¹,

Yaozong Pan

¹

,

Tianqi Zhang

²,

Jiafang Kang

¹,

Zhiyong Zhao

¹ and

Yang You

³

¹

Shandong Provincial Key Laboratory, Naval Aviation University, Yantai 264001, China

²

PLA 92522 Unit, Yantai 264001, China

³

PLA 91001 Unit, Beijing 100000, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(18), 3686; https://doi.org/10.3390/electronics13183686

Submission received: 4 August 2024 / Revised: 31 August 2024 / Accepted: 9 September 2024 / Published: 17 September 2024

(This article belongs to the Special Issue Parallel, Distributed, Edge Computing in UAV Communication)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

As humans continue to exploit the ocean, the number of UAV nodes at sea and the demand for their services are increasing. Given the dynamic nature of marine environments, traditional resource allocation methods lead to inefficient service transmission and ping-pong effects. This study enhances the alignment between network resources and node services by introducing an attention mechanism and double deep Q-learning (DDQN) algorithm that optimizes the service-access strategy, curbs action outputs, and improves service-node compatibility, thereby constituting a novel method for UAV network resource allocation in marine environments. A selective suppression module minimizes the variability in action outputs, effectively mitigating the ping-pong effect, and an attention-aware module is designed to strengthen node-service compatibility, thereby significantly enhancing service transmission efficiency. Simulation results indicate that the proposed method boosts the number of completed services compared with the DDQN, soft actor–critic (SAC), and deep deterministic policy gradient (DDPG) algorithms and increases the total value of completed services.

Keywords:

attention mechanism; maritime wireless network; network access selection; UAV network; resource allocation optimization; service option

1. Introduction

With the continuous development of communication technology, human activity areas have gradually expanded from land to the ocean. Hence, maritime, wide-area, all-weather communication support is urgently required to provide uninterrupted network services for space-based, air-based, ground-based, and naval information services [1]. This is also an important topic in current network communication technology research. For example, B5G/6G integrates air, land, and sea network architectures into its core structure [2]. Owing to the extensive coverage and intricate structure of maritime wireless networks, coupled with diverse demands for network resources, service attributes, and quality of service (QoS), unmanned aerial vehicle networks [3] have emerged as a vital component of maritime wireless networks, leveraging their high scalability, high mobility, and broad coverage. Simultaneously, limited communication resources and environmental uncertainty significantly increase the difficulty of human production and activities. Hence, methods for orchestrating and allocating resources with different network dynamics are urgently required [4] to maximize the use of limited communication resources [5].

The current research on wireless network resource allocation focuses on the following three aspects: power allocation [6,7,8,9], spectrum allocation [10,11,12], and access selection [13,14]. Radio access technology (RAT) is a key technology that can achieve seamless connections for users under wide-area coverage, and it is the first step in resource orchestration and allocation. Current access selection algorithms are mainly divided into two categories, one of which relies on traditional operations and research methods for technical support, and the other approach relies on machine learning methods to determine optimal access strategies. The first category [15,16,17,18,19,20] relies on traditional operations research methods as technical support to provide accurate access result calculations. For example, Zhu et al. [16] proposed an innovative access selection mechanism based on remns to improve the quality of experience issues for 5G Internet of Things users under any motion state. Using a comprehensive preference assessment framework based on the analytic hierarchy process (AHP) and entropy weight method, they measured the preference degree of each Internet of Things (IoT) service for network attributes to reduce network congestion and total latency, improve user service quality, and enhance transmission efficiency under the constraint of limited network capacity.

For the other category [21,22,23,24,25,26], traditional operational research methods are typically used to solve low-dimensional discrete problems. However, scenarios involving air, ground, and sea are complex, and they contain many constraints as well as a large solution space, rendering them difficult for traditional methods to handle. When facing complex high-dimensional problems, machine learning methods are required to provide the means to approximate functions. For instance, Bo [13] proposed a joint relay selection and access control algorithm based on distributed Q-learning (DQL) to address the issues of low transmission rates and network congestion in multiuser multirelay uplink satellite–terrestrial networks. By learning the optimal relay transmission slots, the proposed algorithm was significantly better than traditional algorithms in terms of throughput, access delay, and rate. Please refer to Table 1 for a detailed comparison.

Although the algorithms mentioned in previous studies have addressed the problem of wireless network resource allocation on land from different perspectives, maritime scenarios are characterized by fast-moving nodes [27], a vast area of wireless networks, and a variety of business demands with significant differences. Hence, migrating existing land and air wireless network resource allocation schemes directly to a maritime scenario is difficult. Therefore, this study focuses on the adaptability issue of resource allocation in maritime UAV networks and proposes a maritime wireless network resource allocation method based on an attention mechanism. This method is divided into three modules as follows: a deep reinforcement learning algorithm based on the double DQN (DDQN), an action suppression module based on action-value constraints, and an attention–perception module based on the attention mechanism. These address the issue of the agent’s actions easily, thereby leading to the mitigation of the ping-pong effect; an enhancement in the matching of the network, nodes, and services; and a reduction in the number of switches. The contributions of this study are as follows:

(1) A multidimensional resource management model and objective optimization function for maritime wireless networks were constructed. To improve the business transmission efficiency of nodes within the network, a non-convex objective optimization function for node access to the network and the transmission rate was provided by integrating the network throughput model into the traditional Shannon communication model. Moreover, the joint optimization of the nodes’ business access method and network access selection strategy was performed.

(2) An action suppression module that controls the output of the actions was designed. An action-value constraint function was introduced to regulate the output of the actions, address the issue of the ping-pong effect when an agent outputs actions, and effectively reduce the number of link switches.

(3) An attention mechanism from the field of computer vision was proposed to introduce an attention-aware method for resource allocation. The observation information of the nodes was divided into self-state information and environmental state information, with targeted processing of different feature information, and a hybrid attention weight was formed to enhance the state perception ability of the nodes. This led to the optimization of node business access methods; promoted matching among networks, nodes, and services; and provided a new approach to resource allocation adaptability methods.

2. System Model

Maritime UAV networks deeply integrate the multidimensional resources of air, space, and sea to support the real-time and diverse needs of maritime users. Their architectural compositions include the ocean, airspace, and a small part of the shore base. Network nodes can be divided into low-speed nodes (e.g., ships, light unmanned aerial vehicles, and small unmanned aerial vehicles) and high-speed nodes (e.g., medium-sized unmanned aerial vehicles, large unmanned aerial vehicles, and airplanes), based on their mobility rates, as shown in Figure 1. It is worth noting that the aforementioned network nodes can act as user nodes as well as temporary base stations. Despite the significant differences among entities when abstracted as network nodes, their main effective differences are their mobility rates. Therefore, in the proposed model, they were constructed as low-speed/high-speed user nodes and low-speed/high-speed temporary base stations.

For ease of analysis and generality, we assumed that high-speed mobile nodes perform continuous random motion within the sea area, each of which is capable of executing up to N_max parallel services, with their service list information being

B = \{b_{1, 1}^{1}, \dots, b_{p, q}^{i}\}, p \leq N_{\max}, i \in I

, and j maritime mobile base stations move continuously and randomly throughout the sea area at a slower speed, with the network parameter set of the base stations being

P = \{P_{1}^{1}, P_{2}^{1}, P_{3}^{1}, P_{4}^{1}, \dots, P_{1}^{j}, P_{2}^{j}, P_{3}^{j}, P_{4}^{j}\}, j \in J

. In the scenario considered in this study, the heterogeneous wireless communication network has a common cloud service resource pool that is responsible for collecting the services generated by each maritime base station and distributing them to various nodes. Each base station can provide users with access to multiple networks with different communication ranges, transmission powers, bandwidths, and other parameters with upper limits. User nodes continuously interact with surrounding base stations or other mobile nodes for business purposes during their maritime movements, enabling mobile nodes to execute services without interruption.

Given the intimate connection between network throughput and transmission rate, establishing a relationship between the two enables the optimization of multidimensional resource allocation at nodes. Thus, based on the traditional Shannon channel model, we introduce the TCP throughput model for wireless networks to formulate the relationship between transmission rate and network parameters, thereby developing a multidimensional resource allocation model for wireless network scenarios. The specific construction is detailed as follows:

The TCP throughput model [28] for wireless networks chosen by a node is

C_{s} = \frac{M_{s}}{(\bar{D_{s}} + J_{s}) \sqrt{L_{s}}}

(1)

where M_S represents the maximum packet segment size,

\bar{D_{s}}

represents average packet delay, J_S represents the packet jitter, and L_s represents packet loss rate. In addition, according to Shannon’s formula, the relationship between the transmission rate of node i and the bandwidth of network s is given by

Γ_{i, s} = \frac{C_{\max}}{I_{s}} \log (1 + S N R)

(2)

where the maximum throughput of the node is the full occupation of the base station bandwidth

C_{s} \leq C_{\max}

and the number of node connections I_S. The signal-to-noise ratio of a node can be expressed as follows:

S N R = \frac{P_{r} \cdot g_{i, s} (t)}{σ_{s}}

(3)

where

g_{i, s} (t)

is the channel gain and

σ_{s}

is the channel noise. The received power of the node P_r is represented as

P_{r} = \frac{P_{t} G_{s, t} G_{s, r} λ_{s}^{2}}{{(4 π d_{s})}^{2}}

(4)

where

λ

is the signal frequency of the wireless network. By substituting Equation (1) into Equation (2), we obtain the wireless network transmission rate formula, as follows:

Γ_{i, s} = \frac{M_{s}}{I_{s} (\bar{D_{s}} + J_{s}) \sqrt{L_{s}}} \log (1 + \frac{P_{t} G_{s, t} G_{s, r} λ_{s}^{2} g_{i, s} (t)}{{(4 π d_{s})}^{2} σ_{s}})

(5)

Therefore, we aim to maximize the transmission rates of the wireless networks. Among them, the maximum packet segment size M_S and other parameters

\frac{P_{t} G_{s, t} G_{s, r}}{{(4 π d_{s})}^{2}}

are fixed attributes of network s in this study. Hence,

\max Γ_{i, s}

can be simplified to

\begin{matrix} \max Γ_{i, s} = & \max \{\frac{M_{s}}{I_{s} (\bar{D_{s}} + J_{s}) \sqrt{L_{s}}} \log (1 + \frac{P_{t} G_{s, t} G_{s, r} λ_{s}^{2} g_{i, s} (t)}{{(4 π d_{s})}^{2} σ_{s}})\} \\ = \frac{M_{s}}{\min [I_{s} (\bar{D_{s}} + J_{s}) \sqrt{L_{s}}]} \log (1 + \frac{P_{t} G_{s, t} G_{s, r}}{{(4 π d_{s})}^{2}} \cdot \max [\frac{λ_{s}^{2} g_{i, s} (t)}{σ_{s}}]) \end{matrix}

.

Simultaneously, we assumed that the total value of the business completed by the node is also as large as possible. Here, the value of the j-th business completed by node i is

V_{j}^{i}

, the number of completed businesses is

N_{j}

, the total value of completed businesses is

V_{\max}^{i} = \sum_{j = 1}^{N_{j}} V_{j}^{i}

, the duration of the j_t-th business completed by node i is

t_{j_{t}}^{i}

, the average value of the j_t-th business completed by node i is

{\bar{V}}_{j_{t}}^{i}

, and the number of j_t related to the time t is a function of the time t. This can be simplified to maximize the transmission business value

\max \sum_{j = 1}^{N_{j}} V_{j}^{i}

:

\max \sum_{j = 1}^{N_{j}} V_{j}^{i} = \max \sum_{t = 1}^{t_{\max}} \sum_{j = 1}^{N_{j, t}} {\bar{V}}_{j_{t}}^{i} t_{j_{t}}^{i} \to \sum_{j = 1}^{N_{j, t}} \min t_{j_{t}}^{i}

(6)

The data volume

K_{m}^{i}

of the m-th business of node i is represented as

K_{m}^{i} = \sum_{t = 1}^{t_{j}^{i}} Γ_{i, s, t}

, and

Γ_{i, s, t, j}

is the transmission rate when node i is connected to network s at time t. By substituting this into Equation (6), we obtain

\min t_{j_{t}}^{i} = \min \{t | \sum_{t = 1}^{t_{j}^{i}} Γ_{i, s, t} \geq K_{m}^{i}\}

(7)

In summary, the joint optimization constraint function for maximizing the transmission rate and total business value is obtained as follows:

\{\begin{matrix} \begin{matrix} \begin{matrix} \frac{M_{s}}{\min [I_{s} (\bar{D_{s}} + J_{s}) \sqrt{L_{s}}]} \log (1 + \frac{P_{t} G_{s, t} G_{s, r}}{{(4 π d_{s})}^{2}} \cdot \max [\frac{λ_{s}^{2} g_{i, s} (t)}{σ_{s}}]) \\ \min \{t | \sum_{t = 1}^{t_{j}^{i}} Γ_{i, s, t} \geq K_{m}^{i}\} \end{matrix} \\ \begin{matrix} s . t . & C 1 : C_{i} \leq C_{i \max} \end{matrix} \\ \begin{matrix} C 2 : \sum_{i}^{u_{i, s, k} = 1} C_{i, s} \leq C_{s \max} \end{matrix} \\ \begin{matrix} C 3 : N_{s} \leq N_{\max} \end{matrix} \end{matrix} \\ \begin{matrix} C 4 : \sum_{i}^{u_{i, s, k} = 1} Γ_{i, s} \leq Γ_{s} \end{matrix} \end{matrix}

(8)

where the set of nodes I and the set of available networks S represent the collection of all individuals participating in business interactions in the current sea area.

C_{i \max}

represents the maximum throughput obtained by node i from network k for base station j. Constraint C1 indicates that the bandwidth used by node i must not exceed the maximum bandwidth of the node. Constraint C2 indicates that the bandwidth allocated by the network must not exceed the maximum bandwidth of the network. Constraint C3 indicates that the number of businesses obtained by node i must not exceed the maximum number of businesses that the business list can carry. Constraint C4 indicates that the total rate allocated by network s to the connected nodes is less than or equal to the maximum rate of the base station.

For maritime communication, the common air–sea path loss model is based on the classic logarithmic path loss model formula, and the propagation environment of the ocean and sea wave conditions are corrected using correction factors. The channel noise reference is [29], and the channel gain model [30] is as follows:

g_{i, s} (t) = L (h_{t}, h_{r}, h_{e}, d_{m, k, t}) + 10 n \log_{10} (\frac{d_{t}}{d_{m, k, t}}) + χ_{σ}^{m, k, t} + ς F_{t}

(9)

where d_m,k,t represents the reference distance from node m to base station k. The calculation formula for the reference distance is as follows:

d_{m, k, t} = \sqrt{{(x_{m, t} - x_{k, t})}^{2} + {(y_{m, t} - y_{k, t})}^{2} + {(z_{m, t} - z_{k, t})}^{2}}

(10)

Owing to the considerable distance between the nodes and base stations,

L (h_{t}, h_{r}, h_{e}, d_{m, k})

employs a three-ray path loss model. The formula [21] is as follows:

L (h_{t}, h_{r}, h_{e}, d_{t}) = - 10 \log_{10} ({(\frac{λ}{4 π d_{t}})}^{2} {(2 (1 + Δ))}^{2})

(11)

where h_t and h_r represent the effective heights of the transmitter (Tx) and receiver (Rx) antennas, respectively, whereas h_e denotes the effective height of the evaporation duct. The formula for

Δ

is as follows:

Δ = 2 \sin (\frac{2 π h_{t} h_{r}}{λ d_{t}}) \sin (\frac{2 π (h_{e} - h_{t}) (h_{e} - h_{r})}{λ d_{t}})

(12)

where n is the path loss exponent, which is typically set to 1.1, owing to the surface waveguide effect of the sea surface [30]. In addition,

χ_{σ}^{m, k}

represents shadowing, and the higher the sea level, the greater the shadowing. To better adapt to the rapid movement of nodes, an adjustment parameter F is introduced. This term is set to −1 when far from the shore base and 1 otherwise.

Issues with the network access strategy and efficiency of business completion can be transformed into an optimization problem of access network parameters; that is, the better the network parameters of the strategy network, the more businesses will be completed. If the business access strategy is further optimized, the efficiency of business completion can be increased. The interactive factors among access network parameters are affected by the natural environment and random human disturbances, rendering it difficult to describe the dynamic change relationship directly with a mathematical expression and further form an access strategy. Deep reinforcement learning methods integrate the advantages of neural networks with learning. They not only fit the policy function through data but also explore and adapt to a dynamic environment with a lower training cost and a trial-and-error learning approach. Therefore, this study uses a deep reinforcement learning method to determine the optimal values of the network parameters in the constraint formula.

3. Resource Allocation Method Based on an Attention Mechanism

Considering the characteristics of maritime wireless networks, the following considerations were made to improve the efficiency of resource allocation methods:

(1) When the difference in network parameters between better and optimal links is insignificant, it is entirely reasonable and acceptable to lose some transmission benefits in exchange for the stability of the communication link connection.

(2) The resource allocation algorithm based on deep reinforcement learning, which requires a large amount of environmental information, must accurately explore the internal logic between the environment and nodes to provide action outputs that efficiently and accurately match the current state of the intelligent agent.

(3) The requirements of different services and the corresponding sensitive network parameters are all different, which means that the network parameters corresponding to the service requirements should be prioritized without necessarily seeking the optimal network parameters.

Considering the aforementioned considerations, this section proposes a resource allocation method based on the attention mechanism, known as the attention double DQN (A-DDQN), from the following two perspectives: action output and environmental feature extraction. This method is composed of three modular components, namely, the DDQN algorithm module, the attention–perception module, and the selection suppression module, all of which are designed to enhance the efficiency of network transmission.

3.1. Basic Principle

A flowchart of the entire method is shown in Figure 2. The method is divided into three modules. The first module is the deep reinforcement learning module, which introduces the DDQN algorithm [31] and has high learning efficiency and smoother convergence. The second module is the attention–perception module. The environmental output state values are first transmitted to the attention–perception module to obtain the selection weights for the current resource pool business types. It executes the business selection action of

a_{t}^{j}

and merges the intelligent agent’s business list information with the environmental state values into the Q-network and then outputs the action. The third module is the selection suppression module. The actions and their action value outputs from the Q-network are first checked using the selection suppression module. If the conditions are met, they are output directly; otherwise, the

a_{t}^{i}

output from the previous moment replaces the current network output, outputting the network access selection action. The structure and function of each module are described in sequence below.

3.1.1. Action Suppression Module

Given that frequent switching of intelligent agents’ access choices can lead to transmission instability and information security issues, it is necessary to ensure the stability of the agent’s action output over a period of time. Therefore, this module introduces a method for setting the optimal action-value difference threshold at adjacent moments. For the action value outputted by the Q-network at time t, denoted by Q_t, and the action value outputted at the previous moment, denoted by Q_t₋₁, a sensitivity value Q_s is set as follows:

|Q_{t} - Q_{t - 1}| \geq Q_{s}

(13)

As the output of the algorithm gradually stabilizes, the sensitivity value Q_s decayed over time to adapt to the output of the algorithm. Therefore, the decay rate is set to the sensitivity value Q_s, and the time decay coefficient is denoted as μ. By combining these two points, an action-value constraint function is obtained. This function enables the selection suppression unit to switch the link for the intelligent agent. If this condition is not satisfied, then the current connection action is maintained, as follows:

\{\begin{matrix} |Q_{t} - Q_{t - 1}| \geq Q_{s} \\ ε_{s, t} = ε_{s, t} \cdot {(μ)}^{t} \\ Q_{s} = Q_{s} \cdot ε_{s, t} \\ Q_{s} \in [Q_{s \min}, Q_{s} \cdot ε] \end{matrix}

(14)

3.1.2. Attention Perception Module

In resource allocation machine learning algorithms, where neural networks have the advantage of autonomously filtering useful information from an indifferent input of environmental data, the challenge lies in enabling neural networks to locate and process this useful information swiftly to enhance learning efficiency. This represents a breakthrough in improving the performance of these algorithms. Given that heterogeneous network business requirements often feature local optimizations of network parameters sufficient for global optimization, the proposed method aims to select information more pertinent to current business needs from extensive environmental data. This enables the neural network to focus selectively on relevant environmental information, thereby enhancing the performance and generalizability of resource allocation algorithms. Consequently, attention mechanisms from computer vision were integrated to develop an attention-aware module tailored for wireless network resource allocation. The approach is detailed as follows:

(1) The attention–perception module first splits the “perception field” of the agent to distinguish between the self-state and environmental state, to perform targeted processing on different states in the future. The self-state includes the agent’s current connection network parameters and other state parameters, whereas the environmental state includes the network parameters.

(2) For the self-state information, a linear perceptron is used to map each state parameter according to its dimensions to obtain a sub-weight, calculate the correlations between the attributes of a note and environmental parameters, and normalize these correlations into attention weights. A flowchart is shown in Figure 3. The sub-weight of an agent at time t is represented as

\begin{matrix} s_{t}^{s e l f} \end{matrix} \begin{matrix} = softmax [(\begin{matrix} s_{1}^{s} \\ s_{2}^{s} \\ \dots \\ s_{p}^{s} \end{matrix}) \cdot (\begin{matrix} Γ_{1} & Γ_{2} & \dots & Γ_{p} \end{matrix})] \\ = softmax (\begin{matrix} Γ_{1} (s_{1}^{s}) & Γ_{2} (s_{2}^{s}) & \dots & Γ_{p} (s_{p}^{s}) \end{matrix}) \end{matrix}

(15)

Then, all sub-weights are weighted to obtain the self-attention weight for the selection of business types under self-state awareness, as follows:

Γ_{p}^{t} (s_{p}^{s}) = (\begin{matrix} w_{1}^{1} & w_{1}^{2} & \dots & w_{1}^{p} \\ w_{2}^{1} & w_{2}^{2} & \dots & w_{2}^{p} \\ \dots & \dots & \dots & \dots \\ w_{l}^{1} & w_{l}^{2} & \dots & w_{l}^{p} \end{matrix}) \to (\begin{matrix} w_{1}^{1} \\ w_{2}^{1} \\ \dots \\ w_{l}^{1} \end{matrix}) \otimes (\begin{matrix} w_{1}^{2} \\ w_{2}^{2} \\ \dots \\ w_{l}^{2} \end{matrix}) \otimes \dots \otimes (\begin{matrix} w_{1}^{p} \\ w_{2}^{p} \\ \dots \\ w_{l}^{p} \end{matrix}) \to (\begin{matrix} w_{1}^{s, t} \\ w_{2}^{s, t} \\ \dots \\ w_{l}^{s, t} \end{matrix})

(16)

(3) For the environmental state information, a simple linear perceptron cannot be used for processing because it contains a variety of spatial and temporal information. For instance, maritime scene location information that affects network transmission can be considered spatial information. Here, a multilayer perceptron is employed to calculate the similarity of parameters, extract implicit information from the state through multiple fully connected layers, and output the business type selection spatiotemporal attention weights under the environmental state. This is represented as follows:

s_{t}^{e n v} \otimes F (\cdot) = (\begin{matrix} F (s_{1}^{e}, k) \\ F (s_{2}^{e}, k) \\ \dots \\ F (s_{n - p}^{e}, k) \end{matrix}) = (\begin{matrix} v_{a}^{T} \tanh (W_{s_{1}^{e}} + U_{k}) \\ v_{a}^{T} \tanh (W_{s_{2}^{e}} + U_{k}) \\ \dots \\ v_{a}^{T} \tanh (W_{s_{n - p}^{e}} + U_{k}) \end{matrix}) \to (\begin{matrix} w_{1}^{e, t} \\ w_{2}^{e, t} \\ \dots \\ w_{l}^{e, t} \end{matrix})

(17)

(4) After obtaining the self-attention and spatiotemporal attention weights, the different impact levels of the two types of attention on the selection of business types are considered. Using self-attention as the baseline and adjusting for errors in spatiotemporal attention, the weights are multiplied and combined proportionally. Finally, this results in rescaled weights of the business-type features. These are called hybrid attention weights and are represented as follows:

w^{s, t} \otimes w^{e, t} = (\begin{matrix} w_{1}^{s, t} \\ w_{2}^{s, t} \\ \dots \\ w_{l}^{s, t} \end{matrix}) \otimes (\begin{matrix} w_{1}^{e, t} \\ w_{2}^{e, t} \\ \dots \\ w_{l}^{e, t} \end{matrix}) = (\begin{matrix} w_{1}^{t, j} \\ w_{2}^{t, j} \\ \dots \\ w_{l}^{t, j} \end{matrix})

(18)

The structure of the attention–perception method is shown in Figure 4.

3.1.3. Deep Reinforcement Learning Module

Given that intelligent agents operate in highly dynamic environments with limited communication and computing resources, they necessitate deep reinforcement learning algorithms with high learning efficiency and low computational costs. Consequently, the proposed method incorporates the DDQN algorithm to develop a deep reinforcement learning module. Compared with other intricate deep reinforcement learning algorithms such as deep deterministic policy gradient (DDPG) and soft actor–critic (SAC), this approach maintains a simpler implementation and lower computational costs, resulting in higher learning efficiency and smoother convergence.

The process of agent action selection involves maximizing the reward function. In this context, the agent’s primary task is to maximize the value of a completed business; hence, the main reward is the business completion reward. Given the delayed nature of business completion rewards, to prevent sparse rewards, an auxiliary function is introduced to guide the agent’s behavior. A reward function based on dynamic potential energy is established, which disintegrates the primary task into subtasks to accelerate the algorithm’s convergence speed and enhance performance. The reward function is expressed as follows:

\{\begin{matrix} r (s, a, a') = \sum_{i \leq N_{\max}} r_{i}^{b} + \bar{r} (p, ϕ, a, a') \\ \bar{r} (p, ϕ, a, a') = r (p, ϕ, a, a') + γ ϕ (a') - ϕ (a) \end{matrix} .

(19)

where

\sum_{i \leq N_{\max}} r_{i}^{b}

is the mainline reward, which is rewarded based on the business completed by the intelligent agent at the current time;

\bar{r} (p, ϕ, a, a')

is an auxiliary reward function that rewards or punishes based on the network parameters of the agent’s previous access to the network.

3.2. Algorithmic Process

The pseudocode of the A-DDQN algorithm is presented in the following Algorithm 1:

Algorithm 1. Attention Double DQN Algorithm

Initialize the experience replay buffer D with capacity N; initialize the maximum size of the experience replay pool to N_r;

Initialize the action-value function Q with random weights θ; initialize the training batch size to N_b;

Initialize the target action-value function

\hat{Q}

with weights θ⁻ = θ; initialize the target network update frequency to N⁻.

Initialize the greediness probability ε and the action-value sensitivity Q_s.

For episode = 1, M do

Initialize action sequence ε = {x₁} and initial state sequence ϕ₁ = ϕ(s₁).

For t = 1, T do

If the probability of greediness ε is less than the threshold value

Choose a random action

a_{t}^{i}

with probability ε;

Else

Select an action

a_{t}^{i}

using the Q-network model with a probability of less than Q_s.

The greediness probability ε decays, the action-value sensitivity Q_s decays, and the state ε_s decays with a time decay factor μ.

Given the state

S_{t}

, choose the action

a_{t}^{j}

, and update the Q-network with the state input

(s^{t + 1}, r^{t + 1}) \to (s^{t + 1}, r^{t + 1}, b_{m}^{a_{t}^{j}})

Execute the action a_{t}^{i}

and a_{t}^{j}

to obtain the reward value r_t+1 and state s_t+₁.

Store the quadruple (s_{t}, a_{t}^{i}

, r_t+1, s_t₊₁) into the experience replay pool D.

Sample a quadruple (s_{t}, a_{t}^{i}

, r_t, s_t₊₁) from the experience replay pool D with a minimum batch size.

Calculate the target value of N_b for each training batch:

Define a^{\max} (s'; θ) = \arg \max_{a'} Q (s', a'; θ)

.

Calculate y_{i} = \{\begin{matrix} r_{j} & If the training round stops at j + 1 \\ r_{j} + γ \hat{Q} (s', a^{\max} (s', θ); θ^{-}) & o t h e r \end{matrix}

.

Calculate the gradient descent {‖y_{j} - Q (s, a; θ)‖}^{2}

.

Update the target network parameters . θ^{-} \leftarrow θ

every N^{-}

steps.

End For

3.3. Algorithmic Complexity

When the input size of the algorithm is n, the algorithm complexity of the action suppression module is O (1), the algorithm complexity of the attention–perception method is O (1), and the algorithm complexity of the intelligent agent neural network part is O (TNN), where

T_{N N} = \sum_{i = 1}^{L} n_{i} \cdot n_{i + 1} .

(20)

Therefore, the overall time complexity of the algorithm is O (TNN), which is not significantly different from the complexity of other deep learning algorithms. Furthermore, the neural network used by the intelligent agent consists of four fully connected layers, with a node count of 300/200/60/24 in each layer, resulting in a computational complexity of approximately 73,440 for the entire algorithm.

3.4. Algorithmic Convergence

The convergence of the DDQN algorithm is based on the convergence of Q-learning, and the overestimation of the Q value is reduced through the dual Q-learning method, further enhancing the stability and convergence of the algorithm. The proposed method does not change the basic properties of the Q-learning algorithm; therefore, its convergence is consistent with the Q-learning algorithm. The convergence proof can be found in reference [32].

4. Simulation Results and Analysis

To evaluate the performance of the proposed method in a maritime UAV network scenario, the proposed algorithm was simulated along with the other control algorithms. This section first introduces the simulation scenario and its parameter settings and then analyzes and discusses the simulation results.

4.1. Simulation Parameter Setup

4.1.1. Intelligent Agent Parameter Settings

In the simulation scenario, in addition to intelligent agents, other greedy nodes with tasks that are consistent with those of the agents exist. The order in which all nodes take business from the resource pool is sorted in real time, according to a uniform distribution. The schematic of the simulation scene is shown in Figure 5. The algorithm performance test scenario was random, and the randomness was reflected in the continuous random high-speed movement of nodes, continuous random low-speed movement of base stations, random generation of business attribute values in the resource pool, and random interference of network parameters. Traditional methods and greedy algorithms that do not require training are directly applied to the test scenarios. Owing to the strong randomness of the scenario, the following performance indicators were used as average values per 20 rounds. For the basic parameters of the scenario, refer to [33], for maritime channel parameters, refer to [30], and for link parameters, refer to [34]. Please refer to Table 2 for the detailed parameter settings.

Each base station was set up to provide four types of networks to user nodes with the basic attributes of the links, as listed in Table 3.

4.1.2. Network–User Interaction Setup

The various links of each base station in the scenario are regarded as heterogeneous wireless communication networks. User QoS is mainly influenced by the following four network parameter indicators: packet delay, packet jitter, bandwidth, and packet loss rate. In this study, we assume that base stations can meet users’ communication network switching and access requests with negligible switching time and cost. Under the joint action of the four basic network parameters and random environmental factors (weather, Doppler shift, etc.), the business transmission rate and bandwidth of the users vary to varying degrees. This closed-loop mutual influence constitutes the interaction between the network and the user, as shown in the interaction diagram in Figure 6.

4.1.3. Transmission Business Setup

Base stations generate businesses with different demands according to a Poisson distribution and place them in the resource pool for nodes to access. The characteristics of these business types are presented in Table 4. Nodes randomly access businesses from the resource pool of the base station and place them in their business lists for execution, with the maximum number of parallel executions set at 10. We assume that the maximum number is exceeded or the node’s bandwidth is insufficient. In this case, they will not access more businesses until there is a vacancy in the business list and the bandwidth is free. Considering business timeliness, a business execution failure mechanism is introduced. That is, if the running time of the current business exceeds 1.2 times the duration, it is considered a business execution failure. The base station does not accept failed businesses, which are directly removed from the business list and penalized.

Businesses are distinguished by priority levels, with Level 1 being the highest priority and Level 3 being the lowest. High-priority businesses offer greater rewards for completion and penalties for non-completion. The proportion of businesses from high to low priority is [0.35, 0.5, 0.15]. Table 5 compares the reward and penalty situations for different types of businesses.

The value of a business is defined based on its priority level and type. The completion value of a business is determined by multiplying its instantaneous value by its duration, as listed in Table 6.

4.1.4. Control Algorithms and Hyperparameters

To verify the performance of the proposed method, it was compared with those of the DDPG [35], SAC [36], and DDQN [31] deep reinforcement learning algorithms; the traditional D-AHP method [37]; and a greedy algorithm. In addition, ablation comparison experiments were conducted to verify the performance of the two modules. The SAC_ls, DDPG_ls, and DDQN_ls algorithms have a selection suppression unit added to their original versions to control the number of switches within 10% of the total duration. The D-AHP algorithm is an improved version of the traditional AHP method and is adapted for dynamic maritime wireless network environments. The greedy algorithm refers to an access selection strategy based on the closest distance to the business selection in random order. The hyperparameters of the proposed algorithm are listed in Table 7.

4.2. Analysis of Simulation Results

This section compares and analyzes the performances of the proposed method and control algorithms in terms of the following five aspects: the total number of switches, the quantity and value of business completion, average network parameters, the quantity of business priority completion, and the quantity of business type completion.

4.2.1. Total Number of Switches

Figure 7 shows a comparison of the total number of switches per algorithm round. The statistical standard for the number of switches is defined as follows. If the access selection strategy is different at times t and t+1, then the number of switches is incremented by one, accumulating the total number of switches in a complete round. As shown in Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6 and Figure 7, the inclusion of suppression units yields positive outcomes. The algorithm introduced exhibits a 69.23% reduction in switching compared with the original DDQN algorithm. Although this is slightly higher than other deep learning algorithms subject to switch frequency constraints, the switch count constitutes only 10% of the total action time, which is deemed reasonable and acceptable. When compared with the standard algorithm, the algorithm with selection suppression units maintains a switch count of less than 10% of the total action time. This robustly demonstrates the universal applicability of suppression units in effectively managing frequent algorithm switches. To fit a real situation, the subsequent comparison algorithms of the proposed algorithm all target deep learning algorithms with the switch count constrained to within 10% of the total action time, namely, the DDPG_ls, SAC_ls, and A-DDQN_ls algorithms.

4.2.2. Business Completion Quantity and Value

Figure 8a presents the total number of businesses completed per round using each algorithm. The proposed algorithm increased the total completion amount by approximately 6.4% and 11.73% compared with the DDQN_ls and DDPG_ls algorithms, respectively. Additionally, the results of the ablation study indicated that the attention–perception module of the A-DDQN algorithm increased the number of business completions by 22.7%. This demonstrates that the attention-aware module of the proposed method can effectively enhance the number of completed tasks even under constrained action-switching conditions. This is because the attention-aware module better aligns with node status and business requirements, ensuring a smooth node business queue. Combining Figure 7 and Figure 8a shows that after constraining the number of switches, the SAC algorithm’s business completion quantity decreased by 11.37% and that of the DDQN algorithm decreased by 13.53%. Although the selection suppression unit can effectively suppress the number of switches, it sacrifices some optimal actions at the cost of reducing the business transmission efficiency.

Figure 8b presents the total number of unfinished businesses per round for each algorithm. Notably, the proposed algorithm reduced the total amount of overdue business by 24.56% compared with the DDQN_ls algorithm and by 4.4% compared with the SAC_ls algorithm. This occurs because all three algorithms experience a business transmission efficiency below the normal transmission standards during certain periods. However, the proposed method matches node status with the traffic, and when transmission efficiency is low, it matches with low-duration traffic, thereby mitigating some losses.

Figure 8c illustrates the total value of the completed business. As the time for the agent to execute a business is limited, the total value of the business completed per round has an upper limit. The total value of the business completed by the proposed algorithm increased by 13.72% and 6.57% compared with those of the other two algorithms. This demonstrates that the attention–perception module plays a role in adapting to network parameters and selecting business types. The module fully utilizes its connected network to maximize transmission efficiency. In other words, when selecting non-highest-priority businesses, it adopts a strategy of choosing short-duration and low-value businesses under low network parameters and long-duration and high-value businesses under high network parameters.

4.2.3. Average Network Parameters

Figure 9 presents a comparison chart of the average network parameters per round for the intelligent agents. The average network parameter refers to the mean value of each network parameter per round and reflects the quality of the algorithm’s access strategy within that round. In this scenario, the packet delay, packet loss, and transmission rate are considered to have better transmission efficiency ranges of (0, 70], (10, 45], and (1, 2], respectively. Figure 10 shows that the three network parameters of the proposed algorithm are within the better transmission efficiency range for 80% of the rounds. Compared with the other comparison methods, the proposed method has an increase of approximately 8%–15%, which means that the intelligent agent can complete more business in resource-constrained environments and better adapt to strong dynamic environments.

4.2.4. Completion Quantity of Business Priorities

Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9 and Figure 10 present a comparative analysis of the completion rates across various business priorities. The proposed algorithm outperforms the DDQNls and SACls algorithms, achieving an improvement of 6.1% and 11.73%, respectively. This demonstrates that our approach prioritizes the completion of high-value business tasks. As shown in Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8, in contrast to the other algorithms, our proposed method facilitates a positive cycle; it completes more tasks, updates the list more frequently, and achieves a greater total value of completed tasks while eliminating low-value tasks.

4.2.5. Quantity Completed by Business Type

Figure 11a presents a comparison chart of the completion quantity for Type I businesses. The proposed algorithm increased the completion quantity by 8.7% and 14.83% compared with the DDQN_ls and SAC_ls algorithms, respectively. As shown in Figure 9, the strategy for the intelligent agent to select a business is to choose as many conversational and interactive businesses as possible; it will select a small number of high-bandwidth stream businesses for execution only when the network parameters are sufficiently good, which is in line with the basic strategy of maximizing the value of business completion. In contrast to other algorithms that can only find a single optimal path, the SAC algorithm learns various solutions to problems through maximum entropy, which is more robust against dynamic environments. However, it does not distinguish the adaptability between different businesses and nodes. Algorithms based on traditional methods have low transmission efficiency and slow business list updates. Therefore, when they access businesses from the resource pool, most of the businesses in the pool are those left unchosen by other nodes in the past. Thus, more interactive and stream businesses are selected. The proposed algorithm perceives and adapts the compatibility between nodes and businesses through the attention–perception module. That is, it perceives its own and environmental states in real time and adjusts the action strategy to determine the optimal solution for each state.

In summary, although the proposed algorithm did not have the highest number of optimal performance metrics, its overall performance was the best. The proposed algorithm is applicable to maritime wireless network application scenarios in which business completion is the core performance indicator. While suppressing the ping-pong effect of access selection, it enhances the total quantity and value of business completion. A comparison of the performance parameters of the algorithms is presented in Table 8.

4.2.6. Limitations of the Method

The proposed method primarily targets the highly dynamic, resource-constrained environment of maritime wireless communications, addressing issues such as frequent agent switching and low service completion rates. Consequently, this method relies on or is sensitive to certain environmental assumptions, such as the following:

(1) With the aim of simulation verification, to ensure the intelligent agent can acquire a greater number of services within a restricted timeframe, the average service duration and the proportion of services with a longer duration are set at relatively lower values as environment-sensitive parameters.

(2) With an emphasis on nodes completing as many operations as possible, the incentives for completing low-duration operations and the penalties for noncompletion are set relatively high.

5. Conclusions

This study addresses wireless resource allocation within the framework of offshore wireless networks by jointly optimizing service-access methods and network-entry strategies to provide new insights into wireless network resource management. To address the issues of unreasonable node switching frequencies and low service completion rates, this study proposes a resource allocation method based on an attention mechanism. The proposed method employs an action-value constraint function to mitigate the ping-pong effect in agent network access decisions and incorporates an attention mechanism to generate mixed attention weights that optimize service-access strategies and enhance the alignment among nodes, networks, and services. Simulation results demonstrate that the proposed method significantly outperforms the benchmark method, further validating the effectiveness of attention mechanisms in multidimensional resource allocation for wireless networks. These findings not only extend the application of existing attention mechanisms but also open new avenues for wireless network resource management. In the future, we will focus on the following: 1. Expanding the channel model to include land and air scenarios. 2. Collecting and validating real-world wireless communication network data at sea. 3. Extending the method to multiagent systems to further enhance the algorithm’s generalization ability.

Author Contributions

Conceptualization, Z.M. and Z.Z. (Zhilin Zhang); methodology, Z.Z. (Zhilin Zhang); software, Z.Z. (Zhilin Zhang); validation, Y.P. and Z.Z. (Zhilin Zhang); investigation, Y.Y.; resources, Z.M.; data curation, T.Z.; writing—original draft preparation, Z.Z. (Zhilin Zhang); writing—review and editing, F.L.; visualization, Z.Z. (Zhilin Zhang); supervision, Z.Z. (Zhiyong Zhao); project administration, J.K.; funding acquisition, Z.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by Shandong Provincial Natural Science Foundation (ZR2023MD045) and the Key Basic Research Projects of the Foundation Strengthening Program (Secret).

Data Availability Statement

Data are unavailable because of privacy or ethical restrictions.

Conflicts of Interest

The authors declare no conflicts of interest. The funding sponsors had no role in the design of this study; in the collection, analysis, or interpretation of data; in the writing of this manuscript; or in the decision to publish the results.

Abbreviations

AHP	analytic hierarchy process
DDPG	deep deterministic policy gradient
DQL	distributed Q-learning
IoT	Internet of Things
QoS	quality of service
RAT	radio access technology
SAC	Soft actor–critic
MDP	Markov Decision Process

References

Huo, Y.; Dong, X.; Beatty, S. Cellular Communications in Ocean Waves for Maritime Internet of Things. IEEE Internet Things J. 2020, 7, 9965–9979. [Google Scholar] [CrossRef]
Yin, H.; Huang, Y.; Han, L.; Jin, J. Reflection on 6G Communication Perception Computing Fusion Network. Sci. China (Inf. Sci.) 2023, 53, 1838. [Google Scholar] [CrossRef]
Khalil, H.; Rahman, S.U.; Ullah, I.; Khan, I.; Alghadhban, A.J.; Al-Adhaileh, M.H.; Ali, G.; ElAffendi, M. A UAV-Swarm-Communication Model Using a Machine-Learning Approach for Search-and-Rescue Applications. Drones 2022, 6, 372. [Google Scholar] [CrossRef]
Alqurashi, F.S.; Trichili, A.; Saeed, N.; Ooi, B.S.; Alouini, M.-S. Maritime Communications: A Survey on Enabling Technologies, Opportunities, and Challenges. IEEE Internet Things J. 2023, 10, 3525–3547. [Google Scholar] [CrossRef]
Wei, T.; Feng, W.; Chen, Y.; Wang, C.-X.; Ge, N.; Lu, J. Hybrid Satellite-Terrestrial Communication Networks for the Maritime Internet of Things: Key Technologies, Opportunities, and Challenges. IEEE Internet Things J. 2021, 8, 8910–8934. [Google Scholar] [CrossRef]
Peng, X.; Xu, H.; Qi, Z.; Wang, D.; Zhang, Y.; Rao, N.; Gu, W. Dynamic Multi-target Jamming Channel Allocation and Power Decision-Making in Wireless Communication Networks: A Multi-agent Deep Reinforcement Learning Approach. China Commun. 2024. [Google Scholar] [CrossRef]
Li, F.; Bao, J.; Wang, J.; Liu, D.; Chen, W.; Lin, R. Antijamming Resource-Allocation Method in the EH-CIoT Network Through LWDDPG Algorithm. Sensors 2024, 24, 5273. [Google Scholar] [CrossRef]
Wang, Y.; Liu, F.; Li, Z.; Chen, S.; Zhao, X. An Approach to Maximize the Admitted Device-to-Device Pairs in MU-MIMO Cellular Networks. Electronics 2024, 13, 1198. [Google Scholar] [CrossRef]
Liu, Y.; Li, Y.; Li, L.; He, M. NOMA Resource Allocation Method Based on Prioritized Dueling DQN-DDPG Network. Symmetry 2023, 15, 1170. [Google Scholar] [CrossRef]
He, Y.; Sheng, B.; Yin, H.; Yan, D.; Zhang, Y. Multi-objective Deep Reinforcement Learning Based Time-Frequency Resource Allocation for Multi-beam Satellite Communications. China Commun. 2022, 19, 77–91. [Google Scholar] [CrossRef]
Li, Y.; Aghvami, A.H. Radio Resource Management for Cellular-Connected UAV: A Learning Approach. IEEE Trans. Commun. 2023, 71, 2784–2800. [Google Scholar] [CrossRef]
Wang, H.; Liu, J.; Liu, B.; Xu, Y. Marine Mammal Conflict Avoidance Method Design and Spectrum Allocation Strategy. Electronics 2024, 13, 1994. [Google Scholar] [CrossRef]
Wang, L.; Guo, J.; Zhu, J.; Jia, X.; Gao, H.; Tian, Y. Cross-Layer Wireless Resource Allocation Method Based on Environment-Awareness in High-Speed Mobile Networks. Electronics 2024, 13, 499. [Google Scholar] [CrossRef]
Sun, M.; Jin, Y.; Wang, S.; Mei, E. Joint Deep Reinforcement Learning and Unsupervised Learning for Channel Selection and Power Control in D2D Networks. Entropy 2022, 24, 1722. [Google Scholar] [CrossRef]
Ma, M.; Zhu, A.; Guo, S.; Wang, X.; Liu, B.; Su, X. Heterogeneous Network Selection Algorithm for Novel 5G Services Based on Evolutionary Game. IET Commun. 2020, 14, 320–330. [Google Scholar] [CrossRef]
Zhu, A.; Ma, M.; Guo, S.; Yang, Y. Adaptive Access Selection Algorithm for Multi-service in 5G Heterogeneous Internet of Things. IEEE Trans. Netw. Sci. Eng. 2022, 9, 1630–1644. [Google Scholar] [CrossRef]
Zhou, O.; Wang, J.; Liu, F.; Wang, J. Energy-Efficient Clustered Cell-Free Networking with Access Point Selection. IEEE Open J. Commun. Soc. 2024, 5, 1551–1565. [Google Scholar] [CrossRef]
González, C.C.; Pupo, E.F.; Atzori, L.; Murroni, M. Dynamic Radio Access Selection and Slice Allocation for Differentiated Traffic Management on Future Mobile Networks. IEEE Trans. Netw. Serv. Manag. 2022, 19, 1965–1981. [Google Scholar] [CrossRef]
Roy, A.; Chaporkar, P.; Karandikar, A.; Jha, P. Online Radio Access Technology Selection Algorithms in a 5G Multi-RAT Network. IEEE Trans. Mob. Comput. 2023, 22, 1110–1128. [Google Scholar] [CrossRef]
Passas, V.; Miliotis, V.; Makris, N.; Korakis, T. Pricing Based Distributed Traffic Allocation for 5G Heterogeneous Networks. IEEE Trans. Veh. Technol. 2020, 69, 12111–12123. [Google Scholar] [CrossRef]
Zhao, B.; Ren, G.; Dong, X.; Zhang, H. Distributed Q-Learning Based Joint Relay Selection and Access Control Scheme for IoT-Oriented Satellite Terrestrial Relay Networks. IEEE Commun. Lett. 2021, 25, 1901–1905. [Google Scholar] [CrossRef]
Cui, Q.; Zhang, Z.; Yanpeng, S.; Ni, W.; Zeng, M.; Zhou, M. Dynamic Multichannel Access Based on Deep Reinforcement Learning in Distributed Wireless Networks. IEEE Syst. J. 2022, 16, 5831–5834. [Google Scholar] [CrossRef]
Zheng, J.; Luan, T.H.; Hui, Y.; Yin, Z.; Cheng, N.; Gao, L.; Cai, L.X. Digital Twin Empowered Heterogeneous Network Selection in Vehicular Networks with Knowledge Transfer. IEEE Trans. Veh. Technol. 2022, 71, 12154–12168. [Google Scholar] [CrossRef]
Zhou, H.; Wang, X.; Umehira, M.; Chen, X.; Wu, C.; Ji, Y. Wireless Access Control in Edge-Aided Disaster Response: A Deep Reinforcement Learning-Based Approach. IEEE Access 2021, 9, 46600–46611. [Google Scholar] [CrossRef]
Xiang, H.; Peng, M.; Sun, Y.; Yan, S. Mode Selection and Resource Allocation in Sliced Fog Radio Access Networks: A Reinforcement Learning Approach. IEEE Trans. Veh. Technol. 2020, 69, 4271–4284. [Google Scholar] [CrossRef]
Liang, H.; Zhang, W. Stochastic-Stackelberg-Game-Based Edge Service Selection for Massive IoT Networks. IEEE Internet Things J. 2023, 10, 22080–22095. [Google Scholar] [CrossRef]
Li, Y.; Aghvami, A.H.; Dong, D. Path Planning for Cellular-Connected UAV: A DRL Solution with Quantum-Inspired Experience Replay. IEEE Trans. Wirel. Commun. 2022, 21, 7897–7912. [Google Scholar] [CrossRef]
Mathis, M.; Semke, J.; Mahdavi, J.; Ott, T. The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm. SIGCOMM Comput. Commun. Rev. 1997, 27, 67–82. [Google Scholar] [CrossRef]
ITU. Radio Noise. ITU Radiocommunication Sector, Recommendation ITU-R, 2016. Online, pp. 372–313. Available online: https://www.itu.int/rec/R-REC-P.372-13-201609-I/en (accessed on 31 August 2024).
Wang, J.; Zhou, H.; Li, Y.; Sun, Q.; Wu, Y.; Jin, S.; Quek, T.Q.S.; Xu, C. Wireless Channel Models for Maritime Communications. IEEE Access 2018, 6, 68070–68088. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, AAAI, Phoenix, AZ, USA, 12–17 February 2016; Volume 30, pp. 2094–2100. [Google Scholar] [CrossRef]
Jaakkola, T.; Jordan, M.I.; Singh, S.P. Convergence of Stochastic Iterative Dynamic Programming Algorithms. Adv. Neural Inf. Process. Syst. 1994, 6, 703–710. [Google Scholar] [CrossRef]
Xia, T.; Wang, M.M.; Zhang, J.; Wang, L. Maritime Internet of Things: Challenges and Solutions. IEEE Wirel. Commun. 2020, 27, 188–196. [Google Scholar] [CrossRef]
Bekkadal, F. Innovative Maritime Communications Technologies. In Proceedings of the 18th International Conference on Microwaves, Radar and Wireless Communications, Vilnius, Lithuania, 14–16 June 2010; pp. 1–6. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic Algorithms and Applications. In Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. Available online: https://proceedings.mlr.press/v80/haarnoja18b.html (accessed on 31 August 2024).
Mao, Z.Y.; Zhang, Z.L.; Liu, X.G.; Kang, J.F. Network Selection Algorithm for Maritime Mobile Nodes Based on Dynamic AHP. Syst. Eng. Electron. 2022, 44, 2011–2018. [Google Scholar]

Figure 1. Maritime wireless network architecture.

Figure 2. Block diagram of the resource allocation method based on an attention mechanism.

Figure 3. Self-attention generation.

Figure 4. Perception module.

Figure 5. Maritime heterogeneous wireless network scenario. The red arrows represent the random motion of nodes, brown represents the shore, and light blue represents the ocean.

Figure 6. Network–user interaction diagram.

Figure 7. Comparison of switching times.

Figure 8. Comparison of algorithm business situations.

Figure 9. Comparison of average network parameters in rounds.

Figure 10. Comparison of business priority completion status.

Figure 11. Completion status of business types.

Table 1. Existing typical wireless access technologies.

Traditional Operations Research Methods		Machine Learning Methods
Technical Approach	Application Scenarios	Technical Approach	Application Scenarios
Evolutionary game and AHP [15]	5G, LTE	Distributed Q-learning [21]	Satellite ground relay network
AHP and entropy weighting method [16]	5G Internet of Things (IoT)	Deep Q-learning [22]	Large-scale machine communication scenarios in the IoT
Clustering [17]	Ultra-dense wireless communication system	Transfer learning [23]	Internet of Vehicles
MADM and AHP [18]	5G HetNet	Deep Q-learning [24]	Disaster response network
Discrete-time MDP and Lagrangian method [19]	5G NR-WiFi HetNet	Distributed Q-learning [25]	FRAN
Paris Metro pricing scheme [20]	4G/5G HetNet	Bayesian deep Q-learning [26]	Industrial IoT

Table 2. Simulation parameter settings.

Simulation Parameter Name	Simulation Parameter
Number of intelligent agents	1
Number of high-speed nodes	13
Number of base station nodes	6
High-speed node speed (m/s)	[15, 150]
Base station node speed (m/s)	[1, 15]
Node turning angle	[−0.25π, 0.25π]
Run time (s)	600
Model training times	2000

Table 3. Comparison of maritime base station communication link parameters.

Category	Coverage (km)	Transmission Power	Bandwidth	Outage Probability
Link 1	80	200	8	0.3
Link 2	50	120	15	0.25
Link 3	25	30	10	0.2
Link 4	300	105	20	0.35

Table 4. Comparison of service business characteristics.

Business Type	Business	Average Bandwidth/MHz	Mean Duration/s	Proportion/%
Ordinary business (conversation type)	Real-time location, maritime management, planned route information, etc.	3	8	33
Maritime safety business (interactive type)	Maritime distress and rescue information, emergency support information, etc.	12	16	34
Remote sensing business (stream type)	Hydrological information, sensor raw signals, maps, etc.	50	24	33

Table 5. Comparison of service business rewards/penalties.

Business Completion Reward/Incompletion Penalty	Level 1	Level 2	Level 3
General business (conversation type)	9/−6	5/−3	3/−2
Maritime safety business (interactive type)	10/−7	6/−4	4/−3
Remote sensing business (stream type)	12/−8	8/−5	6/−4

Table 6. Comparison of business instantaneous value.

Business Instantaneous Value	Level 1	Level 2	Level 3
General business (conversation type)	4	3	2
Maritime safety business (interactive type)	5	4	3
Remote sensing business (stream type)	6	5	4

Table 7. A-DDQN algorithm parameters.

Parameter	Numerical Value
Learning rate	1 × 10⁻⁴
Number of network layers	4
Number of neurons	300/200/60/24
Sampling time	1
Discount factor	0.95
Batch sample size	600
Experience buffer length	1.2 × 10⁵
Smoothing factor	1 × 10⁻⁵
Target network update frequency	5

Table 8. Comparison of algorithm performance.

	Total Business Value	Quantity Completed	Quantity Unfinished	Level 1	Level 2	Type 1	Type 2	Switching Frequency
Greedy	6780	185	136	57	87	143	34	449
D-AHP [27]	7370	224	123	70	108	202	15	448
DDQN	18533	377	45	349	23	356	14	278
SAC	20133	404	40	378	22	387	13	557
DDPG_ls [25]	13210	290	83	279	25	275	12	37
DDQN_ls [22]	15593	326	61	325	18	325	15	71
SAC_ls [26]	17367	358	57	326	23	337	13	66
A-DDQN	19750	400	43	373	30	387	18	79

Bold font represents the method proposed in this article.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mao, Z.; Zhang, Z.; Lu, F.; Pan, Y.; Zhang, T.; Kang, J.; Zhao, Z.; You, Y. Sea-Based UAV Network Resource Allocation Method Based on an Attention Mechanism. Electronics 2024, 13, 3686. https://doi.org/10.3390/electronics13183686

AMA Style

Mao Z, Zhang Z, Lu F, Pan Y, Zhang T, Kang J, Zhao Z, You Y. Sea-Based UAV Network Resource Allocation Method Based on an Attention Mechanism. Electronics. 2024; 13(18):3686. https://doi.org/10.3390/electronics13183686

Chicago/Turabian Style

Mao, Zhongyang, Zhilin Zhang, Faping Lu, Yaozong Pan, Tianqi Zhang, Jiafang Kang, Zhiyong Zhao, and Yang You. 2024. "Sea-Based UAV Network Resource Allocation Method Based on an Attention Mechanism" Electronics 13, no. 18: 3686. https://doi.org/10.3390/electronics13183686

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sea-Based UAV Network Resource Allocation Method Based on an Attention Mechanism

Abstract

1. Introduction

2. System Model

3. Resource Allocation Method Based on an Attention Mechanism

3.1. Basic Principle

3.1.1. Action Suppression Module

3.1.2. Attention Perception Module

3.1.3. Deep Reinforcement Learning Module

3.2. Algorithmic Process

3.3. Algorithmic Complexity

3.4. Algorithmic Convergence

4. Simulation Results and Analysis

4.1. Simulation Parameter Setup

4.1.1. Intelligent Agent Parameter Settings

4.1.2. Network–User Interaction Setup

4.1.3. Transmission Business Setup

4.1.4. Control Algorithms and Hyperparameters

4.2. Analysis of Simulation Results

4.2.1. Total Number of Switches

4.2.2. Business Completion Quantity and Value

4.2.3. Average Network Parameters

4.2.4. Completion Quantity of Business Priorities

4.2.5. Quantity Completed by Business Type

4.2.6. Limitations of the Method

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI