UAV Onboard STAR-RIS Service Enhancement Mechanism Based on Deep Reinforcement Learning

Yan, Junjie; Xu, Yichen; Yuan, Haohao; Xue, Chunhua

doi:10.3390/s25061943

Open AccessArticle

UAV Onboard STAR-RIS Service Enhancement Mechanism Based on Deep Reinforcement Learning

¹

School of Electronic Engineering, Guangxi University of Science and Technology, Liuzhou 545006, China

²

Guangxi Key Laboratory of Multidimensional Information Fusion for Intelligent Vehicles, Liuzhou 545006, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(6), 1943; https://doi.org/10.3390/s25061943

Submission received: 19 February 2025 / Revised: 11 March 2025 / Accepted: 18 March 2025 / Published: 20 March 2025

(This article belongs to the Section Communications)

Download

Browse Figures

Versions Notes

Abstract

:

UAVs and reconfigurable intelligent surfaces (RISs) have emerged as promising solutions to enhance communication coverage and performance. However, existing studies primarily focus on optimizing the amplitude and phase shift of a STAR-RIS without considering the impact of varying UAV hovering angles on signal reflection and transmission. In this paper, we propose a novel STAR-RIS-assisted UAV service enhancement mechanism that dynamically adjusts reflection/transmission regions based on the real-time user distribution, significantly improving the channel quality for both edge and occluded users. This work is the first to jointly optimize the phase and amplitude of the STAR-RIS, the UAV flight trajectory, and the hovering angle, addressing the critical challenge of co-channel interference caused by dynamically partitioned service areas. The complex optimization problem is decomposed into subproblems, where the UAV flight trajectory is optimized using the Chained Lin–Kernighan (CLK) algorithm and the STAR-RIS parameters and UAV hovering angle are optimized using the TD3 algorithm. The experimental results show that the proposed mechanism effectively reduces the system service time and user transmission time, outperforming traditional methods.

Keywords:

STAR-RIS; UAV-enhanced edge services; resource allocation; deep reinforcement learning

1. Introduction

The advent of sixth-generation (6G) communication technology has precipitated the practical implementation of applications such as autonomous driving, telemedicine, and the Internet of Things (IoT), thereby metamorphosing them from theoretical constructs into a tangible reality. These applications demand stringent performance requirements, including ultra-low latency, ultra-high reliability, and a high bandwidth [1]. However, traditional terrestrial communication systems are vulnerable to non-line-of-sight (NLoS) issues in complex environments. This is primarily due to the fixed location of the base station, which results in channel fading and the deterioration of the channel quality. Consequently, it is essential to explore new communication architectures and optimization strategies to mitigate NLoS impacts and improve system reliability and efficiency.

UAVs are seen as a promising solution to overcome communication coverage and capacity limitations in remote and challenging environments. When used as aerial nodes, UAVs offer flexible positioning at optimal locations, increasing the likelihood of line-of-sight (LoS) communication and improving the channel gain [2]. In addition, UAVs can act as communication relays, extending network coverage, increasing the system capacity, and improving communication reliability, flexibility, and robustness [3,4]. Thus, the potential for UAVs to improve system performance in complex communication scenarios is clear. However, UAV communication systems still face several challenges, including energy constraints, dynamic channel conditions, and interference in dense networks [5].

Recently, reconfigurable intelligent surfaces (RISs) have become one of the key enabling technologies for 6G as an effective solution to overcome the above limitations [6]. RISs have been demonstrated to have the capacity to exercise intelligent control over the wireless propagation environment by dynamically adjusting the reflection and transmission characteristics [7]. This dynamic adjustment has been shown to optimize the signal path and reduce multipath interference and channel fading and thus demonstrates great potential in improving the performance of wireless networks [8]. Furthermore, with the flexibility of UAVs and the programmability of RISs, the formation of an aerial RIS (ARIS) to dynamically optimize propagation paths and channel conditions has become one of the current research hotspots [9,10,11].

However, a significant technical challenge in a traditional RIS is that it only has the capacity to support signal reflection, requiring the user to be on the same side as the transmitter to receive the signal, thereby restricting its flexibility and practical deployment in various wireless communication applications. In addressing this limitation, researchers have proposed a STAR-RIS, a concurrent transmission and reflection reconfigurable intelligent surface technology [12]. Specifically, the amplitude adjustment facilitates the determination of the energy distribution of the incident signal, thereby ensuring that a proportion of the energy is reflected and a proportion transmitted, thus optimizing the coverage in different areas. The phase shift adjustment ensures that the signal is effectively superimposed in the target area, thereby increasing the channel gain and reducing interference.

Furthermore, STAR-RIS-assisted UAV systems are highly dynamic and complex. Deep reinforcement learning (DRL), which enables agents to make sequential decisions based on environmental observations and interactions, thereby learning and adapting to dynamic conditions, has become a vital tool for addressing such intricate challenges. Classical DRL algorithms, such as Deep Q-Networks (DQNs) and Deep Deterministic Policy Gradients (DDPGs), have achieved success in UAV path planning and resource allocation. However, DDPGs often suffer from Q value overestimation, leading to suboptimal policies [13]. The Twin Delayed Deep Deterministic Policy Gradient (TD3) addresses this limitation by introducing clipped double-Q learning, delayed policy updates, and target noise regularization, thereby enhancing stability and convergence [14]. Recent studies have also highlighted TD3’s efficacy in communication networks [15].

Despite significant advances in the aforementioned domains, critical research gaps persist. Firstly, while prior studies have optimized STAR-RIS phase and amplitude configurations, they have largely overlooked the impact of UAV hovering angles on signal reflection/transmission partitioning. This oversight impedes the dynamic adaptation of STAR-RIS region boundaries to user density variations. Secondly, existing frameworks typically assume pre-defined user groupings or static service areas, neglecting the need for real-time adaptation to dynamic user distributions. Finally, despite the effects of coupling the UAV trajectory, STAR-RIS parameters, and hovering angles on system performance, their joint optimization remains unexplored.

In this paper, we consider a communication scenario assisted by an aerial STAR-RIS. In this scenario, the aerial STAR-RIS dynamically adjusts the hovering angle according to the user’s geographical location and communication needs to divide up the reflection and transmission areas. The following specific contributions are made by this paper:

We propose a UAV-mounted STAR-RIS-assisted communication service enhancement mechanism aimed at improving the channel quality for both edge users and occluded users. This mechanism demonstrates its efficacy by enhancing the efficiency and reliability of the communication system through the joint optimization of the STAR-RIS phase and amplitude, as well as the UAV’s flight trajectory and hovering angle. Additionally, the complexity of the optimization problem is addressed by formulating it as the minimization of the UAV’s total service time.
The joint optimization problem involving a STAR-RIS and UAVs in a complex, dynamic environment includes integer variables and non-convex constraints, making it a mixed-integer nonlinear programming problem that is challenging to solve using traditional methods. To address this complexity, this paper decouples the original optimization problem into two subproblems, total flight time optimization and total transmission time optimization, thereby obtaining a suboptimal solution.
The total flight time optimization problem employs the Chained Lin–Kernighan (CLK) algorithm to determine a trajectory that minimizes the flight time, following the delineation of the UAV’s service area using the DBSCAN algorithm. For the total transmission time optimization problem, the phase, amplitude, and UAV hovering angle of the STAR-RIS are optimized using the TD3 algorithm to minimize the transmission time.

The remainder of this document is structured as follows: Firstly, in Section 2, we review the work related to this paper. Then Section 3, presents the system model and the joint STAR-RIS phase and amplitude, UAV flight trajectory, and hovering angle problems. In Section 4, the optimization problem-solving method is presented and analyzed. A performance evaluation is presented in Section 5, and conclusions are drawn in Section 6.

2. Related Work

In this section, a summary of the relevant work is provided, with a focus on three aspects: UAV-based communication networks, RIS-assisted UAV networks, and STAR-RIS-assisted wireless communication.

2.1. UAV-Based Communication Networks

Recent years have seen UAVs being used in various communication scenarios and playing different roles. In [16], the authors proposed a dual-role UAV application model, as an edge server to assist user devices in processing computing tasks, and as a relay to further offload computing tasks to access points. In [17], the application of UAVs in terrestrial sensor networks was explored, achieving the integration of information transmission and charging functions. In [18], a DQN algorithm based on deep reinforcement learning was proposed for UAV-assisted IoT data collection systems, which effectively reduced the system packet loss rate through the online optimization of UAV flight control and data scheduling strategies. In [19], the application prospects of UAV communications in the millimeter wave band are systematically outlined, with performance evaluation metrics and analysis methods elaborated and key technologies and potential solutions proposed for different application scenarios. In addressing the need for effective disaster communications, a UAV deployment scheme was proposed in [20] to ensure the maximum coverage of ground nodes. However, none of these studies fully considered the impact of RISs on the wireless channel.

2.2. RIS-Assisted UAV Networks

The utilization of RISs in UAV communication networks has been the subject of extensive research in the literature [21,22,23,24,25,26]. Research has focused on multiple metrics, including the enhancement of data transmission rates, the improvement of communication reliability, and the augmentation of communication security. In [21], a classic RIS-assisted UAV communication framework was explored, aiming to maximize the average achievable rate. This optimization was achieved by jointly adjusting the UAV trajectory and the RIS passive beamforming design. In [22], a multi-UAV combined RIS-assisted mobile edge computing (MEC) system was addressed. Using a multi-agent deep reinforcement learning framework, the study focused on minimizing system latency and ensuring fairness among user devices by optimizing the computing offloading strategy and UAV trajectory. In [23], secure computation in a UAV-RIS-assisted multi-layer MEC system was addressed, and a full-duplex active eavesdropper was introduced. In [24], UAV-assisted communication with 3D trajectories was investigated, and a D3QN-based optimization algorithm was proposed to jointly optimize bandwidth allocation, RIS phase shifts, and 3D coordinates to maximize energy efficiency. In [25], a cooperative ARIS-assisted MEC system with two UAVs was considered. To enhance the system energy efficiency, a DDQN algorithm was used to optimize UAV trajectories, ARIS phase shifts, computation offloading strategies, and resource allocation. In [26], the authors presented a multi-antenna covert communication system enhanced by UAV-RIS, with which they creatively optimized beamforming, RIS phase shifts, and the UAV trajectory to maximize the worst-case covert transmission rate, even under imperfect CSI conditions.

2.3. STAR-RIS-Assisted Wireless Networks

To address the limitation of traditional RISs, where users and base stations must be located on the same side, increasing attention is being directed towards the STAR-RIS. The resource allocation schemes for STAR-RIS-aided multi-carrier systems in both orthogonal multiple access (OMA) and non-orthogonal multiple access (NOMA) scenarios are discussed in [27]. In [28], an optimization scheme for vehicular communication systems based on the STAR-RIS is proposed to improve data rates through the joint optimization of factors such as spectrum allocation and the amplitude and phase shift of STAR-RIS elements. However, the strong coupling characteristics of non-ideal CSI and non-ideal serial interference cancellation in practical communication scenarios make it impossible to accurately measure the safety performance of STAR-RIS-assisted NOMA transmission systems. To address this challenge, a fusion algorithm is proposed in [29] to maximize the system’s minimum safe transmission rate. In [30], the integration of the STAR-RIS with UAV NOMA communication networks was explored. This work aimed to maximize the user throughput by jointly optimizing UAV trajectories, the passive beamforming of the STAR-RIS, and time and power allocation, addressing the practical needs of disaster emergency communications. In [31], an MEC service enhancement scheme deploying the STAR-RIS on a UAV as a mobile relay is proposed. This approach combines the rapid deployment and mobility of UAVs with the advantages of the STAR-RIS to enhance spectral efficiency, expand coverage, and reduce the system’s energy consumption. A collaborative optimization scheme for emergency communication networks integrating the STAR-RIS and UAVs is also proposed in [32]. Experimental results demonstrate that this approach significantly enhances the average video streaming capability of users in disaster scenarios.

However, the majority of existing studies on STAR-RIS-aided UAV networks assume that the UAV service area is decentralized and that the UAV needs only to design an optimal flight path for the discrete service area. Moreover, the existing literature assumes that reflective and transmissive users are pre-determined, focusing optimization on the amplitude and phase shift of the STAR-RIS. However, these studies do not consider the impact of varying UAV hovering angles on STAR-RIS signal reflection and transmission. When the UAV hovering angle changes, the reflection and transmission service area of the STAR-RIS adjusts dynamically. This adjustment increases the number of users in the reflecting or transmitting area, which can cause significant co-channel interference and degrade system performance. Therefore, further exploration is required into the division of reflection and transmission regions when deploying the STAR-RIS on UAVs.

3. System Model and Problem Formulation

As shown in Figure 1, this paper considers a UAV service enhancement system supported by a STAR-RIS. The system consists of several user devices,

k, k \in K = {1, \dots, K}

, and a UAV equipped with a STAR-RIS and a ground base station. The coordinates of the base station BS and the user are denoted by

Z_{b} = [x_{b}, y_{b}, H_{b}]

and

Z_{k} = [x_{k}, y_{k}, 0]

respectively. The UAV u provides services to the users at different time intervals within a time period, T. For analysis purposes, the UAV flight period is divided into N equidistant time intervals, i.e.,

N = {1, . . ., n, . . ., N}

, with a step size of

Δ t

, which is the time period

T = N Δ t

. The STAR-RIS, consisting of M reflection/transmission units, can establish a good channel between the user and the base station. The phase of each reflection/transmission unit of the STAR-RIS can be controlled by the UAV, so the phase shift matrix of the STAR-RIS for reflection and transmission in the n time slot can be expressed as follows [12]:

Θ_{r f l} [n] = diag (\sqrt{β_{r f l}^{1} [n]} e^{j θ_{r f l}^{1} [n]}, . . ., \sqrt{β_{r f l}^{M} [n]} e^{j θ_{r f l}^{M} [n]})

(1)

Θ_{r f r} [n] = diag (\sqrt{β_{r f r}^{1} [n]} e^{j θ_{r f r}^{1} [n]}, . . ., \sqrt{β_{r f r}^{M} [n]} e^{j θ_{r f r}^{M} [n]})

(2)

where

β_{r f l}^{m} [n], β_{r f r}^{m} [n] \in [0, 1], m \in M = {1, \dots, M}

represents the amplitude of the m element reflection coefficient and transmission coefficient in the time slot n, and

β_{r f l}^{m} [n] + β_{r f r}^{m} [n] = 1, θ_{r f l}^{m} [n], θ_{r f r}^{m} [n] \in [0, 2 π)

represents the phase shift value of the m element reflection and transmission.

3.1. Mobile Model

Without the loss of generality, all communication nodes are placed in a three-dimensional Cartesian coordinate system. To ensure safety and maintain reliable LoS connectivity between the UAV, BS, and users [33], it is assumed that the UAV provides services at a fixed altitude,

H_{u}

, and based on the systems proposed in [25,26,31,32], the UAV and STAR-RIS are treated as a rigidly connected entity, ensuring their horizontal coordinates and altitude remain identical. In the time slot n, the horizontal coordinates of the UAV are denoted as

Z_{u} [n] = [x_{u}^{n}, y_{u}^{n}]

, which satisfies

x_{u}^{n} \in [0, x_{u}^{m a x}], y_{u}^{n} \in [0, y_{u}^{m a x}]

. Furthermore, following the shortest path planned by the CLK algorithm (Equations (19)–(21)), the UAV flies at a constant speed of

v_{u}

between hover points, ensuring the minimization of the total flight time.

As illustrated in Figure 2, the STAR-RIS is vertically deployed beneath the UAV with its reflective surface oriented along the UAV’s heading direction, while the transmissive surface faces the opposite direction. It is assumed that the UAV can adjust the hover angle

ω_{u}^{n}

in the horizontal direction while hovering. The hovering angle affects the division of the STAR-RIS reflection/transmission service area, which in turn affects the change in the number of users within the reflection or transmission area. As the number of users in the echo or transmission area increases, the co-frequency interference between multiple users will increase significantly, resulting in a decrease in system performance. Therefore, optimizing

ω_{u}^{n}

is of paramount importance. In this paper, the angle of the surface normal vector of the STAR-RIS pointing to the direction of the base station when the UAV is hovering is taken as the benchmark angle. At this time, the surface normal vector can be expressed as

υ_{0} = \frac{(x_{b} - x_{u}^{n}, y_{b} - y_{u}^{n})}{\sqrt{{(x_{b} - x_{u}^{n})}^{2} + {(y_{b} - y_{u}^{n})}^{2}}}

(3)

When the UAV changes the hovering angle, the rotated normal vector can be obtained by applying the rotation matrix in the 2D plane to the reference normal vector:

υ = [\begin{matrix} cos ω_{u}^{n} & - sin ω_{u}^{n} \\ sin ω_{u}^{n} & cos ω_{u}^{n} \end{matrix}] \cdot υ_{0}

(4)

The adjustment range of the hovering angle is defined as follows:

(- \frac{π}{2}, \frac{π}{2})

.

3.2. Service User Model

In the proposed system, UAVS are required to assist the BS in serving users with transmission requirements. Specifically, UAVS assist users who are unable to establish an effective link with the base station. This includes users outside the base station’s coverage area or users within the base station’s coverage area who experience link quality insufficient to support reliable communication, denoted by

C_{k, u}

. The following equation specifically expresses this:

C_{k, u} = \{\begin{matrix} 1, & others, \\ 0, & ∥ Z_{b} - Z_{k} ∥_{2} < ρ_{b} and o f f_{k, b} = 0 \end{matrix}

(5)

where

o f f_{k, b}

indicates whether an effective link can be established between the user with a transmission requirement and the base station. A value of

o f f_{k, b} = 1

indicates the establishment of an effective link between the user with a transmission requirement and the base station; conversely, a value of

o f f_{k, b} = 0

indicates the opposite.

3.3. Communication Model

As previously stated, the primary focus of this paper is the UAV service enhancement system assisted by a STAR-RIS. Consequently, the construction of two links is given primary consideration in the proposed system: one is the link transmitting users to the aerial STAR-RIS (ASTAR-RIS), and the other is the link from the ASTAR-RIS to the BS. It is assumed that only LoS channels exist between the user who transmitted in the first time slot via a UAV and the BS, and thus, it is assumed that the channel fading experienced here only involves the LoS component Rice fading. Consequently, the wireless channel gain

G_{r, b} [n]

from the ASTAR-RIS to the BS in the time slot n can be expressed as

G_{r, b} [n] = \sqrt{φ d_{r, b}^{- ξ_{r, b}} [n]} \sqrt{\frac{Υ}{1 + Υ}} g_{r, b}^{LoS} [n] \in C^{M \times 1_{\leftarrow}}

(6)

where

ξ_{r, b}

represents the path loss index from the ASTAR-RIS to the base station, represents the distance from the ASTAR-RIS to the BS, and

d_{r, b} [n] = \sqrt{{(x_{u}^{n} - x_{b})}^{2} + (y_{u}^{n} - y_{b}) + {(H_{u} - H_{b})}^{2}}

represents the deterministic LoS component between the ASTAR-RIS and BS at time

g_{r, b}^{LoS} [n]

. It can be expressed as [25]

g_{r, b}^{LoS} [n] = {[1, e^{- j \frac{2 π}{ν} \hat{d} cos ϕ_{r, b}^{n}}, \dots, e^{- j \frac{2 π}{ν} \hat{d} (M - 1) cos ϕ_{r, b}^{n}}]}^{T}

(7)

where

ν

represents the carrier wavelength,

\hat{d}

represents the spacing between elements, and

cos ϕ_{r, b}^{n} = \frac{(x_{u}^{n} - x_{b}) \cdot υ_{x} + (y_{u}^{n} - y_{b}) \cdot υ_{y}}{d_{r, b} [n]}

represents the cosine of the signal’s exit angle.

Due to the characteristics of the STAR-RIS, the user-to-ASTAR-RIS link is divided into two parts. For the time slot n, the wireless channel gain of the reflection/transmission path transmitting user

G_{k, r} [n]

to the ASTAR-RIS can be expressed as

G_{k, r} [n] = \sqrt{φ d_{k, r}^{- ξ_{k, r}} [n]} \sqrt{\frac{Υ}{1 + Υ}} g_{k, r}^{LoS} [n] \in C^{1 \times M}

(8)

where

ξ_{k, r}

represents the path loss index from the user to the ASTAR-RIS, and

d_{k, r} [n]

represents the distance from the user to the ASTAR-RIS, and

g_{k, r}^{LoS} [n]

, which is the deterministic LoS component between the user k and the ASTAR-RIS, is expressed as

g_{k, r}^{LoS} [n] = {[1, e^{- j \frac{2 π}{λ} \hat{d} cos ϕ_{k, r}^{n}}, \dots, e^{- j \frac{2 π}{λ} \hat{d} (M - 1) cos ϕ_{k, r}^{n}}]}^{T}

(9)

where

cos ϕ_{k^{'}, r}^{n} = \frac{(x_{k} - x_{u}^{n}) \cdot υ_{x} + (y_{k} - y_{u}^{n}) \cdot υ_{y}}{d_{k, r} [n]} \cdot

is the cosine of the angle of incidence of the signal. Therefore, the channel gain from the user to the BS in the reflected path can be expressed as

G_{k, b}^{r f l} [n] = G_{k, r} {[n]}^{T} Θ_{r f l} [n] G_{r, b} [n]

(10)

Similarly, the channel gain in the transmission path from the user to the BS can be expressed as

G_{k, b}^{r f r} [n] = G_{k, r} {[n]}^{T} Θ_{r f r} [n] G_{r, b} [n]

(11)

In summary, the reachable user-to-BS transmission rate can be expressed as

R_{k, b}^{δ} [n] = B_{k, b} {log}_{2} (1 + \frac{P_{k} | G_{k, b}^{δ} [n] |^{2}}{\sum_{j = 1, j \neq k}^{\hat{j} δ} P_{q} | G_{q, b}^{δ} [n] |^{2} + σ^{2}}), δ \in {r f l, r f r}

(12)

where

B_{k, b}

represents the bandwidth of the subcarrier assigned to the user by the base station.

3.4. Time Model

As previously stated, in this scenario, the UAV is required to assist the base station in serving users with transmission needs and formulate corresponding trajectories. Consequently, the total service time of the UAV is divided into two components: the flight time of the UAV and the time spent serving users with transmission needs.

Assuming that the flight path length of the UAV in the time slot n is

L_{u} [n]

, the total flight time of the UAV can be expressed as

D_{f l y} [n] = \frac{L_{u} [n]}{v_{u}}

(13)

In instances where an obstruction hinders communication or the user’s location lies beyond the base station’s coverage radius, the quality of the wireless connection will be inadequate to satisfy the user’s demand for data transmission. Consequently, the data must be uploaded to the base station via the reflective or transmissive link of the STAR-RIS. This paper proposes the variable

γ_{k} [n]

, which serves to differentiate between users employing reflective and transmissive channels. Specifically, when

γ_{k} [n] = 1

, the user establishes a connection with the base station via the reflective link; when

γ_{k} [n] = 0

, the user establishes a connection via the transmissive link.

Assuming that the size of the task that the user k needs to upload is

S_{k}

, the transmission time

D_{k, b} [n]

can be expressed as

D_{k, b} [n] = γ_{k} [n] \frac{S_{k}}{R_{k, b}^{r f l} [n]} + (1 - γ_{k} [n]) \frac{S_{k}}{R_{k, b}^{r f r} [n]}

(14)

In summary, within the designated time slot n, the total transmission time of the user is the maximum of the transmission times of each user, which can be expressed as

D_{t r a n s} [n] = max_{k = {1, \dots, K}} C_{k, u} D_{k, b} [n]

(15)

3.5. Problem Formulation

In this section, we propose a methodology for addressing the user’s elevated expectations concerning the quality of links. To this end, we integrate the STAR-RIS phase

Θ = {Θ_{r f l} [n], Θ_{r f r} [n], n \in N}

, amplitude

β = {β_{r f l} [n], β_{r f r} [n], n \in N}

, UAV trajectory

Z = {Z_{u} [n], n \in N}

, hovering angle

ω = {ω_{u}^{n}, n \in N}

, and user assignment decision

γ = {γ_{k} [n], n \in N}

. We then formulate an optimization algorithm aimed at minimizing the total UAV service time. The optimization problem can be modeled as follows:

P : min_{\begin{matrix} θ, β, Z, ω, γ \end{matrix}} \sum_{n = 1}^{N} (D_{trans} [n] + D_{fly} [n])

\begin{matrix} s . t . C 1 : & ∥\frac{Z_{u} [n + 1] - Z_{u} [n]}{Δ t}∥ = v_{u}, \forall n \in N \\ C 2 : & 0 \leq x_{u}^{n} \leq x_{u}^{max}, \forall n \\ C 3 : & 0 \leq y_{u}^{n} \leq y_{u}^{max}, \forall n \\ C 4 : & γ_{k} [n] \in {0, 1}, \forall k, n \\ C 5 : & β_{r_{rf}}^{m} [n] + β_{r_{rf}}^{m} [n] = 1, \forall m, n \\ C 6 : & θ_{r_{rf}}^{m} [n], θ_{r_{rf}}^{m} [n] \in [0, 2 π), \forall m, n \\ C 7 : & ω_{u}^{n} \in (- \frac{π}{2}, \frac{π}{2}), \forall n \end{matrix}

(16)

where C1 signifies the maximum allowable velocity for the UAV, while C2 and C3 delineate the boundaries of the UAV’s operational range. These constraints guarantee that the UAV operates within the prescribed velocity and physical limits while ensuring safety. C4 constitutes a binary constraint that allocates decision variables to users, thereby ensuring that each transmission user utilizes only a single channel per time slot. Constraints C5 and C6 pertain to the reflection and transmission coefficients and phase shift angles of the STAR-RIS, stipulating that the sum of the reflection and transmission coefficients of each element is equivalent to 1 and that the phase shift angle falls within the permissible range. Constraint C7 restricts the UAV hovering angle. However, Problem P is a mixed-integer nonlinear programming (MINLP) problem due to the involvement of high-dimensional mixed variables and non-convex constraints, posing significant challenges for direct optimization. First, the coupling between discrete trajectory variables and continuous RIS parameters creates a highly non-convex solution space. Second, the simultaneous optimization of all the variables demands prohibitive computational resources.

Therefore, a strategy was adopted that involved decomposing the original problem into multiple subproblems, each of which was solved separately. The first subproblem encompasses UAV service area division and UAV trajectory planning. As illustrated in Figure 3, we initially determined the distribution of the service area using the DBSCAN algorithm and subsequently determined the optimal UAV trajectory using the algorithm. The subsequent subproblem pertains to the optimization of the amplitude and phase of the STAR-RIS and the hovering angle of the UAV. This subproblem is addressed by employing the TD3 algorithm.

4. Proposed Optimization Algorithm

In this section, we provide a comprehensive exposition of the algorithm proposed to address the two aforementioned subproblems.

4.1. UAV Time of Flight Optimization Algorithm

The objective of subproblem P1 is to minimize the flight time of the UAV. Therefore, P1 can be expressed as

P 1 : min_{\begin{matrix} Z \end{matrix}} \sum_{n = 1}^{N} (D_{fly} [n])

\begin{matrix} s . t . C 1 : & ∥\frac{Z_{u} [n + 1] - Z_{u} [n]}{Δ t}∥ = v_{u}, \forall n \in N \\ C 2 : & 0 \leq x_{u}^{n} \leq x_{u}^{max}, \forall n \\ C 3 : & 0 \leq y_{u}^{n} \leq y_{u}^{max}, \forall n \end{matrix}

(17)

Given that the trajectory of the UAV is a dynamically changing process and the coverage area of the UAV is limited, subproblem P1 as a whole is an NP-hard problem, and it is challenging to obtain a solution in polynomial time. To address this challenge, this paper proposes a two-pronged approach. Firstly, it employs the DBSCAN method to segment the non-uniformly distributed users into multiple relatively independent sets. Secondly, it employs the CLK algorithm to optimize the UAV trajectory, thereby identifying the shortest path for the UAV to serve all user sets.

4.1.1. DBSCAN-Based Service Area Classification

DBSCAN is a density-based clustering algorithm that functions independently of a pre-specified number of clusters and is effective in delineating clusters of users from independent users. The algorithm defines the neighborhood of a point by two main parameters, the minimum number of samples, MinPts, and the radius

ε

, and divides the dataset according to the density of the points [34].

The algorithm process is as follows.

Initially, the parameters MinPts and

ε

of DBSCAN are established in accordance with the specified scenario. In this paper,

ε

is delineated as the radius of the service range of the UAV, and MinPts is configured to 2 to guarantee that a minimum of two users are incorporated into the neighborhood of the core point. Subsequently, the density of each transmitting user in the scenario is calculated within the specified range

ε

. If this density is greater than or equal to MinPts, the user is designated as a core point. In essence, the process commences with the identification of the core point. Thereafter, all users within the designated neighborhood are designated as belonging to the same class. The clustering process is then extended until further extension is no longer feasible. Isolated users, i.e., noisy points, are to be treated separately.

Pursuant to the aforementioned steps, it is possible to divide the transmission users into a number of relatively independent sets. Each set can be represented by

Ω^{o}

,

o \in 𝒪 = {1, \dots, O}

, and the users in each set,

{Ω_{1}^{o}, Ω_{2}^{o}, \dots, Ω_{\hat{k}}^{o}}

, have similar geographic locations and network environments. In this context, denotes the number of users within the current set.

4.1.2. CLK-Based UAV Path Optimization

According to the aforementioned division of the service area of the UAV, a series of hovering points can be obtained. In the strategy of this paper, it is necessary to plan the flight trajectory of the UAV in order to determine the shortest path without violating the constraints, so that the UAV can visit all the hovering points and reach the end point in the shortest time. This problem can be transformed into a no-return traveling salesman problem (NTSP), which is solved in this paper by using the CLK algorithm. The CLK algorithm is a path optimization algorithm that has been enhanced to combine a chain restart mechanism with a local search strategy. This combination of techniques has been shown to be advantageous in the context of solving the large-scale TSP and its variants [35]. The CLK algorithm was selected due to its ability to produce solutions of a quality that is almost optimal, in addition to its capacity to converge more rapidly than stochastic algorithms (e.g., the ant colony algorithm) [36].

It is established that the initial and terminal points of a UAV are designated as

Z_{s t a r t}

and

Z_{g o a l}

, respectively, with the hovering point designated as

{Z_{u}^{1}, Z_{u}^{2}, \dots, Z_{u}^{o}}

. It is noteworthy that each hovering point,

Z_{u}^{o}

, along with

Z_{s t a r t}

and

Z_{g o a l}

, is regarded as a node. The hovering point of the UAV corresponds to the center of the current colony, which can be denoted as

Z_{u}^{o} = \frac{1}{| Ω^{o} |} \sum_{k \in Ω^{o}} Z_{k}

(18)

The edge

e_{o, o^{'}}

between two nodes represents the path from node

Z_{u}^{o}

to node

Z_{u}^{o^{'}}

. The weight of the edge

d (e_{o, o^{'}})

is the Euclidean distance between the two nodes, which can be expressed as

d (e_{o, o^{'}}) = | | Z_{u}^{o} - Z_{u}^{o^{'}} | |

(19)

In contrast to the TSP, the NTSP does not necessitate a return to the initial point. Consequently, the introduction of a virtual node, devoid of specific coordinates, becomes imperative. The cost of traversing from this virtual node to the other nodes is either negligible or a substantial value [37]. This guarantees that in the ultimate solution, the virtual node manifests solely as the starting or ending point of the path. The node transfer cost matrix for the NTSP problem can thus be obtained as shown in Table 1:

The subsequent step involves the formulation of a specific solution strategy. Initially, an arbitrary path is generated. Within the LK algorithm, the switching operation corresponds to the operation, that is to say, the selection of edges from the current path to be exchanged to form a new path. Assuming the current path is

π = (Z_{v i r t u a l}, Z_{s t a r t}, Z_{u}^{1}, \dots, Z_{u}^{o}, Z_{g o a l}, Z_{v i r t u a l})

, the switching operation

χ_{π \to π^{'}}^{ϖ}

can be defined as the selection of

ϖ

pairs of nodes and the reconnection of these nodes to form a new path,

π^{'}

. To quantify the efficacy of the switching operation, it is essential to calculate the benefit of switching, defined as the change in the path length induced by the switching operation. Specifically, the total length of the path

π

is the sum of the weights of the edges between the nodes, which can be expressed as follows:

L (π) = \sum_{(Z_{u}^{o}, Z_{u}^{o^{'}}) \in P} d (e_{o, o^{'}})

(20)

The switching gain, therefore, is defined as the difference in length between the new path and the original path after the switching operation.

U_{π \to π^{'}}^{ϖ} = L (π) - L (π^{'})

(21)

A positive value indicates that the path following the switching operation is preferable to the original path, and the algorithm retains the new path

π^{'}

. Conversely, a negative value signifies that the original path remains unchanged.

The CLK algorithm is a refinement of the LK algorithm that incorporates a restart chain mechanism. A critical aspect of the CLK algorithm involves path perturbation, which aims to disrupt the local optimal structure of the current path by employing perturbation strategies. This process generates new initial solutions, facilitating the continuation of the LK algorithm’s optimization process. This perturbation process typically utilizes random

ϖ - opt

operations to rearrange the paths. The CLK algorithm generates a series of locally optimal solutions through the successive application of the LK algorithm and the perturbation operations. These solutions are linked to a chain by perturbation and optimization until the path remains static, at which point the final UAV flight trajectory is obtained. The complete algorithm flow is shown in Algorithm 1.

Algorithm 1 UAV time of flight optimization algorithm

1:: Input: $Ƶ_{k}, ε, m i n P t s, Ƶ_{start}, Ƶ_{goal}$
2:: Mark all users as unvisited.
3:: repeat
4:: For each user, calculate all neighboring users within the neighborhood.
5:: if neighborhood user $\geq m i n P t s$ then
6:: Mark as a core point.
7:: Group users adjacent to the core point into the same cluster.
8:: until there are no unvisited objects.
9:: Obtain user clusters $Ω^{o}$ .
10:: Calculate hover points using formula (18), and calculate edge weights using formula (19).
11:: Randomly generate an initial path $π$ , and taking it as the current optimal path $π^{*}$ .
12:: For iter = 1 to $\max iter$ do
13:: Apply the LK algorithm to optimize the current path $π^{*}$ and obtain the local optimal path $π^{*}$ .
14:: if $U_{π} > 0$ then
15:: Update the path $π \leftarrow π^{*}$ .
16:: end if
17:: Perturb the current solution $π^{*} \leftarrow new initial solution$ .
18:: end for
19:: Output: $Ƶ$

4.2. Time of Transmission Optimization Algorithm

Subproblem P2 aims to optimize user transmission times by calibrating the phases and amplitudes of the STAR-RIS, the UAV’s hovering angle, and the user assignment strategy, all while maximizing the UAV’s trajectory efficiency. Thus, the P2 problem can be formally stated as

P 2 : min_{\begin{matrix} θ, β, Z, ω, γ \end{matrix}} \sum_{n = 1}^{N} (D_{trans} [n] + D_{fly} [n])

\begin{matrix} s . t . C 1 : & γ_{k} [n] \in {0, 1}, \forall k, n \\ C 2 : & β_{r_{rf}}^{m} [n] + β_{r_{rf}}^{m} [n] = 1, \forall m, n \\ C 3 : & θ_{r_{rf}}^{m} [n], θ_{r_{rf}}^{m} [n] \in [0, 2 π), \forall m, n \\ C 4 : & ω_{u}^{n} \in (- \frac{π}{2}, \frac{π}{2}), \forall n \end{matrix}

(22)

However, the phase and amplitude configurations of the STAR-RIS, as well as the hovering angle of the UAV, are subject to dynamic and continuous change. Consequently, due to the non-convexity and high-dimensional complexity of the problem, obtaining its direct solution using conventional methods is usually infeasible. To address this challenge, the TD3 algorithm is employed in this paper as a solution. As illustrated in Figure 4, the TD3 algorithm is a deep reinforcement learning method specifically designed for solving problems in continuous action space, and it has become a popular choice for solving high-dimensional nonlinear optimization problems due to its excellent performance [38].

The TD3 algorithm is generally solved based on the Markov Decision Process (MDP).The MDP can be defined as a quaternion, <

S, A, R, P

>, which contains the state space, the action space, the state transfer probability, and the reward. In each time slot, n, the intelligent body acquires the current state

s^{n}

from the environment, executes an action,

a^{n}

, interacts with the environment, and consequently arrives at the next state

s^{n + 1}

and receives a reward,

r^{n}

. The following are the specific designs for the state space, action space, and rewards for the intelligent bodies:

State Space: The set of the agent’s states is denoted by S, the agent’s state in the time slot is represented by $s^{n}$ , and $s^{n} \in S$ is composed of the current UAV’s coordinates, the STAR-RIS phase and amplitude, and the assignment decision, which is defined as

$s^{n} = {Z_{u} [n], Θ_{r f l} [n], Θ_{r f r} [n], {fi}_{r f l} [n], {fi}_{r f r} [n], ω_{u}^{n}}$

(23)
Action Space: The set of actions of the agent is denoted by A, the agent’s action in the time slot n is indicated by $a^{n}$ , and $a^{n} \in A$ includes the amplitude factor ${β_{r f l}^{m} [n], β_{r f r}^{m} [n]}$ and phase shift factor ${θ_{r}^{m} [n], θ_{t}^{m} [n]}$ of the STAR-RIS, as well as the hovering angle ${ω_{u}^{n}}$ of the UAV. These can all be defined as increments of the current value, with the specific representation being $β_{δ} [n + 1] = β_{δ} [n] ⊙ Δ β_{δ} [n]$ , $θ_{δ} [n + 1] = θ_{δ} [n] ⊙ Δ θ_{δ} [n], δ \in {r f l, r f r}$ , and $ω_{u}^{n + 1} = ω_{u}^{n} ⊙ Δ ω_{u}^{n}$ . The Hadamard product ⊙ and the increment $Δ$ are fundamental to this representation.
Rewards: The objective of this study is to minimize the transmission time of users in the colony through the implementation of an optimization strategy. To this end, negative values are employed as rewards, serving as a motivational incentive for the agents to prioritize the reduction of the transmission time. It is imperative to note that the limitations concerning the maximum speed and movement range of the UAV must be taken into consideration. Consequently, the reward function can be delineated as follows:

$R^{n} = - D_{t r a n s} [n] + C_{0}$

(24)

where indicates a constant that is imposed as a penalty when a constraint is violated.

TD3 represents an extension of the DDPG algorithm, likewise grounded in the Actor–Critic structure. Its algorithm framework encompasses an Actor policy network,

μ_{κ}

, with the parameters

κ

, and two Critic value networks,

Q_{ζ_{1}}

and

Q_{ζ_{2}}

, with the parameters

ζ_{1}

and

ζ_{2}

, as well as their corresponding target networks, namely the target Actor policy network

μ_{κ^{'}}

and the target Critic value networks

Q_{{ζ^{'}}_{1}}

and

Q_{{ζ^{'}}_{2}}

. The employment of these target networks serves to enhance the stability of the learning process by selecting the Q value that is smaller among the outputs of the two value networks during the gradient descent process. This approach is intended to alleviate the overfitting of the Q function estimate. The subsequent section will provide a detailed description of the specific implementation method of the TD3 algorithm.

At each time step, the agent calculates the output action based on the current state

s^{n}

through the policy network

μ_{σ}

:

a^{n} = μ_{κ} (s^{n}) + ρ, ρ \sim N (0, σ)

(25)

where

ρ

is Gaussian noise. The incorporation of noise is a strategy employed to circumvent entrapment in a local optimum and enhance the efficiency of exploration. The agent uses this noisy action to interact with the environment, thereby acquiring a reward,

R^{n}

, and the next state

s^{n + 1}

. It then retrieves a state transition sample,

(s^{n}, a^{n}, R^{n}, s^{n + 1})

, and stores it in the experience replay buffer

B

as a training dataset for the online network.

In the event that the number of state transition samples in the experience replay buffer exceeds the preset capacity, a mini-batch containing J state transition samples,

(s^{j}, a^{j}, R^{j}, s^{j + 1})

, is randomly sampled from it and used to train the online network. In contrast to the DDPG algorithm, the TD3 algorithm introduces a regularization strategy by adding truncated Gaussian noise to the actions output by the target Actor network.

a^{' j + 1} = μ_{κ^{'}} (s^{j + 1}) + ρ^{'}, ρ^{'} \sim clip (N (0, σ^{2}), - c, c)

(26)

This Gaussian noise serves to smooth the output Q values of the two target Critic networks, thereby reducing the occurrence of overfitting, and the temporal difference (TD) error is calculated using the Bellman expectation equation based on the state–action value function:

Q_{tar} = R^{j} + η min (Q_{{ζ^{'}}_{1}} (s^{j + 1}, a^{' j + 1}), Q_{{ζ^{'}}_{2}} (s^{j + 1}, a^{' j + 1}))

(27)

where

η

denotes the discount factor, and

{ζ_{1}}^{'}

and

{ζ_{2}}^{'}

represent the parameters of the two target Critic networks. The parameters of the Critic network are updated by calculating the TD error to reduce the difference between the estimated Q value and the target Q value and by establishing two mean square error (MSE) loss functions for gradient descent:

L (ζ_{l}) = \frac{1}{J} \sum_{j} {(Q_{ζ_{i}} (s^{j}, a^{j}) - Q_{tar})}^{2}, l = 1, 2

(28)

ζ_{l} \leftarrow ζ_{l} - λ_{Critic} \nabla_{ζ_{l}} L (ζ_{l}), l = 1, 2

(29)

where

Q_{ζ_{l}} (s^{j}, a^{j})

represents the estimated Q value output by the two Critic value networks and

λ_{C r i t i c}

is the learning rate.

In the context of reinforcement learning, the objective of the agent is to maximize the cumulative reward. The Q value can be utilized to measure the long-term cumulative reward of taking a specific action in the current state. During the training process, the Actor network is updated by maximizing the cumulative expected return, which is contingent upon the evaluation of the Critic network. If the Critic network demonstrates significant instability, the updating of the Actor network will be substantially impacted, resulting in system oscillation.

Therefore, the Actor network is updated with a delay, and its loss function and parameter update formula are as follows:

J (κ) = - \frac{1}{J} \sum_{j} Q_{ζ_{1}} (s^{j}, μ_{κ} (s^{j}))

(30)

κ \leftarrow λ_{Actor} \nabla_{κ} J (κ)

(31)

Finally, we use a soft update strategy to synchronize the parameters of the Actor network and Critic network with the corresponding target networks:

ζ_{l}^{'} \leftarrow τ ζ_{l}^{'} + (1 - τ) ζ_{l}^{'}, l = 1, 2

(32)

κ^{'} \leftarrow τ κ + (1 - τ) κ^{'}

(33)

where

τ

is the soft update coefficient and represents the parameters of the target Actor policy network.The complete algorithm flow is shown in Algorithm 2.

Algorithm 2TD3-based training algorithm

1:: Input: $〈 Ƶ_{i}, Ƶ_{k}, 〈 λ_{Actor}, λ_{Critic} 〉, η, J 〉$
2:: Initialize networks, including UAV’s Actor network $μ_{C}$ , Critic network $Q_{1}$ and $Q_{2}$ , and their corresponding target networks, and initialize the replay buffer $B$ .
3:: for episode = 1 to M do
4:: Initialize UAV’s current state $s_{t}^{i}$ .
5:: for $n = 1$ to $N_{d o}$ do
6:: Select action $a^{n} = μ_{C} (s^{n}) + ρ$ based on $s^{n}$ .
7:: UAV moves according to the action $a^{n}$ : to set speed, angle, STAR-RIS phase, and amplitude.
8:: Obtain $R^{n}$ and observe the $s^{n + 1}$ at the next time step.
9:: Store the sample $(s^{n}, a^{n}, R^{n}, s^{n + 1})$ in $B$ .
10:: if the number of samples in $B$ > B then
11:: Randomly sample a small batch from $B$ .
12:: Update the parameters $ζ_{1}$ and $ζ_{2}$ of Critic networks using formula (29).
13:: if n mod $A_{update}$ then
14:: Update the Actor network parameters using formula (31).
15:: Perform soft updates on the target network parameters using formulas (32) and (33).
16:: end if
17:: end if
18:: end for
19:: end for
20:: Output: $Θ, β, ω, γ$

4.3. Computational Complexity

Combining the computational complexities of the DBSCAN clustering algorithm, the CLK path optimization algorithm, and the TD3 algorithm, the overall time complexity of the UAV path optimization algorithm can be expressed as follows: First, the DBSCAN algorithm is used for the density-based clustering of users, with a time complexity of

O (N log N)

, where N is the number of users, indicating the computational cost of the user clustering process. Next, the CLK algorithm is employed to optimize the UAV’s flight path, utilizing a local search strategy to adjust the path, with a time complexity of

O (M^{2})

, where M is the number of path nodes, reflecting the computational cost of the path optimization process. Finally, the TD3 algorithm is used to optimize the UAV’s trajectory and the phase and amplitude configurations of the STAR-RIS, with a time complexity of

O (T_{e p} \cdot T_{s t} \cdot J \cdot ((| S | + | A |) \cdot L_{B} + L_{A} \cdot L_{B}^{2}))

, where

T_{e p}

is the number of training episodes,

T_{s t}

is the number of steps per episode, J is the mini-batch size,

| S |

and

| A |

are the dimensions of the state space and action space of the intelligent bodies, respectively, and

L_{A}

and

L_{B}

are the number of layers and the size of each layer in the network, representing the computational cost of the reinforcement learning model training process. By integrating the complexities of the DBSCAN, CLK, and TD3 algorithms, the overall time complexity of the algorithm is found to be

O (T_{e p} \cdot T_{s t} \cdot J \cdot ((| S | + | A |) \cdot L_{B} + L_{A} \cdot L_{B}^{2}) + N log N + M^{2})

.

5. Numerical Results

In this section, the proposed method is compared with several benchmark methods to evaluate its feasibility and effectiveness. The subsequent sections describe the dataset, the experimental setup, and the benchmark methods of the simulation experiment. Ultimately, the experimental results are analyzed.

5.1. Simulation Setting

In this study, a TD3 model was constructed based on the PyTorch 2.4.1 framework to implement the deep reinforcement learning algorithm. The server was configured with a Intel Silver (Santa Clara, CA, USA) 4210R processor (2.4 GHz), dual NVIDIA GeForce (Santa Clara, CA, USA) RTX 3080 Ti GPUs, 128 GB of memory, and a 4 TB solid-state drive.

A coverage area with a scene size of 300 m × 300 m was designed, and the position of the BS,

Z_{b}

, was set to

(150, 150, 10

), with a coverage radius of

ρ_{b}

= 200 m. Since areas closer to the base station usually have better channel conditions, the users of the UAV service were mainly randomly distributed in the peripheral areas away from the base station and in areas outside the coverage of the base station. The task data size of the users was uniformly distributed within [1 MB, 1.5 MB]. The initial and terminal positions of the UAV flight were designated as

(0, 0, 40)

and (300, 300, 40), respectively, and the UAV moved at a constant speed of

v_{u}

= 20 m/s. The remaining channel parameters and TD3 training parameters are enumerated in Table 2.

5.2. Simulation Results

As illustrated in Figure 5, the optimized user settlement, UAV trajectory, and the division of the reflection and transmission areas of each cluster (service area) are presented. Among them, the number of users, the UAV service radius, and the number of STAR-RIS units.

The utilization of the proposed algorithm enables the UAV to determine the hovering point and the optimal trajectory with precision. This is contingent upon the service radius and the user distribution. As illustrated in the figure, the UAV possesses the capacity to adaptively adjust the hovering angle in accordance with the circumstances in each settlement. This enables the UAV to cover all users in the settlement and minimize the transmission time. In particular, when a single scattered user is present, the UAV will travel to the user’s location and provide service through the reflective link.

As illustrated in Figure 6, the learning curve of the TD3 algorithm trained on each cluster demonstrated a clear upward trend with an increase in the number of training rounds. Despite initial fluctuations, the final reward stabilized, indicating the TD3 algorithm’s capacity to adapt effectively to a complex communication environment assisted by a STAR-RIS, thereby identifying an optimal strategy for the system. It is noteworthy that although the majority of clusters converged after approximately 200 rounds, cluster 4 exhibited a slower convergence rate and greater variability during training. This is attributable to the higher user density in cluster 4, necessitating additional iterations for the agent to converge in such a complex environment.

To assess the effectiveness of the proposed UAV-mounted STAR-RIS system in enhancing services, a comparative analysis was conducted with three baseline schemes:

Scheme 1: Referring to [39], we considered a service enhancement scheme with a UAV equipped with an RIS and optimized it by combining maximum likelihood estimation and maximum correlation estimation. In the figure, we use “RIS” to represent this scheme.
Scheme 2: Referring to [31], we optimized the amplitude and phase of the STAR-RIS based on the proximal policy optimization (PPO) algorithm and did not consider the hovering angle of the UAV. In the figure, we use “SRP” to represent this scheme.
Scheme 3: Referring to [40], we optimized the amplitude and phase of the STAR-RIS without considering the hovering angle using the DDPG algorithm. In the figure, we use “SRD” to represent this scheme.

Similarly, for convenience, we use “SRT” to represent the scheme proposed in this paper.

Figure 7 illustrates the variations in the total system service time and transmission time as a function of the number of users in different mechanisms when the service radius was fixed at and the number of STAR-RIS units was fixed at 40. As the number of users increased, the total service time increased. As shown in Figure 7a, a comparison of the four mechanisms revealed that the mechanism proposed in this paper demonstrated a clear advantage when the number of users was high. Specifically, as shown in Figure 7b, when the number of users was set at 40, the transmission times of the proposed mechanism were 19.43%, 11.19%, and 15.82% lower than those of the RIS, SRP, and SRD mechanisms, respectively. This finding indicates that when the number of users in the cluster increases and inter-link interference between users increases, dynamically adjusting the STAR-RIS direction to allocate reflective and transmissive users can effectively improve system performance. Furthermore, as the number of users approached 30, a substantial increase in the total system service time was observed, primarily attributable to the increase in the flight time of the UAV surpassing the reduction in the transmission time of users. This further underscores the substantial impact of UAV flight path planning on the total system service time.

Figure 8 illustrates the alterations in the total service time and transmission time with varying UAV service radii. As shown in Figure 8a, when the number of users was fixed at 40 and the number of STAR-RIS units was also set at 40, an increase in the service radius resulted in a gradual decrease in the UAV’s hovering points, a continuous decrease in the flight path length. As the flight time constituted a substantial fraction of the total service time, the increase in the user transmission time due to an increase in the user population within the settlement was counterbalanced by the overall reduction in the total service time. A comparison of the four mechanisms revealed that as the service radius increased, the transmission time of the proposed mechanism consistently remained low. Specifically, As shown in Figure 8b, when the UAV service radius was set to 50 m, the transmission time of the proposed mechanism was reduced by approximately 25.24% compared to the RIS mechanism, by around 11.19% compared to the SRP mechanism, and by about 15.82% compared to the SRD mechanism.

Figure 9 illustrates the system service time and transmission time with a constant service radius and 40 users, as well as a variation in the quantity of STAR-RIS units. Figure 9a demonstrates a decline in the total service time as the number of STAR-RIS units rose. Specifically, when the number of STAR-RIS units was increased from 20 to 120, the total system service time was reduced by approximately 28.0% on average. This suggests that increasing the number of units on the super-surface can effectively enhance the transmission rate of users. As shown in Figure 9b, among the four mechanisms, the RIS mechanism had the longest transmission time, which was due to the significant co-channel interference between users. The proposed mechanism in this paper consistently achieved the lowest transmission times, with reductions of 28.03%, 12.79%, and 20.15%, respectively, compared to the average transmission times of the RIS, SRP, and SRD mechanisms.

6. Conclusions

In this paper, we proposed a UAV service enhancement mechanism based on onboard STAR-RIS assistance, with the objective of improving the channel quality for both edge and occluded users. The proposed approach involves the joint optimization of the phase, amplitude, UAV trajectory, and hovering angle of the STAR-RIS, thereby converting the complex optimization problem into a minimization of the total service time of the UAV. Initially, the UAV service area is segmented, employing the DBSCAN algorithm, and subsequently, the optimal trajectory is derived using the CLK algorithm to minimize the flight duration. Finally, the TD3 algorithm is employed to optimize the phase, amplitude, and hovering angle of the UAV to minimize the total transmission time. The experimental results demonstrate that the proposed mechanism offers substantial advantages over traditional methodologies in terms of reducing the system service time and enhancing the user experience.

Author Contributions

Conceptualization, Y.X. and J.Y.; methodology, Y.X.; software, Y.X.; validation, J.Y. and C.X.; formal analysis, J.Y.; investigation, J.Y. and Y.X.; resources, H.Y.; data curation, Y.X.; writing—original draft preparation, Y.X.; writing—review and editing, J.Y.; visualization, C.X.; supervision, H.Y.; project administration, J.Y.; funding acquisition, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by The Natural Science Foundation of Guangxi (No. 2022GXNSFBA035642), in part by the Guangxi Key Research and Development Project under Grant No. Gui Ke AB23075164, in part by the Science and Technology Research Program of the Guangxi Science and Technology Planning Project (No. Gui Ke AD21220161), in part by the National Natural Science Foundation of China (Grant 62061003, Grant 62466004), and in part by The Basic Ability Enhancement Program for Young and Middle-aged Teachers of Guangxi (No. 2022KY0325, No. 2021KY0364).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available upon reasonable request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kadir, E.A.; Shubair, R.; Abdul Rahim, S.K.; Himdi, M.; Kamarudin, M.R.; Rosa, S.L. B5G and 6G: Next Generation Wireless Communications Technologies, Demand and Challenges. In Proceedings of the 2021 International Congress of Advanced Technology and Engineering (ICOTEN), Taiz, Yemen, 4–5 July 2021; pp. 1–6. [Google Scholar] [CrossRef]
Xia, X.; Fattah, S.M.M.; Babar, M.A. A survey on UAV-enabled edge computing: Resource management perspective. Acm Comput. Surv. 2023, 56, 1–36. [Google Scholar] [CrossRef]
Pan, C.; Ren, H.; Deng, Y.; Elkashlan, M.; Nallanathan, A. Joint Blocklength and Location Optimization for URLLC-Enabled UAV Relay Systems. IEEE Commun. Lett. 2019, 23, 498–501. [Google Scholar] [CrossRef]
Mozaffari, M.; Saad, W.; Bennis, M.; Nam, Y.H.; Debbah, M. A Tutorial on UAVs for Wireless Networks: Applications, Challenges, and Open Problems. IEEE Commun. Surv. Tutor. 2019, 21, 2334–2360. [Google Scholar] [CrossRef]
Ahmed, S.; Kamal, A.E. Sky’s the Limit: Navigating 6G with ASTAR-RIS for UAVs Optimal Path Planning. In Proceedings of the 2023 IEEE Symposium on Computers and Communications (ISCC), Gammarth, Tunisia, 9–12 July 2023; pp. 582–587. [Google Scholar] [CrossRef]
Saad, W.; Bennis, M.; Chen, M. A vision of 6G wireless systems: Applications, trends, technologies, and open research problems. IEEE Netw. 2019, 34, 134–142. [Google Scholar] [CrossRef]
Huang, C.; Zappone, A.; Alexandropoulos, G.C.; Debbah, M.; Yuen, C. Reconfigurable intelligent surfaces for energy efficiency in wireless communication. IEEE Trans. Wirel. Commun. 2019, 18, 4157–4170. [Google Scholar] [CrossRef]
Guo, H.; Liang, Y.C.; Chen, J.; Larsson, E.G. Weighted sum-rate maximization for reconfigurable intelligent surface aided wireless networks. IEEE Trans. Wirel. Commun. 2020, 19, 3064–3076. [Google Scholar] [CrossRef]
Wang, C.; Chen, X.; An, J.; Xiong, Z.; Xing, C.; Zhao, N.; Niyato, D. Covert communication assisted by UAV-IRS. IEEE Trans. Commun. 2022, 71, 357–369. [Google Scholar] [CrossRef]
Shang, B.; Shafin, R.; Liu, L. UAV swarm-enabled aerial reconfigurable intelligent surface (SARIS). IEEE Wirel. Commun. 2021, 28, 156–163. [Google Scholar] [CrossRef]
Li, M.; Tao, X.; Li, N.; Wu, H. Energy-efficient covert communication with the aid of aerial reconfigurable intelligent surface. IEEE Commun. Lett. 2022, 26, 2101–2105. [Google Scholar] [CrossRef]
Xu, J.; Liu, Y.; Mu, X.; Dobre, O.A. STAR-RISs: Simultaneous transmitting and reflecting reconfigurable intelligent surfaces. IEEE Commun. Lett. 2021, 25, 3134–3138. [Google Scholar] [CrossRef]
Sun, M.; Xu, X.; Qin, X.; Zhang, P. AoI-Energy-Aware UAV-Assisted Data Collection for IoT Networks: A Deep Reinforcement Learning Method. IEEE Internet Things J. 2021, 8, 17275–17289. [Google Scholar] [CrossRef]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar] [CrossRef]
Liu, X.; Xu, J.; Zheng, K.; Zhang, G.; Liu, J.; Shiratori, N. Throughput Maximization With an AoI Constraint in Energy Harvesting D2D-Enabled Cellular Networks: An MSRA-TD3 Approach. IEEE Trans. Wirel. Commun. 2025, 24, 1448–1466. [Google Scholar] [CrossRef]
Hu, X.; Wong, K.K.; Yang, K.; Zheng, Z. UAV-assisted relaying and edge computing: Scheduling and trajectory optimization. IEEE Trans. Wirel. Commun. 2019, 18, 4738–4752. [Google Scholar] [CrossRef]
Jiang, R.; Xiong, K.; Yang, H.C.; Fan, P.; Zhong, Z.; Letaief, K.B. On the coverage of UAV-assisted SWIPT networks with nonlinear EH model. IEEE Trans. Wirel. Commun. 2021, 21, 4464–4481. [Google Scholar] [CrossRef]
Li, K.; Ni, W.; Tovar, E.; Guizani, M. Joint flight cruise control and data collection in UAV-aided Internet of Things: An onboard deep reinforcement learning approach. IEEE Internet Things J. 2020, 8, 9787–9799. [Google Scholar] [CrossRef]
Xiao, Z.; Zhu, L.; Liu, Y.; Yi, P.; Zhang, R.; Xia, X.G.; Schober, R. A survey on millimeter-wave beamforming enabled UAV communications and networking. IEEE Commun. Surv. Tutor. 2021, 24, 557–610. [Google Scholar] [CrossRef]
Lin, N.; Liu, Y.; Zhao, L.; Wu, D.O.; Wang, Y. An adaptive UAV deployment scheme for emergency networking. IEEE Trans. Wirel. Commun. 2021, 21, 2383–2398. [Google Scholar] [CrossRef]
Li, S.; Duo, B.; Yuan, X.; Liang, Y.C.; Di Renzo, M. Reconfigurable intelligent surface assisted UAV communication: Joint trajectory design and passive beamforming. IEEE Wirel. Commun. Lett. 2020, 9, 716–720. [Google Scholar] [CrossRef]
Wang, S.; Song, X.; Song, T.; Yang, Y. Fairness-aware computation offloading with trajectory optimization and phase-shift design in RIS-assisted multi-UAV MEC network. IEEE Internet Things J. 2024, 11, 20547–20561. [Google Scholar] [CrossRef]
Zhou, Y.; Ma, Z.; Liu, G.; Zhang, Z.; Yeoh, P.L.; Vucetic, B.; Li, Y. Secure multi-layer MEC systems with UAV-enabled reconfigurable intelligent surface against full-duplex eavesdropper. IEEE Trans. Commun. 2023, 72, 1565–1577. [Google Scholar] [CrossRef]
Yao, Y.; Lv, K.; Huang, S.; Xiang, W. 3D Deployment and Energy Efficiency Optimization Based on DRL for RIS-assisted Air-to-Ground Communications Networks. IEEE Trans. Veh. Technol. 2024, 73, 14988–15003. [Google Scholar] [CrossRef]
Duo, B.; He, M.; Wu, Q.; Zhang, Z. Joint dual-UAV trajectory and RIS design for ARIS-assisted aerial computing in IoT. IEEE Internet Things J. 2023, 10, 19584–19594. [Google Scholar] [CrossRef]
Lin, S.; Xu, Y.; Wang, H.; Ding, G. Multi-Antenna Covert Communication Assisted by UAV-RIS With Imperfect CSI. IEEE Trans. Wirel. Commun. 2024, 23, 13841–13855. [Google Scholar] [CrossRef]
Wu, C.; Mu, X.; Liu, Y.; Gu, X.; Wang, X. Resource Allocation in STAR-RIS-Aided Networks: OMA and NOMA. IEEE Trans. Wirel. Commun. 2022, 21, 7653–7667. [Google Scholar] [CrossRef]
Aung, P.S.; Nguyen, L.X.; Tun, Y.K.; Han, Z.; Hong, C.S. Deep reinforcement learning based joint spectrum allocation and configuration design for STAR-RIS-assisted V2X communications. IEEE Internet Things J. 2023, 11, 11298–11311. [Google Scholar] [CrossRef]
Li, M.; Wang, Y.; Wang, S.; Zhang, H. Performance optimization of physical layer security in STAR-RIS aided NOMA system. J. Commun. 2024, 45, 214–225. [Google Scholar] [CrossRef]
Lei, J.; Zhang, T.; Mu, X.; Liu, Y. NOMA for STAR-RIS Assisted UAV Networks. IEEE Trans. Commun. 2024, 72, 1732–1745. [Google Scholar] [CrossRef]
Aung, P.S.; Nguyen, L.X.; Tun, Y.K.; Han, Z.; Hong, C.S. Aerial STAR-RIS empowered MEC: A DRL approach for energy minimization. IEEE Wirel. Commun. Lett. 2024, 13, 1409–1413. [Google Scholar] [CrossRef]
Khan, N.; Ahmad, A.; Alwarafy, A.; Shah, M.A.; Lakas, A.; Azeem, M.M. Efficient Resource Allocation and UAV Deployment in STAR-RIS and UAV-Relay Assisted Public Safety Networks for Video Transmission. IEEE Open J. Commun. Soc. 2025, 1, 1–17. [Google Scholar] [CrossRef]
Li, L.; Guan, W.; Zhao, C.; Su, Y.; Huo, J. Trajectory Planning, Phase Shift Design, and IoT Devices Association in Flying-RIS-Assisted Mobile Edge Computing. IEEE Internet Things J. 2024, 11, 147–157. [Google Scholar] [CrossRef]
Wang, D.; Hu, P.; Du, J.; Zhou, P.; Deng, T.; Hu, M. Routing and scheduling for hybrid truck-drone collaborative parcel delivery with independent and truck-carried drones. IEEE Internet Things J. 2019, 6, 10483–10495. [Google Scholar] [CrossRef]
Yuan, B.; He, R.; Ai, B.; Chen, R.; Zhang, H.; Liu, B. Service Time Optimization for UAV Aerial Base Station Deployment. IEEE Internet Things J. 2024, 11, 38000–38011. [Google Scholar] [CrossRef]
Applegate, D.; Cook, W.; Rohe, A. Chained Lin-Kernighan for large traveling salesman problems. Informs J. Comput. 2003, 15, 82–92. [Google Scholar] [CrossRef]
Hsu, Y.H.; Gau, R.H. Reinforcement learning-based collision avoidance and optimal trajectory planning in UAV communication networks. IEEE Trans. Mob. Comput. 2020, 21, 306–320. [Google Scholar] [CrossRef]
Zheng, C.; Pan, K.; Dong, J.; Chen, L.; Guo, Q.; Wu, S.; Luo, H.; Zhang, X. Multi-Agent Collaborative Optimization of UAV Trajectory and Latency-Aware DAG Task Offloading in UAV-Assisted MEC. IEEE Access 2024, 12, 42521–42534. [Google Scholar] [CrossRef]
Varshney, N.; De, S. AoA-based low complexity beamforming for aerial RIS assisted communications at mmWaves. IEEE Commun. Lett. 2023, 27, 1545–1549. [Google Scholar] [CrossRef]
Guo, Y.; Fang, F.; Cai, D.; Ding, Z. Energy-efficient design for a NOMA assisted STAR-RIS network with deep reinforcement learning. IEEE Trans. Veh. Technol. 2022, 72, 5424–5428. [Google Scholar] [CrossRef]

Figure 1. STAR-RIS-assisted UAV edge service enhancement system.

Figure 2. Relationship between UAV hovering angle and reflective transmission service area.

Figure 3. The total optimization process.

Figure 4. Optimization process based on TD3 algorithm.

Figure 5. Service scenario optimization results.

Figure 6. Service scenario optimization results.

Figure 7. Change in time with number of users: (a) Shows variation in total service time for different mechanisms with different numbers of users. (b) Shows variation in transmission time for different mechanisms with different numbers of users.

Figure 8. Change in time with different UAV service radius: (a) Shows changes in total service time for different mechanisms with different service radius. (b) Shows variation in transmission times for different mechanisms with different service radius.

Figure 9. Change in time with different numbers of STAR-RIS units: (a) Shows variation in total service time for different mechanisms with different numbers of STAR-RIS units. (b) Demonstrates variation in transmission times for different mechanisms with different numbers of STAR-RIS units.

Table 1. Node transfer cost matrix.

	$Z_{virtual}$	$Z_{start}$	$Z_{goal}$	$Z_{u}^{o}$	$Z_{u}^{o^{'}}$
$Z_{v i r t u a l}$	∞	0	∞	∞	∞
$Z_{s t a r t}$	∞	∞	$d (e_{s, g})$	$d (e_{s, o})$	$d (e_{s, o^{'}})$
$Z_{g o a l}$	0	$d (e_{g, s})$	∞	$d (e_{g, o})$	$d (e_{g, o^{'}})$
$Z_{u}^{o}$	∞	∞	$d (e_{o, g})$	∞	$d (e_{o, o^{'}})$
$Z_{u}^{o^{'}}$	∞	∞	$d (e_{o^{'}, g})$	$d (e_{o^{'}, o})$	∞

Table 2. List of simulation parameters.

Parameter	Value
UAV altitude, $H_{u}$	40 m
Time slot length, $Δ t$	0.5 s
Number of STAR-RIS units, M	40
Carrier wavelength, $ν$	750 MHz
Element separation gap, $\hat{d}$	$ν / 2$
AWGN power, $σ^{2}$	−174 dBm/Hz
Path loss at 1 m, $φ$	−30 dBm
The path loss exponent, $ξ_{r, b}, ξ_{k, r}$	2.2
Bandwidth, $B_{k, b}$	10 MHz
Rician factor, $Υ$	10 dB
User transmission power, $P_{k}$	0.1 W
Replay buffer size, $B$	20,000
Batch size, J	256
Learning rate, $λ_{Actor}, λ_{Critic}$	0.002
Soft update factor, $τ$	0.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, J.; Xu, Y.; Yuan, H.; Xue, C. UAV Onboard STAR-RIS Service Enhancement Mechanism Based on Deep Reinforcement Learning. Sensors 2025, 25, 1943. https://doi.org/10.3390/s25061943

AMA Style

Yan J, Xu Y, Yuan H, Xue C. UAV Onboard STAR-RIS Service Enhancement Mechanism Based on Deep Reinforcement Learning. Sensors. 2025; 25(6):1943. https://doi.org/10.3390/s25061943

Chicago/Turabian Style

Yan, Junjie, Yichen Xu, Haohao Yuan, and Chunhua Xue. 2025. "UAV Onboard STAR-RIS Service Enhancement Mechanism Based on Deep Reinforcement Learning" Sensors 25, no. 6: 1943. https://doi.org/10.3390/s25061943

APA Style

Yan, J., Xu, Y., Yuan, H., & Xue, C. (2025). UAV Onboard STAR-RIS Service Enhancement Mechanism Based on Deep Reinforcement Learning. Sensors, 25(6), 1943. https://doi.org/10.3390/s25061943

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

UAV Onboard STAR-RIS Service Enhancement Mechanism Based on Deep Reinforcement Learning

Abstract

1. Introduction

2. Related Work

2.1. UAV-Based Communication Networks

2.2. RIS-Assisted UAV Networks

2.3. STAR-RIS-Assisted Wireless Networks

3. System Model and Problem Formulation

3.1. Mobile Model

3.2. Service User Model

3.3. Communication Model

3.4. Time Model

3.5. Problem Formulation

4. Proposed Optimization Algorithm

4.1. UAV Time of Flight Optimization Algorithm

4.1.1. DBSCAN-Based Service Area Classification

4.1.2. CLK-Based UAV Path Optimization

4.2. Time of Transmission Optimization Algorithm

4.3. Computational Complexity

5. Numerical Results

5.1. Simulation Setting

5.2. Simulation Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI