A Novel DRL-Transformer Framework for Maximizing the Sum Rate in Reconfigurable Intelligent Surface-Assisted THz Communication Systems

Moghaddam, Pardis Sadatian; Khatami, Sarvenaz Sadat; Hernando-Gallego, Francisco; Martín, Diego

doi:10.3390/app15179435

Open AccessArticle

A Novel DRL-Transformer Framework for Maximizing the Sum Rate in Reconfigurable Intelligent Surface-Assisted THz Communication Systems

by

Pardis Sadatian Moghaddam

¹,

Sarvenaz Sadat Khatami

²,

Francisco Hernando-Gallego

³

and

Diego Martín

^4,*

¹

Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA

²

Department of Data Science Engineering, University of Houston, Houston, TX 77204, USA

³

Department of Applied Mathematics, Escuela de Ingeniería Informática de Segovia, Universidad de Valladolid, 40005 Segovia, Spain

⁴

Department of Computer Science, Escuela de Ingeniería Informática de Segovia, Universidad de Valladolid, 40005 Segovia, Spain

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(17), 9435; https://doi.org/10.3390/app15179435

Submission received: 24 July 2025 / Revised: 16 August 2025 / Accepted: 22 August 2025 / Published: 28 August 2025

(This article belongs to the Special Issue RF/Millimeter-Wave/Sub-THz Antennas, Integrated Circuits and Systems for 5G and Beyond)

Download

Browse Figures

Versions Notes

Abstract

Terahertz (THz) communication is a key technology for sixth-generation (6G) networks, offering ultra-high data rates, low latency, and massive connectivity. However, the THz band faces significant propagation challenges, including high path loss, molecular absorption, and susceptibility to blockage. Reconfigurable intelligent surfaces (RISs) have emerged as an effective solution to overcome these limitations by reconfiguring the wireless environment through passive beam steering. In this work, we propose a novel framework, namely the optimized deep reinforcement learning transformer (ODRL-Transformer), to maximize the sum rate in RIS-assisted THz systems. The framework integrates a Transformer encoder for extracting temporal and contextual features from sequential channel observations, a DRL agent for adaptive beamforming and phase shift control, and a hybrid biogeography-based optimization (HBBO) algorithm for tuning the hyperparameters of both modules. This design enables efficient long-term decisionmaking and improved convergence. Extensive simulations of dynamic THz channel models demonstrate that ODRL-Transformer outperforms other optimization baselines in terms of the sum rate, convergence speed, stability, and generalization. The proposed model achieved an error rate of 0.03, strong robustness, and fast convergence, highlighting its potential for intelligent resource allocation in next-generation RIS-assisted THz networks.

Keywords:

terahertz communications; sixth-generation wireless networks; reconfigurable intelligent surfaces; transformer; deep reinforcement learning; biogeography-based optimization

1. Introduction

Sixth-generation (6G) wireless communication is poised to become the cornerstone of future intelligent infrastructures, promising to deliver unprecedented performance in terms of data rate, latency, connectivity density, and reliability [1,2,3,4,5]. Unlike its predecessors, 6G is expected to support visionary applications such as immersive extended reality (XR) [6,7], backscatter communication (BC) [8,9,10,11], holographic telepresence [12], ultra-high-speed vehicular communication [13,14,15,16], and massive-scale machine-type communication [17,18,19]. Achieving these ambitious targets necessitates a paradigm shift in the use of spectrum resources, especially toward higher-frequency bands that offer significantly larger bandwidths. Among these, the terahertz (THz) band, typically ranging from 0.1 to 10 THz, has been identified as a key enabler for 6G due to its ability to provide ultra-broad bandwidths on the order of several hundred gigahertz [20,21,22]. Taking advantage of this vast spectral space, and unlike conventional sub-6 GHz and millimeter-wave (mmWave) systems, THz communication has the potential to achieve data rates exceeding 100 Gbps, thereby addressing the extreme throughput demands of emerging 6G services. Moreover, the short wavelengths associated with THz signals enable highly directional transmissions and ultra-dense spatial reuse, aligning well with the beam-centric and energy-efficient vision of the upcoming 6G and Internet of Things (IoT) networks [23,24,25].

Despite its immense potential, THz communication also brings about fundamental challenges that threaten its practical deployment in 6G networks. Due to its extremely high carrier frequency, THz waves suffer from severe free-space path loss, high penetration loss through common building materials, and strong molecular absorption, especially from atmospheric water vapor. These phenomena drastically limit the effective transmission range and degrade the signal quality in non-line-of-sight (NLoS) or dynamic environments [26,27,28,29]. Moreover, THz propagation is highly sensitive to blockages caused by small obstacles such as walls, furniture, or even the human body, which can result in sudden and deep signal drops. The highly directional nature of THz beams, while beneficial for spatial reuse, also introduces alignment challenges between transmitters and receivers, particularly in mobile or rapidly changing scenarios [30]. Together, these limitations create a harsh propagation environment that demands novel techniques for improving link reliability, extending coverage, and enabling adaptive communication strategies [31,32,33,34].

To address these limitations and unlock the full potential of THz communication in 6G networks, reconfigurable intelligent surfaces (RISs) has emerged as a promising enabling technology [35,36,37]. An RIS consists of a large number of low-cost, passive reflecting elements whose electromagnetic response can be dynamically controlled to alter the propagation environment. By adjusting the phase shifts of each element, an RIS can redirect, focus, or scatter incident waves toward desired directions without requiring active RF chains or complex hardware [38,39,40]. This passive beamforming capability makes RISs highly attractive for THz systems, where power consumption, hardware complexity, and propagation impairments are critical concerns [41,42,43,44]. In the context of 6G, RISs can be strategically deployed to create virtual line-of-sight (LOS) paths, bypass obstacles, compensate for severe path loss, and enhance spatial coverage, particularly in dense urban or indoor environments, where direct THz links are frequently blocked [45,46]. As a result, integrating RISs into THz systems represents a paradigm shift from adapting to the wireless environment to actively reconfiguring it, aligning perfectly with the programmable and energy-efficient vision of future 6G networks [47,48].

1.1. Related Works

In recent years, the integration of RISs with THz communication has received growing attention, driven by their combined ability to overcome severe propagation losses and enable energy-efficient, high-capacity links in future 6G wireless systems. Su et al. [49] addressed the beam split effect in RIS-THz systems and proposed a sub-connected RIS architecture with joint phase-delay precoding to enhance wideband array gain. Le and Alouini [50] analyzed RIS-THz performance over

α

–

μ

fading and derived closed-form expressions for outage probability, ergodic capacity, and energy efficiency, while also proposing a low-complexity power and RIS activation scheme. To extend THz coverage under high path loss, Huang et al. [51] proposed a DRL-based hybrid beamforming scheme for multi-hop RIS-THz networks, achieving significant range gains. Karacora et al. [35] introduced a robust mixed-criticality coding scheme that leverages both direct and RIS-assisted paths to reduce queuing delays while preserving throughput in blocked THz environments.

In the context of secure communications, Yuan et al. [52] investigated secrecy performance in an RIS-assisted THz non-terrestrial network, optimizing phase shifts to maximize the ergodic secrecy rate under imperfect CSI and pointing errors. Pan et al. [53] tackled near-field localization and channel estimation for RIS-THz systems, proposing a Fresnel-based NF-JCEL algorithm that enhances angular and distance resolution in XL-RIS deployments. Du et al. [54] proposed a swarm intelligence-based RIS phase optimization approach and characterized the SNDR and SNR for RIS-THz links modeled with a fluctuating two-ray distribution. Yildirim et al. [55] presented a multi-RIS hybrid beamforming design based on particle swarm and deep learning methods, enhancing the achievable rate while reducing hardware complexity in THz MIMO systems. Focusing on dense RIS configurations, Wan et al. [56] analyzed holographic RIS-based THz MIMO systems, offering a closed-loop channel estimation scheme using angular delay compressive sensing to reduce pilot overhead. Mehrabian and Wong [57] proposed a transformer-based unsupervised learning framework for joint spectrum, precoding, and phase shift design in multi-user RIS-THz MIMO, outperforming traditional baselines in terms of sum rate performance.

In mixed RF-THz systems, Yin et al. [58] studied active RIS-aided relaying with amplify-and-forward and decode-and-forward protocols, deriving performance metrics and confirming that an active RIS improves the diversity order and reliability. Velez et al. [59] assessed RIS integration in sub-THz and mmWave bands for post-5G and post-6G scenarios using OFDM and joint precoding-phase shift design, demonstrating enhanced throughput in both indoor and outdoor environments. Pan et al. [60] proposed a computing power network architecture based on RIS-UAV-assisted THz NOMA for the IoT, using an SAC-based resource allocation algorithm to reduce energy and latency. Zhang et al. [61] introduced fixed true time delays (FTTDs) and a dynamic switch-phase shifter RIS design to combat beam squint in wideband RIS-THz systems. Finally, Do-Duy et al. [62] maximized the sum rate in RIS-aided THz-NOMA using joint AP power and RIS phase shift optimization, achieving notable throughput gains over conventional schemes.

1.2. Research Gaps and Motivations

In RIS-aided THz 6G systems, the ability to maximize the achievable sum rate becomes paramount not only to exploit the ultra-wideband capacity of the THz spectrum but also to ensure that performance gains scale efficiently with user density and mobility. At the same time, due to stringent power constraints at both the base station (BS) and the RIS, energy-efficient operation is critical to maintain sustainable and low-latency communications. The dual requirements of a high-throughput and power-aware design presents a complex and highly dynamic optimization problem, especially in large-scale or rapidly changing environments. Moreover, emerging 6G applications, including ultrareliable low-latency communications (URLLC), XR, and large-scale machine-type communications demand both a high throughput and efficient power usage. This creates a fundamental trade-off in the design of RIS-assisted THz systems, where the objective is not only to enhance signal quality through intelligent reflection but also to ensure that both the BS and RIS remain within their power budgets.

Despite this importance, the joint optimization of RIS phase shifts and BS beamforming under sum rate maximization and power constraints has been largely overlooked in the THz 6G literature. Among limited contributions, only [62] explicitly addressed this problem, proposing a sum rate maximization framework for RIS-aided THz-NOMA systems. However, their approach relies on traditional model-based optimization techniques such as alternating optimization, semidefinite relaxation (SDR), and the alternating direction method of multipliers (ADMM). While mathematically tractable, these methods face well-known limitations when applied to real-time and large-scale deployments. Specifically, they lack adaptability in non-stationary channel environments, exhibit poor scalability as the number of RIS elements and users increases, and often incur high computational complexity due to iterative solvers that are not well suited for edge inference.

These shortcomings motivate the transition toward learning-based approaches, particularly deep reinforcement learning (DRL), which offers an adaptive, data-driven framework for sequential decision making in complex wireless environments. DRL agents can dynamically optimize control actions such as beamforming and RIS configuration based on real-time observations without requiring explicit channel models. However, DRL methods, especially those that rely on feedforward neural networks or RNNs, often struggle to capture the long-term temporal dependencies inherent in THz environments with high user mobility and frequent blockage events. To overcome this, Transformer encoders have recently emerged as powerful sequence modeling tools, capable of learning long-range dependencies via self-attention mechanisms. In the context of THz-RIS systems, a transformer can serve as a temporal feature extractor that refines the state representation fed to the DRL agent, thereby improving its decision making in rapidly varying channel conditions.

Nonetheless, both the DRL and Transformer architectures are notoriously sensitive to hyperparameter tuning (e.g., learning rates, entropy coefficients, and attention dimensions). Poorly selected parameters can lead to convergence issues, suboptimal policies, or unstable training. This necessitates the use of a higher-level optimization mechanism to automate hyperparameter selection. In this work, we leverage the Hybrid Biogeography-Based Optimization (HBBO) algorithm as a meta-optimizer that adaptively tunes the learning parameters of both DRL and Transformer components based on observed performance feedback. In other words, this work proposes a unified framework that synergistically integrates DRL and Transformer encoders and HBBO optimization. The proposed architecture jointly learns an efficient policy for RIS and BS configuration, captures long-term channel dynamics via temporal attention, and self-tunes critical learning hyperparameters for optimal performance. This combination addresses the limitations of prior model-based and standalone learning methods in 6G THz networks.

1.3. Paper Contributions

The key contributions of this work are summarized as follows:

We formulate a joint RIS phase shift and transmit power control problem for RIS-assisted THz communication in 6G networks, aiming to maximize the achievable sum rate under strict power constraints at both the base station and RIS.
To address the dynamic and non-stationary nature of THz environments, we develop a DRL framework that enables adaptive control without requiring explicit channel modeling.
To enhance the agent’s temporal awareness and improve decision making over sequential channel variations, we integrate a Transformer encoder into the DRL agent, enabling effective long-term dependency modeling.
We introduce a novel HBBO algorithm as a meta-optimizer to automatically tune the critical hyperparameters of both the DRL agent and the Transformer encoder, ensuring stable and efficient learning.
Extensive simulations under realistic THz channel conditions demonstrate that our proposed DRL+Transformer+HBBO framework significantly outperforms conventional optimization baselines and DRL-only counterparts in terms of the sum rate, energy efficiency, and THz coverage region.

1.4. Paper Organization

The structure of the paper is as follows. Section 2 introduces the system model and formulates the sum rate maximization problem under power constraints in RIS-assisted THz networks. Section 3 details the proposed DRL-Transformer framework along with the HBBO-based hyperparameter tuning methodology. Section 4 presents the simulation results and performance evaluation. Section 5 discusses the findings in comparison with existing baselines. Finally, Section 6 concludes the paper and outlines directions for future work.

2. System Model and Problem Formulation

We consider an RIS-assisted terahertz downlink communication system that comprises a multi-antenna base station (BS), an RIS with N passive reflecting elements, and a single-antenna user terminal. Due to the extremely high carrier frequency (

0.1 - 1 THz

), THz waves suffer from severe free-space path loss, strong molecular absorption, and a pronounced susceptibility to shadowing. Consequently, the effective coverage radius of a standalone BS is typically limited to only a few meters. In practical indoor or dense urban scenarios, even without a macroscopic object directly in the line of sight (LoS), ordinary building materials (e.g., painted drywall, glass, or human bodies) can introduce attenuation to the order of

40 - 60 dB

, rendering the direct BS–user link virtually unusable. For analytical tractability, and to capture this harsh propagation environment, we therefore model the direct THz channel as completely blocked and focus on an indirect BS–RIS–user cascade, as depicted in Figure 1.

The RIS acts as an electronically controllable reflector that reroutes the THz signal around obstacles, effectively reconfiguring the radio environment to create an artificial non-line-of-sight (NLoS) path. The BS–RIS channel

H_{BR}

and the RIS–user channel

h_{RU}

are both modeled as quasi-static, frequency-flat Rayleigh fading matrices or vectors whose elements capture small-scale scattering effects, while large-scale path loss is incorporated into their average power. This modeling choice reflects the rich scattering observed in short-range THz links and allows closed-form performance analysis in later sections.

The BS is equipped with M antennas and applies an analog beamforming vector

w \in C^{M \times 1}

. The RIS is modeled as a passive array of N reflecting elements, each inducing a controllable phase shift. The RIS applies a diagonal phase shift matrix denoted by

Θ = diag (e^{j θ_{1}}, e^{j θ_{2}}, \dots, e^{j θ_{N}})

, where

θ_{n} \in [0, 2 π)

is the phase shift of the nth element. Let

H_{BR} \in C^{N \times M}

represent the BS-to-RIS channel, and let

h_{RU}^{H} \in C^{1 \times N}

denote the RIS-to-user channel. Although a direct BS–user channel

h_{BU}^{H} \in C^{1 \times M}

may exist, it is considered negligible due to the aforementioned high-frequency limitations and blockage effects commonly encountered in THz propagation environments. Accordingly, the signal received by the user is primarily dominated by the RIS-assisted reflection path and can be expressed as

y = h_{RU}^{H} Θ H_{BR} w s + n,

(1)

where

s \in C

is the transmitted data symbol with units of power, i.e.,

E [| s |^{2}] = 1

, and

n \sim CN (0, σ^{2})

denotes the additive white Gaussian noise for the user.

The RIS applies a diagonal phase shift matrix

Θ = diag (e^{j θ_{1}}, e^{j θ_{2}}, \dots, e^{j θ_{N}}) \in C^{N \times N}

, where each

θ_{n} \in [0, 2 π)

represents the programmable phase shift of the nth reflecting element. This configuration allows the RIS to passively steer the incoming signal from the BS toward the intended user direction, thereby enhancing signal quality and bypassing severe channel impairments. Based on this model, the received signal-to-noise ratio (SNR) at the user terminal is given by

γ = \frac{{| h_{RU}^{H} Θ H_{BR} w |}^{2}}{σ^{2}},

(2)

which quantifies the quality of the end-to-end link through the RIS.

To maximize the received SNR and ensure robust and energy-efficient signal delivery, our objective is to jointly optimize the BS beamforming vector

w

and the RIS phase shift matrix

Θ

. This leads to the following non-convex optimization problem:

\begin{matrix} max_{Θ, w} & {|h_{RU}^{H} Θ H_{BR} w|}^{2} \\ s . t . & {∥ w ∥}^{2} \leq P_{\max}, \\ Θ = diag (e^{j θ_{1}}, \dots, e^{j θ_{N}}), θ_{n} \in [0, 2 π), \forall n, \end{matrix}

(3)

where

P_{\max}

denotes the transmit power constraint at the BS. The problem in Equation (3) is challenging due to the coupling between the beamforming and phase shift variables and the non-convex unit modulus constraint imposed by the passive RIS structure. In the subsequent sections, we introduce a learning-based optimization framework to solve this problem efficiently under dynamic THz channel conditions.

Due to the unit modulus constraint on

Θ

and the coupled nature of

w

and

Θ

in the objective, Equation (3) is non-convex and challenging to solve optimally. In the following sections, we propose a learning-based solution to address this problem effectively.

3. Materials and Methods

This section details the core components of the proposed optimization framework. Section 3.1 describes the transformer encoder architecture, which enables temporal and contextual feature extraction from sequential input states. Section 3.2 outlines the RL foundation, focusing on how the agent interacts with the environment to learn optimal control policies. Section 3.3 introduces the proposed HBBO algorithm, which adaptively tunes the hyperparameters of the learning system using a segmented, nonlinear migration strategy. Finally, Section 3.4 presents the integrated ODRL-Transformer architecture, which combines the transformer-based feature extractor with the RL agent and employs HBBO for hyperparameter optimization, resulting in a scalable and high-performance solution for RIS-assisted THz communication environments.

3.1. Transformer Encoder

The transformer encoder has become a fundamental architecture type for modeling sequential dependencies in various domains due to its parallel processing capability, long-range dependency modeling, and attention-driven feature extraction. Unlike traditional recurrent networks that rely on sequential processing, the transformer encoder processes entire sequences simultaneously, leading to significantly faster training and improved scalability [63]. Its self-attention mechanism enables the model to capture global contextual information by weighing the relative importance of different elements in the sequence. This results in richer and more informative representations, which are especially beneficial in scenarios involving complex and dynamic temporal patterns.

Figure 2 illustrates the architecture of the transformer encoder, which comprises a stack of identical layers, each consisting of two main components: a multi-head self-attention mechanism and a position-wise feedforward neural network. The input sequence is first embedded and enriched with positional encoding to retain order information and then passed through the encoder layers. Within each layer, the multi-head attention mechanism computes attention scores across the sequence, allowing the model to attend to different contextual patterns simultaneously. This is followed by a feedforward network that projects the attention output into a higher-level representation. Each sub-layer is wrapped with residual connections and layer normalization to facilitate stable and efficient learning [64].

To preserve the positional structure of the input sequence, sinusoidal positional encoding is added to the input embeddings, as defined in Equations (4) and (5) [65]:

P E_{(pos, 2 i)} = sin (\frac{pos}{10, 000^{2 i / d}}),

(4)

P E_{(pos, 2 i + 1)} = cos (\frac{pos}{10, 000^{2 i / d}}),

(5)

where

pos

is the position index, i is the dimension index, and d is the embedding size. The attention mechanism begins by projecting the input sequence Z into three linear spaces—queries Q, keys K, and values V—as shown in Equations (6)–(8):

Q = Z W^{Q},

(6)

K = Z W^{K},

(7)

V = Z W^{V},

(8)

where

W^{Q}

,

W^{K}

, and

W^{V}

are the learnable weight matrices responsible for transforming the input into distinct subspaces. The scaled dot-product attention is computed as follows:

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{K}}}) V,

(9)

where

d_{K}

is the dimensionality of each attention head. To enhance representational power, multiple attention heads are executed in parallel and concatenated, as described in Equations (10) and (11):

MultiHead (Q, K, V) = Concat ({head}_{1}, {head}_{2}, \dots, {head}_{h}) W^{O},

(10)

{head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) .

(11)

This design enables the model to jointly capture information from different subspaces at different positions. The feedforward network is applied to each position independently, as formalized in Equation (12):

FFNN (x) = ReLU (0, x W_{1} + b_{1}) W_{2} + b_{2},

(12)

where x is the input vector corresponding to a single token or position in the sequence, W is the weight matrix, and b is the bias vector.

Finally, layer normalization and residual connections are applied to stabilize training and facilitate deeper stacking, as shown in Equations (13) and (14):

\hat{Z} = LayerNorm (Z + MultiHead (Q, K, V)),

(13)

Z^{out} = LayerNorm (\hat{Z} + FFNN (\hat{Z})) .

(14)

These operations help maintain gradient flow and improve convergence during training [63,64,65].

3.2. Reinforcement Learning

RL is a foundational paradigm in machine learning (ML) in which an agent learns to make sequential decisions through interaction with a dynamic environment. Unlike supervised learning, which relies on labeled data, RL employs feedback in the form of scalar rewards to guide the learning process [66]. The agent explores various actions in a given state, receives rewards, and updates its policy to maximize the cumulative reward over time. This trial-and-error approach enables RL to operate effectively in environments where the optimal strategy is not known a priori. The standard RL framework is often modeled as a Markov decision process (MDP), formally defined by the tuple

(S, A, P, R, γ)

, representing states, actions, transition probabilities, rewards, and the discount factor, respectively. The strength of RL lies in its ability to derive optimal policies through interaction without requiring an explicit model of the environment. This makes it highly suitable for complex, uncertain, and nonlinear tasks such as wireless resource allocation, autonomous vehicles, and robotic control systems, where accurate environmental modeling may be infeasible [67].

RL inherently accommodates sequential dependencies and delayed feedback, which are prevalent in real-world decision-making problems. Figure 3 illustrates the canonical RL architecture, where the agent and environment form a closed feedback loop. At each time step, the agent perceives the environment’s current state and selects an action based on its internal policy. This action results in a new environmental state and a reward signal. The agent then uses this experience to refine its policy via a mechanism known as policy updating. As this iterative process unfolds, the agent improves its behavior to maximize long-term rewards. This feedback-driven mechanism is central to how RL adapts and learns in dynamic conditions [68].

In RL, the ultimate goal of the agent is to maximize the total amount of reward it accumulates over time. Since many environments are stochastic and decisions can have long-term consequences, the reward is not evaluated in isolation but instead as a discounted sum of future rewards. This leads to the definition of the return, which represents the total expected reward from a given time step onward. The return captures both immediate and delayed outcomes, allowing the agent to consider the full trajectory of its decisions. Mathematically, this is formalized in Equation (15) [69]:

G_{t} = \sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1},

(15)

where

G_{t}

is the cumulative return at time step t,

r_{t + k + 1}

is the reward received

k + 1

steps after time t, and

γ

is the discount factor that controls the influence of future rewards, with

0 < γ \leq 1

.

To evaluate how good a specific state is under a given policy, RL introduces the state-value function as shown in Equation (16). This function estimates the expected return starting from a particular state and following a specific policy thereafter [69]:

V^{π} (s) = E [G_{t} ∣ s_{0} = s],

(16)

where

V^{π} (s)

is the expected return from state s under policy

π

and

E

denotes the expectation over all possible future trajectories.

To extend this evaluation to include specific actions, RL defines the action-value function as shown in Equation (17). This function quantifies the expected return after taking a given action in a given state and continuing with the policy [68]:

Q^{π} (s, a) = E [G_{t} ∣ s_{t} = s, a_{t} = a],

(17)

where

Q^{π} (s, a)

is the expected return for taking action a in state s and then following policy

π

. One of the most widely used algorithms in RL is Q-learning, such as the update rule expressed in Equation (18). It uses a bootstrapped estimate to iteratively update the value of a state-action pair based on the observed reward and the maximum value of the next state [67]:

Q (s, a) \leftarrow Q (s, a) + α (r + γ max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)),

(18)

where

Q (s, a)

is the current estimate of the action-value function,

α

is the learning rate, which controls how much new information overrides old estimates, r is the immediate reward received after action a,

s^{'}

is the next state, and

{max}_{a^{'}} Q (s^{'}, a^{'})

is the maximum estimated reward achievable from state

s^{'}

.

3.3. Proposed HBBO

BBO is an evolutionary algorithm inspired by the natural distribution of species across different habitats [70]. It models the migration and mutation of species in a set of candidate solutions, known as habitats, to iteratively improve the quality of the solutions. Each habitat corresponds to a potential solution in the search space and is characterized by a set of features known as suitability index variables (SIVs). The fitness of a habitat reflects its suitability, analogous to how a real-world ecosystem’s quality is shaped by environmental factors [71]. High-quality habitats are more likely to share their features with others, while low-quality habitats are more likely to accept new features in the hope of improvement.

The main operators in BBO are migration and mutation, which drive exploration and exploitation during optimization. Migration is controlled through immigration and emigration rates, determining how likely a habitat is to receive or contribute features. Mutation introduces random alterations to the habitat to maintain diversity and avoid local optima. A typical BBO generation proceeds by ranking all habitats based on fitness, calculating the migration rates, performing migration by probabilistically modifying SIVs, and applying mutation to select habitats. This process is repeated over successive generations until the convergence criteria are met. The migration rates are defined based on the relative rank of each habitat. The emigration rate

μ_{i}

, indicating the tendency of a habitat to share its features with others, and the immigration rate

λ_{i}

, indicating its likelihood to accept new features, are computed as follows:

μ_{i} = E \times (\frac{i}{N})

(19)

λ_{i} = I \times (1 - \frac{i}{N})

(20)

where i represents the rank of the habitat in terms of suitability and N is the total number of habitats.

The migration operation involves exchanging SIVs between habitats. Specifically, the SIVs of a host habitat

H_{j}

are updated by incorporating SIVs from a guest habitat

H_{i}

as expressed in [72,73]:

H_{j} (SIVs) \leftarrow H_{j} (SIVs) + H_{i} (SIVs)

(21)

To preserve diversity in the population and escape local optima, BBO applies a mutation operator to the selected habitats. The mutation rate for each habitat is inversely proportional to its probability of being selected, encouraging exploration of poorly performing solutions. This rate is calculated as follows [71]:

m_{i} = m_{max} \times (1 - \frac{p_{i}}{p_{max}})

(22)

where

m_{i}

denotes the mutation rate,

m_{max}

is chosen by the user,

p_{i}

reveals the probability of the species count, and

p_{max}

is the highest value of

p_{i}

.

Despite the simplicity and interpretability of conventional BBO, its performance is often hindered by the use of fixed, linear migration models across all habitats and generations. In many standard BBO implementations, emigration and immigration rates are determined by linear relationships with the habitat fitness rank. While intuitive, this linear formulation fails to reflect the complex, nonlinear dynamics observed in natural ecosystems, where migration behaviors are often irregular, phase-dependent, and sensitive to localized conditions [74]. Consequently, linear models lack the flexibility to promote adaptive knowledge transfer across heterogeneous solution regions. Moreover, applying a single uniform migration function across all habitats and iterations limits the algorithm’s ability to fine-tune exploration versus exploitation trade-offs. Such globally fixed dynamics overlook the evolving structure of the habitat population over time. In early iterations, an aggressive information-sharing mechanism may be useful for rapid convergence, whereas later stages benefit from more localized or restricted migration to preserve elite solutions. The inability to adapt the migration structure based on the population distribution and evolution stage often results in premature convergence or inefficient global searches, particularly in multimodal or noisy optimization landscapes.

To address these issues, we propose a novel hybrid migration model which dynamically adjusts the migration functions based on the habitat quality. Instead of applying a single formula across the population, our method segments the habitats into four distinct regions based on their rank and defines tailored nonlinear migration strategies for each segment. As shown in Equations (23)–(26), this adaptive scheme assigns a distinct pair of emigration and immigration functions to each interval of habitat ranks:

\{\begin{matrix} μ_{i} & = \frac{4 E}{7} \times {(\frac{i}{N})}^{4} \\ λ_{i} & = \frac{4 I}{7} \times (1 - {(\frac{i}{N})}^{5}) \end{matrix} for i < \frac{3 N}{11}

(23)

\{\begin{matrix} μ_{i} & = \frac{4 E}{7} \times exp (\frac{i}{N} - 1) \\ λ_{i} & = I \times \frac{4}{7} (cos (\frac{i π}{N} + φ) + 1) \end{matrix} for \frac{3 N}{11} \leq i \leq \frac{6 N}{11}

(24)

\{\begin{matrix} μ_{i} & = \frac{4 E}{7} \times ln (\frac{i}{N} + 1) \\ λ_{i} & = \frac{4 I}{7} \times exp (- \frac{i}{N}) \end{matrix} for \frac{6 N}{11} \leq i \leq \frac{8 N}{11}

(25)

\{\begin{matrix} μ_{i} & = E \times \frac{4}{7} (tanh (\frac{i π}{N} - \frac{π}{4}) + 1) \\ λ_{i} & = I \times \frac{4}{7} (- tanh (\frac{i π}{N} - \frac{π}{4}) + 1) \end{matrix} for \frac{8 N}{11} < i

(26)

In the low-HSI range (

i < \frac{3 N}{11}

) described in Equation (23), the emigration function is designed to grow extremely slowly for weak habitats, ensuring that low-quality solutions contribute minimally to outward migration. This controlled behavior prevents the dispersion of poor traits across the population, preserving the integrity of better-performing habitats. In contrast, the immigration rate for this range is structured to deliver a relatively high inflow of individuals from stronger habitats, accelerating the improvement of underperforming solutions. This asymmetric flow pattern promotes exploration where it is most needed while avoiding destabilization of the global search dynamics. In the lower–middle HSI range (

\frac{3 N}{11} \leq i \leq \frac{6 N}{11}

) of Equation (24), the emigration rate increases smoothly as the habitat quality improves, allowing these average-quality habitats to play a more active role in disseminating useful traits. At the same time, the immigration rate introduces moderate and dynamically varying inflow, enhancing diversity without overwhelming the population. This balance helps maintain steady progress by blending exploration with controlled exploitation.

For the upper–middle HSI range (

\frac{6 N}{11} \leq i \leq \frac{8 N}{11}

) in Equation (25), the emigration rate follows a sublinear growth pattern, enabling these higher-quality habitats to share beneficial traits without excessive outward flow. The immigration rate decreases progressively with the rank, reducing the risk of overwriting improved solutions with lower-quality information. This combination helps protect and refine promising solutions during the mid-to-late stages of optimization. Finally, in the high-HSI range (

\frac{8 N}{11} < i

) of Equation (26), the emigration function is steep at first but saturates as the HSI increases, allowing elite habitats to contribute early in the process before stabilizing to safeguard their refined traits. Immigration into this range is strongly suppressed, ensuring that the top-performing solutions remain insulated from lower-quality genetic material. Together, these nonlinear migration strategies across the four rank intervals enable the algorithm to maintain an effective exploration–exploitation balance, enhance diversity where necessary, and protect elite solutions to achieve faster and more stable convergence.

This novel hybrid migration formulation offers several key advantages. First, it enables context-aware migration, where each habitat is governed by a rate function tailored to its role within the evolving population structure. Second, the hybrid strategy captures the nonlinear and phase-dependent nature of real-world migration, more accurately modeling the complexity of inter-habitat dynamics. Third, by decoupling the migration mechanism across fitness tiers, the proposed model allows intelligent and selective information exchange, enhancing both the convergence speed and solution diversity. As a result, habitats adopt smarter migration behaviors; elite ones protect their structure while guiding others, and poor-performing ones actively absorb knowledge in a controlled, non-chaotic manner. The proposed migration mechanism improves upon traditional BBO by dynamically assigning different nonlinear rules at each phase of optimization. This design strengthens the global search capability in early generations and preserves diversity and elite convergence in later stages.

3.4. Proposed ODRL-Transformer Model

Figure 4 illustrates the proposed ODRL-Transformer framework, designed to maximize the sum rate in RIS-assisted THz communication systems. The architecture integrates a transformer encoder as the core sequence modeling component within the DRL agent while leveraging a novel HBBO algorithm to fine-tune critical hyperparameters across both the DRL and transformer modules. The system operates in a closed-loop fashion, where the agent observes the environment’s state, processes it through the transformer encoder, outputs an action policy, and receives reward feedback from the environment to refine future behavior. At the core of the agent lies the transformer encoder, which replaces the conventional DL modules typically used in RL pipelines. The transformer encoder processes sequences of observed states to extract rich temporal and contextual representations, enabling the agent to model long-term dependencies and capture complex interactions among channel dynamics, interference patterns, and RIS configurations. The self-attention mechanism allows the agent to selectively weigh past observations based on relevance, enabling superior adaptability in dynamically changing THz channels.

The transformer encoder is composed of stacked layers containing multi-head self-attention blocks and feedforward networks, each followed by residual connections and layer normalization. This design enables the agent to efficiently learn high-level abstractions from time-correlated input data while maintaining gradient stability during training. The inclusion of positional encodings ensures that sequential ordering is preserved. The DRL agent uses the contextual embeddings from the transformer encoder to estimate a policy

π

, which determines the optimal action at each time step. This decision-making process jointly controls active beamforming and RIS phase shifts, forming a continuous control space with high dimensionality. The agent continuously interacts with the environment, receives reward signals reflecting the achieved sum rate, and updates its policy through reinforcement learning principles, thus learning to maximize performance through trial and feedback. At each time step t, the transformer encoder receives a sequence of past observations

S_{T} = [s_{t - T + 1}, \dots, s_{t}]

, where each vector

s_{τ}

captures the real and imaginary parts of the current cascaded channel, beamforming vector, RIS phase shifts, and received SNR. This sequential input enables the transformer to learn temporal dependencies and extract high-level contextual features, which are then fed into the DRL agent to guide beamforming and RIS configuration decisions [75].

Crucially, this architecture includes an HBBO block responsible for optimizing the hyperparameters governing both the DRL agent and the transformer encoder. These parameters include but are not limited to the learning rate, discount factor, batch size, number of attention heads, hidden layer sizes, and dropout rates. HBBO dynamically adjusts these parameters across generations using a hybrid migration model tailored to maintaining exploration while preserving elite configurations, ensuring balanced searches across the solution space. Hyperparameter optimization is critical in deep RL architectures due to the sensitivity of the learning stability and generalization performance to parameter choices. Poorly tuned hyperparameters often result in slow convergence, unstable training, or suboptimal policies. In particular, transformer-based models are known for their large parameter space and sensitivity to architectural settings, requiring careful calibration to avoid overfitting or underutilization of their modeling capacity. The use of HBBO offers distinct advantages in this setting. Unlike traditional grid searches or Bayesian optimization, HBBO exploits population-based intelligence and dynamic migration functions to explore complex, multimodal search spaces more effectively. Its adaptive migration strategy, described in the previous section, enables it to respond differently to elite and weak parameter sets, supporting both the convergence speed and diversity preservation. This makes it highly suitable for tuning the deep and hierarchical ODRL-Transformer structure, where interactions between components are nonlinear and high-dimensional.

4. Results

This section presents a comprehensive evaluation of the proposed ODRL-Transformer framework in an RIS-assisted THz communication environment. To enhance readability and provide a clear understanding of the experimental process, the section is organized into three subsections. First, the simulation set-up is described in detail, including the system configuration, channel parameters, and key assumptions. Next, the optimization parameters of the DRL, transformer encoder, and HBBO algorithm are reported, along with their impact on the training performance. Finally, we present and analyze the simulation results, comparing the proposed model against several state-of-the-art benchmarks in terms of sum rate performance, convergence speed, and robustness.

4.1. Simulation Set-Up

All simulations and model training were implemented in Python 3.10 using a combination of open-source libraries. The core components of the deep reinforcement learning (DRL) pipeline were developed using PyTorch 2.1, which offers efficient tensor operations and GPU acceleration for deep model training. Additional utilities such as NumPy (v1.24) were used for matrix manipulation, while Matplotlib (v3.7) and Seaborn (v0.12) supported result visualization. The Transformer architecture was constructed using native PyTorch modules, and the optimization process via HBBO was custom-implemented to allow fine-grained control over the migration strategies. The entire framework was executed on a system equipped with an Intel Core i9-12900K CPU, 64 GB of RAM, and an NVIDIA RTX 3090 GPU (24 GB VRAM) running on Ubuntu 22.04 LTS. This configuration enabled efficient training and evaluation across multiple Monte Carlo runs and parameter settings. To ensure generalizability and avoid overfitting, the data were partitioned into training, validation, and testing sets using an 80–10–10 split.

We evaluated the performance of the proposed ODRL-Transformer framework using a realistic THz-band communication scenario. All channels were modeled using Rayleigh fading to reflect rich scattering conditions typically observed in complex indoor environments. The path loss exponent was set to 2.7, representing quasi-LoS THz propagation. The RIS-assisted system was configured such that the distances between the base station, RIS, and user satisfied

d_{BR} = d_{RU} = 20 m

, resulting in a total reflected path length of

40 m

. For comparison, the direct BS-to-UE distance was set to

d_{BU} = 35 m

, allowing evaluation of both the reflected and direct link contributions in sparse and blockage-prone environments. The THz carrier frequency was fixed at

0.3 THz

, and the transmission power was set to

P_{s} = 10 dBm

. The thermal noise power at the UE receiver was assumed to be

σ^{2} = - 80 dBm

. The RIS comprised

N = 32

reflecting elements unless otherwise stated, with each capable of continuous phase adjustment. All simulations were averaged over 100 independent channel realizations to ensure statistical reliability.

To evaluate the effectiveness of the proposed ODRL-Transformer framework, we conducted a comprehensive performance comparison against several baseline and variant models. These included the DRL-Transformer, Transformer-HBBO, DRL-HBBO, standalone Transformer, standard DRL, as well as BERT- and GAN-based architectures. Each model was implemented under consistent training conditions and simulation environments to ensure fair benchmarking. The goal was to assess the individual and combined contributions of the transformer encoder, DRL agent, and HBBO optimizer by systematically isolating and recombining their components. The ODRL-Transformer model integrated all three elements (Transformer, DRL, and HBBO), forming the most complete and adaptive learning pipeline. To understand the impact of each component, we examined the DRL-Transformer model (without HBBO), which helps isolate the benefit of reinforcement learning guided by transformer-based sequence modeling. Similarly, the Transformer-HBBO architecture removes the DRL agent and uses supervised training to analyze the optimization impact of HBBO on transformer configurations. The DRL-HBBO model assesses the effect of hyperparameter tuning on policy learning without transformer-driven temporal abstraction. These variants allowed us to rigorously quantify the performance gains associated with each architectural enhancement.

We also compared them against standalone Transformer and standard DRL models to establish baseline performance. The Transformer model was trained using a supervised learning approach lacking decision-based feedback and thus served as a lower-bound reference in environments with delayed rewards. The DRL baseline uses conventional deep networks without sequence modeling, providing insight into the advantages introduced by the attention-based encoder in our proposed method. To further contextualize our results, we included bidirectional encoder representations from transformers (BERT)- and generative adversarial network (GAN)-based models. BERT, a pretrained attention-based model, was selected due to its proven sequence modeling capability, while the GAN was included as a generative approach capable of learning the underlying data distribution. Finally, to evaluate the standalone effectiveness of HBBO, we applied it independently to the DRL and Transformer architectures and their combinations. These experiments demonstrate HBBO’s ability to improve the learning stability, convergence rate, and final policy quality across architectures. This modular evaluation confirms that HBBO not only benefits the joint ODRL-Transformer pipeline but also serves as a robust meta-optimizer across different learning strategies. Together, this broad evaluation design enabled us to validate the superiority and generalizability of the proposed ODRL-Transformer framework in RIS-assisted THz environments.

4.2. Optimization Parameters

To evaluate the prediction accuracy and robustness of the proposed and baseline models, we adopted a diverse set of quantitative metrics. These included traditional regression-based indicators such as the coefficient of determination (

R^{2}

), mean absolute percentage error (MAPE), and root mean squared error (RMSE), as well as variance analysis for stability assessment, a statistical t test for significance validation, and convergence behavior and execution time evaluation for computational efficiency. Together, these metrics provide a well-rounded evaluation of both the predictive accuracy and practical applicability. The coefficient of determination, shown in Equation (27), quantifies the correlation between the predicted and actual values:

R^{2} = {[\frac{1}{N} \sum_{i = 1}^{N} \frac{(y_{i} - \bar{y}) ({\hat{y}}_{i} - \bar{\hat{y}})}{σ_{y} σ_{\hat{y}}}]}^{2},

(27)

where

\bar{y}

is the mean of the observed value,

\bar{\hat{y}}

is the mean of the predicted value,

σ_{y}

is the standard deviation of the actual value, and

σ_{\hat{y}}

is the standard deviation of the predicted value. An

R^{2}

score close to one indicates strong linear agreement between the predictions and ground truth, while lower values signal weak predictive correlation. The MAPE, given in Equation (28), measures the relative prediction error as a percentage:

MAPE = \frac{1}{N} \sum_{i = 1}^{N} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}| \times 100,

(28)

The MAPE reflects how large the errors are relative to the actual values. Lower values denote higher accuracy, with an MAPE below 10% generally considered excellent. However, the MAPE becomes unstable when the actual values

y_{i}

are close to zero. The RMSE, as presented in Equation (29), calculates the standard deviation of the prediction errors:

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}},

(29)

The RMSE penalizes larger errors more than the MAPE, making it useful for capturing the magnitude of extreme deviations. Lower RMSE values indicate a tighter clustering of predicted values around the ground truth.

In addition to accuracy metrics, we analyzed the variance in the model outputs over multiple runs to assess stability and consistency. Lower variance suggests that the model produces reliable results regardless of initialization or environmental randomness. To statistically validate the observed differences between models, we conducted a two-tailed paired t test, which evaluated whether the performance gains of the proposed method over the baselines were statistically significant. A p value less than 0.01 was considered indicative of meaningful improvement rather than random fluctuations. We also monitored the convergence behavior by tracking the average reward or loss across training episodes. A model that converged faster with stable gradients demonstrated efficient learning, while oscillations or delayed convergence suggested instability or suboptimal parameter configurations. Finally, the execution time was recorded for each model to evaluate its computational feasibility in real-time or resource-constrained scenarios. We measured both the training time per episode and total runtime, providing insight into the trade-offs between accuracy and scalability.

Hyperparameter tuning plays a critical role in the training and performance of DL models, especially in complex architectures involving reinforcement learning and attention mechanisms. Proper calibration of parameters such as the learning rate, dropout rate, and network depth can significantly impact the convergence behavior, generalization ability, and overall stability. In multi-component frameworks like ODRL-Transformer, where model performance depends on the interplay between temporal encoding, policy learning, and exploration dynamics, tuning becomes even more essential to balance the learning speed and stability. As shown in Table 1, the hyperparameters for each algorithm were carefully selected through different optimization strategies.

For the proposed ODRL-Transformer architecture, the HBBO algorithm was employed to jointly optimize both the model-level and training-level parameters, including the number of attention heads, dropout rate, and discount factor, as well as migration-specific parameters such as the mutation rate and population size. In contrast, the baseline models (e.g., Transformer, DRL, BERT, and GAN) were calibrated using a grid search over predefined parameter ranges. This ensured that each model was evaluated under its most favorable configuration while isolating the impact of HBBO-driven tuning in the proposed framework.

The results indicate that ODRL-Transformer adopted a relatively conservative learning rate (0.001) and dropout rate (0.2), reflecting a trade-off between exploration and regularization. The transformer backbone used six encoder layers with eight attention heads. Compared with the vanilla Transformer baseline, which used a higher number of attention heads (12) and a larger batch size (128), the HBBO-tuned configuration achieved more efficient learning with fewer computational demands. The Transformer model employed 12 attention heads and a moderate dropout rate of 0.2 to maintain regularization, with training performed using the SGD optimizer and a relatively high batch size of 128 for stability. BERT was configured with eight self-attention heads per layer and a time series input window with a length of 64 to capture temporal patterns using GELU activation and SGD optimization. The DRL agent utilized a discount factor of 0.97 and an e-greedy value of 0.41 to balance exploration and exploitation while also adopting the Adam optimizer for smoother gradient updates. Lastly, the GAN model used a small hidden layer (32 neurons), ReLU activation, and a momentum term of 0.06, with a convergence threshold of 0.068 to determine termination.

The initial hyperparameters of the HBBO algorithm, including the population size, maximum mutation rate, migration coefficients, and habitat partition thresholds, were determined through a systematic trial-and-error process in which one parameter was varied at a time while the others were held constant, allowing us to isolate its impact on performance. For example, the population size was incrementally adjusted across multiple runs under identical conditions, with the resulting sum rate, convergence time, and stability recorded to identify the best trade-off between quality and efficiency. The same procedure was applied sequentially to the mutation rate, migration coefficients, and habitat thresholds until arriving at the configuration reported in Table 1. The calibrated configurations suggest that HBBO not only improves the prediction accuracy and convergence but also balances architectural complexity with runtime efficiency.

The variation in the number of parameters across the compared models in Table 1 stems from differences in their architectural complexity and the components they integrate. Models incorporating a transformer encoder, such as DRL-Transformer and Transformer-HBBO, required additional parameters for their multi-head attention layers, embedding projections, and normalization operations, leading to higher overall parameter counts compared with their non-transformer counterparts. BERT had an even larger number of parameters due to its deep multi-layer bidirectional attention structure and large hidden dimensions. GAN-based models also contain both generator and discriminator networks, effectively doubling certain parameter groups. In contrast, standard DRL models employ simpler feedforward or recurrent layers without attention mechanisms, resulting in fewer trainable parameters. These structural distinctions, rather than differences in training configuration, are the primary reason for the parameter variations reported in Table 1.

4.3. Simulation Results

Figure 5 illustrates the impact of the transmit SNR

{\bar{γ}}_{T}

on the system sum rate in an RIS-assisted THz communication scenario for different numbers of RIS elements

N \in {8, 16, 32, 64}

. As expected in the THz bands, where high path loss and limited scattering significantly degrade direct link quality, the deployment of an RIS played a critical role in enhancing the received signal power. The results clearly show that increasing the number of RIS elements led to substantial improvements in the achievable sum rate. For instance, at

{\bar{γ}}_{T} = 10

dB, the sum rate improved from approximately 7 bps/Hz for

N = 8

to over 11 bps/Hz for

N = 64

. This performance gain stems from the enhanced beamforming precision and controllable reflectivity achieved with more RIS elements, which is especially valuable in the highly directional and frequency-selective THz environment. The baseline without an RIS exhibited significantly lower rates, emphasizing the essential role of intelligent reflecting surfaces in mitigating channel impairments at THz frequencies. Moreover, the performance gap between the RIS-aided and baseline cases widened as

{\bar{γ}}_{T}

increased, indicating that RIS-enabled gain became more prominent at higher transmit powers, which is a behavior aligned with the distance-sensitive propagation characteristics of THz waves.

Figure 6 compares the sum rate performance of the proposed ODRL-Transformer framework with several baseline and variant models across varying transmit SNR levels

{\bar{γ}}_{T}

. In the context of RIS-assisted THz communication systems, where severe path loss, high atmospheric absorption, and limited scattering prevail, the integration of intelligent learning-based strategies becomes essential to ensure efficient beamforming and resource control. The proposed ODRL-Transformer model, which jointly integrates a transformer encoder, a DRL agent, and the HBBO optimizer, consistently outperformed all other configurations. This result highlights the strength of combining temporal sequence modeling, policy-based decision making, and adaptive hyperparameter tuning to handle the dynamic and highly directional nature of THz links. At

{\bar{γ}}_{T} = 20

dB, ODRL-Transformer achieved a sum rate close to 15 bps/Hz for

N = 32

, significantly exceeding the baseline models. In particular, DRL-Transformer and Transformer-HBBO showed competitive performance, validating the contribution of their respective modules. DRL-Transformer benefits from policy-based learning while retaining sequence modeling, whereas Transformer-HBBO showcases the utility of HBBO in optimizing Transformer configurations for high-frequency environments. The performance of DRL-HBBO, while better than that of standard DRL, remained inferior to its full hybrid counterparts, suggesting that HBBO alone is insufficient without temporal modeling. The traditional Transformer and DRL baselines exhibited moderate gains but were unable to match the adaptability and long-term optimization capacity required for THz channels. The BERT and GAN models performed the worst, underscoring the limitations of purely supervised or generative models in reinforcement-based THz system optimization. These results underscore the necessity of hybrid model-aware optimization strategies in RIS-assisted THz scenarios. The superior performance of ODRL-Transformer confirms its ability to capture long-term dependencies, adapt to non-stationary THz environments, and robustly tune system parameters in the presence of extreme propagation impairments.

Figure 7 illustrates the variation in the sum rate as a function of the BS-to-user distance

d_{BU}

for different configurations of RIS elements in a THz communication scenario. Due to the inherent high-frequency characteristics of the THz band, such as severe path loss and molecular absorption, signal attenuation increased significantly with the distance. This makes intelligent beamforming and surface reconfiguration via RIS particularly crucial in maintaining communication performance over longer ranges. The proposed RIS-assisted system demonstrated clear performance superiority over the non-RIS baseline. Across all distance values, the deployment of RISs substantially enhanced the sum rate, effectively mitigating the THz-specific propagation impairments. In particular, larger RIS sizes (e.g.,

N = 64

) yielded higher spectral efficiency, leveraging greater reflection diversity and passive beamforming gains. For instance, at

d_{BU} = 10

m, the system with

N = 64

achieved a sum rate exceeding 22 bps/Hz, more than doubling the performance of the baseline without RISs. As the user moved further from the BS, all configurations experienced a performance decline, consistent with the high sensitivity of the THz channel to the distance. However, systems with larger RIS configurations exhibited slower degradation, demonstrating their enhanced robustness. Even at

d_{BU} = 70

m, the configuration with

N = 64

maintained a significant advantage over the non-RIS set-up, indicating the RIS’s effectiveness in extending the communication range in challenging THz environments. These results validate the role of RISs in compensating for distance-related losses and highlight the importance of optimizing the number of RIS elements to balance performance with deployment cost in practical THz systems.

Table 2 presents the quantitative comparison of all models in terms of prediction accuracy for our RIS-assisted THz communication task. As shown, the proposed ODRL-Transformer framework significantly outperformed all competing models across all three evaluation metrics. It achieved an RMSE of 0.03,

R^{2}

score of 0.97, and MAPE of just 0.52%, indicating extremely high prediction accuracy, a minimal error magnitude, and strong correlation with the actual values. The next best-performing model, DRL-Transformer, recorded higher error rates (RMSE = 2.09, MAPE = 2.19%) and lower correlation (

R^{2} = 0.91

), underscoring the critical contribution of HBBO-driven hyperparameter tuning in enhancing learning efficiency and output quality. Similarly, Transformer-HBBO and DRL-HBBO, which only partially integrate HBBO, showed moderate gains over their non-optimized versions but fell short of full ODRL integration. The models without DRL or Transformer components performed noticeably worse. The GAN approach in particular exhibited the highest RMSE (9.12) and MAPE (14.26%) and the lowest

R^{2}

(0.81), reflecting poor generalization and high deviation from the ground truth. This contrast highlights the importance of combining sequential modeling (via Transformer), decision making (via DRL), and meta-optimization (via HBBO) to address the challenges of dynamic, high-dimensional environments like RIS-assisted THz systems.

Figure 8 presents the same results as Table 2 in a more intuitive format to highlight performance differences in terms of prediction accuracy. As illustrated in Figure 8, the proposed ODRL-Transformer framework achieved the highest

R^{2}

score, clearly outperforming all baseline models in terms of correlation between the predicted and actual values. This strong alignment reflects the model’s ability to learn complex temporal and contextual dependencies in RIS-assisted THz environments. The visual comparison reinforces the effectiveness of combining transformer-based sequence modeling, DRL-based policy learning, and HBBO-based hyperparameter tuning into a unified framework. The substantial performance gap, especially against the standalone DRL, BERT, and GAN models, demonstrates the importance of jointly optimizing both the architecture and training dynamics to address the nonlinearities and high dimensionality inherent in our problem domain.

Figure 9 visually presents the RMSE and MAPE values of all models. As shown in Figure 9, the ODRL-Transformer approach exhibited the lowest RMSE and MAPE among all compared models, indicating minimal deviation from the ground truth in absolute and relative terms. The gap is particularly prominent when compared with the traditional DRL, Transformer, BERT, and GAN baselines, which demonstrated substantially greater errors. This visualization reinforces the numerical results from Table 2 and further highlights the effectiveness of combining DRL, transformer encoding, and HBBO-driven hyperparameter tuning. The clarity of this bar chart allows for immediate visual recognition of the model’s superiority in terms of both precision and generalization.

Table 3 presents the results of the statistical t tests conducted to determine whether the performance improvements of the proposed ODRL-Transformer architecture over the baseline models were statistically significant. The comparisons were based on the prediction performance across multiple runs, using a 99% confidence level (

α = 0.01

) to validate the robustness of the observed gains. The results demonstrate that in all comparisons, the p values were significantly below the 0.01 threshold, confirming that the improvements achieved by the ODRL-Transformer approach were not attributable to random variation. Specifically, the differences between ODRL-Transformer and its direct variants (DRL-Transformer, Transformer-HBBO, and DRL-HBBO) were all statistically significant, underscoring the added value of combining all three components: transformer encoding, DRL, and HBBO-based optimization. Additionally, comparisons with the standard Transformer, DRL, BERT, and GAN baselines also yielded highly significant p values, further reinforcing the superiority of the proposed model across diverse learning strategies. These statistical outcomes validate the consistency and generalizability of the ODRL-Transformer framework’s performance advantage. By formally confirming that the performance improvements were statistically meaningful, the t test results support the conclusion that the proposed model delivers reliable and repeatable gains in RIS-assisted THz environments.

Figure 10 provides a visual comparison of the training dynamics for all evaluated models by plotting the RMSE values across epochs. The ODRL-Transformer framework showed the fastest and most stable convergence, achieving a near-zero RMSE within the first 150 epochs. In contrast, models like GAN, BERT, and standard DRL converged more slowly and exhibited higher error floors, indicating a limited learning capacity and weaker adaptation to the data structure. The hybrid variants (such as DRL-Transformer, Transformer-HBBO, and DRL-HBBO) performed moderately well but still lagged behind the proposed model in terms of both convergence rate and final accuracy. The observed convergence behavior highlights the strength of combining temporal attention (via the transformer encoder), adaptive decision making (via DRL), and hyperparameter optimization (via HBBO). ODRL-Transformer not only learned faster but also achieved a more stable and consistent RMSE trajectory, avoiding the oscillations or plateaus seen in other methods. This indicates enhanced learning efficiency, which is particularly crucial for real-time or resource-constrained THz communication scenarios, where rapid adaptation is needed.

5. Discussion

This section provides an in-depth interpretation of the experimental results presented earlier, aiming to contextualize the performance of the proposed ODRL-Transformer framework within both theoretical and practical dimensions. While the previous sections focused on quantitative comparisons across accuracy metrics such as the RMSE and MAPE, the discussion now shifts toward broader insights related to runtime efficiency, stability, scalability, and applicability in real-world THz communication systems. We analyze the runtime behavior of each model to assess their computational feasibility for time-sensitive scenarios, particularly in RIS-assisted THz environments, where rapid decision making is critical. In addition, we examine the performance variance across multiple training runs to evaluate each model’s robustness and reliability under stochastic conditions. Finally, we briefly discuss the potential suitability of the proposed ODRL-Transformer architecture for real-world deployment, reflecting on its practicality and adaptability beyond controlled simulation settings.

Table 4 presents a detailed analysis of the computational complexity of all evaluated models by reporting their runtimes to reach progressively stricter RMSE thresholds (12, 8, 4, and 2). This runtime-based evaluation offers insight into each model’s convergence efficiency and suitability for time-sensitive applications. The stopping criterion for each algorithm was defined as the moment it achieved a given RMSE value for the first time during training. The results show that the ODRL-Transformer framework achieved an RMSE <12 in just 25 s. It was also the only model capable of reaching an RMSE <2 within a practical time window (153 s), demonstrating an exceptional convergence capability and computational efficiency. In contrast, models such as GAN, BERT, and DRL not only converged more slowly but also failed to achieve RMSE levels below four within a reasonable time frame, indicating limitations in precision and scalability. These observations confirm the superiority of ODRL-Transformer in minimizing error rapidly while maintaining high accuracy.

Moreover, the Transformer-HBBO and DRL-HBBO variants, although improved over their non-optimized counterparts, still required significantly longer runtimes to meet the same RMSE thresholds. For instance, DRL-Transformer reached an RMSE <4 in 394 s, while ODRL-Transformer achieved this in less than a quarter of the time. This highlights the advantage of joint optimization using HBBO, which dynamically adjusts learning rates, discount factors, and architecture settings to accelerate convergence without compromising performance. From a computational complexity standpoint, these results imply that the proposed architecture not only achieved high predictive accuracy but also did so with superior time efficiency. The reduced training time directly translates to lower computational overhead and energy consumption, which are critical concerns in THz-enabled RIS applications where both latency and resource usage must be minimized.

Table 5 presents the total execution times of all evaluated models when trained for a fixed budget of 300 epochs, allowing a direct comparison of computational costs under identical termination conditions. This complements the convergence-time analysis in Table 4, which measured the runtimes to reach predefined RMSE thresholds. Unlike Table 4, where runtime differences were influenced by each model’s convergence rate, Table 5 isolates the inherent computational overhead of each architecture, including the cost introduced by the HBBO meta-optimizer. From Table 5, it is clear that the HBBO-based models (Transformer-HBBO and DRL-HBBO) required moderately higher execution times than their non-HBBO counterparts (Transformer and DRL) due to the additional optimization layer in the early training stages. However, this overhead remained within a feasible range (approximately 25–30% relative to the corresponding baseline models).

When combined with the results in Table 4, these findings show that although HBBO introduced extra computation initially, it substantially accelerated convergence toward lower RMSE thresholds. This trade-off is particularly advantageous in dynamic 6G environments; despite a slight initial overhead, the model benefits from faster adaptation and less frequent retraining. As a result, the proposed ODRL-Transformer framework remains well suited for near real-time deployment.

To assess the consistency and robustness of each learning model, we conducted 30 independent training runs and computed the variance of the resulting RMSE values. Table 6 summarizes these variances, offering insight into how sensitive each model is to initialization randomness and stochastic training factors. A lower variance reflects higher stability and more predictable behavior. The proposed ODRL-Transformer framework exhibited an exceptionally low variance of 0.00004, indicating near-identical performance across all 30 runs. This result underscores the robustness of the model and its strong convergence reliability. In comparison, models like DRL-Transformer and Transformer-HBBO yielded variances of 1.39 and 2.01, respectively, which although improved over conventional baselines still revealed moderate variability. Legacy architectures such as Transformer (6.41), DRL (7.85), BERT (8.21), and GAN (9.26) demonstrated considerably higher variances, suggesting unstable convergence behavior and inconsistent performance across different training sessions. These fluctuations may lead to unpredictable decisions or poor performance in mission-critical wireless applications.

The consistently superior performance of the ODRL-Transformer architecture across accuracy, convergence speed, variance, and statistical significance evaluations collectively suggests that this architecture is a strong candidate for real-world deployment. Its ability to achieve high prediction accuracy in significantly fewer epochs compared with all the baselines implies practical efficiency in terms of training time and energy usage. Furthermore, its stability across 30 runs (variance = 0.00004) highlights its resilience to initialization randomness and its robustness in diverse environments. From a systems perspective, the lightweight nature of the transformer encoder (used solely as a feature extractor) reduced the computational overhead compared with full sequence-to-sequence Transformer set-ups. When integrated with the DRL policy and optimized via HBBO, this design yields a streamlined decision-making pipeline that adapts well to dynamic wireless conditions without compromising the response time. The fact that ODRL-Transformer outperformed even the fine-tuned GAN- and BERT-based models, which are typically more resource-intensive, further underscores its efficiency–performance trade-off advantages. Moreover, the model’s modularity and generality, separating environment perception (Transformer), decision making (DRL), and optimization (HBBO), make it flexible enough to be adapted to other domains such as UAV-assisted networking, smart manufacturing, or vehicular edge computing. Although real-world deployment may require domain-specific tuning or additional safety mechanisms, the current results strongly suggest that the proposed model is not only effective but also practically viable in intelligent, resource-constrained wireless environments.

It is also important to note that while the current study evaluated a single-user scenario for clarity and tractability, the proposed framework can be extended to multi-user environments. In such cases, the state and action spaces expand proportionally with the number of users, resulting in higher model complexity and increased training times. Specifically, the transformer encoder’s self-attention scales quadratically with the user set, while the DRL agent faces a more complex policy optimization problem. Nonetheless, the modular design of the ODRL-Transformer approach enables the incorporation of user-specific representations and the use of scalable learning techniques such as user clustering or hierarchical policy decomposition to support efficient adaptation in multi-user scenarios.

6. Conclusions

In this work, we proposed the ODRL-Transformer framework, a hybrid learning architecture that integrates a transformer encoder for temporal feature extraction, a DRL agent for adaptive decision making, and an HBBO meta-optimizer for dynamic hyperparameter tuning. The design targets accurate prediction, fast convergence, stability, and computational efficiency in RIS-assisted THz communication systems under dynamic conditions. Comprehensive experiments against multiple strong baselines demonstrated the superiority of the proposed model across all key metrics. ODRL-Transformer achieved an RMSE of 0.03, R² value of 0.97, and MAPE of 0.52%, significantly outperforming the next-best DRL-Transformer baseline. It also converged rapidly, reaching an RMSE threshold of two in 153 s, and exhibited exceptional stability, with a variance of 0.00004 over 30 independent runs. These results confirm that the synergy between transformer-based temporal modeling, DRL-driven policy optimization, and HBBO’s adaptive parameter control delivers consistent and statistically significant performance gains. The framework’s modular design ensures adaptability to the complex, nonlinear, and high-dimensional nature of RIS-assisted THz environments. While the current study focused on this specific scenario, the underlying methodology is generalizable to other domains such as autonomous control, adaptive edge computing, and smart environments. The architecture can be extended with future enhancements, including online learning, cooperative decision making, and hardware-aware deployment. One promising extension is to adapt the model for rapidly time-varying or frequency-selective THz channels by modifying the channel modeling and state representation. Specifically, this could involve incorporating multi-tap channel impulse responses to capture frequency selectivity and extending the state vector with temporal features such as Doppler estimates or short-term channel evolution patterns to improve robustness under user mobility. Another promising research direction is to investigate the robustness of the proposed ODRL-Transformer framework under imperfect CSI conditions. In practical THz systems, perfect and instantaneous channel knowledge is rarely available due to estimation delays, measurement noise, and hardware limitations. Future extensions could incorporate noisy, outdated, or quantized CSI during training to simulate realistic environments and enable the agent to learn robust policies.

Author Contributions

Conceptualization, P.S.M., S.S.K., F.H.-G. and D.M.; methodology, S.S.K. and P.S.M.; software, F.H.-G.; validation, P.S.M., S.S.K., F.H.-G. and D.M.; formal analysis, P.S.M.; investigation, P.S.M., S.S.K. and F.H.-G.; resources, D.M.; data curation, S.S.K. and F.H.-G.; writing—original draft preparation, P.S.M., S.S.K., F.H.-G. and D.M.; visualization, P.S.M.; supervision, D.M.; project administration, D.M.; funding acquisition, D.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chafii, M.; Bariah, L.; Muhaidat, S.; Debbah, M. Twelve Scientific Challenges for 6G: Rethinking the Foundations of Communications Theory. IEEE Commun. Surv. Tutor. 2023, 25, 868–904. [Google Scholar] [CrossRef]
Jiang, W.; Han, B.; Habibi, M.A.; Schotten, H.D. The Road Towards 6G: A Comprehensive Survey. IEEE Open J. Commun. Soc. 2021, 2, 334–366. [Google Scholar] [CrossRef]
Wang, C.X.; You, X.; Gao, X.; Zhu, X.; Li, Z.; Zhang, C.; Wang, H.; Huang, Y.; Chen, Y.; Haas, H.; et al. On the Road to 6G: Visions, Requirements, Key Technologies, and Testbeds. IEEE Commun. Surv. Tutor. 2023, 25, 905–974. [Google Scholar] [CrossRef]
Mahdi, M.N.; Ahmad, A.R.; Qassim, Q.S.; Natiq, H.; Subhi, M.A.; Mahmoud, M. From 5G to 6G Technology: Meets Energy, Internet-of-Things and Machine Learning: A Survey. Appl. Sci. 2021, 11, 8117. [Google Scholar] [CrossRef]
Najafi, F.; Kaveh, M.; Mosavi, M.R.; Brighente, A.; Conti, M. EPUF: An Entropy-Derived Latency-Based DRAM Physical Unclonable Function for Lightweight Authentication in Internet of Things. IEEE Trans. Mob. Comput. 2024, 24, 2422–2436. [Google Scholar] [CrossRef]
Ahmad, H.F.; Rafique, W.; Rasool, R.U.; Alhumam, A.; Anwar, Z.; Qadir, J. Leveraging 6G, Extended Reality, and IoT Big Data Analytics for Healthcare: A Review. Comput. Sci. Rev. 2023, 48, 100558. [Google Scholar] [CrossRef]
Yu, H.; Shokrnezhad, M.; Taleb, T.; Li, R.; Song, J. Toward 6G-Based Metaverse: Supporting Highly-Dynamic Deterministic Multi-User Extended Reality Services. IEEE Netw. 2023, 37, 30–38. [Google Scholar] [CrossRef]
Jamshed, M.A.; Haq, B.; Mohsin, M.A.; Nauman, A.; Yanikomeroglu, H. Artificial Intelligence, Ambient Backscatter Communication and Non-Terrestrial Networks: A 6G Commixture. IEEE Internet Things Mag. 2025, 8, 88–94. [Google Scholar] [CrossRef]
Ghadi, F.; Kaveh, M.; Wong, K.K. Performance Analysis of FAS-Aided Backscatter Communications. IEEE Wirel. Commun. Lett. 2024, 13, 2412–2416. [Google Scholar] [CrossRef]
Ghadi, F.R.; Kaveh, M.; Wong, K.K.; Jäntti, R.; Yan, Z. On Performance of FAS-Aided Wireless Powered NOMA Communication Systems. In Proceedings of the 2024 IEEE 20th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob), Vilamoura, Portugal, 14–16 October 2024; pp. 496–501. [Google Scholar]
Kaveh, M.; Ghadi, F.R.; Jäntti, R.; Yan, Z. Secrecy Performance Analysis of Backscatter Communications with Side Information. Sensors 2023, 23, 8358. [Google Scholar] [CrossRef]
Petkova, R.; Bozhilov, I.; Manolova, A.; Tonchev, K.; Poulkov, V. On the Way to Holographic-Type Communications: Perspectives and Enabling Technologies. IEEE Access 2024, 12, 59236–59259. [Google Scholar] [CrossRef]
Gupta, M.S.; Srivastava, A.; Kumar, K. RORA: Reinforcement Learning Based Optimal Distributed Resource Allocation Strategies in Vehicular Cognitive Radio Networks for 6G. Veh. Commun. 2025, 52, 100882. [Google Scholar] [CrossRef]
Fernando, X.; Lăzăroiu, G. Energy-Efficient Industrial Internet of Things in Green 6G Networks. Appl. Sci. 2024, 14, 8558. [Google Scholar] [CrossRef]
Ghadi, F.R.; Kaveh, M.; Wong, K.K.; Martín, D. Physical Layer Security Performance of Cooperative Dual-RIS-Aided V2V NOMA Communications. IEEE Syst. J. 2024, 18, 2074–2084. [Google Scholar] [CrossRef]
Kaveh, M.; Mosavi, M.R. A Lightweight Mutual Authentication for Smart Grid Neighborhood Area Network Communications Based on Physically Unclonable Function. IEEE Syst. J. 2020, 14, 4535–4544. [Google Scholar] [CrossRef]
Rojek, I.; Kotlarz, P.; Dorożyński, J.; Mikołajewski, D. Sixth-Generation (6G) Networks for Improved Machine-to-Machine (M2M) Communication in Industry 4.0. Electronics 2024, 13, 1832. [Google Scholar] [CrossRef]
Sinha, G.; Agarwal, A. Dynamic Trust Management and AI-Based Multi-Factor Authentication for 6G Machine Type Communications. In Proceedings of the 2025 First International Conference on Advances in Computer Science, Electrical, Electronics, and Communication Technologies (CE2CT), Bengaluru, India, 20–22 February 2025; pp. 560–565. [Google Scholar]
Balapuwaduge, I.A.; Pussewalage, H.S.G.; Oleshchuk, V.A. Secure Inter-Cluster Machine-Type D2D Communication for 5G and Beyond. In Proceedings of the IEEE EUROCON 2025—21st International Conference on Smart Technologies, Oslo, Norway, 23–26 June 2025; pp. 1–6. [Google Scholar]
Langde, P.; Jain, T.K.; Parate, M.R.; Singh, S.K. A Journey of Terahertz Communication: An IRS Integration Perspective. Phys. Commun. 2025, 68, 102572. [Google Scholar] [CrossRef]
Ajayan, J. Recent Developments in Terahertz Wave Detectors for Next Generation High Speed Terahertz Wireless Communication Systems: A Review. Infrared Phys. Technol. 2024, 141, 105467. [Google Scholar] [CrossRef]
Jornet, J.M.; Elayan, H.; Nagatsuma, T.; Juntti, M.; Pinto, E.T.; Kürner, T.; Guerboukha, H.; Mittleman, D.M.; Knightly, E. Mobile Terahertz Communication and Sensing Systems: A Future Look. IEEE Veh. Technol. Mag. 2024, 19, 20–35. [Google Scholar] [CrossRef]
Jiang, W.; Zhou, Q.; He, J.; Habibi, M.A.; Melnyk, S.; El-Absi, M.; Han, B.; Di Renzo, M.; Schotten, H.D.; Luo, F.L.; et al. Terahertz Communications and Sensing for 6G and Beyond: A Comprehensive Review. IEEE Commun. Surv. Tutor. 2024, 26, 2326–2381. [Google Scholar] [CrossRef]
Othman, W.M.; Ateya, A.A.; Nasr, M.E.; Muthanna, A.; ElAffendi, M.; Koucheryavy, A.; Hamdi, A.A. Key Enabling Technologies for 6G: The Role of UAVs, Terahertz Communication, and Intelligent Reconfigurable Surfaces in Shaping the Future of Wireless Networks. J. Sens. Actuator Netw. 2025, 14, 30. [Google Scholar] [CrossRef]
Kaveh, M.; Mosavi, M.R.; Martin, D.; Aghapour, S. An Efficient Authentication Protocol for Smart Grid Communication Based on On-Chip-Error-Correcting Physical Unclonable Function. Sustain. Energy Grids Netw. 2023, 36, 101228. [Google Scholar] [CrossRef]
Ullah, R.; Ullah, S.; Ren, J.; Alwageed, H.S.; Mao, Y.; Qi, Z.; Wang, F.; Khan, S.A.; Farooq, U. Beyond Fiber: Toward Terahertz Bandwidth in Free-Space Optical Communication. Sensors 2025, 25, 2109. [Google Scholar] [CrossRef]
Mahmood, A.; Kiah, M.L.M.; Azizul, Z.H.; Azzuhri, S.R. Analysis of Terahertz (THz) Frequency Propagation and Link Design for Federated Learning in 6G Wireless Systems. IEEE Access 2024, 12, 23782–23797. [Google Scholar] [CrossRef]
Hassan, S.S.; Park, Y.M.; Tun, Y.K.; Saad, W.; Han, Z.; Hong, C.S. SpaceRIS: LEO Satellite Coverage Maximization in 6G Sub-THz Networks by MAPPO DRL and Whale Optimization. IEEE J. Sel. Areas Commun. 2024, 42, 1262–1278. [Google Scholar] [CrossRef]
Lotfy, A.; Kaveh, M.; Martín, D.; Mosavi, M.R. An Efficient Design of Anderson PUF by Utilization of the Xilinx Primitives in the SLICEM. IEEE Access 2021, 9, 23025–23034. [Google Scholar] [CrossRef]
Liou, J.J.; Ziegler, M.; Schwierz, F. Gigahertz and Terahertz Transistors for 5G, 6G, and Beyond Mobile Communication Systems. Appl. Phys. Rev. 2024, 11, 031301. [Google Scholar] [CrossRef]
Guo, X.; Bertling, K.; Donose, B.C.; Bruenig, M.; Cernescu, A.; Govyadinov, A.A.; Rakić, A.D. Terahertz Nanoscopy: Advances, Challenges, and the Road Ahead. Appl. Phys. Rev. 2024, 11, 021301. [Google Scholar] [CrossRef]
Elbir, A.M.; Mishra, K.V.; Chatzinotas, S.; Bennis, M. Terahertz-Band Integrated Sensing and Communications: Challenges and Opportunities. IEEE Aerosp. Electron. Syst. Mag. 2024, 39, 3476228. [Google Scholar] [CrossRef]
Kundu, N.K.; McKay, M.R.; Mallik, R.K. Wireless Quantum Key Distribution at Terahertz Frequencies: Opportunities and Challenges. IET Quantum Commun. 2024, 5, 450–461. [Google Scholar] [CrossRef]
Miri, S.; Kaveh, M.; Shahhoseini, H.S.; Mosavi, M.R.; Aghapour, S. On the Security of “An Ultra-Lightweight and Secure Scheme for Communications of Smart Meters and Neighborhood Gateways by Utilization of an ARM Cortex-M Microcontroller”. IET Inf. Secur. 2023, 17, 544–551. [Google Scholar] [CrossRef]
Karacora, Y.; Umra, A.; Sezgin, A. Robust Communication Design in RIS-Assisted THz Channels. IEEE Open J. Commun. Soc. 2025, 6, 3029–3043. [Google Scholar] [CrossRef]
Su, X.; He, R.; Ai, B.; Niu, Y.; Wang, G. Channel Estimation for RIS Assisted THz Systems with Beam Split. IEEE Commun. Lett. 2024, 28, 637–641. [Google Scholar] [CrossRef]
Kaveh, M.; Yan, Z.; Jäntti, R. Secrecy Performance Analysis of RIS-Aided Smart Grid Communications. IEEE Trans. Ind. Inform. 2024, 20, 5415–5427. [Google Scholar] [CrossRef]
Kaveh, M.; Aghapour, S.; Martin, D.; Mosavi, M.R. A Secure Lightweight Signcryption Scheme for Smart Grid Communications Using Reliable Physically Unclonable Function. In Proceedings of the 2020 IEEE International Conference on Environment and Electrical Engineering and 2020 IEEE Industrial and Commercial Power Systems Europe (EEEIC/I&CPS Europe), Madrid, Spain, 9–12 June 2020; pp. 1–6. [Google Scholar]
Kaveh, M.; Ghadi, F.R.; Li, Z.; Yan, Z.; Jäntti, R. Secure Backscatter Communications Through RIS: Modeling and Performance. arXiv 2024, arXiv:2410.01829. [Google Scholar]
Ghadi, F.R.; Kaveh, M.; Martín, D. Performance Analysis of RIS/STAR-IOS-Aided V2V NOMA/OMA Communications over Composite Fading Channels. IEEE Trans. Intell. Veh. 2024, 9, 279–286. [Google Scholar] [CrossRef]
Sokal, B.; de Almeida, A.L.; Makki, B.; Fodor, G. Structured Channel Estimation for RIS-Assisted THz Communications. IEEE Trans. Veh. Technol. 2025, 74, 5175–5180. [Google Scholar]
Wu, J.; Kim, S.; Shim, B. Parametric Sparse Channel Estimation for RIS-Assisted Terahertz Systems. IEEE Trans. Commun. 2023, 71, 5503–5518. [Google Scholar] [CrossRef]
Nguyen, T.L.; Do, T.N.; Kaddoum, G.; da Costa, D.B.; Haas, Z.J. Channel Characterization for RIS-Aided Terahertz Communications: A Stochastic Approach. IEEE Wirel. Commun. Lett. 2022, 11, 1890–1894. [Google Scholar] [CrossRef]
Kumar, A.; Sharma, S.; Kumar, M.H.; Singh, G. RIS-Assisted Terahertz Communications for 6G Networks: A Comprehensive Overview. IEEE Access 2025, 13, 96337–96364. [Google Scholar] [CrossRef]
Noh, S.; Lee, J.; Lee, G.; Seo, K.; Sung, Y.; Yu, H. Channel Estimation Techniques for RIS-Assisted Communication: Millimeter-Wave and sub-THz Systems. IEEE Veh. Technol. Mag. 2022, 17, 64–73. [Google Scholar] [CrossRef]
Amodu, O.A.; Nordin, R.; Abdullah, N.F.; Busari, S.A.; Otung, I.; Ali, M.; Behjati, M. Technical Advancements towards RIS-Assisted NTN-Based THz Communication for 6G and Beyond. IEEE Access 2024, 12, 183153–183181. [Google Scholar] [CrossRef]
Ahmed, M.; Wahid, A.; Khan, W.U.; Khan, F.; Ihsan, A.; Ali, Z.; Rabie, K.; Shongwe, T.; Han, Z. A Survey on RIS Advances in Terahertz Communications: Emerging Paradigms and Research Frontiers. IEEE Access 2024, 12, 173867–173901. [Google Scholar] [CrossRef]
Yang, F.; Pitchappa, P.; Wang, N. Terahertz Reconfigurable Intelligent Surfaces (RISs) for 6G Communication Links. Micromachines 2022, 13, 285. [Google Scholar] [CrossRef]
Su, R.; Dai, L.; Ng, D.W.K. Wideband Precoding for RIS-Aided THz Communications. IEEE Trans. Commun. 2023, 71, 3592–3604. [Google Scholar] [CrossRef]
Le, N.P.; Alouini, M.S. Performance Analysis of RIS-Aided THz Wireless Systems over α–μ Fading: An Approximate Closed-Form Approach. IEEE Internet Things J. 2023, 11, 1328–1343. [Google Scholar] [CrossRef]
Huang, C.; Yang, Z.; Alexandropoulos, G.C.; Xiong, K.; Wei, L.; Yuen, C.; Zhang, Z.; Debbah, M. Multi-Hop RIS-Empowered Terahertz Communications: A DRL-Based Hybrid Beamforming Design. IEEE J. Sel. Areas Commun. 2021, 39, 1663–1677. [Google Scholar] [CrossRef]
Yuan, J.; Chen, G.; Wen, M.; Tafazolli, R.; Panayirci, E. Secure Transmission for THz-Empowered RIS-Assisted Non-Terrestrial Networks. IEEE Trans. Veh. Technol. 2022, 72, 5989–6000. [Google Scholar] [CrossRef]
Pan, Y.; Pan, C.; Jin, S.; Wang, J. RIS-Aided Near-Field Localization and Channel Estimation for the Terahertz System. IEEE J. Sel. Top. Signal Process. 2023, 17, 878–892. [Google Scholar] [CrossRef]
Du, H.; Zhang, J.; Guan, K.; Niyato, D.; Jiao, H.; Wang, Z.; Kürner, T. Performance and Optimization of Reconfigurable Intelligent Surface Aided THz Communications. IEEE Trans. Commun. 2022, 70, 3575–3593. [Google Scholar] [CrossRef]
Yildirim, I.; Koc, A.; Basar, E.; Le-Ngoc, T. Multi-RIS Assisted Hybrid Beamforming Design for Terahertz Massive MIMO Systems. IEEE Open J. Commun. Soc. 2024, 5, 6150–6165. [Google Scholar] [CrossRef]
Wan, Z.; Gao, Z.; Gao, F.; Di Renzo, M.; Alouini, M.S. Terahertz Massive MIMO with Holographic Reconfigurable Intelligent Surfaces. IEEE Trans. Commun. 2021, 69, 4732–4750. [Google Scholar] [CrossRef]
Mehrabian, A.; Wong, V.W. Joint Spectrum, Precoding, and Phase Shifts Design for RIS-Aided Multiuser MIMO THz Systems. IEEE Trans. Commun. 2024, 72, 5087–5101. [Google Scholar] [CrossRef]
Yin, Y.; Yang, L.; Li, X.; Liu, H.; Guo, K.; Li, Y. On the Performance of Active RIS-Assisted Mixed RF-THz Relaying Systems. IEEE Internet Things J. 2025, 12, 14938–14951. [Google Scholar] [CrossRef]
Velez, V.R.; Pavia, J.P.C.; Souto, N.M.; Sebastiao, P.J.; Correia, A.M. Performance Assessment of a RIS-Empowered Post-5G/6G Network Operating at the mmWave/THz Bands. IEEE Access 2023, 11, 49625–49638. [Google Scholar] [CrossRef]
Pan, K.; Li, M.; Lv, S.; Si, P.; Zhang, H.; Yu, F.R. Adaptive Resource Allocation for IoT with Computing Power Network Based on RIS-UAV-Aided NOMA-THz Communication. IEEE Trans. Veh. Technol. 2025, 74, 3588155. [Google Scholar] [CrossRef]
Zhang, C.; Zhou, H.; Liang, Y.C. Near-Field Wideband Beamforming for RIS-Assisted THz Communications with FTTDs. IEEE Trans. Wirel. Commun. 2025, in press. [Google Scholar] [CrossRef]
Do-Duy, T.; Masaracchia, A.; Canberk, B.; Nguyen, L.D.; Duong, T.Q. Throughput Maximisation in RIS-Assisted NOMA-THz Communication Network. IEEE Open J. Commun. Soc. 2024, 5, 5706–5717. [Google Scholar] [CrossRef]
Wang, C.; Chen, Y.; Zhang, S.; Zhang, Q. Stock Market Index Prediction Using Deep Transformer Model. Expert Syst. Appl. 2022, 208, 118128. [Google Scholar] [CrossRef]
Jiang, J.; Ke, L.; Chen, L.; Dou, B.; Zhu, Y.; Liu, J.; Wei, G.W. Transformer Technology in Molecular Science. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2024, 14, e1725. [Google Scholar] [CrossRef]
Zhang, X.; Lin, M.; Hong, Y.; Xiao, H.; Chen, C. MSFT: A Multi-Scale Feature-Based Transformer Model for Arrhythmia Classification. Biomed. Signal Process. Control 2025, 100, 106968. [Google Scholar] [CrossRef]
Luo, F.M.; Xu, T.; Lai, H.; Chen, X.H.; Zhang, W.; Yu, Y. A Survey on Model-Based Reinforcement Learning. Sci. China Inf. Sci. 2024, 67, 121101. [Google Scholar] [CrossRef]
Shakya, A.K.; Pillai, G.; Chakrabarty, S. Reinforcement Learning Algorithms: A Brief Survey. Expert Syst. Appl. 2023, 231, 120495. [Google Scholar] [CrossRef]
Khatami, S.S.; Shoeibi, M.; Salehi, R.; Kaveh, M. Energy-Efficient and Secure Double RIS-Aided Wireless Sensor Networks: A QoS-Aware Fuzzy Deep Reinforcement Learning Approach. J. Sens. Actuator Netw. 2025, 14, 18. [Google Scholar] [CrossRef]
Shoeibi, M.; Oskouei, A.E.; Kaveh, M. A Novel Six-Dimensional Chimp Optimization Algorithm—Deep Reinforcement Learning-Based Optimization Scheme for Reconfigurable Intelligent Surface-Assisted Energy Harvesting in Batteryless IoT Networks. Future Internet 2024, 16, 460. [Google Scholar] [CrossRef]
Simon, D. Biogeography-Based Optimization. IEEE Trans. Evol. Comput. 2008, 12, 702–713. [Google Scholar] [CrossRef]
Zhang, Y.; Gu, X. A Biogeography-Based Optimization Algorithm with Local Search for Large-Scale Heterogeneous Distributed Scheduling with Multiple Process Plans. Neurocomputing 2024, 595, 127897. [Google Scholar] [CrossRef]
Gao, C.; Li, T.; Gao, Y.; Zhang, Z. A Comprehensive Multi-Strategy Enhanced Biogeography-Based Optimization Algorithm for High-Dimensional Optimization and Engineering Design Problems. Mathematics 2024, 12, 435. [Google Scholar] [CrossRef]
Fard, S.; Kaveh, M.; Mosavi, M.R.; Ko, S.B. An Efficient Modeling Attack for Breaking the Security of XOR-Arbiter PUFs by Using the Fully Connected and Long-Short Term Memory. Microprocess. Microsyst. 2022, 94, 104667. [Google Scholar] [CrossRef]
Kaveh, M.; Mesgari, M.S.; Martín, D.; Kaveh, M. TDMBBO: A Novel Three-Dimensional Migration Model of Biogeography-Based Optimization (Case Study: Facility Design and Benchmark Problems). J. Supercomput. 2023, 79, 9715–9770. [Google Scholar] [CrossRef]
Bono, F.M.; Polinelli, A.; Radicioni, L.; Benedetti, L.; Castelli-Dezza, F.; Cinquemani, S.; Belloli, M. Wireless Accelerometer Architecture for Bridge SHM: From Sensor Design to System Deployment. Future Internet 2025, 17, 29. [Google Scholar] [CrossRef]

Figure 1. The system model for reliable RIS-aided THz communication in 6G networks.

Figure 2. The transformer encoder architecture.

Figure 3. The RL architecture.

Figure 4. The proposed ODRL-Transformer model.

Figure 5. Sum rate versus the transmit SNR for different RIS deployment scenarios under the proposed algorithm.

Figure 6. Sum rate versus the transmit SNR for different algorithms and

N = 32

.

Figure 6. Sum rate versus the transmit SNR for different algorithms and

N = 32

.

Figure 7. Sum rate versus the distance between the base station and the user for different RIS deployment scenarios under the proposed algorithm.

Figure 8. Bar chart visualization of the

R^{2}

scores for all models.

Figure 8. Bar chart visualization of the

R^{2}

scores for all models.

Figure 9. Bar chart comparison of RMSE and MAPE values across all models.

Figure 10. Convergence analysis of all models in terms of RMSE over training epochs.

Table 1. Parameter settings of proposed algorithms.

Model	Parameter	Value
ODRL-Transformer	Learning rate	0.001
	Batch size	64
	Feedforward hidden size	2048
	Weight decay	0.02
	Dropout rate	0.2
	Number of attention heads	8
	Number of encoder layers	6
	Optimizer	HBBO
	Discount factor ( $γ$ )	0.92
	$ϵ$ -greedy value	0.43
	Probability range for migration	[0, 1]
	Mutation rate	0.05
	Population size	150
	Iteration	300
Transformer	Learning rate	0.003
	Batch size	128
	Feedforward hidden size	2048
	Weight decay	0.02
	Dropout rate	0.2
	Number of attention heads	12
	Number of encoder layers	6
	Activation function	GELU
	Optimizer	SGD
BERT	Learning rate	0.004
	Batch size	64
	Dropout rate	0.1
	Number of self-attention heads per layer	8
	Number of transformer encoder layers	8
	Length of input time series window	64
	Activation function	GELU
	Optimizer	SGD
DRL	Learning rate	0.002
	Discount factor ( $γ$ )	0.97
	$ϵ$ -greedy value	0.41
	Batch size	64
	Activation function	Adam
GAN	Learning rate	0.003
	Number of neurons in hidden layers	32
	Batch size	128
	Momentum term	0.06
	Activation function	ReLU
	Convergence threshold	0.068
	Optimizer	Adam

Table 2. Performance comparison of ODRL-Transformer and baseline models.

Model	RMSE	$R^{2}$	MAPE
ODRL-Transformer	0.03	0.97	0.52%
DRL-Transformer	2.09	0.91	2.19%
Transformer-HBBO	2.61	0.89	3.29%
DRL-HBBO	2.79	0.88	3.41%
Transformer	6.32	0.84	9.10%
DRL	7.46	0.83	10.12%
BERT	7.93	0.82	11.02%
GAN	9.12	0.81	14.26%

Table 3. Statistical significance analysis of model performance using two-tailed paired t tests.

Model	p Value	Results	$α$
ODRL-Transformer vs. DRL-Transformer	0.0005	Significant	0.01
ODRL-Transformer vs. Transformer-HBBO	0.00009	Significant	0.01
ODRL-Transformer vs. DRL-HBBO	0.00008	Significant	0.01
ODRL-Transformer vs. Transformer	0.00005	Significant	0.01
Transformer vs. DRL	0.00004	Significant	0.01
ODRL-Transformer vs. BERT	0.00003	Significant	0.01
ODRL-Transformer vs. GAN	0.00001	Significant	0.01

Table 4. Runtime required by each model to reach specific RMSE thresholds.

Proposed Methods	RMSE < 12	RMSE < 8	RMSE < 4	RMSE < 2
ODRL-Transformer	25	59	87	153
DRL-Transformer	121	193	394	–
Transformer-HBBO	186	249	519	–
DRL-HBBO	201	280	593	–
Transformer	286	405	–	–
DRL	309	486	–	–
BERT	329	516	–	–
GAN	390	–	–	–

Table 5. Total runtime for each model under a fixed termination condition of 300 training epochs.

Proposed Methods	Run Time (s)
ODRL-Transformer	791
DRL-Transformer	732
Transformer-HBBO	593
DRL-HBBO	561
Transformer	469
DRL	406
BERT	516
GAN	429

Table 6. Variance in prediction results over 30 independent executions for each model.

Proposed Methods	Variance
ODRL-Transformer	0.00004
DRL-Transformer	1.39654
Transformer-HBBO	2.01856
DRL-HBBO	2.28563
Transformer	6.41236
DRL	7.85369
BERT	8.21459
GAN	9.25806

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Moghaddam, P.S.; Khatami, S.S.; Hernando-Gallego, F.; Martín, D. A Novel DRL-Transformer Framework for Maximizing the Sum Rate in Reconfigurable Intelligent Surface-Assisted THz Communication Systems. Appl. Sci. 2025, 15, 9435. https://doi.org/10.3390/app15179435

AMA Style

Moghaddam PS, Khatami SS, Hernando-Gallego F, Martín D. A Novel DRL-Transformer Framework for Maximizing the Sum Rate in Reconfigurable Intelligent Surface-Assisted THz Communication Systems. Applied Sciences. 2025; 15(17):9435. https://doi.org/10.3390/app15179435

Chicago/Turabian Style

Moghaddam, Pardis Sadatian, Sarvenaz Sadat Khatami, Francisco Hernando-Gallego, and Diego Martín. 2025. "A Novel DRL-Transformer Framework for Maximizing the Sum Rate in Reconfigurable Intelligent Surface-Assisted THz Communication Systems" Applied Sciences 15, no. 17: 9435. https://doi.org/10.3390/app15179435

APA Style

Moghaddam, P. S., Khatami, S. S., Hernando-Gallego, F., & Martín, D. (2025). A Novel DRL-Transformer Framework for Maximizing the Sum Rate in Reconfigurable Intelligent Surface-Assisted THz Communication Systems. Applied Sciences, 15(17), 9435. https://doi.org/10.3390/app15179435

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel DRL-Transformer Framework for Maximizing the Sum Rate in Reconfigurable Intelligent Surface-Assisted THz Communication Systems

Abstract

1. Introduction

1.1. Related Works

1.2. Research Gaps and Motivations

1.3. Paper Contributions

1.4. Paper Organization

2. System Model and Problem Formulation

3. Materials and Methods

3.1. Transformer Encoder

3.2. Reinforcement Learning

3.3. Proposed HBBO

3.4. Proposed ODRL-Transformer Model

4. Results

4.1. Simulation Set-Up

4.2. Optimization Parameters

4.3. Simulation Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI