1. Introduction
Millimeter-wave (mm-wave)-communications-specific problems including high route loss and line-of-sight limitations can be solved with the use of sensing technology. Sensing provides the deep reinforcement learning (DRL) agent with extensive environmental feedback, which helps it adjust beam patterns and better forecast channel fluctuations. This collaboration not only enhances the performance of hybrid precoding systems but also strengthens mm-wave communication’s dependability and resilience in complex, dynamic scenarios, a critical feature for networks in the beyond fifth generation (B5G). Hybrid beamforming is a potential technique for mm-wave massive multiple-input multiple-output (mMIMO) systems that provide excellent spectrum economy, high data rate transmission, and a favorable trade-off between hardware complexity and transmission performance. While the reliance on low-frequency bands limits the capacity to provide low latency, ultra-high speeds, higher service quality, and seamless wireless coverage, the rapid expansion in mobile users and devices has increased the burden on mm-wave spectrum resources (300 kHz to 3 GHz) [
1,
2]. However, in large-scale dense deployments, higher frequency waves can deliver multi-gigabit rates [
3]. The newest commercial mobile communication technology, 5G, was created to meet mobile consumers’ wireless transmission needs and offer rudimentary support for massive communication systems, like the Internet of Things (IoT) and enterprise networks. Exploring higher frequency spectrum is essential to finding a solution to the issue of the many access requirements of terminal devices in 5G networks and the constrained availability of spectrum below 6 GHz [
4,
5,
6].
Antenna array deployment is essential for drastically lowering the number of necessary radio frequency (RF) chains in mm-wave mMIMO systems. This feature is particularly helpful in real-world wireless communication situations because there are frequently more antennas than RF chains, which makes precise channel estimates extremely difficult. In recent years, a variety of beamforming-based channel estimate techniques have been put out to solve these issues. In particular, a low-complexity two-way channel estimation method was described in [
7,
8,
9]. This technique uses beam training to locate antennas associated with dominant beams first, between the users and the base station (BS). The estimation process then considers only the channel elements related to these selected antennas. However, the quantity of pilot symbols needed to cover every potential beam is directly correlated with the quantity of BS antennas, which is a very high amount. Also, some traditional compressive sensing-based algorithms, like the orthogonal matching pursuit algorithm employed in [
10], could estimate the beamforming channel with a lower pilot overhead by taking advantage of the sparsity of beamforming channels [
11,
12,
13]. Regretfully, none of the aforementioned beamforming channel estimation algorithms [
11,
12,
13,
14] can attain enough estimate accuracy in low signal-to-noise ratio (SNR) regions. Furthermore, these methods are computationally complex, especially when dealing with high-sparsity beamforming channels. The antenna array at the BS was also employed [
7], [
15,
16,
17,
18] in the mm-wave mMIMO system. This allows signals to be concentrated on various antennas and transforms the spatial channel into a beamspace channel. Low-computational-complexity beamspace channel estimation is possible with the approximate message passing (AMP) technique, a potent iterative algorithm for sparse signal recovery [
19,
20]. The best shrinkage parameters for the AMP approach are hard to find and empirical shrinkage values are usually utilized instead. This restriction limits the useful channel estimation performance of the method. DRL is used in mMIMO systems to solve computationally complicated and demanding problems including resource allocation (RA), beamforming, and channel estimation that arise from the enormous number of users and antennas by providing more intelligent beam training. DL enables these systems to process high-dimensional data efficiently, adjust to changing conditions, and maximize performance measures like latency, energy economy, and spectrum efficiency (SE). The development of B5G wireless communication systems depends heavily on DL since it also makes data-driven methods to model and mitigate uncertainties, improve user experience, and make lower computational complexity possible [
20]. With DL applied to mm-wave systems, the distinct features of the mm-wave frequency spectrum make RA a crucial issue in mm-wave communication systems. Because of the directional nature of mm-wave signals, significant path loss, and interference, using mm-wave technology can lead to greater data rates and network capacity. Utilizing time, frequency, and power as well as other available resources to their fullest potential can be achieved by using the RA [
6,
21,
22,
23]. As the need for high-speed wireless communication services grows, this may help to sustain it. Moreover, power resource optimization, a crucial aspect of battery-powered devices, can be achieved by applying RA. RA helps lessen the environmental impact of wireless communication and prolongs device battery life by consuming less power. Nonetheless, to meet the varying bandwidth and latency needs of various applications, like gaming and video streaming, the quality-of-service demands must be improved. To this end, mm-wave implementation in DL necessitates a substantial amount of data and computational resources, and careful DL algorithm design and optimization are required to achieve optimal EE [
24,
25]. Through intelligent and adaptive RA based on real-time feedback from the system, DL can be utilized to enhance RA in mm-wave systems. DL makes intelligent and adaptive signal processing possible, which can enhance the performance of mm-wave systems [
26]. Furthermore, the performance, reliability, and efficiency of mm-wave systems could be greatly enhanced by the service of DL for RA. It is necessary to carefully evaluate the system requirements, channel conditions, and available computational resources when designing and optimizing DL algorithms for RA [
27,
28,
29,
30]. Effective channel estimation and beamforming are severely limited by the drawbacks of traditional mm-wave systems, such as the difficulty of using a high number of antennas with a small number of RF chains. As the number of antennas in these systems increases, so does the complexity of precisely calculating the channel, resulting in enormous processing demands. Furthermore, to cover the large beam space, classic beamforming approaches frequently call for many pilot symbols, which raises pilot overhead and lowers system efficiency. It becomes harder to control the necessary RF chains and the related beam training overhead in mm-wave systems as the number of antennas increases. These limitations highlight the need for creative solutions, such as the proposed angular-based hybrid precoding (AB-HP), which optimizes beam training procedures and reduces RF-chain utilization by utilizing angular-based information. This method paves the door for more effective and scalable mm-wave systems in B5G networks by directly addressing the problems of high computational complexity, decreased beamforming accuracy in low SNR zones, and power of the signal pilot.
1.1. Related Works
The optimization of RA, including beamforming and power, in special AB-HP systems is an interesting but difficult subject that merits further research. Through mm-wave massive MIMO and hybrid beamforming optimization, dynamic beam selection and power allocation (PA), SE enhancement, and reduced complexity, DRL improves B5G performance. A convolutional neural network (CNN) is utilized to extract channel response characteristics in a DL-based channel estimate network, while a long short-term memory (LSTM) network is used to boost performance. This approach uses machine learning techniques to teach neural networks to carry out particular tasks [
31,
32,
33,
34]. To accommodate the time-varying nature of wireless channels, the suggested DL-based channel estimation network employs an LSTM network to capture temporal dependencies and a CNN to compress channel response data into feature vectors. In dynamic contexts, this combination increases the accuracy of channel estimates. Compared to massive MIMO systems, mm-wave technology requires fewer transmit antennas and RF chains in high-speed mobile situations, where the method is very successful [
35,
36]. To improve hybrid analog/digital beamforming architecture, baseband precoding must be transmitted to use fewer RF chains. Baseband decoding can then be applied to the received signal in the uplink to produce low-cost hybrid beamforming [
37,
38]. Moreover, the digital baseband (DB) beamforming could be used by using microprocessing, while the analog RF chains beamforming might be implemented to produce a high data rate by using phase shift [
39,
40,
41,
42]. By using mMIMO baseband precoding instead of dedicating an unconnected RF chain for every antenna, the high cost and high power consumption of combined signal components could be improved. The combiner hardware in [
43] featured a completely connected phase shifter, and every aerial for the analog signal was connected to the feeding RF chains dependent on the phase shifter on/off switch. The authors of [
44] propose multi-user mMIMO systems by suggesting a DL-PA and hybrid precoding technique. To cut down on the amount of RF chains and channel estimation overhead, an angular-based hybrid precoding (AB-HP) technique has been employed. Ref. [
45] proposed DL driven with non-orthogonal hybrid beamforming for the single-user, which provides a high SE, and a good balance between transmission performance and hardware complexity. Ref. [
46] proposed a DRL to adopt user-pairing by assigning an orthogonal sub-channel to transmit data to the cognitive BS without causing interference to obtain the optimal transmission power and bandwidth allocation. However, Ref. [
47] suggested a ground-breaking experience-driven PA method in the mm-wave for high-speed railway systems with hybrid beamforming that is capable of learning power decisions from experience rather than the precise mathematical model.
The proposed method in [
47] aims to achieve higher SE by selecting the best DL training approach to reduce inter-channel interference which minimizes hardware complexity. With the hybrid beamforming approach, the number of RF chains required can be greatly reduced by combining analog and digital processing, which lowers power consumption and hardware complexity. In addition, the suggested approach makes use of DL to teach the system to choose the optimal beamforming weights for every channel, therefore lowering inter-channel interference and enhancing SE [
48]. Considering that the DL method can adapt to changing channel circumstances and understand complicated patterns in the wireless channel, it is especially successful in reaching this goal. All things considered, the suggested technique offers a novel way to increase the spectral efficiency of digital transceivers with a lot of antennas. Through the use of DL and affordable digital-analog hybrid beamforming algorithms, this approach can improve wireless channel performance while reducing hardware complexity [
49,
50,
51,
52]. Increasing the number of antennas per RF chain transceiver can make it more practical to implement fully connected beamforming. In related work, the author of [
53] proposed a new beam training algorithm via DRL with lower training overhead in B5G to achieve higher performance for EE or SE with a lower training overhead. However, Ref. [
54] proposed the development of DL techniques to optimize EE/SE and optimize its spectrum sensing process. In addition, Refs. [
28,
55] proposed joint energy-spectral efficiency optimization for a dynamic adaptation approach based on machine learning (ML) to achieve EE for connected users. The distributed environment, subject to sharing limitations, was optimized to maximize the EE-SE trade-off by [
56] using multi-objective optimization. Furthermore, they employed Pareto-optimal solutions to optimize the framework by combining methods applied by different UEs and operators simultaneously. The writers of [
40,
57] investigated the trade-off between EE-SE and made it possible for HetNets to enhance UE, bandwidth allocation, and a large number of antennas, by taking advantage of the backhaul capacity limitation to optimize both EE and SE. This allowed them to guarantee high data rates for users. According to the total-transmit-PA limitation, the author of [
58] formulates multi-objective optimization to maximize the EE-SE trade-off and determine the feasible DL rate. Because SE is unrelated to transmitting power in a huge MIMO system, reducing the number of antennas can boost EE [
59]. Key elements like transmission power and antenna count can be efficiently optimized in both EE and SE by using alternative optimization strategies [
60]. Therefore, our proposed approach has the potential to significantly improve the efficiency and reliability of hybrid precoding technique multi-user mMIMO systems. Current techniques, like traditional hybrid beamforming, for optimizing SE and EE in multi-user mMIMO systems try to minimize the number of RF chains by combining analog and digital processing. These methods still have a number of drawbacks, though, such as lengthy beam training processes and high pilot power needs for channel estimation, which raises overhead and computational complexity and ultimately restricts their scalability and usefulness for use in sixth-generation wireless networks. DRL-based hybrid precoding techniques have been introduced to address these issues, yet many fail to effectively balance the beam training overhead with RF chain utilization. The proposed AB-HP architecture with RBT-DRL overcomes these limitations by incorporating angular domain feedback to dynamically optimize EE and SE while significantly reducing training overhead and RF chain usage. Unlike prior approaches that focus solely on reducing RF chains or improving training efficiency, AB-HP with RBT-DRL achieves joint optimization by leveraging DRL’s adaptability to changing channel conditions. This innovative integration enhances real-time flexibility, minimizes pilot overhead, and ensures efficient RF chain distribution, offering a robust solution for mm-wave multi-user mMIMO systems in B5G networks.
The novelty of the proposed AB-HP framework lies in its unique integration of angular domain information and RBT-DRL to simultaneously optimize EE and SE in mm-wave multi-user mMIMO systems for B5G networks. While DRL-based hybrid precoding has been explored in prior studies, the proposed framework distinguishes itself by dynamically leveraging angular information to minimize RF chain utilization and channel estimation complexity, a feature not comprehensively addressed in existing work. Furthermore, including the RBT-DRL mechanism enables dual optimization of EE and SE while drastically reducing the beam training overhead, thereby improving system flexibility and adaptability to dynamic network conditions. This dual optimization approach, combined with the innovative use of angular-based feedback in DRL for mm-wave multi-user mMIMO hybrid precoding, lies in its ability to dynamically optimize EE/SE through AB-HP and maximum RBT, significantly reducing the beam training overhead and RF chain usage while adapting to real-time network dynamics. Existing studies on mm-wave systems have thoroughly examined the difficulties in beamforming, channel estimation, and RF chain utilization. Although the number of RF chains has been reduced with the introduction of hybrid beamforming, this method still necessitates a high pilot power signal and has high beam training complexity. Additionally, even though RBT-DRL has been used to enhance beamforming and RA in mm-wave systems, many solutions ignore the critical problem of effective beam training and RF chain usage. Recent developments that optimize beamforming and reduce the RF chain, such as the AB-HP technique and DRL approaches, have made headway in overcoming these constraints. However, only a few studies have completely combined these methods to address channel estimation difficulty, RF chain use, and beam training overhead, which is a key gap that the proposed AB-HP framework aims to fill. The RBT-DRL techniques have demonstrated their effectiveness in numerous fields, including developing low-complexity PA techniques that achieve near-optimal system capacity, making them suitable for real-time applications in HP-based multi-user MIMO systems based on deploying AB-HP to reduce the number of RF chains and the channel estimation overhead size. The authors of [
28] recommend looking at state-of-the-art techniques like dimension reduction and feature engineering [
5] that can reduce the state vector and action space while maintaining the deep neural network (DNN) performance. Through this, we want to accomplish efficient and effective beam training for mm-wave systems while also drastically cutting down on the DNN’s training period.
1.2. Motivation and Contributions
This study aims to optimize a beam training technique that takes into account the present channel conditions and minimizes the error between the estimated and true channel parameters by enabling dual optimization of EE and SE while drastically lowering beam training. By minimizing beam training overhead and dynamically adapting to network conditions, RBT-DRL-based beam training optimizes EE and SE while enhancing system performance. Enhancing system performance depends on the relay station’s and the BS’s transmit power restrictions necessary to exhibit a unique AB-HP that maximizes EE-SE. This action was taken in response to DL pilot contamination. By leveraging EE results to reduce SE enactment and vice versa, a unified matrix was utilized to enhance the EE-SE trade-off through multi-object optimization. We develop methods for beam tracking, relay selection, joint user scheduling, and codebook optimization for millimeter-wave networks, according to the authors of [
48]. Using DRL to reduce system latency, these techniques are based on the design of an online controller that schedules users dynamically and establishes their links. The speed and scalability of B5G wireless networks can be enhanced by mm-wave systems that dynamically adapt to changing channel conditions and user demands. By leveraging DRL’s ability to handle complex optimization tasks in real-time, DRL-integrated user scheduling and link configuration can significantly improve the performance of mm-wave networks.
The contributions of this article can be summarized as follows:
To emphasize utilizing an AB-HP and hybrid precoding technique tailored for large multi-user mMIMO systems. The hybrid precoding method, grounded in angular basis principles, mitigates the need for extensive RF chains and measurement overhead. Addressing the formidable challenges facing the realization of B5GG wireless networks, specifically the demanding criteria for high bandwidth and EE, our approach aligns with recent literature directives. The literature advocates strategies to minimize interference and optimize transmission power and bandwidth allocation through the allocation of orthogonal sub-channels.
To provide a comprehensive analysis of the prediction services that can be expected from B5G wireless networks, including their fundamental ideas and wide range of uses. To achieve accurate and high-resolution channel estimation, this includes a special emphasis on DL applications made for orthogonal frequency division multiplexing (OFDM) systems.
To highlight difficulties that need to be solved to achieve improved computational efficiency using DL. Therefore, to enable optimal performance, attention must be directed to improving the accuracy of channel predictions.
To exhibit a beam training algorithm leveraging the AB-HP constructed through a DNN. This algorithm operates within the wireless environment, adapting to changes in channels, and concurrently optimizing EE/SE. The objective is to enhance overall performance while reducing the overhead associated with beam training.
To maximize the long-term predicted reward through interactions with the environment. The reward is designed to represent this balance, which will be optimized throughout the applied maximum-reward beam training using deep reinforcement learning (RBT-DRL) to minimize the overhead related to beam training while still achieving good EE or SE performance.
3. Leveraging DL of Mm-Wave in B5G
For sensing-related technologies, DL for mm-wave communication in B5G is essential because it allows adaptive beamforming, accurate channel estimates, and real-time environmental awareness, all of which improve system efficiency and dependability under complex and dynamic network settings. DL addresses problems like dynamic channel fluctuations, beam alignment, and high route loss, and so it is crucial to use it for mm-wave communication in B5G. Through adaptive beamforming, efficient channel prediction, and intelligent resource management, DL enhances the performance of mm-wave communication. The use of DL, a potent machine learning technique well-known for its effectiveness in fields including speech recognition, image identification, and natural language processing, has recently drawn attention from researchers looking to address these issues in mm-wave communication systems [
57,
58,
59,
60]. Channel prediction and estimation is one of the primary uses of DL in mm-wave communication. Real-time channel state prediction is possible with DL-based techniques since they can learn the channel characteristics from vast volumes of data. This can boost signal quality and decrease interference, which will enhance the functionality of mm-wave communication systems. Beamforming is another area where DL is used in mm-wave communication. To handle the dual optimization of EE and SE in the face of intrinsic difficulties including high path loss, beam alignment overhead, and environmental variability, DL for mm-wave communication in B5G is essential. Because DL is so good at finding patterns in complicated data, network parameters may be optimized in real-time and adaptively to improve both EE and SE at the same time. By demonstrating its significance, the DL-based solutions can operate around mm-wave restrictions, lessen the computational load, and enhance resource efficiency.
One method for enhancing signal quality and lowering interference is beamforming, which focuses signal transmission in a particular direction. Beamforming techniques based on deep learning can enhance the efficiency of mm-wave communication networks by determining the ideal beamforming weights based on user data DL and have demonstrated significant potential in a number of wireless communication technologies, such as mMIMO. Channel estimate, or predicting the channel matrix between the BS and the users, is one of the primary issues in mMIMO. To effectively estimate the channel in mMIMO systems, this method can be computationally complex and necessitates a high number of pilots [
61]. By learning the channel properties directly from the received signals, neural networks can be used in DL-based channel estimation to circumvent these difficulties. This can lower the overhead related to pilot training and transmission while increasing the accuracy of the channel estimation. Beamforming is a further use of DL in mMIMO systems [
8,
19,
62], and DL-based beamforming can reduce interference and increase signal quality. It can also enhance the performance of mMIMO systems by learning the ideal beamforming weights from channel data [
55,
62].
The combination of DL and mMIMO has the potential to significantly improve the performance of wireless communication systems, including 5G and beyond 5G. By leveraging the power of DL, it is possible to overcome some of the challenges associated with traditional wireless communication techniques and enable high-speed and reliable wireless communication systems. Although enormous MIMO is a potential technique for future communication networks to maximize SE, the system’s performance is negatively impacted by its extremely high spatial complexity. We apply DL to a huge MIMO system with a focus on the difficulties of the direction of arrival (DOA) estimation and channel estimation, and the viability of this innovative framework is confirmed using in-depth simulations [
25].
3.1. DL in Artificial Intelligence-Driven B5G Mm-Wave Networks
DL is a powerful technique in artificial intelligence (AI) that has been widely applied in various fields, including wireless communication networks. In the context of B5G mm-wave networks, which operate in high-frequency bands, DL can play an important role in optimizing system performance and improving user experience.
In [
26], a deep IA framework with a DNN is described for enabling dependable and prompt primary access for AI-driven networks outside of B5G mm-wave as shown in
Figure 2. The authors in [
54] proposed that by using a small portion of the available beams, deep initial access (IA) might also reduce the beam sweep time. This sweep time demonstrates the superiority of deep AI over a conventional AI algorithm that also functions as adaptive learning and an efficient RA, and enhances the overall reliability and speed of communication. IA should only use a portion of the available beams to cut down on sweep time. Received signal strengths, which are obtained from the subset of beams, are mapped to the beam, and deep IA then determines which beam is best oriented towards the receiver side [
52]. In particular, in the prediction accuracy of line-of-sight and non-line-of-sight (NLoS) scenarios, deep AI significantly outperforms the conventional or standard IA’s beam and reduces IA time [
57], with difficulties such as increased sensitivity to obstructions and signal attenuation, especially in situations when there is NLOS. Through its ability to help the system learn and adapt to complex and dynamic situations, deep AI is a lot faster and more reliable and can help address these difficulties by delivering higher reliability, especially in the case of correct prediction [
60,
61,
62,
63]. DL can also be used in the development of beamforming and beam-tracking algorithms. Beamforming allows signals to be focused on users or places, which can help to lower interference and boost signal strength. By learning from vast amounts of data and generating predictions based on the circumstances at hand, DL can assist in automating this process. Future wireless communication systems should see even more creative uses of DL. The proposed approach leverages RBT-DRL-based beam training to minimize beam training overhead and adapt dynamically to network conditions, significantly improving both SE and EE in mm-wave systems as shown in
Table 4. When compared to existing works such as [
33], which focus on reducing beam sweep time by using a subset of available beams, our method offers a more dynamic and robust solution through deep reinforcement learning. Additionally, research like [
46] emphasizes beam allocation and power optimization to reduce hardware costs and power usage, while our solution directly addresses the challenges of beam training and adaptation in a HP structure. Compared to [
48], which improves system EE using DNN and DRL, the proposed RBT-DRL approach integrates channel estimation and beamforming, offering a more holistic solution for real-world applications with mobility and interference. Moreover, recent works like [
57] discuss DRL-based beam training for mm-wave channels considering user mobility, which aligns closely with our work; however, our method provides enhanced performance by incorporating low-resolution ADC considerations and intelligent resource management. These comparisons validate the novelty and effectiveness of the proposed method in comparison to existing solutions.
3.2. Enhancing Channel Prediction Accuracy
Addressing the sparsity and fast channel fluctuations of mm-wave transmissions requires improving the accuracy of channel prediction using DRL in mm-wave communication for B5G. In dynamic B5G network situations, precise forecasts allow for more effective beam alignment, lower signaling overhead, and providing reliable performance. By using novel approaches intended to handle real-world complexity including mobility, interference, and environmental unpredictability, the channel modeling in this article is validated. Beam training overhead is reduced, and the model dynamically adjusts to network conditions by utilizing DL for OFDM systems. Beam alignment, excessive path loss, and channel fluctuations are all successfully addressed by the RBT-DRL-based beam training approach, which guarantees reliable system performance and high-resolution channel estimation. Their alignment, including limited scattering, beam misalignment, and the effect of low-resolution ADCs, validates the channel modeling, which is analytically tractable due to its intrinsic simplification. Using HP in conjunction with RBT-DRL for beam training is specifically made to handle real-world challenges including interference, mobility, and environmental variability. By including features that dynamically adjust to changing network conditions, the chosen channel models are created to handle real-world complications like mobility, interference, and environmental unpredictability. In dynamic situations like channel variations and significant path loss, the RBT-DRL-based beam training framework optimizes channel estimates while lowering beam training overhead. Additionally, spatial domain quantization signals are used to increase channel accuracy and overcome low-resolution ADC issues as shown in
Figure 3. By lowering the number of RF chains while preserving strong beamforming capabilities, the HAD architecture improves channel modeling even further. MSE optimization is used to validate these models, guaranteeing that the predicted channels closely resemble actual conditions. By integrating DRL with AB-HP to tackle specific issues, such as bandwidth allocation and high-resolution channel estimation, shows how DRL may successfully increase channel prediction accuracy. Using DNNs to reduce the error between estimated and actual parameters, the model automatically adjusts to channel circumstances by dynamically improving EE and SE. Furthermore, the model uses spatial domain quantization signals and HAD architecture for effective beamforming and channel prediction, which reduces the errors brought on by low-resolution ADCs. The proposed approach provides a reliable solution for mm-wave situations since it closely matches realistic channel behaviors, as confirmed by extensive validation using MSE optimization.
It also enhances the predictability of the inferred channels by improving prediction accuracy and it streamlines and expedites the channel estimation process by avoiding the costly neural network training procedure using a DL-based beam training strategy. Training a DNN using historical beam measurements involves maximizing the expected long-term reward. In DNN, the rewards are obtained using a configurable reward mechanism that considers the power or data rate needs of various users. A DNN is taught to maximize the predicted reward over time by applying DRL approaches. Using DRL approaches allows the system to perform better in a variety of applications and adjust to changing needs. Decision-making is made more effective and the system’s overall efficiency is increased through the application of ML algorithms and prior data. To enhance the balance between prediction accuracy and channel estimation overhead, a straightforward training approach can be utilized. Instead of relying on traditional pilot-based channel estimation, an M-based predictor can be employed. This predictor can detect the channel aging pattern using an autoregressive model. Enhancing the supervised DL-based hybrid beam’s highest sum rate under a specified transmit power limit HB. To reduce computing complexity for multi-user mMIMO systems presented, which are used to reduce the number of RF chains and enhance channel estimation overhead, the method has suggested the use of a learning approach that is the closest to modeling the hybrid beamformer [
63,
64]. The BS deployed the antenna arrays and studied the channel estimation for an mm-wave huge mMIMO system with a mixed-resolution c architecture. The approach based on compressed sensing can recover the channel. The sparsity of the mm-wave channel may be used to formulate the beamspace channel estimation as a sparse signal-recovery problem.
4. Optimizing EE- and SE-Based Beam Training
This section examines the effectiveness of using a CSI predictor realized by an autoregressive predictor and a DL-based time division duplex (TDD)-scheme-based predictor involving pattern extraction realized by a convolutional neural network. These predictors can offer great performance gains, achievable throughput, and a significant improvement in channel predictability. The mMIMO system will be essential to the B5G wireless cellular networks’ mm-wave technology. Path loss in mm-wave transmission will be fought by the mMIMO system’s high beamforming gain. With interference and non-linear distortions, the DL is successfully used in joint channel estimation and signal detection. Deep CNNs can take advantage of channel correlations to improve the accuracy of channel estimates while reducing the computational cost in mm-wave mMIMO systems. By leveraging the inherent structure of the channel, a deep CNN can efficiently extract and process spatial and temporal features from channel data. This enables the network to generate more accurate channel estimates with less computational overhead, making it a promising approach for optimizing the performance of mm-wave mMIMO systems. With a low-resolution analog to digital converter (ADC), the supervised learning-based sequential interference cancelation is created for mMIMO detection ADCs. An important step in obtaining the channel state data needed for mMIMO transmission is channel estimation. In contrast, an excessive amount of overhead is needed to serve an ever-increasing number of devices due to the exponential increase in devices and the variety of new applications. The maximum number of devices that may be handled in one cell is in the order of 105. The integration of DRL with the AB-HP framework enhances the connection between beam selection and EE/SE outcomes by dynamically optimizing RF chain consumption while minimizing beam training overhead. RBT-DRL intelligently selects beams based on ambient angular information, reducing computing complexity and pilot power consumption while adapting to real-time network fluctuations. By incorporating DNN-based channel estimation, the system ensures accurate channel characteristics, addressing the challenges of high path loss and obstruction in mm-wave communication. Unlike traditional HP systems that focus solely on reducing RF chains or improving training efficiency, RBT-DRL enables dual optimization by dynamically adjusting the number of activated RF chains to balance EE and SE as shown in (6) and (10). This adaptive beam selection approach enhances capacity in multi-user mMIMO systems while maintaining resilience under varying network conditions. The methodology demonstrates how RBT-DRL outperforms existing solutions in efficiency, reliability, and scalability by optimizing beamforming, reducing training overhead, and improving EE/SE trade-offs.
4.1. RBT-DRL L-Based Beam Training with EE and SE
To reduce the error between the estimated and true channel parameters and optimize a beam training strategy that takes the current channel circumstances, the RBT-DRL-based beam training is essential because it maximizes EE and SE by reducing beam training overhead and dynamically adjusting to network conditions, improving system performance. Mm-wave communication has been proposed as a promising solution to meet the increasing demand for higher data rates in wireless communication systems. However, mm-wave communication faces significant challenges due to high path loss and blockage caused by obstacles in the transmission path. To lessen the effects of these difficulties, channel estimation is essential in mm-wave communication. To adaptively improve EE and SE while reducing computational and measurement overhead, the proposed beam training approach makes use of DNN. When compared to traditional methods, this creative approach guarantees that our technology achieves higher performance and adaptability. DNNs have been demonstrated to reduce the error between the estimated and true channel parameters, and the DNN is trained using a sizable dataset of input-output pairs. But to train and infer, DNN-based techniques need a lot of training data and processing power. Further study in this field is expected to increase the accuracy and efficiency of these methods. The DNN-based channel estimation shows promise as a workable solution for mm-wave communication systems. To optimize either EE or SE, a beam training strategy that takes the current channel circumstances into account and modifies the amount of activated RF chains in the mm-wave system may enhance overall performance and adjust to changing channel circumstances. The mm-wave system’s efficiency and dependability in difficult conditions can be improved by adding RBT-DRL into beam training.
4.2. A Low-Complexity AB-HP
To further mitigate the remaining interference at the DB precoder in a wireless cellular network’s downlink transmission, the channel estimation is performed by the user transmitting a pilot signal of
power to the BS. The power of the pilot signal, denoted by
, helps in determining the quality of the channel. It is hard to figure out the right channel because the ADCs do not give very detailed information and are not very accurate. The corresponding channel estimation is obstructed by the inaccurately quantized output of the ADCs with low resolution. After that, a fully connected DNN is used to construct the objective AB-HP in DNN to investigate how to jointly optimize energy harvesting and SE in response to changes in the wireless channel to reduce overall beam training overhead while achieving high levels of energy or spectral efficiency performance. One approach to achieve this goal is to use DRL to select potential beams for beam training, as suggested in [
28]. Depending on whether performance metric is used, the EE or SE values are
for beam training. However, in mm-wave systems, the state vector in [
28] and the action space in [
27] both scale with the number of antenna elements, which can significantly lengthen the DNN’s training time. Thankfully, at the BS, the correlation between the received signals from each antenna is introduced by the restricted number of scattering and the constrained physical space between the antenna
. Then, the almost undistorted quantization signals can provide additional channel information in the spatial domain for
to improve the channel accuracy.
where
is the estimation of
,
is the power of the
-th cluster, the vectors [
,
]T and [
,
] represent the receive antenna and transmit antenna, respectively, to minimize the MSE across all training samples. The DNN appropriately sets the activation functions and updates the weight matrix in a data-driven way, which is given by
where
represents the number of training samples,
and
denote the true channel and the channel approached by the RBT-DRL, respectively, for the
th training sample. The significantly distorted signals quantized by the low-resolution ADCs hurt the estimate performance, which is ignored in the channel estimation for all antennas. Conventional spatial channels can be converted into beam spatial channels by use of an antenna array. Actually, the spatial discrete Fourier transform (DFT) matrix U of size
is simulated by the antenna array. Conversely, the DNN is created by selectively employing trustworthy observations that match the high-resolution ADC antennas based on the predictions from the high-resolution channels. The hybrid analog-digital (HAD) architecture allows antennas to be coupled to significantly fewer RF chains through analog domain phase shifters and further enhances performance through digital processing. These studies indicate a channel estimation approach based on RNN, which may estimate channels through the non-linear mapping of RBT-DRL, to address the shortcomings of the conventional channel estimation.
After developing the analog RF beamformer
and the digital BB precoder
, the capacity maximization is reformulated as follows:
where
noise power, DL-based-AB-HP with mean square error (MSE) optimization outperforms traditional methods as the number of UEs increases, significantly reducing the runtime while also minimizing the number of required RF chains and lowering the channel estimation overhead. Some works focus on AP-HP to enhance system performance in high-speed communications. In [
8], the PA problem in high-speed orthogonal frequency division multiple access (OFDMA) systems to achieve minimum total transmit power consumption was examined. To minimize the average transmit power, it was suggested in [
11] to optimize PA with antenna selection. In order to maximize EE on high-speed networks with buffer limits, the PA method was introduced by the authors of [
12]. The power control issue in mm-wave high-speed systems has never before been addressed, according to the authors of [
13]. The AB-HP technique is executed in two phases: in phase 1, the suggested AB-HP algorithm uses offline supervised learning to forecast allocated powers in real-time online applications. This is achieved by using the best-allocated power values, determined by the computationally intensive particle swarm optimization (PSO)-PA, as output labels in the offline supervised learning process. The PSO-PA technique is used to determine and store the associated optimally allocated powers in the dataset. The entire available dataset is split into training and validation sets using an 80–20% split, and phase 2 of the learning process is finished in the online AB-HP using a brand-new test dataset. The suggested AB-HP algorithm is implemented using an open-source DL framework. DRL is a subset of ML that can learn through experimentation while interacting with the environment [
40,
66]. Tagged datasets are not a requirement for it. To choose a group of beams based on past experiences, multi-armed bandit (MAB), a simple DRL algorithm, is utilized [
67]. There is very little that MAB can do to modify the beam selection in response to changes in the environment because it does not utilize the way things are right now. On the other hand, based on the predictions from the high-resolution channels, the DNN is constructed by carefully selecting the reliable observations that correspond to the high-resolution ADC antennas [
68]. The use of DNN in DRL allows for the creation of more intelligent beam training algorithms by extending the capabilities of conventional methods. Based on the targeted user’s uplink received power, DRL is employed in [
64] to jointly assign the best BS and beam resources to that user. As per reference [
66], DRL can instantly assimilate information from its environment to choose the optimal beam pair for data transmission. A DL approach for a fully connected hybrid architecture in a multiuser situation is proposed by [
29], demonstrating the ongoing research on using ML for hybrid precoder design. For a fully connected hybrid structure, [
30] proposes unsupervised DL precoding and it suggests using CNN to create a partially connected hybrid design [
31,
65]. Still, compared to the focus on fully linked hybrid architectures, there is not much study on partially connected ones. Consider employing DRL to select appropriate beam candidates for beam training to lower algorithm complexity. By balancing exploration and exploitation without retraining, the RBT-DRL approach efficiently optimizes EE and SE in mm-wave multi-user mMIMO systems, outperforming CNN-based HP and LSTM-driven RA approaches in terms of convergence, computational complexity, and adaptability [
52,
69]. RBT-DRL produces better EE-SE trade-offs with less complexity than CNNs and LSTMs, which have limits in high-dimensional areas and real-time adaptation. In addition, DL-based and HP strategies, such as those that use ASM and PMA, improve EE and SE even more, demonstrating the potential of cutting-edge AI technologies for mm-wave networks.
The training time of the DNN may rise when considering a mm-wave system, since both the size of the action space in [
66] and the state vector in [
52] grow with the number of antenna elements. The beam training technique utilizes DRL’s general architecture. The process of tracking beams is modeled as a DRL process in the training context. The DRL algorithm develops an EE-SE maximization beam training strategy to improve performance while lowering beam training overhead. Initially, depending on previous channel measurements, the DRL block chooses one of many candidate beam training approaches. As the first action in the EE-SE maximization strategy, the chosen beam training method is used. The predicted number of RF chains needed to obtain the highest EE or SE is based on the outcomes of the beam training. The BS with
antennas ccommunicates
data streams to the UE with
antennas. The BS and the UE are assumed to be equipped with
and
RF chains, respectively, such that
≤
≤
and
≤
≤
. Depending on the condition of the system at the time, the DRL block can switch between DRL-EE and DRL-SE configurations. This includes factors like the UE’s downlink queue state and battery level. The mode of RBT-DRL-SE, for instance, is activated to communicate more data using spatial multiplexing when the UE’s packet queue is backlogged. In contrast, RBT-DRL-EE is activated to save energy for the UE if its battery level is low, such as below 50%. Upon request from the BS, the UE will send a report of its parameters. The EE counts how many bits are transmitted for each unit of energy, which is given by
To create a virtual environment for training RBT-DRL agents and operating in highly reliable systems, DRL-scheduling is proposed.
where
represent amplifier efficiency with
,
represent the power required per RF chain,
represents phase shifter, and
represent accounts for the fixed power consumption, where
and
are the power for each transmit and receive antenna.
With relation to the distributional perspective on DRL [
16,
71], the agent seeks to determine the best transmission scheduling. The random return is
, whose expectation is the value
. Improving sample complexity to hyper-parameters variation increases the robustness, and environmental noise is shown in [
17,
40]. The system state specified by the user is denoted as
of the IoT system at the time step
for the input layer. The random return is achieved by adhering to a current policy
by performing an action
from the state
indicated by the random variable
due to the startling predictability of the environment. Action space:
is necessary to adjust all policies and to determine their improvement, thus they arrive at
and analogous distributional Bellman equation, that is
where
and
are random nature of the next state-action pair after developing a policy, and
pointed to a haphazard variable
probability law and
are similar. As a result, when the distributional Bellman operator’s policy assessment behaves,
can be defined as
The RBT-DRL network takes control of the generation of the real-like data obtained from a distribution and generates a real data . The discriminator is at that time training to recognize between the actual data and the data arriving from the refined DRL-distribution traffic data by training a function. Action space A, which is discrete in nature, is the collection of values of actions that consistently accept values from rising integers in the range .
4.3. Reward
To maximize the long-term predicted benefit during an interaction with the environment, the agent is trained to learn a policy [
22]. The goal of this work is to achieve good EE or SE performance for a mobile UE while minimizing the beam training overhead. Because it directs the agent to optimize long-term gains by rewarding desirable behaviors and penalizing ineffective ones, the reward function is essential to DRL. A compromise was made between the obtainable EE or SE and the beam training overhead. This balance, which will be optimized during the RBT-DRL training process, is represented by the reward. In return for transmission in real-time, a slight and acceptable performance degradation brought on by a shorter beam training period is allowed.
The more beams that are evaluated for data transmission, the longer the beam training period will be. The amount of beam measurements needed is represented by the non-negative number Ui, which is the penalty for the
beam training method [
49,
69]. This reward, which is regulated by a trade-off factor
, balances the possible SE or EE with the penalty for beam training overhead
. The reward function for EE may be found in (9), and for SE, it has a similar structure. States (such as CSI and RF chain), actions (beam training), and rewards are all part of the DRL environment. To guarantee convergence, hyperparameters like the learning rate and the trade-off factor are carefully adjusted. Different system sizes, and channel conditions, are used to assess the method’s scalability and real-world performance. To optimize long-term rewards, the agent is trained by gradient optimization to stabilize learning. The training process uses a target network with hyperparameters such as exploration rate, learning rate, and trade-off factor. The reward function is explicitly given, along with formulae designed to balance performance indicators, to guarantee reproducibility. The simulations yielded the penalty values. Thus, in the RBT-DRL-EE, the reward function is provided by
where
, also known as the trade-off factor, regulates the ratio of the achievable EE to the necessary beam training overhead. Training dynamics and convergence speed during DRL are influenced by α, which scales the reward function update according to the learning rate. The learning rate causes slower convergence but more stable learning, whereas a higher rate speeds up learning but can also contribute to instability. By controlling the relative value of long-term vs. immediate benefits, the discount factor δ influences the agent’s decision to prioritize lowering overhead over enhancing EE or SE. Optimizing EE or SE is favored by a greater δ, whilst cutting beam training time is given priority by a lower value. The design of the reward function is crucial to maintaining equilibrium between the training overhead and the achievable EE/SE. This function allows the agent to learn the best trade-off by incorporating penalties
Vn to discourage excessive training time. The agent’s capacity to adjust to different channel conditions and system sizes, as well as the framework’s actual performance, are directly impacted by the sensitivity of these hyperparameters.
is the EE obtained using the
beam training method. The combined effect of the channel state, the quantity of RF chains employed, and the chosen beam training technique can be reflected in the EE or SE.
is the vector ut for EE, where the first
entries represent the EE attained in the previous
time steps, and the final item,
, represents the EE assessed using the pre-assessment at the current time step
. Concerning SE, the vector
is denoted by
, containing the appropriate
elements of SE.
The agent learns a policy to maximize the long-term predicted reward through interactions with the environment [
22]. Prioritized experience replay preserves all states for effective learning, but only high-impact states according to their temporal difference (TD) error. Based on available processing capability, the DRL agent dynamically modifies computational complexity to allow for real-time decision-making without overtaxing the system. Furthermore, resource-constrained DNN training remains within the maximum power limit
thanks to an adjustable learning rate. Our goal in this work is to minimize the overhead related to beam training while still achieving good EE or SE performance for a mobile UE. In another way, to optimize the trade-off between the overhead of beam training and the achievable EE or SE. The Equations (9) or (10) offer the option to weigh the performance metric’s significance while choosing a beam training method, thanks to the incentive system as shown in Algorithm 1. The agent can be trained to attain varying performance levels for distinct applications by adjusting the value of
. A bigger trade-off factor is desirable for applications that need high transmission rates, like high-definition video streaming, where a higher data rate matters more than beam training latency.
Algorithm 1: RBT-DRL Based Beam Training for Achieving EE and SE |
, , and , noise power , power budget , and other power-related factors , . and processing power , complete the following; Determine the best beam training technique, and develop an RBT-DRL agent. , for action space , and state space ; , update beamforming based on the current state . Evaluate allocation power as shown in constraints (3). and adjust the estimated channel in the RBT-DRL agent’s knowledge. Optimize memory usage by deleting unnecessary states and storing just the most important ones. in DNN based on the actions taken. Optimize beam training choices by adjusting battery-aware memory allocation. Store states based on activating the battery level in RBT-DRL. Calculate the reward for each beam training based on the overhead related to beam training while still achieving good EE or SE (9) and (10) end for , )
|
5. Discussion
The performance of the proposed experience-driven power distribution approach in the mm-wave AB-HP is assessed in this section through a series of simulated experiments.
Table 5 outlines the key simulation parameters used to assess the power distribution approach in the mm-wave AB-HP. These parameters, including bandwidth, transmit power, and beam adjustment are essential for evaluating the effectiveness of DL-based techniques. We evaluate the effectiveness of DL-based techniques and conventional beamspace-channel estimation algorithms in mixed-ADC architecture. In particular, for the RBT-DRL it is necessary to ascertain beforehand how many beams to track during time beam training using the value of
. The most straightforward beam training technique is to reduce the overall beam training overhead while preserving sufficient performance.
There is an impact of the different actor learning rates, while the critic’s learning rate is predetermined. The learning rate has a significant impact on the stability and speed of the SE during the RBT-DRL algorithm training. This is so that the convergence of SE can be implemented, and the learning rate represents the learning step. Missing the global optimum during the training process is more likely if the learning rate is high. Conversely, a poor learning rate would likely cause the convergence rate to slow down. We can also observe from
Figure 4 that the SE realized by RBT-DRL can converge to a higher and more stable value sooner when the RBT-DRL for beam training
delivers after the agents are trained to be stable. The average SE attained by various beam training procedures for varying factor
is displayed in
Figure 5. In the DRL, RBT-DRL, and the reward technique, whose efficacy increases with increasing
, similar patterns are observed, wherein the more expensive beam training approaches will be activated more frequently by the RBT-DRL, DRL, and reward to raise the SE as it rises. The range of achievable SE values is 15.9–18.7 bit/s/Hz. The DRL model can be adjusted to achieve, in that order, 91.7%, 93.4%, and 95.9% of the maximum SE within this range. The typical quantity of RF chains is needed for RET-DRL. More RF chains are employed to produce larger SE when there is no power constraint, particularly for
≥ 0.5. RBT-DRL activates less RF chains than DRL when the SE is less weighted in the reward function in (9).
From
Figure 6, the lowest EE is around 49 Mbit/Joule, which is attained without any beam training, while the maximum attainable EE is approximately 73.5 Mbit/Joule, which is reached by rigorous beam training. Achieving a greater EE becomes more crucial as α increases, while the impact of the beam training overhead decreases. Therefore, in training, a higher EE with more beams is obtained as
grows.
The three distinct beam training techniques—RBT-DRL, DRL, and conventional beam training—are depicted in
Figure 7. All three approaches’ SE rises in tandem with SNR, indicating the anticipated increase in data throughput that comes with improved signal quality. The final SE attained and the pace of the rise, however, differs between the approaches. Throughout the whole SNR range, RBT-DRL continuously shows the highest SE, indicating its superior capacity to use increased SNR for enhanced SE. Although the difference is not as noticeable as it is with RBT-DRL, DRL likewise exhibits a larger SE when compared to regular beam training. This suggests that when compared to the traditional beam training method, both RBT-DRL and DRL, which make use of reinforcement learning, are more successful at adjusting to shifting channel circumstances and optimizing beamforming for SE. The DRL and RBT-DRL can dynamically modify their SE settings according to system factors such as battery level and UE queue status. Because of their flexibility, they can adjust beamforming tactics in real time, maximizing SE when necessary and maybe sacrificing some SE when battery levels are low. In contrast, the adaptive RL-based methods perform better than the standard beam training method, especially when SNR conditions alter, because the former probably uses a more static approach. The reward functions for RBT-DRL-SE are meant to balance the overhead and performance, which ultimately affects the observed variations in SE among the three approaches. Equations (4)–(10) explain how the SE is determined.
With varying amounts of , the DRL method can reach 93.1%, 95.6%, and 98.0% of the highest possible EE. Furthermore, RBT-DRL can provide equivalent or even fewer beam measurements by alternating between several beam training techniques, which can perform better than continuously using a fixed beam training approach. However, DRL can draw lessons from beam training’s past and choose the beam training strategy that will maximize long-term rewards. At = 0.1, RBT-DRL achieves zero beam training at a considerable performance deterioration since optimizing the long-term benefit is comparable to limiting the beam training overhead. When = 0.8, increasing the EE maximizes the long-term benefit; therefore, it makes sense to do an exhaustive beam search more frequently to achieve a greater EE.
From
Figure 8, the performance of RBT-DRL, DRL, and three beam training method models is demonstrated to be quite similar in
Figure 8, which shows the average EE attained by various beam training policies at varying SNRs. The EE increases to saturation value after that starts to decrease. As we can see, the SE only increases by 39% whereas the overall power consumption in Equation (4) rises by 45% from SNR = 14 dB to 20 dB. Thus, the high power consumption limits the EE performance at SNR = 20 dB. From
Figure 9, by increasing task information, the appropriate incentive schemes in
Figure 9 facilitate faster learning, with RBT-DRL and DRL requiring around 20% less reward over the whole SNR range. By combining dynamic PA with context awareness and utilizing real-time network conditions to optimize power consumption, the RBT-DRL enhances the solution to handle power consumption problems. To ensure energy-efficient operation without sacrificing SE, the reward mechanism in RBT-DRL can be changed to penalize excessive power usage. Furthermore, hybrid precoding can further lower power consumption by selectively activating antennas based on user density and demand. To reduce redundant power consumption, DRL can facilitate coordinated, dispersed operations among BSs to balance off higher power requirements in mobile and IoT environments. Lastly, the RBT-DRL framework can be verified and improved by benchmarking against low-power solutions, such as lightweight DRL algorithms, to guarantee useful implementation in situations with limited power. The EE increases with SNR in
Figure 8 until a certain point, at which point it tends to plateau or even slightly rise. This pattern emphasizes the trade-off between boosting signal quality for SNR and the corresponding power usage. To compare RBT-DRL-based techniques, in particular in [
41], the RBT-DRL variations in particular attain higher EE in terms of SNR values when compared to [
41], indicating that DRL-based beam training in
Figure 8 may be more successful in maximizing EE than adaptive approaches in [
41]. The saturation of EE at higher SNRs highlights the fact that merely raising transmit power above a particular threshold does not always result in a corresponding rise in EE since power consumption rises faster than possible signal quality gains. As shown in [
42], in comparison to current state-of-the-art (SOTA) algorithms, this approach focuses on mathematical convergence to a local solution that is approximately less complex and in reaching a compromise between computational complexity and SE. From the benchmark when the beam training overhead is reduced, the RBT-DRL continuously achieves a higher EE at different SNR levels, peaking at about 90 Mb/s/Hz. By utilizing DRL and adaptive beam training for dynamic network conditions, RBT-DRL provides notable advantages over SOTA hybrid precoding techniques such as compressive sensing-based approaches as shown in [
42], conventional precoding, and heuristic optimization methods. RBT-DRL is a more adaptable and energy-efficient solution for ultra-dense, high-mobility networks because of its capacity to dynamically optimize RF chain utilization and precoding based on environmental feedback.
The needed long-term anticipated advantage for either RBT-DRL or DRL during an interaction with the environment is substantially larger than that for both RBT-DRL and beam training, and it does not change with varying SNR. To acquire greater rewards with fewer training beams, one can examine the distribution of action choices for all beam training policies. The SE of several techniques is displayed versus the number of N in
Figure 10. Keep in mind that the experiment uses the highest bound of SE achieved by the best full-digital beamforming.
Figure 11 makes it evident that the SE attained by all RBT-DRL grows as the number of Ns rises. On the other hand, DRL and beam training greatly improve performance, whilst the random power and maximal power schemes only slightly improve SE. Furthermore, the SE obtained by the suggested RBT-DRL approach has a SE greater than previous ways as the number of Ns increases in the mm-wave. Furthermore, when employing RBT-DRL and beam training, the maximal power scheme performs worse consistently. This shows how optimizing PA and beamforming together can raise the SE of mm-wave systems.
From
Figure 11, the three primary methods for preserving the stability of the channel estimation are interactions, examining the role of RBT-DRL provides important insights into channel stability, and if it can successfully adjust to changing channel conditions, it may improve SE performance. As a result, stability is provided during simulation by the mm-wave-sampling realization in both native and mutant forms, as shown in
Figure 11. If you have the channel realizations available, you can replace the range function with actual channel data or change it to acquire a more realistic depiction, as the beam training bonding has shown a slight decrease in the number of possible channel realizations for the SE for the highest possible RBT-DRL award. The goal of this technique is to attain maximum efficiency by refining the beamforming strategy. Beam training offers less adaptive beamforming options, and DRL approaches, while trying to maximize beamforming, may not be as specialized as RBT-DRL.
The EE performance for each of the three methods, RBT-DRL, DRL, and beam training, is plotted against time steps in
Figure 11. RBT-DRL consistently produces larger EE values than DRL and beam training, as seen in
Figure 12. However, RBT-DRL exhibits notable EE fluctuations, most likely as a result of altered ambient factors or system characteristics. Although the DRL method does not achieve the highest EE levels of beam training, it does show greater stability than beam training. The RBT-DRL approach shows resilience in sustaining steady performance even if it has the lowest average EE. This is probably because it can dynamically strike a compromise between energy conservation and system efficiency. The beam testing outcomes establish the estimated number of RF chains needed to maximize EE. When the UE is equipped with
Mrx antennas, it receives
My data streams from the BS, which has
Mtx antennas. To attain optimal EE, the beam training procedure maximizes the quantity of RF chains
required at the transmitter and receiver. Based on system parameters, like the state of the downlink queue or the battery level of the UE, the DRL block allows dynamic switching of energy-efficient DRL-EE. Using spatial multiplexing, the RBT-DRL-EE mode prioritizes energy savings when the UE’s packet queue is backlogged and its battery level falls below 50%. RBT-DRL’s dynamic behavior guarantees a trade-off in energy consumption.
Figure 13 shows SE vs. time steps for the three strategies RBT-DRL, DRL, and beam training. Though it varies significantly depending on the dynamic channel and system conditions, beam training consistently yields larger SE than the other two methods. DRL exhibits relative stability and modest SE values, showing environmental adaptability without achieving the peak SE attained by beam training. In contrast, RBT-DRL prioritizes EE over spectral performance, maintaining the lowest SE. Based on the results of the beam training procedure, the expected number of RF chains needed to achieve the ideal EE is calculated. Using
Mtx antennas, the BS transmits
Ms data streams to a UE that has
antennas. It is assumed that both the UE and the BS have RF chains that are
and
and
≤
≤
and
≤
≤
. DRL-EE and DRL-SE configurations are dynamically switched by the DRL block based on system conditions. The reward function for EE is defined by the equation
, which takes into account achievable efficiency, beam training overhead, and RF chain usage. To ensure convergence and scalability across different system sizes and channel conditions, the DRL agent undergoes a rigorous training procedure that includes hyperparameter tweaking (learning rate, trade-off factor) to optimize long-term rewards. While balancing computing complexity and energy cost, this dynamic behavior facilitates real-time decision-making.