Deep Reinforcement Learning-Driven Hybrid Precoding for Efficient Mm-Wave Multi-User MIMO Systems

Salh, Adeb; Alhartomi, Mohammed A.; Hussain, Ghasan Ali; Jing, Chang Jing; Shah, Nor Shahida M.; Alzahrani, Saeed; Alsulami, Ruwaybih; Alharbi, Saad; Hakimi, Ahmad; Almehmadi, Fares S.

doi:10.3390/jsan14010020

Open AccessArticle

Deep Reinforcement Learning-Driven Hybrid Precoding for Efficient Mm-Wave Multi-User MIMO Systems

by

Adeb Salh

¹

,

Mohammed A. Alhartomi

^2,4,*

,

Ghasan Ali Hussain

³

,

Chang Jing Jing

^1,*,

Nor Shahida M. Shah

⁵

,

Saeed Alzahrani

²

,

Ruwaybih Alsulami

⁶

,

Saad Alharbi

⁷,

Ahmad Hakimi

¹ and

Fares S. Almehmadi

²

¹

Faculty of Information and Communication Technology, University Tunku Abdul Rahman (UTAR), Kampar 31900, Perak, Malaysia

²

Department of Electrical Engineering, University of Tabuk, Tabuk 47512, Saudi Arabia

³

Department of Electrical Engineering, Faculty of Engineering, University of Kufa, Kufa 540011, Iraq

⁴

Innovation and Entrepreneurship Center, University of Tabuk, Tabuk 71491, Saudi Arabia

⁵

Faculty of Electrical and Electronic Engineering, Universiti Tun Hussein Onn Malaysia, Parit Raja 86400, Johor, Malaysia

⁶

Department of Electrical Engineering, Umm Al-Qura University Makkah, Mecca 24382, Saudi Arabia

⁷

King Abdulaziz City for Science and Technology, Riyadh 11442, Saudi Arabia

^*

Authors to whom correspondence should be addressed.

J. Sens. Actuator Netw. 2025, 14(1), 20; https://doi.org/10.3390/jsan14010020

Submission received: 21 January 2025 / Revised: 8 February 2025 / Accepted: 10 February 2025 / Published: 12 February 2025

(This article belongs to the Section Communications and Networking)

Download

Browse Figures

Versions Notes

Abstract

:

High route loss and line-of-sight requirements are two of the fundamental challenges of millimeter-wave (mm-wave) communications that are mitigated by incorporating sensor technology. Sensing gives the deep reinforcement learning (DRL) agent comprehensive environmental feedback, which helps it better predict channel fluctuations and modify beam patterns accordingly. For multi-user massive multiple-input multiple-output (mMIMO) systems, hybrid precoding requires sophisticated real-time low-complexity power allocation (PA) approaches to achieve near-optimal capacity. This study presents a unique angular-based hybrid precoding (AB-HP) framework that minimizes radio frequency (RF) chain and channel estimation while optimizing energy efficiency (EE) and spectral efficiency (SE). DRL is essential for mm-wave technology to make adaptive and intelligent decision-making possible, which effectively transforms wireless communication systems. DRL optimizes RF chain usage to maintain excellent SE while drastically lowering hardware complexity and energy consumption in an AB-HP architecture by dynamically learning optimal precoding methods using environmental angular information. This article proposes enabling dual optimization of EE and SE while drastically lowering beam training overhead by incorporating maximum reward beam training driven (RBT) in the DRL. The proposed RBT-DRL improves system performance and flexibility by dynamically modifying the number of active RF chains in dynamic network situations. The simulation results show that RBT-DRL-driven beam training guarantees good EE performance for mobile users while increasing SE in mm-wave structures. Even though total power consumption rises by 45%, the SE improves by 39%, increasing from 14 dB to 20 dB, suggesting that this strategy could successfully achieve a balance between performance and EE in upcoming B5G networks.

Keywords:

mm-wave; power allocation; energy efficiency; spectrum efficiency; deep reinforcement learning

1. Introduction

Millimeter-wave (mm-wave)-communications-specific problems including high route loss and line-of-sight limitations can be solved with the use of sensing technology. Sensing provides the deep reinforcement learning (DRL) agent with extensive environmental feedback, which helps it adjust beam patterns and better forecast channel fluctuations. This collaboration not only enhances the performance of hybrid precoding systems but also strengthens mm-wave communication’s dependability and resilience in complex, dynamic scenarios, a critical feature for networks in the beyond fifth generation (B5G). Hybrid beamforming is a potential technique for mm-wave massive multiple-input multiple-output (mMIMO) systems that provide excellent spectrum economy, high data rate transmission, and a favorable trade-off between hardware complexity and transmission performance. While the reliance on low-frequency bands limits the capacity to provide low latency, ultra-high speeds, higher service quality, and seamless wireless coverage, the rapid expansion in mobile users and devices has increased the burden on mm-wave spectrum resources (300 kHz to 3 GHz) [1,2]. However, in large-scale dense deployments, higher frequency waves can deliver multi-gigabit rates [3]. The newest commercial mobile communication technology, 5G, was created to meet mobile consumers’ wireless transmission needs and offer rudimentary support for massive communication systems, like the Internet of Things (IoT) and enterprise networks. Exploring higher frequency spectrum is essential to finding a solution to the issue of the many access requirements of terminal devices in 5G networks and the constrained availability of spectrum below 6 GHz [4,5,6].

Antenna array deployment is essential for drastically lowering the number of necessary radio frequency (RF) chains in mm-wave mMIMO systems. This feature is particularly helpful in real-world wireless communication situations because there are frequently more antennas than RF chains, which makes precise channel estimates extremely difficult. In recent years, a variety of beamforming-based channel estimate techniques have been put out to solve these issues. In particular, a low-complexity two-way channel estimation method was described in [7,8,9]. This technique uses beam training to locate antennas associated with dominant beams first, between the users and the base station (BS). The estimation process then considers only the channel elements related to these selected antennas. However, the quantity of pilot symbols needed to cover every potential beam is directly correlated with the quantity of BS antennas, which is a very high amount. Also, some traditional compressive sensing-based algorithms, like the orthogonal matching pursuit algorithm employed in [10], could estimate the beamforming channel with a lower pilot overhead by taking advantage of the sparsity of beamforming channels [11,12,13]. Regretfully, none of the aforementioned beamforming channel estimation algorithms [11,12,13,14] can attain enough estimate accuracy in low signal-to-noise ratio (SNR) regions. Furthermore, these methods are computationally complex, especially when dealing with high-sparsity beamforming channels. The antenna array at the BS was also employed [7], [15,16,17,18] in the mm-wave mMIMO system. This allows signals to be concentrated on various antennas and transforms the spatial channel into a beamspace channel. Low-computational-complexity beamspace channel estimation is possible with the approximate message passing (AMP) technique, a potent iterative algorithm for sparse signal recovery [19,20]. The best shrinkage parameters for the AMP approach are hard to find and empirical shrinkage values are usually utilized instead. This restriction limits the useful channel estimation performance of the method. DRL is used in mMIMO systems to solve computationally complicated and demanding problems including resource allocation (RA), beamforming, and channel estimation that arise from the enormous number of users and antennas by providing more intelligent beam training. DL enables these systems to process high-dimensional data efficiently, adjust to changing conditions, and maximize performance measures like latency, energy economy, and spectrum efficiency (SE). The development of B5G wireless communication systems depends heavily on DL since it also makes data-driven methods to model and mitigate uncertainties, improve user experience, and make lower computational complexity possible [20]. With DL applied to mm-wave systems, the distinct features of the mm-wave frequency spectrum make RA a crucial issue in mm-wave communication systems. Because of the directional nature of mm-wave signals, significant path loss, and interference, using mm-wave technology can lead to greater data rates and network capacity. Utilizing time, frequency, and power as well as other available resources to their fullest potential can be achieved by using the RA [6,21,22,23]. As the need for high-speed wireless communication services grows, this may help to sustain it. Moreover, power resource optimization, a crucial aspect of battery-powered devices, can be achieved by applying RA. RA helps lessen the environmental impact of wireless communication and prolongs device battery life by consuming less power. Nonetheless, to meet the varying bandwidth and latency needs of various applications, like gaming and video streaming, the quality-of-service demands must be improved. To this end, mm-wave implementation in DL necessitates a substantial amount of data and computational resources, and careful DL algorithm design and optimization are required to achieve optimal EE [24,25]. Through intelligent and adaptive RA based on real-time feedback from the system, DL can be utilized to enhance RA in mm-wave systems. DL makes intelligent and adaptive signal processing possible, which can enhance the performance of mm-wave systems [26]. Furthermore, the performance, reliability, and efficiency of mm-wave systems could be greatly enhanced by the service of DL for RA. It is necessary to carefully evaluate the system requirements, channel conditions, and available computational resources when designing and optimizing DL algorithms for RA [27,28,29,30]. Effective channel estimation and beamforming are severely limited by the drawbacks of traditional mm-wave systems, such as the difficulty of using a high number of antennas with a small number of RF chains. As the number of antennas in these systems increases, so does the complexity of precisely calculating the channel, resulting in enormous processing demands. Furthermore, to cover the large beam space, classic beamforming approaches frequently call for many pilot symbols, which raises pilot overhead and lowers system efficiency. It becomes harder to control the necessary RF chains and the related beam training overhead in mm-wave systems as the number of antennas increases. These limitations highlight the need for creative solutions, such as the proposed angular-based hybrid precoding (AB-HP), which optimizes beam training procedures and reduces RF-chain utilization by utilizing angular-based information. This method paves the door for more effective and scalable mm-wave systems in B5G networks by directly addressing the problems of high computational complexity, decreased beamforming accuracy in low SNR zones, and power of the signal pilot.

1.1. Related Works

The optimization of RA, including beamforming and power, in special AB-HP systems is an interesting but difficult subject that merits further research. Through mm-wave massive MIMO and hybrid beamforming optimization, dynamic beam selection and power allocation (PA), SE enhancement, and reduced complexity, DRL improves B5G performance. A convolutional neural network (CNN) is utilized to extract channel response characteristics in a DL-based channel estimate network, while a long short-term memory (LSTM) network is used to boost performance. This approach uses machine learning techniques to teach neural networks to carry out particular tasks [31,32,33,34]. To accommodate the time-varying nature of wireless channels, the suggested DL-based channel estimation network employs an LSTM network to capture temporal dependencies and a CNN to compress channel response data into feature vectors. In dynamic contexts, this combination increases the accuracy of channel estimates. Compared to massive MIMO systems, mm-wave technology requires fewer transmit antennas and RF chains in high-speed mobile situations, where the method is very successful [35,36]. To improve hybrid analog/digital beamforming architecture, baseband precoding must be transmitted to use fewer RF chains. Baseband decoding can then be applied to the received signal in the uplink to produce low-cost hybrid beamforming [37,38]. Moreover, the digital baseband (DB) beamforming could be used by using microprocessing, while the analog RF chains beamforming might be implemented to produce a high data rate by using phase shift [39,40,41,42]. By using mMIMO baseband precoding instead of dedicating an unconnected RF chain for every antenna, the high cost and high power consumption of combined signal components could be improved. The combiner hardware in [43] featured a completely connected phase shifter, and every aerial for the analog signal was connected to the feeding RF chains dependent on the phase shifter on/off switch. The authors of [44] propose multi-user mMIMO systems by suggesting a DL-PA and hybrid precoding technique. To cut down on the amount of RF chains and channel estimation overhead, an angular-based hybrid precoding (AB-HP) technique has been employed. Ref. [45] proposed DL driven with non-orthogonal hybrid beamforming for the single-user, which provides a high SE, and a good balance between transmission performance and hardware complexity. Ref. [46] proposed a DRL to adopt user-pairing by assigning an orthogonal sub-channel to transmit data to the cognitive BS without causing interference to obtain the optimal transmission power and bandwidth allocation. However, Ref. [47] suggested a ground-breaking experience-driven PA method in the mm-wave for high-speed railway systems with hybrid beamforming that is capable of learning power decisions from experience rather than the precise mathematical model.

The proposed method in [47] aims to achieve higher SE by selecting the best DL training approach to reduce inter-channel interference which minimizes hardware complexity. With the hybrid beamforming approach, the number of RF chains required can be greatly reduced by combining analog and digital processing, which lowers power consumption and hardware complexity. In addition, the suggested approach makes use of DL to teach the system to choose the optimal beamforming weights for every channel, therefore lowering inter-channel interference and enhancing SE [48]. Considering that the DL method can adapt to changing channel circumstances and understand complicated patterns in the wireless channel, it is especially successful in reaching this goal. All things considered, the suggested technique offers a novel way to increase the spectral efficiency of digital transceivers with a lot of antennas. Through the use of DL and affordable digital-analog hybrid beamforming algorithms, this approach can improve wireless channel performance while reducing hardware complexity [49,50,51,52]. Increasing the number of antennas per RF chain transceiver can make it more practical to implement fully connected beamforming. In related work, the author of [53] proposed a new beam training algorithm via DRL with lower training overhead in B5G to achieve higher performance for EE or SE with a lower training overhead. However, Ref. [54] proposed the development of DL techniques to optimize EE/SE and optimize its spectrum sensing process. In addition, Refs. [28,55] proposed joint energy-spectral efficiency optimization for a dynamic adaptation approach based on machine learning (ML) to achieve EE for connected users. The distributed environment, subject to sharing limitations, was optimized to maximize the EE-SE trade-off by [56] using multi-objective optimization. Furthermore, they employed Pareto-optimal solutions to optimize the framework by combining methods applied by different UEs and operators simultaneously. The writers of [40,57] investigated the trade-off between EE-SE and made it possible for HetNets to enhance UE, bandwidth allocation, and a large number of antennas, by taking advantage of the backhaul capacity limitation to optimize both EE and SE. This allowed them to guarantee high data rates for users. According to the total-transmit-PA limitation, the author of [58] formulates multi-objective optimization to maximize the EE-SE trade-off and determine the feasible DL rate. Because SE is unrelated to transmitting power in a huge MIMO system, reducing the number of antennas can boost EE [59]. Key elements like transmission power and antenna count can be efficiently optimized in both EE and SE by using alternative optimization strategies [60]. Therefore, our proposed approach has the potential to significantly improve the efficiency and reliability of hybrid precoding technique multi-user mMIMO systems. Current techniques, like traditional hybrid beamforming, for optimizing SE and EE in multi-user mMIMO systems try to minimize the number of RF chains by combining analog and digital processing. These methods still have a number of drawbacks, though, such as lengthy beam training processes and high pilot power needs for channel estimation, which raises overhead and computational complexity and ultimately restricts their scalability and usefulness for use in sixth-generation wireless networks. DRL-based hybrid precoding techniques have been introduced to address these issues, yet many fail to effectively balance the beam training overhead with RF chain utilization. The proposed AB-HP architecture with RBT-DRL overcomes these limitations by incorporating angular domain feedback to dynamically optimize EE and SE while significantly reducing training overhead and RF chain usage. Unlike prior approaches that focus solely on reducing RF chains or improving training efficiency, AB-HP with RBT-DRL achieves joint optimization by leveraging DRL’s adaptability to changing channel conditions. This innovative integration enhances real-time flexibility, minimizes pilot overhead, and ensures efficient RF chain distribution, offering a robust solution for mm-wave multi-user mMIMO systems in B5G networks.

The novelty of the proposed AB-HP framework lies in its unique integration of angular domain information and RBT-DRL to simultaneously optimize EE and SE in mm-wave multi-user mMIMO systems for B5G networks. While DRL-based hybrid precoding has been explored in prior studies, the proposed framework distinguishes itself by dynamically leveraging angular information to minimize RF chain utilization and channel estimation complexity, a feature not comprehensively addressed in existing work. Furthermore, including the RBT-DRL mechanism enables dual optimization of EE and SE while drastically reducing the beam training overhead, thereby improving system flexibility and adaptability to dynamic network conditions. This dual optimization approach, combined with the innovative use of angular-based feedback in DRL for mm-wave multi-user mMIMO hybrid precoding, lies in its ability to dynamically optimize EE/SE through AB-HP and maximum RBT, significantly reducing the beam training overhead and RF chain usage while adapting to real-time network dynamics. Existing studies on mm-wave systems have thoroughly examined the difficulties in beamforming, channel estimation, and RF chain utilization. Although the number of RF chains has been reduced with the introduction of hybrid beamforming, this method still necessitates a high pilot power signal and has high beam training complexity. Additionally, even though RBT-DRL has been used to enhance beamforming and RA in mm-wave systems, many solutions ignore the critical problem of effective beam training and RF chain usage. Recent developments that optimize beamforming and reduce the RF chain, such as the AB-HP technique and DRL approaches, have made headway in overcoming these constraints. However, only a few studies have completely combined these methods to address channel estimation difficulty, RF chain use, and beam training overhead, which is a key gap that the proposed AB-HP framework aims to fill. The RBT-DRL techniques have demonstrated their effectiveness in numerous fields, including developing low-complexity PA techniques that achieve near-optimal system capacity, making them suitable for real-time applications in HP-based multi-user MIMO systems based on deploying AB-HP to reduce the number of RF chains and the channel estimation overhead size. The authors of [28] recommend looking at state-of-the-art techniques like dimension reduction and feature engineering [5] that can reduce the state vector and action space while maintaining the deep neural network (DNN) performance. Through this, we want to accomplish efficient and effective beam training for mm-wave systems while also drastically cutting down on the DNN’s training period.

1.2. Motivation and Contributions

This study aims to optimize a beam training technique that takes into account the present channel conditions and minimizes the error between the estimated and true channel parameters by enabling dual optimization of EE and SE while drastically lowering beam training. By minimizing beam training overhead and dynamically adapting to network conditions, RBT-DRL-based beam training optimizes EE and SE while enhancing system performance. Enhancing system performance depends on the relay station’s and the BS’s transmit power restrictions necessary to exhibit a unique AB-HP that maximizes EE-SE. This action was taken in response to DL pilot contamination. By leveraging EE results to reduce SE enactment and vice versa, a unified matrix was utilized to enhance the EE-SE trade-off through multi-object optimization. We develop methods for beam tracking, relay selection, joint user scheduling, and codebook optimization for millimeter-wave networks, according to the authors of [48]. Using DRL to reduce system latency, these techniques are based on the design of an online controller that schedules users dynamically and establishes their links. The speed and scalability of B5G wireless networks can be enhanced by mm-wave systems that dynamically adapt to changing channel conditions and user demands. By leveraging DRL’s ability to handle complex optimization tasks in real-time, DRL-integrated user scheduling and link configuration can significantly improve the performance of mm-wave networks.

The contributions of this article can be summarized as follows:

To emphasize utilizing an AB-HP and hybrid precoding technique tailored for large multi-user mMIMO systems. The hybrid precoding method, grounded in angular basis principles, mitigates the need for extensive RF chains and measurement overhead. Addressing the formidable challenges facing the realization of B5GG wireless networks, specifically the demanding criteria for high bandwidth and EE, our approach aligns with recent literature directives. The literature advocates strategies to minimize interference and optimize transmission power and bandwidth allocation through the allocation of orthogonal sub-channels.
To provide a comprehensive analysis of the prediction services that can be expected from B5G wireless networks, including their fundamental ideas and wide range of uses. To achieve accurate and high-resolution channel estimation, this includes a special emphasis on DL applications made for orthogonal frequency division multiplexing (OFDM) systems.
To highlight difficulties that need to be solved to achieve improved computational efficiency using DL. Therefore, to enable optimal performance, attention must be directed to improving the accuracy of channel predictions.
To exhibit a beam training algorithm leveraging the AB-HP constructed through a DNN. This algorithm operates within the wireless environment, adapting to changes in channels, and concurrently optimizing EE/SE. The objective is to enhance overall performance while reducing the overhead associated with beam training.
To maximize the long-term predicted reward through interactions with the environment. The reward is designed to represent this balance, which will be optimized throughout the applied maximum-reward beam training using deep reinforcement learning (RBT-DRL) to minimize the overhead related to beam training while still achieving good EE or SE performance.

2. Wireless B5G Difficulties with DL-Based Mm-Wave

To achieve directional beamforming and allow the high path loss to be offset by the beamforming gain, massive antenna arrays can be employed at the BS and UE sides. Massive antenna arrays and beam management are necessary for the best BS-UE beam pair selection to handle significant route loss in mm-wave communication. To satisfy the demands of B5G systems, where dense antenna integration is supported by mm-wave’s short wavelength to improve network performance, DL in the wireless physical layer can facilitate effective signal processing.

2.1. DL-Based mMIMO

DL-based massive MIMO in the mm-wave is essential for managing high-dimensional data and dynamic channel circumstances during future B5G training. It facilitates effective learning of beamforming, channel estimation, and RA techniques, guaranteeing flexibility and optimization for the high-capacity, and low-latency [61]. One important mMIMO communication technology is beamforming, which focuses the sent signal’s energy on the target recipient. Through its ability to comprehend the intricate links between channel circumstances, antenna array layout, and performance measures like throughput, latency, and EE, DL can be used to optimize the beamforming process [62]. RA in mMIMO systems, the allocation of resources like frequency, power, and time can be optimized with the application of DL algorithms. Increasing reliability, reducing power consumption, and optimizing network capacity across the board, can help to enhance system performance overall [10]. Reducing loss improved based on achieving perfect CSI, which has a major influence on the benefits that could be realized using mMIMO, and so it makes sense for the high-accuracy channel estimation challenge as shown in Figure 1. We investigate frameworks that completely utilize geographical data to integrate DL approaches for channel estimations and direction of arrivals in mMIMO systems. For super-performance mMIMO channel estimation, the DNN is a good option for performing accurate CSI reconstruction [35].

2.2. DL-Based Mm-Wave

A recently emerging field of study called “DL-based mm-wave” looks at applying DL methods to enhance mm-wave communication systemure performance. Signal processing and RA have special challenges because of the high frequency of mm-wave signals and the directed character of the antenna arrays utilized in mm-wave systems. DL can facilitate RA and intelligent, adaptive signal processing, which can assist in overcoming these issues [36].

For mm-wave communication, DL is applicable for channel modeling with the application of DL so that channel models that consider the intricate relationships between the environment, the antenna array, and signal propagation may be created with accuracy and efficiency. These channel models can be utilized to enhance system performance, create more effective RA rules, and optimize the beamforming procedure [40]. By understanding the intricate correlations between the antenna array configuration, channel situations, and performance metrics like throughput, latency, and EE, DL can be used to optimize the beamforming process [37,38,39]. For RA in mm-wave systems, DL can be applied to optimize the distribution of resources, including time, power, and frequency. The system’s performance may be enhanced by increasing network capacity, enhancing reliability, and lowering power consumption [10]. The performance, reliability, and efficiency of wireless communication networks might all be greatly increased by the application of DL in mm-wave communication. However, a thorough evaluation of the system requirements, channel conditions, and available computational resources is necessary to design and optimize DL for mm-wave communication [13,41]. DL is emerging as a key enabler of mm-wave technology, which operates between 30 and 300 GHz and offers significantly higher bandwidth for B5G networks, improving capacity, speed, and reliability. However, optimizing mm-wave technology requires addressing challenges in processing power, channel conditions, and system requirements. All of this was achieved while using very little power (less than 1 watt) to transmit at 73 GHz [62,63,64,65,66] as shown in Table 1. The potential of DL and beamforming training to enable dual optimization of EE and SE in the face of the difficulties of mm-wave communications makes them crucial for wireless B5G systems. B5G systems may effectively lower the RF chain, reduce channel-estimate complexity, and improve both EE and SE by combining DL with beamforming training. This opens the door to more dependable and effective communication. Table 2 lists several issues with wireless communication, including transmission delays, the increasing demand for smaller devices, caching optimization, and restricted capacity, particularly for applications like UAVs and D2D communication. mm-wave technologies, which provide greater bandwidth, reduced latency, and the potential for smaller antennas, are used to overcome these problems.

Table 3 shows how to use geographical data in channel estimate and direction of arrival using DL-enabled frameworks, emphasizing the importance of DNN in accurate CSI reconstruction.

In this article, we provide the key aspects of mm-wave technology for wireless communication.

❖: The properties of mm-wave signals, which have a short wavelength and high attenuation, limit their range and ability to penetrate obstacles. However, these properties can be harnessed for directional communication, enabling beamforming and beam-steering techniques that can enhance the signal quality and reduce interference.
❖: The potential applications of mm-wave technology, including high-speed data transmission, remote sensing, and vehicular communication. The high bandwidths of mm-wave signals make them well-suited for applications that require large amounts of data, such as video streaming, while the directional nature of mm-wave signals makes them ideal for vehicular communication and remote sensing applications [67].
❖: The challenges and opportunities associated with mm-wave technology. The high frequency and directional nature of mm-wave signals require new antenna designs, signal processing techniques, and RA strategies. DL techniques can be used to address these challenges and optimize the performance of mm-wave systems [13,68,69,70,71].

3. Leveraging DL of Mm-Wave in B5G

For sensing-related technologies, DL for mm-wave communication in B5G is essential because it allows adaptive beamforming, accurate channel estimates, and real-time environmental awareness, all of which improve system efficiency and dependability under complex and dynamic network settings. DL addresses problems like dynamic channel fluctuations, beam alignment, and high route loss, and so it is crucial to use it for mm-wave communication in B5G. Through adaptive beamforming, efficient channel prediction, and intelligent resource management, DL enhances the performance of mm-wave communication. The use of DL, a potent machine learning technique well-known for its effectiveness in fields including speech recognition, image identification, and natural language processing, has recently drawn attention from researchers looking to address these issues in mm-wave communication systems [57,58,59,60]. Channel prediction and estimation is one of the primary uses of DL in mm-wave communication. Real-time channel state prediction is possible with DL-based techniques since they can learn the channel characteristics from vast volumes of data. This can boost signal quality and decrease interference, which will enhance the functionality of mm-wave communication systems. Beamforming is another area where DL is used in mm-wave communication. To handle the dual optimization of EE and SE in the face of intrinsic difficulties including high path loss, beam alignment overhead, and environmental variability, DL for mm-wave communication in B5G is essential. Because DL is so good at finding patterns in complicated data, network parameters may be optimized in real-time and adaptively to improve both EE and SE at the same time. By demonstrating its significance, the DL-based solutions can operate around mm-wave restrictions, lessen the computational load, and enhance resource efficiency.

One method for enhancing signal quality and lowering interference is beamforming, which focuses signal transmission in a particular direction. Beamforming techniques based on deep learning can enhance the efficiency of mm-wave communication networks by determining the ideal beamforming weights based on user data DL and have demonstrated significant potential in a number of wireless communication technologies, such as mMIMO. Channel estimate, or predicting the channel matrix between the BS and the users, is one of the primary issues in mMIMO. To effectively estimate the channel in mMIMO systems, this method can be computationally complex and necessitates a high number of pilots [61]. By learning the channel properties directly from the received signals, neural networks can be used in DL-based channel estimation to circumvent these difficulties. This can lower the overhead related to pilot training and transmission while increasing the accuracy of the channel estimation. Beamforming is a further use of DL in mMIMO systems [8,19,62], and DL-based beamforming can reduce interference and increase signal quality. It can also enhance the performance of mMIMO systems by learning the ideal beamforming weights from channel data [55,62].

The combination of DL and mMIMO has the potential to significantly improve the performance of wireless communication systems, including 5G and beyond 5G. By leveraging the power of DL, it is possible to overcome some of the challenges associated with traditional wireless communication techniques and enable high-speed and reliable wireless communication systems. Although enormous MIMO is a potential technique for future communication networks to maximize SE, the system’s performance is negatively impacted by its extremely high spatial complexity. We apply DL to a huge MIMO system with a focus on the difficulties of the direction of arrival (DOA) estimation and channel estimation, and the viability of this innovative framework is confirmed using in-depth simulations [25].

3.1. DL in Artificial Intelligence-Driven B5G Mm-Wave Networks

DL is a powerful technique in artificial intelligence (AI) that has been widely applied in various fields, including wireless communication networks. In the context of B5G mm-wave networks, which operate in high-frequency bands, DL can play an important role in optimizing system performance and improving user experience.

In [26], a deep IA framework with a DNN is described for enabling dependable and prompt primary access for AI-driven networks outside of B5G mm-wave as shown in Figure 2. The authors in [54] proposed that by using a small portion of the available beams, deep initial access (IA) might also reduce the beam sweep time. This sweep time demonstrates the superiority of deep AI over a conventional AI algorithm that also functions as adaptive learning and an efficient RA, and enhances the overall reliability and speed of communication. IA should only use a portion of the available beams to cut down on sweep time. Received signal strengths, which are obtained from the subset of beams, are mapped to the beam, and deep IA then determines which beam is best oriented towards the receiver side [52]. In particular, in the prediction accuracy of line-of-sight and non-line-of-sight (NLoS) scenarios, deep AI significantly outperforms the conventional or standard IA’s beam and reduces IA time [57], with difficulties such as increased sensitivity to obstructions and signal attenuation, especially in situations when there is NLOS. Through its ability to help the system learn and adapt to complex and dynamic situations, deep AI is a lot faster and more reliable and can help address these difficulties by delivering higher reliability, especially in the case of correct prediction [60,61,62,63]. DL can also be used in the development of beamforming and beam-tracking algorithms. Beamforming allows signals to be focused on users or places, which can help to lower interference and boost signal strength. By learning from vast amounts of data and generating predictions based on the circumstances at hand, DL can assist in automating this process. Future wireless communication systems should see even more creative uses of DL. The proposed approach leverages RBT-DRL-based beam training to minimize beam training overhead and adapt dynamically to network conditions, significantly improving both SE and EE in mm-wave systems as shown in Table 4. When compared to existing works such as [33], which focus on reducing beam sweep time by using a subset of available beams, our method offers a more dynamic and robust solution through deep reinforcement learning. Additionally, research like [46] emphasizes beam allocation and power optimization to reduce hardware costs and power usage, while our solution directly addresses the challenges of beam training and adaptation in a HP structure. Compared to [48], which improves system EE using DNN and DRL, the proposed RBT-DRL approach integrates channel estimation and beamforming, offering a more holistic solution for real-world applications with mobility and interference. Moreover, recent works like [57] discuss DRL-based beam training for mm-wave channels considering user mobility, which aligns closely with our work; however, our method provides enhanced performance by incorporating low-resolution ADC considerations and intelligent resource management. These comparisons validate the novelty and effectiveness of the proposed method in comparison to existing solutions.

3.2. Enhancing Channel Prediction Accuracy

Addressing the sparsity and fast channel fluctuations of mm-wave transmissions requires improving the accuracy of channel prediction using DRL in mm-wave communication for B5G. In dynamic B5G network situations, precise forecasts allow for more effective beam alignment, lower signaling overhead, and providing reliable performance. By using novel approaches intended to handle real-world complexity including mobility, interference, and environmental unpredictability, the channel modeling in this article is validated. Beam training overhead is reduced, and the model dynamically adjusts to network conditions by utilizing DL for OFDM systems. Beam alignment, excessive path loss, and channel fluctuations are all successfully addressed by the RBT-DRL-based beam training approach, which guarantees reliable system performance and high-resolution channel estimation. Their alignment, including limited scattering, beam misalignment, and the effect of low-resolution ADCs, validates the channel modeling, which is analytically tractable due to its intrinsic simplification. Using HP in conjunction with RBT-DRL for beam training is specifically made to handle real-world challenges including interference, mobility, and environmental variability. By including features that dynamically adjust to changing network conditions, the chosen channel models are created to handle real-world complications like mobility, interference, and environmental unpredictability. In dynamic situations like channel variations and significant path loss, the RBT-DRL-based beam training framework optimizes channel estimates while lowering beam training overhead. Additionally, spatial domain quantization signals are used to increase channel accuracy and overcome low-resolution ADC issues as shown in Figure 3. By lowering the number of RF chains while preserving strong beamforming capabilities, the HAD architecture improves channel modeling even further. MSE optimization is used to validate these models, guaranteeing that the predicted channels closely resemble actual conditions. By integrating DRL with AB-HP to tackle specific issues, such as bandwidth allocation and high-resolution channel estimation, shows how DRL may successfully increase channel prediction accuracy. Using DNNs to reduce the error between estimated and actual parameters, the model automatically adjusts to channel circumstances by dynamically improving EE and SE. Furthermore, the model uses spatial domain quantization signals and HAD architecture for effective beamforming and channel prediction, which reduces the errors brought on by low-resolution ADCs. The proposed approach provides a reliable solution for mm-wave situations since it closely matches realistic channel behaviors, as confirmed by extensive validation using MSE optimization.

It also enhances the predictability of the inferred channels by improving prediction accuracy and it streamlines and expedites the channel estimation process by avoiding the costly neural network training procedure using a DL-based beam training strategy. Training a DNN using historical beam measurements involves maximizing the expected long-term reward. In DNN, the rewards are obtained using a configurable reward mechanism that considers the power or data rate needs of various users. A DNN is taught to maximize the predicted reward over time by applying DRL approaches. Using DRL approaches allows the system to perform better in a variety of applications and adjust to changing needs. Decision-making is made more effective and the system’s overall efficiency is increased through the application of ML algorithms and prior data. To enhance the balance between prediction accuracy and channel estimation overhead, a straightforward training approach can be utilized. Instead of relying on traditional pilot-based channel estimation, an M-based predictor can be employed. This predictor can detect the channel aging pattern using an autoregressive model. Enhancing the supervised DL-based hybrid beam’s highest sum rate under a specified transmit power limit HB. To reduce computing complexity for multi-user mMIMO systems presented, which are used to reduce the number of RF chains and enhance channel estimation overhead, the method has suggested the use of a learning approach that is the closest to modeling the hybrid beamformer [63,64]. The BS deployed the antenna arrays and studied the channel estimation for an mm-wave huge mMIMO system with a mixed-resolution c architecture. The approach based on compressed sensing can recover the channel. The sparsity of the mm-wave channel may be used to formulate the beamspace channel estimation as a sparse signal-recovery problem.

4. Optimizing EE- and SE-Based Beam Training

This section examines the effectiveness of using a CSI predictor realized by an autoregressive predictor and a DL-based time division duplex (TDD)-scheme-based predictor involving pattern extraction realized by a convolutional neural network. These predictors can offer great performance gains, achievable throughput, and a significant improvement in channel predictability. The mMIMO system will be essential to the B5G wireless cellular networks’ mm-wave technology. Path loss in mm-wave transmission will be fought by the mMIMO system’s high beamforming gain. With interference and non-linear distortions, the DL is successfully used in joint channel estimation and signal detection. Deep CNNs can take advantage of channel correlations to improve the accuracy of channel estimates while reducing the computational cost in mm-wave mMIMO systems. By leveraging the inherent structure of the channel, a deep CNN can efficiently extract and process spatial and temporal features from channel data. This enables the network to generate more accurate channel estimates with less computational overhead, making it a promising approach for optimizing the performance of mm-wave mMIMO systems. With a low-resolution analog to digital converter (ADC), the supervised learning-based sequential interference cancelation is created for mMIMO detection ADCs. An important step in obtaining the channel state data needed for mMIMO transmission is channel estimation. In contrast, an excessive amount of overhead is needed to serve an ever-increasing number of devices due to the exponential increase in devices and the variety of new applications. The maximum number of devices that may be handled in one cell is in the order of 10⁵. The integration of DRL with the AB-HP framework enhances the connection between beam selection and EE/SE outcomes by dynamically optimizing RF chain consumption while minimizing beam training overhead. RBT-DRL intelligently selects beams based on ambient angular information, reducing computing complexity and pilot power consumption while adapting to real-time network fluctuations. By incorporating DNN-based channel estimation, the system ensures accurate channel characteristics, addressing the challenges of high path loss and obstruction in mm-wave communication. Unlike traditional HP systems that focus solely on reducing RF chains or improving training efficiency, RBT-DRL enables dual optimization by dynamically adjusting the number of activated RF chains to balance EE and SE as shown in (6) and (10). This adaptive beam selection approach enhances capacity in multi-user mMIMO systems while maintaining resilience under varying network conditions. The methodology demonstrates how RBT-DRL outperforms existing solutions in efficiency, reliability, and scalability by optimizing beamforming, reducing training overhead, and improving EE/SE trade-offs.

4.1. RBT-DRL L-Based Beam Training with EE and SE

To reduce the error between the estimated and true channel parameters and optimize a beam training strategy that takes the current channel circumstances, the RBT-DRL-based beam training is essential because it maximizes EE and SE by reducing beam training overhead and dynamically adjusting to network conditions, improving system performance. Mm-wave communication has been proposed as a promising solution to meet the increasing demand for higher data rates in wireless communication systems. However, mm-wave communication faces significant challenges due to high path loss and blockage caused by obstacles in the transmission path. To lessen the effects of these difficulties, channel estimation is essential in mm-wave communication. To adaptively improve EE and SE while reducing computational and measurement overhead, the proposed beam training approach makes use of DNN. When compared to traditional methods, this creative approach guarantees that our technology achieves higher performance and adaptability. DNNs have been demonstrated to reduce the error between the estimated and true channel parameters, and the DNN is trained using a sizable dataset of input-output pairs. But to train and infer, DNN-based techniques need a lot of training data and processing power. Further study in this field is expected to increase the accuracy and efficiency of these methods. The DNN-based channel estimation shows promise as a workable solution for mm-wave communication systems. To optimize either EE or SE, a beam training strategy that takes the current channel circumstances into account and modifies the amount of activated RF chains in the mm-wave system may enhance overall performance and adjust to changing channel circumstances. The mm-wave system’s efficiency and dependability in difficult conditions can be improved by adding RBT-DRL into beam training.

4.2. A Low-Complexity AB-HP

To further mitigate the remaining interference at the DB precoder in a wireless cellular network’s downlink transmission, the channel estimation is performed by the user transmitting a pilot signal of

\sqrt{p}

power to the BS. The power of the pilot signal, denoted by

p

, helps in determining the quality of the channel. It is hard to figure out the right channel because the ADCs do not give very detailed information and are not very accurate. The corresponding channel estimation is obstructed by the inaccurately quantized output of the ADCs with low resolution. After that, a fully connected DNN is used to construct the objective AB-HP in DNN to investigate how to jointly optimize energy harvesting and SE in response to changes in the wireless channel to reduce overall beam training overhead while achieving high levels of energy or spectral efficiency performance. One approach to achieve this goal is to use DRL to select potential beams for beam training, as suggested in [28]. Depending on whether performance metric is used, the EE or SE values are

V_{n} \in R^{M + 1}

for beam training. However, in mm-wave systems, the state vector in [28] and the action space in [27] both scale with the number of antenna elements, which can significantly lengthen the DNN’s training time. Thankfully, at the BS, the correlation between the received signals from each antenna is introduced by the restricted number of scattering and the constrained physical space between the antenna

φ

. Then, the almost undistorted quantization signals can provide additional channel information in the spatial domain for

β

to improve the channel accuracy.

\hat{h} = \sqrt{\frac{p_{l}}{N_{t r}}} \sum_{n = 1}^{M_{t r}} \{\begin{matrix} F_{u, θ}^{r x} (θ_{l, n, z o A}, φ_{l, n, A o A}) \\ F_{u, φ}^{r x} (θ_{l, n, z o A}, φ_{l, n, A o A}) \end{matrix} . [\binom{e^{j φ_{l, n} \sqrt{K_{L, n}}}}{\sqrt{K_{L, n}} e^{j φ_{l, n}} e^{φ φ_{l, n}}}] [\begin{matrix} F_{u, θ}^{t x} (θ_{l, n, z o D}, φ_{l, n, A o D}) \\ F_{u, φ}^{t x} (θ_{l, n, z o D}, φ_{l, n, A o D}) \end{matrix}],

(1)

where

\hat{h}

is the estimation of

h

,

p_{l}

is the power of the

l

-th cluster, the vectors [

F_{u, θ}^{r x}

,

F_{u, φ}^{t x}

]T and [

F_{u, θ}^{t x}

,

F_{u, φ}^{t x}

] represent the receive antenna and transmit antenna, respectively, to minimize the MSE across all training samples. The DNN appropriately sets the activation functions and updates the weight matrix in a data-driven way, which is given by

M S E {= \frac{1}{N_{t r}} \sum_{n = 1}^{M_{t r}} ‖h_{n} - {\hat{h}}_{n}‖}^{2},

(2)

where

N_{t r}

represents the number of training samples,

h_{n}

and

{\hat{h}}_{n}

denote the true channel and the channel approached by the RBT-DRL, respectively, for the

n

th training sample. The significantly distorted signals quantized by the low-resolution ADCs hurt the estimate performance, which is ignored in the channel estimation for all antennas. Conventional spatial channels can be converted into beam spatial channels by use of an antenna array. Actually, the spatial discrete Fourier transform (DFT) matrix U of size

N \times N

is simulated by the antenna array. Conversely, the DNN is created by selectively employing trustworthy observations that match the high-resolution ADC antennas based on the predictions from the high-resolution channels. The hybrid analog-digital (HAD) architecture allows antennas to be coupled to significantly fewer RF chains through analog domain phase shifters and further enhances performance through digital processing. These studies indicate a channel estimation approach based on RNN, which may estimate channels through the non-linear mapping of RBT-DRL, to address the shortcomings of the conventional channel estimation.

After developing the analog RF beamformer

F

and the digital BB precoder

B_{n}

, the capacity maximization is reformulated as follows:

\underset{p}{SE = \max} \sum_{n = 1}^{M} \log_{2} (1 + \frac{p_{n} {|h_{n} F B_{n}|}^{2}}{\sum_{n = 1}^{N} {p_{n} |h_{n} F B_{n}|}^{2} + ρ^{2}}) s . t . E ‖S‖ = \sum_{n = 1}^{M} p_{n} h_{n} F F^{H} B_{n} B_{n}^{H} \leq p_{T}

(3)

p_{n} \geq 0, \forall n

where

ρ

noise power, DL-based-AB-HP with mean square error (MSE) optimization outperforms traditional methods as the number of UEs increases, significantly reducing the runtime while also minimizing the number of required RF chains and lowering the channel estimation overhead. Some works focus on AP-HP to enhance system performance in high-speed communications. In [8], the PA problem in high-speed orthogonal frequency division multiple access (OFDMA) systems to achieve minimum total transmit power consumption was examined. To minimize the average transmit power, it was suggested in [11] to optimize PA with antenna selection. In order to maximize EE on high-speed networks with buffer limits, the PA method was introduced by the authors of [12]. The power control issue in mm-wave high-speed systems has never before been addressed, according to the authors of [13]. The AB-HP technique is executed in two phases: in phase 1, the suggested AB-HP algorithm uses offline supervised learning to forecast allocated powers in real-time online applications. This is achieved by using the best-allocated power values, determined by the computationally intensive particle swarm optimization (PSO)-PA, as output labels in the offline supervised learning process. The PSO-PA technique is used to determine and store the associated optimally allocated powers in the dataset. The entire available dataset is split into training and validation sets using an 80–20% split, and phase 2 of the learning process is finished in the online AB-HP using a brand-new test dataset. The suggested AB-HP algorithm is implemented using an open-source DL framework. DRL is a subset of ML that can learn through experimentation while interacting with the environment [40,66]. Tagged datasets are not a requirement for it. To choose a group of beams based on past experiences, multi-armed bandit (MAB), a simple DRL algorithm, is utilized [67]. There is very little that MAB can do to modify the beam selection in response to changes in the environment because it does not utilize the way things are right now. On the other hand, based on the predictions from the high-resolution channels, the DNN is constructed by carefully selecting the reliable observations that correspond to the high-resolution ADC antennas [68]. The use of DNN in DRL allows for the creation of more intelligent beam training algorithms by extending the capabilities of conventional methods. Based on the targeted user’s uplink received power, DRL is employed in [64] to jointly assign the best BS and beam resources to that user. As per reference [66], DRL can instantly assimilate information from its environment to choose the optimal beam pair for data transmission. A DL approach for a fully connected hybrid architecture in a multiuser situation is proposed by [29], demonstrating the ongoing research on using ML for hybrid precoder design. For a fully connected hybrid structure, [30] proposes unsupervised DL precoding and it suggests using CNN to create a partially connected hybrid design [31,65]. Still, compared to the focus on fully linked hybrid architectures, there is not much study on partially connected ones. Consider employing DRL to select appropriate beam candidates for beam training to lower algorithm complexity. By balancing exploration and exploitation without retraining, the RBT-DRL approach efficiently optimizes EE and SE in mm-wave multi-user mMIMO systems, outperforming CNN-based HP and LSTM-driven RA approaches in terms of convergence, computational complexity, and adaptability [52,69]. RBT-DRL produces better EE-SE trade-offs with less complexity than CNNs and LSTMs, which have limits in high-dimensional areas and real-time adaptation. In addition, DL-based and HP strategies, such as those that use ASM and PMA, improve EE and SE even more, demonstrating the potential of cutting-edge AI technologies for mm-wave networks.

The training time of the DNN may rise when considering a mm-wave system, since both the size of the action space in [66] and the state vector in [52] grow with the number of antenna elements. The beam training technique utilizes DRL’s general architecture. The process of tracking beams is modeled as a DRL process in the training context. The DRL algorithm develops an EE-SE maximization beam training strategy to improve performance while lowering beam training overhead. Initially, depending on previous channel measurements, the DRL block chooses one of many candidate beam training approaches. As the first action in the EE-SE maximization strategy, the chosen beam training method is used. The predicted number of RF chains needed to obtain the highest EE or SE is based on the outcomes of the beam training. The BS with

M_{t x}

antennas ccommunicates

M_{s}

data streams to the UE with

M_{r x}

antennas. The BS and the UE are assumed to be equipped with

{M^{t x}}_{r f}

and

{M^{r x}}_{r f}

RF chains, respectively, such that

M_{s}

≤

{M^{t x}}_{r f}

≤

M_{t x}

and

M_{s}

≤

{M^{r x}}_{r f}

≤

M_{r x}

. Depending on the condition of the system at the time, the DRL block can switch between DRL-EE and DRL-SE configurations. This includes factors like the UE’s downlink queue state and battery level. The mode of RBT-DRL-SE, for instance, is activated to communicate more data using spatial multiplexing when the UE’s packet queue is backlogged. In contrast, RBT-DRL-EE is activated to save energy for the UE if its battery level is low, such as below 50%. Upon request from the BS, the UE will send a report of its parameters. The EE counts how many bits are transmitted for each unit of energy, which is given by

M (n) = \frac{B_{n}}{p_{n}} r (n) . Bit / Joule

(4)

To create a virtual environment for training RBT-DRL agents and operating in highly reliable systems, DRL-scheduling is proposed.

p_{n} = \frac{Ƿ_{n}}{ζ} + M_{r f} (n) (p_{r f} + M_{t r} p_{p s}) + M_{r f} (n) (p_{r f} + M_{r x} p_{p s}) + p_{c o n s t},

(5)

where

ζ

represent amplifier efficiency with

ζ \in [0,1]

,

p_{r f}

represent the power required per RF chain,

p_{p s}

represents phase shifter, and

p_{c o n s t} = M_{t r} p_{t x} + M_{r x} p_{r x} + 2 p_{c o m m o n}

represent accounts for the fixed power consumption, where

p_{t x}

and

p_{r x}

are the power for each transmit and receive antenna.

E E_{n} = \frac{r_{n}}{\frac{Ƿ_{n}}{ζ} + M_{r f} (n) (p_{r f} + M_{t r} p_{p s}) + M_{r f} (n) (p_{r f} + M_{r x} p_{p s}) + p_{c o n s t}} .

(6)

With relation to the distributional perspective on DRL [16,71], the agent seeks to determine the best transmission scheduling. The random return is

ƴ

, whose expectation is the value

Q^{π}

. Improving sample complexity to hyper-parameters variation increases the robustness, and environmental noise is shown in [17,40]. The system state specified by the user is denoted as

s_{n} = {\{S_{1, n}, \dots \dots \dots, S_{N, n}\}}^{T} \in r^{M}

of the IoT system at the time step

n

for the input layer. The random return is achieved by adhering to a current policy

π

by performing an action

a

from the state

s

indicated by the random variable

y_{q}^{π} (s, a)

due to the startling predictability of the environment. Action space:

a_{n} = {\{A_{1, n}, \dots \dots \dots, A_{N, n}\}}^{T}

is necessary to adjust all policies and to determine their improvement, thus they arrive at

Q^{π} (s, a) = E [y_{q}^{π} (s, a)]

and analogous distributional Bellman equation, that is

y_{q}^{π} (s, a) ≜ r (s, a) + ψ y_{q}^{π} (s^{'}, a^{'}),

(7)

where

s^{'}

and

a^{'}

are random nature of the next state-action pair after developing a policy, and

A : ≜ B

pointed to a haphazard variable

B

probability law and

A

are similar. As a result, when the distributional Bellman operator’s policy assessment behaves,

T

can be defined as

T y_{q}^{π} (s, a) : ≜ R (s, a) + ψ y_{q}^{π} (s^{'}, \arg \max_{a^{'} \in A} E [y_{q}^{π} (s^{'}, a^{'})]), s^{'} ~ P (. |s, a), a^{'} ~ π (. |s^{'}) .

(8)

The RBT-DRL network takes control of the generation of the real-like data

Z

obtained from a distribution

P_{g} (Z)

and generates a real data

X = (Z, {φ^{*}}_{d} (φ_{r})) ~ P_{g} (Z)

. The discriminator

φ_{D}

is at that time training to recognize between the actual data

P (X)

and the data arriving from the refined DRL-distribution traffic data

P_{g} (Z)

by training a function. Action space A, which is discrete in nature, is the collection of values of actions that consistently accept values from rising integers in the range

a_{n}

.

4.3. Reward

To maximize the long-term predicted benefit during an interaction with the environment, the agent is trained to learn a policy [22]. The goal of this work is to achieve good EE or SE performance for a mobile UE while minimizing the beam training overhead. Because it directs the agent to optimize long-term gains by rewarding desirable behaviors and penalizing ineffective ones, the reward function is essential to DRL. A compromise was made between the obtainable EE or SE and the beam training overhead. This balance, which will be optimized during the RBT-DRL training process, is represented by the reward. In return for transmission in real-time, a slight and acceptable performance degradation brought on by a shorter beam training period is allowed.

The more beams that are evaluated for data transmission, the longer the beam training period will be. The amount of beam measurements needed is represented by the non-negative number Ui, which is the penalty for the

n - t h

beam training method [49,69]. This reward, which is regulated by a trade-off factor

δ

, balances the possible SE or EE with the penalty for beam training overhead

V_{n}

. The reward function for EE may be found in (9), and for SE, it has a similar structure. States (such as CSI and RF chain), actions (beam training), and rewards are all part of the DRL environment. To guarantee convergence, hyperparameters like the learning rate and the trade-off factor are carefully adjusted. Different system sizes, and channel conditions, are used to assess the method’s scalability and real-world performance. To optimize long-term rewards, the agent is trained by gradient optimization to stabilize learning. The training process uses a target network with hyperparameters such as exploration rate, learning rate, and trade-off factor. The reward function is explicitly given, along with formulae designed to balance performance indicators, to guarantee reproducibility. The simulations yielded the penalty values. Thus, in the RBT-DRL-EE, the reward function is provided by

ꓣ_{n}^{e e} = α δ ꓰ_{n} - α (1 - δ) V_{n} - λ (\frac{M_{n}}{M_{m a x}} + \frac{p_{n}}{p_{m a x}}), 0 \leq δ \leq 1, n = 1,2, . .

(9)

where

δ

, also known as the trade-off factor, regulates the ratio of the achievable EE to the necessary beam training overhead. Training dynamics and convergence speed during DRL are influenced by α, which scales the reward function update according to the learning rate. The learning rate causes slower convergence but more stable learning, whereas a higher rate speeds up learning but can also contribute to instability. By controlling the relative value of long-term vs. immediate benefits, the discount factor δ influences the agent’s decision to prioritize lowering overhead over enhancing EE or SE. Optimizing EE or SE is favored by a greater δ, whilst cutting beam training time is given priority by a lower value. The design of the reward function is crucial to maintaining equilibrium between the training overhead and the achievable EE/SE. This function allows the agent to learn the best trade-off by incorporating penalties V_n to discourage excessive training time. The agent’s capacity to adjust to different channel conditions and system sizes, as well as the framework’s actual performance, are directly impacted by the sensitivity of these hyperparameters.

ꓰ_{n}

is the EE obtained using the

n - t h

beam training method. The combined effect of the channel state, the quantity of RF chains employed, and the chosen beam training technique can be reflected in the EE or SE.

V_{n} = {[ԑ_{n - M}, ԑ_{n - M + 1}, \dots \dots, ԑ_{n - 1}, {\bar{ԑ}}_{n}]}^{M}

is the vector ut for EE, where the first

M

entries represent the EE attained in the previous

M

time steps, and the final item,

{\hat{ԑ}}_{n}

, represents the EE assessed using the pre-assessment at the current time step

n

. Concerning SE, the vector

V_{n}

is denoted by

V_{n} = {[r_{n - M}, r_{n - M + 1}, \dots \dots, r_{n - 1}, {\bar{r}}_{n}]}^{M}

, containing the appropriate

M + 1

elements of SE.

ꓣ_{n}^{s e} = α δ ꓰ_{n} - α (1 - δ) r_{n} - λ (\frac{M_{n}}{M_{m a x}} + \frac{p_{n}}{p_{m a x}}), 0 \leq δ \leq 1, n = 1,2, \dots

(10)

The agent learns a policy to maximize the long-term predicted reward through interactions with the environment [22]. Prioritized experience replay preserves all states for effective learning, but only high-impact states according to their temporal difference (TD) error. Based on available processing capability, the DRL agent dynamically modifies computational complexity to allow for real-time decision-making without overtaxing the system. Furthermore, resource-constrained DNN training remains within the maximum power limit

p_{m a x}

thanks to an adjustable learning rate. Our goal in this work is to minimize the overhead related to beam training while still achieving good EE or SE performance for a mobile UE. In another way, to optimize the trade-off between the overhead of beam training and the achievable EE or SE. The Equations (9) or (10) offer the option to weigh the performance metric’s significance while choosing a beam training method, thanks to the incentive system as shown in Algorithm 1. The agent can be trained to attain varying performance levels for distinct applications by adjusting the value of

δ

. A bigger trade-off factor is desirable for applications that need high transmission rates, like high-definition video streaming, where a higher data rate matters more than beam training latency.

Algorithm 1: RBT-DRL Based Beam Training for Achieving EE and SE

$Input Set h_{n}, B_{n}$ , $M_{t x}, M_{r x}$ , ${M^{t x}}_{r f}$ and ${M^{r x}}_{r f}$
$Establish the ζ$ , noise power $ρ$ , power budget $p_{T}$ , and other power-related factors $p_{r f}, p_{p s}$ , $p_{c o n s t}$ .
$Device constraint for memory M_{m a x}$ and processing power $p_{m a x}$
$Set RBT - DRL ’ s initial memory buffer to M_{m a x}$
$for each time step n$ , complete the following;
Determine the best beam training technique, and develop an RBT-DRL agent.
$Provide definitions for the reward ꓣ_{n}^{e e}$ , for $ꓣ_{n}^{s e}$ action space $a_{n}$ , and state space $s_{n}$ ;
$for each time step n$ , update beamforming $F, B_{n}$ based on the current state $s_{n}$ .
Evaluate allocation power as shown in constraints (3).
$Check the quality of the channel as shown in (2) based on available processing power p_{m a x}$ and adjust the estimated channel ${\hat{h}}_{n}$ in the RBT-DRL agent’s knowledge.
Optimize memory usage by deleting unnecessary states and storing just the most important ones.
$Compute cumulative discounted reward at a time step and transit to the next state s_{n}$ in DNN based on the actions taken.
Optimize beam training choices by adjusting battery-aware memory allocation.
Store states based on activating the battery level in RBT-DRL.
Calculate the reward for each beam training based on the overhead related to beam training while still achieving good EE or SE (9) and (10)
end for
$Train the DNN model using the previously saved samples (s_{n}$ , $a_{n}$ )

5. Discussion

The performance of the proposed experience-driven power distribution approach in the mm-wave AB-HP is assessed in this section through a series of simulated experiments. Table 5 outlines the key simulation parameters used to assess the power distribution approach in the mm-wave AB-HP. These parameters, including bandwidth, transmit power, and beam adjustment are essential for evaluating the effectiveness of DL-based techniques. We evaluate the effectiveness of DL-based techniques and conventional beamspace-channel estimation algorithms in mixed-ADC architecture. In particular, for the RBT-DRL it is necessary to ascertain beforehand how many beams to track during time beam training using the value of

V_{n}

. The most straightforward beam training technique is to reduce the overall beam training overhead while preserving sufficient performance.

There is an impact of the different actor learning rates, while the critic’s learning rate is predetermined. The learning rate has a significant impact on the stability and speed of the SE during the RBT-DRL algorithm training. This is so that the convergence of SE can be implemented, and the learning rate represents the learning step. Missing the global optimum during the training process is more likely if the learning rate is high. Conversely, a poor learning rate would likely cause the convergence rate to slow down. We can also observe from Figure 4 that the SE realized by RBT-DRL can converge to a higher and more stable value sooner when the RBT-DRL for beam training

V_{n}

delivers after the agents are trained to be stable. The average SE attained by various beam training procedures for varying factor

δ

is displayed in Figure 5. In the DRL, RBT-DRL, and the reward technique, whose efficacy increases with increasing

δ

, similar patterns are observed, wherein the more expensive beam training approaches will be activated more frequently by the RBT-DRL, DRL, and reward to raise the SE as it rises. The range of achievable SE values is 15.9–18.7 bit/s/Hz. The DRL model can be adjusted to achieve, in that order, 91.7%, 93.4%, and 95.9% of the maximum SE within this range. The typical quantity of RF chains is needed for RET-DRL. More RF chains are employed to produce larger SE when there is no power constraint, particularly for

δ

≥ 0.5. RBT-DRL activates less RF chains than DRL when the SE is less weighted in the reward function in (9).

From Figure 6, the lowest EE is around 49 Mbit/Joule, which is attained without any beam training, while the maximum attainable EE is approximately 73.5 Mbit/Joule, which is reached by rigorous beam training. Achieving a greater EE becomes more crucial as α increases, while the impact of the beam training overhead decreases. Therefore, in training, a higher EE with more beams is obtained as

δ

grows.

The three distinct beam training techniques—RBT-DRL, DRL, and conventional beam training—are depicted in Figure 7. All three approaches’ SE rises in tandem with SNR, indicating the anticipated increase in data throughput that comes with improved signal quality. The final SE attained and the pace of the rise, however, differs between the approaches. Throughout the whole SNR range, RBT-DRL continuously shows the highest SE, indicating its superior capacity to use increased SNR for enhanced SE. Although the difference is not as noticeable as it is with RBT-DRL, DRL likewise exhibits a larger SE when compared to regular beam training. This suggests that when compared to the traditional beam training method, both RBT-DRL and DRL, which make use of reinforcement learning, are more successful at adjusting to shifting channel circumstances and optimizing beamforming for SE. The DRL and RBT-DRL can dynamically modify their SE settings according to system factors such as battery level and UE queue status. Because of their flexibility, they can adjust beamforming tactics in real time, maximizing SE when necessary and maybe sacrificing some SE when battery levels are low. In contrast, the adaptive RL-based methods perform better than the standard beam training method, especially when SNR conditions alter, because the former probably uses a more static approach. The reward functions for RBT-DRL-SE are meant to balance the overhead and performance, which ultimately affects the observed variations in SE among the three approaches. Equations (4)–(10) explain how the SE is determined.

With varying amounts of

δ

, the DRL method can reach 93.1%, 95.6%, and 98.0% of the highest possible EE. Furthermore, RBT-DRL can provide equivalent or even fewer beam measurements by alternating between several beam training techniques, which can perform better than continuously using a fixed beam training approach. However, DRL can draw lessons from beam training’s past and choose the beam training strategy that will maximize long-term rewards. At

δ

= 0.1, RBT-DRL achieves zero beam training at a considerable performance deterioration since optimizing the long-term benefit is comparable to limiting the beam training overhead. When

δ

= 0.8, increasing the EE maximizes the long-term benefit; therefore, it makes sense to do an exhaustive beam search more frequently to achieve a greater EE.

From Figure 8, the performance of RBT-DRL, DRL, and three beam training method models is demonstrated to be quite similar in Figure 8, which shows the average EE attained by various beam training policies at varying SNRs. The EE increases to saturation value after that starts to decrease. As we can see, the SE only increases by 39% whereas the overall power consumption in Equation (4) rises by 45% from SNR = 14 dB to 20 dB. Thus, the high power consumption limits the EE performance at SNR = 20 dB. From Figure 9, by increasing task information, the appropriate incentive schemes in Figure 9 facilitate faster learning, with RBT-DRL and DRL requiring around 20% less reward over the whole SNR range. By combining dynamic PA with context awareness and utilizing real-time network conditions to optimize power consumption, the RBT-DRL enhances the solution to handle power consumption problems. To ensure energy-efficient operation without sacrificing SE, the reward mechanism in RBT-DRL can be changed to penalize excessive power usage. Furthermore, hybrid precoding can further lower power consumption by selectively activating antennas based on user density and demand. To reduce redundant power consumption, DRL can facilitate coordinated, dispersed operations among BSs to balance off higher power requirements in mobile and IoT environments. Lastly, the RBT-DRL framework can be verified and improved by benchmarking against low-power solutions, such as lightweight DRL algorithms, to guarantee useful implementation in situations with limited power. The EE increases with SNR in Figure 8 until a certain point, at which point it tends to plateau or even slightly rise. This pattern emphasizes the trade-off between boosting signal quality for SNR and the corresponding power usage. To compare RBT-DRL-based techniques, in particular in [41], the RBT-DRL variations in particular attain higher EE in terms of SNR values when compared to [41], indicating that DRL-based beam training in Figure 8 may be more successful in maximizing EE than adaptive approaches in [41]. The saturation of EE at higher SNRs highlights the fact that merely raising transmit power above a particular threshold does not always result in a corresponding rise in EE since power consumption rises faster than possible signal quality gains. As shown in [42], in comparison to current state-of-the-art (SOTA) algorithms, this approach focuses on mathematical convergence to a local solution that is approximately less complex and in reaching a compromise between computational complexity and SE. From the benchmark when the beam training overhead is reduced, the RBT-DRL continuously achieves a higher EE at different SNR levels, peaking at about 90 Mb/s/Hz. By utilizing DRL and adaptive beam training for dynamic network conditions, RBT-DRL provides notable advantages over SOTA hybrid precoding techniques such as compressive sensing-based approaches as shown in [42], conventional precoding, and heuristic optimization methods. RBT-DRL is a more adaptable and energy-efficient solution for ultra-dense, high-mobility networks because of its capacity to dynamically optimize RF chain utilization and precoding based on environmental feedback.

The needed long-term anticipated advantage for either RBT-DRL or DRL during an interaction with the environment is substantially larger than that for both RBT-DRL and beam training, and it does not change with varying SNR. To acquire greater rewards with fewer training beams, one can examine the distribution of action choices for all beam training policies. The SE of several techniques is displayed versus the number of N in Figure 10. Keep in mind that the experiment uses the highest bound of SE achieved by the best full-digital beamforming. Figure 11 makes it evident that the SE attained by all RBT-DRL grows as the number of Ns rises. On the other hand, DRL and beam training greatly improve performance, whilst the random power and maximal power schemes only slightly improve SE. Furthermore, the SE obtained by the suggested RBT-DRL approach has a SE greater than previous ways as the number of Ns increases in the mm-wave. Furthermore, when employing RBT-DRL and beam training, the maximal power scheme performs worse consistently. This shows how optimizing PA and beamforming together can raise the SE of mm-wave systems.

From Figure 11, the three primary methods for preserving the stability of the channel estimation are interactions, examining the role of RBT-DRL provides important insights into channel stability, and if it can successfully adjust to changing channel conditions, it may improve SE performance. As a result, stability is provided during simulation by the mm-wave-sampling realization in both native and mutant forms, as shown in Figure 11. If you have the channel realizations available, you can replace the range function with actual channel data or change it to acquire a more realistic depiction, as the beam training bonding has shown a slight decrease in the number of possible channel realizations for the SE for the highest possible RBT-DRL award. The goal of this technique is to attain maximum efficiency by refining the beamforming strategy. Beam training offers less adaptive beamforming options, and DRL approaches, while trying to maximize beamforming, may not be as specialized as RBT-DRL.

The EE performance for each of the three methods, RBT-DRL, DRL, and beam training, is plotted against time steps in Figure 11. RBT-DRL consistently produces larger EE values than DRL and beam training, as seen in Figure 12. However, RBT-DRL exhibits notable EE fluctuations, most likely as a result of altered ambient factors or system characteristics. Although the DRL method does not achieve the highest EE levels of beam training, it does show greater stability than beam training. The RBT-DRL approach shows resilience in sustaining steady performance even if it has the lowest average EE. This is probably because it can dynamically strike a compromise between energy conservation and system efficiency. The beam testing outcomes establish the estimated number of RF chains needed to maximize EE. When the UE is equipped with M_rx antennas, it receives M_y data streams from the BS, which has M_tx antennas. To attain optimal EE, the beam training procedure maximizes the quantity of RF chains

{M^{t x}}_{r f}

required at the transmitter and receiver. Based on system parameters, like the state of the downlink queue or the battery level of the UE, the DRL block allows dynamic switching of energy-efficient DRL-EE. Using spatial multiplexing, the RBT-DRL-EE mode prioritizes energy savings when the UE’s packet queue is backlogged and its battery level falls below 50%. RBT-DRL’s dynamic behavior guarantees a trade-off in energy consumption.

Figure 13 shows SE vs. time steps for the three strategies RBT-DRL, DRL, and beam training. Though it varies significantly depending on the dynamic channel and system conditions, beam training consistently yields larger SE than the other two methods. DRL exhibits relative stability and modest SE values, showing environmental adaptability without achieving the peak SE attained by beam training. In contrast, RBT-DRL prioritizes EE over spectral performance, maintaining the lowest SE. Based on the results of the beam training procedure, the expected number of RF chains needed to achieve the ideal EE is calculated. Using M_tx antennas, the BS transmits M_s data streams to a UE that has

M_{r x}

antennas. It is assumed that both the UE and the BS have RF chains that are

{M^{t x}}_{r f}

and

{M^{r x}}_{r f}

and

M_{s}

≤

{M^{t x}}_{r f}

≤

M_{t x}

and

M_{s}

≤

{M^{r x}}_{r f}

≤

M_{r x}

. DRL-EE and DRL-SE configurations are dynamically switched by the DRL block based on system conditions. The reward function for EE is defined by the equation

ꓣ_{n}^{e e}

, which takes into account achievable efficiency, beam training overhead, and RF chain usage. To ensure convergence and scalability across different system sizes and channel conditions, the DRL agent undergoes a rigorous training procedure that includes hyperparameter tweaking (learning rate, trade-off factor) to optimize long-term rewards. While balancing computing complexity and energy cost, this dynamic behavior facilitates real-time decision-making.

6. Conclusions

In this survey, we present possible research directions for the future. Inspired by state-of-the-art developments in 5G communication technology and DL, we provide an overview of the most recent work on utilizing DL for wireless physical layers. Furthermore, we outline many cutting-edge and successful DL-based communication frameworks. This review delivers high-quality services by investigating DL for the wireless physical layer in light of the aforementioned factors. Our research focuses on B5G systems to accommodate new needs in future communication situations. Through the trade-off between EE and SE, this article ensures high data rates for users and allows HetNets to optimize UE, spectrum allocation, and a large number of antennas by taking advantage of the backhaul capacity limitation to maximize both EE and SE. The suggested AB-HP approach aims to reduce channel-estimate overhead and RF chains by reducing the beam training overhead. Using the maximal power technique consistently results in inferior performance for the proposed maximum RBT-DRL. Based on a Pareto-optimal solution to the optimization problem, this shows that maximizing beamforming and PA simultaneously can increase the spectral efficiency of mm-wave systems. Future work will focus on addressing latency constraints, ensuring fair user allocation, and improving compatibility and computational complexity with diverse network architectures to further optimize performance in dynamic and heterogeneous environments.

Author Contributions

Conceptualization, A.S. and N.S.M.S.; methodology, A.S. and M.A.A.; software, A.S., M.A.A. and G.A.H.; validation, C.J.J. and M.A.A.; formal analysis, A.S. and A.H.; investigation, C.J.J.; resources, A.S. and C.J.J.; writing—review and editing, S.A. (Saeed Alzahrani) and R.A.; visualization, S.A. (Saad Alharbi); supervision, A.H. and F.S.A. updated the comments. All authors have read and agreed to the published version of the manuscript.

Funding

The authors extend their appreciation to Universiti Tunku Abdul Rahman Research Fund (UTARRF) by supporting the Fund through Vote no. (6557/2A02) and TIER1 grant vot Q444 from Universiti Tun Hussein Onn Malaysia.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dai, L.; Wang, B.; Yuan, Y.; Han, C.; Wang, Z. Non-orthogonal multiple access for 5G: Solutions, challenges, opportunities, and future research trends. IEEE Commun. Mag. 2015, 53, 74–81. [Google Scholar] [CrossRef]
Alkhateeb, A. Deep MIMO: A generic deep learning dataset for millimeter wave and massive MIMO applications. arXiv 2019, arXiv:1902.06435. [Google Scholar]
Castanheira, D.; Lopes, P.; Silva, A.; Gameiro, A. Hybrid beamforming designs for massive MIMO millimeter-wave heterogeneous systems. IEEE Access 2017, 5, 21806–21817. [Google Scholar] [CrossRef]
Xue, Q.; Ji, C.; Ma, S.; Guo, J.; Xu, Y.; Chen, Q.; Zhang, W. A Survey of Beam Management for mmWave and THz Communications Towards 6G. arXiv 2023, arXiv:2308.02135. [Google Scholar] [CrossRef]
Salh, A.; Audah, L.; Shah, N.S.; Alhammadi, A.; Abdullah, Q.; Kim, Y.H.; Al-Gailani, S.A.; Hamzah, S.; Esmail, B.A.; Almohammedi, A.A. A survey on deep learning for ultra-reliable and low-latency communications challenges on 6G wireless systems. IEEE Access 2021, 9, 55098–55131. [Google Scholar] [CrossRef]
Molisch, A.F.; Kung, S.Y. Variable-phase-shift-based RF-baseband codesign for MIMO antenna selection. IEEE Trans. Signal Process. 2005, 53, 4091–4103. [Google Scholar]
Mir, T.; Siddiqi, M.Z.; Mir, U.; MacKenzie, R.; Hao, M. Machine learning inspired hybrid precoding for wideband millimeter-wave massive MIMO systems. IEEE Access 2019, 7, 62852–62864. [Google Scholar] [CrossRef]
Zhou, P.; Fang, X.; Wang, X.; Long, Y.; He, R.; Han, X. Deep learning-based beam management and interference coordination in dense mmWave networks. IEEE Trans. Veh. Technol. 2018, 68, 592–603. [Google Scholar] [CrossRef]
Ye, H.; Li, Y.; Juang, B. Power of deep learning for channel estimation and signal detection in OFDM systems. IEEE Wirel. Commun. Lett. 2018, 7, 114–117. [Google Scholar] [CrossRef]
Salh, A.; Audah, L.; Shah, N.S.; Hamzah, S.A. A Study on the Achievable Data Rate in Massive MIMO System. In AIP Conference Proceedings; AIP Publishing: Melville, NY, USA, 2017; Volume 1883, pp. 1–11. [Google Scholar]
Almutairi, A.H. A Deep Learning Framework for Blockage Mitigation in mmWave Wireless. Ph.D. Thesis, Portland State University, Portland, OR, USA, 2024; pp. 1–58. [Google Scholar]
Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the EMNLP, Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
Sanguinetti, L.; Zappone, A.; Debbah, M. Deep learning power allocation in massive MIMO. In Proceedings of the Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 28–31 October 2018; pp. 1257–1261. [Google Scholar]
Wang, T.; We, C.; Jin, S.; Li, G.Y. Deep learning-based CSI feedback approach for time-varying massive MIMO channels. IEEE Wirel. Commun. Lett. 2019, 8, 416–419. [Google Scholar] [CrossRef]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE CVPR, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the EMNLP, Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar]
Zamanipour, M.A. Survey on deep-learning based techniques for modeling and estimation of massive MIMO channels. arXiv 2019, arXiv:1910.03390. [Google Scholar]
Salh, A.; Ngah, R.; Hussain, G.A.; Alhartomi, M.; Boubkar, S.; Shah, N.S.; Alsulami, R.; Alzahrani, S. Energy-efficient power allocation with hybrid beamforming for millimetre-wave 5G massive MIMO system. Wirel. Pers. Commun. 2020, 115, 43–59. [Google Scholar] [CrossRef]
Chien, T.V.; Canh, T.N.; Bjornson, E.; Larsson, E.G. Power control in cellular massive MIMO with varying user activity: A deep learning solution. IEEE Trans. Wirel. Commun. 2020, 19, 5732–5748. [Google Scholar] [CrossRef]
Bai, S.; Rauf, M.; Khan, A.M.; Kumar, S.; Kumar, H.; Mirza, E.A. Applications of deep learning for millimeter wave. J. Independ. Stud. Res. Comput. 2021, 19, 38–47. [Google Scholar] [CrossRef]
Li, Z.; Han, S.; Molisch, A.F. Hybrid beamforming design for millimeter-wave multi-user massive MIMO downlink. In Proceedings of the IEEE International Conference on Communications (ICC), Kuala Lumpur, Malaysia, 22–27 May 2016; pp. 1–6. [Google Scholar]
Zhang, Y.; Mu, Y.; Liu, Y.; Zhang, T.; Qian, Y. Deep learning-based beamspace channel estimation in mmWave massive MIMO systems. IEEE Wirel. Commun. Lett. 2020, 9, 2212–2215. [Google Scholar] [CrossRef]
Payami, S.; Sellathurai, M.; Nikitopoulos, K. Low-complexity hybrid beamforming for massive MIMO systems in frequency-selective channels. IEEE Access 2019, 7, 36195–36206. [Google Scholar] [CrossRef]
Liang, F.; Shen, C.; Wu, F. An iterative BP-CNN architecture for channel decoding. IEEE J. Sel. Top. Signal Process. 2018, 12, 144–159. [Google Scholar] [CrossRef]
Elbir, A.M. CNN-based precoder and combiner design in mmwave MIMO systems. IEEE Commun. Lett. 2019, 23, 1240–1243. [Google Scholar] [CrossRef]
Chen, L.; Sefat, S.M.; Kim, K.L. An optimal algorithm for mmWave 5G wireless networks based on neural network. Heliyon 2023, 9, 15156–15174. [Google Scholar] [CrossRef]
Elsayed, M.; Erol-Kantarci, M. Radio resource and beam management in 5G mmWave: A survey of machine learning-based approaches. IEEE Commun. Surv. Tutor. 2020, 22, 2232–2271. [Google Scholar]
Sim, M.S.; Lim, Y.G.; Park, S.H.; Dai, L.; Chae, C.B. Deep learning-based mmWave beam selection for 5G NR/6G with sub-6 GHz channel information: Algorithms and prototype validation. IEEE Access 2020, 8, 51634–51646. [Google Scholar] [CrossRef]
Alkhateeb, A.; Alex, S.; Varkey, P.; Li, Y.; Qu, Q.; Tujkovic, D. Deep learning coordinated beamforming for highly-mobile millimeter wave systems. IEEE Access 2018, 6, 37328–37348. [Google Scholar] [CrossRef]
Huang, H.; Song, Y.; Yang, J.; Gui, G.; Adachi, F. Deep-learning-based millimeter-wave massive MIMO for hybrid precoding. IEEE Trans. Veh. Technol. 2019, 68, 3027–3032. [Google Scholar] [CrossRef]
Arab, H.; Ghaffari, I.; Chioukh, L.; Tatu, S.; Dufour, S. Machine learning based object classification and identification scheme using an embedded millimeter-wave radar sensor. Sensors 2021, 21, 4291. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Tan, W.; Nie, W.; Wu, X.; Liu, T. Deep Learning-Based Channel Estimation for mmWave Massive MIMO Systems in Mixed-ADC Architecture. Sensors 2022, 22, 3938. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Tao, J.; Luo, S.; Li, S.; Zhang, C.; Xiang, W. A deep learning driven hybrid beamforming method for millimeter wave MIMO system. Digit. Commun. Netw. 2022, 8, 221–230. [Google Scholar] [CrossRef]
Gao, J.; Zhong, C.; Li, G.Y.; Soriaga, J.B.; Behboodi, A. Deep Learning-based Channel Estimation for Wideband Hybrid MmWave Massive MIMO. IEEE Trans. Commun. 2023, 71, 231–238. [Google Scholar] [CrossRef]
Peken, T.; Adiga, S.; Tandon, R.; Bose, T. Deep learning for SVD and hybrid beamforming. IEEE Trans. Wirel. Commun. 2020, 19, 6621–6642. [Google Scholar] [CrossRef]
Jiang, X.; Kaltenberger, F. Channel reciprocity calibration in TDD hybrid beamforming massive MIMO systems. IEEE J. Sel. Top. Signal Process. 2018, 12, 422–431. [Google Scholar] [CrossRef]
Akdeniz, M.R.; Liu, Y.; Samimi, M.K.; Sun, S.; Rangan, S.; Rappaport, T.S.; Erkip, E. Millimeter wave channel modeling and cellular capacity evaluation. IEEE J. Sel. Areas Commun. 2014, 32, 1164–1179. [Google Scholar] [CrossRef]
Maimaiti, S.; Chuai, G.; Gao, W.; Zhang, J. Beam Allocation and Power Optimization for Energy-Efficiency in Multiuser mmWave Massive MIMO System. Sensors 2021, 21, 2550. [Google Scholar] [CrossRef] [PubMed]
Giang, H.T.; Hoan, T.N.; Thanh, P.D.; Koo, I. Hybrid NOMA/OMA-based dynamic power allocation scheme using deep reinforcement learning in 5G networks. Appl. Sci. 2020, 10, 4236. [Google Scholar] [CrossRef]
Zhang, Y.; Heath, R.W. Reinforcement learning-based joint user scheduling and link configuration in millimeter-wave networks. IEEE Trans. Wirel. Commun. 2022, 22, 3038–3054. [Google Scholar] [CrossRef]
Narengerile, N.; Thompson, J.; Patras, P.; Ratnarajah, T. Deep reinforcement learning-based beam training with energy and spectral efficiency maximisation for millimetre-wave channels. EURASIP J. Wirel. Commun. Netw. 2022, 2022, 110–122. [Google Scholar] [CrossRef]
Pradhan, C.; Li, A.; Zhuo, L.; Li, Y.; Vucetic, B. Hybrid-precoding for mmWave multi-user communications in the presence of beam-misalignment. IEEE Trans. Wirel. Commun. 2020, 19, 6083–6099. [Google Scholar] [CrossRef]
Shi, Q.; Hong, M. Spectral Efficiency Optimization for Millimeter Wave Multiuser MIMO Systems. IEEE J. Sel. Top. Signal Process. 2018, 12, 455–468. [Google Scholar] [CrossRef]
Li, H.; Li, M.; Liu, Q. Hybrid beamforming with dynamic subarrays and low-resolution PSs for mmWave MU-MISO systems. IEEE Trans. Commun. 2019, 68, 602–614. [Google Scholar] [CrossRef]
Nachmani, E.; Be’ery, Y.; Burshtein, D. Learning to decode linear codes using deep learning. In Proceedings of the 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 27–30 September 2016; pp. 341–347. [Google Scholar]
Gruber, T.; Cammerer, S.; Hoydis, J.; ten Brink, S. On deep learning-based channel decoding. In Proceedings of the 2017 51st Annual Conference on Information Sciences and Systems (CISS), Baltimore, MD, USA, 22–24 March 2017; pp. 1–6. [Google Scholar]
Cousik, T.S.; Shah, V.K.; Erpek, T.; Sagduyu, Y.E.; Reed, J.H. Deep learning for fast and reliable initial access in AI-driven 6G mmWave networks. IEEE Trans. Netw. Sci. Eng. 2022, 11, 5668–5680. [Google Scholar] [CrossRef]
Khan, M.I.; Reggiani, L.; Alam, M.M.; Le Moullec, Y.; Sharma, N.; Yaacoub, E.; Magarini, M. Q-learning based joint energy-spectral efficiency optimization in multi-hop device-to-device communication. Sensors 2020, 20, 6692. [Google Scholar] [CrossRef]
Hao, Y.; Ni, Q.; Li, H.; Hou, S. Energy and spectral efficiency tradeoff with user association and power coordination in massive MIMO enabled HetNets. IEEE Commun. Lett. 2016, 20, 2091–2094. [Google Scholar] [CrossRef]
Zhang, T.; Liu, S.; Xiang, W.; Xu, L.; Qin, K.; Yan, X. A real-time channel prediction model based on neural networks for dedicated short-range communications. Sensors 2019, 19, 3541. [Google Scholar] [CrossRef] [PubMed]
Narengerile, N.; Thompson, J.; Patras, P.; Ratnarajah, T. Deep reinforcement learning-based beam training for spatially consistent millimeter wave channels. In Proceedings of the 2021 IEEE 32nd Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Helsinki, Finland, 13–16 September 2021; pp. 579–584. [Google Scholar]
Liu, F.; Zhang, L.; Yang, X.; Li, T.; Du, R. DL-Based Energy-Efficient Hybrid Precoding for mmWave Massive MIMO Systems. IEEE Trans. Veh. Technol. 2022, 72, 6103–6112. [Google Scholar] [CrossRef]
Bai, L.; Wang, C.X.; Huang, J.; Xu, Q.; Yang, Y.; Goussetis, G.; Sun, J.; Zhang, W. Predicting wireless mmWave massive MIMO channel characteristics using machine learning algorithms. Wirel. Commun. Mob. Comput. 2018, 2018, 1–20. [Google Scholar] [CrossRef]
Cammerer, S.; Gruber, T.; Hoydis, J.; ten Brink, S. Scaling deep learning-based decoding of polar codes via partitioning. In Proceedings of the 2017 IEEE Global Communications Conference (GLOBECOM), Singapore, 4–8 December 2017; pp. 1–6. [Google Scholar]
Li, J.; Zhang, Q.; Xin, X.; Tao, Y.; Tian, Q.; Tian, F.; Chen, D.; Shen, Y.; Cao, G.; Gao, Z.; et al. Deep learning-based massive MIMO CSI feedback. IEEE Wirel. Commun. Lett. 2018, 7, 1–13. [Google Scholar]
Wang, J.; Wu, D. Millimeter-wave multimedia communications: Challenges, methodology, and applications. IEEE Commun. Mag. 2015, 53, 232–238. [Google Scholar]
Park, G.; Chandrasegar, V.K.; Koh, J. Accuracy Enhancement of Hand Gesture Recognition Using CNN. IEEE Access 2023, 11, 26496–26501. [Google Scholar] [CrossRef]
Lee, G.; Choi, M. Link adaptation for high-quality uncompressed video streaming in 60-GHz wireless networks. IEEE Trans. Multimed. 2016, 18, 627–642. [Google Scholar]
Salh, A.; Shah, N.S.; Audah, L.; Mohd Shah, S.; Hamzah, S.A. Distributed Power Control for 5G Millimeter Wave Dense Small Cell. J. Adv. Res. Dyn. Control. Syst. 2020, 12, 656–663. [Google Scholar]
Park, S.; Alkhateeb, A.; Heath, R.W. Dynamic Subarrays for Hybrid Precoding in Wideband mmWave MIMO Systems. IEEE Trans. Wirel. Commun. 2017, 16, 2907–2920. [Google Scholar] [CrossRef]
Salh, A.; Abdullah, Q.; Hussain, G.; Ngah, R.; Audah, L.; Shah, N.S.; Hamzah, S. A New Technique for Improving Energy Efficiency in 5G Mm-Wave Hybrid Precoding Systems. In Proceedings of the International Conference on Emerging Smart Technologies and Applications (eSmarTA), Ibb, Yemen, 25–26 October 2022; pp. 1–7. [Google Scholar]
Shehata, M.; Mokh, A.; Crussière, M.; Hélard, M.; Pajusco, P. On the Equivalence of Hybrid Beamforming to Full Digital Zero Forcing in mmWave MIMO. In Proceedings of the 26th International Conference on Telecommunication (ICT 2019), Hanoi, Vietnam, 8–10 April 2019; pp. 25–33. [Google Scholar]
Liao, Y.; Hua, Y.; Dai, X.; Yao, H.; Yang, X. ChanEstNet: A Deep Learning Based Channel Estimation for High-Speed Scenarios. In Proceedings of the IEEE International Conference on Communications (ICC), Shanghai, China, 20–24 May 2019; pp. 1–6. [Google Scholar]
Wang, R.; Klaine, P.V.; Onireti, O.; Sun, Y.; Imran, M.A.; Zhang, L. Deep Learning Enabled Beam Tracking for Non-Line of Sight Millimeter Wave Communications. IEEE Open J. Commun. Soc. 2021, 2, 1710–1720. [Google Scholar] [CrossRef]
Moon, S.; Kim, H.; You, Y.H.; Kim, C.H.; Hwang, I. Online Learning-Based Beam and Blockage Prediction for Indoor Millimeter-Wave Communications. ICT Express 2022, 8, 1–6. [Google Scholar] [CrossRef]
Lin, T.; Cong, J.; Zhu, Y.; Zhang, J.; Ben Letaief, K. Hybrid Beamforming for Millimeter Wave Systems Using the MMSE Criterion. IEEE Trans. Commun. 2019, 67, 3693–3708. [Google Scholar] [CrossRef]
Salh, A.; Audah, L.; Abdullah, Q.; Aydoğdu, Ö.; Alhartomi, M.A.; Alsamhi, S.H.; Almalki, F.A.; Shah, N.S. Low Computational Complexity for Optimizing Energy Efficiency in mm-Wave Hybrid Precoding System for 5G. IEEE Access 2021, 10, 4714–4727. [Google Scholar] [CrossRef]
Xu, F.; Yu, X.; Li, M.; Wen, B. Energy-Efficient Power Allocation Scheme for Hybrid Precoding mmWave-NOMA System with Multi-User Pairing. In Proceedings of the IEEE International Conference on Communications Workshops (ICC Workshops), Shanghai, China, 20–24 May 2019; pp. 1–5. [Google Scholar]
Hei, Y.; Liu, C.; Li, W.; Ma, L.; Lan, M. CNN-based hybrid precoding for mmwave MIMO systems with adaptive switching module and phase modulation array. IEEE Trans. Wirel. Commun. 2022, 21, 10489–10501. [Google Scholar] [CrossRef]
Xu, J.; Ai, B. Experience-Driven Power Allocation Using Multi-Agent Deep Reinforcement Learning for Millimeter-Wave High-Speed Railway Systems. IEEE Trans. Intell. Transp. Syst. 2021, 23, 5490–5500. [Google Scholar] [CrossRef]
He, H.; Jiang, H. Deep Learning Based Energy Efficiency Optimization for Distributed Cooperative Spectrum Sensing. IEEE Wirel. Commun. 2019, 26, 32–39. [Google Scholar] [CrossRef]

Figure 1. Channel estimation in OFDM systems using a DNN [61].

Figure 2. A representation of the DNN training in mm-wave sets for training [50].

Figure 3. Massive MIMO system, the BS receiver employs 1-bit ADC converters with DL [32].

Figure 4. Related SE to a number of iterations.

Figure 5. SE versus factor.

Figure 6. EE versus factor.

Figure 7. Related of SE to SNR.

Figure 8. Related of EE to SNR.

Figure 9. Reward versus SNR.

Figure 10. SE versus a number of antennas.

Figure 11. SE versus a sample of an mm-wave channel.

Figure 12. EE versus time step for memory and processing optimization.

Figure 13. SE versus time step for memory and processing optimization.

Table 1. mm-wave scenarios applications.

Ref.	Frequency (GHz)	Scenario	Application
[63]	60, 70	Indoor	Mix-Media
[64]	60	Wireless LAN	Access to uplink channels
[65]	28, 38, 71, 81, 86	Urban areas	Access and backhaul
[66]	30–60	Wireless PAN	Transmission of video in ultra-high definition
[67]	60	Wireless PAN	Quick device-to-device data transfer

Table 2. Solutions based on mm-wave application for existing challenges [62,63,64,65,66].

Current Issues	Mm-Wave Solution
Lack of bandwidth or the need for additional bandwidth, particularly for D2D or sophisticated UAV applications, is the issue [62,63].	For applications like UAV or D2D communications, in particular, the high B.W. need requirements can be accommodated by the ample vacant mm-wave bandwidth.
Challenges related to delays in the transmission and communication of data over wireless networks [64].	The mm-wave clocks in at less than 10 milliseconds, demonstrating very little latency. This feature allows for improved performance and resolves latency problems that are frequently associated with wireless data transmission.
Growing need for downsizing [65].	The wavelengths of mm-wave transmissions range from 10 mm to 1 mm, making them ideal for creating extremely small antenna designs for data transmission and wireless power.
Challenges related to optimizing caching in wireless communication networks [66].	Applications based on mm-waves can improve end-user experience, increase the capacity of front-end links with limited bandwidth, and reduce backhaul burden in the core network due to the ample bandwidth available around 250 GHz.

Table 3. Solutions based on mm-wave application for existing challenges [62,63,64,65,66].

Research Challenges	Future Perspectives
Processing data in high dimensions under dynamic channel conditions [62,63,64].	Create scalable DL models for high-dimensional channel estimation in real-time.
Accurate capture of CSI for effective beamforming [64].	For accurate CSI reconstruction and beamforming optimization, use sophisticated DNN frameworks.
Intricate relationships between antenna arrays, performance measures, and channel circumstances [65].	Develop adaptive beamforming techniques for different situations by utilizing DL.
Managing higher mm-wave path loss while maintaining reliable signal processing.	Combine DL with spatial data to enhance signal tracking and channel estimation.
Computational complexity in DL model training and activation in real-time [66].	Optimize DL models for edge processing-appropriate, lightweight implementations.

Table 4. Comparison of existing research on DL mm-wave.

Ref.	Year	AI		Application on Communication	Performance of AI	Open Challenges	Future Direction	Remarks
Ref.	Year	DL	ML	Application on Communication	Performance of AI	Open Challenges	Future Direction	Remarks
[33]	2020	/	x	/	/	x	/	This paper focuses on how Deep IA reduces the beam sweep time compared to utilizing only a subset of the available beams.
[46]	2021	/		/	/	/	x	To reduce an MU-mm-wave mMIMO system’s hardware cost and downlink power usage, this article investigates beam allocation and power optimization.
[48]	2019	x	/	/	/	/	x	This paper focuses on developing ML by integrating graph neural networks and DRL to improve the overall system EE.
[49]	2022		/	/	/	x	/	This work focuses on a novel method that can learn to forecast both short- and long-term variations: the NN-based channel prediction model.
[54]	2022	/	x	/	/	/	/	This paper intends to address the linear ISI using a former RNN with memory that is combined with two cascading sub-networks, i.e., the DL detector.
[57]	2019	/	x	/	/	/	/	This paper focuses on a novel beam training algorithm via DRL for mm-wave channels considering user mobility effects to improve EE and SE.
This article	2025	/	/	/	/	/	/	This article on AI performance evaluation and improving the hybrid precoding in MU-mMIMO systems efficiency by adjusting to changes in wireless channels and optimizing EE and SE.

Table 5. Simulation parameters.

Parameter	Value
Bandwidth	80 MHz
$Transmit power P$	20 dBm
Adjusting beam	3
CPU clock speed mobile edge server	${1.2 \times 10}^{12}$ Hz
Batch size	64
$Time slot duration τ$	0.05
Noise figure	0.1
Velocity of UEs	0.5 m/s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Salh, A.; Alhartomi, M.A.; Hussain, G.A.; Jing, C.J.; Shah, N.S.M.; Alzahrani, S.; Alsulami, R.; Alharbi, S.; Hakimi, A.; Almehmadi, F.S. Deep Reinforcement Learning-Driven Hybrid Precoding for Efficient Mm-Wave Multi-User MIMO Systems. J. Sens. Actuator Netw. 2025, 14, 20. https://doi.org/10.3390/jsan14010020

AMA Style

Salh A, Alhartomi MA, Hussain GA, Jing CJ, Shah NSM, Alzahrani S, Alsulami R, Alharbi S, Hakimi A, Almehmadi FS. Deep Reinforcement Learning-Driven Hybrid Precoding for Efficient Mm-Wave Multi-User MIMO Systems. Journal of Sensor and Actuator Networks. 2025; 14(1):20. https://doi.org/10.3390/jsan14010020

Chicago/Turabian Style

Salh, Adeb, Mohammed A. Alhartomi, Ghasan Ali Hussain, Chang Jing Jing, Nor Shahida M. Shah, Saeed Alzahrani, Ruwaybih Alsulami, Saad Alharbi, Ahmad Hakimi, and Fares S. Almehmadi. 2025. "Deep Reinforcement Learning-Driven Hybrid Precoding for Efficient Mm-Wave Multi-User MIMO Systems" Journal of Sensor and Actuator Networks 14, no. 1: 20. https://doi.org/10.3390/jsan14010020

APA Style

Salh, A., Alhartomi, M. A., Hussain, G. A., Jing, C. J., Shah, N. S. M., Alzahrani, S., Alsulami, R., Alharbi, S., Hakimi, A., & Almehmadi, F. S. (2025). Deep Reinforcement Learning-Driven Hybrid Precoding for Efficient Mm-Wave Multi-User MIMO Systems. Journal of Sensor and Actuator Networks, 14(1), 20. https://doi.org/10.3390/jsan14010020

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning-Driven Hybrid Precoding for Efficient Mm-Wave Multi-User MIMO Systems

Abstract

1. Introduction

1.1. Related Works

1.2. Motivation and Contributions

2. Wireless B5G Difficulties with DL-Based Mm-Wave

2.1. DL-Based mMIMO

2.2. DL-Based Mm-Wave

3. Leveraging DL of Mm-Wave in B5G

3.1. DL in Artificial Intelligence-Driven B5G Mm-Wave Networks

3.2. Enhancing Channel Prediction Accuracy

4. Optimizing EE- and SE-Based Beam Training

4.1. RBT-DRL L-Based Beam Training with EE and SE

4.2. A Low-Complexity AB-HP

4.3. Reward

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI