Next Article in Journal
Flare Removal Model Based on Sparse-UFormer Networks
Previous Article in Journal
Machine Learning-Based Risk Prediction of Discharge Status for Sepsis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Machine Learning Techniques for Blind Beam Alignment in mmWave Massive MIMO

Télécom Paris, 91120 Paris, France
*
Author to whom correspondence should be addressed.
Entropy 2024, 26(8), 626; https://doi.org/10.3390/e26080626
Submission received: 2 April 2024 / Revised: 14 June 2024 / Accepted: 10 July 2024 / Published: 25 July 2024

Abstract

:
This paper proposes methods for Machine Learning (ML)-based Beam Alignment (BA), using low-complexity ML models, and achieves a small pilot overhead. We assume a single-user massive mmWave MIMO, Uplink, using a fully analog architecture. Assuming large-dimension codebooks of possible beam patterns at  U E  and  B S , this data-driven and model-based approach aims to partially and blindly sound a small subset of beams from these codebooks. The proposed BA is blind (no CSI), based on Received Signal Energies (RSEs), and circumvents the need for exhaustively sounding all possible beams. A sub-sampled subset of beams is then used to train several ML models such as low-rank Matrix Factorization (MF), non-negative MF (NMF), and shallow Multi-Layer Perceptron (MLP). We provide an extensive mathematical description of these models and the algorithms for each of them. Our extensive numerical results show that, by sounding only  10 %  of the beams from the  U E  and  B S  codebooks, the proposed ML tools are able to accurately predict the non-sounded beams through multiple transmitted power regimes. This observation holds as the codebook sizes at  U E  and  B S  vary from  128 × 128  to  1024 × 1024 .

1. Introduction

Driven by the explosive growth trend of large-scale connectivity and higher data rate systems, wireless data traffic is expected to exponentially increase, growing to 5 zettabytes per month and reaching a 100 Gps data rate by 2030 [1] Thus, the latency in the 6th Generation is predicted to reach 0.1 ms, representing  10 %  of  5 G  latency, in order to support new emerging technical needs, including holographic images, Internet of Things applications, and autonomous driving.
Beam Alignment is frequently defined in the literature as beam sounding, i.e., beam training. It illustrates a fundamental problem in millimeter-wave Multiple Input, Multiple Output systems, defined as the exchange of information between the user equipment  U E  and the base station  B S  in order to accurately select the optimal beam-steering direction. The process of aligning the beams is related to several technical problems, such as beam forming, beam sweeping, beam tracking, and beam selection. The whole framework that unites these operations between  U E  and  B S  is often denoted as the Beam Management. To fulfill the BA task, beam patterns stored in large codebooks are used at both  U E  and  B S . In fact, pencil beams with directional gain are increasingly being used in several applications in order to alleviate the severe path-loss attenuation and increase capacity and data throughput. On the other hand, massive MIMO systems provide large gain in spectral and energy efficiencies compared with conventional MIMO systems. Using mmWave technology, these systems mainly offer a better communication quality by increasing the system bandwidth and reducing the effects of noise and interference. Due to the diversification of future  5 G  and towards  6 G  applications and intelligent systems, scientists predict the continuous generation of massive datasets for deep processing through large bandwidths, which introduces mmWave bands as the golden spectrum band candidates. However, the limitations of mmWave communication physical properties of the channel are crucial: scattering, attenuation, low coherence time related to the Doppler effect, penetration loss, environmental constraints, and complex channel modeling in realistic urban scenarios. The major problem we aim to encounter in this paper is the inevitable high signaling/training overhead. For this reason, the main trade-off is to browse the most accurate and the least complex  M L  algorithm that optimizes finding the optimal beam pair based on sounded instantaneous Received Signal Energies and using the minimum (possible) amount of training samples.
Contributions: In this current work, we propose ML-based BA methods, for a single user massive mmWave MIMO, Uplink, with a wide-band channel. We assume a single radio frequency chain with large codebooks of possible analog beams at BS (also known as BS codebook) and UE (also known as UE codebook). We define a beam pair as one beam from the BS and UE codebook. By approximating the SNR with the Receive Signal Energy (RSE), we bypass the need for CSI, i.e., a blind approach. We sub-sample large codebooks into smaller sub-sampled BS and UE codebooks, and sound the beam pairs from the sub-sampled codebooks to generate the training set—a novelty of the approach. Using the RSE of the sounded beam pairs (sub-sampled codebooks), we propose to train the following ML methods to predict the RSE of the beam pairs that were not sounded: Matrix Factorization (MF), non-negative Matrix Factorization (NMF), and feed-forward (shallow) Multi-Layer Perceptron (MLP).
  • We formulate the MF and NMF problems. We propose to use Block Coordinate Descent (BCD) and Block Gradient Descent (BGD) methods to solve each problem. We derive in depth all the update equations for these methods. We show that the BCD method converges to a stationary point from both MF and NMF problems. Our extensive numerical results show that, sub-sampling  10 %  of the BS/UE codebooks, the remaining RSE values can be predicted extremely well (with a training/test error  10 6 ) for every antenna configuration.
  • We develop at length the equations of a general MLP model, the resulting loss function, and the corresponding optimization problem. In addition, we derive the equations of back-propagation for the MLP in question. Using extensive numerical results, we observe that sounding  10 %  of original codebooks is sufficient to predict the RSE of the beam pairs that were not sounded, with negligible training/test error.
  • We numerically compare the training/test losses of all the proposed models for a varying cardinality of codebooks and transmit powers. These results suggest that the BCD method for MF/NMF outperforms the MLP in terms of training and test error. Meanwhile, BCD for MF/NMF has a large computational complexity and the MLP exhibits medium complexity.
  • Interestingly, by sounding  10 %  of the BS/UE codebooks, the proposed ML models can predict the unknown RSE (beam pairs not sounded) with a negligible test error. Thus, the proposed methods achieve a  90 %  reduction in pilot signaling overhead, compared with the SotA benchmark, without any noticeable loss in performance.
Notations: Matrices and vectors are respectively written in boldface upper-case and lower-case letters. We use  T r [ A ] , A T , A 1 , A H , | A | , | | A | | F  for the trace, transpose, inverse, conjugate transpose, determinant, and Frobenius norm of a matrix  A  and the  n × n  identity matrix.  [ A ] i , j  is used to denote the (i, j)th entry of a matrix  A . We denote the Hadamard product by ∘, while  [ a ] + : = max ( a , 0 )  illustrates a Euclidean projection of  a  on  R + D  and is applied element by element on  a . We denote  | x |  the absolute value of x and  [ x ] t  as the entry t of a vector  x .
Methods/Experiment: The proposed approach is data driven and model based. The dataset is generated following the Saleh Valenzuela wide-band mmWave system model. It is based on Received Signal Energies for each and every beam pair in the massive MIMO Uplink setup stored in separate .csv files. The model-based solution to the empirical risk minimization includes deriving a closed-form solution to the formulated non-convex optimization problem, stating the theoretical guarantees of convergence and empirically illustrating the success of the proposed partial and blind Beam Alignment procedure using different algorithms. All simulations are executed on Infres GPU servers and the Comelec laboratory PC at Télécom Paris, having the following characteristics: Intel(R) Core(TM) i5-8365U CPU @ 1.60 GHz, 16 Go (RAM), x64 processor, and 64-bit operating system under the license of Windows 10 Enterprise LTSC 2018, version 1809. The manufacturer is Dell and is located in Paris, France. All python packages used in this work (numpy, scipy, keras, pytorch, matplolib..) are related to python 3.9 release. In fact, the experimental protocol is based on offline grid-search cross-validation, which requires GPU processing for the selection of optimal hyperparameters and online training/prediction for Matrix Factorization, non-negative Matrix Factorization, and Multi-Layer Perceptron. The comparison is conducted following a Quality of Service-based approach, simulating a variety of MIMO configurations and architectural setups, investigating the impact of varying the Received Signal Energy regime and empirically stating intersections and differences in the impact of the transmit power on model behaviors, loss values, optimal signaling overhead ratio, and optimal hyperparameters.
  • Problem Statement: The main challenge addressed in this study is the high signaling overhead in Beam Alignment for mmWave MIMO systems, which hampers the efficient selection of optimal beam-steering directions.
  • Research Questions and Hypotheses: This study investigates whether machine learning methods can effectively reduce the signaling overhead required for accurate beam-pair prediction in mmWave MIMO systems.
  • Objectives and Aims: The primary objective is to develop and evaluate ML-based BA methods that minimize the training overhead while maintaining high accuracy in predicting the RSE for unsounded beam pairs.
  • Significance and Rationale: The study proposes a novel approach to BA using ML techniques, which can lead to a substantial reduction in pilot signaling overhead and enhance the efficiency of future wireless communication systems.

2. Literature Survey

In conventional standards, Exhaustive BA, also called Brute Force BA, is the de facto approach for the alignment process. It is based on sounding all available beams at both  U E  and  B S  codebooks in order to exhaustively select the optimal beam pair. One obvious drawback is the fact that the resulting signaling overhead scales as the product of the  U E  and  B S  codebook sizes. At 60 GHz, the Exhaustive BA has been adopted in several mmWave  W L A N  or  W P A N  communication technologies, e.g., IEEE 802.15.3c [2] and IEEE 802.11ad [3]. It is conventionally applied in small MIMO configurations using small codebook sizes (e.g., codebooks of size  8 × 8  for  L T E ) and guarantees optimal performance. For cellular networks [4], V2X communications, Unmanned Aerial Vehicles, or High-Speed Train applications, the infeasibility of brute-force-based BA pushes scientists to reduce the large signaling overhead from using massive antennas systems. State-of-the-art methods can be divided into two categories: classic BA and learning-based BA. Traditional techniques tend to use a more and more structured Beam Alignment design such as hierarchical multi-level codebooks [5] (training beamforming vectors are constructed with different beam widths at different levels) and an overlapped beam pattern [6], where the main idea is to augment the amount of information carried by each channel measurement, reducing the required channel estimation time and beam coding [7], where we assign a unique code signature to each beam angle in addition to subspace estimation/decomposition-based  B A  [8]. Compressed sensing-based algorithms [9] are also used in this context, taking advantage of channel sparsity. Therefore, we state two intersections in classic methods: they generally rely on  C S I  exchange and Exhaustive BA. In contrast, lately, Machine Learning ( M L )-based BA has emerged and is continuously leading to some promising results. For instance, statistical models such as Kolmogorov model-based BA in [10] with sub-sampled codebooks reduce the signaling overhead:  15 %  of Exhaustive  B A  provides accurate predictions for optimal beams at  U E  and  B S  in a partial  B A  procedure, similar to our approach. Deep learning through shallow neural networks is increasingly used by Wireless Communication scientists, where we distinguish two major paradigms: first, the ML methods related to Supervised Learning ( S L ) via a Support Vector Machine and Multi-Layer Perceptrons for joint analog beam selection in [11], convolutional neural networks for beam management in sub-6 GHz in [12] and for calibrated beam training in [13], recurrent neural networks such as Long Short-Term Memory network for beam tracking in [14,15,16], auto-encoders for beam management in [17], and several other neural architectures, and second, Reinforcement Learning ( R L ) in [18,19,20], generally used to resolve the problems of Multi-Armed Bandit and Markov decision process. In addition, neural architectures have the ability to extract features from the hidden interactions between  B S  and  U E , providing fast and accurate estimations through different MIMO setups and channel realizations, especially when applied to massive datasets where more and more data/train samples are embedded. This work is an extension of [21]. In this paper, we extend the channel model to wide-band and we add multiple RF-chains at  B S  in a fully analog low-complexity architecture, where we investigate more  M L  tools for partial and blind  B A . This paper is one of the first attempts to apply  M F / N M F  models and shallow Multi-Layer Perceptrons to a blind and partial Beam Alignment for massive mmWave SU-MIMO. Our work in [22] is related to the same approach and objectives, where we quantize the output of each RF-chain.

3. System Model

In this section, we illustrate the mmWave MIMO point-to-point system model. We consider an Uplink transmission from multiple-antenna user equipment  U E  using a single radio frequency chain and a multiple-antenna base station  B S  using multiple radio frequency chains. The proposed  M L  methods are performed at the  B S , which has higher computational resources than  U E . Figure 1a,b provide a diagram representation of the proposed architecture.  U E  and  B S  are respectively equipped with Uniform Linear Arrays of  N T  and  N R  antenna. We propose a low-cost/complexity fully analog architecture where  U E  has one radio frequency chain and  B S  has  N r f  radio frequency chains.  U E  selects its analog beamformer  f u C N T  from a codebook of feasible beam choices,  u T , where  T  is the corresponding index set. Moreover,  B S  selects its analog combiner  W i C N R × N r f  from a codebook  i R  with  R  as the index set of the codebook. We denote with  C T  the number of possible beamforming vectors at  U E , i.e., the size/cardinality of the  U E  codebook,  | T | = C T  and  C R , and the size/cardinality of the  B S  codebook,  | R | = C R . Both beamforming and combining are fully performed in the analog domain using phase shifters at  U E  and  B S ; thus, they satisfy the following constant modulus constraints,  r { 1 , , N R } , t { 1 , , N r f } :
W i C N R × N r f , | [ W i ] r , t | = 1 N r f N R
f u C N T , | [ f u ] t | = 1 N T , t { 1 , , N T }
For our proposed approach,  B S  is responsible for receiving signal energies, denoted as  R S E , in order to learn their patterns and features for the purpose of accurately predicting the optimal beam indexes from their corresponding codebooks and send them to  U E . We adopt the wide-band channel model  G C N R × N T  given by
G ( k ) = 1 N c l = 1 N c H l e j 2 π l k / N c , k { 1 , , N C }
where  N c  represents the number of sub-carriers over the whole bandwidth through an  O F D M  scenario, k is the index of the sub-carrier k, and  H l C N R × N T  is the narrow band channel model representing the time domain channel impulse response with L-tapped delays given by  H l = N T N R L i = 1 L ρ i a R ( θ i ( R ) ) a T H ( θ i ( T ) ) , where L is number of paths (rank) of the channel;  θ i ( R )  and  θ i ( T )  are the angles of arrival at  B S  and the angles of departure from  U E , noting AoA/AoD to correspond to the  i t h  path (and both assumed to be uniform over  [ π / 2 , π / 2 ] );  ρ i  is the complex gain of the  i t h  path such that  ρ i CN ( 0 , 1 ) , i ; and last but not least,  a R ( θ i ( R ) ) C N R  and  a T ( θ i ( T ) ) C N T  are the array response vectors at both  U E  and  B S , respectively. We further assume that the channel is completely unknown to both  U E  and  B S . Henceforth, in this paper, we shall denote the beam pair  ( u , i )  as the combination of the  U E  beamformer indexed u from the  U E  codebook  T  and combiner indexed i in the  B S  codebook  R . The signal at  B S  resulting from applying the beam pair  ( u , i ) y u , i C N r f  is expressed as
y u , i = W i H G ( k ) f u s u + n i , ( u , i ) T × R ,
where  s u = 1 P u  is the transmitted pilot symbol associated with  f u  (having power  P u ) and  n i = W i H n  is the effective additive white Gaussian noise  A W G N  with unit variance ( σ 2 = 1 ). We define the received Signal-to-Noise Ratio ( S N R ) for the beam pair  ( u , i )  as  SNR u , i = P u | | W i H G ( k ) f u | | 2 2 , ( u , i ) T × R . We assume a fully blind approach; i.e., neither  B S  nor  U E  has any knowledge of  G . Thus, computing the above  S N R  expression is not feasible due to the fact that BS is assumed not to know  G . Thus, in this work, we will approximate the  S N R  of the beam pair  ( u , i )  using the corresponding instantaneous Received Signal Energies ( R S E s ) expressed as  RSE u , i = | | y u , i | | 2 2 , ( u , i ) T × R . In other words, we will assume that  R S E u , i SNR u , i  for each beam pair  ( u , i ) T × R .
Benchmark: Exhaustive  B A : The de facto method for Beam Alignment is Exhaustive  B A . It is accomplished by exhaustively sounding, jointly, the beams of both  U E  and  B S  codebooks, recording all entries of  R S E , and exhaustively searching  S  for the indexes of the beam pair that maximize  R S E  at  B S , i.e,  ( u , i ) = a r g m a x ( u , i ) T × R R S E u , i . Thus, the  R S E  matrix is computed/recorded  N r f -entries, with each of pilot symbol, since  N r f  samples are simultaneously received at the  B S  for every pilot transmission (see Figure 2). Consequently, the pilot signaling overhead of the Exhaustive  B A  is  Ω = | T × R | / N r f = C T C R / N r f , which implies that the overhead of this benchmark scales poorly with the  B S  and  U E  codebooks.
Proposed partial Beam Alignment using sub-sampled codebooks: Recall the designation of the beam pair  ( u , i )  as the beamforming vector of the index u in the  U E  codebook of beams and the combining vector of the index i in the  B S  codebook of beams. First, we select (at random) the indexes of the sub-sampled codebooks of beams at  U E  and  B S R S  and  T S , such that  R S R  and  T S T , and  | R S | | R | | T S | | T | . The idea behind this approach is to only sound beam pairs from the sub-sampled codebook of beams,  R S  and  T S . We thus define the training set K , as the sub-sampled codebook indexes at  U E  and  B S , i.e.,  K : = { ( u , i ) | ( u , i ) T S × R S } . Then, the  R S E  of the sounded beam pairs (training set) is given to several ML methods, and the learned ML model is used to predict the  R S E  of non-sounded beam pairs.
We formalize this proposed method below. We express both the received signal  y ( u , i )  and  R S E  for the beam pair  ( u , i )  resulting from the sounded beam pairs (i.e., training set), as follows:
y u , i = W i H G ( k ) f u s u + n i , ( u , i ) T S × R S
RSE u , i = y u , i 2 2 , ( u , i ) T S × R S .
The dataset is formulated using the following incomplete  R S E  matrix,  S R C T × C R ( : = R | T | × | R | ) :
[ S ] u , i : = RSE u , i , i f ( u , i ) T S × R S U n k n o w n   R S E , i f ( u , i ) T S × R S
where  [ S ] u , i  denotes the element  ( u , i )  of  S ( u , i ) T × R . Evidently, the value of  R S E  is undefined for the beam pairs that were not sounded, designated as unknown-RSE matrix coefficient. Those are the missing entries, which are predicted using one of the following proposed  M L  methods: (i) low-rank  M F / N M F  and (ii) shallow (feed-forward)  M L P , where we utilize the sounded  R S E  entries as the training set K . Then the training set,  K , is fed into one of the above ML models, which will predict the  R S E  of non-sounded coefficients in  S , denoted as ‘Unknown’, in (5) (see Figure 3). Finally, the pilot signaling overhead for the above-proposed sub-sampled codebook method is  Ω = | T S × R S | / N r f = | K | / N r f . We split the RSE dataset into a training set  K  and a test set  L  such that  K L = { } . In this paper,  R S E u , i  denotes the true value (label) of the RSE for the beam pair  ( u , i )  in the training set  K , and   R S E u , i ^  denotes the true value (label) of the RSE for the beam pair  ( u , i )  in the test set  L .
Signaling overhead ratio: It is defined as  η : = o v e r h e a d   o f   l e a r n i n g - b a s e d   B A o v e r h e a d   o f   E x h a u s t i v e   B A = | T S | × | R S | | T | × | R | = | K | C T C R , where  T S  and  R S  are, respectively, the sizes of the  U E  and  B S  sub-sampled codebooks used in our proposed partial beam sounding, while  T  and  R  refer to the original size of the codebooks, and  0 < η 1  measures the signaling overhead of all the proposed  M F M L P , and  A E  methods compared with that of Exhaustive  B A . Evidently, a small value for  η  is desired to reduce the signaling overhead of our proposed method. However, a low  η  implies that the size of the training set is small. As a result, the proposed  M L  method will not be able to extract enough data patterns due to the (too) small number of training samples, resulting in a larger prediction error. As one of the contributions of this work, we will (empirically) find as small a value for  η  as possible while still having extremely small training and prediction error.
Conjecture: Note that, from the equations of the narrow-band channel model  H  and the wide-band channel model  G ( k ) , it is simple to verify that  r a n k ( H ) L  and  r a n k ( G ( k ) ) L N C . Assume that  P u . Thus, we can approximate the RSE matrix as
[ S ] u , i = y u , i 2 2 = W i H G ( k ) f u P u + n i 2 2 P u P u W i H G ( k ) f u 2 2 , ( u , i ) T × R
If  P u , then it can be shown that the RSE matrix  S  is such that  r a n k ( S ) L N C . This implies that if  P u , then  S R C R × C T  is a low-rank matrix, i.e.,  r a n k ( S ) L N C min ( C T , C R ) .
While the proof for this necessary condition eludes the authors, we empirically observed that if  P u  is large, then the number of non-zero singular values of  S { σ i ( S ) } i = 1 r a n k ( S ) , satisfies the above upper bound, i.e.,  | { σ i ( S ) } i = 1 r a n k ( S ) | L N C .
Remark 1. 
Recall the expression for the effective rate, r,  r = ( 1 Ω T ) log ( 1 + R S E u , i ) , where Ω is the pilot signaling overhead and  T  is the number of symbols per block. Thus, the problem of maximizing r is written as the following series of equivalent problems:
( u , i ) : = arg max ( u , i ) T × R r arg max ( u , i ) T × R log ( 1 + R S E u , i ) arg max ( u , i ) T × R R S E u , i , where the lastis due to the fact that the  log ( x )  is a strictly monotonically increasing function in x. This result implies finding the optimal beam pair  ( u , i )  that maximizes r is equivalent to finding the best beam pair that maximizes the  R S E .
Remark 2. 
The information (number of entries) needed to represent the  R S E  matrix  S C C R × C T  is measured as  r a n k ( S ) ( 1 + C T + C R ) . This result is evident from performing the  S V D  on  S  and counting the resulting number of entries. Thus, if  S  is severely rank deficient, i.e., extremely compressible, then methods such as  M F / N M F  will exhibit extremely small training and test error. Conversely, if  S  is full rank, i.e., not compressible, then the training and test of  M F / N M F  will be quite large.

4. Matrix Factorization and Non-Negative Matrix Factorization

4.1. MF and NMF Problem Formulation

The intuition behind low-rank  M F  is to model the  R S E  of the sounded beam pairs (i.e., entries of  S  that are known as  T S × R S ) as an inner product between two D-dimensional latent vectors/factors,  θ u , ψ i , as illustrated in Figure 4. Specifically, the  R S E  of the beam pair  ( u , i ) , denoted as  [ S ] u , i , is modeled as  [ S ] u , i : = θ u T ψ i , θ u R D , ψ i R D ( u , i ) K ( : = T S × R S ) , where D is the size/dimension/complexity of the Matrix Factorization model latent factors and  θ u R D , ψ i R D  are the  M F  model parameters (to be optimized). In addition, due to the low-rank  M F  model, D is assumed to be much smaller than the dimensions of  S , i.e.,  D ( C T , C R ) . The  R S E  of the beam pair  ( u , i )  is known from sounding the sub-sampled codebooks (i.e., label). The general formulation of our loss function  u , i  describes the distance between the true value  R S E u , i  and the predicted value  θ u T ψ i , which corresponds to the  M F  output/prediction:  u , i : = ( R S E u , i θ u T ψ i ) 2 , ( u , i ) K ( : = T S × R S ) . The Empirical Risk (also known as training error) is defined as the average across all the individual loss function  u , i . We define the regularized Empirical Risk function as the above empirical risk in addition to the following regularization terms:
( u , i ) K 1 | K | [ S ] u , i θ u T ψ i 2 + λ i ψ i 2 2 + μ u θ u 2 2 = f ( ( θ u , ψ i ) ( u , i ) K )
where  { λ i 0 , μ u 0 | ( u , i ) K }  is the set of regularization hyperparameters used to balance the  M F / N M F  model, preventing any overfitting or underfitting. The Empirical Risk Minimization corresponding to the  M F  model is given by
( P 1 ) : = { θ ^ u , ψ ^ i } a r g m i n { θ u , ψ i } ( u , i ) K f ( θ u , ψ i ) s . t . θ u R D , ψ i R D
For the Matrix Factorization variant  N M F , the optimization problem is given by
( P 2 ) : = { θ ^ u , ψ ^ i } a r g m i n { θ u , ψ i } ( u , i ) K f ( θ u , ψ i ) s . t . θ u R + D , ψ i R + D
where  { θ ^ u , ψ ^ i }  denotes the optimal latent vectors for MF and NMF. The test loss (also knows as test error) is given by applying the general loss on the unknown data samples (non-sounded beams) using optimal  M F / N M F  parameters  θ ^ u  and  ψ ^ i = 1 | L | ( u , i ) L R S E u , i ^ θ ^ u T ψ ^ i 2 , where  L  is the test set of our learning model.

4.2. Solutions for MF

We resolve the  M F  problem  ( P 1 )  using the following methods: (i) Block Coordinate Descent (BCD) often denoted as Alternating Least Squares (ALSs), (ii) BCD with Stochastic Gradient Descent, and (iii) Block Gradient Descent (BGD), which merges BCD and Gradient Descent (GD) definitions.
BCD for MF (BCD MF): BCD proceeds by splitting the optimizing problem  ( P 1 )  into sub-problems, supposing that all other blocks are known/fixed. We will show that each sub-problem is strongly convex in each block, and the BCD algorithm converges to a stationary point. The application of BCD to the  M F  problem results in two sub-problems, S1 and S2, which are solved iteratively. At iteration k, the sub-problem  ( S 1 )  is defined by fixing the block  { ψ i ( k ) } i  and the update/solve block  { θ u } u  only, as follows:
( S 1 ) : θ u ( k + 1 ) = a r g m i n θ u R d f ( { θ u , ψ i ( k ) } ) = ( u , i ) K [ ( [ S ] u , i θ u T ψ i ( k ) ) 2 + μ u θ u 2 2 + λ i ψ i ( k ) 2 2 ]
Moreover, the sub-problem  ( S 2 )  is defined by fixing the block  { θ u ( k + 1 ) } u  in  ( P 1 )  and the update/solve block  { ψ i } i , only, as follows:
( S 2 ) : ψ i ( k + 1 ) = a r g m i n θ i R d f ( { θ u ( k + 1 ) , ψ i } ) = ( u , i ) K [ ( [ S ] u , i θ u ( k + 1 ) ψ i ) 2 + μ u θ u ( k + 1 ) 2 2 + λ i ψ i 2 2 ]
We will rewrite  S 1  into as series of equivalent problems as follows:
( S 1 ) : = a r g m i n θ u R d ( u , i ) K [ [ S ] u , i 2 2 [ S ] u , i θ u T ψ i ( k ) + θ u T ψ i ( k ) θ i ( k ) T θ u + μ u θ u 2 2 ] a r g m i n θ u R d u [ 2 θ u T i ( [ S ] u , i ψ i ( k ) ) + θ u T i ( ψ i ( k ) ψ i ( k ) T ) θ u + μ u θ u 2 2 ] a r g m i n θ u R d u U i [ 2 θ u T ( r u ( k ) ) + θ u T ( Q u ( k ) ) θ u + μ u θ u 2 2 ] = u U i h u ( θ u ) , θ u ( k + 1 ) = a r g m i n θ u R d [ 2 θ u T r u ( k ) + θ u T ( Q u ( k ) + μ u I D ) θ u ] = f 1 ( θ u ) , u U i ,
where  U i  is the set of row indexes u in the RSE matrix corresponding to the column i in the known entries of the RSE matrix,  Q u ( k ) = i ( ψ i ( k ) ψ i ( k ) T )  and  r u ( k ) = i ( [ S ] u , i ψ i ( k ) ) . We derive the closed-form solution for the sub-problem S1 by finding the global min of  f 1 ( θ u ) , as follows:
f 1 ( θ u ) = 0 2 r u ( k ) + 2 ( Q u ( k ) + μ u I D ) θ u = 0 θ u = ( Q u ( k ) + μ u I D ) 1 r u ( k )
Similarly, we rewrite the sub-problem (S2) into the following series of equivalent problems by stating the last one:
( S 2 ) : ψ i ( k + 1 ) = a r g m i n ψ i R d [ 2 t i ( k + 1 ) T ψ i + ψ i T ( P i ( k + 1 ) + λ i I ) ψ i ] = f 2 ( ψ i ) , i I u ,
where  I u  is the set of column indexes i in the RSE matrix corresponding to the row u in the known entries of the RSE matrix,  t i ( k + 1 ) = u ( [ S ] u , i θ u ( k + 1 ) T )  and  P i ( k + 1 ) = u ( θ u ( k + 1 ) θ u ( k + 1 ) T ) . Next, we derive a closed-form solution for the sub-problem S2 by finding the global min of  f 2 ( ψ i ) , as follows:
f 2 ( ψ i ) = 0 2 t i ( k + 1 ) + 2 ( P i ( k + 1 ) + λ i I D ) ψ i = 0 ψ i = ( P i ( k + 1 ) + λ i I D ) 1 t i ( k + 1 ) ψ i ( k + 1 ) = ( ( u ( θ u ( k + 1 ) θ u ( k + 1 ) T ) ) + λ i I D ) 1 ( u ( [ S ] u , i θ u ( k + 1 ) T ) )
Thus, BCD updates to solve MF are given as follows:
θ u ( k + 1 ) = ( i ψ i ( k ) ( ψ i ( k ) ) T ) + μ u I ) 1 ( i [ S ] u , i ψ i ( k ) ) ψ i ( k + 1 ) = ( ( u θ u ( k + 1 ) ( θ u ( k + 1 ) ) T ) + λ i I ) 1 ( u [ S ] u , i θ u ( k + 1 ) )
( u , i ) K , k = 0 , 1 , , I M
where (k) represents the index of the BCD iterations, (u,i) are the codebook indexes at  U E  and  B S , and  [ S ] u , i  denotes the  R S E  of the (u,i) beam couple. The solution  { θ ^ u , ψ ^ i } ( u , i ) K  is reached after the interval/gap between consecutive iterations reaches a predefined  ϵ  or a max number of iterations,  I M . We have the following result.
Corollary 1. 
The sequence of updates  { θ u ( k ) , ψ i ( k ) | ( u , i ) K } k  generated by BCD, in (8), is non-increasing (in k) and converges to a stationary point as  k .
Proof. 
See Appendix A. □
Block Stochastic Gradient Descent (BSGD) for MF (SGD MF): SGD MF proceeds by taking T plain SGD steps (mini-batch size  = 1 ). BGD proceeds by taking T SGD steps for each block BCD. We first choose at random a single training sample  ( u , i ) K . The BSGD update for the sub-problem (S1) is done by performing SGD for  f 1 ( θ u ) = u U i h u ( θ u ) , i.e., choosing at random a single index  u U i  and computing the plain SGD  f 1 ( θ u ) ^ = ^ u U i h u ( θ u ) = h u ( θ u ) , where u is a random index from  U i , and  f 1 ( θ u ) ^  is the plain SGD on  f 1 ( ) . The corresponding update is given as
θ u ( k + 1 ) = θ u ( k ) α k f 1 ( θ u ( k ) ) ^ , = θ u ( k ) α k h u ( θ u ( k ) ) u U i = θ u ( k ) + 2 α k ( ( i ( [ S ] u , i ψ i ( k ) ) ) ( ( i ψ i ( k ) ψ i ( k ) T ) + μ u I D ) θ u ( k ) ) , u U i , k = 1 . . T
where u is a single index chosen at random from  U i Q u ( k ) = i ( ψ i ( k ) ψ i ( k ) T ) r u ( k ) = i ( [ S ] u , i ψ i ( k ) ) , (k) is the iteration index for SGD, and  f 1 ( θ u ) ^  is the plain SGD over one random sample  u U i . Similarly, the update for the sub-problem (S2) is done by taking T plain SGD steps of  f 2 ( ψ ) = i I u h i ( ψ i ) , i.e., the SGD,  f 2 ( ψ i ) ^ = ^ ( i I u h i ( ψ i ) ) = h i ( ψ i ) , where i is single random index from  I u . Thus, the SGD MF update for the sub-problem (S2) is expressed as
ψ i ( k + 1 ) = ψ i ( k ) α k f 2 ( ψ i ( k ) ) ^ = ψ i ( k ) α k h 2 ( ψ i ( k ) ) , i I u = ψ i ( k ) + 2 α k ( ( u ( [ S ] u , i θ u ( k ) T ) ) ( u ( θ u ( k ) θ u ( k ) T ) ) + λ i I D ) θ u ( k ) ) , i I u , k = 1 . . T
where i is a single index chosen randomly from  I u t i ( k ) = u ( [ S ] u , i θ u ( k ) T ) P i ( k ) = u ( θ u ( k ) θ u ( k ) T ) , and  f 2 ( ψ i ) ^  is the plain SGD gradient computed with one sample  i I u , chosen at random. We write the SGD MF updates as
θ u ( k + 1 ) = θ u ( k ) + 2 α k ( ( i ( [ S ] u , i ψ i ( k ) ) ) ( ( i ψ i ( k ) ψ i ( k ) T ) + μ u I D ) θ u ( k ) ) , u U i ψ i ( k + 1 ) = ψ i ( k ) + 2 α k ( ( u ( [ S ] u , i θ u ( k ) T ) ) ( u ( θ u ( k ) θ u ( k ) T ) ) + λ i I D ) θ u ( k ) ) , i I u
k = 0 , 1 , , T ,
where u is a random index chosen from  U i , and i a random index from  I u 0 α k 1  is the step size for SGD.
BGD for MF (BGD MF): Rather than having a closed-form solution for each optimization block, BGD proceeds by taking T gradient steps for each block T gradient step. We skip the details here for space limitations. Thus, the BGD updates for the  M F  problem are expressed as
θ u ( k + 1 ) = θ u ( k ) + 2 α k ( ( i ( [ S ] u , i ψ i ( k ) ) ) ( ( i ψ i ( k ) ψ i ( k ) T ) + μ u I D ) θ u ( k ) ) ψ i ( k + 1 ) = ψ i ( k ) + 2 α k ( ( u ( [ S ] u , i θ u ( k ) T ) ) ( u ( θ u ( k ) θ u ( k ) T ) ) + λ i I D ) θ u ( k ) )
( u , i ) K , k = 0 , 1 , , T ,
where (u,i) are the codebook indexes at  U E  and  B S , k is the GD iteration index, and  α ( k )  is the BGD step size ( 0 < α ( k ) < 1 ).

4.3. Solutions for NMF

Our proposed  N M F  follows the exact steps as in  M F , with the main difference of constraining the latent vectors being non-negative  θ u R + D , ψ i R + D , ( u , i ) K . Likewise, we solve the  N M F  problem,  ( P 2 ) , using BCD, SGD, and BGD.
BCD for NMF (BCD NMF): The derivations of BCD for  N M F  (11) are identical to those of BCD for  M F  (8), followed by the corresponding projection operation. The updates of BCD for  N M F  derivations are given by
θ u ( k + 1 ) = ( i ψ i ( k ) ( ψ i ( k ) ) T ) + μ u I ) 1 ( i [ S ] u , i ψ i ( k ) ) + ψ i ( k + 1 ) = ( ( u θ u ( k + 1 ) ( θ u ( k + 1 ) ) T ) + λ i I ) 1 ( u [ S ] u , i θ u ( k + 1 ) ) +
( u , i ) K , k = 0 , 1 , , I M
where (k) is the BCD iteration index, and  [ a ] + : = max ( a , 0 )  is applied element by element on  a , i.e., a Euclidean projection of  a  on  R + D . Since the projection is Euclidean (non-expansive operator), the corollary stated in the previous subsection applies to the BCD for  N M F  too.
Block Stochastic Gradient Descent (BSGD) for NMF (SGD NMF): The SGD NMF derivations are exactly the same as that of SGD MF, followed by a projection  [ ] + . We thus express the SGD NMF updates as
θ u ( k + 1 ) = θ u ( k ) + 2 α k ( ( i ( [ S ] u , i ψ i ( k ) ) ) ( ( i ψ i ( k ) ψ i ( k ) T ) + μ u I D ) θ u ( k ) ) + , u U i ψ i ( k + 1 ) = ψ i ( k ) + 2 α k ( ( u ( [ S ] u , i θ u ( k ) T ) ) ( u ( θ u ( k ) θ u ( k ) T ) ) + λ i I D ) θ u ( k ) ) + , i I u
k = 0 , 1 , , T ,
where u is a random index chosen from  U i , i is a random index from  I u [ a ] + : = max ( a , 0 ) , and  α ( k )  is the SGD step size ( 0 < α ( k ) < 1 ).
BGD for NMF (BGD NMF): The solution and derivations for BGD NMF are the same as those for BGD MF, followed by a projection  [ ] + , i.e,
θ u ( k + 1 ) = θ u ( k ) + 2 α k ( ( i ( [ S ] u , i ψ i ( k ) ) ) ( ( i ψ i ( k ) ψ i ( k ) T ) + μ u I D ) θ u ( k ) ) + ψ i ( k + 1 ) = ψ i ( k ) + 2 α k ( ( u ( [ S ] u , i θ u ( k ) T ) ) ( u ( θ u ( k ) θ u ( k ) T ) ) + λ i I D ) θ u ( k ) ) +
( u , i ) K , k = 0 , 1 , , T ,
where  [ a ] + : = max ( a , 0 ) , (k) is the GD iteration index and  α ( k )  is the GD step size ( 0 < α ( k ) < 1 ). We use a constant step size  α k = α  for all these methods.

4.4. Prediction for MF and NMF

For both  M F  and  N M F , the predicted  R S E  of the beam-pair  ( u , i ) , for beam indexes that were not sounded, is expressed as
{ R S E ^ u , i : = ( θ ^ u ) T ψ ^ i | ( u , i ) L }
where  L  is the test set and  { θ ^ u ) T , ψ ^ i }  are optimal solutions to MF (or NMF). Afterwards, we search for the optimal beam pair at  U E  and  B S  as the one with the highest  R S E  value over both training and test sets, as follows:
( u , i ) = a r g m a x ( u , i ) L K ( θ ^ u ) T ψ ^ i .

4.5. Proposed BA Algorithm Using MF/NMF

Due to the fact that the updates given in a closed-form solution, we can quantify the computational complexity of all of the above methods. As seen from the updates for BCD MF and BCD NMF, we have to invert two  D × D  matrices (for the sum problems S1 and S2). Thus, the (per-iteration) computational complexity of BCD MF and BCD NMF is approximated as  C B C D M F = C B C D N M F = O ( 2 D 3 ) . Moreover, for BGD MF and BGD NMF, one has to compute two full-batch gradients over all training samples in  K  (for the sub-problems S1 and S2). Consequently, the complexity, per-iteration, for BGD MF and BGD NMF is approximated as  C B G D M F = C B G D N M F = O ( 2 | K | ) . Finally, for SGD MF and SGD NMF, since we use a mini-batch size  = 1  (for the sub-problems S1 and S2), the resulting per-iteration computational complexity is approximated as  C S G D M F = C S G D N M F = O ( 2 ) . Solving the  M F  and  N M F  problem, we employ methods such as BCD, BGD, or SGD. All details are shown in Algorithm 1.
Algorithm 1 Proposed MF/NMF-Based BA Method.
  • Input:  { f u } u T { W i } i R η P u
-
Generate randomly sub-sampled codebooks,  T S , R S , satisfying  ( | T S | . | R S | ) / ( | T | × | R | ) = η
-
Sound beam pairs from training set,  K : = T S × R S .
-
Record corresponding  R S E  in and generate mat.  S , in (5)
-
Select model: MF or NMF
-
IF MF model selected
  • solve  ( P 1 )  with BCD for MF, in (8) or solve  ( P 1 )  with BGD for MF, in (10) or solve  ( P 1 )  with SGD for MF, in (9). At the end of training, return optimal latent vectors,  { θ ^ u , ψ ^ i } ( u , i ) K
-
IF NMF model selected
  • solve  ( P 2 )  with BCD for NMF, in (11) or solve  ( P 2 )  with BGD for NMF, in (13) or solve  ( P 2 )  with SGD for NMF, in (12). At the end of training, return ideal latent vectors,  { θ ^ u , ψ ^ i } ( u , i ) K
-
Use ideal latent vectors  { θ ^ u , ψ ^ i } ( u , i ) K , to predict unknown  R S E  of test set,  L , in (14)
-
Search training and test sets, for beam pair w/ largest  R S E , (15)
  • Output:  f u W i
While, for MF BCD and NMF BCD, the only hyperparameter is the model size D, MF BGD and NMF BGD require, in addition to D α k , the GD step size as hyperparameters.

4.6. Numerical Simulations

This section illustrates our numerical setup. The number of antennas at  U E  and  B S { 128 , 256, 512,  1024 } . We set up  N T = C T  and  N R = C R . The overhead ratio regime  η { 0.7 , 0.5 , 0.3 , 0.1}. The number of  O F D M  sub-carriers  N c = 64  and the number of channel paths L is 2. We vary the transmitted power,  P u { 1 , 10 1 , 10 2 } . We use  D F T  codebooks at  U E  and  B S . The optimal hyperparameters are chosen to minimize test loss. The model dimension  D { 2 , 3 , 4 , 5 , 6 } , the learning rate  α k { 10 1 , 10 2 , 10 3 , 10 4 , 10 5 , 10 6 } , and the regularization factors  { λ , μ } { 10 2 , 10 3 , 10 4 , 10 5 , 10 6 , 10 7 } . For each MIMO configuration and for each  P u  regime, we randomly generate and store the resulting  R S E  matrices.
We propose to investigating six models in total (BCD MF, BCD NMF, BGD MF, BGD NMF, SGD MF, SGD NMF) with respect to three transmitted power regimes: high  P u = 1 W , medium  P u = 10 1  W, and low  P u = 10 2  W with fixed  σ 2 = 1 . In Table 1, we provide a summary for all proposed system parameters. We use the training Normalized MSE ( N M S E ) to evaluate the training error, expressed as  T r a i n   N M S E = 1 | K | ( ( u , i ) K ( θ u T ^ ψ i ^ R S E u , i R S E u , i ) 2 ) . We also define  T e s t   N M S E = 1 | L | ( ( u , i ) L ( R S E u , i ^ θ u T ^ ψ i ^ R S E u , i ^ ) 2 ) . The range of training error and the overall behavior of  B C D -based models are different and distinctive from  G D  models in both  M F  and  N M F ; for instance,  B G D -based models’ error range are around  × 10 7 , while  B C D -based models are around  × 10 4 . Thus,  G D  is more accurate. However,  B C D  converges faster and the cost function drops to low values from the very first iterations. In addition, for  M F  and  N M F , the train  N M S E  decreases with the increase in the overhead ratio  η , as seen in Figure 5. Low and medium  P u  regimes are characterized by noisy links between  U E  and  B S  and represent a more challenging experimental environment.  B C D -based models tend to be faster in reaching low error values, while  B G D -based models are more accurate. (For instance,  B S G D  generally ameliorates the quality of prediction compared with  B G D ).
Regarding  M F / N M F  simulation figures, Figure 5a states the decrease of train/test  N M S E  in function of the overhead ratio (more training samples result in fewer errors); Figure 5b,c track the instant drop in loss values from the very first iterations for  B C D -based models; and Figure 5d,e present the progressive convergence of cost function among the iterations when we use  B G D -based models. In summary, Table 2 outlines the optimal (minimum) signaling overhead ratio required for the all proposed system configurations, the optimal model (holding the smallest total cost function), the related combination of optimal hyperparameters, and the corresponding train/test error values. When the signal is affected with much noise, it is harder to keep the same range of error when compared with high a  P u  regime. In fact,  M F  models keep the same (minimum) signaling overhead ( 0.1 ) regardless of the transmitted power regime, being able to accurately predict with just  10 %  of sounded beams. Thus, the proposed  M F / N M F  methods are able to reduce the pilot signaling overhead by  90 %  compared with Exhaustive  B A  with negligible training and test errors.

5. Multi-Layer Perceptron

5.1. MLP Problem Formulation

We consider a feed-forward  M L P , with J layers, modeled as a composition of J non-linear functions/layers. Let  z 0 R  be the  M L P  input, and  z J R  be the  M L P  output; see Figure 6. We denote with  { z 2 , , z J 1 }  all the hidden layers. We assume for simplicity that the width of all the layers is the same, denoted as D, i.e.,  { z 2 R D , , z J 1 R D } ; see Figure 6. The equation describing layer 1 is given by  z 1 = σ 1 ( ϕ 1 z 0 ) = σ 1 ( ϕ 1 1 ) , where  z 1 R D  is the output of layer 1,  ϕ 1 R D  is the resulting weight vector, and  σ 1 ( ) : R R D  is the non-linear activation function for layer 1. We use one hot encoding for the MLP input  z 0 R , i.e.,  z 0 = 1  for all training samples,  ( u , i ) K . We express the output of the hidden layers,  { z j R D } j = 2 J 1 , as  z j = σ j ( Φ j z j 1 ) , j { 2 , , J 1 } , where  z j 1 R D  is the input of the layer j and  z j R D  is its output  , j { 2 , , J 1 } Φ j R D × D  is the weight matrix for the layer j , j { 2 , , J 1 } ; and  σ j 1 ( ) : R D R D  is the element-by-element non-linear activation function for the layer j j { 2 , , J 1 } . Finally, the relation for the last layer  j = J  is expressed as  z J = σ J ( ϕ J z J 1 ) , where  z J R  is the output for layer J ϕ J R 1 × D  is its weight vector, and  σ J ( ) : R D R  is the non-linear activation function for the layer J. We express the output of the MLP  z J R  (as a function of all layers) as
z J : = σ J ( ϕ J σ 2 ( Φ 2 ( σ 1 ( ϕ 1 ) ) ) )
The output of  M L P  is made to fit/approximate all the  R S E  values at all training samples;  z J : = R S E u , i ( u , i ) K . We define the MSE loss  l u , i  for the sample  ( u , i )  in the training set  K  as the distance between the MLP output  z J  and the known RSE label for the beam pair  ( u , i ) RSE u , i , i.e,
l u , i : = ( z J R S E u , i ) 2 = ( σ J ( ϕ J σ 2 ( Φ 2 ( σ 1 ( ϕ 1 ) ) ) ) M L P   o u t p u t R S E u , i R S E   v a l u e ) 2 , ( u , i ) K
Then, the empirical risk is defined as the average of the individual loss  l u , i  across the training set  K ( 1 / | K | ) ( u , i ) K l u , i . The empirical risk minimization for the MLP is given in  ( P 3 ) .
( P 3 ) : = { ( ϕ 1 * , Φ 2 * , , ϕ J * ) a r g m i n ϕ 1 , Φ 2 , , Φ J 1 , ϕ J 1 | K | ( u , i ) K l u , i ( ϕ 1 , Φ 2 , , Φ J 1 , ϕ J ) s . t . ϕ 1 R D , Φ 2 R D × D , , Φ J 1 R D × D , ϕ J R 1 × D

5.2. MLP Learning

We propose to learn the optimal  M L P  weights via back-propagation (BP). We choose an arbitrary mini-batch of samples of size  B K  and define the mini-batch loss as
l B : = 1 | B | u , i B ( σ J ( ϕ J σ 2 ( Φ 2 ( σ 1 ( ϕ 1 ) ) ) ) R S E u , i ) 2 , ( u , i ) B
We express the partial derivative of the loss corresponding to the mini-batch  l B  with respect to each layer  Φ j , j { 1 , , J }  as
l B Φ j = 1 | B | ( u , i ) B ( δ j z j 1 T ) , j { 1 , . . J } ,
where
δ j = Δ ( Φ j + 1 T δ j + 1 ) σ j , j < J 2 ( z J R S E u , i ) σ j , j = J , ( u , i ) B , σ j = Δ σ ( u ) u = [ σ ( u 1 ) u 1 , , σ ( u d j ) u d j ] T ,
j = 1 , , J  and ∘ denotes the Hadamard product. We express the BP weight update of the mini-batch loss  l B , for all layers  j { 1 , , J } , as
Φ j ( k + 1 ) = Φ j ( k ) β j ( k ) l B Φ j | Φ j ( k ) , j { 1 , , J } , k = { 1 , , T }
where (k) is the BP iteration index,  Φ j ( k )  is the value of  Φ j  at iteration k β j ( k )  is the BP step size (learning rate) for the layer j at iteration k, and  l B Φ j | Φ j ( k )  is the partial derivative given in (18) evaluated at  Φ j ( k ) .
Back-propagation algorithm with mini-batch
Choose the mini-batch  B  as a random subset of the training set  K .
  • Compute the loss function  l B  for all samples in the mini-batch  ( u , i ) B  in (17).
  • Compute the partial derivative  l B Φ j  of the mini-batch loss  l B  with respect to  Φ j  in (18).
  • Update the weights of each layer as in (19).
We assume that the BP learning rate is the same for all layers,  β j ( k ) = β k , j { 1 , , J } .

5.3. Prediction Using MLP

The  M L P  prediction for the sample (u,i) in the test set  L , using optimal weights  ϕ 1 * Φ 2 * , , ϕ J *  is as follows:
z J ^ = σ J ( ϕ J * σ 2 ( Φ 2 * ( σ 1 ( ϕ 1 * ) ) ) ) , ( u , i ) L
Therefore, the test  M S E  is defined as
1 | L | ( u , i ) L R S E u , i ^ σ J ( ϕ J * σ 2 ( Φ 2 * ( σ 1 ( ϕ 1 * ) ) ) ) 2
We then select the optimal indexes  u  and  i  related to the highest  R S E u , i  value, as follows:
( u , i ) = a r g m a x ( u , i ) L K { R S E u , i | ( u , i ) K } { R S E u , i ^ | ( u , i ) L }

5.4. Proposed BA Algorithm Using  M L P

The Multi-Layer Perceptron-based Beam Alignment is specified in Algorithm 2.
Algorithm 2 Proposed MLP-Based BA Method.
  • Input:  { f u } u T { W i } i R η P u
-
Generate randomly sub-sampled codebooks,  T S , R S , satisfying  ( | T S | . | R S | ) / ( | T | × | R | ) = η
-
Sound beam pairs from training set,  K : = T S × R S .
-
Record corresponding  R S E  and generate  R S E  mat.  S , in (5)
-
Train  M L P  weights (using back-propagation algorithm)
  • return optimal weights,  { ϕ 1 * , Φ 2 * , , ϕ J * }
-
Use optimal parameters  { ϕ 1 * , Φ 2 * , , ϕ J * } , to predict unknown  R S E  of test set,  L , in (21)
-
Search training and test sets, for optimal beam pair  ( u , i ) , holding the largest  R S E , (22)
  • Output:  f u W i
We assume that the number of neurons per layer D, the number of layers J, the mini-batch size  B = | B | , and the BP learning rate  β ( k )  are hyperparameters. They are tuned using a grid search cross-validation.

5.5. Numerical Simulations

We define the training and test cost functions as follows:
T r a i n   N M S E = 1 | K | ( ( u , i ) K ( R S E u , i σ J ( ϕ J σ 2 ( Φ 2 ( σ 1 ( ϕ 1 ) ) ) ) R S E u , i ) 2 )
T e s t   N M S E = 1 | L | ( ( u , i ) L ( R S E u , i ^ σ J ( ϕ J * σ 2 ( Φ 2 * ( σ 1 ( ϕ 1 * ) ) ) ) R S E u , i ^ ) 2 )
Therefore, we used the same system configurations as for  M F / N M F , resumed in Table 1. Moreover, we choose the learning rate  β k { 0.1 0.01 0.001 0.0001 } , while the batch size  B { 2 , 4, 8, 16, 32, 64,  128 } , the number of hidden layers  J { 1 , 2 , 3 } . For each layer, the number of neurons  D { 8 , 16, 32, 64,  128 } . We use the Rectified Linear Units as our activation function for all layers.
Similar to  M F / N M F , train performance is observed when we track the evolution of the cost function  N M S E , applied to the training samples of the set  K , in a function of iterations. The range of considerably low-error values and the overall learning behavior of the  M L P  architecture illustrates that our shallow neural network successfully resolves the non-linear regression problems related to our BA process. For massive setups,  M L P  reaches around  10 6  error in a high  P u  regime. However, this cost value increases as long as the amount of noise and interference augments. Note that the train  N M S E  also decreases when we increase the size of the dataset matrix  S , which provides more samples for  M L P  to improve the feature extraction and the prediction quality. Regarding the unknown beams, test error values in the numerical result tables are close to the train cost (with no overfitting or underfitting in the corresponding learning curves). Moreover, the test loss is impacted by the transmitted power regime the same way as the training process. Identical to  G D -based  M F / N M F , the  M L P  learning curves in Figure 7 plot the same shape of curve with a continuous monotonic decrease in the train and test cost among the iterations: the convergence is progressive among the iterations, and at the last epoch, training and test  N M S E  values land at considerably low error values and prove that  M L P  accurately fits to our problem and provides a concrete solution for  M L -based  B A . From a  Q o S  perspective, Table 3 resumes the smallest (optimal) signaling overhead required for a successful beam sounding based on reliable prediction quality. Similar to  M F / N M F , for all the proposed transmitted power,  M L P  requires  10 %  of the total beam pairs to fulfill the  R S E  matrix.

6. Results and Discussion

6.1. Train/Test Prediction Performance Comparison

For the six  M F -based models, we select the best one (minimum test error) to represent the  M F  family of methods in this section and compare it with  M L P . When we analyze  Q o S  (Table 1 and Table 2), we notice that the transmitted power regime impacts the quality of prediction by reducing the overall loss. For  M F / N M F , we observe that the loss damage is large. We jump from around  10 8  for massive configurations (256, 512, and 1024) to  10 4  for smaller setups. For  M L P , we spot the increase in the overall loss when we decrease  P u . Thus,  M L P  seems to be the most robust architecture with respect to changing the transmitted power. Additionally, we empirically notice that the change in the  P u  values does not impact the optimal hyperparameters selected from cross-validation. Furthermore, when we track the evolution of the training/test cost in the function of iterations, we observe balanced models with no signs of overfitting or underfitting. On the other hand, when the transmitted power decreases,  M F / N M F  tend to be the most impacted models in terms of train/test error, while the  M L P  error is robust.
On the other hand, from a  Q o S  perspective, concerning the evolution of the optimal (minimum) required signaling overhead and what impact can the  P u  regime have on the optimal required values, in reference to Table 1 and Table 2, all the proposed models required just  10 %  of the total number of beam pairs at  U E  and  B S  for all antenna configurations from  128 × 128  to  1024 × 1024  for all the proposed  P u  values. This proves that the transmitted power impacts the quality of prediction but not the number of beam pairs required for training. In fact, low  P u  leads to damaging the signal quality and subsequently damages the quantity of useful information to be extracted from the datasets. Finally, the only cases where the  P u  regime impacts the optimal overhead ratio is among the smallest configurations, for instance, the  16 × 16  setup where it seems normal for all learning models to demand more data to learn from (more hidden interactions between  U E  and  B S  as features to extract). These are the experimental situations where Exhaustive  B A  is technically feasible.

6.2. Similarities and Differences between Models

All models required just  10 %  of the beams for training for all the proposed massive setups. Moreover, all the proposed models are shallow neural architectures with few hidden layers for low-complexity constraints. Even among the largest configurations, the optimal dimensions of models picked from the cross-validation illustrate small networks with no need to require dense architectures. Furthermore, all models succeeded with the matrix completion task, and they all illustrate a monotonic decrease in loss values as long as we increase the MIMO setup. Additionally,  M F -based models are the most accurate reaching loss values in the range  10 8  for massive setups in a high  P u  regime, and their cross-validation illustrates smaller grid search where there are fewer hyperparameters to tune. However, they are the slowest models when applied to high-dimensional MIMO setups. On the other hand,  M L P  illustrates a good balance between run time (complexity) and loss values (prediction quality). It reaches around  10 4  and  10 5  loss for massive configurations. In addition, the  M L P  is the most robust model facing the changes in the  P u  values. In Figure 8, for  512 × 512 , the figure illustrates the train/test  N M S E  in the function of each model and the corresponding transmitted power: in Figure 8a, for  P u = 1 W M F  achieves its best performance, slightly better than  M L P  with the difference between achieved cost values at around  10 1 . In Figure 8b, when  P u = 10 1 W M F  still gets the best performance, marginally better than  M L P  with an  N M S E  value difference of around  10 1 . In Figure 8c, when  P u = 10 2 M F  noticeably gets impacted (overall loss around  10 3 ) while  M L P  provides the best prediction performance: this suggests that when  P u  is small,  M L P  is more robust than  M F / N M F , which performs best in high  P u  regime. Similarly, almost same remarks hold for Figure 9 when we simulate the  128 × 128  configuration: in Figure 9a,  M F  reaches considerably better performance compared with  M L P  with  10 4 . In Figure 9b,  M L P  kept the same range of error, which states again the robustness of the model while  M F  got severely impacted ( 10 3 ) but sill holds the best performance. In Figure 9c, when  P u  is weak,  M F  illustrates the worst performance in all simulations. On the other hand,  M L P  got slightly impacted with an overall loss of  10 1  and reaches the best quality of prediction. In Figure 10, we investigate the highest configuration  1024 × 1024 . Similar conclusions for Figure 8 and Figure 9 hold for this figure in terms of best model ( M F  for  P u = 1 W P u = 10 1  and  M L P  for  P u = 10 2 ). In addition, we aim to investigate the overall impact of varying the transmitted power. Thus, we track the  l o g ( N M S E )  values while switching from one  P u  regime to another: In Figure 10, in Figure 10a, for  M L P , the curve gap from low/medium is  l o g ( N M S E ) m e d i u m l o g ( N M S E ) l o w 16 ( 12 ) 4 . The gap in the medium/high regimes is almost negligible (  l o g ( N M S E ) h i g h l o g ( N M S E ) m e d i u m 16 ( 16 ) 0.5 ). Finally, in Figure 10b, the  M F  gap is around  l o g ( N M S E ) m e d i u m l o g ( N M S E ) l o w 17 ( 9 ) 8  and  l o g ( N M S E ) h i g h l o g ( N M S E ) m e d i u m 22 ( 17 ) 5 : at each change of  P u M F  is considerably impacted. To sum up, the choice of the optimal model strongly depends on the available complexity and the given transmitted power  P u . In fact,  M F , whether through  B C D  or  B G D  optimization, is the best model when the transmitted power is high ( P u = 1 W ). In this case,  B C D M F  converges faster but has higher complexity than  B G D . However,  S G D  for  M F / N M F  are the slowest models to converge but show negligible complexity. On the other hand, if we aim to prioritize run time,  M L P  exhibits the fastest predictions with good prediction error. Finally, it is wise to opt for  M L P  if the system is to operate under various transmitted power regimes where  M L P  offers good prediction quality for every  P u  value and the available complexity is medium.

7. Conclusions

In this paper, we proposed a blind Machine Learning-based Beam Alignment using Matrix Factorization, non-negative Matrix Factorization, and Multi-Layer Perceptron. We assumed an Uplink massive mmWave MIMO system using single RF-chains at  U E  and multiple RF-chains at  B S  though a fully analog architecture. The proposed approach consists in sounding the  R S E  of sub-sampled codebooks at  U E  and  B S . The  R S E  of the non-sounded beams is predicted using  M F N M F , and  M L P  models. Our results show that, by sounding just  10 %  of the total beam pair samples, we may predict with high accuracy the unknown  R S E  values, which massively reduce the large signaling overhead of Exhaustive  B A . Our future work investigates the scalability of our approach to a multi-user scenario. Robustness and  M L -interpretability are other research directions for modeling industrial deployment.

Author Contributions

Conceptualization, A.K., H.G. and G.R.-B.O.; Methodology, A.K. and H.G.; Software, A.K.; Validation, G.R.-B.O.; Formal analysis, H.G.; Writing—original draft, A.K.; Writing—review & editing, H.G. and G.R.-B.O.; Supervision, H.G. and G.R.-B.O. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Télécom Paris, l’Institut Polytechnique de Paris, France.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Datasets are available from the authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

ALSAlternating Least Squares
AoDAngle of Departure
AoAAngle of Arrival
AWGNAdditive White Gaussian Noise
BABeam Alignment
BSBase Station
BCEBinary Cross Entropy
BCDBlock Coordinate Descent
BGDBlock Gradient Descent
BSGDBlock Stochastic Gradient Descent
CSIChannel State Information
DFTDiscrete Fourier Transform
GDGradient Descent
LoSLine of Sight
MFMatrix Factorization
MIMOMultiple Input Multiple Output
MLMachine Learning
MLPMulti-Layer Perceptron
MSEMean Squared Error
NMFNon-Negative Matrix Factorization
NLoSNon Line of Sight
NMSENormalized Mean Squared Error
OFDMOrthogonal Frequency Division Multiplexing
QoSQuality of Service
ReLuRectified Linear Unit
RSEReceived Signal Energies
SNRSignal-to-Noise Ratio
UEUser Equipment

Appendix A. Proof: BCD Convergence

We will show that the two (below) necessary conditions for convergence of BCD are satisfied:
(i)
The loss function is strongly convex, per block; i.e., we need to show that sub-problem S1 and S2 have a unique solution.
(ii)
The constraints of the MF prob  θ u R d , ψ i R d , are separable and individually convex.
Recall that sub-problem S1 is written as
( S 1 ) : θ u ( k + 1 ) = a r g m i n θ u R d [ 2 θ u T r u ( k ) + θ u T ( Q u ( k ) + μ u I D ) θ u ] = f 1 ( u ) , u ,
Next, we will prove that the equivalent form in (S1), is a strongly convex function; i.e., it shows that  f 1 ( θ u )  is strongly in  θ u . To that end, we derive the corresponding Hessian:
2 f 1 ( θ u ) : = 2 ( Q u ( k ) + μ u I D ) , u ,
For this Hessian expression,  Q u ( k ) 0  is a Positive Semi Definite (PSD) matrix (by def),  μ u I 0  is a Positive Definite (PD) matrix, and  ( Q u ( k ) + μ u I D ) 0  is a PD matrix. Thus, the Hessian is a PD matrix  2 f 1 ( θ u ) 0 , and  f 1 ( θ u )  is strongly in  θ u , and the solution to the sub-problem (S1) is unique. Recall that the sub-problem (S2) is expressed as
( S 2 ) : ψ i ( k + 1 ) = a r g m i n ψ i R d [ 2 t i ( k + 1 ) T ψ i + ψ i T ( P u ( k + 1 ) + λ i I ) ψ i ] = f 2 ( ψ i ) , i ,
Next, we will prove that the equivalent form is a strongly convex function; i.e., it shows that  f 2 ( ψ i )  is strongly in  ψ i . To that end, we derive the corresponding Hessian:
2 f 2 ( ψ i ) : = 2 ( P i ( k + 1 ) + λ i ( i ) I D ) , i ,
For this Hessian expression,  P i ( k + 1 ) 0  is a PSD matrix (by def),  λ i ( i ) I 0  is a PD matrix, and  ( P i ( k + 1 ) + λ i ( i ) I D ) 0  is a PD matrix. Thus, the Hessian is a PD matrix  2 f 2 ( ψ i ) 0 , and  f 2 ( ψ i )  is strongly convex in  ψ i . Thus, the solution to the sub-problem (S2) is unique.

References

  1. Wang, Y.; Wei, Z.; Feng, Z. Beam Training and Tracking in MmWave Communication: A Survey. arXiv 2022, arXiv:2205.10169. [Google Scholar]
  2. IEEE Std 802.15.3c-2009; IEEE Standard for Information Technology—Local and Metropolitan Area Networks—Specific requirements—Part 15.3: Amendment 2: Millimeter-Wave-Based Alternative Physical Layer Extension. IEEE: Piscataway, NJ, USA, 2009.
  3. IEEE Std 802.11ad-2012; IEEE Standard for Information technology—Telecommunications and Information Exchange between Systems—Local and Metropolitan Area Networks—Specific Requirements-Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications Amendment 3: Enhancements for Very High Throughput in the 60 GHz Band. IEEE: Piscataway, NJ, USA, 2012.
  4. 3GPP. TS 38.211 V16.7.1 NR; Physical Channels and Modulation; ETSI Technical Specification 138 211 V16.10.0; Released: 07/2022. Available online: https://www.etsi.org/deliver/etsi_ts/138200_138299/138211/16.10.00_60/ts_138211v161000p.pdf (accessed on 9 July 2024).
  5. Noh, S.; Zoltowski, M.D.; Love, D.J. Multi-Resolution Codebook and Adaptive Beamforming Sequence Design for Millimeter Wave Beam Alignment. IEEE Trans. Wirel. Commun. 2017, 16, 5689–5701. [Google Scholar] [CrossRef]
  6. Kokshoorn, M.; Chen, H.; Wang, P.; Li, Y.; Vucetic, B. Millimeter Wave MIMO Channel Estimation Using Overlapped Beam Patterns and Rate Adaptation. IEEE Trans. Signal Process. 2016, 65, 601–616. [Google Scholar] [CrossRef]
  7. Tsang, Y.M.; Poon, A.S.Y.; Addepalli, S. Coding the Beams: Improving Beamforming Training in mmWave Communication System. In Proceedings of the 2011 IEEE Global Telecommunications Conference—GLOBECOM 2011, Houston, TX, USA, 5–9 December 2011; pp. 1–6. [Google Scholar] [CrossRef]
  8. Buzzi, S.; D’Andrea, C. Subspace Tracking and Least Squares Approaches to Channel Estimation in Millimeter Wave Multiuser MIMO. IEEE Trans. Commun. 2019, 67, 6766–6780. [Google Scholar] [CrossRef]
  9. Khordad, E.; Collings, I.B.; Hanly, S.V.; Caire, G. Compressive Sensing Based Beam Alignment Schemes for Time-Varying Millimeter-Wave Channels. IEEE Trans. Wirel. Commun. 2023, 22, 1604–1617. [Google Scholar] [CrossRef]
  10. Ghauch, H.; Skoglund, M.; Shokri-Ghadikolaei, H.; Fischione, C.; Sayed, A.H. Learning Kolmogorov Models for Binary Random Variables. In Proceedings of the International Conference on Machine Learning Workshop, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
  11. Yetis, C.M.; Björnson, E.; Giselsson, P. Joint Analog Beam Selection and Digital Beamforming in Millimeter Wave Cell-Free Massive MIMO Systems. arXiv 2021, arXiv:2103.11199. [Google Scholar] [CrossRef]
  12. Dreifuerst, R.M.; Heath, R.W.; Yazdan, A. Massive MIMO Beam Management in Sub-6 GHz 5G NR. arXiv 2022, arXiv:2204.06064. [Google Scholar]
  13. Ma, K.; He, D.; Sun, H.; Wang, Z.; Chen, S. Deep Learning Assisted Calibrated Beam Training for Millimeter-Wave Communication Systems. arXiv 2021, arXiv:2101.05206. [Google Scholar] [CrossRef]
  14. Nguyen, K.N.; Ali, A.; Mo, J.; Ng, B.L.; Va, V.; Zhang, J.C. Beam Management with Orientation and RSRP using Deep Learning for Beyond 5G Systems. arXiv 2022, arXiv:2202.02247. [Google Scholar]
  15. Aldalbahi, A.; Shahabi, F.; Jasim, M. BRNN-LSTM for Initial Access in Millimeter Wave Communications. Electronics 2021, 10, 1505. [Google Scholar] [CrossRef]
  16. Dehkordi, S.K.; Kobayashi, M.; Caire, G. Adaptive Beam Tracking based on Recurrent Neural Networks for mmWave Channels. arXiv 2021. [Google Scholar] [CrossRef]
  17. Hussain, M.; Michelusi, N. Learning and Adaptation for Millimeter-Wave Beam Tracking and Training: A Dual Timescale Variational Framework. arXiv 2021. [Google Scholar] [CrossRef]
  18. Dreifuerst, R.M.; Daulton, S.; Qian, Y.; Varkey, P.; Balandat, M.; Kasturia, S.; Tomar, A.; Yazdan, A.; Ponnampalam, V.; Heath, R.W. Optimizing Coverage and Capacity in Cellular Networks using Machine Learning. arXiv 2021, arXiv:2010.13710. [Google Scholar]
  19. Narengerile, N.; Thompson, J.; Patras, P.; Ratnarajah, T. Deep Reinforcement Learning-Based Beam Training for Spatially Consistent Millimeter Wave Channels. In Proceedings of the 2021 IEEE 32nd Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Helsinki, Finland, 13–16 September 2021; pp. 579–584. [Google Scholar] [CrossRef]
  20. Wang, L.; Ai, B.; Niu, Y.; Gao, M.; Zhong, Z. Adaptive Beam Alignment Based on Deep Reinforcement Learning for High Speed Railways. In Proceedings of the 2022 IEEE 95th Vehicular Technology Conference: (VTC2022-Spring), Helsinki, Finland, 19–22 June 2022; pp. 1–6. [Google Scholar] [CrossRef]
  21. Ktari, A.; Ghauch, H.; Rekaya, G. Matrix Factorization for Blind Beam Alignment in Massive mmWave MIMO. In Proceedings of the 2022 IEEE Wireless Communications and Networking Conference (WCNC), Austin, TX, USA, 10–13 April 2022; pp. 2637–2642. [Google Scholar] [CrossRef]
  22. Ktari, A.; Ghauch, H.; Rekaya, G. Cascaded binary classifiers for blind Beam Alignment in mmWave MIMO using one-bit quantization. In Proceedings of the International Conference on Communications (ICC), WS02 ICC23 Workshop, DDINS, Rome, Italy, 28 May–1 June 2023. [Google Scholar]
Figure 1. Proposed  B A  diagram representation: (a) fully analog MIMO architecture using a single RF chain at  U E  and multiple RF chains at  B S ; (b) simplified illustration of Beam Alignment problem.
Figure 1. Proposed  B A  diagram representation: (a) fully analog MIMO architecture using a single RF chain at  U E  and multiple RF chains at  B S ; (b) simplified illustration of Beam Alignment problem.
Entropy 26 00626 g001
Figure 2. Exhaustive Beam Alignment:  | T | = | R | = 4 N r f = 2  RF-Chains at  B S . Record 2 beam pairs for each pilot symbol transmission until the matrix is complete. Signaling overhead,  Ω = 4 × 4 2 .
Figure 2. Exhaustive Beam Alignment:  | T | = | R | = 4 N r f = 2  RF-Chains at  B S . Record 2 beam pairs for each pilot symbol transmission until the matrix is complete. Signaling overhead,  Ω = 4 × 4 2 .
Entropy 26 00626 g002
Figure 3. Proposed partial Beam Alignment using sub-sampled codebooks:  | T | = | R | = 4 N r f = 2  RF-Chains: record 2 beam pairs for each pilot symbol transmission until sounded beams are recorded. The missing entries represent the predicted entries. Signaling overhead,  Ω = 3 × 3 2 .
Figure 3. Proposed partial Beam Alignment using sub-sampled codebooks:  | T | = | R | = 4 N r f = 2  RF-Chains: record 2 beam pairs for each pilot symbol transmission until sounded beams are recorded. The missing entries represent the predicted entries. Signaling overhead,  Ω = 3 × 3 2 .
Entropy 26 00626 g003
Figure 4. Toy Example: Matrix Factorization with  | T | = 5 , | R | = 7 , D = 3 M F  results into two rectangular matrices to be optimized:  M F  uses the  R S E  of known beams (yellow) to predict/complete unknown beams (gray). The product of the latent factors  θ 2 T  and  ψ 5  gives the unknown value of  R S E 2 , 5 .
Figure 4. Toy Example: Matrix Factorization with  | T | = 5 , | R | = 7 , D = 3 M F  results into two rectangular matrices to be optimized:  M F  uses the  R S E  of known beams (yellow) to predict/complete unknown beams (gray). The product of the latent factors  θ 2 T  and  ψ 5  gives the unknown value of  R S E 2 , 5 .
Entropy 26 00626 g004
Figure 5. M F / N M F  train/test performance and learning curves: (a) 512 × 512 train/test loss in function of the overhead ratio; (b) learning curve: 256 × 256 with overhead 0.1  B C D M F ; (c) learning curve: 1024 × 1024 with overhead 0.1  B C D N M F ; (d) learning curve: 512 × 512 with overhead 0.1  B G D M F ; (e) learning curve: 128 × 128 with overhead 0.1  B C D S G D .
Figure 5. M F / N M F  train/test performance and learning curves: (a) 512 × 512 train/test loss in function of the overhead ratio; (b) learning curve: 256 × 256 with overhead 0.1  B C D M F ; (c) learning curve: 1024 × 1024 with overhead 0.1  B C D N M F ; (d) learning curve: 512 × 512 with overhead 0.1  B G D M F ; (e) learning curve: 128 × 128 with overhead 0.1  B C D S G D .
Entropy 26 00626 g005
Figure 6. Multi-Layer Perceptron architecture (toy example with  J = 4 ).
Figure 6. Multi-Layer Perceptron architecture (toy example with  J = 4 ).
Entropy 26 00626 g006
Figure 7. M L P  Learning curves: (a) learning curve: 256 × 256 with overhead 0.1 MLP; (b) learning curve: 512 × 512 with overhead 0.1 MLP; and (c) learning curve: 128 × 128 with overhead 0.3 MLP.
Figure 7. M L P  Learning curves: (a) learning curve: 256 × 256 with overhead 0.1 MLP; (b) learning curve: 512 × 512 with overhead 0.1 MLP; and (c) learning curve: 128 × 128 with overhead 0.3 MLP.
Entropy 26 00626 g007
Figure 8. Train/test  N M S E  in function of  P u  for all proposed models for  512 × 512  using optimal overhead ratio; (a) 512 × 512 train/test  N M S E  for  P u = 1  W; (b) 512 × 512 train/test  N M S E  for  P u = 10 1  W; (c) 512 × 512 train/test  N M S E  for  P u = 10 2  W.
Figure 8. Train/test  N M S E  in function of  P u  for all proposed models for  512 × 512  using optimal overhead ratio; (a) 512 × 512 train/test  N M S E  for  P u = 1  W; (b) 512 × 512 train/test  N M S E  for  P u = 10 1  W; (c) 512 × 512 train/test  N M S E  for  P u = 10 2  W.
Entropy 26 00626 g008
Figure 9. Train/test  N M S E  in function of  P u  for all proposed models for  128 × 128  using optimal overhead ratio: (a) 128 × 128 train/test  N M S E  for  P u = 1  W; (b) 128 × 128 train/test  N M S E  for  P u = 10 1 W ; (c) 128 × 128 train/test  N M S E  for  P u = 10 2  W.
Figure 9. Train/test  N M S E  in function of  P u  for all proposed models for  128 × 128  using optimal overhead ratio: (a) 128 × 128 train/test  N M S E  for  P u = 1  W; (b) 128 × 128 train/test  N M S E  for  P u = 10 1 W ; (c) 128 × 128 train/test  N M S E  for  P u = 10 2  W.
Entropy 26 00626 g009
Figure 10. Log( N M S E ) in function of  P u  for  1024 × 1024  using optimal overhead ratio: (a M L P  train/test  l o g ( N M S E )  in function of  P u  using optimal overhead ratio; (b M F  train/test  l o g ( N M S E )  in function of  P u  using optimal overhead ratio.
Figure 10. Log( N M S E ) in function of  P u  for  1024 × 1024  using optimal overhead ratio: (a M L P  train/test  l o g ( N M S E )  in function of  P u  using optimal overhead ratio; (b M F  train/test  l o g ( N M S E )  in function of  P u  using optimal overhead ratio.
Entropy 26 00626 g010
Table 1. System parameters and hyperparameters.
Table 1. System parameters and hyperparameters.
System Configuration for All Proposed Models
System parameterNumerical value
number of antennas  N T  at  U E   128 , 256 , 512 , 1024
number of antennas  N R  at  B S   128 , 256 , 512 , 1024
codebook cardinality  | T |  at  U E   128 , 256 , 512 , 1024
codebook cardinality  | R |  at  B S   128 , 256 , 512 , 1024
overhead ratio  η  regime   0.7 , 0.5 , 0.3 , 0.1
number of  O F M D  sub-carriers  N c 64
number of channel paths L2 (NLoS)
transmitted power  P u  (W)   1 , 10 1 , 10 2
M F / N M F  dimension  D M F   2 , 3 , 4 , 5 , 6
M F / N M F  learning rate  α k   10 1 , 10 2 , 10 3 , 10 4 , 10 5 , 10 6
M F / N M F  regularization factors  λ , μ   10 2 , 10 3 , 10 4 , 10 5 , 10 6 , 10 7
M L P  number of layers J   1 , 2 , 3
M L P  number of neurons per layer  D M L P   8 , 16 , 32 , 64 , 128
M L P  batch size B   2 , 4 , 8 , 16 , 32 , 64 , 128
M L P  learning rate  β k   10 1 , 10 2 , 10 3 , 10 4
Table 2. Q o S  minimum overhead required for  M F / N M F  for all proposed  P u  regimes.
Table 2. Q o S  minimum overhead required for  M F / N M F  for all proposed  P u  regimes.
a |  MF / NMF  |  QoS  Minimum Overhead Required for  P u = 1  W
MIMO setupOptimal hyperparametersMin OverheadTrain NMSETest NMSE
128 by 128BGD NMF{D = 2, ( λ μ ) = (0.0001, 0.0001),  α k  = 0.001}0.18.407746  × 10 6 9.147875  × 10 6
256 by 256BGD MF{D = 3, ( λ μ ) = (0.0001, 0.0001),  α k  = 0.001}0.14.102708  × 10 6 7.344720  × 10 6
512 by 512BGD MF{D = 4, ( λ μ ) = (0.0001, 0.0001),  α k  = 0.001}0.18.374633  × 10 7 9.417057  × 10 7
1024 by 1024SGD NMF{D = 4, ( λ μ ) = (0.0001, 0.0001),  α k  = 0.01}0.11.219227  × 10 7 1.616363  × 10 7
b |  MF / NMF  |  QoS  Minimum Overhead Required for  P u = 10 1  W
MIMO setupOptimal hyperparametersMin OverheadTrain NMSETest NMSE
128 by 128SGD NMF {D = 2, ( λ μ ) = (0.0001, 0.0001),  α k  = 0.001}0.10.0001910.000276
256 by 256SGD NMF {D = 3, ( λ μ ) = (0.0001, 0.0001),  α k  = 0.001}0.14.648861  × 10 5 5.775554  × 10 5
512 by 512BGD NMF{D = 4, ( λ μ ) = (0.0001, 0.0001),  α k  = 0.001}0.11.052556  × 10 5 1.170430  × 10 5
1024 by 1024BGD NMF {D = 4, ( λ μ ) = (0.0001, 0.0001),  α k  = 0.001}0.11.600790  × 10 6 1.695907  × 10 6
c |  MF / NMF  |  QoS  Minimum Overhead Required for  P u = 10 2  W
MIMO setupOptimal hyperparametersMin overheadTrain NMSETest NMSE
128 by 128SGD MF {D = 2, ( λ μ ) = (0.0001, 0.0001),  α k  = 1  × 10 6 }0.10.1155170.118776
256 by 256BGD MF {D = 3, ( λ μ ) = (0.0001, 0.0001),  α k  = 0.0001}0.10.0164750.016679
512 by 512SGD NMF{D = 4, ( λ μ ) = (0.0001, 0.0001),  α k  = 1  × 10 6 }0.10.0033710.003449
1024 by 1024BGD MF {D = 4, ( λ μ ) = (0.0001, 0.0001),  α k  = 1  × 10 5 }0.10.0016810.001948
Table 3. Q o S  minimum overhead required for  M L P  for all the proposed  P u  regimes.
Table 3. Q o S  minimum overhead required for  M L P  for all the proposed  P u  regimes.
a |  MLP  |  QoS  Minimum Overhead Required for  P u = 1  W
MIMO setupOptimal hyperparametersMin overheadTrain NMSETest NMSE
128 by 128{(J = 3, D = 8), B = 4,  β k  = 0.0001}0.10.0011440.002639
256 by 256{(J = 3, D = 16), B = 16,  β k  = 0.001}0.13.941522  × 10 5 3.948157  × 10 6
512 by 512{(J = 3, D = 64), B = 32,  β k  = 0.0001}0.13.305507  × 10 5 3.335168  × 10 5
1024 by 1024{(J = 3, D = 64), B = 64,  β k  = 0.0001}0.19.810028  × 10 6 9.857067  × 10 6
b |  MLP  |  QoS  Minimum Overhead Required for  P u = 10 1  W
MIMO setupOptimal hyperparametersMin overheadTrain NMSETest NMSE
128 by 128{(J = 3, D = 8), B = 4,  β k  = 0.0001}0.10.0075690.007662
256 by 256{(J = 3, D = 16), B = 16,  β k  = 0.001}0.10.0001390.000288
512 by 512{(J = 3, D = 64), B = 32,  β k  = 0.0001}0.15.419598  × 10 5 5.756302  × 10 5
1024 by 1024{(J = 3, D = 64), B = 64,  β k  = 0.0001}0.11.184073  × 10 5 1.72301  × 10 5
c |  MLP  |  QoS  Minimum Overhead Required for  P u = 10 2  W
MIMO setupOptimal hyperparametersMin overheadTrain NMSETest NMSE
128 by 128{(J = 3, D = 8), B = 4,  β k  = 0.0001}0.10.0495590.071185
256 by 256{(J = 3, D = 16), B = 16,  β k  = 0.001}0.10.0170110.017634
512 by 512{(J = 3, D = 64), B = 32,  β k  = 0.0001}0.10.0001410.000666
1024 by 1024{(J = 3, D = 64), B = 64,  β k  = 0.0001}0.11.700140  × 10 4 1.702889  × 10 4
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ktari, A.; Ghauch, H.; Rekaya-Ben Othman, G. Machine Learning Techniques for Blind Beam Alignment in mmWave Massive MIMO. Entropy 2024, 26, 626. https://doi.org/10.3390/e26080626

AMA Style

Ktari A, Ghauch H, Rekaya-Ben Othman G. Machine Learning Techniques for Blind Beam Alignment in mmWave Massive MIMO. Entropy. 2024; 26(8):626. https://doi.org/10.3390/e26080626

Chicago/Turabian Style

Ktari, Aymen, Hadi Ghauch, and Ghaya Rekaya-Ben Othman. 2024. "Machine Learning Techniques for Blind Beam Alignment in mmWave Massive MIMO" Entropy 26, no. 8: 626. https://doi.org/10.3390/e26080626

APA Style

Ktari, A., Ghauch, H., & Rekaya-Ben Othman, G. (2024). Machine Learning Techniques for Blind Beam Alignment in mmWave Massive MIMO. Entropy, 26(8), 626. https://doi.org/10.3390/e26080626

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop