Machine Learning Techniques for Blind Beam Alignment in mmWave Massive MIMO

Ktari, Aymen; Ghauch, Hadi; Rekaya-Ben Othman, Ghaya

doi:10.3390/e26080626

Open AccessArticle

Machine Learning Techniques for Blind Beam Alignment in mmWave Massive MIMO

by

Aymen Ktari

^*

,

Hadi Ghauch

and

Ghaya Rekaya-Ben Othman

Télécom Paris, 91120 Paris, France

^*

Author to whom correspondence should be addressed.

Entropy 2024, 26(8), 626; https://doi.org/10.3390/e26080626

Submission received: 2 April 2024 / Revised: 14 June 2024 / Accepted: 10 July 2024 / Published: 25 July 2024

(This article belongs to the Topic AI and Computational Methods for Modelling, Simulations and Optimizing of Advanced Systems: Innovations in Complexity)

Download

Browse Figures

Versions Notes

Abstract

:

This paper proposes methods for Machine Learning (ML)-based Beam Alignment (BA), using low-complexity ML models, and achieves a small pilot overhead. We assume a single-user massive mmWave MIMO, Uplink, using a fully analog architecture. Assuming large-dimension codebooks of possible beam patterns at

U E

and

B S

, this data-driven and model-based approach aims to partially and blindly sound a small subset of beams from these codebooks. The proposed BA is blind (no CSI), based on Received Signal Energies (RSEs), and circumvents the need for exhaustively sounding all possible beams. A sub-sampled subset of beams is then used to train several ML models such as low-rank Matrix Factorization (MF), non-negative MF (NMF), and shallow Multi-Layer Perceptron (MLP). We provide an extensive mathematical description of these models and the algorithms for each of them. Our extensive numerical results show that, by sounding only

10 %

of the beams from the

U E

and

B S

codebooks, the proposed ML tools are able to accurately predict the non-sounded beams through multiple transmitted power regimes. This observation holds as the codebook sizes at

U E

and

B S

vary from

128 \times 128

to

1024 \times 1024

.

Keywords:

mmWave MIMO; massive antennas; ML-based Beam Alignment; blind BA; Matrix Factorization; Multi-Layer Perceptron; non-linear regression

1. Introduction

Driven by the explosive growth trend of large-scale connectivity and higher data rate systems, wireless data traffic is expected to exponentially increase, growing to 5 zettabytes per month and reaching a 100 Gps data rate by 2030 [1] Thus, the latency in the 6th Generation is predicted to reach 0.1 ms, representing

10 %

of

5 G

latency, in order to support new emerging technical needs, including holographic images, Internet of Things applications, and autonomous driving.

Beam Alignment is frequently defined in the literature as beam sounding, i.e., beam training. It illustrates a fundamental problem in millimeter-wave Multiple Input, Multiple Output systems, defined as the exchange of information between the user equipment

U E

and the base station

B S

in order to accurately select the optimal beam-steering direction. The process of aligning the beams is related to several technical problems, such as beam forming, beam sweeping, beam tracking, and beam selection. The whole framework that unites these operations between

U E

and

B S

is often denoted as the Beam Management. To fulfill the BA task, beam patterns stored in large codebooks are used at both

U E

and

B S

. In fact, pencil beams with directional gain are increasingly being used in several applications in order to alleviate the severe path-loss attenuation and increase capacity and data throughput. On the other hand, massive MIMO systems provide large gain in spectral and energy efficiencies compared with conventional MIMO systems. Using mmWave technology, these systems mainly offer a better communication quality by increasing the system bandwidth and reducing the effects of noise and interference. Due to the diversification of future

5 G

and towards

6 G

applications and intelligent systems, scientists predict the continuous generation of massive datasets for deep processing through large bandwidths, which introduces mmWave bands as the golden spectrum band candidates. However, the limitations of mmWave communication physical properties of the channel are crucial: scattering, attenuation, low coherence time related to the Doppler effect, penetration loss, environmental constraints, and complex channel modeling in realistic urban scenarios. The major problem we aim to encounter in this paper is the inevitable high signaling/training overhead. For this reason, the main trade-off is to browse the most accurate and the least complex

M L

algorithm that optimizes finding the optimal beam pair based on sounded instantaneous Received Signal Energies and using the minimum (possible) amount of training samples.

Contributions: In this current work, we propose ML-based BA methods, for a single user massive mmWave MIMO, Uplink, with a wide-band channel. We assume a single radio frequency chain with large codebooks of possible analog beams at BS (also known as BS codebook) and UE (also known as UE codebook). We define a beam pair as one beam from the BS and UE codebook. By approximating the SNR with the Receive Signal Energy (RSE), we bypass the need for CSI, i.e., a blind approach. We sub-sample large codebooks into smaller sub-sampled BS and UE codebooks, and sound the beam pairs from the sub-sampled codebooks to generate the training set—a novelty of the approach. Using the RSE of the sounded beam pairs (sub-sampled codebooks), we propose to train the following ML methods to predict the RSE of the beam pairs that were not sounded: Matrix Factorization (MF), non-negative Matrix Factorization (NMF), and feed-forward (shallow) Multi-Layer Perceptron (MLP).

We formulate the MF and NMF problems. We propose to use Block Coordinate Descent (BCD) and Block Gradient Descent (BGD) methods to solve each problem. We derive in depth all the update equations for these methods. We show that the BCD method converges to a stationary point from both MF and NMF problems. Our extensive numerical results show that, sub-sampling $10 %$ of the BS/UE codebooks, the remaining RSE values can be predicted extremely well (with a training/test error $\approx 10^{- 6}$ ) for every antenna configuration.
We develop at length the equations of a general MLP model, the resulting loss function, and the corresponding optimization problem. In addition, we derive the equations of back-propagation for the MLP in question. Using extensive numerical results, we observe that sounding $10 %$ of original codebooks is sufficient to predict the RSE of the beam pairs that were not sounded, with negligible training/test error.
We numerically compare the training/test losses of all the proposed models for a varying cardinality of codebooks and transmit powers. These results suggest that the BCD method for MF/NMF outperforms the MLP in terms of training and test error. Meanwhile, BCD for MF/NMF has a large computational complexity and the MLP exhibits medium complexity.
Interestingly, by sounding $10 %$ of the BS/UE codebooks, the proposed ML models can predict the unknown RSE (beam pairs not sounded) with a negligible test error. Thus, the proposed methods achieve a $90 %$ reduction in pilot signaling overhead, compared with the SotA benchmark, without any noticeable loss in performance.

Notations: Matrices and vectors are respectively written in boldface upper-case and lower-case letters. We use

T r [A], A^{T}, A^{- 1}, A^{H} {, | A |, | | A | |}_{F}

for the trace, transpose, inverse, conjugate transpose, determinant, and Frobenius norm of a matrix

A

and the

n \times n

identity matrix.

{[A]}_{i, j}

is used to denote the (i, j)th entry of a matrix

A

. We denote the Hadamard product by ∘, while

{[a]}_{+} : = max (a, 0)

illustrates a Euclidean projection of

a

on

R_{+}^{D}

and is applied element by element on

a

. We denote

| x |

the absolute value of x and

{[x]}_{t}

as the entry t of a vector

x

.

Methods/Experiment: The proposed approach is data driven and model based. The dataset is generated following the Saleh Valenzuela wide-band mmWave system model. It is based on Received Signal Energies for each and every beam pair in the massive MIMO Uplink setup stored in separate .csv files. The model-based solution to the empirical risk minimization includes deriving a closed-form solution to the formulated non-convex optimization problem, stating the theoretical guarantees of convergence and empirically illustrating the success of the proposed partial and blind Beam Alignment procedure using different algorithms. All simulations are executed on Infres GPU servers and the Comelec laboratory PC at Télécom Paris, having the following characteristics: Intel(R) Core(TM) i5-8365U CPU @ 1.60 GHz, 16 Go (RAM), x64 processor, and 64-bit operating system under the license of Windows 10 Enterprise LTSC 2018, version 1809. The manufacturer is Dell and is located in Paris, France. All python packages used in this work (numpy, scipy, keras, pytorch, matplolib..) are related to python 3.9 release. In fact, the experimental protocol is based on offline grid-search cross-validation, which requires GPU processing for the selection of optimal hyperparameters and online training/prediction for Matrix Factorization, non-negative Matrix Factorization, and Multi-Layer Perceptron. The comparison is conducted following a Quality of Service-based approach, simulating a variety of MIMO configurations and architectural setups, investigating the impact of varying the Received Signal Energy regime and empirically stating intersections and differences in the impact of the transmit power on model behaviors, loss values, optimal signaling overhead ratio, and optimal hyperparameters.

Problem Statement: The main challenge addressed in this study is the high signaling overhead in Beam Alignment for mmWave MIMO systems, which hampers the efficient selection of optimal beam-steering directions.
Research Questions and Hypotheses: This study investigates whether machine learning methods can effectively reduce the signaling overhead required for accurate beam-pair prediction in mmWave MIMO systems.
Objectives and Aims: The primary objective is to develop and evaluate ML-based BA methods that minimize the training overhead while maintaining high accuracy in predicting the RSE for unsounded beam pairs.
Significance and Rationale: The study proposes a novel approach to BA using ML techniques, which can lead to a substantial reduction in pilot signaling overhead and enhance the efficiency of future wireless communication systems.

2. Literature Survey

In conventional standards, Exhaustive BA, also called Brute Force BA, is the de facto approach for the alignment process. It is based on sounding all available beams at both

U E

and

B S

codebooks in order to exhaustively select the optimal beam pair. One obvious drawback is the fact that the resulting signaling overhead scales as the product of the

U E

and

B S

codebook sizes. At 60 GHz, the Exhaustive BA has been adopted in several mmWave

W L A N

or

W P A N

communication technologies, e.g., IEEE 802.15.3c [2] and IEEE 802.11ad [3]. It is conventionally applied in small MIMO configurations using small codebook sizes (e.g., codebooks of size

8 \times 8

for

L T E

) and guarantees optimal performance. For cellular networks [4], V2X communications, Unmanned Aerial Vehicles, or High-Speed Train applications, the infeasibility of brute-force-based BA pushes scientists to reduce the large signaling overhead from using massive antennas systems. State-of-the-art methods can be divided into two categories: classic BA and learning-based BA. Traditional techniques tend to use a more and more structured Beam Alignment design such as hierarchical multi-level codebooks [5] (training beamforming vectors are constructed with different beam widths at different levels) and an overlapped beam pattern [6], where the main idea is to augment the amount of information carried by each channel measurement, reducing the required channel estimation time and beam coding [7], where we assign a unique code signature to each beam angle in addition to subspace estimation/decomposition-based

B A

[8]. Compressed sensing-based algorithms [9] are also used in this context, taking advantage of channel sparsity. Therefore, we state two intersections in classic methods: they generally rely on

C S I

exchange and Exhaustive BA. In contrast, lately, Machine Learning (

M L

)-based BA has emerged and is continuously leading to some promising results. For instance, statistical models such as Kolmogorov model-based BA in [10] with sub-sampled codebooks reduce the signaling overhead:

15 %

of Exhaustive

B A

provides accurate predictions for optimal beams at

U E

and

B S

in a partial

B A

procedure, similar to our approach. Deep learning through shallow neural networks is increasingly used by Wireless Communication scientists, where we distinguish two major paradigms: first, the ML methods related to Supervised Learning (

S L

) via a Support Vector Machine and Multi-Layer Perceptrons for joint analog beam selection in [11], convolutional neural networks for beam management in sub-6 GHz in [12] and for calibrated beam training in [13], recurrent neural networks such as Long Short-Term Memory network for beam tracking in [14,15,16], auto-encoders for beam management in [17], and several other neural architectures, and second, Reinforcement Learning (

R L

) in [18,19,20], generally used to resolve the problems of Multi-Armed Bandit and Markov decision process. In addition, neural architectures have the ability to extract features from the hidden interactions between

B S

and

U E

, providing fast and accurate estimations through different MIMO setups and channel realizations, especially when applied to massive datasets where more and more data/train samples are embedded. This work is an extension of [21]. In this paper, we extend the channel model to wide-band and we add multiple RF-chains at

B S

in a fully analog low-complexity architecture, where we investigate more

M L

tools for partial and blind

B A

. This paper is one of the first attempts to apply

M F / N M F

models and shallow Multi-Layer Perceptrons to a blind and partial Beam Alignment for massive mmWave SU-MIMO. Our work in [22] is related to the same approach and objectives, where we quantize the output of each RF-chain.

3. System Model

In this section, we illustrate the mmWave MIMO point-to-point system model. We consider an Uplink transmission from multiple-antenna user equipment

U E

using a single radio frequency chain and a multiple-antenna base station

B S

using multiple radio frequency chains. The proposed

M L

methods are performed at the

B S

, which has higher computational resources than

U E

. Figure 1a,b provide a diagram representation of the proposed architecture.

U E

and

B S

are respectively equipped with Uniform Linear Arrays of

N_{T}

and

N_{R}

antenna. We propose a low-cost/complexity fully analog architecture where

U E

has one radio frequency chain and

B S

has

N_{r f}

radio frequency chains.

U E

selects its analog beamformer

f_{u} \in C^{N_{T}}

from a codebook of feasible beam choices,

u \in T

, where

T

is the corresponding index set. Moreover,

B S

selects its analog combiner

W_{i} \in C^{N_{R} \times N_{r f}}

from a codebook

i \in R

with

R

as the index set of the codebook. We denote with

C_{T}

the number of possible beamforming vectors at

U E

, i.e., the size/cardinality of the

U E

codebook,

| T | = C_{T}

and

C_{R}

, and the size/cardinality of the

B S

codebook,

| R | = C_{R}

. Both beamforming and combining are fully performed in the analog domain using phase shifters at

U E

and

B S

; thus, they satisfy the following constant modulus constraints,

\forall r \in {1, \dots, N_{R}}, \forall t \in {1, \dots, N_{r f}}

:

W_{i} \in C^{N_{R} \times N_{r f}}, | {[W_{i}]}_{r, t} | = \frac{1}{N_{r f} N_{R}}

f_{u} \in C^{N_{T}}, | {[f_{u}]}_{t} | = \frac{1}{N_{T}}, \forall t \in {1, \dots, N_{T}}

For our proposed approach,

B S

is responsible for receiving signal energies, denoted as

R S E

, in order to learn their patterns and features for the purpose of accurately predicting the optimal beam indexes from their corresponding codebooks and send them to

U E

. We adopt the wide-band channel model

G \in C^{N_{R} \times N_{T}}

given by

\begin{matrix} G (k) = \sqrt{\frac{1}{N_{c}}} \sum_{l = 1}^{N_{c}} H_{l} e^{- j 2 π l k / N_{c}}, \forall k \in {1, \dots, N_{C}} \end{matrix}

(1)

where

N c

represents the number of sub-carriers over the whole bandwidth through an

O F D M

scenario, k is the index of the sub-carrier k, and

H_{l} \in C^{N_{R} \times N_{T}}

is the narrow band channel model representing the time domain channel impulse response with L-tapped delays given by

H_{l} = \sqrt{\frac{N_{T} N_{R}}{L}} \sum_{i = 1}^{L} ρ_{i} a_{R} (θ_{i}^{(R)}) a_{T}^{H} (θ_{i}^{(T)})

, where L is number of paths (rank) of the channel;

θ_{i}^{(R)}

and

θ_{i}^{(T)}

are the angles of arrival at

B S

and the angles of departure from

U E

, noting AoA/AoD to correspond to the

i^{t h}

path (and both assumed to be uniform over

[- π / 2, π / 2]

);

ρ_{i}

is the complex gain of the

i^{t h}

path such that

ρ_{i} \sim CN (0, 1), \forall i

; and last but not least,

a_{R} (θ_{i}^{(R)}) \in C^{N_{R}}

and

a_{T} (θ_{i}^{(T)}) \in C^{N_{T}}

are the array response vectors at both

U E

and

B S

, respectively. We further assume that the channel is completely unknown to both

U E

and

B S

. Henceforth, in this paper, we shall denote the beam pair

(u, i)

as the combination of the

U E

beamformer indexed u from the

U E

codebook

T

and combiner indexed i in the

B S

codebook

R

. The signal at

B S

resulting from applying the beam pair

(u, i)

,

y_{u, i} \in C^{N_{r f}}

is expressed as

\begin{matrix} y_{u, i} & = {W_{i}}^{H} G (k) f_{u} s_{u} + n_{i}, \forall (u, i) \in T \times R, \end{matrix}

(2)

where

s_{u} = 1 \sqrt{P_{u}}

is the transmitted pilot symbol associated with

f_{u}

(having power

\sqrt{P_{u}}

) and

n_{i} = W_{i}^{H} n

is the effective additive white Gaussian noise

A W G N

with unit variance (

σ^{2} = 1

). We define the received Signal-to-Noise Ratio (

S N R

) for the beam pair

(u, i)

as

{SNR}_{u, i} = P_{u} | | W_{i}^{H} G (k) f_{u} {| |}_{2}^{2}, \forall (u, i) \in T \times R

. We assume a fully blind approach; i.e., neither

B S

nor

U E

has any knowledge of

G

. Thus, computing the above

S N R

expression is not feasible due to the fact that BS is assumed not to know

G

. Thus, in this work, we will approximate the

S N R

of the beam pair

(u, i)

using the corresponding instantaneous Received Signal Energies (

R S E s

) expressed as

{RSE}_{u, i} = | | y_{u, i} {| |}_{2}^{2}, \forall (u, i) \in T \times R

. In other words, we will assume that

R S E_{u, i} \approx {SNR}_{u, i}

for each beam pair

(u, i) \in T \times R

.

Benchmark: Exhaustive $B A$ : The de facto method for Beam Alignment is Exhaustive

B A

. It is accomplished by exhaustively sounding, jointly, the beams of both

U E

and

B S

codebooks, recording all entries of

R S E

, and exhaustively searching

S

for the indexes of the beam pair that maximize

R S E

at

B S

, i.e,

(u^{★}, i^{★}) = \underset{(u, i) \in T \times R}{a r g m a x} R S E_{u, i}

. Thus, the

R S E

matrix is computed/recorded

N_{r f}

-entries, with each of pilot symbol, since

N_{r f}

samples are simultaneously received at the

B S

for every pilot transmission (see Figure 2). Consequently, the pilot signaling overhead of the Exhaustive

B A

is

Ω = | T \times R | / N_{r f} = C_{T} C_{R} / N_{r f}

, which implies that the overhead of this benchmark scales poorly with the

B S

and

U E

codebooks.

Proposed partial Beam Alignment using sub-sampled codebooks: Recall the designation of the beam pair

(u, i)

as the beamforming vector of the index u in the

U E

codebook of beams and the combining vector of the index i in the

B S

codebook of beams. First, we select (at random) the indexes of the sub-sampled codebooks of beams at

U E

and

B S

,

R_{S}

and

T_{S}

, such that

R_{S} \subset R

and

T_{S} \subset T

, and

| R_{S} | ≪ | R |

| T_{S} | ≪ | T |

. The idea behind this approach is to only sound beam pairs from the sub-sampled codebook of beams,

R_{S}

and

T_{S}

. We thus define the training set,

K

, as the sub-sampled codebook indexes at

U E

and

B S

, i.e.,

K : = {(u, i) | (u, i) \in T_{S} \times R_{S}}

. Then, the

R S E

of the sounded beam pairs (training set) is given to several ML methods, and the learned ML model is used to predict the

R S E

of non-sounded beam pairs.

We formalize this proposed method below. We express both the received signal

y_{(u, i)}

and

R S E

for the beam pair

(u, i)

resulting from the sounded beam pairs (i.e., training set), as follows:

\begin{matrix} y_{u, i} = {W_{i}}^{H} G (k) f_{u} s_{u} + n_{i}, \forall (u, i) \in T_{S} \times R_{S} \end{matrix}

(3)

\begin{matrix} {RSE}_{u, i} = {∥ y_{u, i} ∥}_{2}^{2}, \forall (u, i) \in T_{S} \times R_{S} . \end{matrix}

(4)

The dataset is formulated using the following incomplete

R S E

matrix,

S \in R^{C_{T} \times C_{R}} (: = R^{| T | \times | R |})

:

\begin{matrix} {[S]}_{u, i} : = \{\begin{matrix} {RSE}_{u, i} & , i f (u, i) \in T_{S} \times R_{S} \\ U n k n o w n R S E & , i f (u, i) \notin T_{S} \times R_{S} \end{matrix} \end{matrix}

(5)

where

{[S]}_{u, i}

denotes the element

(u, i)

of

S

,

\forall (u, i) \in T \times R

. Evidently, the value of

R S E

is undefined for the beam pairs that were not sounded, designated as unknown-RSE matrix coefficient. Those are the missing entries, which are predicted using one of the following proposed

M L

methods: (i) low-rank

M F / N M F

and (ii) shallow (feed-forward)

M L P

, where we utilize the sounded

R S E

entries as the training set,

K

. Then the training set,

K

, is fed into one of the above ML models, which will predict the

R S E

of non-sounded coefficients in

S

, denoted as ‘Unknown’, in (5) (see Figure 3). Finally, the pilot signaling overhead for the above-proposed sub-sampled codebook method is

Ω = | T_{S} \times R_{S} | / N_{r f} = | K | / N_{r f}

. We split the RSE dataset into a training set

K

and a test set

L

such that

K \cap L = {}

. In this paper,

R S E_{u, i}

denotes the true value (label) of the RSE for the beam pair

(u, i)

in the training set

K

, and

\hat{R S E_{u, i}}

denotes the true value (label) of the RSE for the beam pair

(u, i)

in the test set

L

.

Signaling overhead ratio: It is defined as

η : = \frac{o v e r h e a d o f l e a r n i n g - b a s e d B A}{o v e r h e a d o f E x h a u s t i v e B A} = \frac{| T_{S} | \times | R_{S} |}{| T | \times | R |} = \frac{| K |}{C_{T} C_{R}}

, where

T_{S}

and

R_{S}

are, respectively, the sizes of the

U E

and

B S

sub-sampled codebooks used in our proposed partial beam sounding, while

T

and

R

refer to the original size of the codebooks, and

0 < η \leq 1

measures the signaling overhead of all the proposed

M F

,

M L P

, and

A E

methods compared with that of Exhaustive

B A

. Evidently, a small value for

η

is desired to reduce the signaling overhead of our proposed method. However, a low

η

implies that the size of the training set is small. As a result, the proposed

M L

method will not be able to extract enough data patterns due to the (too) small number of training samples, resulting in a larger prediction error. As one of the contributions of this work, we will (empirically) find as small a value for

η

as possible while still having extremely small training and prediction error.

Conjecture: Note that, from the equations of the narrow-band channel model

H

and the wide-band channel model

G (k)

, it is simple to verify that

r a n k (H) \leq L

and

r a n k (G (k)) \leq L N_{C}

. Assume that

P_{u} \to \infty

. Thus, we can approximate the RSE matrix as

\begin{matrix} {[S]}_{u, i} = ∥ y_{u, i} ∥_{2}^{2} = ∥ {W_{i}}^{H} G (k) f_{u} \sqrt{P_{u}} + n_{i} ∥_{2}^{2} \overset{P_{u} \to \infty}{\approx} P_{u} {∥ {W_{i}}^{H} G (k) f_{u} ∥}_{2}^{2}, \forall (u, i) \in T \times R \end{matrix}

(6)

If

P_{u} \to \infty

, then it can be shown that the RSE matrix

S

is such that

r a n k (S) \leq L N_{C}

. This implies that if

P_{u} \to \infty

, then

S \in R^{C_{R} \times C_{T}}

is a low-rank matrix, i.e.,

r a n k (S) \leq L N_{C} ≪ min (C_{T}, C_{R})

.

While the proof for this necessary condition eludes the authors, we empirically observed that if

P_{u}

is large, then the number of non-zero singular values of

S

,

{σ_{i} (S)}_{i = 1}^{r a n k (S)}

, satisfies the above upper bound, i.e.,

| {σ_{i} (S)}_{i = 1}^{r a n k (S)} | \leq L N_{C}

.

Remark 1.

Recall the expression for the effective rate, r,

r = (1 - \frac{Ω}{T}) log (1 + R S E_{u, i})

, where Ω is the pilot signaling overhead and

T

is the number of symbols per block. Thus, the problem of maximizing r is written as the following series of equivalent problems:

(u^{★}, i^{★}) : = arg {max}_{\forall (u, i) \in T \times R} r \Leftrightarrow arg {max}_{\forall (u, i) \in T \times R} log (1 + R S E_{u, i}) \Leftrightarrow arg {max}_{\forall (u, i) \in T \times R}

R S E_{u, i}

, where the last ⇔ is due to the fact that the

log (x)

is a strictly monotonically increasing function in x. This result implies finding the optimal beam pair

(u^{★}, i^{★})

that maximizes r is equivalent to finding the best beam pair that maximizes the

R S E

.

Remark 2.

The information (number of entries) needed to represent the

R S E

matrix

S \in C^{C_{R} \times C_{T}}

is measured as

r a n k (S) (1 + C_{T} + C_{R})

. This result is evident from performing the

S V D

on

S

and counting the resulting number of entries. Thus, if

S

is severely rank deficient, i.e., extremely compressible, then methods such as

M F / N M F

will exhibit extremely small training and test error. Conversely, if

S

is full rank, i.e., not compressible, then the training and test of

M F / N M F

will be quite large.

4. Matrix Factorization and Non-Negative Matrix Factorization

4.1. MF and NMF Problem Formulation

The intuition behind low-rank

M F

is to model the

R S E

of the sounded beam pairs (i.e., entries of

S

that are known as

T_{S} \times R_{S}

) as an inner product between two D-dimensional latent vectors/factors,

θ_{u}, ψ_{i}

, as illustrated in Figure 4. Specifically, the

R S E

of the beam pair

(u, i)

, denoted as

{[S]}_{u, i}

, is modeled as

{[S]}_{u, i} : = θ_{u}^{T} ψ_{i}, θ_{u} \in R^{D}, ψ_{i} \in R^{D}

,

\forall (u, i) \in K (: = T_{S} \times R_{S})

, where D is the size/dimension/complexity of the Matrix Factorization model latent factors and

θ_{u} \in R^{D}, ψ_{i} \in R^{D}

are the

M F

model parameters (to be optimized). In addition, due to the low-rank

M F

model, D is assumed to be much smaller than the dimensions of

S

, i.e.,

D ≪ (C_{T}, C_{R})

. The

R S E

of the beam pair

(u, i)

is known from sounding the sub-sampled codebooks (i.e., label). The general formulation of our loss function

ℓ_{u, i}

describes the distance between the true value

R S E_{u, i}

and the predicted value

θ_{u}^{T} ψ_{i}

, which corresponds to the

M F

output/prediction:

ℓ_{u, i} : = {(R S E_{u, i} - θ_{u}^{T} ψ_{i})}^{2}, \forall (u, i) \in K (: = T_{S} \times R_{S})

. The Empirical Risk (also known as training error) is defined as the average across all the individual loss function

ℓ_{u, i}

. We define the regularized Empirical Risk function as the above empirical risk in addition to the following regularization terms:

\begin{matrix} \sum_{(u, i) \in K} [\frac{1}{| K |} {({[S]}_{u, i} - θ_{u}^{T} ψ_{i})}^{2} + λ_{i} ∥ ψ_{i} ∥_{2}^{2} + μ_{u} {∥ θ_{u} ∥}_{2}^{2}] = f ({(θ_{u}, ψ_{i})}_{(u, i) \in K}) \end{matrix}

(7)

where

{λ_{i} \geq 0, μ_{u} \geq 0 | \forall (u, i) \in K}

is the set of regularization hyperparameters used to balance the

M F / N M F

model, preventing any overfitting or underfitting. The Empirical Risk Minimization corresponding to the

M F

model is given by

\begin{matrix} (P 1) : = {{\hat{θ}}_{u}, {\hat{ψ}}_{i}} \{\begin{matrix} \underset{{θ_{u}, ψ_{i}}_{(u, i) \in K}}{a r g m i n} f (θ_{u}, ψ_{i}) \\ s . t . θ_{u} \in R^{D}, ψ_{i} \in R^{D} \end{matrix} \end{matrix}

For the Matrix Factorization variant

N M F

, the optimization problem is given by

\begin{matrix} (P 2) : = {{\hat{θ}}_{u}, {\hat{ψ}}_{i}} \{\begin{matrix} \underset{{θ_{u}, ψ_{i}}_{(u, i) \in K}}{a r g m i n} f (θ_{u}, ψ_{i}) \\ s . t . θ_{u} \in R_{+}^{D}, ψ_{i} \in R_{+}^{D} \end{matrix} \end{matrix}

where

{{\hat{θ}}_{u}, {\hat{ψ}}_{i}}

denotes the optimal latent vectors for MF and NMF. The test loss (also knows as test error) is given by applying the general loss on the unknown data samples (non-sounded beams) using optimal

M F / N M F

parameters

{\hat{θ}}_{u}

and

{\hat{ψ}}_{i}

:

= \frac{1}{| L |} \sum_{(u, i) \in L} {(\hat{R S E_{u, i}} - {\hat{θ}}_{u}^{T} {\hat{ψ}}_{i})}^{2}

, where

L

is the test set of our learning model.

4.2. Solutions for MF

We resolve the

M F

problem

(P 1)

using the following methods: (i) Block Coordinate Descent (BCD) often denoted as Alternating Least Squares (ALSs), (ii) BCD with Stochastic Gradient Descent, and (iii) Block Gradient Descent (BGD), which merges BCD and Gradient Descent (GD) definitions.

BCD for MF (BCD MF): BCD proceeds by splitting the optimizing problem

(P 1)

into sub-problems, supposing that all other blocks are known/fixed. We will show that each sub-problem is strongly convex in each block, and the BCD algorithm converges to a stationary point. The application of BCD to the

M F

problem results in two sub-problems, S1 and S2, which are solved iteratively. At iteration k, the sub-problem

(S 1)

is defined by fixing the block

{ψ_{i}^{(k)}}_{\forall i}

and the update/solve block

{θ_{u}}_{\forall u}

only, as follows:

\begin{matrix} (S 1) : θ_{u}^{(k + 1)} & = a r g m i n_{θ_{u} \in R^{d}} f ({θ_{u}, ψ_{i}^{(k)}}) \\ = \sum_{(u, i) \in K} [{({[S]}_{u, i} - θ_{u}^{T} ψ_{i}^{(k)})}^{2} + μ_{u} ∥ θ_{u} ∥_{2}^{2} + λ_{i} ∥ ψ_{i}^{(k)} ∥_{2}^{2}] \end{matrix}

Moreover, the sub-problem

(S 2)

is defined by fixing the block

{θ_{u}^{(k + 1)}}_{\forall u}

in

(P_{1})

and the update/solve block

{ψ_{i}}_{\forall i}

, only, as follows:

\begin{matrix} (S 2) : ψ_{i}^{(k + 1)} & = a r g m i n_{θ_{i} \in R^{d}} f ({θ_{u}^{(k + 1)}, ψ_{i}}) \\ = \sum_{(u, i) \in K} [{({[S]}_{u, i} - θ_{u}^{(k + 1)} ψ_{i})}^{2} + μ_{u} ∥ θ_{u}^{(k + 1)} ∥_{2}^{2} + λ_{i} ∥ ψ_{i} ∥_{2}^{2}] \end{matrix}

We will rewrite

S 1

into as series of equivalent problems as follows:

\begin{matrix} (S 1) & : = a r g m i n_{θ_{u} \in R^{d}} \sum_{(u, i) \in K} [{[S]}_{u, i}^{2} - 2 {[S]}_{u, i} θ_{u}^{T} ψ_{i}^{(k)} + θ_{u}^{T} ψ_{i}^{(k)} θ_{i}^{{(k)}^{T}} θ_{u} + μ_{u} ∥ θ_{u} ∥_{2}^{2}] \\ \Leftrightarrow a r g m i n_{θ_{u} \in R^{d}} \sum_{u} [- 2 θ_{u}^{T} \sum_{i} ({[S]}_{u, i} ψ_{i}^{(k)}) + θ_{u}^{T} \sum_{i} (ψ_{i}^{(k)} ψ_{i}^{{(k)}^{T}}) θ_{u} + μ_{u} ∥ θ_{u} ∥_{2}^{2}] \\ \Leftrightarrow a r g m i n_{θ_{u} \in R^{d}} \sum_{u \in U_{i}} [- 2 θ_{u}^{T} (r_{u}^{(k)}) + θ_{u}^{T} (Q_{u}^{(k)}) θ_{u} + μ_{u} ∥ θ_{u} ∥_{2}^{2}] = \sum_{u \in U_{i}} h_{u} (θ_{u}), \\ θ_{u}^{(k + 1)} & = a r g m i n_{θ_{u} \in R^{d}} [- 2 θ_{u}^{T} r_{u}^{(k)} + θ_{u}^{T} (Q_{u}^{(k)} + μ_{u} I_{D}) θ_{u}] = f_{1} (θ_{u}), \forall u \in U_{i}, \end{matrix}

where

U_{i}

is the set of row indexes u in the RSE matrix corresponding to the column i in the known entries of the RSE matrix,

Q_{u}^{(k)} = \sum_{i} (ψ_{i}^{(k)} ψ_{i}^{{(k)}^{T}})

and

r_{u}^{(k)} = \sum_{i} ({[S]}_{u, i} ψ_{i}^{(k)})

. We derive the closed-form solution for the sub-problem S1 by finding the global min of

f_{1} (θ_{u})

, as follows:

\begin{matrix} \nabla f_{1} (θ_{u}) = 0 & \Leftrightarrow - 2 r_{u}^{(k)} + 2 (Q_{u}^{(k)} + μ_{u} I_{D}) θ_{u} = 0 \Leftrightarrow θ_{u} = {(Q_{u}^{(k)} + μ_{u} I_{D})}^{- 1} r_{u}^{(k)} \end{matrix}

Similarly, we rewrite the sub-problem (S2) into the following series of equivalent problems by stating the last one:

\begin{matrix} (S 2) : ψ_{i}^{(k + 1)} = & a r g m i n_{ψ_{i} \in R^{d}} [- 2 t_{i}^{{(k + 1)}^{T}} ψ_{i} + {ψ_{i}}^{T} (P_{i}^{(k + 1)} + λ_{i} I) ψ_{i}] = f_{2} (ψ_{i}), \forall i \in I_{u}, \end{matrix}

where

I_{u}

is the set of column indexes i in the RSE matrix corresponding to the row u in the known entries of the RSE matrix,

t_{i}^{(k + 1)} = \sum_{u} ({[S]}_{u, i} θ_{u}^{{(k + 1)}^{T}})

and

P_{i}^{(k + 1)} = \sum_{u} (θ_{u}^{(k + 1)} θ_{u}^{{(k + 1)}^{T}})

. Next, we derive a closed-form solution for the sub-problem S2 by finding the global min of

f_{2} (ψ_{i})

, as follows:

\begin{matrix} \nabla f_{2} (ψ_{i}) = 0 \Leftrightarrow - 2 t_{i}^{(k + 1)} + 2 (P_{i}^{(k + 1)} + λ_{i} I_{D}) ψ_{i} = 0 \Leftrightarrow ψ_{i} = {(P_{i}^{(k + 1)} + λ_{i} I_{D})}^{- 1} t_{i}^{(k + 1)} \\ \leftrightarrow ψ_{i}^{(k + 1)} = {((\sum_{u} (θ_{u}^{(k + 1)} θ_{u}^{{(k + 1)}^{T}})) + λ_{i} I_{D})}^{- 1} (\sum_{u} ({[S]}_{u, i} θ_{u}^{{(k + 1)}^{T}})) \end{matrix}

Thus, BCD updates to solve MF are given as follows:

\begin{matrix} \{\begin{matrix} θ_{u}^{(k + 1)} & = (\sum_{i} ψ_{i}^{(k)} {(ψ_{i}^{(k)})}^{T}) + μ_{u} {I)}^{- 1} (\sum_{i} {[S]}_{u, i} ψ_{i}^{(k)}) \\ ψ_{i}^{(k + 1)} & = {((\sum_{u} θ_{u}^{(k + 1)} {(θ_{u}^{(k + 1)})}^{T}) + λ_{i} I)}^{- 1} (\sum_{u} {[S]}_{u, i} θ_{u}^{(k + 1)}) \end{matrix} \end{matrix}

\forall (u, i) \in K, k = 0, 1, \dots, I_{M}

(8)

where ^(k) represents the index of the BCD iterations, (u,i) are the codebook indexes at

U E

and

B S

, and

{[S]}_{u, i}

denotes the

R S E

of the (u,i) beam couple. The solution

{{\hat{θ}}_{u}, {\hat{ψ}}_{i}}_{(u, i) \in K}

is reached after the interval/gap between consecutive iterations reaches a predefined

ϵ

or a max number of iterations,

I_{M}

. We have the following result.

Corollary 1.

The sequence of updates

{θ_{u}^{(k)}, ψ_{i}^{(k)} | \forall (u, i) \in K}_{k}

generated by BCD, in (8), is non-increasing (in k) and converges to a stationary point as

k \to \infty

.

Proof.

See Appendix A. □

Block Stochastic Gradient Descent (BSGD) for MF (SGD MF): SGD MF proceeds by taking T plain SGD steps (mini-batch size

= 1

). BGD proceeds by taking T SGD steps for each block BCD. We first choose at random a single training sample

(u, i) \in K

. The BSGD update for the sub-problem (S1) is done by performing SGD for

f_{1} (θ_{u}) = \sum_{u \in U_{i}} h_{u} (θ_{u})

, i.e., choosing at random a single index

u \in U_{i}

and computing the plain SGD

\hat{\nabla f_{1} (θ_{u})} = \hat{\nabla} (\sum_{u \in U_{i}} h_{u} (θ_{u})) = h_{u} (θ_{u})

, where u is a random index from

U_{i}

, and

\hat{\nabla f_{1} (θ_{u})}

is the plain SGD on

f_{1} ()

. The corresponding update is given as

\begin{matrix} θ_{u}^{(k + 1)} & = θ_{u}^{(k)} - α_{k} \hat{\nabla f_{1} (θ_{u}^{(k)})}, = θ_{u}^{(k)} - α_{k} \nabla h_{u} (θ_{u}^{(k)}) u \in U_{i} \\ = θ_{u}^{(k)} + 2 α_{k} ((\sum_{i} ({[S]}_{u, i} ψ_{i}^{(k)})) - ((\sum_{i} ψ_{i}^{(k)} ψ_{i}^{{(k)}^{T}}) + μ_{u} I_{D}) θ_{u}^{(k)}), u \in U_{i}, k = 1 . . T \end{matrix}

where u is a single index chosen at random from

U_{i}

,

Q_{u}^{(k)} = \sum_{i} (ψ_{i}^{(k)} ψ_{i}^{{(k)}^{T}})

,

r_{u}^{(k)} = \sum_{i} ({[S]}_{u, i} ψ_{i}^{(k)})

, ^(k) is the iteration index for SGD, and

\hat{\nabla f_{1} (θ_{u})}

is the plain SGD over one random sample

u \in U_{i}

. Similarly, the update for the sub-problem (S2) is done by taking T plain SGD steps of

f_{2} (ψ) = \sum_{i \in I_{u}} h_{i} (ψ_{i})

, i.e., the SGD,

\hat{\nabla f_{2} (ψ_{i})} = \hat{\nabla} (\sum_{i \in I_{u}} h_{i} (ψ_{i})) = h_{i} (ψ_{i})

, where i is single random index from

I_{u}

. Thus, the SGD MF update for the sub-problem (S2) is expressed as

\begin{matrix} ψ_{i}^{(k + 1)} & = ψ_{i}^{(k)} - α_{k} \hat{\nabla f_{2} (ψ_{i}^{(k)})} = ψ_{i}^{(k)} - α_{k} \nabla h_{2} (ψ_{i}^{(k)}), i \in I_{u} \\ = ψ_{i}^{(k)} + 2 α_{k} ((\sum_{u} ({[S]}_{u, i} θ_{u}^{{(k)}^{T}})) - (\sum_{u} (θ_{u}^{(k)} θ_{u}^{{(k)}^{T}})) + λ_{i} I_{D}) θ_{u}^{(k)}), i \in I_{u}, \forall k = 1 . . T \end{matrix}

where i is a single index chosen randomly from

I_{u}

,

t_{i}^{(k)} = \sum_{u} ({[S]}_{u, i} θ_{u}^{{(k)}^{T}})

,

P_{i}^{(k)} = \sum_{u} (θ_{u}^{(k)} θ_{u}^{{(k)}^{T}})

, and

\hat{\nabla f_{2} (ψ_{i})}

is the plain SGD gradient computed with one sample

i \in I_{u}

, chosen at random. We write the SGD MF updates as

\begin{matrix} \{\begin{matrix} θ_{u}^{(k + 1)} & = θ_{u}^{(k)} + 2 α_{k} ((\sum_{i} ({[S]}_{u, i} ψ_{i}^{(k)})) - ((\sum_{i} ψ_{i}^{(k)} ψ_{i}^{{(k)}^{T}}) + μ_{u} I_{D}) θ_{u}^{(k)}), u \in U_{i} \\ ψ_{i}^{(k + 1)} & = ψ_{i}^{(k)} + 2 α_{k} ((\sum_{u} ({[S]}_{u, i} θ_{u}^{{(k)}^{T}})) - (\sum_{u} (θ_{u}^{(k)} θ_{u}^{{(k)}^{T}})) + λ_{i} I_{D}) θ_{u}^{(k)}), i \in I_{u} \end{matrix} \end{matrix}

\forall k = 0, 1, \dots, T,

(9)

where u is a random index chosen from

U_{i}

, and i a random index from

I_{u}

.

0 \leq α_{k} \leq 1

is the step size for SGD.

BGD for MF (BGD MF): Rather than having a closed-form solution for each optimization block, BGD proceeds by taking T gradient steps for each block T gradient step. We skip the details here for space limitations. Thus, the BGD updates for the

M F

problem are expressed as

\begin{matrix} \{\begin{matrix} θ_{u}^{(k + 1)} & = θ_{u}^{(k)} + 2 α_{k} ((\sum_{i} ({[S]}_{u, i} ψ_{i}^{(k)})) - ((\sum_{i} ψ_{i}^{(k)} ψ_{i}^{{(k)}^{T}}) + μ_{u} I_{D}) θ_{u}^{(k)}) \\ ψ_{i}^{(k + 1)} & = ψ_{i}^{(k)} + 2 α_{k} ((\sum_{u} ({[S]}_{u, i} θ_{u}^{{(k)}^{T}})) - (\sum_{u} (θ_{u}^{(k)} θ_{u}^{{(k)}^{T}})) + λ_{i} I_{D}) θ_{u}^{(k)}) \end{matrix} \end{matrix}

\forall (u, i) \in K, k = 0, 1, \dots, T,

(10)

where (u,i) are the codebook indexes at

U E

and

B S

, k is the GD iteration index, and

α^{(k)}

is the BGD step size (

0 < α^{(k)} < 1

).

4.3. Solutions for NMF

Our proposed

N M F

follows the exact steps as in

M F

, with the main difference of constraining the latent vectors being non-negative

θ_{u} \in R_{+}^{D}, ψ_{i} \in R_{+}^{D}, \forall (u, i) \in K

. Likewise, we solve the

N M F

problem,

(P 2)

, using BCD, SGD, and BGD.

BCD for NMF (BCD NMF): The derivations of BCD for

N M F

(11) are identical to those of BCD for

M F

(8), followed by the corresponding projection operation. The updates of BCD for

N M F

derivations are given by

\begin{matrix} \{\begin{matrix} θ_{u}^{(k + 1)} & = {[(\sum_{i} ψ_{i}^{(k)} {(ψ_{i}^{(k)})}^{T}) + μ_{u} {I)}^{- 1} (\sum_{i} {[S]}_{u, i} ψ_{i}^{(k)})]}_{+} \\ ψ_{i}^{(k + 1)} & = {[{((\sum_{u} θ_{u}^{(k + 1)} {(θ_{u}^{(k + 1)})}^{T}) + λ_{i} I)}^{- 1} (\sum_{u} {[S]}_{u, i} θ_{u}^{(k + 1)})]}_{+} \end{matrix} \end{matrix}

\forall (u, i) \in K, k = 0, 1, \dots, I_{M}

(11)

where ^(k) is the BCD iteration index, and

{[a]}_{+} : = max (a, 0)

is applied element by element on

a

, i.e., a Euclidean projection of

a

on

R_{+}^{D}

. Since the projection is Euclidean (non-expansive operator), the corollary stated in the previous subsection applies to the BCD for

N M F

too.

Block Stochastic Gradient Descent (BSGD) for NMF (SGD NMF): The SGD NMF derivations are exactly the same as that of SGD MF, followed by a projection

{[]}_{+}

. We thus express the SGD NMF updates as

\begin{matrix} \{\begin{matrix} θ_{u}^{(k + 1)} & = {[θ_{u}^{(k)} + 2 α_{k} ((\sum_{i} ({[S]}_{u, i} ψ_{i}^{(k)})) - ((\sum_{i} ψ_{i}^{(k)} ψ_{i}^{{(k)}^{T}}) + μ_{u} I_{D}) θ_{u}^{(k)})]}_{+}, u \in U_{i} \\ ψ_{i}^{(k + 1)} & = {[ψ_{i}^{(k)} + 2 α_{k} ((\sum_{u} ({[S]}_{u, i} θ_{u}^{{(k)}^{T}})) - (\sum_{u} (θ_{u}^{(k)} θ_{u}^{{(k)}^{T}})) + λ_{i} I_{D}) θ_{u}^{(k)})]}_{+}, i \in I_{u} \end{matrix} \end{matrix}

\forall k = 0, 1, \dots, T,

(12)

where u is a random index chosen from

U_{i}

, i is a random index from

I_{u}

,

{[a]}_{+} : = max (a, 0)

, and

α^{(k)}

is the SGD step size (

0 < α^{(k)} < 1

).

BGD for NMF (BGD NMF): The solution and derivations for BGD NMF are the same as those for BGD MF, followed by a projection

{[]}_{+}

, i.e,

\begin{matrix} \{\begin{matrix} θ_{u}^{(k + 1)} & = {[θ_{u}^{(k)} + 2 α_{k} ((\sum_{i} ({[S]}_{u, i} ψ_{i}^{(k)})) - ((\sum_{i} ψ_{i}^{(k)} ψ_{i}^{{(k)}^{T}}) + μ_{u} I_{D}) θ_{u}^{(k)})]}_{+} \\ ψ_{i}^{(k + 1)} & = {[ψ_{i}^{(k)} + 2 α_{k} ((\sum_{u} ({[S]}_{u, i} θ_{u}^{{(k)}^{T}})) - (\sum_{u} (θ_{u}^{(k)} θ_{u}^{{(k)}^{T}})) + λ_{i} I_{D}) θ_{u}^{(k)})]}_{+} \end{matrix} \end{matrix}

\forall (u, i) \in K, k = 0, 1, \dots, T,

(13)

where

{[a]}_{+} : = max (a, 0)

, ^(k) is the GD iteration index and

α^{(k)}

is the GD step size (

0 < α^{(k)} < 1

). We use a constant step size

α_{k} = α

for all these methods.

4.4. Prediction for MF and NMF

For both

M F

and

N M F

, the predicted

R S E

of the beam-pair

(u, i)

, for beam indexes that were not sounded, is expressed as

\begin{matrix} {{\hat{R S E}}_{u, i} : = {({\hat{θ}}_{u})}^{T} {\hat{ψ}}_{i} | \forall (u, i) \in L} \end{matrix}

(14)

where

L

is the test set and

{{{\hat{θ}}_{u})}^{T}, {\hat{ψ}}_{i}}

are optimal solutions to MF (or NMF). Afterwards, we search for the optimal beam pair at

U E

and

B S

as the one with the highest

R S E

value over both training and test sets, as follows:

\begin{matrix} (u^{★}, i^{★}) = a r g m a x_{(u, i) \in L \cup K} {({\hat{θ}}_{u})}^{T} {\hat{ψ}}_{i} . \end{matrix}

(15)

4.5. Proposed BA Algorithm Using MF/NMF

Due to the fact that the updates given in a closed-form solution, we can quantify the computational complexity of all of the above methods. As seen from the updates for BCD MF and BCD NMF, we have to invert two

D \times D

matrices (for the sum problems S1 and S2). Thus, the (per-iteration) computational complexity of BCD MF and BCD NMF is approximated as

C_{B C D M F} = C_{B C D N M F} = O (2 D^{3})

. Moreover, for BGD MF and BGD NMF, one has to compute two full-batch gradients over all training samples in

K

(for the sub-problems S1 and S2). Consequently, the complexity, per-iteration, for BGD MF and BGD NMF is approximated as

C_{B G D M F} = C_{B G D N M F} = O (2 | K |)

. Finally, for SGD MF and SGD NMF, since we use a mini-batch size

= 1

(for the sub-problems S1 and S2), the resulting per-iteration computational complexity is approximated as

C_{S G D M F} = C_{S G D N M F} = O (2)

. Solving the

M F

and

N M F

problem, we employ methods such as BCD, BGD, or SGD. All details are shown in Algorithm 1.

Algorithm 1 Proposed MF/NMF-Based BA Method.

Input: ${f_{u}}_{\forall u \in T}$ , ${W_{i}}_{\forall i \in R}$ , $η$ , $P u$

-

Generate randomly sub-sampled codebooks,

T_{S}, R_{S}

, satisfying

(| T_{S} | . | R_{S} |) / (| T | \times | R |) = η

-

Sound beam pairs from training set,

K : = T_{S} \times R_{S}

.

-

Record corresponding

R S E

in and generate mat.

S

, in (5)

-

Select model: MF or NMF

-

IF MF model selected

solve $(P 1)$ with BCD for MF, in (8) or solve $(P 1)$ with BGD for MF, in (10) or solve $(P 1)$ with SGD for MF, in (9). At the end of training, return optimal latent vectors, ${{\hat{θ}}_{u}, {\hat{ψ}}_{i}}_{(u, i) \in K}$

-

IF NMF model selected

solve $(P 2)$ with BCD for NMF, in (11) or solve $(P 2)$ with BGD for NMF, in (13) or solve $(P 2)$ with SGD for NMF, in (12). At the end of training, return ideal latent vectors, ${{\hat{θ}}_{u}, {\hat{ψ}}_{i}}_{(u, i) \in K}$

-

Use ideal latent vectors

{{\hat{θ}}_{u}, {\hat{ψ}}_{i}}_{(u, i) \in K}

, to predict unknown

R S E

of test set,

L

, in (14)

-

Search training and test sets, for beam pair w/ largest

R S E

, (15)

Output: $f_{u^{★}}$ , $W_{i^{★}}$

While, for MF BCD and NMF BCD, the only hyperparameter is the model size D, MF BGD and NMF BGD require, in addition to D,

α^{k}

, the GD step size as hyperparameters.

4.6. Numerical Simulations

This section illustrates our numerical setup. The number of antennas at

U E

and

B S \in {128

, 256, 512,

1024}

. We set up

N_{T} = C_{T}

and

N_{R} = C_{R}

. The overhead ratio regime

η \in {

0.7

,

0.5

,

0.3

,

0.1}. The number of

O F D M

sub-carriers

N_{c} = 64

and the number of channel paths L is 2. We vary the transmitted power,

P_{u} \in {1, 10^{- 1}, 10^{- 2}}

. We use

D F T

codebooks at

U E

and

B S

. The optimal hyperparameters are chosen to minimize test loss. The model dimension

D \in {2, 3, 4, 5, 6}

, the learning rate

α_{k} \in {10^{- 1}, 10^{- 2}, 10^{- 3}, 10^{- 4}, 10^{- 5}, 10^{- 6}}

, and the regularization factors

{λ, μ} \in {10^{- 2}, 10^{- 3}, 10^{- 4}, 10^{- 5}, 10^{- 6}, 10^{- 7}}

. For each MIMO configuration and for each

P_{u}

regime, we randomly generate and store the resulting

R S E

matrices.

We propose to investigating six models in total (BCD MF, BCD NMF, BGD MF, BGD NMF, SGD MF, SGD NMF) with respect to three transmitted power regimes: high

P u = 1 W

, medium

P u = 10^{- 1}

W, and low

P u = 10^{- 2}

W with fixed

σ^{2} = 1

. In Table 1, we provide a summary for all proposed system parameters. We use the training Normalized MSE (

N M S E

) to evaluate the training error, expressed as

T r a i n N M S E = \frac{1}{| K |} (\sum_{(u, i) \in K} {(\frac{\hat{θ_{u}^{T}} \hat{ψ_{i}} - R S E_{u, i}}{R S E_{u, i}})}^{2})

. We also define

T e s t N M S E = \frac{1}{| L |} (\sum_{(u, i) \in L} {(\frac{\hat{R S E_{u, i}} - \hat{θ_{u}^{T}} \hat{ψ_{i}}}{\hat{R S E_{u, i}}})}^{2})

. The range of training error and the overall behavior of

B C D

-based models are different and distinctive from

G D

models in both

M F

and

N M F

; for instance,

B G D

-based models’ error range are around

\times 10^{- 7}

, while

B C D

-based models are around

\times 10^{- 4}

. Thus,

G D

is more accurate. However,

B C D

converges faster and the cost function drops to low values from the very first iterations. In addition, for

M F

and

N M F

, the train

N M S E

decreases with the increase in the overhead ratio

η

, as seen in Figure 5. Low and medium

P_{u}

regimes are characterized by noisy links between

U E

and

B S

and represent a more challenging experimental environment.

B C D

-based models tend to be faster in reaching low error values, while

B G D

-based models are more accurate. (For instance,

B S G D

generally ameliorates the quality of prediction compared with

B G D

).

Regarding

M F / N M F

simulation figures, Figure 5a states the decrease of train/test

N M S E

in function of the overhead ratio (more training samples result in fewer errors); Figure 5b,c track the instant drop in loss values from the very first iterations for

B C D

-based models; and Figure 5d,e present the progressive convergence of cost function among the iterations when we use

B G D

-based models. In summary, Table 2 outlines the optimal (minimum) signaling overhead ratio required for the all proposed system configurations, the optimal model (holding the smallest total cost function), the related combination of optimal hyperparameters, and the corresponding train/test error values. When the signal is affected with much noise, it is harder to keep the same range of error when compared with high a

P_{u}

regime. In fact,

M F

models keep the same (minimum) signaling overhead (

0.1

) regardless of the transmitted power regime, being able to accurately predict with just

10 %

of sounded beams. Thus, the proposed

M F / N M F

methods are able to reduce the pilot signaling overhead by

90 %

compared with Exhaustive

B A

with negligible training and test errors.

5. Multi-Layer Perceptron

5.1. MLP Problem Formulation

We consider a feed-forward

M L P

, with J layers, modeled as a composition of J non-linear functions/layers. Let

z_{0} \in R

be the

M L P

input, and

z_{J} \in R

be the

M L P

output; see Figure 6. We denote with

{z_{2}, \dots, z_{J - 1}}

all the hidden layers. We assume for simplicity that the width of all the layers is the same, denoted as D, i.e.,

{z_{2} \in R^{D}, \dots, z_{J - 1} \in R^{D}}

; see Figure 6. The equation describing layer 1 is given by

z_{1} = σ_{1} (ϕ_{1} z_{0}) = σ_{1} (ϕ_{1} 1)

, where

z_{1} \in R^{D}

is the output of layer 1,

ϕ_{1} \in R^{D}

is the resulting weight vector, and

σ_{1} () : R ⟶ R^{D}

is the non-linear activation function for layer 1. We use one hot encoding for the MLP input

z_{0} \in R

, i.e.,

z_{0} = 1

for all training samples,

\forall (u, i) \in K

. We express the output of the hidden layers,

{z_{j} \in R^{D}}_{j = 2}^{J - 1}

, as

z_{j} = σ_{j} (Φ_{j} z_{j - 1}), \forall j \in {2, \dots, J - 1}

, where

z_{j - 1} \in R^{D}

is the input of the layer j and

z_{j} \in R^{D}

is its output

, \forall j \in {2, \dots, J - 1}

;

Φ_{j} \in R^{D \times D}

is the weight matrix for the layer j

, \forall j \in {2, \dots, J - 1}

; and

σ_{j - 1} () : R^{D} ⟶ R^{D}

is the element-by-element non-linear activation function for the layer j,

\forall j \in {2, \dots, J - 1}

. Finally, the relation for the last layer

j = J

is expressed as

z_{J} = σ_{J} (ϕ_{J} z_{J - 1})

, where

z_{J} \in R

is the output for layer J,

ϕ_{J} \in R^{1 \times D}

is its weight vector, and

σ_{J} () : R^{D} ⟶ R

is the non-linear activation function for the layer J. We express the output of the MLP

z_{J} \in R

(as a function of all layers) as

\begin{matrix} z_{J} : = σ_{J} (ϕ_{J} \dots σ_{2} (Φ_{2} (σ_{1} (ϕ_{1})))) \end{matrix}

(16)

The output of

M L P

is made to fit/approximate all the

R S E

values at all training samples;

z_{J} : = R S E_{u, i}

,

\forall (u, i) \in K

. We define the MSE loss

l_{u, i}

for the sample

(u, i)

in the training set

K

as the distance between the MLP output

z_{J}

and the known RSE label for the beam pair

(u, i)

,

{RSE}_{u, i}

, i.e,

\begin{matrix} l_{u, i} : = {(z_{J} - R S E_{u, i})}^{2} = {(\underset{M L P o u t p u t}{\underset{︸}{σ_{J} (ϕ_{J} \dots σ_{2} (Φ_{2} (σ_{1} (ϕ_{1}))))}} - \underset{R S E v a l u e}{\underset{︸}{R S E_{u, i}}})}^{2}, \forall (u, i) \in K \end{matrix}

Then, the empirical risk is defined as the average of the individual loss

l_{u, i}

across the training set

K

,

(1 / | K |) \sum_{(u, i) \in K} l_{u, i}

. The empirical risk minimization for the MLP is given in

(P 3)

.

\begin{matrix} (P 3) : = {({ϕ_{1}}^{*}, {Φ_{2}}^{*}, \dots, {ϕ_{J}}^{*}) \{\begin{matrix} \underset{ϕ_{1}, Φ_{2}, \dots, Φ_{J - 1}, ϕ_{J}}{a r g m i n} \frac{1}{| K |} \sum_{(u, i) \in K} l_{u, i} (ϕ_{1}, Φ_{2}, \dots, Φ_{J - 1}, ϕ_{J}) \\ s . t . ϕ_{1} \in R^{D}, Φ_{2} \in R^{D \times D}, \dots, Φ_{J - 1} \in R^{D \times D}, ϕ_{J} \in R^{1 \times D} \end{matrix} \end{matrix}

5.2. MLP Learning

We propose to learn the optimal

M L P

weights via back-propagation (BP). We choose an arbitrary mini-batch of samples of size

B \subseteq K

and define the mini-batch loss as

\begin{matrix} l_{B} : = \frac{1}{| B |} \sum_{u, i \in B} {(σ_{J} (ϕ_{J} \dots σ_{2} (Φ_{2} (σ_{1} (ϕ_{1})))) - R S E_{u, i})}^{2}, \forall (u, i) \in B \end{matrix}

(17)

We express the partial derivative of the loss corresponding to the mini-batch

l_{B}

with respect to each layer

Φ_{j}, j {1, \dots, J}

as

\begin{matrix} \frac{\partial l_{B}}{\partial Φ_{j}} = \frac{1}{| B |} \sum_{(u, i) \in B} (δ_{j} {z_{j - 1}}^{T}), \forall j \in {1, . . J}, \end{matrix}

(18)

where

\begin{matrix} δ_{j} \overset{Δ}{=} \{\begin{matrix} ({Φ_{j + 1}}^{T} δ_{j + 1}) \circ σ_{j}^{'}, j < J \\ 2 (z_{J} - R S E_{u, i}) \circ σ_{j}^{'}, j = J, (u, i) \in B \end{matrix}, σ_{j}^{'} \overset{Δ}{=} \frac{\partial σ (u)}{\partial u} = {[\frac{\partial σ (u_{1})}{\partial u_{1}}, \dots, \frac{\partial σ (u_{d_{j}})}{\partial u_{d_{j}}}]}^{T}, \end{matrix}

j = 1, \dots, J

and ∘ denotes the Hadamard product. We express the BP weight update of the mini-batch loss

l_{B}

, for all layers

\forall j \in {1, \dots, J}

, as

\begin{matrix} Φ_{j}^{(k + 1)} = Φ_{j}^{(k)} - {β_{j}}^{(k)} \frac{\partial l_{B}}{\partial Φ_{j}} |_{Φ_{j}^{(k)}}, \forall j \in {1, \dots, J}, k = {1, \dots, T} \end{matrix}

(19)

where ^(k) is the BP iteration index,

Φ_{j}^{(k)}

is the value of

Φ_{j}

at iteration k,

{β_{j}}^{(k)}

is the BP step size (learning rate) for the layer j at iteration k, and

\frac{\partial l_{B}}{\partial Φ_{j}} |_{Φ_{j}^{(k)}}

is the partial derivative given in (18) evaluated at

Φ_{j}^{(k)}

.

Back-propagation algorithm with mini-batch

Choose the mini-batch

B

as a random subset of the training set

K

.

Compute the loss function $l_{B}$ for all samples in the mini-batch $(u, i) \in B$ in (17).
Compute the partial derivative $\frac{\partial l_{B}}{\partial Φ_{j}}$ of the mini-batch loss $l_{B}$ with respect to $Φ_{j}$ in (18).
Update the weights of each layer as in (19).

We assume that the BP learning rate is the same for all layers,

β_{j}^{(k)} = β^{k}, \forall j \in {1, \dots, J}

.

5.3. Prediction Using MLP

The

M L P

prediction for the sample (u,i) in the test set

L

, using optimal weights

{ϕ_{1}}^{*}

,

{Φ_{2}}^{*}, \dots, {ϕ_{J}}^{*}

is as follows:

\begin{matrix} \hat{z_{J}} = σ_{J} ({ϕ_{J}}^{*} \dots σ_{2} ({Φ_{2}}^{*} (σ_{1} ({ϕ_{1}}^{*})))), \forall (u, i) \in L \end{matrix}

(20)

Therefore, the test

M S E

is defined as

\begin{matrix} \frac{1}{| L |} \sum_{(u, i) \in L} {(\hat{R S E_{u, i}} - σ_{J} ({ϕ_{J}}^{*} \dots σ_{2} ({Φ_{2}}^{*} (σ_{1} ({ϕ_{1}}^{*})))))}^{2} \end{matrix}

(21)

We then select the optimal indexes

u^{★}

and

i^{★}

related to the highest

R S E_{u, i}

value, as follows:

\begin{matrix} (u^{★}, i^{★}) = a r g m a x_{(u, i) \in L \cup K} {R S E_{u, i} | \forall (u, i) \in K} \cup {\hat{R S E_{u, i}} | \forall (u, i) \in L} \end{matrix}

(22)

5.4. Proposed BA Algorithm Using $M L P$

The Multi-Layer Perceptron-based Beam Alignment is specified in Algorithm 2.

Algorithm 2 Proposed MLP-Based BA Method.

Input: ${f_{u}}_{\forall u \in T}$ , ${W_{i}}_{\forall i \in R}$ , $η$ , $P u$

-

Generate randomly sub-sampled codebooks,

T_{S}, R_{S}

, satisfying

(| T_{S} | . | R_{S} |) / (| T | \times | R |) = η

-

Sound beam pairs from training set,

K : = T_{S} \times R_{S}

.

-

Record corresponding

R S E

and generate

R S E

mat.

S

, in (5)

-

Train

M L P

weights (using back-propagation algorithm)

return optimal weights, ${{ϕ_{1}}^{*}, {Φ_{2}}^{*}, \dots, {ϕ_{J}}^{*}}$

-

Use optimal parameters

{{ϕ_{1}}^{*}, {Φ_{2}}^{*}, \dots, {ϕ_{J}}^{*}}

, to predict unknown

R S E

of test set,

L

, in (21)

-

Search training and test sets, for optimal beam pair

(u^{★}, i^{★})

, holding the largest

R S E

, (22)

Output: $f_{u^{★}}$ , $W_{i^{★}}$

We assume that the number of neurons per layer D, the number of layers J, the mini-batch size

B = | B |

, and the BP learning rate

β^{(k)}

are hyperparameters. They are tuned using a grid search cross-validation.

5.5. Numerical Simulations

We define the training and test cost functions as follows:

\begin{matrix} T r a i n N M S E = \frac{1}{| K |} (\sum_{(u, i) \in K} {(\frac{R S E_{u, i} - σ_{J} (ϕ_{J} \dots σ_{2} (Φ_{2} (σ_{1} (ϕ_{1}))))}{R S E_{u, i}})}^{2}) \end{matrix}

(23)

\begin{matrix} T e s t N M S E = \frac{1}{| L |} (\sum_{(u, i) \in L} {(\frac{\hat{R S E_{u, i}} - σ_{J} ({ϕ_{J}}^{*} \dots σ_{2} ({Φ_{2}}^{*} (σ_{1} ({ϕ_{1}}^{*}))))}{\hat{R S E_{u, i}}})}^{2}) \end{matrix}

(24)

Therefore, we used the same system configurations as for

M F / N M F

, resumed in Table 1. Moreover, we choose the learning rate

β_{k} \in {0.1

,

0.01

,

0.001

,

0.0001}

, while the batch size

B \in {2

, 4, 8, 16, 32, 64,

128}

, the number of hidden layers

J \in {1, 2, 3}

. For each layer, the number of neurons

D \in {8

, 16, 32, 64,

128}

. We use the Rectified Linear Units as our activation function for all layers.

Similar to

M F / N M F

, train performance is observed when we track the evolution of the cost function

N M S E

, applied to the training samples of the set

K

, in a function of iterations. The range of considerably low-error values and the overall learning behavior of the

M L P

architecture illustrates that our shallow neural network successfully resolves the non-linear regression problems related to our BA process. For massive setups,

M L P

reaches around

10^{- 6}

error in a high

P_{u}

regime. However, this cost value increases as long as the amount of noise and interference augments. Note that the train

N M S E

also decreases when we increase the size of the dataset matrix

S

, which provides more samples for

M L P

to improve the feature extraction and the prediction quality. Regarding the unknown beams, test error values in the numerical result tables are close to the train cost (with no overfitting or underfitting in the corresponding learning curves). Moreover, the test loss is impacted by the transmitted power regime the same way as the training process. Identical to

G D

-based

M F / N M F

, the

M L P

learning curves in Figure 7 plot the same shape of curve with a continuous monotonic decrease in the train and test cost among the iterations: the convergence is progressive among the iterations, and at the last epoch, training and test

N M S E

values land at considerably low error values and prove that

M L P

accurately fits to our problem and provides a concrete solution for

M L

-based

B A

. From a

Q o S

perspective, Table 3 resumes the smallest (optimal) signaling overhead required for a successful beam sounding based on reliable prediction quality. Similar to

M F / N M F

, for all the proposed transmitted power,

M L P

requires

10 %

of the total beam pairs to fulfill the

R S E

matrix.

6. Results and Discussion

6.1. Train/Test Prediction Performance Comparison

For the six

M F

-based models, we select the best one (minimum test error) to represent the

M F

family of methods in this section and compare it with

M L P

. When we analyze

Q o S

(Table 1 and Table 2), we notice that the transmitted power regime impacts the quality of prediction by reducing the overall loss. For

M F / N M F

, we observe that the loss damage is large. We jump from around

10^{- 8}

for massive configurations (256, 512, and 1024) to

10^{- 4}

for smaller setups. For

M L P

, we spot the increase in the overall loss when we decrease

P_{u}

. Thus,

M L P

seems to be the most robust architecture with respect to changing the transmitted power. Additionally, we empirically notice that the change in the

P_{u}

values does not impact the optimal hyperparameters selected from cross-validation. Furthermore, when we track the evolution of the training/test cost in the function of iterations, we observe balanced models with no signs of overfitting or underfitting. On the other hand, when the transmitted power decreases,

M F / N M F

tend to be the most impacted models in terms of train/test error, while the

M L P

error is robust.

On the other hand, from a

Q o S

perspective, concerning the evolution of the optimal (minimum) required signaling overhead and what impact can the

P_{u}

regime have on the optimal required values, in reference to Table 1 and Table 2, all the proposed models required just

10 %

of the total number of beam pairs at

U E

and

B S

for all antenna configurations from

128 \times 128

to

1024 \times 1024

for all the proposed

P u

values. This proves that the transmitted power impacts the quality of prediction but not the number of beam pairs required for training. In fact, low

P_{u}

leads to damaging the signal quality and subsequently damages the quantity of useful information to be extracted from the datasets. Finally, the only cases where the

P_{u}

regime impacts the optimal overhead ratio is among the smallest configurations, for instance, the

16 \times 16

setup where it seems normal for all learning models to demand more data to learn from (more hidden interactions between

U E

and

B S

as features to extract). These are the experimental situations where Exhaustive

B A

is technically feasible.

6.2. Similarities and Differences between Models

All models required just

10 %

of the beams for training for all the proposed massive setups. Moreover, all the proposed models are shallow neural architectures with few hidden layers for low-complexity constraints. Even among the largest configurations, the optimal dimensions of models picked from the cross-validation illustrate small networks with no need to require dense architectures. Furthermore, all models succeeded with the matrix completion task, and they all illustrate a monotonic decrease in loss values as long as we increase the MIMO setup. Additionally,

M F

-based models are the most accurate reaching loss values in the range

10^{- 8}

for massive setups in a high

P_{u}

regime, and their cross-validation illustrates smaller grid search where there are fewer hyperparameters to tune. However, they are the slowest models when applied to high-dimensional MIMO setups. On the other hand,

M L P

illustrates a good balance between run time (complexity) and loss values (prediction quality). It reaches around

10^{- 4}

and

10^{- 5}

loss for massive configurations. In addition, the

M L P

is the most robust model facing the changes in the

P_{u}

values. In Figure 8, for

512 \times 512

, the figure illustrates the train/test

N M S E

in the function of each model and the corresponding transmitted power: in Figure 8a, for

P_{u} = 1 W

,

M F

achieves its best performance, slightly better than

M L P

with the difference between achieved cost values at around

10^{- 1}

. In Figure 8b, when

P_{u} = 10^{- 1} W

,

M F

still gets the best performance, marginally better than

M L P

with an

N M S E

value difference of around

10^{- 1}

. In Figure 8c, when

P_{u} = 10^{- 2}

,

M F

noticeably gets impacted (overall loss around

10^{- 3}

) while

M L P

provides the best prediction performance: this suggests that when

P_{u}

is small,

M L P

is more robust than

M F / N M F

, which performs best in high

P_{u}

regime. Similarly, almost same remarks hold for Figure 9 when we simulate the

128 \times 128

configuration: in Figure 9a,

M F

reaches considerably better performance compared with

M L P

with

10^{- 4}

. In Figure 9b,

M L P

kept the same range of error, which states again the robustness of the model while

M F

got severely impacted (

10^{- 3}

) but sill holds the best performance. In Figure 9c, when

P_{u}

is weak,

M F

illustrates the worst performance in all simulations. On the other hand,

M L P

got slightly impacted with an overall loss of

10^{- 1}

and reaches the best quality of prediction. In Figure 10, we investigate the highest configuration

1024 \times 1024

. Similar conclusions for Figure 8 and Figure 9 hold for this figure in terms of best model (

M F

for

P_{u} = 1 W

,

P_{u} = 10^{- 1}

and

M L P

for

P_{u} = 10^{- 2}

). In addition, we aim to investigate the overall impact of varying the transmitted power. Thus, we track the

l o g (N M S E)

values while switching from one

P_{u}

regime to another: In Figure 10, in Figure 10a, for

M L P

, the curve gap from low/medium is

l o g {(N M S E)}_{m e d i u m} - l o g {(N M S E)}_{l o w} \approx - 16 - (- 12) \approx - 4

. The gap in the medium/high regimes is almost negligible (

l o g {(N M S E)}_{h i g h} - l o g {(N M S E)}_{m e d i u m} \approx - 16 - (- 16) \approx 0.5

). Finally, in Figure 10b, the

M F

gap is around

l o g {(N M S E)}_{m e d i u m} - l o g {(N M S E)}_{l o w} \approx - 17 - (- 9) \approx - 8

and

l o g {(N M S E)}_{h i g h} - l o g {(N M S E)}_{m e d i u m} \approx - 22 - (- 17) \approx - 5

: at each change of

P_{u}

,

M F

is considerably impacted. To sum up, the choice of the optimal model strongly depends on the available complexity and the given transmitted power

P_{u}

. In fact,

M F

, whether through

B C D

or

B G D

optimization, is the best model when the transmitted power is high (

P_{u} = 1 W

). In this case,

B C D M F

converges faster but has higher complexity than

B G D

. However,

S G D

for

M F / N M F

are the slowest models to converge but show negligible complexity. On the other hand, if we aim to prioritize run time,

M L P

exhibits the fastest predictions with good prediction error. Finally, it is wise to opt for

M L P

if the system is to operate under various transmitted power regimes where

M L P

offers good prediction quality for every

P_{u}

value and the available complexity is medium.

7. Conclusions

In this paper, we proposed a blind Machine Learning-based Beam Alignment using Matrix Factorization, non-negative Matrix Factorization, and Multi-Layer Perceptron. We assumed an Uplink massive mmWave MIMO system using single RF-chains at

U E

and multiple RF-chains at

B S

though a fully analog architecture. The proposed approach consists in sounding the

R S E

of sub-sampled codebooks at

U E

and

B S

. The

R S E

of the non-sounded beams is predicted using

M F

,

N M F

, and

M L P

models. Our results show that, by sounding just

10 %

of the total beam pair samples, we may predict with high accuracy the unknown

R S E

values, which massively reduce the large signaling overhead of Exhaustive

B A

. Our future work investigates the scalability of our approach to a multi-user scenario. Robustness and

M L

-interpretability are other research directions for modeling industrial deployment.

Author Contributions

Conceptualization, A.K., H.G. and G.R.-B.O.; Methodology, A.K. and H.G.; Software, A.K.; Validation, G.R.-B.O.; Formal analysis, H.G.; Writing—original draft, A.K.; Writing—review & editing, H.G. and G.R.-B.O.; Supervision, H.G. and G.R.-B.O. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Télécom Paris, l’Institut Polytechnique de Paris, France.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Datasets are available from the authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

ALS	Alternating Least Squares
AoD	Angle of Departure
AoA	Angle of Arrival
AWGN	Additive White Gaussian Noise
BA	Beam Alignment
BS	Base Station
BCE	Binary Cross Entropy
BCD	Block Coordinate Descent
BGD	Block Gradient Descent
BSGD	Block Stochastic Gradient Descent
CSI	Channel State Information
DFT	Discrete Fourier Transform
GD	Gradient Descent
LoS	Line of Sight
MF	Matrix Factorization
MIMO	Multiple Input Multiple Output
ML	Machine Learning
MLP	Multi-Layer Perceptron
MSE	Mean Squared Error
NMF	Non-Negative Matrix Factorization
NLoS	Non Line of Sight
NMSE	Normalized Mean Squared Error
OFDM	Orthogonal Frequency Division Multiplexing
QoS	Quality of Service
ReLu	Rectified Linear Unit
RSE	Received Signal Energies
SNR	Signal-to-Noise Ratio
UE	User Equipment

Appendix A. Proof: BCD Convergence

We will show that the two (below) necessary conditions for convergence of BCD are satisfied:

(i): The loss function is strongly convex, per block; i.e., we need to show that sub-problem S1 and S2 have a unique solution.
(ii): The constraints of the MF prob $θ_{u} \in R^{d}, ψ_{i} \in R^{d}$ , are separable and individually convex.

Recall that sub-problem S1 is written as

\begin{matrix} (S 1) : {θ_{u}}^{(k + 1)} & = a r g m i n_{θ_{u} \in R^{d}} [- 2 {θ_{u}}^{T} {r_{u}}^{(k)} + {θ_{u}}^{T} ({Q_{u}}^{(k)} + μ_{u} I_{D}) θ_{u}] = f_{1} (u), \forall u, \end{matrix}

Next, we will prove that the equivalent form in (S1), is a strongly convex function; i.e., it shows that

f_{1} (θ_{u})

is strongly in

θ_{u}

. To that end, we derive the corresponding Hessian:

\begin{matrix} \nabla^{2} f_{1} (θ_{u}) & : = 2 (Q_{u}^{(k)} + μ_{u} I_{D}), \forall u, \end{matrix}

For this Hessian expression,

Q_{u}^{(k)} ⪰ 0

is a Positive Semi Definite (PSD) matrix (by def),

μ_{u} I ≻ 0

is a Positive Definite (PD) matrix, and

(Q_{u}^{(k)} + μ_{u} I_{D}) ≻ 0

is a PD matrix. Thus, the Hessian is a PD matrix

\nabla^{2} f_{1} (θ_{u}) ≻ 0

, and

f_{1} (θ_{u})

is strongly in

θ_{u}

, and the solution to the sub-problem (S1) is unique. Recall that the sub-problem (S2) is expressed as

\begin{matrix} (S 2) : {ψ_{i}}^{(k + 1)} = a r g m i n_{ψ_{i} \in R^{d}} [- 2 t_{i}^{{(k + 1)}^{T}} ψ_{i} + {ψ_{i}}^{T} (P_{u}^{(k + 1)} + λ_{i} I) ψ_{i}] = f_{2} (ψ_{i}), \forall i, \end{matrix}

Next, we will prove that the equivalent form is a strongly convex function; i.e., it shows that

f_{2} (ψ_{i})

is strongly in

ψ_{i}

. To that end, we derive the corresponding Hessian:

\begin{matrix} \nabla^{2} f_{2} (ψ_{i}) : = 2 (P_{i}^{(k + 1)} + {λ_{i}}^{(i)} I_{D}), \forall i, \end{matrix}

For this Hessian expression,

P_{i}^{(k + 1)} ⪰ 0

is a PSD matrix (by def),

λ_{i}^{(i)} I ≻ 0

is a PD matrix, and

(P_{i}^{(k + 1)} + λ_{i}^{(i)} I_{D}) ≻ 0

is a PD matrix. Thus, the Hessian is a PD matrix

\nabla^{2} f_{2} (ψ_{i}) ≻ 0

, and

f_{2} (ψ_{i})

is strongly convex in

ψ_{i}

. Thus, the solution to the sub-problem (S2) is unique.

References

Wang, Y.; Wei, Z.; Feng, Z. Beam Training and Tracking in MmWave Communication: A Survey. arXiv 2022, arXiv:2205.10169. [Google Scholar]
IEEE Std 802.15.3c-2009; IEEE Standard for Information Technology—Local and Metropolitan Area Networks—Specific requirements—Part 15.3: Amendment 2: Millimeter-Wave-Based Alternative Physical Layer Extension. IEEE: Piscataway, NJ, USA, 2009.
IEEE Std 802.11ad-2012; IEEE Standard for Information technology—Telecommunications and Information Exchange between Systems—Local and Metropolitan Area Networks—Specific Requirements-Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications Amendment 3: Enhancements for Very High Throughput in the 60 GHz Band. IEEE: Piscataway, NJ, USA, 2012.
3GPP. TS 38.211 V16.7.1 NR; Physical Channels and Modulation; ETSI Technical Specification 138 211 V16.10.0; Released: 07/2022. Available online: https://www.etsi.org/deliver/etsi_ts/138200_138299/138211/16.10.00_60/ts_138211v161000p.pdf (accessed on 9 July 2024).
Noh, S.; Zoltowski, M.D.; Love, D.J. Multi-Resolution Codebook and Adaptive Beamforming Sequence Design for Millimeter Wave Beam Alignment. IEEE Trans. Wirel. Commun. 2017, 16, 5689–5701. [Google Scholar] [CrossRef]
Kokshoorn, M.; Chen, H.; Wang, P.; Li, Y.; Vucetic, B. Millimeter Wave MIMO Channel Estimation Using Overlapped Beam Patterns and Rate Adaptation. IEEE Trans. Signal Process. 2016, 65, 601–616. [Google Scholar] [CrossRef]
Tsang, Y.M.; Poon, A.S.Y.; Addepalli, S. Coding the Beams: Improving Beamforming Training in mmWave Communication System. In Proceedings of the 2011 IEEE Global Telecommunications Conference—GLOBECOM 2011, Houston, TX, USA, 5–9 December 2011; pp. 1–6. [Google Scholar] [CrossRef]
Buzzi, S.; D’Andrea, C. Subspace Tracking and Least Squares Approaches to Channel Estimation in Millimeter Wave Multiuser MIMO. IEEE Trans. Commun. 2019, 67, 6766–6780. [Google Scholar] [CrossRef]
Khordad, E.; Collings, I.B.; Hanly, S.V.; Caire, G. Compressive Sensing Based Beam Alignment Schemes for Time-Varying Millimeter-Wave Channels. IEEE Trans. Wirel. Commun. 2023, 22, 1604–1617. [Google Scholar] [CrossRef]
Ghauch, H.; Skoglund, M.; Shokri-Ghadikolaei, H.; Fischione, C.; Sayed, A.H. Learning Kolmogorov Models for Binary Random Variables. In Proceedings of the International Conference on Machine Learning Workshop, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Yetis, C.M.; Björnson, E.; Giselsson, P. Joint Analog Beam Selection and Digital Beamforming in Millimeter Wave Cell-Free Massive MIMO Systems. arXiv 2021, arXiv:2103.11199. [Google Scholar] [CrossRef]
Dreifuerst, R.M.; Heath, R.W.; Yazdan, A. Massive MIMO Beam Management in Sub-6 GHz 5G NR. arXiv 2022, arXiv:2204.06064. [Google Scholar]
Ma, K.; He, D.; Sun, H.; Wang, Z.; Chen, S. Deep Learning Assisted Calibrated Beam Training for Millimeter-Wave Communication Systems. arXiv 2021, arXiv:2101.05206. [Google Scholar] [CrossRef]
Nguyen, K.N.; Ali, A.; Mo, J.; Ng, B.L.; Va, V.; Zhang, J.C. Beam Management with Orientation and RSRP using Deep Learning for Beyond 5G Systems. arXiv 2022, arXiv:2202.02247. [Google Scholar]
Aldalbahi, A.; Shahabi, F.; Jasim, M. BRNN-LSTM for Initial Access in Millimeter Wave Communications. Electronics 2021, 10, 1505. [Google Scholar] [CrossRef]
Dehkordi, S.K.; Kobayashi, M.; Caire, G. Adaptive Beam Tracking based on Recurrent Neural Networks for mmWave Channels. arXiv 2021. [Google Scholar] [CrossRef]
Hussain, M.; Michelusi, N. Learning and Adaptation for Millimeter-Wave Beam Tracking and Training: A Dual Timescale Variational Framework. arXiv 2021. [Google Scholar] [CrossRef]
Dreifuerst, R.M.; Daulton, S.; Qian, Y.; Varkey, P.; Balandat, M.; Kasturia, S.; Tomar, A.; Yazdan, A.; Ponnampalam, V.; Heath, R.W. Optimizing Coverage and Capacity in Cellular Networks using Machine Learning. arXiv 2021, arXiv:2010.13710. [Google Scholar]
Narengerile, N.; Thompson, J.; Patras, P.; Ratnarajah, T. Deep Reinforcement Learning-Based Beam Training for Spatially Consistent Millimeter Wave Channels. In Proceedings of the 2021 IEEE 32nd Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Helsinki, Finland, 13–16 September 2021; pp. 579–584. [Google Scholar] [CrossRef]
Wang, L.; Ai, B.; Niu, Y.; Gao, M.; Zhong, Z. Adaptive Beam Alignment Based on Deep Reinforcement Learning for High Speed Railways. In Proceedings of the 2022 IEEE 95th Vehicular Technology Conference: (VTC2022-Spring), Helsinki, Finland, 19–22 June 2022; pp. 1–6. [Google Scholar] [CrossRef]
Ktari, A.; Ghauch, H.; Rekaya, G. Matrix Factorization for Blind Beam Alignment in Massive mmWave MIMO. In Proceedings of the 2022 IEEE Wireless Communications and Networking Conference (WCNC), Austin, TX, USA, 10–13 April 2022; pp. 2637–2642. [Google Scholar] [CrossRef]
Ktari, A.; Ghauch, H.; Rekaya, G. Cascaded binary classifiers for blind Beam Alignment in mmWave MIMO using one-bit quantization. In Proceedings of the International Conference on Communications (ICC), WS02 ICC23 Workshop, DDINS, Rome, Italy, 28 May–1 June 2023. [Google Scholar]

Figure 1. Proposed

B A

diagram representation: (a) fully analog MIMO architecture using a single RF chain at

U E

and multiple RF chains at

B S

; (b) simplified illustration of Beam Alignment problem.

Figure 1. Proposed

B A

diagram representation: (a) fully analog MIMO architecture using a single RF chain at

U E

and multiple RF chains at

B S

; (b) simplified illustration of Beam Alignment problem.

Figure 2. Exhaustive Beam Alignment:

| T | = | R | = 4

,

N_{r f} = 2

RF-Chains at

B S

. Record 2 beam pairs for each pilot symbol transmission until the matrix is complete. Signaling overhead,

Ω = \frac{4 \times 4}{2}

.

Figure 2. Exhaustive Beam Alignment:

| T | = | R | = 4

,

N_{r f} = 2

RF-Chains at

B S

. Record 2 beam pairs for each pilot symbol transmission until the matrix is complete. Signaling overhead,

Ω = \frac{4 \times 4}{2}

.

Figure 3. Proposed partial Beam Alignment using sub-sampled codebooks:

| T | = | R | = 4

,

N_{r f} = 2

RF-Chains: record 2 beam pairs for each pilot symbol transmission until sounded beams are recorded. The missing entries represent the predicted entries. Signaling overhead,

Ω = \frac{3 \times 3}{2}

.

Figure 3. Proposed partial Beam Alignment using sub-sampled codebooks:

| T | = | R | = 4

,

N_{r f} = 2

RF-Chains: record 2 beam pairs for each pilot symbol transmission until sounded beams are recorded. The missing entries represent the predicted entries. Signaling overhead,

Ω = \frac{3 \times 3}{2}

.

Figure 4. Toy Example: Matrix Factorization with

| T | = 5, | R | = 7, D = 3

.

M F

results into two rectangular matrices to be optimized:

M F

uses the

R S E

of known beams (yellow) to predict/complete unknown beams (gray). The product of the latent factors

θ_{2}^{T}

and

ψ_{5}

gives the unknown value of

R S E_{2, 5}

.

Figure 4. Toy Example: Matrix Factorization with

| T | = 5, | R | = 7, D = 3

.

M F

results into two rectangular matrices to be optimized:

M F

uses the

R S E

of known beams (yellow) to predict/complete unknown beams (gray). The product of the latent factors

θ_{2}^{T}

and

ψ_{5}

gives the unknown value of

R S E_{2, 5}

.

Figure 5.

M F / N M F

train/test performance and learning curves: (a) 512 × 512 train/test loss in function of the overhead ratio; (b) learning curve: 256 × 256 with overhead 0.1

B C D M F

; (c) learning curve: 1024 × 1024 with overhead 0.1

B C D N M F

; (d) learning curve: 512 × 512 with overhead 0.1

B G D M F

; (e) learning curve: 128 × 128 with overhead 0.1

B C D S G D

.

Figure 5.

M F / N M F

train/test performance and learning curves: (a) 512 × 512 train/test loss in function of the overhead ratio; (b) learning curve: 256 × 256 with overhead 0.1

B C D M F

; (c) learning curve: 1024 × 1024 with overhead 0.1

B C D N M F

; (d) learning curve: 512 × 512 with overhead 0.1

B G D M F

; (e) learning curve: 128 × 128 with overhead 0.1

B C D S G D

.

Figure 6. Multi-Layer Perceptron architecture (toy example with

J = 4

).

Figure 6. Multi-Layer Perceptron architecture (toy example with

J = 4

).

Figure 7.

M L P

Learning curves: (a) learning curve: 256 × 256 with overhead 0.1 MLP; (b) learning curve: 512 × 512 with overhead 0.1 MLP; and (c) learning curve: 128 × 128 with overhead 0.3 MLP.

Figure 7.

M L P

Learning curves: (a) learning curve: 256 × 256 with overhead 0.1 MLP; (b) learning curve: 512 × 512 with overhead 0.1 MLP; and (c) learning curve: 128 × 128 with overhead 0.3 MLP.

Figure 8. Train/test

N M S E

in function of

P_{u}

for all proposed models for

512 \times 512

using optimal overhead ratio; (a) 512 × 512 train/test

N M S E

for

P_{u} = 1

W; (b) 512 × 512 train/test

N M S E

for

P_{u} = 10^{- 1}

W; (c) 512 × 512 train/test

N M S E

for

P_{u} = 10^{- 2}

W.

Figure 8. Train/test

N M S E

in function of

P_{u}

for all proposed models for

512 \times 512

using optimal overhead ratio; (a) 512 × 512 train/test

N M S E

for

P_{u} = 1

W; (b) 512 × 512 train/test

N M S E

for

P_{u} = 10^{- 1}

W; (c) 512 × 512 train/test

N M S E

for

P_{u} = 10^{- 2}

W.

Figure 9. Train/test

N M S E

in function of

P_{u}

for all proposed models for

128 \times 128

using optimal overhead ratio: (a) 128 × 128 train/test

N M S E

for

P_{u} = 1

W; (b) 128 × 128 train/test

N M S E

for

P_{u} = 10^{- 1} W

; (c) 128 × 128 train/test

N M S E

for

P_{u} = 10^{- 2}

W.

Figure 9. Train/test

N M S E

in function of

P_{u}

for all proposed models for

128 \times 128

using optimal overhead ratio: (a) 128 × 128 train/test

N M S E

for

P_{u} = 1

W; (b) 128 × 128 train/test

N M S E

for

P_{u} = 10^{- 1} W

; (c) 128 × 128 train/test

N M S E

for

P_{u} = 10^{- 2}

W.

Figure 10. Log(

N M S E

) in function of

P_{u}

for

1024 \times 1024

using optimal overhead ratio: (a)

M L P

train/test

l o g (N M S E)

in function of

P u

using optimal overhead ratio; (b)

M F

train/test

l o g (N M S E)

in function of

P_{u}

using optimal overhead ratio.

Figure 10. Log(

N M S E

) in function of

P_{u}

for

1024 \times 1024

using optimal overhead ratio: (a)

M L P

train/test

l o g (N M S E)

in function of

P u

using optimal overhead ratio; (b)

M F

train/test

l o g (N M S E)

in function of

P_{u}

using optimal overhead ratio.

Table 1. System parameters and hyperparameters.

System Configuration for All Proposed Models
System parameter	Numerical value
number of antennas $N_{T}$ at $U E$	$128, 256, 512, 1024$
number of antennas $N_{R}$ at $B S$	$128, 256, 512, 1024$
codebook cardinality $\| T \|$ at $U E$	$128, 256, 512, 1024$
codebook cardinality $\| R \|$ at $B S$	$128, 256, 512, 1024$
overhead ratio $η$ regime	$0.7, 0.5, 0.3, 0.1$
number of $O F M D$ sub-carriers $N_{c}$	64
number of channel paths L	2 (NLoS)
transmitted power $P_{u}$ (W)	$1, 10^{- 1}, 10^{- 2}$
$M F / N M F$ dimension $D_{M F}$	$2, 3, 4, 5, 6$
$M F / N M F$ learning rate $α_{k}$	$10^{- 1}, 10^{- 2}, 10^{- 3}, 10^{- 4}, 10^{- 5}, 10^{- 6}$
$M F / N M F$ regularization factors $λ, μ$	$10^{- 2}, 10^{- 3}, 10^{- 4}, 10^{- 5}, 10^{- 6}, 10^{- 7}$
$M L P$ number of layers J	$1, 2, 3$
$M L P$ number of neurons per layer $D_{M L P}$	$8, 16, 32, 64, 128$
$M L P$ batch size B	$2, 4, 8, 16, 32, 64, 128$
$M L P$ learning rate $β_{k}$	$10^{- 1}, 10^{- 2}, 10^{- 3}, 10^{- 4}$

Table 2.

Q o S

minimum overhead required for

M F / N M F

for all proposed

P u

regimes.

Table 2.

Q o S

minimum overhead required for

M F / N M F

for all proposed

P u

regimes.

a \| $MF / NMF$ \| $QoS$ Minimum Overhead Required for $P_{u} = 1$ W
MIMO setup	Optimal hyperparameters	Min Overhead	Train NMSE	Test NMSE
128 by 128	BGD NMF{D = 2, ( $λ$ , $μ$ ) = (0.0001, 0.0001), $α_{k}$ = 0.001}	0.1	8.407746 $\times 10^{- 6}$	9.147875 $\times 10^{- 6}$
256 by 256	BGD MF{D = 3, ( $λ$ , $μ$ ) = (0.0001, 0.0001), $α_{k}$ = 0.001}	0.1	4.102708 $\times 10^{- 6}$	7.344720 $\times 10^{- 6}$
512 by 512	BGD MF{D = 4, ( $λ$ , $μ$ ) = (0.0001, 0.0001), $α_{k}$ = 0.001}	0.1	8.374633 $\times 10^{- 7}$	9.417057 $\times 10^{- 7}$
1024 by 1024	SGD NMF{D = 4, ( $λ$ , $μ$ ) = (0.0001, 0.0001), $α_{k}$ = 0.01}	0.1	1.219227 $\times 10^{- 7}$	1.616363 $\times 10^{- 7}$
b \| $MF / NMF$ \| $QoS$ Minimum Overhead Required for $P_{u} = 10^{- 1}$ W
MIMO setup	Optimal hyperparameters	Min Overhead	Train NMSE	Test NMSE
128 by 128	SGD NMF {D = 2, ( $λ$ , $μ$ ) = (0.0001, 0.0001), $α_{k}$ = 0.001}	0.1	0.000191	0.000276
256 by 256	SGD NMF {D = 3, ( $λ$ , $μ$ ) = (0.0001, 0.0001), $α_{k}$ = 0.001}	0.1	4.648861 $\times 10^{- 5}$	5.775554 $\times 10^{- 5}$
512 by 512	BGD NMF{D = 4, ( $λ$ , $μ$ ) = (0.0001, 0.0001), $α_{k}$ = 0.001}	0.1	1.052556 $\times 10^{- 5}$	1.170430 $\times 10^{- 5}$
1024 by 1024	BGD NMF {D = 4, ( $λ$ , $μ$ ) = (0.0001, 0.0001), $α_{k}$ = 0.001}	0.1	1.600790 $\times 10^{- 6}$	1.695907 $\times 10^{- 6}$
c \| $MF / NMF$ \| $QoS$ Minimum Overhead Required for $P_{u} = 10^{- 2}$ W
MIMO setup	Optimal hyperparameters	Min overhead	Train NMSE	Test NMSE
128 by 128	SGD MF {D = 2, ( $λ$ , $μ$ ) = (0.0001, 0.0001), $α_{k}$ = 1 $\times 10^{- 6}$ }	0.1	0.115517	0.118776
256 by 256	BGD MF {D = 3, ( $λ$ , $μ$ ) = (0.0001, 0.0001), $α_{k}$ = 0.0001}	0.1	0.016475	0.016679
512 by 512	SGD NMF{D = 4, ( $λ$ , $μ$ ) = (0.0001, 0.0001), $α_{k}$ = 1 $\times 10^{- 6}$ }	0.1	0.003371	0.003449
1024 by 1024	BGD MF {D = 4, ( $λ$ , $μ$ ) = (0.0001, 0.0001), $α_{k}$ = 1 $\times 10^{- 5}$ }	0.1	0.001681	0.001948

Table 3.

Q o S

minimum overhead required for

M L P

for all the proposed

P_{u}

regimes.

Table 3.

Q o S

minimum overhead required for

M L P

for all the proposed

P_{u}

regimes.

a \| $MLP$ \| $QoS$ Minimum Overhead Required for $P_{u} = 1$ W
MIMO setup	Optimal hyperparameters	Min overhead	Train NMSE	Test NMSE
128 by 128	{(J = 3, D = 8), B = 4, $β_{k}$ = 0.0001}	0.1	0.001144	0.002639
256 by 256	{(J = 3, D = 16), B = 16, $β_{k}$ = 0.001}	0.1	3.941522 $\times 10^{- 5}$	3.948157 $\times 10^{- 6}$
512 by 512	{(J = 3, D = 64), B = 32, $β_{k}$ = 0.0001}	0.1	3.305507 $\times 10^{- 5}$	3.335168 $\times 10^{- 5}$
1024 by 1024	{(J = 3, D = 64), B = 64, $β_{k}$ = 0.0001}	0.1	9.810028 $\times 10^{- 6}$	9.857067 $\times 10^{- 6}$
b \| $MLP$ \| $QoS$ Minimum Overhead Required for $P_{u} = 10^{- 1}$ W
MIMO setup	Optimal hyperparameters	Min overhead	Train NMSE	Test NMSE
128 by 128	{(J = 3, D = 8), B = 4, $β_{k}$ = 0.0001}	0.1	0.007569	0.007662
256 by 256	{(J = 3, D = 16), B = 16, $β_{k}$ = 0.001}	0.1	0.000139	0.000288
512 by 512	{(J = 3, D = 64), B = 32, $β_{k}$ = 0.0001}	0.1	5.419598 $\times 10^{- 5}$	5.756302 $\times 10^{- 5}$
1024 by 1024	{(J = 3, D = 64), B = 64, $β_{k}$ = 0.0001}	0.1	1.184073 $\times 10^{- 5}$	1.72301 $\times 10^{- 5}$
c \| $MLP$ \| $QoS$ Minimum Overhead Required for $P_{u} = 10^{- 2}$ W
MIMO setup	Optimal hyperparameters	Min overhead	Train NMSE	Test NMSE
128 by 128	{(J = 3, D = 8), B = 4, $β_{k}$ = 0.0001}	0.1	0.049559	0.071185
256 by 256	{(J = 3, D = 16), B = 16, $β_{k}$ = 0.001}	0.1	0.017011	0.017634
512 by 512	{(J = 3, D = 64), B = 32, $β_{k}$ = 0.0001}	0.1	0.000141	0.000666
1024 by 1024	{(J = 3, D = 64), B = 64, $β_{k}$ = 0.0001}	0.1	1.700140 $\times 10^{- 4}$	1.702889 $\times 10^{- 4}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ktari, A.; Ghauch, H.; Rekaya-Ben Othman, G. Machine Learning Techniques for Blind Beam Alignment in mmWave Massive MIMO. Entropy 2024, 26, 626. https://doi.org/10.3390/e26080626

AMA Style

Ktari A, Ghauch H, Rekaya-Ben Othman G. Machine Learning Techniques for Blind Beam Alignment in mmWave Massive MIMO. Entropy. 2024; 26(8):626. https://doi.org/10.3390/e26080626

Chicago/Turabian Style

Ktari, Aymen, Hadi Ghauch, and Ghaya Rekaya-Ben Othman. 2024. "Machine Learning Techniques for Blind Beam Alignment in mmWave Massive MIMO" Entropy 26, no. 8: 626. https://doi.org/10.3390/e26080626

APA Style

Ktari, A., Ghauch, H., & Rekaya-Ben Othman, G. (2024). Machine Learning Techniques for Blind Beam Alignment in mmWave Massive MIMO. Entropy, 26(8), 626. https://doi.org/10.3390/e26080626

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning Techniques for Blind Beam Alignment in mmWave Massive MIMO

Abstract

1. Introduction

2. Literature Survey

3. System Model

4. Matrix Factorization and Non-Negative Matrix Factorization

4.1. MF and NMF Problem Formulation

4.2. Solutions for MF

4.3. Solutions for NMF

4.4. Prediction for MF and NMF

4.5. Proposed BA Algorithm Using MF/NMF

4.6. Numerical Simulations

5. Multi-Layer Perceptron

5.1. MLP Problem Formulation

5.2. MLP Learning

5.3. Prediction Using MLP

5.4. Proposed BA Algorithm Using $M L P$

5.5. Numerical Simulations

6. Results and Discussion

6.1. Train/Test Prediction Performance Comparison

6.2. Similarities and Differences between Models

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Proof: BCD Convergence

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Machine Learning Techniques for Blind Beam Alignment in mmWave Massive MIMO

Abstract

1. Introduction

2. Literature Survey

3. System Model

4. Matrix Factorization and Non-Negative Matrix Factorization

4.1. MF and NMF Problem Formulation

4.2. Solutions for MF

4.3. Solutions for NMF

4.4. Prediction for MF and NMF

4.5. Proposed BA Algorithm Using MF/NMF

4.6. Numerical Simulations

5. Multi-Layer Perceptron

5.1. MLP Problem Formulation

5.2. MLP Learning

5.3. Prediction Using MLP

5.4. Proposed BA Algorithm Using M L P

5.5. Numerical Simulations

6. Results and Discussion

6.1. Train/Test Prediction Performance Comparison

6.2. Similarities and Differences between Models

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Proof: BCD Convergence

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.4. Proposed BA Algorithm Using $M L P$