Next Article in Journal
An Application of Inverse Reinforcement Learning to Estimate Interference in Drone Swarms
Next Article in Special Issue
The Strongly Asynchronous Massive Access Channel
Previous Article in Journal
Kaniadakis Functions beyond Statistical Mechanics: Weakest-Link Scaling, Power-Law Tails, and Modified Lognormal Distribution
Previous Article in Special Issue
An Information-Theoretic View of Mixed-Delay Traffic in 5G and 6G
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Speeding up Training of Linear Predictors for Multi-Antenna Frequency-Selective Channels via Meta-Learning

Department of Engineering, King’s College London, London WC2R 2LS, UK
*
Author to whom correspondence should be addressed.
Entropy 2022, 24(10), 1363; https://doi.org/10.3390/e24101363
Submission received: 3 August 2022 / Revised: 19 September 2022 / Accepted: 23 September 2022 / Published: 26 September 2022
(This article belongs to the Special Issue Wireless Networks: Information Theoretic Perspectives Ⅱ)

Abstract

:
An efficient data-driven prediction strategy for multi-antenna frequency-selective channels must operate based on a small number of pilot symbols. This paper proposes novel channel-prediction algorithms that address this goal by integrating transfer and meta-learning with a reduced-rank parametrization of the channel. The proposed methods optimize linear predictors by utilizing data from previous frames, which are generally characterized by distinct propagation characteristics, in order to enable fast training on the time slots of the current frame. The proposed predictors rely on a novel long short-term decomposition (LSTD) of the linear prediction model that leverages the disaggregation of the channel into long-term space-time signatures and fading amplitudes. We first develop predictors for single-antenna frequency-flat channels based on transfer/meta-learned quadratic regularization. Then, we introduce transfer and meta-learning algorithms for LSTD-based prediction models that build on equilibrium propagation (EP) and alternating least squares (ALS). Numerical results under the 3GPP 5G standard channel model demonstrate the impact of transfer and meta-learning on reducing the number of pilots for channel prediction, as well as the merits of the proposed LSTD parametrization.

1. Introduction

The capacity to accurately predict channel state information (CSI) is a key enabler of proactive resource allocation strategies, which are central to many visions for efficient and low-latency communications in 6G and beyond (see, e.g., [1]). The problem of channel prediction is relatively straightforward in the presence of known channel statistics. In fact, under the common assumption that multi-antenna frequency-selective channels follow stationary complex Gaussian processes, optimal channel predictors can be obtained via linear minimum mean squared error (LMMSE) estimators such as the Wiener filter [2]. However, in practice, the channel statistics are not known, and predictors need to be optimized based on training data obtained through the transmission of pilot signals [3,4,5,6,7,8]. The problem addressed by this paper concerns the design of data-efficient channel predictors for multi-antenna frequency-selective channels.

1.1. Context and Prior Art

A classical approach to tackle this problem is to optimize finite impulse response (FIR) filters [9], or recursive linear filters via autoregressive (AR) models [3,4,5] such as Kalman filtering (KF) [6,7,8], by estimating channel statistics from the available pilot data. Although recursive linear filters generally outperform FIR filters when an accurate model of the state-transition dynamics is available [10,11], FIR filters are typically advantageous in the presence of limited amounts of pilot data [12]. More recently, deep learning-based nonlinear predictors have also been proposed to adapt to channel statistics through the training of neural networks, namely recurrent neural networks [13,14,15,16], convolutional neural networks [17,18], and multi-layer perceptrons [19].
As reported in [14,15,16,19], deep learning-based predictors tend to require larger training (pilot) data, and fail to outperform well-designed linear filters in the low-data regime. Some solutions addressing this issues include [20], which applies reinforcement learning to determine whether to predict channels or not at the current time, and the use of hypernetworks to adapt parameters of a KF accordingly to current channel dynamics [12].
Most prior work, with the notable exception of [12], focuses on the optimization of channel predictors under the assumption of a stationary spatio-temporal correlation function across the time interval of interest. This conventional approach fails to leverage common structure that may exist across multiple frames, with each frame being characterized by distinct spatio-temporal correlations (see Figure 1). Reference [12] allowed for varying Doppler spectra across frames, through a deep learning-based hypernetwork that is used to adapt the parameters of a generative model [12].
This paper takes a different approach that allows us to move beyond the single-antenna setting studied in [12]. As described in the next subsection, key ingredients of the proposed methods are transfer learning and meta-learning. Transfer learning [21] and meta-learning [22] aim at using knowledge from distinct tasks in order to reduce the data requirements on a new task of interest. Given a large enough resemblance between different tasks, both transfer learning and meta-learning have shown remarkable performance to reduce the sample complexity in general machine learning problems [23]. Transfer learning applies to a specific target task, whereas meta-learning caters to adaptation to any new task (see e.g., [24]).
Previous applications of transfer learning to communication systems include beamforming for multi-user, multiple-input, single-output (MISO) downlink [25] and for intelligent reflecting surfaces (IRS)-assisted MISO downlink [26], and downlink channel prediction [27,28] (see also [25,27]). Meta-learning has been applied to communication systems, including demodulation [29,30,31,32], decoding [33], end-to-end design of encoding and decoding with and without a channel model [34,35]; MIMO detection [36], beamforming for multiuser MISO downlink systems via [37], layered division multiplexing for ultra-reliable communications [38], UAV trajectory design [39], and resource allocation [40].

1.2. Contributions

This paper proposes novel efficient data-driven channel prediction algorithms that reduce pilot requirements by integrating transfer and meta-learning with a novel long-short-term decomposition (LSTD) of the linear predictors. Unlike the prior articles reviewed above, the proposed methods apply to multi-antenna frequency-selective channels whose statistics change across frames (see Figure 1). Specific contributions are as follows.
  • We develop efficient predictors for single-antenna frequency-flat channels based on transfer/meta-learned quadratic regularization. Transfer and meta-learning are used to leverage data from multiple frames in order to extract shared useful knowledge that can be used for prediction on the current frame (see Figure 2).
  • Targeting multi-antenna frequency-selective channels, we introduce the LSTD-based model class of linear predictors that builds on the well-known disaggregation of standard channel models into long-term space-time signatures and fading amplitudes [5,41,42,43,44]. Accordingly, the channel is described by multipath features, such as angle of arrivals, delays, and path loss, that change slowly across the frame, as well as by fast-varying fading amplitudes. Transfer learning and meta-learning algorithms for LSTD-based prediction models are proposed that build on equilibrium propagation (EP) and alternating least squares (ALS).
  • Numerical results under the 3GPP 5G standard channel model demonstrate the impact of transfer and meta-learning on reducing the number of pilots for channel prediction, as well as the merits of the proposed LSTD parametrization.
Part of this paper was presented in [45], which only covered meta-learning for the case of single-antenna frequency-flat channels. As compared to [45], this journal version includes both transfer and meta-learning, and it addresses the general scenario of multi-antenna frequency-selective channels by introducing and developing the LSTD model class of linear predictors.

1.3. Organization

The rest of the paper is organized as follows. In Section 2, we detail system and channel models, and describe conventional, transfer, and meta-learning concepts. In Section 3, we develop solutions for single-antenna frequency-flat channels. In Section 4, multi-antenna frequency-selective channels are considered, and we propose LSTD-based linear prediction schemes. Numerical results are presented in Section 5, and conclusions are presented in Section 6.
Notation: In this paper, ( · ) denotes the transposition; ( · ) the Hermitian transposition, ( · ) F the Frobenius norm, | · | the absolute value, | | · | | the Euclidean norm, vec ( · ) the vectorization operator that stacks the columns of a matrix into a column vector, [ · ] i the i-th element of the vector, and I S the S × S identity matrix for some integer S.

2. System Model

2.1. System Model

As shown in Figure 1, we study a frame-based transmission system, with each frame containing multiple time slots. Each frame carries data from a possibly different user to the same receiver, e.g., a base station. The receiver has N R antennas, and the transmitters have N T antennas. The channel h l , f in slot l = 1 , 2 , of frame f = 1 , 2 , is a vector with S = N R N T W entries, with W being the delay spread measured in number of transmission symbols within each frame f, the multi-path channels h l , f C N R N T W × 1 are characterized by fixed, frame-dependent, average path powers, path delays, Doppler spectra, and angles of arrival and departure [46]. For instance, in a frame f, we may have a slow-moving user in line-of-sight condition subject to time-invariant fading, whereas in another, the channel may have significant scattering with fast temporal variations with a large Doppler frequency. In both cases, the frame is assumed to be short enough that average path powers, path delays, Doppler spectra, and angles of arrival and departure do not change within the frame [41,42].
As also seen in Figure 1, for each frame f, we are interested in addressing the lag- δ channel prediction problem, in which channel h l + δ , f is predicted based on the N past channels:
H l , f N = [ h l , f , , h l N + 1 , f ] C S × N .
We adopt linear prediction with regressor V f C S N × S , so that the prediction is given as
h ^ l + δ , f = V f vec ( H l , f N ) .
The focus on linear prediction is justified by the optimality of linear estimation for Gaussian stationary processes [47], which provide standard models for fading channels in rich scattering environments.
Assuming no prior knowledge of the channel model, we adopt a data-driven approach to the design of the predictor (2). Accordingly, to train the linear predictor (2), for any frame f, the receiver is assumed to have available the training set
Z f tr = { ( x i , f , y i , f ) } i = 1 L tr { ( vec ( H l , f N ) , h l + δ , f ) } l = N L tr + N 1
encompassing L tr input–output examples. Dataset Z f tr can be constructed from L tr + N + δ 1 channels { h 1 , f , , h L tr + N + δ 1 , f } by using the lag- δ channel h l + δ , f as label for the covariate vector vec ( H l , f N ) . In practice, the channel vectors h l , f are estimated by using pilot symbols, and estimation noise can be easily incorporated in the model (see Section 2.5). Throughout, we implicitly assume that the channels h l , f correspond to estimates available at the receiver.
From dataset Z f tr in (3), we write the corresponding L tr × S N input matrix X f tr = [ x 1 , f , , x L tr , f ] , and the L tr × S target matrix Y f tr = [ y 1 , f , , y L tr , f ] , so that the dataset can be expressed as the pair Z f tr = ( X f tr , Y f tr ) .

2.2. Channel Model

We adopt the standard spatial channel model [46]. Accordingly, a channel vector h l , f for slot l in frame f, is obtained by sampling the continuous-time multipath vector channel impulse response
h l , f ( τ ) = d = 1 D Ω d , f a d , f g ( τ τ d , f ) exp ( j 2 π γ d , f t l ) ,
which is the sum of contributions from D paths. In (4), the waveform g ( τ ) is given by the convolution of the transmitted waveform and the matched filter at the receiver. Furthermore, the contribution of the d-th path depends on the average power Ω d , f , the path delay τ d , f , the N T N R × 1 spatial vector a d , f , the Doppler frequency γ d , f , and the starting wall-clock time of the l-th slot t l . The average power Ω d , f , path delays τ d , f , spatial vector a d , f , and Doppler frequency γ d , f are constant within one frame because they depend on large-scale geometric features of the propagation environment. However, they may change over frames following Clause 7.6.3.2 (Procedure B) in [46]. The number of paths is assumed without loss of generalization to be the same for all frames f because one can set Ω d , f = 0 for frames with a smaller number of paths.
In [46], the spatial vector a d , f has a structure that depends on field patterns and steering vectors of the transmit and receive antennas, as well on the polarization of the antennas. Mathematically, the entry of the spatial vector a d , f corresponding to the receive and transmit antenna element n R and n T can be modeled as [46]
[ a d , f ] n R + ( n T 1 ) N R = F r x , n R ( θ d , f , Z O A , ϕ d , f , A O A ) T M d , f · F t x , n T ( θ d , f , Z O D , ϕ d , f , A O D ) exp j 2 π l d , f , n R , n T λ 0 ,
where F r x , n R ( · , · ) and F t x , n T ( · , · ) are the 2 × 1 field patterns, θ d , f , Z O A , ϕ d , f , A O A , θ d , f , Z O D , and ϕ d , f , A O D are the zenith angle of arrival (ZOA), azimuth angle of arrival (AOA), zenith angle of departure (ZOD), and azimuth angle of departure (AOD) (in degrees), λ 0 is the wavelength (in m) of the carrier frequency, l d , f , n R , n T is the length of the path (in m) between the two antennas, and M d , f is the polarization coupling matrix defined as
M d , f = exp j Φ d , f θ θ 1 / κ d , f exp j Φ d , f θ ϕ 1 / κ d , f exp j Φ d , f ϕ θ exp j Φ d , f ϕ ϕ ,
with random initial phase Φ d , f ( · , · ) U ( π , π ) and log-normal distributed cross polarization power ratio (XPR) κ d , f > 0 [46].
In order to obtain the S × 1 vector h l , f , we sample the continuous-time channel h l , f ( τ ) in (4) at Nyquist rate 1 / T to obtain W discrete-time N R N T × 1 channel impulse response
h l , f [ w ] = h l , f ( ( w 1 ) T )
for w = 1 , , W . Following [41], the channel vector h l , f C N R N T W × 1 is obtained by concatenating the W channel vectors h l , f [ w ] for w = 1 , , W as
h l , f = [ h l , f [ 1 ] , , h l , f [ W ] ] .

2.3. Conventional Learning

The optimization of the linear predictor V f in (2) can be formulated as a supervised learning problem as it will be detailed in Section 3. In conventional learning, the predictor V f is designed separately in each frame f based on the corresponding dataset Z f tr . In order for this predictor V f to generalize well to slots in the same frame f outside the training set, it is necessary to have a sufficiently large number of training slots, L tr [48,49].

2.4. Transfer Learning and Meta-Learning

In conventional learning, the number of required training slots L tr can be reduced by selecting hyperparameters in the learning problem that reflect prior knowledge about the prediction problem at hand. In the next sections, we will explore solutions that optimize such hyperparameters based on data received from multiple previous frames. To this end, as illustrated in Figure 2, we assume the availability of channel data collected from F frames received in the past. In each frame, the channel follows the model described in Section 2.2. Accordingly, data from previous frames consists of L + N + δ 1 channels { h 1 , f , , h L + N + δ 1 , f } for some integer L.
By using these channels, the dataset
Z f = { ( x i , f , y i , f ) } i = 1 L { ( vec ( H l , f N ) , h l + δ , f ) } l = N L + N 1
can be obtained as explained in Section 2.1, where L is typically larger than L tr , although this will not be assumed in the analysis. Correspondingly, we also define the L × N input matrix X f and the L × 1 target vector y f . We will propose methods that leverage the historical knowledge available from dataset Z f for f = 1 , , F via transfer learning and meta-learning with the goal of reducing number of pilots, L tr , needed for channel prediction in a new frame (i.e., frame F + 1 in Figure 2).

2.5. Incorporating Estimation Noise

Until now, we assumed that channel vectors h l , f are available noiselessly to the predictor. In practice, channel information needs to be estimated via pilots. To elaborate on this point, let us assume the received signal model
y l , f p [ i ] = h l , f x l , f p [ i ] + n l , f [ i ] ,
where x l , f p [ i ] stands for the ith transmitted pilot symbol in block l of frame f, y l , f p [ i ] for the corresponding received signal, and h l , f for the channel with additive white complex Gaussian noise n l , f [ i ] CN ( 0 , N 0 I S ) . Given an average energy constraint E [ x l , f p [ i ] 2 ] = E x for the training symbol, the average signal-to-noise ratio (SNR) is given as E x / N 0 . From (10), we can estimate the channel as
h ˇ l , f = y l , f p [ i ] x l , f p [ i ] = h l , f + n l , f [ i ] x l , f p [ i ] = h l , f + ξ ,
which suffers from channel estimation noise ξ CN ( 0 , SNR 1 I S ) . If P training symbols are available in each block, the channel estimation noise can be reduced via averaging to SNR 1 / P . Channels h ˇ l , f can be used as training data in the schemes described in the previous subsections. More efficient channel estimation methods, including sparse Bayesian learning [50] and approximate message passing approaches [51] may further reduce the channel estimation noise.

3. Single-Antenna Frequency-Flat Channels

In this section, we propose transfer learning and meta-learning methods for single-antenna flat-fading channels, which result in S = 1 . Throughout this section, we write the prediction matrix V f C S N × S in (2) as the vector v f C N × 1 , and the target data Y f tr C L tr × S as the vector y f tr C L tr × 1 . Correspondingly, we rewrite the linear predictor (2) as
h ^ l + δ , f = v f vec ( H l , f N ) .

3.1. Conventional Learning

Assuming the standard quadratic loss, we formulate the supervised learning problem as the ridge regression optimization
v ( Z f tr | v ¯ ) = arg min v f C N × 1 | | X f tr v f y f tr | | 2 + λ | | v f v ¯ | | 2 ,
with hyperparameters ( λ , v ¯ ) given by the scalar λ > 0 and by the N × 1 bias vector v ¯ . The bias vector v ¯ can be thought of defining the prior mean of the predictor v f , whereas λ > 0 specifies the precision (i.e., inverse of the variance) of this prior knowledge. The solution of problem (13) can be obtained explicitly as
v ( Z f tr | v ¯ ) = ( A f tr ) 1 ( X f tr ) y f tr + λ v ¯ , with A f tr = ( X f tr ) X f tr + λ I .

3.2. Transfer Learning

Transfer learning uses datasets Z f in (9) from the previous F frames, i.e., with f = 1 , , F , to optimize the hyperparameter vector v ¯ in (13) as
v ¯ trans = arg min v C N × 1 f = 1 F X f v y f 2 .
The rationale for this choice is that vector v ¯ trans provides a useful prior mean to be used in the ridge regression problem (13), because it corresponds to an optimized predictor for the previous frames. Having optimized the bias vector v ¯ trans , we train a channel predictor v via ridge regression (13) by using the training data Z f new for a new frame f new with L new training samples, to obtain
v f new = v ( Z f new | v ¯ trans ) .
Note that during deployment time, this approach has the same computational complexity as that of conventional learning, because the bias vector is treated as a constant vector.

3.3. Meta-Learning

Unlike transfer learning, which utilizes all the available datasets { Z f } f = 1 F from the previous frames at once as in (15), meta-learning allows for the separate adaptation of the predictor in each frame. To this end, for each frame f, we split the L data points into L tr training pairs { ( x i , f , y i , f ) } i = 1 L tr { ( x i , f tr , y i , f tr ) } i = 1 L tr = Z f tr and L te = L L tr test pairs { ( x i , f , y i , f ) } i = L tr + 1 L { ( x i , f te , y i , f te ) } i = 1 L te = Z f te , resulting in two separate datasets, Z f tr and Z f te . We correspondingly define the L tr × N input matrix X f tr and the L tr × 1 target vector y f tr , as well as the L te × N input matrix X f te and the L te × 1 target vector y f te .
The hyperparameter vector v ¯ is then optimized by minimizing the sum loss of the predictors v ( Z f tr | v ¯ ) in (13) that are adapted separately for each frame f = 1 , , F given the bias vector v ¯ . Accordingly, estimating the loss in each frame f via the test set Z f te yields the meta-learning problem
v ¯ meta = arg min v ¯ C N × 1 f = 1 F | v ( Z f tr | v ¯ ) x i , f te y i , f te | 2 .
As studied in [52], the minimization in (17) is a least squares problem that can be solved in closed form as
v ¯ meta = arg min v ¯ C N × 1 f = 1 F X ˜ f te v ¯ y ˜ f te 2 = ( X ˜ X ˜ ) 1 X ˜ y ˜ ,
where L te × N matrix X ˜ f te contains by row the Hermitian transpose of the N × 1 pre-conditioned input vectors { λ ( A f tr ) 1 x i , f te } i = 1 L te , with A f tr = ( X f tr ) X f tr + λ I ,; y ˜ f te is L te × 1 vector containing vertically the complex conjugate of the transformed outputs { ( y i , f te ( y f tr ) X f tr ( A f tr ) 1 x i , f te } i = 1 L te , the F L te × N matrix X ˜ = [ X ˜ 1 te , , X ˜ F te ] stacks vertically the L te × N matrices { X ˜ f te } f = 1 F , and the F L te × 1 vector y ˜ = [ y ˜ 1 te , , y ˜ F te ] stacks vertically the L te × 1 vectors { y ˜ f te } f = 1 F . Unlike standard meta-learning algorithms used by most papers on communications [25,29,30,32,33,34], the proposed meta-learning procedure adopts linear models, significantly reducing the computational complexity of meta-learning [52].
After meta-learning, similar to transfer learning, based on the meta-learned hyperparameter v ¯ λ meta , we train a channel predictor via ridge regression (13), obtaining
v f new = v ( Z f new | v ¯ meta ) .

4. Multi-Antenna Frequency-Selective Channels

In this section, we study the more general scenario with any number of antennas and with frequency-selective channels, resulting in S > 1 . As we will discuss, a naïve extension of the techniques presented in the previous sections is undesirable, because this would not leverage the structure of the channel model (4). For this reason, in the following, we will introduce novel hybrid model- and data-driven solutions that build on the channel model (4).

4.1. Naïve Extension

We start by briefly presenting the direct extension of the approaches studied in the previous section to any S > 1 . Unlike the previous section, we adopt the general matrix notation introduced in Section 2. First, with S = 1 , conventional learning obtains the predictor by solving problem (13), which is generalized to any S > 1 as the minimization
V ( Z f tr | V ¯ ) = arg min V f C S N × S | | X f tr V f Y f tr | | F 2 + λ | | V f V ¯ | | F 2
over the linear prediction matrix V f in (2). Similarly, transfer learning computes the bias matrix V ¯ trans by solving the following generalization of problem (15),
V ¯ trans = arg min V C S N × S f = 1 F X f V Y f F 2 ,
followed by the evaluation of the predictor V ( Z f tr | V ¯ trans ) using (20); whereas meta-learning addresses the following generalization of minimization (17),
V ¯ meta = arg min V ¯ C S N × S f = 1 F i = 1 L te | V ( Z f tr | V ¯ ) x i , f te y i , f te | 2 ,
over the bias matrix V ¯ C S N × S , which is used to compute the predictor V ( Z f tr | V ¯ meta ) in (20).
The issue with the naïve extensions (21) and (22) is that the dimension of the predictor V and of the hyperparameter matrix V ¯ can become extremely large when S grows. This, in turn, may lead to overfitting in the hyperparameter space [53] when the number of frames, F, is limited. This form of overfitting may prevent transfer learning and meta-learning from effectively reducing the sample complexity for problem (20), because the optimized hyperparameter matrix V ¯ would be excessively dependent on the data received in the F previous frames. To solve this problem, we propose next to utilize the structure of the channel model (4) in order to reduce the dimension of the channel parametrization.

4.2. LSTD Channel Model

The channel model (4) implies that the channel vector h l , f in (7) and (8) can be written as the product of a frame-dependent N R N T W × D matrix T f and of a slot-dependent D × 1 vector β l , f as in [41],  
h l , f = T f β l , f ,
where T f collects space-time signatures of the D paths as
T f = [ Ω 1 , f 1 / 2 g ( τ 1 , f ) vec ( a 1 , f ) , , Ω D , f 1 / 2 g ( τ D , f ) vec ( a D , f ) ] ,
with g ( τ d , f ) = [ g ( τ d , f ) , , g ( ( W 1 ) T τ d , f ) ] being the W × 1 vector that collects the Nyquist-rate samples of the delayed waveform g ( τ τ d , f ) , and the D × 1 fading amplitude vector being defined as β l , f = [ exp ( j w 1 , f t l ) , , exp ( j w D , f t l ) ] .
The frame-dependent matrix T f is typically rank-deficient, because paths are generally not all resolvable [54,55]. To account for this structural property of the channel, as in [41], we introduce a N R N T W × K full-rank unitary matrix B f , such that span { T f } = span { B f } and redefine (23) as
h l , f = B f d l , f .
As an example, the unitary matrix B f can be obtained from the singular value decomposition of matrix T f , i.e., T f = B f Λ f 1 / 2 U f , by introducing the K × 1 vector d l , f = Λ f 1 / 2 U f β l , f  [41]. For future reference, we also rewrite (25) as
h l , f = k = 1 K b f k d l , f k ,
where d l , f k is the k-th element of the vector d l , f and b f k is the k-th column of the matrix B f .
We will refer to matrix B f in (26) as the long-term space-time feature matrix, or feature matrix for short, whereas vector d l , f will be referred as the short-term corresponding amplitude vector. Parametrization (25) and (26) are particularly efficient when the feature matrix B f can be accurately estimated from the available data. For conventional learning, this requires observing a sufficiently large number of slots per frame, i.e., a large L new [41], as well as a channel that varies sufficiently quickly across each frame. In contrast, as we will explore, transfer and meta-learning can potentially leverage data from multiple frames in order to enhance the estimation of the feature matrix.

4.3. LSTD-Based Prediction Model

Given the LSTD channel model (25) and (26), in this subsection we redefine the problem of predicting channel h l + δ , f = B f d l + δ , f as the problem of estimating the feature matrix B f and predicting the amplitude vector d l + δ , f based on the available data. This will lead to a reduced-rank parametrization of the linear predictor (2).
To start, we write the predicted channel h ^ l + δ , f as
h ^ l + δ , f = B ^ f d ^ l + δ , f ,
where B ^ f and d ^ l + δ , f are the estimated feature matrix and the predicted amplitude vector, respectively. To define the corresponding predictor, we first observe that the input matrix H l , f N in (1) can be expressed by using (25) as
H l , f N = B f [ d l , f , , d l N + 1 , f ] .
Assume now that we have an estimated feature matrix B ^ f . If this estimate is sufficiently accurate, the N past amplitudes [ d l , f , , d l N + 1 , f ] C K × N can be in turn estimated from H l , f N as
[ d ^ l , f , , d ^ l N + 1 , f ] = B ^ f H l , f N .
Consider now the prediction of the k-th amplitude d l + δ , f k . Generalizing (12), we adopt the linear predictor
d ^ l + δ , f k = ( v f k ) vec ( [ d ^ l , f k , , d ^ l N + 1 , f k ] ) ,
where v f k is an N × 1 prediction vector, and
[ d ^ l , f k , , d ^ l N + 1 , f k ] = ( b ^ f k ) H l , f N C 1 × N
is the k-th row of the matrix (29), which represents the past N fading scalar amplitudes that correspond to the k-th feature b f k . Plugging the prediction (30) into (27) yields the predicted channel h ^ l + δ , f (cf. (26))
h ^ l + δ , f = k = 1 K b ^ f k d ^ l + δ , f k .
As detailed in Appendix A, inserting (30) and (31) to (32), we can express the LSTD-based prediction (32) in the form (2) as
h ^ l + δ , f = ( V f ( K ) ) vec ( H l , f N ) ,
where the LSTD-based predictor matrix V f ( K ) C S N × S is given as
V f ( K ) = k = 1 K v f k ( b ^ f k ( b ^ f k ) ) ,
where ⊗ is the Kronecker product. Overall illustration of LSTD-based channel prediction is summarized in Figure 3.
The LSTD prediction model (33) reduces the dimension of the learnable parameters from S 2 N (for V f ) to ( S + N ) K (for V f ( K ) ). This complexity reduction comes at a minimal cost in terms of bias as long as the number of total features K is accurately chosen (a detailed discussion can be found in Section 4.7) and correlations across amplitudes d l , f k for different features k = 1 , , K are negligible.

4.4. Conventional Learning for LSTD-Based Prediction

In conventional learning, the goal is to optimize the LSTD-based predictor V f ( K ) by optimizing the feature matrix B ^ f and the feature-wise predictors { v f k } k = 1 K based on the available training dataset Z f tr . Substituting V f with V f ( K ) defined in (34) into the naïve extension of conventional learning in (20) yields the problem
V ( K ) , * ( Z f tr | V ¯ ( K ) ) = arg min B ^ f , v f 1 , , v f K V f ( K ) = k = 1 K v f k ( b ^ f k ( b ^ f k ) ) | | X f tr V f ( K ) Y f tr | | F 2 + λ V f V ¯ ( K ) F 2 , subject to B ^ f B ^ f = I K ,
over the optimization variables ( B ^ f , { v f k } k = 1 K ) . In (35), the hyperparameters ( λ , V ¯ ( K ) ) are given by the scalar λ > 0 and by the S N × S LSTD-based bias matrix V ¯ ( K ) defined as (cf. (34))
V ¯ ( K ) = k = 1 K v ¯ k ( b ¯ k ( b ¯ k ) ) .
Because the Euclidean norm regularization V f V ¯ ( K ) F 2 in (35) mixes long-term and short-term dependencies due to (34) and (36), we propose the modification of problem (35)
V ( K ) , * ( Z f tr | { b ¯ k , v ¯ k } k = 1 K ) = arg min B ^ f , v f 1 , , v f K V f ( K ) = k = 1 K v f k ( b ^ f k ( b ^ f k ) ) { | | X f tr V f ( K ) Y f tr | | F 2 + λ 2 k = 1 K | | v f k v ¯ k | | 2 λ 1 k = 1 K tr ( b ^ f k ) ( b ¯ k ( b ¯ k ) ) b ^ f k } , subject to B ^ f B ^ f = I K ,
with hyperparameters ( λ 1 , λ 2 , b ¯ 1 , , b ¯ K , v ¯ 1 , , v ¯ K ) given by the scalars λ 1 , λ 2 > 0 , by the S × 1 long-term bias vectors b ¯ 1 , , b ¯ K , and by the N × 1 short-term bias vectors v ¯ 1 , , v ¯ K . For each feature k, the considered regularization minimizes the Euclidean distance between the short-term prediction vector v f k and the short-term bias vector v ¯ k as in Section 3, while maximizing the alignment between the long-term feature vector b ^ f k and the long-term bias vector b ¯ k in a manner akin to the kernel alignment method of [56].
To address problem (37), inspired by [57,58], we propose a sequential approach, in which the pair ( v f k , b ^ f k ) consisting of the k-th predictor v f k and the k-th feature vector b ^ f k is optimized in the order k = 1 , 2 , , K . Specifically, at each step k, we consider the problem
b ^ f k , * , v f k , * = arg min b ^ f k , v f k ( V f ( K ) ) k = v f k ( b ^ f k ( b ^ f k ) ) { | | X f tr ( V f ( K ) ) k ( Y f tr ) k | | F 2 λ 1 tr ( b ^ f k ) ( b ¯ k ( b ¯ k ) ) b ^ f k + λ 2 | | v f k v ¯ k | | 2 } , subject to ( b ^ f k ) b ^ f k = 1 ,
where the L tr × S k-th residual target matrix ( Y f tr ) k is defined as [57,58]
( Y f tr ) k = Y f tr , for k = 1 , Y f tr k = 1 k 1 X f tr ( V f ( K ) ) k , , for k > 1 ,
given the k-th predictor
( V f ( K ) ) k = v f k ( b ^ f k ( b ^ f k ) )
and k-th optimized predictor
( V f ( K ) ) k , = v f k , ( b ^ f k , ( b ^ f k , ) ) .
Because (38) is a nonconvex problem, we consider alternating least squares (ALS) [59] to obtain the optimal solution { b ^ f k , , v f k , } by iterating between the following steps: (i) for a fixed b ^ f k , update v f k as
v f k arg min v f k ( V f ( K ) ) k = v f k ( b ^ f k ( b ^ f k ) ) | | X f tr ( V f ( K ) ) k ( Y f tr ) k | | F 2 + λ 2 | | v f k v ¯ k | | 2 ;
and (ii) for a fixed v f k , update b ^ f k as
b ^ f k arg min b ^ f k ( V f ( K ) ) k = v f k ( b ^ f k ( b ^ f k ) ) { | | X f tr ( V f ( K ) ) k ( Y f tr ) k | | F 2 λ 1 tr ( b ^ f k ) ( b ¯ k ( b ¯ k ) ) b ^ f k } , subject to ( b ^ f k ) b ^ f k = 1 ,
until convergence. Closed-form solutions for (42) and (43) can be found in Appendix B, and the overall LSTD-based conventional learning scheme can be found in Algorithm 1.
Algorithm 1: LSTD-based conventional learning for channel prediction for S 1
Entropy 24 01363 i001

4.5. Transfer Learning for LSTD-Based Prediction

Similar to conventional learning, transfer learning for LSTD-based prediction can be addressed from the naïve extension (21) by utilizing the LSTD parametrization V ( K ) in (34) in lieu of the unconstrained predictor V to obtain the bias matrix V ¯ ( K ) , trans as
V ¯ ( K ) , trans = arg min B ^ , v 1 , , v K V ( K ) = k = 1 K v k ( b ^ k ( b ^ k ) ) f = 1 F | | X f V ( K ) Y f | | F 2 , subject to B ^ B ^ = I K ,
which can also be solved via the ALS-based sequential approach detailed in Section 4.4. This produces the sequences b ¯ 1 , trans , , b ¯ K , trans and v ¯ 1 , trans , , v ¯ K , trans as (cf. (38))
b ¯ k , trans , v ¯ k , trans = arg min b ^ k , v k ( V ( K ) ) k = v k ( b ^ k ( b ^ k ) ) f = 1 F | | X f ( V ( K ) ) k ( Y f ) k | | F 2 , subject to ( b ^ k ) b ^ k = 1 ,
where the residual target matrix ( Y f ) k is defined as (cf. (39))
( Y f ) k = Y f , for k = 1 , Y f k = 1 k 1 X f ( V f ( K ) ) k , trans , for k > 1
with k-th optimized predictor
( V ( K ) ) k , trans = v ¯ k , trans ( b ¯ k , trans ( b ¯ k , trans ) ) .
Details for transfer learning can be found in Appendix C, and the overall transfer learning scheme for LSTD prediction is summarized in Algorithm 2. After transfer learning, similar to Section 3.2, based on the optimized hyperparameters b ¯ 1 , trans , , b ¯ K , trans and v ¯ 1 , trans , , v ¯ K , trans , the LSTD-based channel predictor for a new frame f new can be obtained via (37) as
V f new ( K ) , = V ( K ) , Z f new tr | { b ¯ k , trans , v ¯ k , trans } k = 1 K ,
which can also be solved in the sequential way as in (38).
Algorithm 2: LSTD-based transfer-learning for channel prediction for S 1
Entropy 24 01363 i002

4.6. Meta-Learning for LSTD-Based Prediction

Plugging (37) into the naïve extension of (22), we can formulate the meta-learning problem for LSTD-based prediction as
min { b ¯ k , v ¯ k } k = 1 K f = 1 F X f te V ( K ) , ( Z f tr | { b ¯ k , v ¯ k } k = 1 K ) Y f te F 2 .
Similar to the sequential approach (38) described in Section 4.4, we propose a hierarchical sequential approach for meta-learning by using (38) in the order k = 1 , , K , obtaining the problem  
b ¯ k , meta , v ¯ k , meta = arg min b ¯ k , v ¯ k ( V f ( K ) ) k , = v f k , ( b ^ f k , ( b ^ f k , ) ) f = 1 F X f te ( V f ( K ) ) k , ( Y f te ) k F 2 ,
with the residual target matrix ( Y f te ) k defined as (cf. (39))
( Y f te ) k = Y f te , for k = 1 , Y f te k = 1 k 1 X f te ( V f ( K ) ) k , , for k > 1 .
The bilevel non-convex optimization problem (50) is addressed through gradient-based updates with gradients computed via equilibrium propagation (EP) [60,61]. EP uses finite differentiation to approximate the gradient of the bilevel optimization (50), where the difference is computed between two gradients obtained at two stationary points ( b ^ f k , , v f k , ) and ( b ^ f k , α , v f k , α ) for the original problem (38) and modified version of (38) that considers additional prediction loss for the test set Z f te . Specifically, EP leverages the asymptotic equality [60]
b ¯ k f = 1 F X f te ( V f ( K ) ) k , ( Y f te ) k F 2 = lim α 0 2 λ 1 α f = 1 F b ^ f k , ( b ^ f k , ) b ^ f k , α ( b ^ f k , α ) b ¯ k
and
v ¯ k f = 1 F X f te ( V f ( K ) ) k , ( Y f te ) k F 2 = lim α 0 2 λ 2 α f = 1 F ( v f k , v f k , α ) ,
with additional real-valued hyperparameter α R , which is generally chosen to be a non-zero small value [60,61]. In (52) and (53), vectors b ^ f k , α and v f k , α are defined as (cf. (38))
b ^ f k , α , v f k , α = arg min b ^ f k , v f k ( V f ( K ) ) k = v f k ( b ^ f k ( b ^ f k ) ) { | | X f tr ( V f ( K ) ) k ( Y f tr ) k | | F 2 + α | | X f te ( V f ( K ) ) k ( Y f te ) k | | F 2 λ 1 tr ( b ^ f k ) ( b ¯ k ( b ¯ k ) ) b ^ f k + λ 2 | | v f k v ¯ k | | 2 } , subject to ( b ^ f k ) b ^ f k = 1 .
Derivations for the gradients (52) and (53) can be found in Appendix D.
To reduce the computational complexity for the gradient-based updates, we adopt stochastic gradient descent with the Adam optimizer as done in [61] in order to update b ¯ k and v ¯ k based on (52) and (53). The overall LSTD-based meta-learning scheme is detailed in Algorithm 3.
After meta-learning, as in Section 3.3, based on the optimized b ¯ 1 , meta , , b ¯ K , meta and v ¯ 1 , meta , , v ¯ K , meta , LSTD-based channel predictor for a new frame f new can be obtained via (37) as
V f new ( K ) , = V ( K ) , ( Z f new tr | { b ¯ k , meta , v ¯ k , meta } k = 1 K ) ,
which can be solved in a sequential way, as in (38).
The computational complexity order of the considered schemes is summarized in Table 1 and Table 2. At deployment time, as seen in Table 1, all schemes require the same computational complexity of conventional learning. In contrast, in the offline meta-learning or transfer learning phase, the computational overhead depends on the dimension of the channel vector S = N R N T W . LSTD-based schemes can reduce the computational overhead as compared to naïve solutions when the channel vector is large, i.e., S 1 , and the rank K is sufficiently small. This is quantified in Table 2, where I ALS is the number of iterations for ALS and I EP is the number of iterations for EP.
Algorithm 3: LSTD-based meta-learning for channel prediction for S 1
Entropy 24 01363 i003

4.7. Rank-Estimation for LSTD-Based Prediction

The number of total features K for LSTD-based predictions depends on the rank of the unknown space-time signature matrix T f as discussed in Section 4.2. This rank can be estimated by using available channels from previous frames if we assume that the number of total features does not change over multiple frames. This can be achieved via one of the standard methods, Akaike’s information theoretic criterion (AIC) (Equation (16) in [62]), which is applicable for all the proposed LSTD-based techniques. However, as the AIC-based rank estimation generally tends to be overestimated [62,63], we propose a potentially more effective estimator for meta-learning, which utilizes a validation dataset.
To this end, we first split the available F frames into F tr meta-training frames f = 1 , , F tr and F val meta-validation frames f = F tr + 1 , , F . Then, we compute the sum-loss as (cf. (49))
f = F tr + 1 F tr + F val X f te V ( k ) , ( Z f tr | { b ¯ k , meta , v ¯ k , meta } k = 1 k ) Y f te F 2 ,
where the hyperparameters { b ¯ k , meta , v ¯ k , meta } k = 1 k are computed by using the F tr meta-training frames, as explained in the previous section. The rank-estimation procedure sequentially evaluates the meta-validation loss (56) in order to minimize it over the selection of k. In this regard, it is worth noting that an increase in the total number of features k always decreases the meta-training loss in (49), whereas this is not necessarily true for the meta-validation and meta-test losses.

5. Experiments

In this section, we present experimental results for the prediction of multi-antenna and/or frequency-selective channels. Numerical examples for single-antenna frequency-flat channels for both offline and online learning scenarios can be found in the conference version of this paper [45]. For all the experiments, we compute the normalized mean squared error (NMSE) | | h ^ l + δ , f h l + δ , f | | 2 / | | h l + δ , f | | 2 , which is averaged over 100 samples for 200 new frames. To avoid discrepancies between the evaluation measures used during the training and testing phase, we also adopt the NMSE as the training loss function by normalizing the training dataset for the new frame f new as (cf. (3))
Z f new tr = { ( x i , f new , y i , f new ) } i = 1 L tr vec ( H l , f new N ) / | | h l + δ , f new | | , h l + δ , f new / | | h l + δ , f new | | l = N L tr + N 1 ,
and similarly redefine the datasets from previous frames f = 1 , , F for transfer and meta-training as (cf. (9))
Z f = { ( x i , f , y i , f ) } i = 1 L vec ( H l , f N ) / | | h l + δ , f | | , h l + δ , f / | | h l + δ , f | | l = N L + N 1 .
As summarized in Table 3, we consider a window size N = 5 with lag size δ = 3 . All of the experimental results follow the 3GPP 5G standard SCM channel model [46] with variations of the long-term features over frames following Clause 7.6.3.2 (Procedure B) [46], under the Umi–Street Canyon environment, as discussed in Section 2.2. The normalized Doppler frequency ρ = γ d , f / γ SRS [ 0 , 1 ] within each frame f, defined as the ratio between the Doppler frequency γ d , f (4) and the frequency of the pilot symbols γ SRS , or sounding reference signal (SRS) [46], is randomly selected in one of the two following ways: (i) for slow-varying environments, it is uniformly drawn in the interval [ 0.005 , 0.05 ] ; and (ii) for fast-varying environments, it is uniformly distributed in the interval [ 0.1 , 1 ] . In the following, we study the impact of (i) the number of antennas N R N T , (ii) the number of channel taps W, (iii) the number of training samples L new , and (iv) the number of previous frames F, for various prediction schemes: (a) conventional learning, (b) transfer learning, and (c) meta-learning, where each scheme is implemented by using either the naïve or the LSTD parametrization. We set λ = 0 , λ 1 = 0 , and λ 2 = 0 for conventional learning [9,45], whereas λ = 1 , λ 1 = 1 , and λ 2 = 1 for transfer and meta-learning.

5.1. Multi-Antenna Frequency-Flat Channels

We begin by considering multi-antenna frequency-flat channels and evaluating the NMSE as a function of total number of antennas N R N T under a fast-varying environment (Figure 4) or a slow-varying environment (Figure 5). We set K = 1 in the LSTD model. Specific antenna configurations are described in Appendix E. Both transfer and meta-learning are seen to provide significant advantages as compared to conventional learning, as long as one chooses the type of parametrization—naïve or LSTD—as a function of the type of variability in the channel, with meta-learning generally outperforming transfer learning. In particular, as seen in Figure 4, for fast-varying environments, meta-learning with LSTD parametrization has the best performance, significantly reducing the NMSE with respect to both conventional and transfer learning. This is because meta-learning with LSTD can account for the need to adapt to fast-varying channel conditions, while also leveraging the reduced-rank structure of the channel. In contrast, as shown in Figure 5, for slow-varying channels, naïve parametrization tends to be preferable, because, as explained in Section 4.2, long-term and short-term features of the channel become indistinguishable when channel variability is too low. It is also interesting to observe that increasing the number of antennas is generally useful for prediction, as the predictor can build on a larger vector of correlated covariates. This is, however, not the case for conventional learning in slow-varying environments, for which the features tend to be too correlated, resulting in overfitting. As a final note, although absolute NMSE values close to 1 may be insufficient for use in applications such as precoding, they can provide useful information for other applications such as proactive resource allocation [40,64].

5.2. Rank Estimation

In the previous experiments, we have considered channels with unitary rank, for which one can assume without loss of optimality a number of features in the LSTD parametrization equal to K = 1 . In order to implement predictors for multi-antenna frequency-selective channels, one instead needs to first address the problem of estimating the number of features. Here, we evaluate the performance of the approach proposed in Section 4.7 for rank estimation. To this end, we set the number of antennas as N R = 8 and N T = 8 , and consider the 19-clustered channel model with delay spread ratio 2. Figure 6 shows the NMSE evaluated on the meta-training, meta-validation, and meta-test data sets as a function of total number of features K. The meta-training set contains 20 frames, the meta-test 200 frames, and the meta-validation set 20 frames. The meta-training loss is monotonically decreasing with K, because a richer parametrization enables a closer fit of the training data. In contrast, both meta-test and meta-validation loss are optimized for an intermediate value of K. The main point of the figure is that the meta-validation loss, while only containing 20 frames, provides useful information to choose a value of K that approximately minimizes the meta-test loss. In contrast, although we can see that K = 3 is a proper estimate of the channel rank for the considered set-up, AIC-based rank estimation gives the highly overestimated value K = 200 , which deteriorates the prediction performance, as can be seen in Figure 6. Throughout the following experiments, we will follow the proposed procedure to select K for meta-learning, whereas for all the other schemes, we adopt AIC-based rank estimation to determine K.

5.3. Single-Antenna Frequency-Selective Channels

Before considering multi-antenna frequency-selective channels, we first consider the impact of the level of frequency selectivity on the prediction of single-antenna frequency-selective channels. To this end, starting from 45 ns , we increase the delay spread by a multiplicative factor, and correspondingly also increase the number of taps by the same amount, which is referred to as delay spread ratio in Figure 6. The number of taps W is obtained as the smallest number of taps that contains more than 90% of the average channel power, following ITU-R report [65]. Figure 7 shows that the dependence on the delay spread of the channel is qualitatively similar to the dependence on the number of antennas in Figure 4 and Figure 5, with the top of Figure 7 representing the performance under a fast-varying environment and the bottom figure depicting the NMSE for a slow-varying environment. Accordingly, as discussed in the previous subsection, meta-learning outperforms both transfer and conventional learning, as long as the parametrization is correctly selected: naïve for slow-varying channels, and LSTD for fast-varying environments.

5.4. Multi-Antenna Frequency-Selective Channel Case

We now consider the prediction performance for multi-antenna frequency-selective channels as a function of the number of training samples L new in Figure 8 and Figure 9, as well as versus the number of frames F in Figure 10. For meta-learning, we set L tr = L new in order to avoid discrepancies between meta-training and meta-testing [29]. Figure 8 and Figure 9 shows that meta-learning and transfer learning, which utilize F = 500 previous frames, can significantly outperform conventional learning in terms of number of required pilots L new . This key observation motivates the use of transfer and meta-learning in the presence of limited training data. Furthermore, confirming the analysis in Section 3.3 and Section 4.6, meta-learning can outperform all other schemes as long as one selects a naïve parametrization for slowly varying environments, and the LSTD parametrization for fast-varying environments. For sufficiently large L new , transfer learning can, however, improve over meta-learning on fast-varying environments, as seen in Figure 8. This stems from the split of training and testing set applied by meta-learning, which can lead to a performance loss as L new increases.
Lastly, we investigate the effect of the number of previous frames F for transfer and meta-learning. As a general result, as demonstrated by Figure 10, an increase in the number F of previous frames results in better performance for both transfer and meta-learning. Furthermore, in a slow-varying environment with a small value of F, transfer learning can outperform meta-learning due to the limited need for adaptation, whereas meta-learning with the correctly select type of parametrization, outperforms transfer learning otherwise.

6. Conclusions

In this paper, we have introduced data-driven channel prediction strategies for multi-antenna frequency-selective channels that aim at reducing the number of pilots by integrating transfer and meta-learning with a novel parametrization of linear predictors. The methods leverage the underlying structure of the wireless channels, which can be expressed in terms of a long short-term decomposition (LSTD) into long-term space-time features and fading amplitudes. To enable transfer and meta-learning under an LSTD-based model, we have proposed an optimization strategy based on equilibrium propagation (EP) and alternating least squares (ALS). Numerical experiments have shown that the proposed LSTD-based transfer and meta-learning methods far outperform conventional prediction methods, especially in the few-pilots regime. For instance, under a standard 3GPP SCM channel model, assuming four transmit antennas and two receive antennas, using only one pilot meta-learning with LSTD can reduce the normalized prediction MSE by 3 dB as compared to standard learning techniques. Future work may consider the joint use of deep neural networks, in lieu of linear prediction filters, although related results for multi-antenna frequency-flat channels have not reported any significant advantage to date [14,15,16,19].

Author Contributions

Conceptualization, software, formal analysis, writing, S.P.; conceptualization, supervision, writing, project administration, funding acquisition, O.S. All authors have read and agreed to the published version of the manuscript.

Funding

The work of S.P and O.S. was partly supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 725731).

Data Availability Statement

Code is available at https://github.com/kclip/channel-prediction-meta-learning (accessed on 22 September 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Derivation for Long-Short-Term-Decomposition (LSTD)-Based Predictor V f ( K )

Because the k-th amplitudes d ^ l N + 1 , f k , , d ^ l + δ , f k are all scalar values, we can write
vec ( [ d ^ l , f k , , d ^ l N + 1 , f k ] ) = [ d ^ l , f k , , d ^ l N + 1 , f k ] and d ^ l + δ , f k = ( d ^ l + δ , f k ) . By using these equalities, we can plug in (30) and (31) to (32), to get the expression of the predicted channel h ^ l + δ , f as
h ^ l + δ , f = k = 1 K b ^ f k ( b ^ f k ) H l , f N ( ( v f k ) ) = k = 1 K ( v f k ) ( b ^ f k ( b ^ f k ) ) ( V f ( K ) ) from ( 33 ) vec ( H l , f N ) ,
from which we can easily obtain LSTD-based predictor matrix V f ( K ) = k = 1 K v f k ( b ^ f k ( b ^ f k ) ) .

Appendix B. Details on Conventional Learning for LSTD-Based Prediction

Recalling from (3) that X f tr = [ vec ( H 1 , f N ) , , vec ( H L tr , f N ) ] , we can rewrite one part of ALS (42) in the form of standard ridge regression formula as
v f k arg min v f k ( V f ( K ) ) k = v f k ( b ^ f k ( b ^ f k ) ) | | X f tr ( V f ( K ) ) k ( Y f tr ) k | | F 2 + λ 2 | | v f k v ¯ k | | 2 = arg min v f k i = 1 L tr b ^ f k ( b ^ f k ) H i , f N ( ( v f k ) ) ( y i , f tr ) k 2 + λ 2 v f k v ¯ k 2
with ( y i , f tr ) k being the Hermitian transposition of the i-th row of the k-th residual target matrix ( Y f tr ) k defined in (39), which can be solved in closed-form similar to (14) as
( ( v f k ) ) = i = 1 L tr ( H i , f N ) b ^ f k ( b ^ f k ) H i , f N + λ 2 I 1 i = 1 L tr ( H i , f N ) b ^ f k ( b ^ f k ) ( y i , f tr ) k + λ 2 v ¯ k .
The initialization of the parameter b ^ f k is set to the available hyperparameter b ¯ k .
Similarly, the other part of ALS (43) can be rewritten as
b ^ f k arg min b ^ f k i = 1 L tr b ^ f k ( b ^ f k ) H i , f N ( ( v f k ) ) ( y i , f tr ) k 2 λ 1 tr ( b ^ f k ) ( b ¯ k ( b ¯ k ) ) b ^ f k subject to b k b k = 1 , = arg min b ^ f k ( b ^ f k ) ( ( X ˇ k , f ) ( X ˇ k , f ) ( Y ˇ k , f ) ( X ˇ k , f ) ( X ˇ k , f ) ( Y ˇ k , f ) ) b ^ f k subject to b k b k = 1 ,
for which the solution of (A4) can be obtained by taking the eigenvector of ( ( X ˇ k , f ) ( X ˇ k , f ) ( Y ˇ k , f ) ( X ˇ k , f ) ( X ˇ k , f ) ( Y ˇ k , f ) ) that corresponds to the smallest eigenvalue, with the matrices X ˇ k , f and Y ˇ k , f defined as
X ˇ k , f = ( X ˇ f tr ) k λ 1 ( b ¯ k ) , Y ˇ k , f = ( Y f tr ) k λ 1 ( b ¯ k ) ,
where we denote ( X ˇ f tr ) k = [ ( x ˇ 1 , f k ) , , ( x ˇ L tr , f k ) ] given x ˇ i , f k = H i , f N ( ( v f k ) ) . Note that we arbitrarily started ALS with vector v f k , as this ordering did not show any meaningful impact on the final results, as also reported in [66].

Appendix C. Details on Transfer Learning for LSTD-Based Prediction

Solution of (45) can be directly obtained with the tools in Appendix B given λ 1 , λ 2 = 0 and by substituting X f tr and Y f tr with [ X 1 , , X F ] and [ Y 1 , , Y F ] , respectively.

Appendix D. Details on Meta-Learning for LSTD-Based Prediction

Before deriving (52) and (53), for ease of representation, let us define inner loss function L f inner , outer loss function L f outer , and total loss function L f total as
L f inner ( b ^ f k , v f k | b ¯ k , v ¯ k ) = X f tr ( V f ( K ) ) k ( b ^ f k ) b ^ f k ( Y f tr ) k F 2 λ 1 ( b ^ f k ) ( b ¯ k ( b ¯ k ) ) b ^ f k ( b ^ f k ) b ^ f k + λ 2 | | v f k v ¯ k | | 2 ,
L f outer ( b ^ f k , v f k ) = X f te ( V f ( K ) ) k ( b ^ f k ) b ^ f k ( Y f te ) k F 2 ,
and
L f total ( b ^ f k , v f k | b ¯ k , v ¯ k , α ) = L f inner ( b ^ f k , v f k | b ¯ k , v ¯ k ) + α L f outer ( b ^ f k , v f k ) ,
respectively. Because (A6) is scale-invariant to ( b ^ f k ) b ^ f k (recall that ( V f ( K ) ) k = v f k ( b ^ f k ( b ^ f k ) ) ), this can be considered as an unconstrained version of (38), i.e., ( b ^ f k , , v f k , ) in (38) minimizes (A6). Analogously, (A8) can be considered as an unconstrained expression of (54) as ( b ^ f k , α , v f k , α ) in (54) minimizes (A8).
Assuming that the conditions of implicit function theorem [67] are satisfied with respect to ( b ^ f k , α , v f k , α ) , from the finite differentiation method [60], we can write the gradients for meta-learning as
b ¯ k f = 1 F X f te ( V f ( K ) ) k , ( Y f te ) k F 2 = lim α 0 f = 1 F 1 α L f total b ¯ k ( b ^ f k , α , v f k , α | b ¯ k , v ¯ k , α ) L f total b ¯ k ( b ^ f k , , v f k , | b ¯ k , v ¯ k , 0 ) = lim α 0 2 λ 1 α f = 1 F b ^ f k , ( b ^ f k , ) b ^ f k , α ( b ^ f k , α ) b ¯ k
and
v ¯ k f = 1 F X f te ( V f ( K ) ) k , ( Y f te ) k F 2 = lim α 0 f = 1 F 1 α L f total v ¯ k ( b ^ f k , α , v f k , α | b ¯ k , v ¯ k , α ) L f total v ¯ k ( b ^ f k , , v f k , | b ¯ k , v ¯ k , 0 ) = lim α 0 2 λ 2 α f = 1 F ( v f k , v f k , α ) ,
which concludes the derivation of (52) and (53). A generalized proof along with useful characteristics of EP can be found in [60]. For initialization, the hyperparameter b ¯ k is set as the one-hot vector at position k; while v ¯ k are chosen as all-zero vectors [52].

Appendix E. Details on the Antenna Configuration in Section 5.1

Following the table contains the specification of the antenna configurations in Section 5.1. We denote ( N R hor , N R ver , N R pol , N T hor , N T ver , N T pol ) by the pair of number of horizontal receive antennas N R hor , number of vertical receive antennas N R ver , number of polarizations of receive antennas N R pol , number of horizontal transmit antennas N T hor , number of vertical transmit antennas N T ver , and number of polarizations of transmit antennas N T pol . Note that N R N T = N R hor N R ver N R pol N T hor N T ver N T pol .
Table A1. Antenna Configurations for Section 5.1.
Table A1. Antenna Configurations for Section 5.1.
Number ofAntenna Configuration
Total Antennas ( N R N T )( N R hor , N R ver , N R pol , N T hor , N T ver , N T pol )
1 ( 1 , 1 , 1 , 1 , 1 , 1 )
2 ( 1 , 1 , 1 , 2 , 1 , 1 )
4 ( 1 , 1 , 1 , 2 , 2 , 1 )
8 ( 2 , 1 , 1 , 2 , 2 , 1 )
16 ( 2 , 1 , 1 , 2 , 2 , 2 )
32 ( 2 , 1 , 1 , 4 , 2 , 2 )
64 ( 2 , 1 , 1 , 4 , 4 , 2 )
128 ( 2 , 2 , 1 , 4 , 4 , 2 )

References

  1. Tang, F.; Kawamoto, Y.; Kato, N.; Liu, J. Future intelligent and secure vehicular network toward 6G: Machine-learning approaches. Proc. IEEE 2019, 108, 292–307. [Google Scholar] [CrossRef]
  2. Hoeher, P.; Kaiser, S.; Robertson, P. Two-dimensional pilot-symbol-aided channel estimation by Wiener filtering. In Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Munich, Germany, 21–24 April 1997. [Google Scholar]
  3. Baddour, K.E.; Beaulieu, N.C. Autoregressive modeling for fading channel simulation. IEEE Trans. Wirel. Commun. 2005, 4, 1650–1662. [Google Scholar] [CrossRef]
  4. Duel-Hallen, A.; Hu, S.; Hallen, H. Long-range prediction of fading signals. IEEE Signal Process. Mag. 2000, 17, 62–75. [Google Scholar] [CrossRef]
  5. Liu, L.; Feng, H.; Yang, T.; Hu, B. MIMO-OFDM wireless channel prediction by exploiting spatial-temporal correlation. IEEE Trans. Wirel. Commun. 2013, 13, 310–319. [Google Scholar] [CrossRef]
  6. Min, C.; Chang, N.; Cha, J.; Kang, J. MIMO-OFDM downlink channel prediction for IEEE802.16e systems using Kalman filter. In Proceedings of the 2007 IEEE Wireless Communications and Networking Conference (WCNC), Hong Kong, China, 11–15 March 2007; pp. 942–946. [Google Scholar]
  7. Komninakis, C.; Fragouli, C.; Sayed, A.H.; Wesel, R.D. Multi-input multi-output fading channel tracking and equalization using Kalman estimation. IEEE Trans. Signal Process. 2002, 50, 1065–1076. [Google Scholar] [CrossRef]
  8. Kashyap, S.; Mollén, C.; Björnson, E.; Larsson, E.G. Performance analysis of (TDD) massive MIMO with Kalman channel prediction. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 3554–3558. [Google Scholar]
  9. Simmons, N.; Gomes, S.B.F.; Yacoub, M.D.; Simeone, O.; Cotton, S.L.; Simmons, D.E. AI-Based Channel Prediction in D2D Links: An Empirical Validation. IEEE Access 2022, 10, 65459–65472. [Google Scholar] [CrossRef]
  10. Simon, D.; Shmaliy, Y.S. Unified forms for Kalman and finite impulse response filtering and smoothing. Automatica 2013, 49, 1892–1899. [Google Scholar] [CrossRef]
  11. Shmaliy, Y.S. Linear optimal FIR estimation of discrete time-invariant state-space models. IEEE Trans. Signal Process. 2010, 58, 3086–3096. [Google Scholar] [CrossRef]
  12. Pratik, K.; Amjad, R.A.; Behboodi, A.; Soriaga, J.B.; Welling, M. Neural Augmentation of Kalman Filter with Hypernetwork for Channel Tracking. arXiv 2021, arXiv:2109.12561. [Google Scholar]
  13. Liu, W.; Yang, L.L.; Hanzo, L. Recurrent neural network based narrowband channel prediction. In Proceedings of the 2006 IEEE 63rd Vehicular Technology Conference, Melbourne, VIC, Australia, 7–10 May 2006; Volume 5, pp. 2173–2177. [Google Scholar]
  14. Jiang, W.; Schotten, H.D. A comparison of wireless channel predictors: Artificial Intelligence versus Kalman filter. In Proceedings of the ICC 2019—2019 IEEE International Conference on Communications (ICC), Shanghai, China, 20–24 May 2019; pp. 1–6. [Google Scholar]
  15. Jiang, W.; Strufe, M.; Schotten, H.D. Long-range MIMO channel prediction using recurrent neural networks. In Proceedings of the 2020 IEEE 17th Annual Consumer Communications & Networking Conference (CCNC), Las Vegas, NV, USA, 10–13 January 2020. [Google Scholar]
  16. Kibugi, J.; Ribeiro, L.N.; Haardt, M. Machine Learning Prediction of Time-Varying Rayleigh Channels. arXiv 2021, arXiv:2103.06131. [Google Scholar]
  17. Yuan, J.; Ngo, H.Q.; Matthaiou, M. Machine learning-based channel prediction in massive MIMO with channel aging. IEEE Trans. Wirel. Commun. 2020, 19, 2960–2973. [Google Scholar] [CrossRef]
  18. Zhang, Y.; Alkhateeb, A.; Madadi, P.; Jeon, J.; Cho, J.; Zhang, C. Predicting Future CSI Feedback For Highly-Mobile Massive MIMO Systems. arXiv 2022, arXiv:2202.02492. [Google Scholar]
  19. Kim, H.; Kim, S.; Lee, H.; Jang, C.; Choi, Y.; Choi, J. Massive MIMO channel prediction: Kalman filtering vs. machine learning. IEEE Trans. Commun. 2020, 69, 518–528. [Google Scholar] [CrossRef]
  20. Bogale, T.E.; Wang, X.; Le, L.B. Adaptive channel prediction, beamforming and scheduling design for 5G V2I network: Analytical and machine learning approaches. IEEE Trans. Veh. Technol. 2020, 69, 5055–5067. [Google Scholar] [CrossRef]
  21. Torrey, L.; Shavlik, J. Transfer learning. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques; IGI Global: Hershey, PA, USA, 2010. [Google Scholar]
  22. Thrun, S.; Pratt, L. Learning to Learn; Springer Science & Business Media: Berlin, Germany, 2012. [Google Scholar]
  23. Raghu, A.; Raghu, M.; Bengio, S.; Vinyals, O. Rapid learning or feature reuse? towards understanding the effectiveness of maml. arXiv 2019, arXiv:1909.09157. [Google Scholar]
  24. Jose, S.T.; Park, S.; Simeone, O. Information-Theoretic Analysis of Epistemic Uncertainty in Bayesian Meta-learning. In Proceedings of the AISTATS, Virtual Event, 28–30 March 2022. [Google Scholar]
  25. Yuan, Y.; Zheng, G.; Wong, K.K.; Ottersten, B.; Luo, Z.Q. Transfer learning and meta learning-based fast downlink beamforming adaptation. IEEE Trans. Wirel. Commun. 2020, 20, 1742–1755. [Google Scholar] [CrossRef]
  26. Ge, Y.; Fan, J. Beamforming optimization for intelligent reflecting surface assisted MISO: A deep transfer learning approach. IEEE Trans. Veh. Technol. 2021, 70, 3902–3907. [Google Scholar] [CrossRef]
  27. Yang, Y.; Gao, F.; Zhong, Z.; Ai, B.; Alkhateeb, A. Deep transfer learning-based downlink channel prediction for FDD massive MIMO systems. IEEE Trans. Commun. 2020, 68, 7485–7497. [Google Scholar] [CrossRef]
  28. Parera, C.; Redondi, A.E.; Cesana, M.; Liao, Q.; Malanchini, I. Transfer learning for channel quality prediction. In Proceedings of the 2019 IEEE International Symposium on Measurements & Networking (M&N), Catania, Italy, 8–10 July 2019; pp. 1–6. [Google Scholar]
  29. Park, S.; Jang, H.; Simeone, O.; Kang, J. Learning to demodulate from few pilots via offline and online meta-learning. IEEE Trans. Signal Process. 2020, 69, 226–239. [Google Scholar] [CrossRef]
  30. Cohen, K.M.; Park, S.; Simeone, O.; Shamai, S. Learning to Learn to Demodulate with Uncertainty Quantification via Bayesian Meta-Learning. arXiv 2021, arXiv:2108.00785. [Google Scholar]
  31. Mao, H.; Lu, H.; Lu, Y.; Zhu, D. RoemNet: Robust Meta Learning Based Channel Estimation in OFDM Systems. In Proceedings of the ICC 2019—2019 IEEE International Conference on Communications (ICC), Shanghai, China, 20–24 May 2019. [Google Scholar]
  32. Raviv, T.; Park, S.; Simeone, O.; Eldar, Y.C.; Shlezinger, N. Online Meta-Learning For Hybrid Model-Based Deep Receivers. arXiv 2022, arXiv:2203.14359. [Google Scholar]
  33. Jiang, Y.; Kim, H.; Asnani, H.; Kannan, S. MIND: Model Independent Neural Decoder. In Proceedings of the 2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Cannes, France, 2–5 July 2019. [Google Scholar]
  34. Park, S.; Simeone, O.; Kang, J. Meta-learning to communicate: Fast end-to-end training for fading channels. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020. [Google Scholar]
  35. Park, S.; Simeone, O.; Kang, J. End-to-end fast training of communication links without a channel model via online meta-learning. In Proceedings of the 2020 IEEE 21st International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Atlanta, GA, USA, 26–29 May 2020. [Google Scholar]
  36. Goutay, M.; Aoudia, F.A.; Hoydis, J. Deep hypernetwork-based MIMO detection. In Proceedings of the 2020 IEEE 21st International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Atlanta, GA, USA, 26–29 May 2020. [Google Scholar]
  37. Zhang, J.; Yuan, Y.; Zheng, G.; Krikidis, I.; Wong, K.K. Embedding Model Based Fast Meta Learning for Downlink Beamforming Adaptation. IEEE Trans. Wirel. Commun. 2021, 21, 149–162. [Google Scholar] [CrossRef]
  38. Karasik, R.; Simeone, O.; Jang, H.; Shamai, S. Learning to Broadcast for Ultra-Reliable Communication with Differential Quality of Service via the Conditional Value at Risk. arXiv 2021, arXiv:2112.02007. [Google Scholar]
  39. Hu, Y.; Chen, M.; Saad, W.; Poor, H.V.; Cui, S. Distributed multi-agent meta learning for trajectory design in wireless drone networks. IEEE J. Sel. Areas Commun. 2021, 39, 3177–3192. [Google Scholar] [CrossRef]
  40. Nikoloska, I.; Simeone, O. Black-Box and Modular Meta-Learning for Power Control via Random Edge Graph Neural Networks. arXiv 2021, arXiv:2108.13178. [Google Scholar]
  41. Simeone, O.; Spagnolini, U. Lower bound on training-based channel estimation error for frequency-selective block-fading Rayleigh MIMO channels. IEEE Trans. Signal Process. 2004, 52, 3265–3277. [Google Scholar] [CrossRef]
  42. Cicerone, M.; Simeone, O.; Spagnolini, U. Channel estimation for MIMO-OFDM systems by modal analysis/filtering. IEEE Trans. Commun. 2006, 54, 2062–2074. [Google Scholar] [CrossRef]
  43. Pedersen, K.I.; Andersen, J.B.; Kermoal, J.P.; Mogensen, P. A stochastic multiple-input-multiple-output radio channel model for evaluation of space-time coding algorithms. In Proceedings of the Vehicular Technology Conference Fall 2000. IEEE VTS Fall VTC2000. 52nd Vehicular Technology Conference (Cat. No. 00CH37152), Boston, MA, USA, 24–28 September 2000; Volume 2, pp. 893–897. [Google Scholar]
  44. Abdi, A.; Kaveh, M. A space-time correlation model for multielement antenna systems in mobile fading channels. IEEE J. Sel. Areas Commun. 2002, 20, 550–560. [Google Scholar] [CrossRef] [Green Version]
  45. Park, S.; Simeone, O. Predicting Flat-Fading Channels via Meta-Learned Closed-Form Linear Filters and Equilibrium Propagation. arXiv 2021, arXiv:2110.00414. [Google Scholar]
  46. 3GPP. Study on Channel Model for Frequencies From 0.5 to 100 GHz (3GPP TR 38.901 Version 16.1.0 Release 16). TR 38.901. 2020. Available online: https://www.etsi.org/deliver/etsi_tr/138900_138999/138901/16.01.00_60/tr_138901v160100p.pdf (accessed on 16 September 2022).
  47. Gallager, R.G. Principles of Digital Communication; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
  48. Simeone, O. A brief introduction to machine learning for engineers. Found. Trends® Signal Process. 2018, 12, 200–431. [Google Scholar] [CrossRef]
  49. Simeone, O. Machine Learning for Engineers; Cambridge University Press: Cambridge, UK, 2022. [Google Scholar]
  50. Prasad, R.; Murthy, C.R.; Rao, B.D. Joint approximately sparse channel estimation and data detection in OFDM systems using sparse Bayesian learning. IEEE Trans. Signal Process. 2014, 62, 3591–3603. [Google Scholar] [CrossRef]
  51. Huang, C.; Liu, L.; Yuen, C.; Sun, S. Iterative channel estimation using LSE and sparse message passing for mmWave MIMO systems. IEEE Trans. Signal Process. 2018, 67, 245–259. [Google Scholar] [CrossRef]
  52. Denevi, G.; Ciliberto, C.; Stamos, D.; Pontil, M. Learning to learn around a common mean. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
  53. Yin, M.; Tucker, G.; Zhou, M.; Levine, S.; Finn, C. Meta-learning without memorization. arXiv 2019, arXiv:1912.03820. [Google Scholar]
  54. Swindlehurst, A.L. Time delay and spatial signature estimation using known asynchronous signals. IEEE Trans. Signal Process. 1998, 46, 449–462. [Google Scholar] [CrossRef]
  55. Nicoli, M.; Simeone, O.; Spagnolini, U. Multislot estimation of fast-varying space-time communication channels. IEEE Trans. Signal Process. 2003, 51, 1184–1195. [Google Scholar] [CrossRef]
  56. Cortes, C.; Mohri, M.; Rostamizadeh, A. Algorithms for learning kernels based on centered alignment. J. Mach. Learn. Res. 2012, 13, 795–828. [Google Scholar]
  57. Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52. [Google Scholar] [CrossRef]
  58. Wong, A.S.Y.; Wong, K.W.; Leung, C.S. A unified sequential method for PCA. In Proceedings of the ICECS’99 6th IEEE International Conference on Electronics, Circuits and Systems (Cat. No. 99EX357), Paphos, Cyprus, 5–8 September 1999; Volume 1, pp. 583–586. [Google Scholar]
  59. Sidiropoulos, N.D.; De Lathauwer, L.; Fu, X.; Huang, K.; Papalexakis, E.E.; Faloutsos, C. Tensor decomposition for signal processing and machine learning. IEEE Trans. Signal Process. 2017, 65, 3551–3582. [Google Scholar] [CrossRef]
  60. Scellier, B.; Bengio, Y. Equilibrium propagation: Bridging the gap between energy-based models and backpropagation. Front. Comput. Neurosci. 2017, 11, 24. [Google Scholar] [CrossRef]
  61. Zucchet, N.; Schug, S.; von Oswald, J.; Zhao, D.; Sacramento, J. A contrastive rule for meta-learning. arXiv 2021, arXiv:2104.01677. [Google Scholar]
  62. Wax, M.; Kailath, T. Detection of signals by information theoretic criteria. IEEE Trans. Acoust. Speech Signal Process. 1985, 33, 387–392. [Google Scholar] [CrossRef]
  63. Liavas, A.P.; Regalia, P.A.; Delmas, J.P. Blind channel approximation: Effective channel order determination. IEEE Trans. Signal Process. 1999, 47, 3336–3344. [Google Scholar] [CrossRef]
  64. Agrawal, A.; Andrews, J.G.; Cioffi, J.M.; Meng, T. Iterative power control for imperfect successive interference cancellation. IEEE Trans. Wirel. Commun. 2005, 4, 878–884. [Google Scholar] [CrossRef]
  65. Recommendations ITU-R. Multipath Propagation and Parameterization of its Characteristics. Available online: https://www.itu.int/dms_pubrec/itu-r/rec/p/R-REC-P.1407-0-199907-S!!PDF-E.pdf (accessed on 22 September 2022).
  66. Xiao, C.; Yang, C.; Li, M. Efficient alternating least squares algorithms for low multilinear rank approximation of tensors. J. Sci. Comput. 2021, 87, 1–25. [Google Scholar] [CrossRef]
  67. Lorraine, J.; Vicol, P.; Duvenaud, D. Optimizing millions of hyperparameters by implicit differentiation. In Proceedings of the AISTATS, Virtual Event, 26–28 August 2020. [Google Scholar]
Figure 1. Illustration of the frame-based transmission system under study. At any frame f, based on the previous N channels h l , f N , we investigate the problem of optimizing the δ -lag prediction h ^ l + δ , f .
Figure 1. Illustration of the frame-based transmission system under study. At any frame f, based on the previous N channels h l , f N , we investigate the problem of optimizing the δ -lag prediction h ^ l + δ , f .
Entropy 24 01363 g001
Figure 2. Illustration of the considered transfer and meta-learning methods. With access to pilots from previously received frames, transfer learning and meta-learning aim at obtaining the hyperparameters V ¯ to be used for channel prediction in a new frame.
Figure 2. Illustration of the considered transfer and meta-learning methods. With access to pilots from previously received frames, transfer learning and meta-learning aim at obtaining the hyperparameters V ¯ to be used for channel prediction in a new frame.
Entropy 24 01363 g002
Figure 3. Illustration of the considered LSTD-based prediction model. (i) Estimate amplitudes d l , f via the estimated long-term feature matrix B ^ f . (ii) Feature-wise short-term fading amplitude prediction d ^ l + δ , f k based on feature-wise predictor v f k for k = 1 , , K . (iii) Reconstruction of the predicted channel h ^ l + δ , f based on the feature matrix B ^ f and predicted fading amplitude d ^ l + δ , f .
Figure 3. Illustration of the considered LSTD-based prediction model. (i) Estimate amplitudes d l , f via the estimated long-term feature matrix B ^ f . (ii) Feature-wise short-term fading amplitude prediction d ^ l + δ , f k based on feature-wise predictor v f k for k = 1 , , K . (iii) Reconstruction of the predicted channel h ^ l + δ , f based on the feature matrix B ^ f and predicted fading amplitude d ^ l + δ , f .
Entropy 24 01363 g003
Figure 4. Multi-antenna frequency-flat channel prediction performance as a function of the total number of antennas, N R N T , under a single-clustered, single-tap ( W = 1 ), 3GPP SCM channel model for a fast-varying environment with number of training samples L new = 1 ( K = 1 ).
Figure 4. Multi-antenna frequency-flat channel prediction performance as a function of the total number of antennas, N R N T , under a single-clustered, single-tap ( W = 1 ), 3GPP SCM channel model for a fast-varying environment with number of training samples L new = 1 ( K = 1 ).
Entropy 24 01363 g004
Figure 5. Multi-antenna frequency-flat channel prediction performance as a function of the total number of antennas, N R N T , under a single-clustered, single-tap ( W = 1 ), 3GPP SCM channel model for a slow-varying environment with number of training samples L new = 1 ( K = 1 ).
Figure 5. Multi-antenna frequency-flat channel prediction performance as a function of the total number of antennas, N R N T , under a single-clustered, single-tap ( W = 1 ), 3GPP SCM channel model for a slow-varying environment with number of training samples L new = 1 ( K = 1 ).
Entropy 24 01363 g005
Figure 6. Multi-antenna frequency-selective channel prediction performance as a function of the number of features K, under 19-clustered, multi-taps ( W = 4 ) , multi-antenna ( N T = 8 , N R = 8 ) 3GPP SCM channel model for a fast-varying environment with number of training samples L new = 1 . Results are evaluated with number of previous frames F tr = 20 for meta-training, F val = 20 for meta-validation, and F te = 200 for meta-test.
Figure 6. Multi-antenna frequency-selective channel prediction performance as a function of the number of features K, under 19-clustered, multi-taps ( W = 4 ) , multi-antenna ( N T = 8 , N R = 8 ) 3GPP SCM channel model for a fast-varying environment with number of training samples L new = 1 . Results are evaluated with number of previous frames F tr = 20 for meta-training, F val = 20 for meta-validation, and F te = 200 for meta-test.
Entropy 24 01363 g006
Figure 7. Single-antenna frequency-selective channel prediction performance as a function of delay spread ratio, under 19-clustered, multi-taps, single-antenna ( N T = 1 , N R = 1 ) 3GPP SCM channel model for a fast-varying environment (top) and slow-varying environment (bottom) with number of training samples L new = 1 ( K = 1 ).
Figure 7. Single-antenna frequency-selective channel prediction performance as a function of delay spread ratio, under 19-clustered, multi-taps, single-antenna ( N T = 1 , N R = 1 ) 3GPP SCM channel model for a fast-varying environment (top) and slow-varying environment (bottom) with number of training samples L new = 1 ( K = 1 ).
Entropy 24 01363 g007
Figure 8. Multi-antenna frequency-selective channel prediction performance as a function of the number of training samples L new , under 19-clustered, two taps ( W = 2 ), multi-antenna ( N T = 4 , N R = 2 ) 3GPP SCM channel model for a fast-varying environment with total number of features K = 2 unless determined by Section 5.2.
Figure 8. Multi-antenna frequency-selective channel prediction performance as a function of the number of training samples L new , under 19-clustered, two taps ( W = 2 ), multi-antenna ( N T = 4 , N R = 2 ) 3GPP SCM channel model for a fast-varying environment with total number of features K = 2 unless determined by Section 5.2.
Entropy 24 01363 g008
Figure 9. Multi-antenna frequency-selective channel prediction performance as a function of the number of training samples L new , under 19-clustered, two taps ( W = 2 ), multi-antenna ( N T = 4 , N R = 2 ) 3GPP SCM channel model for a slow-varying environment.
Figure 9. Multi-antenna frequency-selective channel prediction performance as a function of the number of training samples L new , under 19-clustered, two taps ( W = 2 ), multi-antenna ( N T = 4 , N R = 2 ) 3GPP SCM channel model for a slow-varying environment.
Entropy 24 01363 g009
Figure 10. Multi-antenna frequency-selective channel prediction performance as a function of the number of available previous frames F under 19-clustered, two taps ( W = 2 ), multi-antenna ( N T = 4 , N R = 2 ) 3GPP SCM channel model for a fast-varying environment (top) and slow-varying environment (bottom) with number of training samples L new = 1 .
Figure 10. Multi-antenna frequency-selective channel prediction performance as a function of the number of available previous frames F under 19-clustered, two taps ( W = 2 ), multi-antenna ( N T = 4 , N R = 2 ) 3GPP SCM channel model for a fast-varying environment (top) and slow-varying environment (bottom) with number of training samples L new = 1 .
Entropy 24 01363 g010
Table 1. Computational complexity analysis at deployment (meta-testing).
Table 1. Computational complexity analysis at deployment (meta-testing).
Learning Type O ( · ) for Naïve Approach O ( · ) for LSTD-Based Approach
Conventional learning O ( S 3 N 3 + S 2 N 2 L tr ) O ( K I ALS ( L tr ( S N 2 + S 2 ) + N 3 + S 3 ) )
Transfer learning O ( S 3 N 3 + S 2 N 2 L tr ) O ( K I ALS ( L tr ( S N 2 + S 2 ) + N 3 + S 3 ) )
Meta-learning O ( S 3 N 3 + S 2 N 2 L tr ) O ( K I ALS ( L tr ( S N 2 + S 2 ) + N 3 + S 3 ) )
Table 2. Computational complexity analysis during meta-training.
Table 2. Computational complexity analysis during meta-training.
Learning Type O ( · ) for Naïve Approach O ( · ) for LSTD-Based Approach
Conventional learning
Transfer learning O ( S 3 N 3 + S 2 N 2 F L ) O ( K I ALS ( F L ( S N 2 + S 2 ) + N 3 + S 3 ) )
Meta-learning O ( F ( S 3 N 3 + S 2 N 2 L O ( K F I EP I ALS ( L ( S N 2 + S 2 ) + N 3 + S 3 ) )
+ S 2 N L te ( L tr + S N ) ) )
Table 3. Experimental setting.
Table 3. Experimental setting.
Window size ( N ) 5
Lag size ( δ ) 3
Number of previous frames ( F ) 500
Number of slots ( L + N δ + 1 ) 107
Frequency of the pilot signals ( w SRS / 2 π )200
Normalized Doppler frequency
for slow-varying environment ρ Unif [ 0.005 , 0.05 ]
Normalized Doppler frequency
for fast-varying environment ρ Unif [ 0.1 , 1 ]
SNR for channel estimation20 dB
Number of pilots for channel estimation100
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Park, S.; Simeone, O. Speeding up Training of Linear Predictors for Multi-Antenna Frequency-Selective Channels via Meta-Learning. Entropy 2022, 24, 1363. https://doi.org/10.3390/e24101363

AMA Style

Park S, Simeone O. Speeding up Training of Linear Predictors for Multi-Antenna Frequency-Selective Channels via Meta-Learning. Entropy. 2022; 24(10):1363. https://doi.org/10.3390/e24101363

Chicago/Turabian Style

Park, Sangwoo, and Osvaldo Simeone. 2022. "Speeding up Training of Linear Predictors for Multi-Antenna Frequency-Selective Channels via Meta-Learning" Entropy 24, no. 10: 1363. https://doi.org/10.3390/e24101363

APA Style

Park, S., & Simeone, O. (2022). Speeding up Training of Linear Predictors for Multi-Antenna Frequency-Selective Channels via Meta-Learning. Entropy, 24(10), 1363. https://doi.org/10.3390/e24101363

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop