Machine Learning-Based Imputation Approach with Dynamic Feature Extraction for Wireless RAN Performance Data Preprocessing

Dahj, Jean Nestor M.; Ogudo, Kingsley A.

doi:10.3390/sym15061161

Open AccessArticle

Machine Learning-Based Imputation Approach with Dynamic Feature Extraction for Wireless RAN Performance Data Preprocessing

by

Jean Nestor M. Dahj

^*

and

Kingsley A. Ogudo

Department of Electrical & Electronics Engineering Faculty of Engineering and the Built Environment, University of Johannesburg, Johannesburg 0524, South Africa

^*

Author to whom correspondence should be addressed.

Symmetry 2023, 15(6), 1161; https://doi.org/10.3390/sym15061161

Submission received: 5 May 2023 / Revised: 21 May 2023 / Accepted: 24 May 2023 / Published: 27 May 2023

Download

Browse Figures

Versions Notes

Abstract

:

Machine learning (ML) in wireless mobile communication is becoming more and more customary, with application trends leaning toward performance improvement and network automation. The radio access network (RAN), critical for service access, frequently generates performance data that mobile network operators (MNOs) and researchers leverage for planning, self-optimization, and intelligent network operations. However, missing values in the RAN performance data, as in any valuable data, impact analysis. Poor handling of such missing data in the RAN can distort the relationships between different metrics, leading to inaccurate and unreliable conclusions and predictions. Therefore, there is a need for imputation methods that preserve the overall structure of the RAN data to an optimal level. In this study, we present an imputation approach for handling RAN performance missing data based on machine learning algorithms. The method customizes the feature-extraction mechanism by using dynamic correlation analysis. We apply the method to actual RAN performance indicator data to evaluate its performance. We finally compare and evaluate the proposed approach with statistical imputation techniques such as the mean, median, and mode. The results show that machine learning-based imputation, as approached in this experimental study, preserves some relationships between KPIs compared to non-ML techniques. Random Forest regressor gave the best performance in imputing the data.

Keywords:

machine learning (ML); data imputation; radio access network (RAN); data preprocessing; telecommunications; mobile network operators (MNOs)

1. Introduction

The service requirements, virtualization, and cloudification of 5G networks [1] and the projected planning of 6G requirements [2] increasingly push the mobile telecommunication arena toward advanced technologies. Self-automation of network operation is becoming common, justifying the demand for AI/ML integration in network processes. Radio access, particularly, has always been a point of interest for MNOs because it plays an essential role in network access and accounts for most network problems. Radio access nodes generate a set of measurements and performance indicators that MNOs use to monitor and assess network quality. These measurements include indicators such as block error rate, channel quality index, signal-to-noise ratio, throughput, and many more to provide insight into network quality and service assurance. We refer to those metrics as RAN performance data. Forming the nucleus of radio optimization, researchers are leveraging radio performance data for RAN automation. For example, AI/ML-based resource management applications focusing on efficient spectrum management, uplink power control, and channel assignment are entirely based on RAN performance data [3]. RAN performance data are also fundamental in mobility (handovers) management and monitoring [4].

The origins of radio performance indicators (KPIs) are diverse, with input from industry standards, regulatory bodies, and radio equipment vendors. The most common practical collection sources include operation support system (OSS) servers through performance counters and trace servers through traces. However, like most regular collected data, radio performance data suffer from missing values, often due to transmission issues, source server problems, or environmental factors such as power outages. While the most straightforward solution would be to ignore missing values completely, there could be valuable information loss that could render ineffective most modeling techniques [5]. Hence, the need to correctly impute the missing data to return some level of feature relationships while keeping the structure of the data optimal.

The handling of missing values in wireless mobile radio can have a significant impact on analysis and optimization operations [6]. While many imputation techniques have been introduced in several industries, common statistical imputation techniques such as mean, median, and mode imputation are still widely used in wireless communication data, partly because of their ease of implementation and lower computational power requirements [7]. In such cases, the mean, median, or mode values are used to replace the missing values of the metric of interest. Statistical imputation works on the assumption that the missing values are random, and because missing values are not random, the technique introduces bias and distorts the original structure of the data [8]. The experiment conducted in [9] also shows that single-value statistical imputation (mean and median) introduces artificial variability, which can affect models’ generalization in the case of ML algorithm training on the data.

Network data are sensitive to bias and variance. Poor radio conditions, for example, often lead to low modulation scheme selection, which leads to a reduced number of bits per symbol, directly affecting the user throughput. Certain conditions are imposed symmetrically on the downlink and uplink channels to maintain the balance in the resource allocation procedures. Using a single-value imputation method disrupts the meaningfulness of such data structures. A wrongly imputed cell quality index (CQI) in a self-optimizing 5G network can lead to wrong resource allocation for a specific cell and user. A correctly imputed received signal power, on the other hand, can ensure that the power allocated to the user equipment (UE) aligns with the symmetric power constraints that limit the maximum power allowed on radio channels. Therefore, although they are fast in implementation, statistical single-value imputation methods are not suitable for radio network performance data. Regression-based imputation techniques have been introduced by several research works as a way to circumvent the bias, variance, and features’ relationship distortion issues often experienced in mean, median, and mode imputation [10,11]. The multivariate imputation method [12] is one example of a machine learning-based technique that leverages regression algorithms and has been tried on network data. However, most regression-based imputation techniques do not consider the interrelationships between features, which is critical in RAN performance data. Given four features

x_{1}, x_{2}, x_{3}

, and

y

of a dataset

X

, multivariate method imputes missing values of

y

by regressing it on

x_{1}, x_{2}, x_{3}

. While this method provides good results in estimating the data structure, the relationship between variables, especially in network data, can be affected. For example, regressing a cell downlink control channel element assignment (CCE) rate with the average physical uplink shared channel (PUSCH) may lead to erroneous estimates. Hence, the need to use regression-based imputation with caution when dealing with wireless data where traffic stream flow direction, radio conditions, resource utilization, and mobility may differently affect the outcome of the process.

In this research study, we present a method that utilizes machine learning algorithms for RAN performance data imputation effectively. However, the big differentiator of the proposed method compared to most techniques used in data imputation is the dynamic selection of features for the regression task. The feature-extraction process uses the Pearson correlation coefficient [13] to study linear relationships between key performance indicators (KPIs). Pearson correlation coefficient is a robust method for feature relationship analysis and has been applied in many research works [14,15,16]. Using the threshold defined, the features that meet the set requirements are used for the imputation operation. Only features with no missing values fulfilling the set correlation requirements are used to impute a feature with missing values. ML-based imputation can be computationally expensive, especially in big data environments [17]. However, the need to have accurate data for effective decisions, in our case, surpasses the worry of computing power. To address this, we use a pipeline approach that processes data in memory. The feature extraction and the data imputation tasks are performed in memory in a structured process. We also apply statistical imputation techniques (the mean, mode, and median) for performance comparison between the ML-based and non-ML-based methods.

The contribution of our study is threefold. Firstly, the study practically highlights the problems of statistical imputation on RAN performance metrics for network and users’ analytical exercises. The experiment’s outcome shows how these methods fail to capture the KPI relationships and can negatively impact network decisions. Secondly, the study optimizes machine learning imputation techniques by introducing dynamic feature extraction to select the best candidate features for regression tasks on the features with missing values. Thirdly, our research study is the foremost in comprehensively dealing with the missing data issue in RAN performance data directly obtained from a live network OSS using ML-based methods.

2. Literature Review and Background

The importance of RAN performance data and machine learning applications in the Telco environment is studied in detail in [4]. The 5G system, for example, with its complex requirements with respect to accommodating multiple industry use cases [18,19] depends primarily on radio capabilities [20]. Technologies and concepts such as massive MIMO, small cells, beamforming, and advanced access techniques such as OFDM and DSS in 5G are all used to improve the capabilities of the radio spectrum with respect to catering for high-speed connectivity and low-latency applications [21]. The road to self-optimizing and self-organizing wireless mobile networks depends on accurate radio access performance measurement to tune network parameters accordingly. Ref. [22] proposed a self-optimized approach that intelligently associates users with the best candidate cells by measuring radio conditions. The study uses reinforcement learning on radio metrics to tune bias factors while optimizing users’ throughput. Other notable research on ML-based self-decision radio performance tuning includes [23,24,25] for power control, tuning, and management, and Refs. [26,27,28] for spectrum management and resource allocation. The accurate reading of RAN measurement values is essential for such use cases. However, missing measurement values can be unavoidable in RAN data.

Missing data impact and data imputation techniques have been studied in several other domains, such as medicine [29,30,31], sport [32], and environmental data [7,9]. In most cases, researchers and scientists have resolved to adopt ML/AI techniques to impute the data. Ref. [33] uses an ensemble technique to impute clinical datasets and performs better than statistical methods. However, the study does not show the difference in data structure retention between the selected techniques, and feature correlation is not taken into consideration. In [12], the K-Nearest Neighbor (KNN)-based missing data imputation method is presented. The study uses classification accuracy to measure the technique’s effectiveness. The proposed method highlights the gap between mean imputation and KNN imputation. However, it may not apply to all datasets. Variable correlation is not studied prior to missing value replacement.

Although AI/ML is becoming the solution of choice to address complex network problems, its adoption in the Telco industry has been slower compared to other sectors. The complexity of mobile network architectures, the various heterogeneous data sources, and constant changes in specifications and releases to accommodate user and service needs make it challenging, exciting, and fun for researchers and MNOs. In the Telco arena, very few research works have addressed data imputation problems. Most works focus on the actual application of the technologies. Moreover, complex algorithms are now being developed to make mobile networks more intelligent.

3. Related Work

A notable research study in radio data imputation is the work of Chaudhry et al. [34]. The study combines multivariate and univariate approaches to impute variables in the LTE spectrum data collected from spectrum measurement devices. The KPI with the missing value is the cell throughput, an important indicator of cell performance. Multivariate imputation by chained equations (MICE) [12] is utilized with predictive mean matching (PMM) [35]. The performance is compared against the Kalman filtering imputation method, attributing an optimal performance to the MICE-PMM technique. Although popular in handling missing values, MICE has shown bias problems since the algorithm assumes that the values are missing at random, which is not consistently accurate in the network environment. The accuracy of MICE also depends on the regression equations used to estimate the missing value. Suppose the used estimators fail to capture relationships between features. In that case, MICE can produce unreliable imputed values, which will be problematic in the RAN environment, where KPIs often show strong correlations. Like most imputation methods, Ref. [34] ignores feature correlation. Another notable research study in network data imputation is the work of Lin J. et al. [36]. Although not directly linked to mobile network data, the study proposes a regularized dynamic factor analysis (DFA) capable of handling missing values in wireless sensor network data. The study uses dimension reduction to reduce the number of features while preserving data information.

4. Machine Learning Models’ Background and Study Problem Formulation

Machine learning is evolving, with advanced and specialized techniques being conceived by researchers and scientists for complex tasks. The ML field is popularly segmented into three main classes depending on the learning or application task: supervised, unsupervised, and reinforcement learning. In supervised ML, the input or training data contain the output labels. The objective of supervised ML algorithms, therefore, is to learn the patterns and relationships in the data to create a function that accurately predicts the output of new unlabeled data. Depending on the output label type, the supervised ML class is grouped into classification and regression tasks. In cases where the training data have no output labels, unsupervised ML is used to uncover patterns and relationships in the data. Its ability to identify structures in data with no prior outcome knowledge makes it the ideal ML field for clustering and dimensionality reduction tasks. Reinforcement learning (RL) is an ML class that introduces the concept of self-decision through action, reward, and penalty mechanisms [37]. Based on the trial and error approach, RL’s objective is to identify the most effective set of rules, also known as policy, that maximizes the cumulative reward over a period of time. Besides the main ML classes, hybrid algorithms such as semi-supervised and transfer learning can also be leveraged for intelligence tasks. Semi-supervised ML combines both labeled and unlabeled data to train a model. The model identifies patterns and structures in the unlabeled data and then combines the output with the labeled data for precise and effective predictions. In transfer learning, the knowledge acquired in one task is used in another related task—no need to train the model from scratch. Visual and natural language processing (NLP) are two areas that benefit more from transfer learning techniques. Each ML category consists of algorithms that perform desired tasks. How algorithms are selected for implementation depends on the problem to be addressed and other considerations, such as data characteristics and performance targets. A non-exhaustive summary table of widely used ML algorithms is shown in Table 1. For a more comprehensive understanding of supervised and unsupervised machine learning algorithms, including their applications, strengths, and weaknesses, please refer to the extensive information available in [38]. Ref. [39] offers an in-depth exploration of RL concepts and algorithms.

Our study experiment uses five supervised machine learning models to impute the reference signal received power (RSRP) KPI based on a multitude of RAN metrics collected from the OSS. The algorithms used in the study include Linear Regression, KNN, Random Forest, Support Vector Regression (SVR), and XGBoost. The RAN imputation problem is treated as a regression problem such that the relationship between radio KPIs and a particular KPI with missing values is modeled. Given RAN performance data

X

of

m

samples (network cells),

n

non-missing value features (KPIs), and a missing value feature or KPI

y

, we determine a function

f

such that

\forall 1 \leq i \leq m, y (i) = f (x_{1} (i), x_{2} (i), \dots, x_{n} (i)) + ε (i)

where

ε (i)

represents the error and

y (i)

The predicted value. However, we must carefully study the correlation between variables before selecting the features for the function.

Supervised machine learning algorithms can help estimate function

f

accurately. We give an overview of the algorithms used in this experiment.

4.1. Multiple Linear Regression

Linear Regression is a statistical method used to analyze the relationship between a response feature and one or more predictor features [40]. Given the problem definition above, the algorithm aims to determine regression coefficients

β_{0}, β_{1}, β_{2}, \dots, β_{n}

that best estimate the predicted value as expressed in (1).

y (i) = β_{0} + β_{1} x_{1} (i) + β_{2} x_{2} (i) + \dots + β_{n} x_{n} (i) + ε (i)

(1)

The choice of the coefficients is such that the sum of squared errors (SSE) is minimized between the actual values

y

and the predicted value

\hat{y}

using (2).

S S E = \sum_{i = 1}^{m} {(y (i) - \hat{y} (i))}^{2}

(2)

In the case of non-linearity, the function

f

becomes a nonlinear function. However, the goal remains to determine the nonlinear coefficients that will best fit the data.

4.2. Support Vector Regression (SVR)

SVR belongs to the Support Vector Machine (SVM) algorithm family [41] used for continuous value prediction or estimation. The algorithm uses a nonlinear kernel function to connect input features to a higher-dimensional variable space. After mapping features to the variable space, a Linear Regression algorithm is applied to predict new values. Based on the problem statement given, the goal would be to determine the hyperplane in the variable or feature space that maximizes the gap between

y

and

\hat{y}

, considering the tolerance

ε

.

The SVR optimization function is given by (3).

J (w) = \frac{1}{2} w^{2} + C \sum_{i = 1}^{m} (\max (0, |y (i) - \hat{y} (i)| - ε))

(3)

subject to:

\hat{y} (i) = w^{T} φ (x (i)) + b

(4)

where

w

is the weight vector,

C

is the regularization element,

w^{T} φ (x (i))

represents the attribute vector in the hyperplane,

b

represents the bias element. Several research works have used SVM types of algorithms for predictive purposes [42,43].

4.3. Random Forest Regression (RFR)

RFR is a tree-based and ensemble algorithm popular for regression problems (predicting continuous values). The Random Forest algorithm [44] generally has shown good performance in regression and classification tasks as it uses an ensemble approach. In the case of our problem, the Random Forest regressor is used as follows. Given the number of Decision Trees

N

created using bagging, boosting, and randomization techniques from

X

,

N

tree-based algorithms are trained on random subsets of

X

. Data points are run through each tree in the Random Forest to predict the new values. The average of all the tree predictions is used to determine the actual prediction. The notion is expressed using (5).

\hat{y} (i) = \frac{1}{N} \sum_{t = 1}^{N} {\hat{y}}_{t} (i)

(5)

where

{\hat{y}}_{t} (i)

is the local prediction of tree

t

of observation

i

. The algorithm can be optimized by tuning hyperparameters, such as changing

N

, changing the samples’ size to split nodes, and performing an effective feature reselection.

4.4. XGBoost Regression

XGBoost is also based on Decision Tree algorithms combined with gradient-boosting techniques to improve performance while minimizing the load on the system [45]. Since its inception in 2016, many research works have leveraged the algorithm to address regression and classification tasks [46,47]. The fundamental principle of the algorithm is to use Decision Trees as weak learners to predict values while optimizing a loss function that evaluates the difference between

y

and

\hat{y}

.

Let us initialize a prediction score

s_{0}

to a constant value

v

. For each iteration

k

, the loss function’s gradient and Hessian are computed in reference to

s_{k - 1}

, the new prediction score. Scores are used to build a Decision Tree to approximate the gradient. The prediction score

s_{k}

is then updated with the new tree prediction. The update process is repeated until the accuracy of a given validation set becomes static. The prediction score can be represented using (6).

\hat{y} (i) = \sum_{t = 1}^{N} w (t) {\hat{y}}_{t} (i)

(6)

where

N

represents the number of trees,

w (t)

represents the weight of tree

t

, and

{\hat{y}}_{t} (i)

is the estimation or prediction of the

t

-th tree for the

i

-th observation in dataset.

4.5. K-Nearest Neighbor (KNN) Regression

Unlike most regression techniques that depend on parameters such as regression coefficients, weights, bias, etc., KNN does not depend on parameters; hence, it is considered non-parametric. The algorithm uses the closest points to a data point to predict the new value. Given a number of nearest neighbors

K

, and an input

x (i)

,

K

data points

x

in

X

can be determined such that the distance

d (x, x (i))

is minimal. The predicted value is then computed by averaging the results of the

K

nearest neighbors. The Euclidean distance, given in (7), can be used to calculate the distance between the points.

d (x, x (i)) = \sqrt{\sum_{j = 1}^{n} {(x_{j} - x_{j} (i))}^{2}}

(7)

where

n

represents the number of independent variables, and

i

is the index of the training observation. Since

K

can influence the model performance significantly, tuning it properly can improve the KNN regression performance. Ref. [48] reviews the KNN algorithm in more detail, giving an overview of its variants.

5. Study Methodology

The methodology employed in this experiment is shown in Figure 1. A three-step data-processing pipeline is used to address the imputation problem in a sensitive mobile radio access environment with symmetric constraints. It includes a dynamic feature-extraction operation, a data-modeling step, and the imputation task using ML algorithms. Collected cell and user data’s RAN KPIs are fed to the feature extractor, which identifies the columns with missing values up to a certain percentage of the total value. The KPIs are correlated to extract features that can be used for each missing data column.

The extracted features are finally used to extract subsets of the data that can be used to train the regressors. We scale the data and train the different selected machine learning algorithms. Based on the evaluation criteria defined in the pipeline, the best-performing model is automatically selected to impute the final missing values. In the next subsections, we explore each step, which in a standard ML process, would be part of the preprocessing step.

5.1. Feature Extraction

Feature extraction is a crucial part of the preprocessing step and does the heavy load of preparing data for modeling and imputation. Given the RAN KPI dataframe

X

of dimension

(m, n)

where

m

is the number of rows or samples and

n

the number of columns or features, we compute the Pearson correlation between variables using (8).

C o r r (x, y) = \frac{C o v (x, y)}{σ (x) σ (y)}

(8)

where

C o v (x, y)

is the covariance between

x

and

y

, and

σ

is the standard deviation of the column calculated using (9), where

\bar{x}

is the mean value of column

x

with

x \in X

.

x (i)

is the

i

th element of

x

. The covariance is computed using (10).

σ (x) = \sqrt{\frac{\sum_{i = 1}^{m} {(x (i) - \bar{x})}^{2}}{m}}

(9)

C o v (x, y) = \frac{\sum (x - \bar{x}) (y - \bar{y})}{m}

(10)

The correlation matrix of the dataframe

c o r r (X)

is of shape

(n, n)

. A correlation matrix model is shown in Figure 2a. We create a correlation dataset

F

from

c o r r (X)

of size

(n, p)

with

p

being the number of columns containing the missing values to be imputed. An illustration of the creation of

F

is shown in Figure 2b. The correlation of a column with itself

c o r r (x_{i}, x_{i}) = 1

and

c o r r (x_{i}, x_{j}) = c o r r (x_{j}, x_{i})

. Figure 3 shows the in-memory created table

F

assuming that

x_{1}

and

x_{2}

are the two columns containing missing values. The table is created in memory for fast processing.

Let us consider

x

of

F

a column with an acceptable threshold of missing values.

Given the input dataframe

F

of size

(n, 2)

, we want to find a set of KPIs

S

that strongly correlate with the missing value column

x

. We define a cutoff

β

which represents the

β

-th percentile of

x

,

r

.

β

is chosen such that we have a solid positive correlation threshold between

x

and other KPIs. The correlation of 1 is discarded from the computation as it represents a feature correlation with itself.

We define

s o r t (F [x])

as

F

sorted by

x

in ascending order. Then

r

is calculated using (11).

r = β (n + 1)

(11)

if

[r] = [r]

then percentile

p (β) = x (r)

, where

x (r)

is the

r

-th observation of

x

.

Otherwise:

k = ‖ r ‖

, with

‖ r ‖

being

r

rounded to the nearest integer.

p (β) = (k - r) x (k) + (r - k + 1) x (k + 1)

where

x (k)

and

x (k + 1)

are the

k

-th and the

(k + 1)

-th observations of

x

.

We define an array of KPIs

S

such that:

\forall 1 \leq i \leq n, \forall j, k \in K, (j \neq k \land x (i) \geq p (β)) \Rightarrow (j \in S) \land (k \in S)

S

contains all distinct KPIs

(K)

of

s o r t (F [x])

where the value of column

x

is greater than or equal to the value of the

β

-th percentile of

x

.

S

is the set of extracted features for the imputation task.

5.2. Data Modeling

The data-modeling step is used to scale the data and train the regressors for imputation. Considering

X

to be the initial dataframe or matrix of shape

(m, n)

and the extracted features

S

to contain the strongly correlated features and the feature of interest

x

, we create a new dataframe or matrix

\underline{X} | {\underline{X}}_{i j} = X_{i j}

where

j \in S

.

\underline{X}

is sliced to the features of

X

specified in

S

.

Define two dataframes $M$ and $N$ such that for the feature with missing values $x$ , $M = \{{\underline{X}}_{i j} | {\underline{X}}_{i j} \neq \emptyset\}$ and $N = \{{\underline{X}}_{i j} | {\underline{X}}_{i j} = \emptyset\}$ . $\underline{X} = M \cup N$ . $M$ is the subset of $\underline{X}$ containing no missing values, and $N$ is the subset of $\underline{X}$ containing missing values only for $x$ .
Use $M$ as input data to train different regression algorithms following the standard procedures of machine learning tasks (splitting, scaling, fitting, and predicting), explained in the following steps.
Split the input data $M$ into the training set $R$ and testing $E$ such that the training set $R = {x \in M | 1 \leq i \leq α |M|}$ and testing set $E = {x \in M | α |M| < i \leq |M|}$ where $|M|$ represents the number of samples in $M$ and $α$ the split percentage. For a standard 80% split, $α = 0.8$ .
Standardize variables of $R$ using the formula $R^{(S)} = \frac{(R - μ)}{σ}$ where $R$ is the original feature value, $μ$ and $σ$ represent the mean and standard deviation values of the feature, respectively. $R^{(S)}$ is the scaled values of $R$ or just scaled $R$ .
By scaling the data, we ensure that all variables are treated equally by the algorithms during training, and the optimization process is not affected or skewed by the original value difference.
Transform all variables of $E$ using the same scaling parameters of $R$ such that $E^{(T)} = \frac{(E - μ)}{σ}$ where $E$ is the original variable and the pair $(μ, σ)$ is as computed in step 4. $E^{(T)}$ represents the transformed $E$ . By using the pair $(μ, σ)$ , we ensure that $E$ is standardized using the same reference as $R$ , enabling unbiased comparisons of model performance and improving its reliability on unseen data.
Train the models by fitting the selected regression algorithms using $R^{(S)}$ .
Use the trained models in step 6 to predict the new values using $E^{(T)}$ .
Assess the performance of each regressor.

The RAN KPI dataset only contains numerical metrics. Therefore, no encoding operation was used in the preprocessing task.

Because the missing value problem is a regression problem, we use three error-measuring metrics to analyze the performance of the models.

Given

y

and

\hat{y}

, the true and predicted values, respectively, we can compute the mean squared error

M S E

, the mean-squared error

R 2

, and the mean absolute error

M A E

of the different models using (12)–(14).

M S E = \frac{\sum {(y - \hat{y})}^{2}}{n}

(12)

R 2 = 1 - \frac{\sum {(y - \hat{y})}^{2}}{\sum {(y - \bar{y})}^{2}}

(13)

M A E = \frac{1}{n} \sum |y - \hat{y}|

(14)

where

n

is the number of samples or records in the testing or evaluation dataset of

M

. We use the three errors because each gives us insight into a specific aspect of the problem. Although all performance assessment metrics are important, the

R 2

is prioritized for RAN KPIs because the proportion of variance in the data to be imputed is crucial.

5.3. Data Imputation

The best-performing model is automatically selected to replace the missing values of

x

in dataset

N

. This is done by predicting the values of the missing data. An analysis of the different imputation techniques is carried out to illustrate the impact of each method on the radio analysis. The missing values of

x

in

X

are then replaced by the values of

x

in

N

using the common identifier, in this case, the Cell ID.

Let

k

be the common identifier,

x

the feature vector imputed,

X

the original RAN performance data, and

N

the dataset of predicted imputed data of

x

. Let

X [k] = \{k (i) | |\{j | X (j) [k] = k (i)\}| = 1\}

(15)

N [k] = \{k (i) | |\{j | N (j) [k] = k (i)\}| = 1\}

(16)

where

X [k]

and

N [k]

are sets of unique values in common column

k

of

X

and

N

, respectively. They represent the unique cells in the two datasets.

\forall k (i) \in X [k]

,

X (i) = \{X (j)\}

and

N (i) = \{N (j)\}

such that

X (j)

and

N (j)

are rows in

X

and

N

, respectively, where column

k = k (i)

.

\forall X (j) \in X (k)

if

X (j) [x] = \emptyset

then

X (j) [x] = N (j) [x]

.

Otherwise

X (j) [x] = X (j) [x]

.

Note that

i

and

j

are indices of row numbers. The index

i

represents the index of the row currently used in the loop, and

j

represents the indices of all rows where column

k

is equal to

k (i)

.

X

is now imputed optimally on column

x

and can be used for further network analytics tasks such as reinforcement learning for optimization, clustering, and identification of cells of interest.

5.4. Complexity Analysis

Although the best model selection is based on performance assessment, it is also essential to understand the complexity of each model in terms of resource consumption. This additional task can help optimize the trade-off between the algorithm’s accuracy and efficiency, considering the computational resources (memory, training time, storage, and processing power). In an environment with symmetric constraints such as the RAN, where the balance must be kept between uplink and downlink resource allocation, a model that trains longer on the data imputation task may only result in delayed network decisions, especially for real-time network actions. Therefore, a statistical complexity measurement method is introduced to substantiate the chosen algorithm. Two parameters are used in this study, training execution and memory utilization. The complexity measurement is conducted in steps 6 and 7 of the data-modeling process (model training and prediction). For each algorithm, we record the time

t_{0}

and

t_{t}

at the beginning and end of the training execution. We then calculate the difference

Δ t = t_{t} - t_{0}

as the training time. We profile the memory allocation over the execution time and extract the peak or highest memory usage during the period.

6. Experiments and Results

The experiment is based on live 5G RAN network traffic with system setup shown in Table 2. An experimental Linux server is directly connected to the operation and support system (OSS) servers through the northbound interface for data collection. This physical setup facilitates interaction with base stations (gNodeB), which can help fine-tune configuration parameters based on environment self-learning capabilities in future research studies. The experiment focuses on new radio (NR) cells only as 5G is becoming the center of network intelligence. The data are collected for three months. The study leverages Apache Spark for in-memory processing with big data analysis capability.

We use the 5G RAN KPIs, as shown in Table 3. The data are obtained from different vendors’ OSS in the same network. KPIs are standardized to obtain the global network view, extracting data for all network cells. Twenty-four radio KPIs are used for the experiment. Python is used for processing steps, and the Scikit Learn library [49] is leveraged for modeling purposes. Table 1 shows that two columns have 2.196% missing values. The table also provides the correlation coefficient of the two columns with missing values and the rest of the KPIs. There is a strong correlation between the average_sinr and the average_rsrp. Since both features have the same missing values, we exclude them from each other’s imputation process. Table 3 is obtained by applying the feature-extraction process described in the methodology to the correlated matrix dataset and using the average_rsrp as an example.

Table 4 shows the variables with the highest correlation to the average RSRP KPIs using the 85th percentile of the RSRP data. The PUSCH MCS and the PUSCH QAM256 modulation ratio (the highest modulation scheme in 5G, providing a higher data rate) moderately correlate to the RSRP. The CQI and the PUSCH RSRP positively but weakly correlate with the average RSRP. The rest of the KPIs show weaker or extremely weaker correlations with the metric of interest. Uplink performance is directly linked to radio conditions in a wireless mobile network, which is supported to some extent by the correlation data. Therefore, we use the four KPIs to train regressors to help impute the RSRP. The first four observations of the RSRP dataset used for machine learning imputation are shown in Table 5.

It is essential to understand the difference between the RSRP and the PUSCH RSRP. While both measure the signal strength for wireless connection quality evaluation, there is a difference. The RSRP is signal power measurement on the downlink stream (from the cell or base station to the user). It is essential to maintain the connection with the network. The PUSCH RSRP is the power level measurement of the uplink signals sent from the users to the cell on the shared channel’s physical uplink.

The PUSCH_MCS indicates the modulation and coding schemes used for data transmission in the uplink channel. The modulation and coding chosen by the RAN depend on the mobile user’s radio conditions, interference, and overall channel quality. The QAM256 ratio, often referred to as the 256-QAM ratio, represents the rate of the usage of the 256-QAM modulation scheme, which is crucial in enhancing the users’ radio experience. It provides a higher data rate, enhances capacity, and improves spectral efficiency. In an ideal network scenario, MNOs should aim for a high usage of 256-QAM. However, the signal-to-noise ratio (SINR) directly impacts the MCS. Hence, 256-QAM requires a constant good signal (higher SINR), which can be challenging to achieve in areas where physical obstacles such as buildings, vegetation (trees), electrical devices creating interference, etc., can be observed.

The CQI is the actual measurement of the radio channel quality, which is, by definition, a crucial component in selecting the modulation and coding scheme. The user equipment computes the CQI and reports to the base station (or cell) for resource management [50,51,52].

6.1. Imputing the Data: Task 1

As mentioned in Section 4, the imputation task is modeled as a regression problem; therefore, different error measurements are employed to assess the performance and select the best model for the task. Table 6 presents the performance of the different imputation approaches on the RSRP and the underlined hyperparameters. We can see that the traditional statistical imputation methods, including the mean, median, and mode, perform poorly in comparison with the ML-based techniques. The mean, median, and mode have negative R2 errors, failing to detect and explain the variance in the dependent variable across the dataset. With MAE > 4, the imputed values are off by a larger unit than the other techniques. An offset of 10 dBm, 4 dBm, and 3 dBm in RSRP is equivalent to a 10 mW, 2.5 mW, and 1.99 mW in received signal power, respectively, which can be critical in radio resource allocation [53]. The mode imputation displays the worst performance and cannot be used for the task. The Random Forest regression algorithm performs better (lower MSE, R2, and MAE).

Using the pipeline, Random Forest is automatically selected as the best model for imputing our radio data. For the model interpretability and feature reselection, we use the feature weight and gain to understand the model behavior of the model–feature importance, shortly described in the following subsection, contextualized to the scope of our study.

6.2. Feature Importance and Selection

Feature importance plays a crucial role in understanding how variables (KPIs in our study) impact the model, thus explaining the components that have driven the selected model’s results to a certain degree. As shown in the experiment, only the first four strongly correlated KPIs are used for imputation. However, using more features in an ML process has proven to increase the regressor’s complexity sometimes, making the model less efficient [55]. Therefore, determining feature importance helps us eliminate KPIs that contribute little to the model’s decision and only keep the most impactful ones—feature selection. We then retrain the algorithms using the reselected features. A comparison is drawn between the imputation tasks using the original and reselected features to make the final decision. We also ensure that the model’s output aligns with the domain knowledge as a debugging step of the model. The Random Forest algorithm from the Scikit-learn library implementation uses the mean decrease in impurity based on the Gini impurity index, shown in (17) [56].

Γ = \sum_{i = 1}^{n} P (i) (1 - P (i))

(17)

where

Γ

is the Gini impurity,

P (i)

is the probability of a sample belonging to class

i

for a classification problem. For a regression problem,

P (i)

is replaced by the MSE, given in (12). So, given a feature

j

used for the imputation task and a selected tree

t

of the forest split, a decrease in

Γ (j, t)

is calculated. For all trees, the algorithm calculates

Γ

for each feature as

A (j) = \sum Γ (j, t)

for

1 \leq t \leq T

, where

A (j)

is the accumulated decrease in

Γ

and

T

the total number of trees selected in the forest.

A (j)

is then normalized by dividing it by

T

,

N (j) = \frac{A (j)}{T}

. The KPIs are then ranked in descending order based on

N (j)

. The output (KPI importance) is shown in Figure 3.

The PUSCH QAM256 modulation ratio is the feature with the most significant impact in imputing the average RSRP. Domain knowledge is crucial in evaluating the model output at this stage. For example, radio channel conditions (RSRP, SINR, CQI) are fundamental in determining the MCS. Good radio conditions imply a higher modulation scheme, leading to more bits per symbol (higher throughput) [57]. Therefore, the model feature’s importance aligns with the general mobile communication radio resource management procedures. The PUSCH QAM256 modulation ratio is followed by the uplink PUSCH RSRP, measured on the cell level, the PUSCH MCS, and the CQI. We can see that the model’s feature importance does not contradict the domain knowledge, a crucial model debugging step.

6.3. Imputing the Data: Task 2

The feature-reduction method is finally employed to optimize the models. All selected imputation methods are retrained using the two most impactful features, as presented in Figure 3. The output of the second training task is shown in Table 7. The SVR algorithm has provided the best performance using the new feature subset. However, comparing Table 7 to Table 6, we can see that the feature reduction or reselection technique has not improved the imputation task performance. Hence, all the initially selected metrics remain essential in modeling the RSRP missing values.

6.4. Algorithms’ Complexity Analysis

The complexity analysis of the four trained models in terms of execution time and memory usage is presented in Figure 4. The figure shows the memory usage of each model over the execution period—the black line plot. We can identify spikes in memory usage and the peak memory utilization (red line intersection point) for each model.

SVR registered the highest memory utilization spike, reaching 211 MB, followed by XGBoost (178 MB) and Random Forest (152 MB). Linear Regression showed the lowest peak memory utilization, 145 MB. The XGBoost algorithm had the longest training execution time, 3.6 s, while Linear Regression and KNN had the lowest training time, 2.2 s. The Random Forest execution time is 3.2 s, a 45% increase compared to the lowest execution time and 11% less than the longest execution time. Based on the resource availability (refer to Table 2), Random Forest remains the model of choice as the difference in memory utilization and execution time is insignificant, and we have enough computational power to train all the models.

6.5. Imputing the Actual Missing Values on the RAN Data

The Random Forest model is used as the final model to impute the 2.196% missing values on the RSRP. Compared to a constant statistical imputation method, Random Forest captures the proportion of variance in the RSRP up to 32% in the trained model, which is optimal for the problem. Figure 5 shows the distribution of the imputed missing value.

We can see that only 31 out of 110 missing values match the mean and median values, and 13 out of 110 match the mode value. To observe the influence of the various imputation techniques on the actual missing values rows, we use Table 8, Table 9 and Table 10. The objective is to use domain knowledge to analyze the imputed values against the other metrics. Do the imputed values make sense?

Table 8 shows the top 10 best imputed RSRP values (> −90 dBm per the model output). Let us look at Cell ID 8, which has the highest PUSCH QAM256 modulation ratio. Its CQI is also high, signaling a possible good channel quality, ranging between 0 and 15, where 15 indicates better channel quality. The PUSCH RSRP of the cell is also high (−88.8 dBm), meaning strong uplink signals are sent by users on the PUSCH channel of the cell. We should expect a good average RSRP in cases of good radio conditions, which is more or less reported with the imputed value. However, we can also have cases of higher RSRP and lower PUSCH RSRP when users connect far away from the cell. The MNO’s goal should be optimizing both streams’ connection quality.

Cell ID 110 in Table 7 reports a PUSCH QAM256 modulation ratio of 0.002, which is very low. The cell cannot use QAM256. Its CQI is below 7 (4.293), and the PUSCH RSRP is −119 dBm, indicating weak average uplink signal strength—a case of a potentially poor radio condition. In this case, we can expect a low average RSRP, shown by the ML-imputed value. The distribution of the ML-imputed values highlights the relationship between radio quality-related metrics more accurately than the other methods.

Table 8 shows how well the uplink PUSCH RSRP and the imputed average RSRP strongly correlate compared to statistical imputation techniques. We can conclude using Table 8, Table 9 and Table 10 that our ML method with correlation analysis provides a better imputation result.

The RAN data can now be used for any analytical modeling and decisions.

7. Conclusions

Missing data in wireless RAN measurement data can affect analytics processes and lead to unreliable decisions. Handling them will always require effective techniques to maintain the structure of the data. This paper presented a machine learning-based model of handling missing values in wireless radio access performance data. The study’s approach strongly emphasized dynamic correlation analysis for feature extraction or selection. Five ML algorithms were trained on the RAN KPIs using the RSRP as the dependent variable. The results were compared with the traditional statistical imputation methods, including the mean, median, and mode. The study showed that the statistical methods failed to capture the variation in the dependent variable. ML-based techniques provided better performance compared to statistical. The Random Forest regression model outperformed all other selected models. Because of its performance, it was used as the ML model for imputing the RSRP missing values. Domain-level analysis was used to interpret the imputed value, rating the method used. Based on this study and the need to move towards self-organizing wireless mobile networks, we do not encourage the use of statistical imputation techniques on radio performance measurements. On the other hand, we recommend using ML/AI methods to handle missing values.

Author Contributions

Conceptualization, J.N.M.D. and K.A.O.; methodology, J.N.M.D.; software, J.N.M.D.; validation, K.A.O.; formal analysis, J.N.M.D.; investigation, J.N.M.D. and K.A.O.; resources, J.N.M.D. and K.A.O.; data curation, J.N.M.D.; writing—original draft preparation, J.N.M.D.; writing—review and editing, K.A.O.; visualization, J.N.M.D.; supervision, K.A.O.; project administration, K.A.O.; funding acquisition, K.A.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by University of Johannesburg, URC-GRANT-2022 grant number KAO2019 and The APC was funded by University Of Johannesburg Library, APC Fund.

Data Availability Statement

Given the proprietary nature of the operator’s information, the supporting data cannot be openly accessible at the moment. However, for research purposes, the data of this study can be obtained from the corresponding author on request and subject to Non-Disclosure Agreement.

Acknowledgments

This work was partly supported by University of Johannesburg’s University Research Committee (URC) grant, the Department of Electrical and Electronic Engineering, and the South African National Research Funds (NRF).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Li, X.; Samaka, M.; Chan, H.A.; Bhamare, D.; Gupta, L.; Guo, C.; Jain, R. Network Slicing for 5G: Challenges and Opportunities. IEEE Internet Comput. 2017, 21, 20–27. [Google Scholar] [CrossRef]
Anoh, K.; See, C.; Dama, Y.; Abd-Alhameed, R.; Keates, S. 6G Wireless Communication Systems: Applications, Opportunities and Challenges. Future Internet 2022, 14, 379. [Google Scholar] [CrossRef]
Lin, M.; Zhao, Y. Artificial intelligence-empowered resource management for future wireless communications: A survey. China Commun. 2020, 17, 58–77. [Google Scholar] [CrossRef]
Sun, Y.; Peng, M.; Zhou, Y.; Huang, Y.; Mao, S. Application of Machine Learning in Wireless Networks: Key Techniques and Open Issues. IEEE Commun. Surv. Tutor. 2019, 21, 3072–3108. [Google Scholar] [CrossRef]
Soley-Bori, M. Dealing with Missing Data: Key Assumptions and Methods for Applied Analysis; Boston University: Boston, MA, USA, 2013; Volume 4, pp. 1–19. [Google Scholar]
Li, Y.; Dogan, A.; Liu, C. Ensemble Generative Adversarial Imputation Network with Selective Multi-Generator (ESM-GAIN) for Missing Data Imputation. In Proceedings of the 2022 IEEE 18th International Conference on Automation Science and Engineering (CASE), Mexico City, Mexico, 20–24 August 2022; pp. 807–812. [Google Scholar] [CrossRef]
Kabir, G.; Tesfamariam, S.; Hemsing, J.; Sadiq, R. Handling incomplete and missing data in water network database using imputation methods. Sustain. Resilient Infrastruct. 2019, 5, 365–377. [Google Scholar] [CrossRef]
Chhabra, G.; Vashisht, V.; Ranjan, J. A Comparison of Multiple Imputation Methods for Data with Missing Values. Indian J. Sci. Technol. 2017, 10, 1–7. [Google Scholar] [CrossRef]
Zhang, Y.; Thorburn, P.J. Handling missing data in near real-time environmental monitoring: A system and a review of selected methods. Futur. Gener. Comput. Syst. 2021, 128, 63–72. [Google Scholar] [CrossRef]
Kim, M.; Baek, S.; Ligaray, M.; Pyo, J.; Park, M.; Cho, K.H. Comparative studies of different imputation methods for recovering streamflow observation. Water 2015, 7, 6847–6860. [Google Scholar] [CrossRef]
Ratolojanahary, R.; Ngouna, R.H.; Medjaher, K.; Junca-Bourié, J.; Dauriac, F.; Sebilo, M. Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset. Expert Syst. Appl. 2019, 131, 299–307. [Google Scholar] [CrossRef]
Van Buuren, S.; Oudshoorn, K.G. Mice: Multivariate Imputation by Chained Equations in R. J. Stat. Softw. 2011, 45, 1–67. [Google Scholar] [CrossRef]
Liu, H.; Chen, C.; Li, Y.; Duan, Z.; Li, Y. Characteristic and correlation analysis of metro loads. In Smart Metro Station Systems: Data Science and Engineering; Elsevier: Amsterdam, The Netherlands, 2022; pp. 237–267. [Google Scholar] [CrossRef]
Zhi, X.; Yuexin, S.; Jin, M.; Lujie, Z.; Zijian, D. Research on the Pearson correlation coefficient evaluation method of analog signal in the process of unit peak load regulation. In Proceedings of the 2017 13th IEEE International Conference on Electronic Measurement & Instruments (ICEMI), Yangzhou, China, 20–22 October 2017; pp. 522–527. [Google Scholar] [CrossRef]
Sharma, P.; Petit, J.; Liu, H. Pearson Correlation Analysis to Detect Misbehavior in VANET. In Proceedings of the 2018 IEEE 88th Vehicular Technology Conference (VTC-Fall), Chicago, IL, USA, 27–30 August 2018. [Google Scholar]
Coscia, M.; Mendez-Bermudez, A. Pearson correlations on complex networks. J. Complex Netw. 2021, 9, 1–14. [Google Scholar] [CrossRef]
Kuhn, M.; Johnson, K. Handling Missing Data. In Feature Engineering and Selection: A Practical Approach for Predictive Models; Chapman & Hall/CRC Press: Boca Raton, FL, USA, 2020. [Google Scholar]
Agiwal, M.; Roy, A.; Saxena, N. Next Generation 5G Wireless Networks: A Comprehensive Survey. IEEE Commun. Surv. Tutor. 2016, 18, 1617–1655. [Google Scholar] [CrossRef]
Chin, W.H.; Fan, Z.; Haines, R. Emerging technologies and research challenges for 5G wireless networks. IEEE Wirel. Commun. 2014, 21, 106–112. [Google Scholar] [CrossRef]
Peng, M.; Li, Y.; Zhao, Z.; Wang, C. System architecture and key technologies for 5G heterogeneous cloud radio access networks. IEEE Netw. 2015, 29, 6–14. [Google Scholar] [CrossRef]
Zhang, J.; Bjornson, E.; Matthaiou, M.; Ng, D.W.K.; Yang, H.; Love, D.J. Prospective Multiple Antenna Technologies for beyond 5G. IEEE J. Sel. Areas Commun. 2020, 38, 1637–1660. [Google Scholar] [CrossRef]
Jaber, M.; Imran, M.A.; Tafazolli, R.; Tukmanov, A. A Distributed SON-Based User-Centric Backhaul Provisioning Scheme. IEEE Access 2016, 4, 2314–2330. [Google Scholar] [CrossRef]
Simsek, M.; Bennis, M.; Güvenç, I. Learning Based Frequency- and Time-Domain Inter-Cell Interference Coordination in HetNets. IEEE Trans. Veh. Technol. 2014, 64, 4589–4602. [Google Scholar] [CrossRef]
Xu, L.; Nallanathan, A. Energy-Efficient Chance-Constrained Resource Allocation for Multicast Cognitive OFDM Network. IEEE J. Sel. Areas Commun. 2016, 34, 1298–1306. [Google Scholar] [CrossRef]
Vu, H.V.; Le-Ngoc, T. Underlaid FD D2D Communications in Massive MIMO Systems via Joint Beamforming and Power Allocation. In Proceedings of the 2021 IEEE Wireless Communications and Networking Conference (WCNC), Nanjing, China, 29 March–1 April 2021; pp. 1–6. [Google Scholar] [CrossRef]
Fan, C.; Li, B.; Zhao, C.; Guo, W.; Liang, Y.-C. Learning-Based Spectrum Sharing and Spatial Reuse in mm-Wave Ultradense Networks. IEEE Trans. Veh. Technol. 2017, 67, 4954–4968. [Google Scholar] [CrossRef]
Chen, M.; Saad, W.; Yin, C. Virtual Reality Over Wireless Networks: Quality-of-Service Model and Learning-Based Resource Management. IEEE Trans. Commun. 2018, 66, 5621–5635. [Google Scholar] [CrossRef]
Zhu, K.; Zhang, Z.; Sun, F.; Shen, B. Workflow Makespan Minimization for Partially Connected Edge Network: A Deep Reinforcement Learning-Based Approach. IEEE Open J. Commun. Soc. 2022, 3, 518–529. [Google Scholar] [CrossRef]
Dong, W.; Fong, D.Y.T.; Yoon, J.-S.; Wan, E.Y.F.; Bedford, L.E.; Tang, E.H.M.; Lam, C.L.K. Generative adversarial networks for imputing missing data for big data clinical research. BMC Med. Res. Methodol. 2021, 21, 78. [Google Scholar] [CrossRef]
Wu, X.; Akbarzadeh Khorshidi, H.; Aickelin, U.; Edib, Z.; Peate, M. Imputation techniques on missing values in breast cancer treatment and fertility data. Health Inf. Sci. Syst. 2019, 7, 19. [Google Scholar] [CrossRef] [PubMed]
Sankepally, S.R.; Kosaraju, N.; Rao, K.M. Data Imputation Techniques: An Empirical Study using Chronic Kidney Disease and Life Expectancy Datasets. In Proceedings of the 2022 International Conference on Innovative Trends in Information Technology, Kottayam, India, 12–13 February 2022; pp. 1–7. [Google Scholar] [CrossRef]
Epp-Stobbe, A.; Tsai, M.C.; Klimstra, M. Comparison of Imputation Methods for Missing Rate of Perceived Exertion Data in Rugby. Mach. Learn. Knowl. Extr. 2022, 4, 827–838. [Google Scholar] [CrossRef]
Chhabra, G.; Vashisht, V.; Ranjan, J. A Classifier Ensemble Machine Learning Approach to Improve Efficiency for Missing Value Imputation. In Proceedings of the 2018 International Conference on Computing, Power and Communication Technologies, Greater Noida, India, 28–29 September 2018; pp. 23–27. [Google Scholar] [CrossRef]
Chaudhry, A.; Li, W.; Basri, A.; Patenaude, F. On improving imputation accuracy of LTE spectrum measurements data. In Proceedings of the 2018 Wireless Telecommunications Symposium (WTS), Phoenix, AZ, USA, 17–20 April 2018; pp. 1–7. [Google Scholar] [CrossRef]
Little, R.J.A. Missing-data Adjustments in Large Surveys. J. Bus. Econ. Stat. 1988, 6, 287–296. [Google Scholar]
Lin, J.-Q.; Wu, H.-C.; Chan, S.-C. A New Regularized Recursive Dynamic Factor Analysis with Variable Forgetting Factor and Subspace Dimension for Wireless Sensor Networks with Missing Data. IEEE Trans. Instrum. Meas. 2021, 70, 9509713. [Google Scholar] [CrossRef]
Arshad, K.; Ali, R.F.; Muneer, A.; Aziz, I.A.; Naseer, S.; Khan, N.S.; Taib, S.M. Deep Reinforcement Learning for Anomaly Detection: A Systematic Review. IEEE Access 2022, 10, 124017–124035. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer Series in Statistics (SSS); Springer: Stanford, CA, USA, 2009. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: London, UK, 2017. [Google Scholar]
Hirose, H.; Soejima, Y.; Hirose, K. NNRMLR: A Combined Method of Nearest Neighbor Regression and Multiple Linear Regression. In Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, Fukuoka, Japan, 20–22 September 2012; pp. 351–356. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Gupta, G.; Rathee, N. Performance comparison of Support Vector Regression and Relevance Vector Regression for facial expression recognition. In Proceedings of the 2015 International Conference on Soft Computing Techniques and Implementations (ICSCTI), Faridabad, India, 8–10 October 2015; pp. 1–6. [Google Scholar] [CrossRef]
Lei, H.; Guoxing, Y.; Chao, H. A sparse algorithm for adaptive pruning least square support vector regression machine based on global representative point ranking. J. Syst. Eng. Electron. 2021, 32, 151–162. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 6–10 August 2016. [Google Scholar]
Wen, H.-T.; Wu, H.-Y.; Liao, K.-C. Using XGBoost Regression to Analyze the Importance of Input Features Applied to an Artificial Intelligence Model for the Biomass Gasification System. Inventions 2022, 7, 126. [Google Scholar] [CrossRef]
Obiora, C.N.; Ali, A.; Hasan, A.N. Implementing Extreme Gradient Boosting (XGBoost) Algorithm in Predicting Solar Irradiance. In Proceedings of the 2021 IEEE PES/IAS PowerAfrica, Nairobi, Kenya, 23–27 August 2021. [Google Scholar]
Taunk, K.; De, S.; Verma, S.; Swetapadma, A. A Brief Review of Nearest Neighbor Algorithm for Learning and Classification. In Proceedings of the 2019 International Conference on Intelligent Computing and Control Systems (ICCS), Madurai, India, 15–17 May 2019. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
3GPP. 5G NR: Multiplexing and Channel Coding, 3GPP TS 38.212 version 15.2.0; Release 15; ETSI: Valbonne, France, 2018. [Google Scholar]
3GPP. 5G NR: Physical Channels and Modulation, 3GPP TS 38.211 version 16.2.0; Release 16; ETSI: Valbonne, France, 2020. [Google Scholar]
3GPP. 5G NR: Physical Layer Measurements, 3GPP TS 38.215 version 16.2.0; Release 16; ETSI: Valbonne, France, 2020. [Google Scholar]
Selvam, P.D.; Vishvaksenan, K.S. Antenna Selection and Power Allocation in Massive MIMO. Radioengineering 2019, 27, 340–346. [Google Scholar] [CrossRef]
Scikit-Learn. User Guide. Available online: https://scikit-learn.org/ (accessed on 15 March 2023).
Lin, L.; Wang, D.; Zhao, S.; Chen, L.; Huang, N. Power Quality Disturbance Feature Selection and Pattern Recognition Based on Image Enhancement Techniques. IEEE Access 2019, 7, 67889–67904. [Google Scholar] [CrossRef]
Nembrini, S.; König, I.R.; Wright, M.N. The revival of the Gini importance? Bioinformatics 2018, 34, 3711–3718. [Google Scholar] [CrossRef]
Afroz, F.; Subramanian, R.; Heidary, R.; Sandrasegaran, K.; Ahmed, S. SINR, RSRP, RSSI and RSRQ Measurements in Long Term Evolution Networks. Int. J. Wirel. Mob. Netw. 2015, 7, 113–123. [Google Scholar] [CrossRef]

Figure 1. Simplified pipeline model of the study.

Figure 2. A simple model of the correlation matrix of X; (a) Correlation matrix with all the features; (b) Created dataframe or matrix

F

from the correlation matrix.

Figure 2. A simple model of the correlation matrix of X; (a) Correlation matrix with all the features; (b) Created dataframe or matrix

F

from the correlation matrix.

Figure 3. Random Forest Regressor’s feature importance.

Figure 4. Memory and execution time analysis of the trained algorithms. (a) Linear Regression resource utilization, (b) SVR resource utilization, (c) Random Forest resource utilization, (d) XGBoost resource utilization, and (e) KNN resource utilization.

Figure 5. Distribution of imputed average RSRP.

Table 1. A non-exhaustive summary of ML algorithms.

ML Category	Tasks	Algorithms
Supervised Learning	Classification	Decision Trees, Support Vector Machines (SVM), Random Forest, Neural Networks, K-Nearest Neighbors (KNN), XGBoost
	Regression	Linear Regression, Support Vector Regression (SVR), Random Forest, Neural Networks, K-Nearest Neighbors (KNN), XGBoost
Unsupervised Learning	Clustering	K-means, Hierarchical Clustering, DBSCAN, Mean Shift Clustering, Spectral Clustering, Gaussian Mixture Models (GMM), Self-Organizing Maps (SOM)
	Dimensionality reduction	Principal Component Analysis (PCA) Linear Discriminant Analysis (LDA), t-SNE, Autoencoders
Reinforcement Learning	Exploration Strategies	Thompson Sampling, Upper Confidence Bound (UCB), Epsilon-Greedy
	Value Estimation	Q-Learning, Deep Q-Networks, State-Action-Reward-State-Action (SARSA)
	Policy Evaluation	Monte Carlo, TD(λ) and Eligibility Traces, Temporal Difference Learning
Semi-Supervised Learning	Classification	Graph-Based Methods, Generative Models, Self-Training, Manifold Regularization

Table 2. Experiment system setup parameters.

Area/Section	Parameter	Description
Network platform	RAN	OSS server
	Technology focus	5G
Data collection parameters	Data Format	KPI counters and user traces
	Raw data aggregation time	Hourly
	Raw data aggregation level	Cell
	Data period	3 months
Data collection method	Connectivity	Direct connect to OSS platforms (NBI)
Hardware	Physical server (VM can be used as well)	64 GB RAM 2 TB Storage i9 Processor–10 cores, 64-bit OS
Software	Operating system	Linux-Ubuntu
	Data-processing language	Python
	Data-processing framework	Apache Spark

The NBI (northbound interface) is the RAN and the OSS communication interface. The NBI can be used for many network tasks. In our experiment, we use it to interact with RAN nodes to collect performance metrics.

Table 3. Selected RAN performance data KPIs.

Features	% Missing Values	Average_ rsrp_corr	Average_ sinr_corr	Description
site identifier	0.00	N/A	N/A	Site identifier in the network
pdsch_qam256_mod_ratio	0.00	0.247	0.357	The physical downlink shared channel’s Quadratic Amplitude Modulation of 256 bits.
pusch_qam256_mod_ratio	0.00	0.464	0.261	The physical uplink shared the channel’s Quadratic Amplitude Modulation of 256 bits.
active_users	0.00	0.176	0.125	The average number of actively connected users
rrc _users_connected_max	0.00	0.132	0.070	The maximum number of RRC connections
Time_advance_coverage	0.00	−0.294	−0.273	Average time advance in meter
cqi	0.00	0.381	0.401	Cell quality index
pdsch_mcs_mean	0.00	0.278	0.359	The physical downlink shared channel’s mean modulation coding scheme
pusch_mcs_mean	0.00	0.494	0.312	The physical uplink shared channel’s mean modulation coding scheme
site_availability	0.00	0.006	−0.011	Availability of the site
Throughput_DL	0.00	0.235	0.235	Site throughput in Mbps (data rate) on the downlink
user_throughput_DL	0.00	0.050	0.149	User throughput in Mbps (data rate) on the downlink
ibler_DL	0.00	−0.040	−0.223	Instantaneous block error rate (%) on the downlink
prb_DL	0.00	0.129	0.060	Physical resource block utilization on the downlink
qos_flow_setup_success_rate	0.00	0.029	−0.007	Quality of service (QoS) flow setup success rate
rrc_setup_success_rate	0.00	0.016	−0.018	The setup success rate of radio resource control
service_drop_rate	0.00	−0.002	−0.069	Data service drop rate (%)
ul_throughput	0.00	0.309	0.196	Cell uplink throughput in Mbps (data rate)
ul_avg_pusch_rsrp	0.00	0.363	0.186	Uplink mean PUSCH RSRP (dBm)
ibler_UL	0.00	−0.165	−0.182	Site instantaneous block error rate (%) on the uplink
prb_UL	0.00	0.040	0.049	Physical resource block utilization on the uplink
user _throughput_UL	0.00	0.285	0.181	User uplink throughput in Mbps (data rate)
ul _traffic_ratio_on_edge	0.00	−0.544	−0.261	Edge traffic ratio on the uplink
average_rsrp	2.196	1.000	0.682	Average reference signal received power (dBm).
average_sinr	2.196	0.682	1.000	Average signal-to-noise ratio (dB)

RSRP = received signal resource power; RRC = radio resource control.

Table 4. RSRP correlation for the 85th percentile.

Variables	ave_rsrp
pusch_mcs_mean	0.494
pusch_qam256_mod_ratio	0.464
cqi	0.381
ul_avg_pusch_rsrp	0.363

Table 5. First four samples of the RSRP dataset for modeling after extracting the correct features.

pusch_mcs_mean	pusch_qam256_mod_ratio	cqi	ul_avg_pusch_rsrp	ave_rsrp
6.873	0.357	7.636	−126.659	−98.9703
9.356	2.85	9.576	−123.261	−98.1912
8.646	1.973	9.341	−124.392	−95.5518
10.533	3.802	9.437	−122.878	−94.2947

Table 6. Imputation process evaluation metrics and considered hyperparameters.

Imputation Model	MSE	R2	MAE	Hyperparameters Considerations
Mean	33.792	−0.001	4.447
Median	33.734	0	4.439
Mode	146.193	−3.334	10.92
Linear Regression	22.768	0.325	3.439
Support Vector Regression	21.42	0.365	3.273	{Kernel: rbf, cost ( $C$ ): 1, gamma $(γ)$ : scale}
Random Forest Regression	20.735	0.385	3.291	{Estimators $(n)$ : 100, depth: 10, criterion: MSE, n_jobs: −1}
XGBoost Regression	22.077	0.346	3.463	{Estimators $(n)$ : 100, depth: 10, learning rate ( $α$ ): 0.1, loss function: MSE}
KNN Regression	23.876	0.292	3.657	{Neighbors $(n)$ : 5, weights: uniform, algorithm: auto, metric: Minkowski, n_jobs: −1}

For more details about the meaning and selection of each hyperparameter, refer to [49,54].

Table 7. The imputation method’s evaluation errors after feature reduction.

Imputation Model	MSE	R2	MAE
Mean	34.942	−0.0005	4.449
Median	35.035	−0.003	4.446
Model	139.610	−2.997	10.695
Linear Regression	46.062	−0.318	5.053
Support Vector Regression	24.432	0.300	3.482
Random Forest Regression	24.681	0.293	3.560
XGBoost Regression	27.107	0.224	3.686
KNN Regression	26.645	0.237	3.820

Table 8. Illustration of our approach method’s imputed rows with values greater than −90.00 dBm.

cellID	pusch_mcs_ mean	pusch_qam256_ mod_ratio	cqi	ul_avg_ pusch_rsrp	ave_rsrp_ Imp_mean	ave_rsrp_ Imp_median	ave_rsrp_ Imp_mode	ave_rsrp_ regression_imputed
1	7.660	11.615	8.348	−97.362	−92.641	−92.438	−103.000	−83.506
2	20.020	21.665	12.015	−114.418	−92.641	−92.438	−103.000	−85.357
3	19.075	22.359	11.834	−113.431	−92.641	−92.438	−103.000	−85.682
4	23.507	42.305	12.198	−108.205	−92.641	−92.438	−103.000	−85.954
5	4.970	6.672	2.724	−64.190	−92.641	−92.438	−103.000	−86.245
6	15.151	29.983	11.122	−115.717	−92.641	−92.438	−103.000	−86.306
7	8.562	10.732	11.995	−117.059	−92.641	−92.438	−103.000	−86.331
8	16.959	43.834	12.108	−88.833	−92.641	−92.438	−103.000	−86.843
9	22.524	15.333	11.931	−110.624	−92.641	−92.438	−103.000	−87.365
10	13.913	24.011	10.549	−107.021	−92.641	−92.438	−103.000	−88.066

The ave_rsrp_Imp_mean, ave_rsrp_Imp_median, and ave_rsrp_Imp_mode show the single imputation value using the mean, median, and mode, respectively. The ave_rsrp_regression_imputed shows the results of our approach.

Table 9. Illustration of our approach method’s imputed rows with values less than −105.00 dBm.

cellID	pusch_mcs_ mean	pusch_qam256_ mod_ratio	cqi	ul_avg_ pusch_rsrp	ave_rsrp_ Imp_mean	ave_rsrp_ Imp_median	ave_rsrp_ Imp_mode	ave_rsrp_ regression_imputed
110	7.196	0.002	4.293	−119.803	−92.641	−92.438	−103.000	−110.813
109	14.372	0.000	8.146	−122.969	−92.641	−92.438	−103.000	−110.037
108	0.008	0.000	0.528	−129.995	−92.641	−92.438	−103.000	−109.912
107	0.464	0.005	1.254	−129.917	−92.641	−92.438	−103.000	−109.839
106	3.239	0.013	4.273	−124.270	−92.641	−92.438	−103.000	−109.249
105	0.776	0.000	1.233	−124.551	−92.641	−92.438	−103.000	−109.237
104	1.681	0.019	2.092	−129.746	−92.641	−92.438	−103.000	−108.788
103	0.161	0.000	0.305	−114.400	−92.641	−92.438	−103.000	−108.555
102	0.055	0.038	2.799	−129.899	−92.641	−92.438	−103.000	−108.317
101	3.041	0.000	6.756	−127.629	−92.641	−92.438	−103.000	−108.306

Table 10. Illustration of our approach method’s imputed rows: ordered according to the UL PUSCH RSRP.

cellID	pusch_mcs_ mean	pusch_qam256_ mod_ratio	cqi	ul_avg_ pusch_rsrp	ave_rsrp_ Imp_mean	ave_rsrp_ Imp_median	ave_rsrp_ Imp_mode	ave_rsrp_ regression_imputed
34	2.948	0.116	3.441	−96.942	−92.641	−92.438	−103	−94.336
82	8.955	0.407	6.116	−97.195	−92.641	−92.438	−103	−101.174
1	7.66	11.615	8.348	−97.362	−92.641	−92.438	−103	−83.506
40	8.425	1.531	7.863	−98.187	−92.641	−92.438	−103	−95.006
42	6.261	1.699	5.762	−99.494	−92.641	−92.438	−103	−95.207
79	0.001	0	0.15	−103.901	−92.641	−92.438	−103	−100.644
83	6.329	0.004	2.342	−105.162	−92.641	−92.438	−103	−101.463
30	3.715	0.075	2.385	−106.646	−92.641	−92.438	−103	−93.83
10	13.913	24.011	10.549	−107.021	−92.641	−92.438	−103	−88.066
65	2.258	0.373	2.14	−107.225	−92.641	−92.438	−103	−98.381

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dahj, J.N.M.; Ogudo, K.A. Machine Learning-Based Imputation Approach with Dynamic Feature Extraction for Wireless RAN Performance Data Preprocessing. Symmetry 2023, 15, 1161. https://doi.org/10.3390/sym15061161

AMA Style

Dahj JNM, Ogudo KA. Machine Learning-Based Imputation Approach with Dynamic Feature Extraction for Wireless RAN Performance Data Preprocessing. Symmetry. 2023; 15(6):1161. https://doi.org/10.3390/sym15061161

Chicago/Turabian Style

Dahj, Jean Nestor M., and Kingsley A. Ogudo. 2023. "Machine Learning-Based Imputation Approach with Dynamic Feature Extraction for Wireless RAN Performance Data Preprocessing" Symmetry 15, no. 6: 1161. https://doi.org/10.3390/sym15061161

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning-Based Imputation Approach with Dynamic Feature Extraction for Wireless RAN Performance Data Preprocessing

Abstract

1. Introduction

2. Literature Review and Background

3. Related Work

4. Machine Learning Models’ Background and Study Problem Formulation

4.1. Multiple Linear Regression

4.2. Support Vector Regression (SVR)

4.3. Random Forest Regression (RFR)

4.4. XGBoost Regression

4.5. K-Nearest Neighbor (KNN) Regression

5. Study Methodology

5.1. Feature Extraction

5.2. Data Modeling

5.3. Data Imputation

5.4. Complexity Analysis

6. Experiments and Results

6.1. Imputing the Data: Task 1

6.2. Feature Importance and Selection

6.3. Imputing the Data: Task 2

6.4. Algorithms’ Complexity Analysis

6.5. Imputing the Actual Missing Values on the RAN Data

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI