1. Introduction
In recent years, the number of smart devices and applications has increased exponentially, so it is predicted that by 2025 the number of devices connected to the network will reach 30 billion [
1]. This growth means that users’ expectations regarding security, real-time operation and privacy protection should be considered in addition to properly executing tasks or monitoring systems. The wide range of applications requires diverse network characteristics, from physical and transmission technologies to routing and transport protocols capable of supporting different service sets.
Network analytics lets network operators explore various practical models to troubleshoot configuration issues, enhance network efficiency, cut operational costs, identify potential security threats, and plan the development of the network. For instance, Nokia has proposed the Network Data Analytics Function (NWDAF) [
2], a network analytics engine capable of analyzing parameters in different circumstances for performance optimization and capacity planning, credential misuse, and cloud security. The Quality-of-Service (QoS) metrics are usually seen as performance indicators of the network status. The network operators employed different methods to enhance user experiences by improving QoS metrics. For instance, providing a robust connection with less delay for Device to Device (D2D) communication is guaranteed by resource allocation and power control in 5G cellular networks [
3]. One of the most tangible QoS metrics directly related to user experiences and qualification is the End-to-End (E2E) delay. The E2E delay is the time needed to transfer a packet from one endpoint to another, i.e., the time between the instant the transmission starts at the source node and the instant the packet is ultimately received at the destination node.
From a network management viewpoint, it is essential to identify the network E2E delay profile so that its suitability for supporting different delay-constrained services can be assessed over time. For instance, in 5G networks, Ultra-Reliable Low Latency Communication (URLLC) applications demand low latency networks [
4], while other kinds of applications, such as opportunistic sensing, do not have such requirements. Due to the different requirements of each service in terms of throughput, reliability, and time sensitivity, it is crucial to know the E2E delay probabilistic features. Using probabilistic models to determine the E2E delay distribution is crucial to support different delay management strategies [
5].
Different methodologies have been proposed to model the E2E delay. Queuing theory-based (QT) models were proposed to compute the mean E2E delay by considering the stochastic properties of the queue’s arrival and departure random processes [
6,
7,
8]. Although QT models are very popular in the literature for modeling and predicting delay, these are too dependent on the statistics of the queue’s arrival and departure random processes, which can easily change over time due to the variety of applications and generated traffic patterns. Additionally, QT models only allow the computation of expected values (e.g., expected queue delay or expected delay), not providing any insight into the distribution of the delay. Network Calculus-based (NC) models were also proposed in the literature [
9,
10,
11]. NC models allow the computation of a delay bound of a flow traversing the network [
12]. Although delay bounds are useful for time-sensitive scenarios, such as the parameterization of timeout values, they do not provide a high level of description as the one obtained from the estimation of its distribution.
In a nutshell, QT models provide long-term management based on the arrival and service random processes and only allow the computation of expected values (e.g., expected queue delay or expected delay). However, they do not provide any insight into the distribution of the delay. NC models allow the control of unforeseen events and problems by determining the worst cases events and readiness for solving these issues. However, similar to QT models, NC models do not characterize the distribution of the delay. On the contrary, the model proposed in our work focuses on a more detailed estimation of the E2E delay, where instead of estimating a bound or an expected value of the delay, the goal is on estimating the distribution of the delay.
When characterizing the E2E delay through various distribution models, the critical challenge is determining which distributions represent the experimental data collected over time. Due to the network’s heterogeneous nature, mainly due to the diversity of different radio management policies and network Core technologies available in 5G networks [
13], the characterization of the E2E delay involves the parameterization of different delay patterns that might change significantly over time. Consequently, the E2E delay often does not follow a single known Probability Density Function (PDF) but a mixture of them. This diversity motivates a modeling approach based on probabilistic mixture models that combine two or more distributions to increase the model’s accuracy. To this end, the research question to be addressed in this work is centered on how the distribution of the E2E delay can be accurately estimated in a short amount of time. The scientific hypothesis explored in this work aims at evaluating the feasibility of estimating the E2E through a Gaussian Mixture Model (GMM) [
14]. The main question to be answered is how the multiple parameters of the GMM, such as the number of GMM components and the number of samples adopted in the estimation process influence the estimation accuracy and its computation time. An open problem in GMM is the selection of the methodology that should be used to obtain its optimal parameters for a given number of components. In
Section 2, we provide a literature review of different methodologies capable of computing the GMM parameters and its pros and cons are also identified.
The innovative aspects of this paper include the following contributions:
The identification of a methodology to estimate the distribution of the E2E delay based on 5G data obtained over time. The GMM is adopted to estimate the PDF of the E2E delay of 5G networks, considering both standalone and non-standalone operation and different network subsystems such as the Radio Access Network (RAN) or the Core network;
The influence of the number of GMM components and number of data samples on the estimation accuracy;
The evaluation of the GMM’s computation time as a function of the number of model components as well as the number of samples used as input;
An assessment of GMM’s accuracy versus its computation time, which allows the characterization of the tradeoff between both features.
The rest of this paper is organized as follows:
Section 2 introduces the literature review on the parameterization of GMM. The estimation methodology is presented in
Section 3.
Section 4 introduces the 5G dataset and the different scenarios considered in the experiments evaluated in this work.
Section 5 evaluates the estimation performance for the different experiment scenarios. Finally,
Section 6 concludes the paper.
Regarding the notation adopted in this work, we use or simply to represent the probability of X. Vectors are represented in upper case, upright boldface type, e.g., .
2. Literature Review
A GMM is defined as a parametric probability density function that consists of a linear combination of multiple Gaussian distributions [
15]. Different approaches to estimate the distributions’ parameters and weights values based on observed data include the Maximum Likelihood Estimation (MLE), Expectation-Maximization (EM) [
16], Minimum Message Length (MML), Moment Matching (MM), and Penalized Maximum Likelihood Expectation-Maximization (PML-EM) [
14]. The MLE approach maximizes the likelihood function between a known distribution and the observed data. The MML method is an information measurement for statistical comparison. The MM method finds the unknown parameters by obtaining the expected values of the random variables’ powers of the population distribution model equal to the sample moments [
17]. MM can be employed as an alternative approach for MLE in most complex problems due to its simple, easy, and fast computation. The PML-EM is an approach to estimate the parameters in cases when the likelihood is relatively flat, which makes MLE estimation determination difficult. The EM is an iterative algorithm that maximizes the likelihood expectation between data and a mixture of distributions. However, EM as convergence rate is influenced by the initialization random values, and it is hard to define the number of the distributions adopted in the mixture model and how they affect the accuracy of the approximation. EM’s dependency on the initialized values is one of the main causes of slow convergence, as indicated in [
18].
The GMM has received significant attention in the literature, particularly to support the estimation of QoS network parameters. The work in [
19] investigates how to estimate the link-delay distributions based on end-to-end multicast measurements and adopting an MLE-based GMM model. In [
20], a known conditional distribution and an unknown finite Gaussian mixture is proposed to approximate the weighting of the GMM components, showing that higher accuracy is achieved when compared to the EM algorithm. The EM algorithm also has drawbacks. Generally, the EM algorithm faced three main issues: First, determining the initial value may change the results and the converging state and rate. Second, it is hard to define the number of mixture model distributions and how they affect the accuracy of the approximation. Finally, the convergence rate is prolonged in some cases and may take a long time to achieve an accurate solution.
The majority of the research works addressing the improvement of the EM algorithm emphasize that EM dependency on the initialized value is its main drawback. Several research works explore different approaches to solve the random initialization of the different GMM variables. The work in [
18] proposed a robust EM clustering algorithm for GMM in two phases and formulated a new method to solve the initialization problems in EM. In a further step, this work also proposes a scheme to automatically obtain an optimal number of clusters. The proposed approach is evaluated using experimental examples to demonstrate the outperformance of conventional EM algorithms. The work in [
21] proposes an innovative approach to estimate the mean of the distributions of a mixture model based on local maximum, and the numerical results presented in the paper reveal that the proposed algorithm achieves higher performance than the Naive EM algorithm.
The number of mixed components is one of the most critical issues in GMM. The work in [
22] proposed an improved EM algorithm to select the number of components of GMM and simultaneously estimate the weight of these components and unknown parameters. The evaluation results illustrate a better performance in estimating the distribution parameters and consistency in determining the number of components. Moreover, ref. [
23] suggested a new method to estimate the GMM components and to identify the jitter cause. The EM algorithm is adopted to determine the best match parameters to fit the observations. The authors also consider the Bayesian Information Criterion (BIC) algorithm to determine the number of GMM distributions and eliminate the initial value selection problem for the EM algorithm.
As a widely used approach to obtain the MLE function, the EM algorithm convergence speed can be a critical issue. There is a wide range of investigations focused on this issue. For instance, in [
24] the authors adopted the Anderson acceleration technique. The evaluation results with different simulations and numerical examples show that the number of iterations to obtain convergence is reduced by applying Anderson acceleration on the EM algorithm. Moreover, the work in [
24] proposed a new method to address the EM algorithm issue with multimodal likelihood functions. In these cases, the EM algorithm may get trapped into a local maximum, resulting in long convergence cycles to reach the optimal solution. The method proposed in [
25] attempts to optimize the initial random values to avoid local maximum points, targeting the maximization step. Simulation results confirm that optimizing the initial value improves the convergence rate and avoids a local trap. The GMM model has been adopted in several scenarios. For instance, ref. [
26] suggests a new clutter elimination method based on GMM and EM estimation, which attempts to estimate and perform fast clutter with a small amount of data. The work in [
27] provides a comprehensive analysis of actual latency values collected among various data center locations. The work suggests a new GMM approximation based on the simple box approximation algorithm for the round-trip time distribution. The GMM approximation can then be used to simulate and emulate the deployment of applications and services in the cloud.
The works cited so far aim at regenerating the collected data based on all their information. On the contrary, in the methodology followed in this work we determine the GMM parameters that better fit the sampled data. The accuracy of the estimation model is characterized as a function of the number of GMM components and a variable number of input samples. Our goal is to compare the accuracy of the proposed estimation methodology with all empirical data contained in the dataset. Additionally, the GMM estimation time is also studied to assess the feasibility of using the proposed methodology in real-time applications and services.
5. Performance Results
This section presents the simulation results and evaluates the performance of the method described in
Section 3. The MSE is used to find out how the PDF estimated with the GMM is close to the empirical PDF of the dataset and is defined as follows
where
N represents the number of discrete points of the PDF,
represents the GMM PDF value of the discrete point
k, and
represents the value of the PDF of the empirical data.
The four selected scenarios in
Table 1 have been employed to determine the model’s accuracy as a function of the number of GMM components and the number of samples. The threshold
was set to
.
As mentioned before, the number of iterations to reach convergence is one of the essential metrics for evaluating the EM algorithm performance. When the initial values randomly chosen are more accurate, the time to reach a certain level of convergence decreases.
Table 2 summarizes the number of iterations of the EM algorithm and the MSE achieved for the different number of components in each scenario. The EM algorithm is computed for datasets with sample rate 1. Based on the numerical results, when the number of components increases, the number of iterations required to reach the convergence threshold increases because more parameters need to be estimated. It is worth nothing that the limiting value of 25,000 iterations was never reached for
. On the other hand, the MSE decreases with the number of components because a higher number of components leads to a higher number of degrees of freedom to model the data.
In what follows, we compare the PDF plots of the estimated GMM PDF for three and eight components with the empirical dataset PDF. The impact of changing the sample rates , and 1) and the number of components on the GMM’s accuracy MSE and EM computation time is characterized by running the estimation methodology a thousand of times and averaging the results. The average MSE and computation time are plotted for each scenario.
Scenario 1 considers the Core E2E delay data for an NSA 5G testbed in the download stream and for 1024 bytes packet size. This dataset contains 10,000 samples and the PDFs obtained for the GMM for three and eight components are represented in
Figure 7. As can be seen in the figure, the adoption of eight components increases the model’s accuracy when compared to three.
The MSE values for different GMM component numbers and sample rates are summarized in
Table 3 to characterize their impact on GMM estimation accuracy. For the same number of GMM components, the MSE decreases as the number of samples increases, which means that a lower error is achieved when a higher number of samples is used. In addition, the MSE value decreases as the number of GMM components increases for a fixed sample rate. This is due to the increase in the number of Gaussian distributions considered in the GMM model.
The computation time is presented in
Table 4. As a general trend, the computation time increases with the number of samples for a specific number of GMM components due to the longer sample vector processed by the EM algorithm. For a fixed number of samples, the computation time increases with the number of components due to the increased number of parameters required to compute.
Figure 8 illustrates the logarithmic plot of the MSE and computation time per component number for different numbers of E2E delay samples in Scenario 1. Each curve represents a specific number of samples,
T. Based on the results, to keep MSE errors below
, it is beneficial to decrease
T and increase the number of GMM components,
K, as the computational time is more affected by the number of samples than by the number of GMM components.
Scenario 2 collected Core E2E delay data for a thousand seconds in NSA 5G testbed in the upload stream with 256 bytes of packet size. The dataset contains 10,000 E2E delay samples. The PDFs obtained for three and eight components are represented in
Figure 9. Based on the results, the adoption of eight components increases the model’s accuracy compared to three.
The MSE values for different GMM component numbers and sample rates are summarized in
Table 5 to characterize their impact on GMM estimation accuracy. For the same GMM component number, the MSE value decreases as the number of samples increases, which means that lower error occurs when the sample rates increase. In addition, the MSE value decreases with the number of GMM components for a fixed sample rate due to the need of computing more parameters to represent the experimental data accurately.
The computation time represented for Scenario 2 is represented in
Table 6. The computation time increases with the number of samples for a specific number of components.
Figure 10 illustrates the logarithmic plot of the MSE and computation time per component number for different numbers of samples in Scenario 2. Each curve represents a specific number of samples. Based on the results, to keep MSE errors below
, it is beneficial to decrease the sample numbers (
T) and increase the GMM components number (
K), as the computational time is more affected by the number of samples than by the number of GMM components.
In Scenario 3, we consider RAN E2E delay samples in an SA 5G testbed and for a download stream obtained with 1024 bytes packet size. The dataset collects 100,000 data samples and the PDFs obtained for three and eight GMM components are represented in
Figure 11.
The MSE values for different GMM component numbers and sample rates are summarized in
Table 7. For the same number of GMM components, the MSE value decreases as the number of samples increases. In addition, the MSE value decreases with the number of GMM components, as previously observed.
The computation time is presented in
Table 8.
Figure 12 illustrates the logarithmic plot of the MSE and computation time per component number for different sample numbers (sample rates) in Scenario 3.
In Scenario 4, we consider the RAN E2E delay data in an SA 5G testbed for the upload stream and considering a packet size of 128 bytes. The PDFs obtained for the three and eight GMM components are represented in
Figure 13. As can be observed, the PDF of the E2E delay is no similar to the ones obtained for Scenarios 1–3.
The MSE values for the different numbers of GMM components are summarized in
Table 9.
The computation times for Scenario 4 is presented in
Table 10. Once again, we observe the same trends as the ones obtained for Scenarios 1–3.
Figure 14 illustrates the logarithmic plot of the MSE and computation time per component number for different sample rates in Scenario 4. As can be seen, the performance of the estimation methodology effectively depends on the 5G topology, but also on the number of GMM components and the number of input samples.
As confirmed by the results presented in this section, the GMM can be effectively used to estimate the E2E delay in a short amount of time, validating the initial hypothesis. The proposed methodology focuses on a more detailed estimation of the E2E delay, where instead of estimating a bound as in NC methods, or instead of computing an expected value of the delay as in QT models, we estimate the distribution of the E2E delay. However, the parameters adopted in the GMM model strongly influence the accuracy and computation time of the estimation. As a final remark of the results described in this section, we conclude that:
By increasing the number of GMM components the number of parameters to estimate also increase, so the computing time. The computation time increases approximately exponentially with the number of components, although the estimation accuracy increases in a smaller scale;
By increasing the number of components in all scenarios, the average number of EM iterations for reaching the convergence threshold also increases. Due to the sensitivity of this parameter with regards to the difference of estimates, EM takes more iterations in the scenarios with larger deviations in the E2E delay samples;
The number of samples causes a tremendous impact on the estimation computation time. Although the accuracy of the estimation is significantly reduced for a smaller amount of samples, the computation time can be significantly reduced as the number of GMM components increases;
Although a linear relation between the MSE and the number of GMM components is not found, they always exhibit an inverse trend;
Although there is no linear relation between computation time and the number of GMM components, they always exhibit a direct trend.