1. Introduction
In communication networks, the first problems of choosing a cumulative distribution function (CDF) denoted here as
A(
t) arose in the 19th century when building telegraph networks. However, the mathematical theory that makes it possible to perform the required calculations arose during the creation of telephone networks. It was called teletraffic engineering. Agner Krarup Erlang is considered its founder [
1,
2]. He was a Danish mathematician, statistician, and engineer.
In the teletraffic theory, a call attempt made by a subscriber is considered to be a request, as applied to the telephone network. A number of operations should be carried out with this request in order to provide the subscriber with the opportunity to receive the requested service. A modern multiservice network usually considers a packet that should be processed for the correct exchange of information between users as a request. The name of the related mathematical theory has also changed. Now, it is more often called the queuing theory [
3,
4]. This name is used because the request can stay in the buffer memory (in other words, in the queue) waiting for servicing.
Measurements of traffic parameters in telephone networks showed that the CDF
A(
t) in most cases obeys the exponential law [
5] with the parameter
:
The value of the parameter
is inversely proportional to the average value of the interval between the moments of arrival of requests
A(1). The queuing theory calls the value
the intensity of the arriving request flow. An exponential distribution of the form (1) indicates that the request flow is of a Poisson process [
3,
4]. For a Poisson flow, the probability of arrival of exactly
k requests over a time interval with duration of
t, denoted as
, is determined by the following formula:
Note that as the product of increases, the function becomes close to symmetric.
With the formation of multiservice networks, in which all types of information (speech, data, and video) are transmitted as a sequence of packets, the Poisson flow hypothesis became unacceptable. CDF
A(
t) that differed from (1) began to be used. Note that the coefficient of variation is equal to the ratio of the standard deviation to the mean of a random variable. Typically, these distributions have a coefficient of variation,
CA, greater than one. For a Poisson flow, this coefficient is equal to one. Moreover, self-similar processes began to be used to describe the function
A(
t) [
6].
Modern telecommunication equipment allows the collection of statistical data, based on which a histogram a(t) is being built. It allows highly reliable assessments to be obtained that are able to reveal the nature of the arriving request flow. For this purpose, it is necessary to develop a method for analyzing the function a(t). The solution to such a problem is the main topic of this paper.
2. Mathematical Model of a Multiservice Node
A multiservice node can be viewed as a black box [
7], as shown in
Figure 1. This black box is used as a model for a queuing system. Four more processes are defined for such a system in addition to the function
A(
t).
The process B(t) is the CDF of the holding time of requests (packets) in a queuing system that is used as a model of a multiservice node. Using the process C(t), various control algorithms for a queuing system can be specified. The process D(t) is the CDF of the interval between the moments at which successfully processed requests depart from the queuing system. Some requests cannot be processed for a number of reasons, and they are lost. The process E(t) allows this phenomenon to be described.
Below, mainly the function A(t) is reviewed, with reference to its properties that make it possible to choose the distribution B(t), calculate the main parameters of the D(t) and E(t) processes, as well as to choose the ideal control algorithms represented as the process C(t).
2.1. Two Models of the Arrival Process
It is expedient to consider the request flow as a sequence defined on the “Time” axis, which is usually denoted by the letter t. On this axis, it is possible to single out time moments, , at which requests arrive at the queuing system input. The values correspond, for example, to those time moments when packets arrive at a multiservice node. The request flow is random. There are also deterministic request flows. For them, the values of are given by the predetermined schedule. The analysis of such a request flow is not related to the study of random variables.
Figure 2 shows five moments of time,
, at which requests arrive at the queuing system input. The size of the
-th time interval between the moments of arrival of neighboring requests is
. Therefore,
for all
.
The model shown in
Figure 2 contains all the necessary information about the arriving request flow under the condition that at any moment of time,
only one request may arrive. Otherwise, it is necessary to set the values of
, which determine the number of requests arriving at the time moment,
. For the model under consideration,
.
Let us suppose that, at time moments
and
, two requests arrive at the queuing system input at once. The second request flow model should be used then. This is shown in
Figure 3. The “Number of Requests” ordinate axis, which allows the specification of all the values of
, is added to it. The second request flow model is rarely used in practice.
Let us return to
Figure 2. Statistical information on the
values allows the determination of the CDF
A(
t). Methods for solving this problem will be described below.
Let us suppose further that, for the selected function
A(
t), there is a Laplace–Stieltjes transform [
8], which is denoted as
. Then, the average value of the interval size between the arrivals of neighboring requests
A(1) is determined via one of the following two formulas:
The values of
and
A(1), as stated above, are related to each other by a simple relationship:
The A(t) function can be built based on the measurement results that is the histogram a(t). The distribution A(t) will be then a stepwise function. In this case, there is always a Laplace–Stieltjes transform for it. It is obvious that when using the Laplace–Stieltjes transformation, the calculation of the value A(1) and other characteristics of the random variable is simplified due to the use of differentiation instead of integration.
2.2. Handling of the Function a(t)
The histogram
a(
t) built following the measurement results can be represented by its Laplace–Stieltjes transform
α(s):
In this formula, τ is the period, with which the increments of the measured function equal to are sampled. The value n defines a point on the abscissa axis for the last value. The following condition always holds: . Let us note that most of the approximations of the function A(t) in the queuing theory are based on distributions defined in [. Moreover, for a significant part of the studied models. Typically, all such approximations lead to significant errors in the study of queuing systems.
It is necessary to select the value of
τ before taking measurements. It is appropriate to define it as the greatest common divisor for the entire set of fixed values of time intervals between packet arrivals [
9]. In this case, the CDF
A(
t) represents the law of packet arrival more accurately.
The top of
Figure 4 shows an example of a histogram
a(
t). Below are the transformed
and
histograms for which the period
τ is doubled and quadrupled, respectively. The subscript “
U” (the first letter in the word “upper”) indicates the fact that the ordinate values for the histograms
and
are calculated by summing them (within the time interval under consideration) to the nearest larger value of
t/τ. This method of transforming the
a(
t) histogram, which is proposed for example in [
10], is hereinafter referred to as option I. In the general case, the histograms describe asymmetric distributions of the form
A(
t).
Figure 5 shows the histogram
a(
t) and two results of its transformation, as determined by option II. The subscript “
L” (the first letter in the word “lower”) indicates the fact that the ordinate values for the histograms
and
are calculated by summing them (within the time interval under consideration) to the nearest smaller value of
t/τ. The two options for the transformation of the function
a(
t) make it possible to obtain the lower and upper limits for the numerical characteristics of the investigated random variable. Usually, such assessments are enough to solve the problem. However, if necessary, other methods of transformation of the function
a(
t) can be used.
As a result of the transformation of the function
a(
t) in both options, the numerical characteristics of the measured random variable vary. This can be seen from
Table 1, which shows the average value of
A(1) and the coefficient of variation
CA of the studied random variable.
Based on the numerical values given in
Table 1, the occurring errors are very significant. This allows two conclusions to be drawn on the reasonable choice of scales when processing measurement results:
option I, in which “enlarging” the scale, provides an upper estimate for the mathematical expectation and a lower estimate for the coefficient of variation;
option II, in which “enlarging” the scale, provides a lower estimate for the mathematical expectation and an upper estimate for the coefficient of variation.
This means that, when processing the measurement results, it is not advisable to enlarge the scale along the abscissa, that is, it is recommended to use the minimum value of τ. In this case, the accuracy of the obtained results will be maximum (this aspect of processing the measurement results is described in more detail below). The complexity of assessing all numerical characteristics of random variables at a minimum value of τ does not increase substantially due to the possibility of automatic input of statistical data and calculations on a personal computer or another computing device.
It should also be noted that a queuing system with a stepwise function
A(
t) and a constant packet processing time knows a method for calculating the moments of waiting and delay duration [
11]. These moments make it possible to assess the characteristics of the quality of service for packet multiservice networks and compare them with the standardized indicators [
12].
2.3. Choice of Function A(t) Based on a Reasonable Hypothesis
Let us suppose that approximate assessments of A(1) and CA are known. However, there are no measurement results that allow the choice of the function A(t). It is appropriate to distinguish two classes of distributions A(t). The first one is set for a limited time interval, []. In some cases, . Further, similar CDFs are denoted as . The subscript “l” is the first letter in the word “limited”. The subscript “u” (from the word “unlimited”) is used to denote the as determined on the interval [.
Estimates of
A(1) and
CA allow the choice of two-parameter distributions:
and
. A suitable example of a function from the
family is the beta distribution [
13,
14]. Of practical value is a case in which the coefficient of variation
CA exceeds one. For this condition, it is appropriate to choose the Weibull distribution from the functions of the
family [
14]. However, other types of the
function lead to results close to those of the Weibull distribution.
A particular interest in ensuring the normalized quality of service indicators for multiservice traffic is associated with the operation of queuing systems under conditions of increased load,
. Taking into account the accepted designations, the value of
is determined by the ratio of
to the service intensity
:
Approximate formulas for assessing the average request delay time
S(1) in a queuing system at a high load are given for example in [
15,
16] for a model with arbitrary CDFs
A(
t) and
B(
t). For
and
, the results, which are given in [
15,
16], can be represented in the following form [
17], under the condition of a constant request holding time:
The choice of a model with constant holding time is justified by the fact that all packets in the switching nodes of the multiservice network are processed in the same way. This allows the introduction of a reasonable assumption about the constancy of the request processing time [
18].
Clearly, the average value of the packet delay time for distributions of the form
will be greater in comparison with that of the model, which is characterized by the function
. A significant difference between the
and
is also observed for the probability of packet loss, π.
Figure 6 shows the dependences of the probability of packet loss on the load for two distributions of the form
and
. The coefficient of variation,
CA, for both types of distribution is chosen to be equal to ten.
More significant packet losses for the functions
are associated with the fact that the duration of the idle period in the corresponding queuing model is limited to
. There is no such limitation for the functions
. To assess the upper limit
, which corresponds to the model with the function
, an approximate formula was proposed in [
17] if the packet processing time is constant and the capacity of the queue is limited to
r:
Formulas (7) and (8) make it possible to assess the characteristics of the quality of service with an error of no more than 20%. This statement applies to a wide class of distributions included in the family. It is also true for the beta distribution in the family. Among other distributions of the form , the correctness of Formulas (7) and (8) was checked by analyzing the histograms obtained as a result of traffic measurements in the operated multiservice network. It turned out that both relationships are quite acceptable for the analysis of models with an arbitrary nature of the distribution included in the family .
2.4. Reducing the Number of Samples in the Histogram a(t)
As a result of measurements of the request flow, a histogram with a very large number of readings can be generated; this number is determined by the value of
n. In this case, the clarity of the function
A(
t) is sometimes lost. The coefficient of variation,
CA, will be used for a number of subsequent assessments. For Function (5), it is calculated as follows:
To construct the initial
a(
t) histogram, the number of values of the measured random variable, which is equal to
, is used. An example of such a histogram is shown in the upper part of
Figure 7. It is assumed that 25 values of the measured random variable, which are denoted, in the general case, as the set
, are obtained. In this case, the readings at the points
,
, and
are considered rare phenomena that insignificantly affect the nature of the distribution
A(
t). In some cases, there is a desire to discard them, that is, to reduce the set
. Subsequently, a histogram
g(
t) is generated (option (b)). The third solution, as shown at the bottom of the same figure, is to replace the rare
values with one increment observed
x times and located at the
point. In the example used,
m = 9. Then, another histogram
that is represented by option (c) is generated.
First, let us consider an option of transforming the initial histogram, for which the results in the region are neglected. In this case, to construct the histogram g(t) and the corresponding distribution function, the set will include 22 elements.
The last reading for the histogram h(t) is denoted as x. In the example under consideration, x = 3. It is necessary to choose a value m, for which the value of x is transferred to the point. The criteria for choosing the value of are the proximity of the values of A(1) and CA for both histograms. The value of is estimated numerically using the following approach:
I. Characteristics of a random variable A(1) and CA for the a(t), g(t), and h(t) histograms are denoted as A(1)(g), CA(g), A(1)(h), and CA(h), respectively.
II. The values of and are found by numerical solution of the equations A(1)(a) = A(1)(h) and CA(a) = CA(h).
III. One value of that is equal to the maximum value of the pair of and is selected.
In fact, you may choose a value for that is not an integer. However, in the case of a small value of , such an approach will not lead to a noticeable increase in the accuracy of the main investigated characteristics of a random variable. It is possible to use the two values of , which are denoted below as and , respectively. The value is obtained by discarding the significant digits after the period (comma) in the resulting value. Then, .
2.5. Three Numerical Examples
The first numerical example is related to the estimation of the error when discarding the last three readings. For this purpose, one should use formulas that allow the calculation of the values of errors obtained the investigated characteristics of a random variable. For the value A(1), the relative error was about 26%, and for the CA values it was about 25%. Typically, these error values are not considered acceptable. For this reason, the use of histograms of the form g(t) does not seem reasonable.
The second numerical example is related to the choice of the m value. For this purpose, it is first required to solve two equations: A(1)(a) = A(1)(h) and CA(a) = CA(h), clause II of the previous section of the paper. This will allow the choice of the m value.
For the histogram h(t), it is appropriate to choose the value of . Then, the values A(1) and CA for the histograms a(t) and h(t) will compose the following pairs: 3.08 and 3.08, 0.86 and 0.84. This means that the mean values coincide, and the relative error in calculating the coefficient of variation is about 2.3%. It is possible to choose another value of m, for which the relative error in calculating the coefficient of variation will be zero. Then there will be differences in the mean values. If the error in estimating the coefficient of variation at the level of 2.3% seems to be acceptable, then the choice of the value of should be considered justified.
A more general methodological approach is based on the use of both and values, that is, the operation specified in clause III is not required. The and values allow the original histogram to be converted into two new functions.
It should be noted that, in order to obtain guaranteed upper values of the studied characteristics, it is better not to neglect the operation provided for in clause III. In other words, it is advisable to use only one value of for further analysis, which is the larger among the pair of and .
The third numerical example is related to the influence of the location of the last reading on the abscissa axis. The size of this reading is also significant. This is the last increment on the histogram
h(
t). Let us suppose that the last reading is shifted to the right along the abscissa axis at different distances. The last increment for the histogram
h(
t) is 0.04.
Table 2 shows the dynamics of growth of characteristics
A(1) and
CA during the shift of the last reading of the measured random variable. The location of the last reading is indicated by the letter “
”. For the second row in
Table 2, the condition
m = n is true.
The data given in the table indicate that even small values of that are located at a noticeable distance (along the abscissa axis) from the main part of the observed values will significantly affect the characteristics of the studied random variable. For this reason, ignoring the tail of the distribution function is fraught with dramatic distortions of the measurement results.
The value of the last
increment in this example may seem significant. For this reason, it makes sense to repeat the calculations by artificially increasing the number of observations for the process under investigation to the
point. If all the initial ordinate values (except for the last one) are increased ten times in the
h(
t) histogram, then the values of
A(1) and
CA will become different. They are shown in
Table 3. As in the previous table, the point where the last reading is located is denoted by the letter
m.
Both tables illustrate the fact that the last reading remains an important quantity for assessing the CDF A(t). It cannot be neglected. The correctness of replacing several readings located at a distance from their main mass by the value of the histogram located at the point requires a detailed discussion. Another very important conclusion that can be drawn from the data of both tables is that, for small values of the last reading, A(1) increases not so noticeably, but CA increases faster.
Another important issue should be the definition of that range of values of the studied distribution, , which can be considered as a tail. If the CDF A(t) is obtained as a result of measurements, then it is not difficult to set the value of the sought function at this point, which is denoted below as , for the value of . In mathematical statistics, the correlation of a part of the distribution of a random variable with the concept of “tail” is carried out subjectively. Let us take the following definition as an initial hypothesis: the tail is the part of the distribution that includes all values of , for which , along the abscissa axis. The threshold is usually 0.90, 0.95, or 0.99. If the distribution under investigation is given analytically, then the value of is usually found numerically by solving the equation . In rare cases, the value of can be found explicitly after solving this equation.
2.6. Accuracy Criteria for Solving the Problem
The closeness of the mean values and the coefficients of variation for the two distributions does not mean that the output of the model, which is a queuing system, will yield results with acceptable accuracy [
19]. Practical interest is associated with the quality of service indicators, standardized for multiservice networks [
12]. Of which, it is appropriate to single out the average request delay time in the queuing system,
S(1), and the
p-quantile of the same random variable,
. The closeness of these values for different distributions becomes a criterion for the accuracy of solving the problem.
To investigate the arising errors, it is appropriate to choose a model with one server of the
G/M/1 type in Kendall’s classification [
20]. This model is convenient because it allows an analytical search of all the necessary characteristics of a random variable, such as the requests’ delay time in the queuing system. The model assumes that the holding time of requests is a random variable distributed exponentially. The value of the intensity of servicing requests is equal to
. The distribution law
A(
t) can be arbitrary.
First, it is necessary to solve the equation given, for example, in [
21]. Taking into account Expression (5), it can be represented in the following form:
The value of
is the only root of Equation (10) within the range from zero to one. After finding the value of σ, the values of
S(1) and
are calculated using the following formulas [
21]:
When using different histograms, different values of the parameter
are obtained. This determines the occurring errors. Let us suppose that when analyzing two histograms, the values of
and
were obtained. Using Formula (11), it is easy to show that the relative errors in assessing the quantities
S(1) and
are equal. This makes it possible to denote the relative error with the letter
without using indices. Therefore, the following relationship holds:
The values of and depend on the load of the queuing system . The greater the load of the queuing system, the closer the values of the parameters and to one. This means, based on Expression (12), that the error can noticeably increase at a large load unless the values of and become very close to each other. To analyze the operation of queuing systems under conditions of a dramatically increasing load, it is appropriate to investigate the dependence of . For this purpose, the value of the service intensity, , should be changed within the limits that allow investigation of the dependence of in the high-load range of interest.
When solving problems of designing networks and their individual components, the level of acceptable (permissible) load for which the investigation of the dependence of
is rather irrelevant is selected. An important issue is the solution of the so-called inverse problem. For the
G/M/1 model, the assessment of the values of
S(1) and
calculated using Formulas (11) is the simplest example of a direct problem. The inverse problem is to assess the value of
under the given limitations on the permissible levels of the values of
S(1) and
. For the
G/M/1 model, the inverse problem is solved in an elementary way, but the result is two values of the required service intensity, of
and
:
Since both limitations should be met, the value of
is chosen as the maximum one from the pair
and
. It follows from Relationships (13) that the values of
and
are inversely proportional to the normalized value that can be denoted by the letter “
z”. The rest of the variables included in Relationships (13) represent a certain constant. Let us denote the relative error in assessing the
z as
. After a series of simple transformations, the following assessment can be obtained for the relative error in the calculation of the service intensity,
:
In the area of generally accepted levels of (units of percents), when solving the inverse problem, the relative error doubles. This conclusion should be taken into account when carrying out calculations related to the design of those hardware and software facilities, the models of which can be represented by a queuing system.
2.7. Two Additional Tasks
The results obtained make it possible to solve two additional tasks of great practical significance. The first task is related to the choice of the approximating function f(t), which is a composition of two or more known distribution laws of random variables. The meaning of the second task is to sample the histogram a(t), which is a continuous function using Expression (5). Then the analysis of a number of models can be simplified.
The measurement results show that, very often, the histograms a(t) have several extrema. For this reason, the measurement results are difficult to represent using one of the known distribution laws of random variables. It should be noted that such an approximation is useful for clarity but ineffective for the subsequent analysis of models due to the possibility of the occurrence of significant errors.
Analysis of a large number of histograms showed that the composition of two distributions given at different intervals along the abscissa axis is used more often.
Figure 8 shows the original histogram
a(
t) with two extrema, as well as two functions,
and
. These functions approximate the histogram
a(
t).
Figure 8 gives examples of the Erlang distribution of the order
and the Simpson distribution [
22], respectively.
The approximating
f(
t) function can be represented as follows [
22]:
The illustration above and Relationships (15) allow a better understanding of the nature of the CDF A(t). In some cases, the physical nature of the process of request arrival in the queuing system becomes clear.
For some types of the functions
A(
t), it is impossible to obtain relationships for analyzing the parameters of the request delay time. Then, the
A(
t) can be replaced by a stepwise function. The results of solving this task are given in [
11]. They allow the sampling period,
d, to be determined, with which readings of the values of
a(
t) are required to be taken.
Let us assume that the admissible relative error,
is given, which determines the accuracy of the estimation of the delay time parameters. If the coefficient of variation,
CA does not exceed one, then the numerical value of
is determined from the following inequality:
In some cases, the coefficient of variation,
CA, significantly exceeds one. Therefore, it is required to apply another formula:
Expression (17) allows more accurate values of the sampling period,
d, to be obtained for the functions
A(
t) with a very high coefficient of variation. These functions often reflect the actual processes of packet arrival to multiservice nodes. Moreover, high
CA values are typical for multiservice node congestion modes. Such phenomena are often observed in emergency situations that generate a dramatic increase in traffic [
17].
2.8. Areas for Further Research
According to the authors of this paper, further research should be carried out in four main fields. These fields are indirectly related to each other since they pursue one goal. It consists in optimizing a number of basic processes related to servicing multiservice traffic.
The first field of further research is related to the analysis of the stability of histograms, which are obtained following the result of measuring traffic parameters. Packet traffic is subject to dramatic fluctuations. For this reason, it is necessary to assess the results obtained in terms of their stability.
The second field of further research is aimed at solving forecasting problems. Changes in the form of the histograms and their parameters may serve as a good base for predicting changes in the CDF A(t). The results of such a forecast are important, in turn, for the design of telecommunication networks, as well as for their management in the event of congestion.
The third field is tied to the relevance of the analysis of the functions A(t), in which the value of the intensity of the arriving request flow is not a constant value. In other words, it is advisable to obtain results that will be similar to those obtained in this paper but after the function has been introduced instead of a point assessment for the intensity of the request flow.
The fourth field is related to the study of continuous functions that are represented by histograms. This approach is relevant for the investigation of symmetric beta distribution [
22]. It is recommended for the analysis of traffic related to the Internet of Things [
13].