1. Introduction
Monitoring the Quality of Experience (QoE) nowadays is crucial for the main involved entities in the application service chain: (i) mobile network operators (e.g., Free, Orange and SFR in France), service providers (e.g., Google, Microsoft) and the end-users (e.g., customers or companies). Estimating the QoE in terms of user’s Mean Opinion Score (MOS) can help operators and providers to identify performance anomalies and resolve them in order to try to retain their customers [
1].
In fact, the measurement techniques at the end-user side for assessing the QoE has attracted the attention of many research works guided by operators, service providers or the academic community because it is an easy and cheap way to collect data at a large scale [
2]. In cellular networks, for a given base station (e.g.,
eNB), an operator can measure all physical parameters related to all user equipment associated with. However, with mobility, some users may dissociate from the eNB to another one (if exists). In this case, the operator will not be able to keep measuring (monitoring) the perceived QoE for that user and during the complete service run time. In addition, a user may be completely disconnected due to a lack of coverage. Moreover, the trend towards end-to-end encryption (like HTTPS) has significantly reduced the visibility of network operators on the application parameters (buffer information, etc.) and the traffic of their customers, making the monitoring process more challenging and cumbersome [
3].
One important use case of crowdsourcing for Mobile network operators is the estimation of Key Performance Indicators (KPIs) and relevant Key QoE Indicators (KQIs) to quantify the end user’s perceived quality. It is also crucial for operators to easily produce coverage maps with performance indicators to demonstrate that the coverage commitments on which the license is conditional have been met in addition to limiting customer churn due to quality dissatisfaction. In fact, with the constraints faced for privacy and the adoption of end-to-end encryption, operators do not always have access to these indicators via crowdsourcing. Instead, they appeal to machine-learning models to predict multiple QoE-relevant metrics (KQIs) directly from the analysis of the encrypted traffic as done in [
3]. In this context, we can cite the work [
4], where the authors provide an estimation system of YouTube video bitrate using a decision tree classifier. In [
5], the authors test five ML methods for YouTube video QoE estimation by using a dataset of 1060 video sessions. They found that, up to
QoE classification accuracy is achieved with the RF method, using only features extracted from encrypted traffic. In [
6], the authors introduce another methodology, called eMIMIC, that estimates average bitrate and re-buffering ratio for encrypted video traffic. Three datasets of cellular traces (
,
) are used. The results indicate that the re-buffering rate is estimated with accuracy of
, in addition to average bitrate estimation with an error under 100 kbps for up to
. Another approach that investigates the estimation of KQI from the physical layer parameters has also attracted some research attention. The authors in [
7] have built up a QoE model for videos delivered over a radio network (e.g., Long Term Evolution (
)) using HTTP (Hypertext Transfer Protocol) adaptive streaming (HAS). Their objective consists of achieving a comparison of the QoE prediction using HAS profile (video presentation) and using radio parameters (physical layer). They concluded that the HAS profile is sufficient and better than the radio scenario parameters to estimate user’s QoE in the context of
technology. Based on the same technology, the authors of [
8] introduce a no-Reference video streaming QoE estimator by testing different machine learning techniques. The Gradient Tree Boosting (GTB) method is selected to calculate the video QoE using 11 considered radio parameters. This model achieves
of correlation and
of MSE. At the end, the authors concluded that the radio parameters related with the transmission rate of the streaming session are the most important features in predicting QoE for the GTB algorithm.
Therefore, our objective is to focus on the measurements on the end user terminals for collecting radio indicators such as the Reference Signal Received Quality (RSRQ) and Reference Signals Received Power (RSRP) for / networks in addition to some application metrics like the buffering time before playing a video on demand. We aim at estimating the QoE of some popular Internet services using different kinds of user terminals in a large covered region by many radio units and through several mobility test modes. We would like to understand the problem causes and, as a consequence, to come with helpful recommendations to mitigate the poor observed performance.
To achieve this study, we proceed as follows. First, we survey some important and related crowdsourcing based studies, and we highlight the main goal and applied use case of the collected traces. Second, we collect our own crowdsourcing dataset composed of more than
traces from one of the major French mobile operators in the Île-de-France region using two different
/
terminals. Third, we clean, process and annotate the collected dataset by implementing a known QoE model per considered service. In particular, we calculate the user’s Mean Opinion Score (MOS) per service. Then, we use radio indicators to describe three interesting use cases: (i) Data statistical study in heterogeneous environments, (ii) Anomaly root cause analysis for the considered services, and (iii) we discuss the utility of estimating the MOS based on only the radio indicator (e.g., bitrate) as done in many previous works in the literature. We provide publicly in [
9] the dataset with all the Python codes [
10] for regenerating the analysis or reusing them on other datasets.
The remainder of the paper is organized as follows: In
Section 2, we introduce the related works including key existing mobile crowdsourcing works and the main applied use cases with the collected datasets. In
Section 3, we introduce our cellular measurement campaign and the collected dataset. The pre-processing of the collected dataset is explained in
Section 4. In
Section 5, we describe three use cases of our proposal. In particular, we achieve first in
Section 5.1 a statistical study in heterogeneous environments. Then, in
Section 5.2, we evaluate the root cause of bad performance of key Internet services. Third, the impact of radio indicators on the video QoE is evaluated in
Section 5.3. Finally,
Section 6 concludes this work and presents future works. It is worth mentioning that the list of acronyms used in this work is presented in the abbreviations section at the end.
2. Related Works
The research trend in the context of mobile crowdsourcing aims to address practical challenges (e.g., traffic prediction, traffic classification, traffic routing, congestion control, resource management, QoE management, network security, etc.) and meet the needs of the system actors (users, operators, providers). As a result, numerous mobile crowdsourcing campaigns were achieved in order to collect real datasets and to permit the study of a particular challenge for mobile, service providers, or device vendors. In this kind of campaign, different elements have to be taken into account like the measurement tools, the used devices, the considered services, the access technologies and the mobility test modes [
11,
12,
13,
14]. In fact, collecting datasets requires a lot of time and resources, in addition to the mobilization of volunteers and/or testers to achieve the tests. The mobility aspects increase the complexity level of crowdsourcing campaigns as the geo-localization of the users or connected devices is an important factor and directly impacts the perceived quality [
14,
15,
16].
In [
14], the collected mobile dataset by driving a car along a distinct route in the Sydney metropolitan area considering the
(
,
) networks in 2008. The goal is to study the impact of mobility in a vehicular network context.
In [
15], the authors use three mobility test modes (static, car, high-speed rail) for the
network between two large metropolitan areas in China: Beijing and Tianjin. The objective is to evaluate the network performance, mainly the handover process, in high mobility (300 km/h or above).
In [
13], the physical indicators for
and non-
technologies are considered. The collected data concern two network providers in three countries (U.S., Germany, Netherlands). Indeed, many mobility patterns are tested including sitting/standing, walking, biking (fast), car, bus, trains and planes. The goal of this study consists of statistical analysis of the impact of mobility speed on
performance.
In [
17], the authors publish a mobile dataset for
and
technologies taken around Covenant University in Nigeria. In this study, the indoor pattern is evaluated for two months between June and July 2020. All the measures were taken from 7:30 a.m. to 11:00 p.m. The goal is to investigate the performance of local operators networks. This study is one of the first studies that concerns cellular technologies in Africa.
The authors in [
2] used a
tool to collect the mobile dataset at the Institute of Telecommunications, TU Wien, Vienna, Austria. The authors choose a static indoor pattern to analyze the effect of, on one hand, the short-term fluctuations of the measured key parameter indicators, and on the other hand, time-of-day effects on
networks. Another goal is to train a traffic throughput predictor (by machine learning) in a dense urban environment.
The work [
11] presents a large dataset, belonging to the company
, which contains more than
million crowdsourced network measurements, collected in Germany between January and July 2019. Compared to the other presented datasets,
,
performance and
are evaluated. According to the authors, the measured values differ between individual measurements and the mean value for an area. This is why it can be helpful to study in depth the individual measurements and not just take into account the performance of a global area.
In [
12], the authors use the “G-NetTrack Pro” tool in different mobility patterns to provide a
dataset for addressing two use cases. The first one is
algorithm evaluation. They compared different optimization algorithms of chunk selection. The second one is the handover analysis.
Using the same tool, in [
18], the authors produce a
dataset, collected from a major Irish mobile operator, and a synthetic
large scale multi-cell NS-3 simulation framework. Their goal is to study
deployment. They consider video streaming and file downloading services and aim at understanding the dynamic reasoning for adaptive clients in
multi-cell wireless scenarios.
One important use case of crowdsourcing for Mobile network operators is the estimation of Key Performance Indicators (KPIs) and relevant Key QoE Indicators (KQIs) to quantify the end user’s perceived quality. In fact, with the constraints faced for privacy and the adoption of end-to-end encryption, operators do not always have access to these indicators via crowdsourcing. Instead, they appeal to machine-learning models to predict multiple QoE-relevant metrics (KQIs) directly from the analysis of the encrypted traffic as done in [
3]. In this context, we can cite the work [
4], where the authors provide an estimation system of YouTube video bitrate using a decision tree classifier. In [
5], the authors test five ML methods for YouTube video QoE estimation by using a dataset of 1060 video sessions. They found that up to
QoE classification accuracy is achieved with the RF method, using only features extracted from encrypted traffic. In [
6], the authors introduce another methodology, called eMIMIC, that estimates average bitrate and re-buffering ratio for encrypted video traffic. Three datasets of cellular traces (
,
) are used. The results indicate that the re-buffering rate is estimated with an accuracy of up to
, in addition to average bitrate estimation with error under 100 kbps for up to
. Another approach that investigates the estimation of KQI from the physical layer parameters has also attracted some research attention. Authors in [
7] have built a QoE model for videos delivered over a radio network (e.g., Long Term Evolution (
)) using HTTP (Hypertext Transfer Protocol) adaptive streaming (HAS). Their objective consists of achieving a comparison of the QoE prediction using HAS profile (video presentation) and using radio parameters (physical layer). They concluded that the HAS profile is sufficient and better than the radio scenario parameters to estimate user’s QoE in the context of
technology. Based on the same technology, the authors of [
8] introduce a no-Reference video streaming QoE estimator by testing different machine learning techniques. The Gradient Tree Boosting (GTB) method is selected to calculate the video QoE using 11 considered radio parameters. This model achieves
of correlation and
of MSE. At the end, the authors concluded that the radio parameters related to the transmission rate of the streaming session are the most important features in predicting QoE for the GTB algorithm.
In fact, many related works refer to mobile datasets that are produced by either operators without publishing any detail or by research groups but with a reduced number of parameters [
13]. In order to have a full detailed dataset in our region and study the performance of the most popular Internet services (e.g., video streaming, downloading and web browsing) over a popular French mobile access network, we decided to build our own dataset. In addition, we would like to study the impact of the radio parameters on the QoE for streaming videos on fixed HD quality rather than using adaptive streaming as done, for example, in [
7] in order to evaluate the root cause of poor performance as discussed in detail in
Section 5.
3. Campaign Test and Data Description
3.1. Campaign Description
Before evaluating the network and service performance, we will present first in this section the data collection procedure. The latter is composed of three parts: (i) the test campaign description, (ii) the collection of the data traces, and (iii) the raw data presentation. Then, we will detail the achieved data pre-processing phase.
To begin, we illustrate in
Figure 1 an overview of the achieved campaign.
In this campaign, we used the “5Gmark” tool for a variety of reasons, including the simplicity and the efficiency in evaluating several services with many mobility test modes. In particular, “5Gmark” allows for measuring the cellular connection through three modes: “Speed Test”, “Full Test”, and “Drive test”. In practice, the “Speed Test” presents one test cycle to assess the connection quality by measuring just the latency test (in terms of milliseconds), download data test (during 10 s) and upstream data transfer (uploading during 10 s). “Full Test” is also one test cycle that integrates, in addition to the “Speed Test” data, the measurement of two additional service: YouTube streaming (display YouTube HD video during 30 s) and web browsing (test connection during 10 s of 6 websites). The “Drive Test” represents a test cycle, set of “Speed Tests” or “Full Tests”, which runs automatically with a test number counter (5, 10, 20, etc.) and an interval in each test in minutes (by default 0). Note that the server is selected for each service according the user position regardless of the operator.
To study the impact of the user terminal, we consider two Android smartphone named Xiaomi Note Pro 9 and Samsung A10 that have different characteristics. During our tests, three different access technologies, named 2G (
), 3G (
,
(
/
) and 4G (
/
), are evaluated. The collection and analysis methodology are applied to
or the beyond generations. In this evaluation, a wide variety of parameters are collected including application, radio indicators and context information. These parameters are gathered using an active data collection procedure that concerns five services named latency test (ping), download data test, upstream data transfer, web browsing and video streaming as illustrated in
Figure 1.
The measurements are effectuated for six months, from March to September (except August) 2021 in the Île-de-France region using two main categories of mobility modes. The first mode is car mode with a maximum speed of 130 (km/h). This mode includes travel by car on some highways around Paris as well bus’s lines in Paris center. The second mode is trained with a maximum speed of 120 (km/h)). This mode includes a regional train in Paris (RER) and the subway (metro).
3.2. Collection Procedure
The strategy for collecting the data are as follows: eight testers participate in the data collection. They are teachers or students of ESME Sudria School (France). All of them use one of the two considered Android smartphones (Xiaomi Note Pro 9 or Samsung A10) and with the same SIM cards (for one single operator). The participants collected the majority of traces during morning and evening hours during working days along work-home trajectory, in addition to some random mobility during the weekend. The “Drive test” mode is programmed to execute a set of full test, and each one is composed of a combination of five applications.
Figure 2 presents the structure of one complete “Test cycle” (full test).
We will focus later on the analysis of the three most popular services: (i) file downloading, (ii) video streaming, and (iii) web browsing as they generate most of the Internet traffic. Below we present a synthetic description of each selected service:
File transfers are carried out in a single thread, representative of customer usage, so as not to artificially maximize throughput. The downloaded files have about twenty different extensions.
Video streaming service is carried out on the platform of YouTube content. The video is viewed in 720 p resolution for 30 s with no change in quality.
There are six web browsing tests. Each test tries to request and view pages from international and national web servers for 10 s. The six sites for the web test are selected randomly from a predefined list of 30 popular sites. In
Figure 1, you can see an example of six selected web sites.
3.3. Raw Dataset Description
During the campaign test, we collect two raw datasets that contain
traces and 2742 test cycles (sessions). The first dataset will be used to deeply analyze the services and to study the problem root cause (
Section 5.1 and
Section 5.2). The second dataset is used to explore the possibility of predicting video metrics and user’s QoE from the radio indicators (
Section 5.3). In fact, the traces consist of a lot of more than 100 parameters subdivided into four categories in terms of data type: (i) categorical data, (ii) numerical data, (iii) temporal data (measurement instant), and (iv) spatial data (GPS coordinates, geographical area of the track, etc.).
Figure 3 shows the distribution of the raw data measurements on the map in the east of the Île-de-France region (France).
Figure 3 is obtained from the “5Gmark” user’s dashboard at the end of the crowdsourcing phase. The color within the plot represents the service status (good, medium or bad) according only to the bit rate value. A detailed analysis based łall of the parameters is presented in the next section. Indeed, as the data are collected mainly using two transport modes (train and car), we clearly observe, in this figure, that the measurements are localized on roads, highways and rails crossed by trains/metros/buses. Having a look at the simple service status reported in the figure, it is obvious that most of the high bit rate connections are located in the city centers like “Ivry sur Seine”, “Meaux” and “Paris” in our region and most of the bad services are obtained on highways and rails like the rail line (SNCF Train line P) in the middle that contains a lot of red points.
3.4. Dataset Pre-Processing and Feature Selection
Once the initial raw data of traces are obtained, we pass to the data cleaning step that implements the following two operations named “
feature selection” and “
data preparation”. Concerning the feature selection operation, we have applied a correlation study between the main mobile physical parameters and the service status.
Figure 4 shows the correlation results for the video streaming service. We notice that RSRP and RSSNR are the two most correlated with the service status (i.e., the bitrate) with 0.23 and 0.15, respectively. The RSRQ, LTE_asu and LTE_dbm come after with 0.13. We have selected RSRQ as the third important feature as it is consistent with what network operators do in general for radio provisioning [
19].
Concerning the “data preparation”, we have decided to simply discard entire rows and/or columns containing missing values when the features are selected as we have enough measures and do not want to consider any modified data.
After that, and to analyze the mobile experience from an end user’s perspective, we calculate the user’s mean opinion scores (MOS) for each considered service using an appropriate QoE model for each service from the literature, and we annotate the dataset with this new feature. In particular, we drive MOS score (i) from the bandwidth rate as in [
1,
20] for the downloading service, (ii) from the bitrate for HD video streaming service as in [
21,
22], and (iii) from the application buffer information as in [
23,
24] for the web browsing service. After calculating the user’s MOS score values, we represent the MOS scale in 5-levels (
) and 3-levels (
) as done in the literature [
25].
Table 1 reports the final clean dataset with the considered features that is ready for the in-depth analysis. A detailed description and script Python files for reproducing the analysis can be found in our Gitlab repository [
9].
After filtering incomplete entries,
trace measurements remain. Each trace is composed of 22 parameters corresponding to one of the four data types named categorical, numeric, temporal and spatial as presented in
Table 1. Out of these features, we find the meta information like measurement instant (
,
), geo-coordinates (
,
), and location (
,
); the dataset also includes device information (such as device model). In addition, key physical parameters are included like
Reference Signal Received Quality (
),
Reference Signal Received Power (
), and the bitrate. Finally, details about some application parameters such as the initial buffering time, rebuffering duration, and rebuffering frequency are collected.
4. Data Analysis
In this section, we divide our analysis according to the services, network access technologies, mobility patterns and device types.
4.1. Services
To begin,
Table 2 reports the number of examples and the rate of the selected services (downloading, video, web).
Indeed, we notice that the number of traces that concern the web and the video (
and
traces, respectively) are more presented than the traces of the download service with
traces. This is due to the setting up of “Full Test” mode itself presented in
Figure 2. We notice that the traces are collected for just 10 s for the Download service while they are collected for a duration of 30 and 60 s for video and web services, respectively. This observation explains why the bitrate of downloading traces is almost less than the bitrate of video services as shown in
Table 3.
We found, as in the literature, that the video service is the most demanding service in terms of bandwidth with 33 MB/s in average. Regarding the web service, it is the least demanding service in terms of bandwidth with an average of MB/s. This is justified by the fact that it does not need a lot of bandwidth to view web pages that are not too large in size compared to HD video segments.
Compared to the downloading service, we notice that it has the shortest launch time compared to video and web services. This is logical as the user requires the complete download before viewing the file.
In addition, we have noticed that the video bitrate peak exceeds 237 MB/s. This peak is surely the result of () technology with the new system with () configuration considered by the French operator. Notice here that we did not collect traces, but the study remains applicable to it.
4.2. Technologies
To study the influence of the access technologies used in this study,
Table 4 shows the number of measurements collected during our test campaign as well as the percentage of measurements made using the 2G (
), 3G (
,
(
/
) and
.
As the phones used are not compatible, the greatest number of measurements are of the / type with a percentage of . This implies that technology is still widely used in France in 2021 due to the modest NSA deployment for the moment. Moreover, a general lesson to be drawn is that deployment in the Paris region is good even if it cannot satisfy the requirements of the new generation applications (e.g, cloud Virtual Reality).
as well as
are still used to ensure continuity of service in some complex urban sectors (
and
, respectively, of our traces). Finally, in all the locations evaluated, at most,
of the measurements were carried out using a smartphone with 2G access. This is in line with the outcome of recent measurement works [
11,
26] done in France and Germany.
4.3. Mobility Aspects
Figure 5 shows a histogram of the number of traces collected versus speed, using a step of 10 km per hour.
We see that, during our test campaign, we have collected measurements with different speeds reaching more than 140 (km/h). From the histogram in
Figure 5, we notice that the greatest examples number are located at speeds of less than 10 (km/h). This is justified by the reduced speed during traffic jams (especially in the Paris center) with cars and buses as well as the frequent stops made by metros (subways). From speeds of 60 (km/h) and up to 130 (km/h), the number of tracks is roughly distributed according to a normal distribution with 85 (km/h) average speed.
4.4. Device Types
The impact of the characteristics of the user terminal is detected in our dataset. In fact, we have displayed the statistics in terms of bitrate for the two phones used (Redmi Note 9 Pro and Samsung A10).
Table 5 presents a summary of these results.
From
Table 5, we can clearly see that using different terminals implies different performances. Indeed, we observe that the average bitrate using the “Redmi” phone is 24 MB/s against 15 MB/s for the Samsung A10. This implies an increase of
, justified by the hardware characteristics that are better in the “Redmi”.
In addition, we also note that the maximum bitrate value reaches 237 MB/s and 113 MB/s for the Redmi and the Samsung A10, respectively. Thus, the maximum measured bitrate with Redmi is twice that of Samsung’s bitrate. This can be justified by the compatibility of the Redmi with () technology that includes the new system with () while the A10 Samsung phone supports only simple access technology. In addition, we will see later that the antenna gain in Redmi is almost much better than the gain in A10.
5. Use Cases
Mobile data collection has received significant attention in recent years as described in the related works (
Section 2). This is because it is important for several use-cases as discussed in the survey [
27] including traffic prediction, enhancing routing, traffic classification, resource management, root cause analysis and QoS/QoE management.
We are interested here in three main use cases. The first one is the measurements analysis in heterogeneous environments. By heterogeneous environments, we mean the use of different user’s terminals and the consideration of three application services in addition to the application of two mobility test modes.
The second use case presents the root cause analysis (RCA) of poor quality identified in some sectors for a given service. The idea is to address the performance of the connections that seem to be ”poor“ from the system’s point of view (that has the status or ) or from the user’s point of view (when the MOS score is 1 or 2). Notice here that, when the user application succeeds at connecting to and obtaining a response from a server (i.e., system view status “OK”), the service quality could be poor for the user (e.g., low bitrate).
The third and last use case consists of studying the impact of radio parameters on the video metrics and user QoE using the test cycles (sessions) dataset. The objective is to explore the possibility to predict the overall video quality with ML techniques using radio parameters including RSRP, RSRQ and SSNR.
5.1. Use Case 1: Data Statistical Study in Heterogeneous Environments
In the heterogeneous environments, one of the challenges that concerns the mobile network sector is the management of the handover mechanism and the best station’s coverage. The use of radio quality indicators is very helpful in this context such as the RSRP and the RSRQ in long-term evolution (
) systems [
28]. Therefore, the first use case of our dataset is the study of the possible relations between RSRP and RSRQ on one hand and their impact on the quality of the services on the other hand from both system and user points of view. The objective is to come out with some recommendations of best signal indicators range per service and per user terminal [
19].
To that end, we rely on system-view quality that represents service status on two levels:
(service is supposed to work fine) and
(presents
and
status)). This is given automatically from the evaluation tool and is calculated based on the bitrate for the three tested services including file downloading, video streaming and web browsing.
Table 6 shows the statistics about physical parameters per service for the two devices used.
As in [
19], most of the results are expected in the case of video streaming and web browsing. However, for a few number of traces, we notice that, despite very good received signal strength on the radio side (average RSRP of
dBm) and regardless of the user terminal, the user is not able to download the content. Therefore, we conclude here that the poor performance is caused by the servers and not by the radio provision. It is more likely that the servers were overloaded at the moment of the measurement of these few traces. Caching the popular contents per region at the edge may help in such a scenario.
Next, we focus on the video streaming and web browsing services. In fact, we took the average values of RSPR and RSRQ classes to achieve an overview of recommended signal strength levels for these two services as illustrated in
Figure 6. Note that the measurements of the two used devices (Redmi and Samsung) are taken into account.
By comparing our results in
Figure 6 against the recommendations published in [
19], we notice that the results are similar in the two considered services (video and web). Indeed, the same RSRP and RSRQ parameter thresholds are found in the cases of the categories: “Excellent”, “Good” and “No Signal”. In addition, we further refined the study of the “Poor to Fair” category by finding the thresholds that allow this category to be subdivided into two levels: “Acceptable” and “Very poor”, where the quality of video service is acceptable between
(dBm) and
(dBm) for the RSRP indicator, and web service quality is also acceptable between
(dBm) and
(dBm). Concerning the RSRQ indicator, the acceptability threshold is
(dBm) for the two services.
According to these results, we have concluded that RSRP values are interesting for studying each service separately according to our collected dataset. Furthermore, we observe that the RSRQ does not have significant variations. For this later reason, we focus on the RSRQ to study the impact of the user terminal type in a heterogeneous environment. To do this, we present in
Figure 7 the RSRP indicator thresholds by service (web or video) and by user terminal. In our case, we measured with two devices: Xiaomi Note Pro 9 (noted Redmi) and Samsung A10 (noted Samsung).
From
Figure 7, we find that the acceptable RSRP threshold is different for each user terminal. In fact, with the Redmi device, the acceptable threshold is
(dBm) for both web and video services, while it equals
(dBm) for the Samsung device. Consequently, we notice that the Redmi is performing better than Samsung. This means that the gain of the receiving antenna is much higher in the Redmi device.
As both devices are used in the same place, with the same direction and distance of the base station and for similar technology (frequency band), and according to the Friis Equations (4)–(7) in [
29], we conclude that the antenna of Redmi has a better reception gain than the one of Samsung.
5.2. Use Case 2: Root Cause Analysis for Service Problems
Understanding the service quality problem often requires definition of “poor service quality events” that occurred in a test cycle (
Figure 2). We define four events that reflect problems during any test cycle. The first one is called a “radio provisioning problem”. It happens when all the services are not functional. The second and third problems are “Web problem” and “Download problem”. These two problems appear during a test cycle when the web and download services do not work and also, in the same test cycle, the video service that is more demanding in bitrate and more sensitive to delay is working. We pay attention here to the application buffer as it may help video streaming to keep working even if there is some temporary disconnection or an interference happens. We call this last situation the “servers problem” because the problem will instead come from the web and downloading servers.
Table 7 lists these problems.
To dive into the analysis, these problem events are examined using two different modes to find the number of occurrences based (i) on number of trace cycles or (ii) on geographic area sector.
5.2.1. Trace Cycles-Based Analysis
In the first mode, it is based on the test cycles. This means, in all the analyzed 2742 test cycles, finding out how many trace cycles have these problem events and trying to look for the root cause. To that end, we use Python capabilities to achieve data-frames slicing with the Pandas library to select the desired trace cycles for both system and user views as defined previously [
10].
Table 8 presents the results for the two points of view.
From the results, we observe that the problems occur rarely with only
(58 traces) of the dataset. In these cases, we observe that the Web and download problems are detected a little more than the two problems of “radio provisioning” and “Servers problem”. In particular, in 36 test cycles, the download service does not work well while the video service does. Contrary, from the user’s point of view, 23 test cycles have the video service working well while the web browsing service is working less well. This is due to the model used to define the status (system view), which is based on the physical bitrate, while the calculation of the user MOS of the web service is based on the launch time [
23,
24].
Now, let us analyze the root causes of these problems. When we explore the six traces of the provisioning problem, we remark that the radio indicators (RSRP, RSRQ, etc.) are very poor during all of the test cycles. This is why all services are not available. Concerning the web and download problems, we notice that the radio indicators are good and then we assume that the servers are not well responding to their load.
5.2.2. Geographical-Based Analysis
The second mode to analyze the measurements dataset consists of the subdivision of the global geographic region in
small zones according to the GPS coordinates. The goal is to find the number and size of small zones where the problem events (
Figure 8) occur and try to map what we have found for the geographical placement of the base stations and also for the mobility mode (e.g., highway or dense urban sector). This may give us an idea on the handover moments, the existence of near or far stations, possible radio problems due to fast fading, etc.
Initially, the measurements were collected over the Île-de-France geographical region in the format of a rectangle 96 (km) wide (longitude) and 68 (km) long (height). To find the interesting geographic zone that helps us to understand the root cause of problem events, we divide the region into 1024 zones (
) as illustrated in
Figure 8.
Using the same methodology as the previous mode, based on Python capabilities, we achieve the statistics of service problem events in the context of geographical-based.
Using the same methodology as the previous mode, based on the capabilities of Python, we perform statistics on the number of resulting problems compared to geographic areas. The results are given in
Table 9 below.
As a result, we found that, in the existing 1024 (32 × 32) geographical parts, 771 parts are empty, where we did not obtain any measure. From the rest of the geographical parts, we see in
Table 9 the number of those where problems happen. Thus, we confirm the conclusions of the first mode, which indicate that the problems occur rarely with approximately
of all considered geographic area (Île-de-France region) and the download service does not work well while the video service does, in 8 geographic parts and 13 geographic parts, respectively, according to system point view and user point view, respectively.
Concerning the other problems, we can visualize in
Figure 8 some locations in the map where these events have occurred. We see clearly that the Radio provisioning problems occur in the geographical parts where red color (bad quality) dominates. Furthermore, we observe that the web problems occur where green color dominates due to the good quality of video service in this place. Finally and concerning the Servers problems, we notice that the dominant color is orange and red because of the bad quality of downloadin and web. This can be explained by the fact that, when moving from good to bad cellular cover, the video application can mitigate (overcomes) the temporary disconnection by its applicative buffer, whereas the web service can not leak of buffer.
5.3. Use Case 3: Impact of Radio Parameters on the Video Metrics and User QoE
As already discussed in
Section 2, we are interested in studying the impact of the radio parameters on fixed HD video quality streaming over the
mobile network. In particular, we would like to evaluate the impact of the three radio signal references (RSRP, RSRQ and RSSNR) on both the user MOS and the video KQI (i.e., bitrate and launch time). In other words, can we, based on these three physical parameters (RSRP, RSRQ and RSSNR), predict the video streaming performance (MOS and KQI)? To give a good answer to this question, we aim at training several machine learning models that take as input features these three parameters and gives as output the targeted video metrics (MOS, the bitrate, and the launch time). It is straightforward to train and predict the bitrate and the launch time with ML regressors as we have everything in our dataset. However, we do not measure directly in our campaign the user’s MOS, so we need to build this feature in the dataset for every video session before to build and train any regressor. To that end, we calculate the MOS based on key models from literature. In fact, two approaches do exist in the state of the art to compute the MOS value. The first (
) is based on the bitrate video values like in [
21,
22], and the second (
) is based on the buffer information like in [
23,
24]. We will denote by (
) the user’s MOS score that is the minimum calculated MOS value from both approaches for each video session (we have 746 completed video sessions). We consider here the worst case of perceived quality between the buffer-based MOS and bitrate-based one. In the remainder of this part, we explain the implementation of the various steps of our prediction evaluation using the dataset [
30].
Thus, to begin, let
,
,
, and
denote the bitrate video, initial buffering time, rebuffering duration, and rebuffering frequency, respectively. Based on the study [
22], we achieve the video MOS from bitrate video as follows (
1):
The idea is to achieve the MOS score continuously for HD video resolution over the interval 0 and 9 Mbps as shown in
Table 10. In fact, the MOS is considered excellent (with 5 value) when the bitrate is larger than 9 Mbps.
Concerning the second approach, we implement the provided relationship between QoE and buffer information as in [
24]. The used formula is given below in Equation (
2):
where
,
, and
are the respective levels of
,
and
as defined in [
23,
24], where the authors use 1, 2, and 3 to encode the “low”, “medium”, and “high” levels, respectively.
Therefore, we achieve the overall user’s MOS score by using the below equation that implements
function as mentioned above:
Once the MOS scores are calculated, our dataset is ready for the training and the prediction phases using four ML regressors. We aim at predicting, on one hand, the calculated MOS scores, and, on the other hand, the video KQI (bitrate and launching time). We consider here Random forest (RF), Decision tree (DT), K-nearest neighbors (KNN), and Gradient Tree Boosting (GTB), and we test many hyper-parameters configurations described in
Table 11).
During the hyper parameter tuning step, the results are validated using a 5-fold cross-validation method using Python “scikit-learn” package, where the data are divided into for training and for testing. The next operation consists of the final prediction step, in which the best configuration for each ML method is used to implement the regression method that predicts three calculated MOS scores and two considered video KQI (bitrate and launching time).
Concerning the three MOS scores prediction, the results are given in
Table 12. We report in this figure the Mean Absolute Error (
) [
31] and Mean Absolute Percentage Error (
) [
32] and Pearson correlation rate (
r) [
8].
From
Table 12, we notice that all the considered models, except the DT method, performed reasonably well on the task of MOS score prediction and showed high degrees of accuracy with at least
in the case of buffer-based MOS,
in the case of bitrate-based MOS and
in the case of buffer-based MOS. According to the results, we see also that the
, with 11
and
, is the best one in the prediction of buffer-based MOS and bitrate-based MOS with mean error of
and
. This confirms the efficiency of the
in the context of QoE prediction as achieved in [
25]. Furthermore, and concerning the user’s MOS prediction, we find that the two ensemble methods,
and
, achieve the best prediction performance result with mean error rate equal to
, respectively, and
. In fact, the correlation performance of
, with
, is close to the values reported by other researchers in the literature. The results of predicting video KQI are given in
Table 13.
According to the results, the behavior of the ML models is not the same. Indeed, we observe that the ensemble methods (RF and GBT) give better results than the classical methods (DT and KNN). These are justified by the strong of ensemble ML methods compared to classical ones, where a classical ML method (DT or KNN) is built on a complete data set, using all characteristics/variables of interest, while ensemble methods (RF or GBT) select observations/lines and specific characteristics/variables to build multiple predictors from which the results are then averaged. In fact, the ensemble ML methods are more suitable when we have large interval values of the targets: bitrate and launch time. As the range of values for these targets are large, both the MAE and MAPE metrics do not necessarily lead to a good interpretation. Thus, we replace these metrics by the logarithm of the relative error (denoted by
) between the estimated and the reference values as reported in
Table 14.
From
Table 14, we observe that the logarithm of the relative error is between 0 and 1, which presents credible values. In fact, we confirm that using radio parameters can give acceptable prediction results in the case of launch time with a
of
that represents an average error rate of 246 ms, and a correlation rate of
. However, the results are less good and not sufficient for the bitrate KQI with a
of
that represents an average error rate (MAE) of more than 9400 kbits/s.
6. Conclusions and Perspectives
The crowdsourcing approach offers a new cheap paradigm for services’ quality of experience (QoE) assessment perceived by end users. Analyzing traces is very useful for enhancing the QoS and identifying the root cause of poor performance that may happen in some small zones. It is also crucial for operators to easily produce coverage maps, for instance, to demonstrate that the coverage commitments on which the license is conditional have been met in addition to limiting customer churn due to quality dissatisfaction.
We have collected a dataset for three popular Internet services using two different
/
user terminals. The measurements are effectuated during 6 months in 2021 and for one popular French operator in a large region in France (a rectangle 96 km × 68 km). This region is divided later on the map into 1024 small zones. The QoE in terms of user’s Mean Opinion Score (MOS) has been computed from known models found in the literature for every service with the aim of analyzing the cause of poor performance found in some zones. Several problem events are defined and matched against the traces. Our analysis is applied on both plain-text and encrypted traffic within different mobility modes. We concluded that the radio provisioning is not the only possible cause of poor performance as anyone intuitively thinks especially with mobility. The capacity of application’s servers, their location with respect to users, and the user terminal characteristics can be good reasons for problems. We have noticed also that older mobile technologies are still used to enlarge the coverage in less dense sectors (where the density of population per km
is not important). We have also demonstrated that the key radio parameters can be used in a simple way to give an acceptable prediction of the HD video quality metrics, mainly the launch time, the bitrate and the MOS. It is worth mentioning that our study is applied for
new radio. The crowdsourcing campaign, the collecting and preparation of the datasets and the applied performance evaluation methodology on the key internet services are still the same. The main difference from the practical point of view is changing the dataset features with the new radio indicators. In fact,
will lead to significant gains in network throughput, outage and power consumption thanks to several key technologies including Downlink/Uplink Decoupling, Massive MIMO (beamforming), and the introduction of millimeter wave bands [
33,
34]. Due mainly to beamforming, the
new radio (NR) uses Synchronization Signals (SS) and Channel State Information (CSI) instead of the Cell-Specific Reference signals (CRS) which are transmitted by neighboring cells [
35]. In fact, in
/
systems, the CRS is shared by all User Equipment (UE) in the same cell and this is why a CRS transmission cannot be beamformed to a specific UE. Instead, in the
NR, a UE specifically configured and dedicated measurement signal named the Channel State Information Reference Signal (CSI-RS) had been introduced since release 10. Later, configuring multiple CSI-RS to one UE simultaneously is enabled with a larger number of antenna ports (e.g., release 13). This permits measuring the characteristics of a radio channel so that it can use correct modulation, code rate and beamforming. The base station (e.g., gNB) sends CSI Reference signals to report channel status information such as CSI-RSRP, CSI-RSRQ and CSI-SINR for mobility procedures [
35]. Therefore, our study is applied to
networks once the correct CSI Reference signals are collected instead of those for
/
.
In the future, we would like to explore other efficient ensemble ML methods and deep learning techniques that can be used to achieve real-time measurement of video QoE.