1. Introduction
Video streaming has become a major usage scenario for Internet users, accounting for over 60% of downstream traffic on the Internet [
1]. The growing popularity of new applications and video formats, such as 4K and 360-degree videos, mandates that network resources must be apportioned among different users in an optimal and fair manner in order to deliver a satisfactory Quality of Experience (QoE). There are many factors impacting the quality of experience of the video streaming service; for example, the peak signal-to-noise ratio (PSNR) of the received video [
2] or the structural similarity of the image (SSIM) [
3]. In particular, the stall time during streaming is a critical performance objective, especially for services that require a low response time and highly rely on customer experience, e.g., online video streaming and autonomous vehicle networks [
4]. Further, the streaming device impacts the bitrate and, in turn, affects the QoE parameter (see [
5] and the citations within). The online optimization of stall time and QoE in a dynamic network environment is a very challenging problem that can be analyzed as an optimization problem [
6] or learning problem [
7]. Traditional optimization-based approaches often rely on precise models to crystalize the network system and the underlying optimization problems. For instance, the authors in [
8] construct the QoE-aware utility functions using a two-term power series model, while the authors in [
9] leverage both “bSoft probe” and “Demokritos probe” to model the QoE measurement by analyzing the weight factors and exponents of all video-streaming service parameters, as well as quantifying the “Decodable Frame Rate” of three different types of frames. However, these model-based approaches cannot solve the online QoE optimization with incomplete or little knowledge about future system dynamics.
Recently, Reinforcement Learning (RL) has been proven as an effective strategy in solving many online network optimization problems that may not yield a straightforward analytical structure, such as Wireless Sensor Network (WSN) routing [
10], vehicle networks spectrum sharing [
11], data caching [
7], and network service placement [
12]. In particular, deep RL employs neural networks to estimate a decision-making policy, which self-improves based on collected experiential data to maximize the rewards. Compared with traditional model-based decision-making strategies, deep RL has a number of benefits: (i) it does not require a complete mathematical model or analytical formulation that may not be available in many complex practical problems; (ii) the use of deep neural networks as function approximators makes the RL algorithms easily extensible to problems with large state spaces; and (iii) it is capable of achieving a fast convergence in online decision making and dynamic environments that evolve over time.
The goal of this paper is to develop a new family of multi-agent reinforcement learning algorithms to apportion download bandwidth on the fly among different users and to optimize QoE and fairness objectives in video streaming. We note that existing RL algorithms often focus on maximizing the sum of future (discounted) rewards across all agents and fail to address inter-agent utility optimization, aiming at balancing the discounted reward received by each individual agent. Such inter-agent utility optimizations are widely considered in video streaming problems, e.g., to optimize the fairness of network resource allocation and to maximize a non-linear QoE function of individual agent’s performance metrics. More precisely, in a dynamic setting, the problem being solved must be modeled as an MDP, where agents take actions based on some policy
and observed system states, causing the system to transition to a new state. A reward
is fetched for each agent
k. The transition probability to the new state is dependent only on the previous state and the action taken in the previous state. RL algorithms aim to find an optimal policy
to maximize the sum of (discounted) rewards
for all users. However, when QoE and fairness objectives are concerned, a nonlinear function
f, such as the fairness utility [
13] and sigmoid QoE function reported by [
14], must be applied to the rewards received by different agents, resulting in the optimization of a new objective
, where
is the average discounted reward for each agent
k in a finite time
T. It is easy to see that such nonlinear functions will potentially violate the memoryless rule of MDP which is required for RL since the optimization objective now depends on all past rewards/states. In this paper, we will develop a new family of multi-agent reinforcement learning algorithms to optimize such nonlinear objectives for QoE and fairness maximization in video streaming.
We propose
Multi-agent Policy Gradient for Finite Time Horizon (MAPG-finite) for the optimization of nonlinear objective functions of cumulative rewards of multiple agents with a finite time horizon. We employ MAPG-finite in online video streaming with the goal of maximizing QoE and fairness objectives by adjusting the download bandwidth distribution. To this end, we quantify the stall time for online video streaming with multiple agents under a shared network link and dynamic video switching by the agents. At the end of the time horizon, a nonlinear function
of the agents’ individual cumulative rewards is calculated. The choice of
is able to capture different notions of fairness—e.g., the well-known
-fairness utility [
13] that incorporates proportional fairness and max–min fairness as special cases, and the sigmoid-like QoE function reported by [
14] that indicates that users with a mediocre waiting time tend to be more sensitive than the rest—and thus balances the performance received by different agents in online video streaming. We leverage the RL algorithm proposed in [
15] to develop a model-free multi-agent reinforcement learning algorithm to cope with the inter-agent fairness reward for multiple users. In particular, this RL algorithm modifies the traditional policy gradient to find a proper ascending direction for the nonlinear objective function using random sampling. We improve the convergence of the proposed algorithm to at least a local optimal of the target optimization problem. The proposed multi-agent algorithm is model-free and shown to efficiently solve the QoE and fairness optimization in online video streaming.
To demonstrate the challenge associated with optimizing nonlinear objective functions, consider the example shown in
Figure 1. Two users, A and B, share a download link for video streaming. Due to the bandwidth constraint, the link is only able to stream one high-definition (HD) video and one low-definition (LD) video at a time. In each time slot
t, the two users’ QoE, denoted by
and
, are measured by a simple policy that multiplies the quality of the
content (from one to five stars) and the resolution of the video (1 for LD and 2 for HD). The service provider is interested in optimizing a logarithmic utility of the users’ aggregate QoE, i.e.,
(which corresponds to the notion of maximum proportional fairness [
13,
16]), for the two time slots
. It is easy to see that, in Case 1, user A received
in time slot 1 and user B
. If we stream HD to user A and LD to user B in the next time slot, then the total received utility becomes
, whereas the opposite assignment achieves a higher utility
. However, in Case 2, choosing user A in time slot 2 to receive HD and user B LD gives the highest utility of
. Thus, the optimal decision in time slot 2 depends on the reward received in all past time slots (while we have only shown the rewards in time slot 1 for simplicity in this example). In general, the dependence of the utility objective on all past rewards implies a violation of the Markovian property, as the actions should only be affected by the currently observed system state in MDP. This mandates a new family of RL algorithms that are able to cope with nonlinear objective functions, which is the motivation of this paper.
We note that, in video streaming optimization, the action space for learning-based algorithms can still be prohibitively large, since the bandwidth assignment decisions are continuous (or at least fine-grained if we sample the continuous decision space), whereas the action space suitable for RL algorithms should be small as the output layer sizes of neural networks are limited. The action space for bandwidth assignment increases undesirably with both the growing number of users and the increasing amount of network resources in the video streaming system.
To overcome this challenge, we propose an adaptive bandwidth adjustment process. It leverages two separate reinforcement learning modules running in parallel, each tasked to select a target user to increase or decrease their current bandwidth by one unit. This effectively reduces the action space of each RL module to exactly the number of users in the system, while both learning modules can be trained in parallel with the same set of data through backpropagation. This technique significantly reduces the action spaces in MAPG-finite and makes the optimization problem tractable.
To evaluate the proposed algorithm, we develop a modularized testbed for the event-driven simulation of video streaming with a multi-agent bandwidth assignment. In particular, a bandwidth assigner is developed to observe the agent states, obtain an optimized distribution from the activated action executor, and then adjust the bandwidth of each agent on the fly. We implement this distribution-generating solution along with the model-free multi-agent deep policy gradient algorithm, and compare the performance with static and dynamic baseline strategies, including “Even” (which guarantees balanced bandwidths for all users), “Adaptive” (which assigns more bandwidth to users consuming higher bitrates), and SARSA (which is a standard single-agent RL-driven policy that fails to consider inter-agent utility optimization). By simulating various network environments, as well as both constant and adaptive bitrate policies, we validate that the proposed MAPG-finite outperforms all other tested algorithms. With Constant Bitrate (CBR) streaming, MAPG-finite is able to improve the achieved QoE by up to , and the fairness by up to compared with baseline strategies. Further, with the Adaptive Bitrate (ABR) streaming, up to a QoE improvement can be obtained.
We conclude that the key contributions of this work are:
We model the bandwidth assignment problem for optimizing QoE and fairness objectives in multi-user online video streaming. The stall time is quantified for general cases under system dynamics.
Due to the nature of the inter-agent fairness problem, we propose a multi-agent learning algorithm that is proven to converge and leverages two reinforcement learning modules running in parallel to effectively reduce the action space size.
The proposed algorithm is implemented and evaluated on our testbed, which is able to simulate various configurations, including different reward functions, network conditions, and user behavior settings.
The numerical results show that MAPG-finite outperforms a number of baselines, including “Even”, “Adaptive”, and single-agent learning policies. With CBR, MAPG-finite achieves up to a improvement in the achieved QoE, and a improvement in the logarithmic fairness; with ABR, MAPG-finite achieves up to a WoE improvement.
2. Related Work
Multi-Agent Reinforcement Learning: In the past, the Multi-agent Reinforcement Learning (MARL) technique [
17] has been discussed for scenarios where all of the agents make decisions individually to achieve a global optimal. Existing works include coordinated reinforcement learning [
18], which coordinates both the action selections and the parameter updates between users; sparse cooperative Q-learning [
19], which allows agents to jointly solve a problem when the global coordination requirements are available; ref. [
20], which uses the max-plus algorithm as the elimination algorithm of the coordination graph; ref. [
21], which compares multiple known structural abstractions to improve the scalability; and ref. [
22], which automatically expands an agent’s state space when the convergence is lacking. Apart from the standard Q-learning [
23] and policy gradient [
24] algorithms, there is rich literature on meta-heuristic algorithms for reinforcement learning. In [
25], an ant-colony optimization method for swarm reinforcement learning is provided, which improves empirically over the Q-learning-based methods by using parallel learning inspired by ant swarms. Building on biological inspired algorithms, ref. [
26] provides a genetic algorithm to search for parameters for deep-reinforcement learning. In addition, the authors of [
27] study a modification of ant colony optimization by considering
-greedy policies combined with Levy flight for random exploration in order to search for possible global optima. In addition, the authors of [
28] consider a multi-period optimization using an ant-colony-optimization-inspired algorithm relaxation-induced neighborhood search algorithm for performing a search in large neighborhoods.
Recently, along with the development of neural networks and deep learning, the deep-MARL [
29] is proposed to resolve real-world problems with larger state spaces. With various aspects of deep-MARL researched, such as investigating the representational power of network architectures [
30], applying deep-MARL with discrete-continuous hybrid action spaces [
31], enhancing the experience selection mechanism [
32], etc., real-world applications can be solved, including wireless sensor networks (WSN) routing [
10], vehicle networks spectrum sharing [
11], online ride-sourcing (driver–passenger paring) services [
33,
34], video game playing [
35], and linguist problems [
29]. Comparing with existing work, our proposed solution in this paper focuses on optimizing inter-agent fairness objectives in reinforcement learning.
Video Streaming Optimization: In order to improve the performance of data streaming, various techniques have been proposed. The mostly discussed method is Adaptive Bitrate (ABR) [
36] streaming, which dynamically adjusts the streaming bitrate to reduce the stall time. The different algorithms include BBA [
37], Bola [
38], FastMPC [
39], LBP [
40], FastScan [
36], and Pensieve [
41]. In addition to ABR and bandwidth allocation considered in this paper, caching is also a popular technique to reduce the stall time and further improve QoE. Inspired by the LRU cache replacement policy, ref. [
42] analyzes an alternative gLRU designed for video streaming application, and DeepChunk [
7] proposes a Q-learning-based cache replacement policy to jointly optimize the hit ratio and stall time. Within an edge network environment, the placement of calculations will also affect the streaming performance, thus work [
12] breaks the hierarchical service placement problem into sub-trees, and further solves it using Q-learning.
For video streaming services still using Constant Bitrate (CBR) systems, ref. [
43] proposed QUVE, which estimates the future network quality and controls video-encoding accordingly. The study in [
44] considers maximizing QoE by optimizing the cache content in edge servers. This is different from our setup, where we consider caching chunks at client devices. Similar to us, ref. [
45] also provides a bandwidth allocation strategy to maximize QoE. However, they use model-predictive control, whereas we pose it as a learning problem and use reinforcement learning. The study in [
46] considers a multi-user encoding strategy where the encoding schemes for each user vary depending on their network condition. However, the study uses a Markovian model and does not take the possible future network conditions into account. In [
47], a future dependent adaptive strategy is considered, where the TCP throughput and success probability of a chunk download are estimated. Similar to us, the authors of [
48] consider a reinforcement learning protocol to maximize the QoE for multiple clients. However, they use average client QoE at time
t as a reward for time
t and use deterministic policies learned from Q-learning [
23]. We show that our formulation outperforms standard Q-learning algorithms by considering stochastic policies and rewards as a function of the QoE of the clients.
Our work, by using a model-free deep-RL policy [
15], aims to maximize the overall quality of experience of multiple agents. To measure the QoE, ref. [
14] considers the web page loading time as a factor, ref. [
49] tracks graphic settings, and [
50] focuses on mobile networks, such that the signal-to-noise ratio, load, and handovers matter. More mapping methodologies can be found in the survey [
51].
3. System Model
We consider a server with a total bandwidth of
B stream videos to users in set
, in which, all of the users are consuming videos continuously. We consider a streaming session to be divided into nonidentical logical slots. In each time slot, all of the users will maintain requesting/playing chunks from the same videos. Once any user
starts a request for a new video in the current time slot
l, the slot counter increments, thus the new slot
starts for all users in
, even if the video does not change for users
. Using the logical time slot setting described above, at time slot
, a user
consumes downloading rate
to fetch video
, which is coded with bitrate
. The downloading speed is limited by
, and may update for all users when the time slot increments in the system. The video server continuously sends video chunks to the user with downloading speed
, and the user plays the video with bitrate
, which is defined by the property of the video. For Adaptive Bitrate (ABR) videos, we update the bitrate of the chunk on starting a new slot only. Thus, all of the chunks sent in slot
l are of bitrate
. For Constant Bitrate (CBR) videos,
may remain constant across time slots
if multiple slots span the video
. A list of the key variables used in this paper can be seen in
Table 1.
In each slot, we reset the clock to zero. We use to denote the time when the server starts to send the m-th chunk of video in the time slot l, to denote the time when the user starts to play video chunk m, and to denote the moment that chunk m is finished playing. For analysis, we consider that the size of each chunk is normalized to 1 unit.
With our formulation, there will be two classes of users in a time slot l. The first class is of users who requested a new video and triggered the increment of the time slot to l. Since the user has requested a new video, it can purge the already downloaded chunks for the previous video. Users in this class may observe a new downloading rate and video bit rate . The second class of users are those who do not request a new video, but a new streaming rate is assigned to them because some other user has requested a new video and triggered a slot change. For these users, the video bitrates will remain the same from the previous slot , or be adjusted solely by the ABR streaming policy when CBR or ABR policies are activated, whereas, for the downloading rate , it is updated by the bandwidth distribution policy. Note that a resource allocation scheme may still allocate bandwidth to the user such that ; however, this may not be always true.
Next, we calculate the stall time in a slot l for both classes of users.
3.1. Class 1: User Requests a New Video
We first calculate the stall durations for user
k that has requested a video change. As shown in
Figure 2, user
k starts to fetch a video from the beginning of the slot
l. We assume that the chunk
m is played in time-slot
l; if not, the calculations for the stall duration for those chunks will be studied in Case 2. With the given downloading speed and bitrate, we can observe the relationships between
,
, and
:
Since we will be limiting the analysis for user
k in slot
l, we drop the subscript
k and the argument
l from
, and
for brevity. Let
T denote the time elapsed in time slot
l, and let
denote the stall time in slot
l till elapsed time
T for user
k. Clearly, when the video downloading speed is equal to or higher than the video bitrate (
), the user only has to wait for the first chunk to arrive, and then experience a stall-free video playback. Otherwise, for
, three conditions need to be considered for
T. If
T is smaller than or equal to
, no video chunk has been played and the stall time is exactly the time elapsed
T in the time slot. Otherwise, if
T lands in an interval in which a video chunk
m is being played, the stall time of
T is equal to the stall time of
, and if
T lands in an interval where the user is waiting for the chunk
to be downloaded, the stall time needs an additional wait after the
m-th chunk is played. Hence, the stall time until the end of the slot
l,
, can be defined as a recursive conditional function,
In the condition of
, the stall time before the
m-th chunk is downloaded is the key to obtaining the stall time of
T. The stall time of
fits the second condition of Equation (
4). Thus, we have:
According to Equations (
1)–(
3), the difference between
and
can be calculated. Thus, Equation (
5) can be written in a recursive form:
and further solved as:
Finally, substitute
into Equation (
4); the stall time of time slot length
T is
Note that, if some other user requests a new video triggering an increment in the time slot from l to , the stall duration analysis will fall to the second class of users. We discuss the stall duration for the second class of users in the next section.
3.2. Class 2: Users Continuing with the Old Video
We now discuss the stall time model for users who continue the video from time slot
to time slot
l (
Figure 3).
Assume that, in the previous time slot
, the total slot duration is
. At the moment of
, a chunk—denoted by 0—is being downloaded. As a result of the chunks that were continuously downloaded, by evaluating
and the download speed
, we can calculate the length or ratio of chunk 0, which has not been downloaded by
Note that, since the length of the chunks are normalized, we have: .
Downloaded with speed
, the leftover chunk 0 with length
needs time
to be ready for the user to play it. Following the continuous downloading rule, in time slot
l, we have
. Then, similar to Equation (
1), the rest of the
can be recursively obtained.
We denote the last chunk being played in time slot
as chunk
, which is the video chunk ahead of chunk 0 by
n, and we denote its finish time calculated in the previous time slot by
. If
and
, we know that all chunks before chunk 0 have finished playing in slot
. Otherwise, chunk
is being played half-way at the moment of the time slot transition. For the latter case, user
k will continue the play of video chunk
at the beginning of time slot
l. Then, in the new time slot
l, because the video bitrate is not changed, chunk
will be finished at
. Since, at the beginning of slot
l, chunk 0 is being downloaded, we know that chunks in interval
in
t are all ready to be played. Therefore, we can derive the play finish time of chunks
in slot
l as:
As the download finish time and play finish time are defined, the leftover video chunk 0 is played at time and finished at .
With all of the leftover chunk issues tackled, we finally obtain the chunk time equations for time slot
l:
With all of the time equations obtained, we can now calculate the stall time using the similar procedure shown in the previous sub-section. Since, for
, all of the chunks are being played stall-free, if the slot ends at time
, the stall time will be zero. From chunk
, it is possible that the stall appears between the gap where chunk
is finished, while chunk
m is not downloaded yet (
). If
T happens to be in this gap, the stall time
will be the accumulated waiting time of chunks
(denoted as
) plus the additional time between
T and
. Otherwise, if
T happens to be during when a chunk
m is being played, then the stall time
should be the accumulated stall time for chunks
, which can be denoted as
.
3.3. Quality of Experience
The goal of this work is to maximize the inter-agent QoE utility for all users. In this paper, we consider the fairness utility functions in [
13] and optimize the inter-agent fairness with two existing evaluations—the sigmoid-like QoE function and the logarithmic fairness function. By analyzing real-world user rating statistics, a sigmoid-like relationship between the web page loading time and the user QoE was reported in [
14]. Inspired by that, we draw a similar nonlinear, sigmoid-like QoE curve to map the streaming stall time ratio, and verify that (i) reducing the stall time for users who already have a very low stall time or (ii) increasing the stall time for users who already suffer from a high stall time does not impact the QoE values, while (iii) users with a mediocre QoE are more sensitive to stall time changes:
We also consider a logarithmic utility function that achieves the well-known proportional fairness [
13] among the users:
It is easy to see that, with the unit stall time decrease, this utility function provides (i) a larger QoE increment for users experiencing a higher stall time and (ii) a smaller increment for users already enjoying a good performance with a low stall time.
In both Equations (
16) and (
17),
x represents the stall time ratio for playing a video. It is defined by
where
denotes the time slots that video
v has been played in, and
denotes the stall time for user
k in time slot
l with slot length
.
We note that (
16) is only one representative QoE function, while other functions may be used. Suppose that, in
L time slots,
is the set of videos played by user
k. Then, the total QoE of the
L time slots obtained by user
k is given as:
and, for all users, the total QoE is
Substituting (
19) and (
18) in (
20), we have
Note that the QoE function defined in Equation (
19) assigns a higher quality of experience to a lower stall time. The QoE metric remains constant for small stall times. If the stall times are lower than a certain value and are not noticeable, the QoE does not vary, as obtained in the sigmoid-like function of Equation (
19). In addition, the QoE decreases rapidly with increasing stall times and remains zero if the stall times exceed a certain value, therefore ruining the viewing experience.
4. Problem Formulation
In this section, we propose a slice assignment system to distribute the download link bandwidth to users. Let be a vector in such that . Each element denotes the portion of the total bandwidth that user k is assigned to. By this definition, user k’s downloading bandwidth under policy can be calculated as .
The Multi-Agent Video Streaming (MA-Stream) optimization problem is defined as the following:
We now discuss the MA-Stream optimization problem described in Equations (
22)–(
24). Equation (
22) denotes the sum of the quality of experience for each user
across each video played in
L time slots. The control variable is the policy
(in Equation (
24)), which directly controls the bandwidth allocation. This gives the constraint in Equation (
23) where the sum of allocated bandwidths,
, to each user
can be, at most, the total bandwidth of the system for all slots
. Moreover, the QoE for any video is a non-linear function of the cumulative stall durations over each chunk in the video played.
We utilize the deep reinforcement learning technique to optimize the bandwidth distribution . In the following sub-sections, we define the state, action, and objective for the decision making.
4.1. State
At time slot l, the observed state is defined by a dimensional vector , where denotes the video bitrates, represents the currently assigned download speeds, tracks the accumulated stall time for the current playing video until slot l, and counts the number of chunks that are downloaded but not yet played for user . For brevity, we use the notation , where . We will expand the corresponding vector when necessary. By considering the variables and , the learning model should be able to estimate the downloaded and played video chunk information in the current time slot l, while and provide the objective-related history information.
4.2. Action and State Transition
At the beginning of time slot
l, in order to find the optimal download speed distribution, multiple decisions are needed to adjust the observed speed distribution. We utilize two decision processes to obtain the optimal distribution
while maintaining the constraint shown in Equation (
23). One of the processes is a
decreasing process that decides for which user the download speed will be decreased by one unit of rate, and the other process is an
increasing process that decides the user that will obtain the released one unit of download speed.
The download speed distribution is iteratively adjusted to a final distribution by recursively running the
decreasing and
increasing decision processes. A distribution will not be assigned to the system until the final decision is concluded, and the system will not transit into the next time slot. Assuming, at time slot
l, with the observed state
, actions
are made by the decreasing and increasing processes. The intermediate state
can be derived by
Now, this intermediate state is used in the decision making for both processes. New actions will be made to push the distribution towards the final state. Finally, at state , when both the increasing and decreasing processes give the same action , the distribution is obtained as .
According to , the system distributes the bandwidth to each user for the time slot l. The next time slot will be triggered when a user switches its playing video. We assume that the new content request for all users follows Poisson arrival processes with arrival rate for user k, so the mean value of slot duration can be derived by , and the probability that user k triggers the state transition is . For time slot , the initial system state should have video bitrates if CBR is activated as the bitrate policy, and downloading speeds calculated in the previous time slot.
The accumulated stalls
can be calculated using Equations (
4) and (
15). Let
be the video played by the user
k in time slot
l, and let
be the time slot where user
k starts playing video
. Let
denote the length of time slot
; we have
The number of remaining chunks
can easily be tracked during the downloading/playing procedures, and observed whenever the information is needed. Both the downloading and playing processes can be monitored. For the downloading process, let
, where
denotes the chunk of the video being played at the beginning of time slot
l being downloaded, and
denotes the ratio or percentage of chunk
that has been completed. The similar mechanism holds for the playing process,
. Using both of the processes, the remaining chunks
can be calculated by
in any time slot
l.
4.3. Feedback
As pointed out in Equation (
22), the goal of the controller is to maximize the average QoE. For our RL algorithm to learn an optimal policy to maximize the objective, every slot provides a feedback of the value of the objective calculated from the average stall times for all users.
In
Section 4.2, we mention, that when the decreasing and increasing processes take decisions
, the state transition only happens in a logic domain instead of the realistic time domain. During this intermediate state transition, no real stall time calculations exist and we assign zero rewards for actions
in the intermediate state before converging to
. When the final distribution is achieved (
), the slot duration
can be obtained and, hence, stall times
can be calculated for all users. We can also obtain rewards from the calculated stall times using Equation (
21).
The complete schema is presented in Algorithm 1.
Algorithm 1 Proposed MA-Stream Algorithm |
1: Input: Set of users , maximum bandwidth B |
2: for slot do |
3: Observe state as described in Section 4.1 |
4: Compute bandwidth allocations for all using RL engine |
5: while No user switches video do |
6: Continue streaming with for all |
7: Store Stall duration, for slot l for all |
8: end while |
9: end for |