Next Article in Journal
Construction of Supplemental Functions for Direct Serendipity and Mixed Finite Elements on Polygons
Previous Article in Journal
A Novel Fuzzy Unsupervised Quadratic Surface Support Vector Machine Based on DC Programming: An Application to Credit Risk Management
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dynamic Routing Policies for Multi-Skill Call Centers Using Deep Q Network

School of Economics and Management, Tongji University, Shanghai 200092, China
Mathematics 2023, 11(22), 4662; https://doi.org/10.3390/math11224662
Submission received: 24 September 2023 / Revised: 10 November 2023 / Accepted: 14 November 2023 / Published: 16 November 2023

Abstract

:
When the call center queuing system becomes complex, it turns out that the static routing policy is not optimal. This paper considers the problem of the dynamic routing policy for call centers with multiple skill types and agent groups. A state-dependent routing policy based on the Deep Q Network (DQN) is proposed, and a reinforcement learning algorithm is applied to optimize the routing. A simulation algorithm is designed to help customers and agents interact with the external environment to learn the optimal strategy. The performance evaluation considered in this paper is the service level/abandon rate. Experiments show that the DQN-based dynamic routing policy performs better than the common static policy Global First Come First Serve (FCFS) and the dynamic policy Priorities with Idle Agent Thresholds and Weight-Based Routing in various examples. On the other hand, the training time of the routing policy model based on the DQN is much faster than routing optimization based on simulation and a genetic algorithm.
MSC:
60K25; 68M20; 90B22

1. Introduction

In the increasingly complex market environment of service products, the traditional single-skill call centers are gradually being replaced by multi-skill call centers. Multi-skill call centers are typical Parallel-Server Systems. There are multiple classes of customers and multiple types of agent groups. In order to optimize the service system, it is a practical problem to match the incoming call and agent reasonably. The complexity of multi-skill routing rules is caused by the fact that customers need an agent with specific skills and there are shared skills between different agent groups.
Some rudimentary routing policies can be called static if they do not take into account the real-time state of the service system. For example, assigning the longest waiting call to the longest idle agent that has the right skill to serve it or assigning the customers to the Faster Servers First (FSF). In addition, it is an intuitive static policy solution to prioritize the match between the customer types and the agent groups. When an incoming call arrives or an agent becomes idle, it matches them according to the preset priority. Due to its simplicity, many analyses of call center queuing systems are based on static routing policies [1,2,3,4]. Especially, Tezcan and Dai proved that a c μ -type greedy routing policy is asymptotically optimal for N-systems in a many-server heavy-traffic regime [5].
When the queuing system becomes complex, it turns out that the static routing policy is not optimal. Some state-dependent threshold policies have been proposed as exampled by Ormeci [6], who considered a call center with one shared and two dedicated agent groups and proved the existence of a monotonic threshold policy for the shared station, where the thresholds depend on the number of customers in all three stations. Chan et al. reviewed the common routing policies, most of which are based on routing rules found in industry, and introduced a dynamic routing policy named weight-based routing (WR) [7].
The optimization of dynamic routing is usually based on the Markov decision process (MDP), and approximate dynamic programming (ADP) is the most popular solving approach [8,9,10,11]. This often considers call centers with few call types and specific structures like “N-design” or a “Hierarchical approach structure” and makes some simplifying assumptions such as no waiting room and customers having infinite patience. Poisson arrivals and exponential service times are assumed in most of the literature. In real-time call centers, however, the situation is often complicated. There are many more call types, and the arrival and service rates have different distributions.
With the development of deep learning technology, learning heuristic-based methods are promising in numerous research fields [12,13]. Many deep reinforcement learning (DRL) research methods have emerged in the sequential decision making problems that are difficult to solve using traditional operation research methods. For instance, Tan explores the use of a deep reinforcement learning architecture for conducting path planning experiments with mobile robots [14], and Kool et al. apply DRL with an encoder–decoder framework with multi-head attention layers to tackle Vehicle Routing Problems (VRPs) [15]. Fuller et al. implement Reinforcement Learning in the Call Center Queuing Simulation Model [16]. To solve the problem of continuous state spaces, Mnih et al. proposed a DRL algorithm that combines DNN and Q-learning for function approximation to handle high-dimensional inputs, which is called the Deep Q Network (DQN) [17]. Although the DQN was proposed in 2015, it is still widely used to deal with high dimensional and continuous state space problems, and it shows excellent characteristics in global transparency and global optimization [18]. The double DQN-based 5G network job scheduling algorithm provides higher convergence between wireless nodes in the 5G network and consumes less energy. And compared with the current standard deep learning methods DeMADRL and BiDRL, this algorithm has the best effect [19]. Kopacz et al. modeled cooperative–competitive social group dynamics with multi-agent environments and used three methods to solve the multi-agent optimization problem. It was found that training in a decentralized manner with the DQN outperforms both the monotonic value factorization methods (QMIX) and the multiagent variational exploration approach (MAVEN) on the analyzed cooperative–competitive environments for both agent types [20]. Mukhtar et al. developed the Deep Graph Convolution Q-Network (DGCQ) by combining the DQN and the Graph Convolutional Network (GCN) to achieve a signal-free corridor. The proposed model is trained on the robust synthetic traffic network and evaluated on the real-world traffic networks that outperform the other state-of-the-art models [21].
In summary, the multi-skill call center is a complex queuing system. When analyzing its dynamic routing strategy, it is difficult to establish an accurate model because it is difficult to predict the arrival of future customers, the completion time of serving customers in the system, and the acceptable waiting time of customers in the queue. However, the traditional methods often assume that the customer arrival and service time obey a certain distribution, which means that the simulation process is not close to reality. At the same time, the traditional methods usually can only obtain good results under a specific call center structure, which makes the results not practical. To solve this problem and optimize the entire queuing system, this paper creatively uses the DQN as a powerful tool to build a dynamic routing policy for multi-skill call centers for the first time. The DQN does not rely on accurate modeling but constantly learns under the interaction with the environment and can be applied to call centers with different structures. Therefore, compared with the previous methods, the dynamic routing strategy based on the DQN is more practical and can show a better optimization ability in the more complex and uncertain simulation process.
The contribution of the article is mainly manifested in two aspects. In theory, the article studies the dynamic routing policies for multi-skill call centers from the perspective of the Deep Q Network. Through the combination of the queuing theory and deep reinforcement learning, the lack of applicability of traditional methods in the more complex and uncertain simulation process is remedied. In practice, this study can guide each customer and agent to make scientific decisions according to the current queue situation, so as to achieve a higher service level and lower abandonment ratio of customers.
The remainder of the article is organized as follows: Section 2 elaborates on the problem statement. Section 3 provides the reinforcement framework design. Section 4 and Section 5 discuss the simulator design and numerical experiments results. In Section 6, we offer our conclusions.

2. Problem Statement

Consider a call center where agents handle incoming calls that require different skills. Aiming at the highest overall service level and the lowest customer abandonment rate, a suitable dynamic routing strategy needs to be designed to optimize the multi-skill call center system and achieve a reasonable match between incoming calls and agents. For a routing policy, the following decisions should be made. (a) When a certain type of customer arrives, which agent group should that customer be routed to if there are idle agents to serve him? (b) When an agent finishes the service and there are customers in the queue who can be served, which type of customers should he choose, or should he remain idle? The performance measurement is described in Section 2.1 and Section 2.2, and we make assumptions about some other details of the research question.

2.1. Performance Measures

There is a well-known performance measure in the call center field called the 80/20 rule. It means that 80 percent of customers need to be served within 20 s. This rule can be abstracted into the service level (SL), defined as the fraction of calls whose delay fell below a prespecified acceptable waiting time (AWT), also called the “service-level” target [22].
Another important performance measure is the abandonment ratio (AR), defined as the fraction of customers who wait longer than they have patience for and abandon the service. This is a direct indication of how many customers the system has lost. For example, Bodur and Luedtke studied server scheduling in multi-skill multi-class service systems, and their objective function involved minimizing the scheduling cost and expected abandonment cost [23].
Chan et al. described the above two performance measures in detail [7]. SL under route policy π and AWT τ is
S L π ,   τ = E [ X ( π ,   τ ) ] E [ N N A ( π , τ ) ] ,
where X ( π ,   τ ) is the number of customers who wait no more than τ , N is the total number of arrivals during the period considered, and N A ( π , τ ) is the number of customers who abandoned within time τ . AR under route policy π is
A R π = E [ Z ( π ) ] E [ N ] ,
where Z ( π ) is the number of customers who abandoned during the period considered.
In this paper, a dynamic routing policy is designed to optimize the service level and abandonment rate of the queueing system.

2.2. Other Assumptions

The customers are classified into K categories, named 1 to K based on the type of skills they need. The agents are classified into I groups, named 1 to I according to the skills they possess. There are n i agents in the agent groups i . We assume that all agents in the same agent group are homogeneous. A multi-skill call center has a flexible queuing structure which can be described using a bipartite graph G = E , K I [24,25,26]. E is the set of edges connecting K and I . An agent group of type i can serve a customer of type k if and only if k ,   i E , where k K and i I . In addition, we define A k { i I | k , i E } , which represents the agent groups set that can serve customers of type k , and C i { k K | k ,   i E } represents the customer type set that can be served by agent group i . If the customer fails to be routed to the agent group upon arrival, the customer will wait in an infinite waiting queue. If the waiting time exceeds his patience, the customer will abandon the service. Usually, it is assumed that the arrival rate of customers has a Poisson distribution, and the service rate of agents and the abandon rate of customers have an exponential distribution. However, in order to verify the robustness and practicability of our proposed method, we do not assume a particular distribution.

3. Reinforcement Framework Design

A multi-skill call center is a complex queuing system, and it is difficult to construct an accurate model when analyzing its dynamic routing policy. Reinforcement learning, however, does not rely on precise modeling, but instead, it learns the state transfer function and reward function in the interaction with the environment (e.g., simulation). Reinforcement learning tasks are usually described in terms of a Markov decision process. The state space, action space, and reward space are needed to construct a complete Markov decision process. This section will expand on these aspects and finally introduce the model-independent reinforcement learning algorithm Deep Q Network (DQN).

3.1. States

Intuitively, call waiting times and agent idle times of each group are of concern when making routing decisions. Chan et al. combined the two using an affine function to construct a routing policy based on weights [7]. However, they only considered the customers with the longest waits in each type and the agents with the longest idle time in each agent group. Suppose in an extreme case, there are two queues A and B, where queue A has one and only one customer waiting for 100 min, and queue B has 100 customers waiting for 90 min. In their routing policy, queue A has a higher weight on call waiting times than queue B, which is obviously unreasonable. To improve this defect, we define the state variables at time t as S t = c t , 1 , , c t , K ,   a t , 1 , , a t , I . c t , k is an n-dimensional vector w 1 , , w n , representing the waiting time of the first n customers in type k . Similarly, a t , i is an m-dimensional vector i 1 , , i m , representing the idle time of the first m agents in agent group i .

3.2. Actions

When a call of type k arrives, the decision maker can either assign it to any agent group that can serve the call or keep the customer waiting. Hence, the customer’s actions set at time t can be defined as A t , k = i A k b , where action b stands for keeping the customer waiting. When an agent of agent group i finishes the service, the decision maker can either assign a new call to him that the agent can handle or keep him idle. Hence, the agent’s actions set at time t can be defined as A t , i = k C i b , where action b stands for keeping the agent idling.
However, in order to facilitate the construction of a neural network for learning, we need to unify the action space of customers and agents. It is noted that, in essence, a certain edge k ,   i E in the bipartite graph G = E , K I is selected, whether the arriving customer of class k selects the agent of class i for service or the idle agent of class i selects the customer of class k in the queue. So, the state space can be defined as A = E b .

3.3. Rewards

The setting of the reward function is a very important step in reinforcement learning, which will affect the convergence speed and degree of the reinforcement learning algorithm. The objective is to optimize the service level, abandonment rate, or agent group occupancy rate of the queueing system. Therefore, the reward function must reflect these optimization goals. Tezcan and Dai considered the cost function as the holding and reneging costs during a finite time interval [5]. Referring to this definition, we define the reward function as follows.
R ( S L ) t = k K i Q k t max ( h k w i τ k ,   h k )
R ( A B ) t = k K e k Z k t
R ( O R ) t = j I d j O t j η j
where Q k t denotes the set of customers of type k in queue at time t , w i denotes the waiting time of customer i , Z k t denotes the number of customers who have abandoned the system in time period t , O t j denotes the occupancy rate of agent group j in the time period t , and η j denotes the expected occupancy rate for agent group i . h k < 0 , e k < 0 , d j < 0 are the penalty parameters of the holding, abandonment, and occupancy, respectively. We take R ( S L ) t as the reward function when the optimization goal is to maximize the service level, take R ( A B ) t as the reward function when the optimization goal is to minimize the abandonment rate, and take R ( O R ) t as the reward function when the optimization goal is to minimize the occupancy rate. The combination of the three reward functions is discussed in Section 5 Numerical Experiments. In particular, it can be seen from Equation (3) that when the customer’s waiting time is less than the AWT, the penalty is proportional to its waiting time, and when it is greater than the AWT, it is a fixed value.

3.4. Deep Q Network

After the state space, action space, and reward space of the problem are given, we need to construct an appropriate algorithm to obatin the state value function v π s = E π [ G t | S t = s ] or the action state value q π s , a = E π [ G t | S t = s , A t = a ] , where G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + = τ = 0 + γ τ R t + τ + 1 means the long-term discount reward and γ [ 0 , 1 ] is the discount factor. Q-learning is a classic model-free method to solve this kind of problem and has the following steps [27]:
  • Initializes an empty table Q ( s , a ) and maps all states to values of actions.
  • The interaction with the environment results in a tuple s, a, r, s′ (state, action, return, new state). In this step, we need to think about exploration versus exploitation and decide which action to take.
  • Update the Q ( s , a ) values using the Bellman approximation.
  • Check the convergence conditions. If not met, repeat from step 2.
When the number of states and actions is very large (such as the call center dynamic routing problem), it is not possible to update all of the state action pairs one by one. The function approximation method approximates the entire state value function (or action value function) with a parameterized model and updates the whole function in each learning process. In this way, the value estimates of states (or pairs of state actions) that have not been accessed can also be updated. Linear functions and artificial neural networks are two of the most common approximation functions. The latter combines deep learning with reinforcement learning and is called the Deep Q Network. The affine function proposed by Chan et al. is a kind of linear approximation function [7], although they used a genetic algorithm to solve it. Their paper notes that such linear approximations have the following drawback: policies with similar near-optimal costs may have very different weight parameters. However, the Deep Q Network uses an artificial neural network q s , a ; w , s S , a A to replace the action value function, with a strong nonlinear expression ability. The structure of the DQN model includes three linear layers, each of which is connected by an ReLU function. In order to solve the problems of training instability and difficulty, researchers adopted two improvements to deep Q learning: experience replay and a target network [28,29]. Algorithm 1 presents the application of the improved DQN algorithm in the dynamic routing problem of multi-skill call centers.
Algorithm 1 Deep Q learning algorithm with experiential replay and target network
Data: w // weights of the evaluation network q · , · ; w
Data: w t a r g e t // weights of the target network q · , · ; w t a r g e t
Data: R s , a s S , a A A // expected reward for action a at state s
Data: D // replay buffer
Data: T // training time for one episode
Data: M // max episodes
Data: N // a threshold value
Data: γ // discount factor
Data: α // learning rate of the evaluation network
Data: α t a r g e t // learning rate of the target network
1Initialize w and w t a r g e t , empty replay buffer D , n 0
2for  m M  do
3  for  t T  do
4    for each customer in queue, new arrival customer and each idle agent do
5       s current state               // observe current state
6      if decision maker is a customer then marked it with a Boolean variable
7         δ 1
8      else decision maker is an agent
9         δ 0
10       p a random number [ 0 , 1 ]         // generate a random number
11       a   chooseAction ( s , δ , p )   // according to the evaluation network, select a from the action set and execute it.
12       r R s , a                     // observe the reward
13       s new state;                 // observe the new state
14      Store transition ( s , a , r , s ) in the buffer D
15      Sample a random minibatch of transitions ( s i , a i , r i , s i ) ( i B ) from D
16       u i r i + γ max a q ( s i , a ; w t a r g e t ) ( i B )        // estimate of the reward
17       w w + α 1 | B | i B u i q s i , a i ; w q ( s i , a i ; w )       // update w
18       s s                       // update current state
19       n n + 1
20      if  n N  then update w t a r g e t
21         w t a r g e t 1 α t a r g e t w t a r g e t + α t a r g e t w
22    end for
23  end for
24end for
Function chooseAction ( s , δ , p )
Input:  s // a state of the agent
Input:  δ // Boolean variable, 1 represents customer and 0 represents agent
Data:  ϵ // current value of ϵ
Data: ϵ d e c a y // value of the update for ϵ
Data:  p // a threshold value
Data: A k // set of all available actions for customer of type k
Data: A i // set of all available actions for agent of type i
1if  δ = 1  then decision maker is a customer
2   k customer type
3  if    ϵ < p  then is greedy
4     a argmax a A k q ( s , a ; w )
5  else is random
6     j j [ 1 ,   | A k | ]                     // j is chosen randomly
7     a a j                       // a j is the j th element of A k
8else decision maker is an agent
9   i agent type
10  if  ϵ < p  then
11     a argmax a A i q s , a ; w
12  else is random
13     j j [ 1 ,   | A i | ]
14     a a j
15 ϵ ϵ × ϵ d e c a y
16return  a
In Algorithm 1, we use the target network to estimate the return as a learning target. In the process of weight updating, only the weight of the evaluation network is updated. After a certain number of updates, the weight of the evaluation network is assigned to the target network. We store the new transition ( s , a , r , s ) in a buffer of a fixed size so that it pushes the oldest experience out of it. We then take samples from the buffer to update the optimal value function. This is called a replay buffer.
For each customer in the queue, each new arrival customer, and each idle agent, an action is chosen through the function chooseAction. This function uses a greedy scheme by following the Epsilon-Greedy method. Under this policy, the customer or the agent chooses the action that results in the highest Q-value with a probability of 1 ϵ , otherwise, it selects a random action. As discussed above, the action space is E b . q s , a ; w , s S , a A gives the Q-value of any k , i E and action b . For a customer of type k , the set of actions he can choose is a subset of A , defined as A k = k , i k , i E , i A k { b } . So, the choice of action under the greedy strategy is argmax a A k q ( s , a ; w ) . The same can be deduced for an agent of class i . The ϵ reduces linearly according to a decay parameter ϵ d e c a y .

4. Simulator Design

Simulation is an important tool for studying the routing problem of multi-skill call centers. The efficiency of reinforcement learning algorithms also depends on the simulation environment. This section presents an event-based simulator design. In the simulator, all times (waiting times, agent idle times, delay times, etc.) are counted in seconds, for all policies. Throughout this article, we use Python 3.9 for coding (Algorithm 2).
Algorithm 2 Multi-skill call center simulator
Data:  T // total simulation time
Data: queue// waiting queue for customer
Data: events// events occurring at the current moment
1Initialize queue, events and generate customer arrival information
2for  t T  do
3  events all events happened at time t       // there are three types of events
4  for event events do
5    if event is a new customer arrives then
6      customer the new arrival customer
7       a chooseAgent (customer) // a can be assign to a certain agent or keep waiting
8      if  a = keep waiting then
9        push the customer into queue
        generate customer’s patience time
10      else
11        generate customer’s finish time
12    else if event is an agent finishes service then
13      agent the idle agent
14       a chooseCustomer (agent)
15      if  a = keep idling then
16        pass
17      else
18        customer the chosen customer
19        generate customer’s finish time
20        pop customer from queue
21    else event is a customer abandons the service
22      customer the abandoned customer
23      pop customer from queue
24    update system performance measures
25  end for
26  clear events
27end for
In each time step, we record all events that occur at the current moment. There are three types of events, including the arrival of the new customer, the completion of the service by the agent, and the abandonment of the waiting by the customer in the queue. New arriving customers are assigned either to agents with the requisite service competencies or queued up for subsequent attention. Once the customer receives the service, the service completion time can be generated using a random generator. Meanwhile, as a customer takes their place in the queue, an associated patience time is similarly generated. The agent who finishes their service can either select available customers in the queue to serve or remain idle. Finally, customers who give up waiting need to be removed from the queue. At the end of each step, we need to update the performance measure of the system and clear the event list. ChooseAgent (customer) and chooseCustomer (agent) are predefined functions that control the simulation flow. They depend on what routing policy we use.

5. Numerical Experiments

We first experimented with the dynamic routing policy in the classic small-scale call centers, such as the X-design and W-design call centers. Their structure is shown in Figure 1 and Figure 2. Then for a large-scale call center, we considered the most commonly used “long chain” structure and the “single pooling” structure as shown in Figure 3 and Figure 4. The rectangles in Figure 1, Figure 2, Figure 3 and Figure 4 represent multiple customer types and the circles represent multiple agent groups.
For each call center structure, we set a group of corresponding parameters to simulate the real situation. In the simulation process, all parameter settings are shown in Table 1. Most of our parameter settings refer to Chan et al. ‘s study [7]. For the parameters of the DQN training model, we set them according to experience and the results of multiple simulations.
In the X-design model, we take Poisson arrivals with rates λ = λ 1 , λ 2 = ( 18 , 1.8 ) . In the W-design model, we consider the arrival process for each call type to be Poisson–Gamma with a stationary Poisson arrival process for each day, whose random rate (for the entire day) has a gamma distribution with means ( 3000 , 1000 , 200 ) and standard deviations ( 244.9 , 223.6 , 40 ) . Long chain is an important structure used to improve the flexibility of systems such as supply chains and multi-skill call centers [30]. Each customer k K can be served by 2 agent groups, k and k + 1 mod | K | . Single pooling is another flexible architecture proposed by Legros et al. [4]. They showed that single pooling performs better than chaining for various cases of asymmetry. For single pooling, each customer k K can be served by agent groups k and | K | .
When using the DQN policy for simulation, we refer to Chan et al. ‘s research [7] and construct three different reward functions based on Equations (3)–(5), as shown in Equations (6)–(8). According to their simulation results, in four kinds of call center structures, the optimal form of reward function is selected. At the same time, in order to evaluate the final result, we construct an evaluation function for the result, as shown in Equation (9).
R S t = k K i Q k t max ( h k w i τ k ,   h k )
R S A t = k K i Q k t max ( h k w i τ k ,   h k ) + k K e k Z k t
R S A O t = k K i Q k t max ( h k w i τ k ,   h k ) + k K e k Z k t + j I d j O t j η j
F = k K max ( 100 ( S L ¯ c k ) , 0 ) 2 + 100 ( A k ) 2 + j I max ( 100 ( O j O R ¯ ) , 0 ) 2
R S t is a reward function that only considers the service level, R S A t considers both the service level and abandonment rate, and R S A O t considers the service level, abandonment rate, and agent group occupancy. F evaluates the overall balance of the results. The lower the value of F , the more balanced the results. c k and A k denote the service level and abandonment rate of customer type k throughout the simulation, and O j denotes the occupancy rate of agent group j throughout the simulation. S L ¯ represents the expected service rate for each customer type, and O R ¯ represents the expected occupancy rate for each agent group as η j . Under the three reward functions, the results are evaluated as F S , F S A , and F S A O respectively.
In our training model, the max episodes for the X-design and W-design models are 20 for long chain and for single pooling is 50, and the state dimension parameters are n = m = 10 . Therefore, the state space is 40 dimensions in the X-design model, 60 dimensions in the W-design model, and 120 dimensions in the long chain and single pooling models. The structure of the DQN model includes three linear layers, each of which is connected by the ReLU function. All the experiments were performed on a 2.60-GHz Intel(R) Core(TM) i7-10750H CPU made by HP, sourced from Shanghai, China.
Under the four call center structures, using the DQN policy, the results of the three kinds of reward functions are shown in Table 2.
S L and A R represent the final overall service rate and the final overall abandonment rate, which are calculated using Equations (1) and (2) respectively. According to the calculation results, we can find that under the two structures, X-design and long chain, when R S A O t is selected as the reward function, the service rate is the highest, the abandonment rate is the lowest, and the result evaluation is the best. Under the single pooling structure, R S A O t can achieve the lowest abandonment rate and the best result evaluation, but the service rate is the second highest. Under the W-design structure, the three reward functions perform the same. Overall, we believe that R S A O t is the best form of the reward function. And it is used throughout the rest of the article.
To highlight the advancement of the DQN policy, we compare it with three other classic and influential routing policies. The routing policies we compared included static policy Global FCFS (G), dynamic policies Priorities with Idle Agent Thresholds (PT), and Weight-Based Routing (WR). The detailed definition of these policies and the routing optimization can be found in Chan et al. (2014) [7].

5.1. Experiments with the X-Design

Based on the parameter settings of the X-design instance mentioned above, the results obtained using four different policies including DQN are shown in Table 3.
It can be found that compared with other policies, the DQN is in a leading position in terms of the overall service level, abandonment rate, and agent group occupancy rate. However, the value of the evaluation function of the DQN is not good, which means that the results of the DQN are less balanced than those of other policies.
At this point, we can adjust the penalty factor for each type of customer to achieve our specific goal of service level or abandonment rate for a certain type of customer. For example, in the X-design model, if we set the holding penalty of two types of customers to be the same ( h k = 0.1 ), when the performance evaluation is the service level, the service levels of the two types of customers are shown in Figure 5. The service level c 1 is 88.11%, while the service level c 2 is 67.72%. This is because the arrival rate of the first type of customer is 10 times that of the second type of customer, so when the holding penalty of the two types of customers is the same, the system will sacrifice the service level of the second type of customer appropriately to satisfy the majority of the first type of customers who arrive. If we adjust the holding penalty of the second type of customer to 10 times that of the first type of customer and keep the other parameters unchanged, the result is shown in Figure 6. The service level of the two types of customers becomes close; the service level c 1 is 78.70%, and the service level c 2 is 78.30%, but the overall service level decreases.
In fact, we can flexibly allocate the holding penalty to meet various demands, whether we want the overall service level to be the highest or each service level to be relatively good. For other strategies, although there is no reward function like that in the DQN, the goal can also be changed according to the actual demand to obtain the desired result. In this paper, since the reward function in the DQN is set as the weighted sum of the service rate, abandonment rate, and occupancy rate, the objectives corresponding to other policies are also the comprehensive embodiment of the three indexes.
Figure 7 shows the loss curve of the DQN algorithm when the optimization goal is to maximize the service level in the X-design structure. The horizontal axis is the number of training steps, and the vertical axis is the loss. One episode contains 36,000 steps. It can be seen that 20 episodes can well ensure the convergence of the algorithm. Figure 8 shows the curve of the average reward changing with the number of training steps.

5.2. Experiments with Other Call Center Structures

Based on the parameter settings of the W-design instance, single pool instance, and long chain instance mentioned above, the results obtained using four different policies are shown in Table 4, Table 5 and Table 6.
Based on Table 3, Table 4, Table 5 and Table 6, it can be found that in a small-scale call center structure such as the X-design or W-design, the DQN is superior in terms of the overall service rate, abandonment rate, and agent group occupancy rate compared with other policies. However, the results of the DQN are not more balanced than those of other policies. For large-scale call center structures such as single pool and long chain, the advantages of the DQN are more prominent. In addition to service rates, abandonment rates, and occupancy rates, the results of the DQN are more balanced than those of other policies. Overall, the DQN policy shows superiority under all these kinds of different call center structures, so the applicability of the DQN is very strong.
Under the three structures, after the adoption of the DQN policy, the change curve of the service rate of each customer type is shown in Figure 9, Figure 10 and Figure 11.
Through these figures, we can find that under different call center structures, the service rate of each type of customer will fluctuate greatly at the beginning and gradually become stable. In the instances we set, the service rates of all types of customers in the long chain structure are finally stable at more than 85%, but they are stable between 50% and 85% in the W-design structure. And there is a significant difference in the service rates of all types of customers in the single pool structure, with a difference of more than 0.5 between the highest value and the lowest value.
In addition, the routing optimization based on the genetic algorithm (WR) takes about 30 min in the X-design and W-design call centers, and about 3 h in the long chain and single pooling models. In addition, the algorithm in the long chain and single pooling models uses a parallel method (6 cores) to solve the fitness value. However, the training time of the DQN-based algorithm using a single thread is only about 10 min in the X-design and W-design models and 45 min in the long chain and single pooling models.

5.3. Robustness of the DQN Policy

In order to verify the robustness of the dynamic routing policy we proposed, we consider the different distribution of the arrival rate and service rate, the different parameters in the distribution function, the change in the number of people in the agent groups, and other factors that may affect the performance of the whole system. Here, we take long chain structure as an example to verify the robustness of the DQN by adjusting the arrival rate and service rate. The results are shown in Table 7 and Table 8.
It can be seen that under different arrival rates and service rates in the long chain structure, the DQN is a relatively better routing policy.

6. Conclusions

This paper proposes a dynamic routing policy for a multi-skill call center based on the Deep Q Network. Each customer and agent make decisions according to the waiting time of customers in the current queue and the idle time of the agent groups. Unlike the linear function used in Weight-Based Routing, the Deep Q Network (a nonlinear function) is designed to measure which action should be chosen. The optimal decision is learned by interacting with the external environment and a multi-skill call center simulator is designed for training the reinforcement learning model. The DQN-based dynamic routing policy performs better than the common static policy Global FCFS and the dynamic policy Priorities with Idle Agent Thresholds and Weight-Based Routing for various examples. On the other hand, the training time of the routing policy model based on the DQN is much faster than routing optimization based on simulation and the genetic algorithm. In future research, the combination of dynamic routing policies with scheduling optimization and structural design for multi-skill call centers is an interesting direction. Designing the appropriate state space and reward function is the key to solving these problems. The application of multi-agent reinforcement learning in multi-skill call centers is another future research direction.

Funding

This research received no external funding.

Data Availability Statement

The data and codes presented in this study are openly available in https://github.com/18780047285/DQN.git, accessed on 10 November 2023.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Shumsky, R.A. Approximation and analysis of a call center with flexible and specialized servers. OR Spectrum. 2004, 26, 307–330. [Google Scholar] [CrossRef]
  2. Wallace, R.B.; Whitt, W. A Staffing Algorithm for Call Centers with Skill-Based Routing. Manuf. Serv. Oper. Manag. 2005, 7, 276–294. [Google Scholar] [CrossRef]
  3. Cezik, M.T.; L’Ecuyer, P. Staffing Multiskill Call Centers via Linear Programming and Simulation. Manag. Sci. 2008, 54, 310–323. [Google Scholar] [CrossRef]
  4. Legros, B.; Jouini, O.; Dallery, Y. A flexible architecture for call centers with skill-based routing. Int. J. Prod. Econ. 2015, 159, 192–207. [Google Scholar] [CrossRef]
  5. Tezcan, T.; Dai, J.G. Dynamic Control of N-Systems with Many Servers: Asymptotic Optimality of a Static Priority Policy in Heavy Traffic. Oper. Res. 2010, 58, 94–110. [Google Scholar] [CrossRef]
  6. Ormeci, E.L. Dynamic Admission Control in a Call Center with One Shared and Two Dedicated Service Facilities. IEEE Trans. Autom. Control 2004, 49, 1157–1161. [Google Scholar] [CrossRef]
  7. Chan, W.; Koole, G.; L’Ecuyer, P. Dynamic Call Center Routing Policies Using Call Waiting and Agent Idle Times. Manuf. Serv. Oper. Manag. 2014, 16, 544–560. [Google Scholar] [CrossRef]
  8. Baubaid, A.; Boland, N.; Savelsbergh, M. The Dynamic Freight Routing Problem for Less-Than-Truckload Carriers. Transp. Sci. 2022, 57, 717–740. [Google Scholar] [CrossRef]
  9. Bae, J.W.; Kim, K.-K.K. Gaussian Process Approximate Dynamic Programming for Energy-Optimal Supervisory Control of Parallel Hybrid Electric Vehicles. IEEE Trans. Veh. Technol. 2022, 71, 8367–8380. [Google Scholar] [CrossRef]
  10. Anuar, W.K.; Lee, L.S.; Seow, H.-V.; Pickl, S. A Multi-Depot Dynamic Vehicle Routing Problem with Stochastic Road Capacity: An MDP Model and Dynamic Policy for Post-Decision State Rollout Algorithm in Reinforcement Learning. Mathematics 2022, 10, 2699. [Google Scholar] [CrossRef]
  11. Legros, B. Waiting time based routing policies to parallel queues with percentiles objectives. Oper. Res. Lett. 2018, 46, 356–361. [Google Scholar] [CrossRef]
  12. Mao, Y.; A Miller, R.; Bodenreider, O.; Nguyen, V.; Fung, K.W. Two complementary AI approaches for predicting UMLS semantic group assignment: Heuristic reasoning and deep learning. J. Am. Med. Inform. Assoc. 2023. [Google Scholar] [CrossRef] [PubMed]
  13. Zhang, Z.; Li, M.; Lin, X.; Wang, Y.; He, F. Multistep speed prediction on traffic networks: A deep learning approach considering spatio-temporal dependencies. Transp. Res. Part C Emerg. Technol. 2019, 105, 297–322. [Google Scholar] [CrossRef]
  14. Tan, J. A Method to Plan the Path of a Robot Utilizing Deep Reinforcement Learning and Multi-Sensory Information Fusion. Appl. Artif. Intell. 2023, 37. [Google Scholar] [CrossRef]
  15. Kool, W.; Hoof, V.H.; Welling, M. Attention, learn to solve routing problems! arXiv 2018, arXiv:1803.08475. [Google Scholar]
  16. Fuller, D.B.; de Arruda, E.F.; Ferreira Filho, V.J.M. Learning-agent-based simulation for queue network systems. J. Oper. Res. Soc. 2020, 71, 1723–1739. [Google Scholar] [CrossRef]
  17. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  18. Waschneck, B.; Reichstaller, A.; Belzner, L.; Altenmüller, T.; Bauernhansl, T.; Knapp, A.; Kyek, A. Optimization of global production scheduling with deep reinforcement learning. Procedia CIRP 2018, 72, 1264–1269. [Google Scholar] [CrossRef]
  19. Dong, Y.; Alwakeel, A.M.; Alwakeel, M.M.; Alharbi, L.A.; A Althubiti, S. A Heuristic Deep Q Learning for Offloading in Edge Devices in 5 g Networks. J. Grid Comput. 2023, 21, 1–15. [Google Scholar] [CrossRef]
  20. Kopacz, A.; Csató, L.; Chira, C. Evaluating cooperative-competitive dynamics with deep Q-learning. Neurocomputing 2023, 550. [Google Scholar] [CrossRef]
  21. Mukhtar, H.; Afzal, A.; Alahmari, S.; Yonbawi, S. CCGN: Centralized collaborative graphical transformer multi-agent reinforcement learning for multi-intersection signal free-corridor. Neural Netw. 2023, 166, 396–409. [Google Scholar] [CrossRef] [PubMed]
  22. Cao, P.; He, S.; Huang, J.; Liu, Y. To Pool or Not to Pool: Queueing Design for Large-Scale Service Systems. Oper. Res. 2020, 69, 1866–1885. [Google Scholar] [CrossRef]
  23. Bodur, M.; Luedtke, J.R. Mixed-Integer Rounding Enhanced Benders Decomposition for Multiclass Service-System Staffing and Scheduling with Arrival Rate Uncertainty. Manag. Sci. 2016, 63, 2073–2091. [Google Scholar] [CrossRef]
  24. Tsitsiklis, J.N.; Xu, K. Flexible Queueing Architectures. Oper. Res. 2017, 65, 1398–1413. [Google Scholar] [CrossRef]
  25. Chen, X.; Zhang, J.; Zhou, Y.; Tsitsiklis, J.N.; Xu, K.; Li, Y.; Shu, J.; Song, M.; Zheng, H.; Saghafian, S.; et al. Optimal Sparse Designs for Process Flexibility via Probabilistic Expanders. Oper. Res. 2014, 63, 1159–1176. [Google Scholar] [CrossRef]
  26. Stolyar, A.L.; Yudovina, E. Systems with large flexible server pools: Instability of “natural” load balancing. Ann. Appl. Probab. 2013, 23, 2099–2138. [Google Scholar] [CrossRef]
  27. Sutton, R.S.; Barto, A.G. Reinforcement learning: An introduction. IEEE Trans. Neural Netw. 1998, 9, 1054. [Google Scholar] [CrossRef]
  28. Kumar, R.; Sharma, N.V.K.; Chaurasiya, V.K. Adaptive traffic light control using deep reinforcement learning technique. Multimedia Tools Appl. 2023, 1–22. [Google Scholar] [CrossRef]
  29. Lim, B.; Vu, M. Distributed Multi-Agent Deep Q-Learning for Load Balancing User Association in Dense Networks. IEEE Wirel. Commun. Lett. 2023, 12, 1120–1124. [Google Scholar] [CrossRef]
  30. Chou, M.C.; Chua, G.A.; Teo, C.-P.; Zheng, H. Design for Process Flexibility: Efficiency of the Long Chain and Sparse Structure. Oper. Res. 2010, 58, 43–58. [Google Scholar] [CrossRef]
Figure 1. X-design.
Figure 1. X-design.
Mathematics 11 04662 g001
Figure 2. W-design.
Figure 2. W-design.
Mathematics 11 04662 g002
Figure 3. Long chain structure.
Figure 3. Long chain structure.
Mathematics 11 04662 g003
Figure 4. Single pooling structure.
Figure 4. Single pooling structure.
Mathematics 11 04662 g004
Figure 5. Service level curve ( c 1 , c 2 use the same holding penalty).
Figure 5. Service level curve ( c 1 , c 2 use the same holding penalty).
Mathematics 11 04662 g005
Figure 6. Service level curve ( c 2 uses higher holding penalty).
Figure 6. Service level curve ( c 2 uses higher holding penalty).
Mathematics 11 04662 g006
Figure 7. Loss curve for X-design.
Figure 7. Loss curve for X-design.
Mathematics 11 04662 g007
Figure 8. Mean reward curve for X-design.
Figure 8. Mean reward curve for X-design.
Mathematics 11 04662 g008
Figure 9. Service level curve under W-design.
Figure 9. Service level curve under W-design.
Mathematics 11 04662 g009
Figure 10. Service level curve under single pool.
Figure 10. Service level curve under single pool.
Mathematics 11 04662 g010
Figure 11. Service level curve under long chain.
Figure 11. Service level curve under long chain.
Mathematics 11 04662 g011
Table 1. Simulation parameter settings.
Table 1. Simulation parameter settings.
ParametersMeaningValue
X-design
λ X = λ 1 , λ 2 Poisson arrivals rates ( 18 , 1.8 )
μ X = μ 11 , μ 12 , μ 21 , μ 22 Rates of exponential service times ( 0.198 , 0.18 , 0.162 , 0.18 )
ν X = ν 1 , ν 2 Rates of exponential patience time ( 0.12 , 0.24 )
n X = n 1 , n 2 Staffing numbers ( 90 , 14 )
τ k X ,   k = 1 , 2 Acceptable waiting times 20
W-design
λ W Means of Poisson–Gamma arrivals rates ( 3000 , 1000 , 200 )
Standard deviations of Poisson–Gamma arrivals rates ( 244.9 , 223.6 , 40 )
ϕ W = ϕ 11 , ϕ 21 , ϕ 22 , ϕ 32 Means of lognormal service times ( 8 , 10 , 9 , 15 )
ω W = ω 11 , ω 21 , ω 22 , ω 32 Standard deviations of lognormal service times ( 8 , 10 , 11 , 12 )
ν W = ν 1 , ν 2 , ν 3 Rates of exponential patience times ( 0.2 , 0.11 , 0.1 )
n W = n 1 , n 2 Staffing numbers ( 48 , 12 )
τ k W ,   k = 1 , 2 , 3 Acceptable waiting times ( 20 , 20 , 20 )
Long chain
λ L C Poisson arrivals rates ( 10 , 8 , 6 , 13 , 7 , 6 )
μ L C = μ k k , k = 1 , , | K | Rates of exponential service times 0.3
μ L C = μ k j , k = 1 , , | K | , j = k + 1   mod   | K | 0.15
ν L C Rates of exponential patience time ( 0.12 , 0.15 , 0.18 , 0.18 , 0.15 , 0.12 )
n L C Staffing numbers ( 50 , 40 , 30 , 50 , 40 , 30 )
τ k L C ,   k = 1 , , | K | Acceptable waiting times 20
Single pooling
λ S P Poisson arrivals rates ( 14 , 10 , 11 , 13 , 11 , 6 )
μ S P = μ k k , k = 1 , , | K | Rates of exponential service times 0.3
μ S P = μ k | K | = 0.15 , k = 1 , , | K | 0.15
ν S P Rates of exponential patience time ( 0.12 , 0.15 , 0.18 , 0.18 , 0.15 , 0.12 )
n S P Staffing numbers ( 50 , 40 , 30 , 50 , 40 , 30 )
τ k S P ,   k = 1 , , | K | Acceptable waiting times 20
Training model
h k , k K Penalty parameters of the holding−0.1
e k , k K Penalty parameters of the abandonment−1
d j , j I Penalty parameters of the occupancy−1
S L ¯ Expected service rate for each customer type0.8
O R ¯ / η j , j I Expected occupancy rate for each agent group0.95
γ Discount factor0.99
bzBatch size32
rsReplay size10,000
α = α t a r g e t Learning rate 10 4
( ϵ ,   ϵ d e c a y ) Initial value(1, 0.9)
T Training steps for one episode360,000
N Threshold value for update the target network1000
Table 2. DQN results under three kinds of reward functions.
Table 2. DQN results under three kinds of reward functions.
Structure F S S L A R F S A S L A R F S A O S L A R
X-design531.2185.32%2.68%531.2185.32%2.68%329.4686.27%2.51%
W-design432.9276.95%3.88%432.9276.95%3.88%432.9276.95%3.88%
Single pooling3220.7277.55%3.51%3846.5176.16%3.36%2672.9576.68%3.09%
Long chain168.9486.40%2.21%13.0593.32%1.13%2.2796.96%0.55%
Table 3. Results with the X-design.
Table 3. Results with the X-design.
Policy F ( c 1 , c 2 , A 1 , A 2 ) S L A R ( O 1 , O 2 )
G221.73(74.50%, 74.64%, 2.43%, 4.63%)74.51%2.63%(106.27%, 97.88%)
PT265.58(74.53%, 74.62%, 2.38%, 7.14%)74.54%2.81%(105.97%, 100.45%)
WR207.04(73.47%, 75.47%, 2.49%, 5.91%)73.66%2.82%(105.14%, 93.18%)
DQN329.46(88.11%, 67.72%, 1.42%, 13.29%)86.27%2.51%(84.82%, 89.32%)
Table 4. Results with the W-design.
Table 4. Results with the W-design.
Policy F ( c 1 , c 2 , c 3 , A 1 , A 2 , A 3 ) S L A R ( O 1 , O 2 )
G509.38(70.50%, 82.81%, 62.36%, 4.90%, 1.70%, 4.97%)72.99%4.17%(102.50%, 92.71%)
PT315.42(73.26%, 75.43%, 70.72%, 4.77%, 5.42%, 5.52%)73.65%4.95%(103.96%, 83.69%)
WR1277.57(65.74%, 59.31%, 59.64%, 6.18%, 9.78%, 4.79%)63.97%6.95%(103.65%, 77.24%)
DQN432.92(76.30%, 82.47%, 61.40%, 3.80%, 3.45%, 6.86%)76.95%3.88%(83.44%, 83.94%)
Table 5. Results with the single pool.
Table 5. Results with the single pool.
Policy F ( c 1 , c 2 , c 3 , c 4 , c 5 , c 6 )( A 1 , A 2 , A 3 , A 4 , A 5 , A 6 ) S L A R ( O 1 , O 2 , O 3 , O 4 , O 5 , O 6 )
G6167.19(87.14%, 95.34%, 44.50%, 95.15%, 86.67%, 14.28%)(1.16%, 0.54%, 8.79%, 0.86%, 1.73%, 10.91%)76.45%3.28%(98.89%, 91.86%, 107.43%, 93.46%, 97.61%, 109.48%)
PT4829.59(83.78%, 89.66%, 23.69%, 90.30%, 72.85%, 47.75%)(1.60%, 1.17%, 14.04%, 1.32%, 3.33%, 6.07%)71.03%4.28%(97.68%, 89.61%, 105.50%, 93.31%, 95.36%, 109.10%)
WR5497.66(93.21%, 97.61%, 11,51%, 91.86%, 75.96%, 69.69%)(0.69%, 0.31%, 20.74%, 1.50%, 2.98%, 3.48%)75.01%4.89%(96.01%, 88.05%, 105.74%, 93.24%, 95.00%, 105.65%)
DQN2672.95(79.24%, 98.08%, 42.92%, 96.08%, 80.61%, 45.80%)(1.93%, 0.37%, 9.60%, 0.63%, 2.26%, 5.11%)76.68%3.09%(81.88%, 74.42%, 88.12%, 75.90%, 81.30%, 89.96%)
Table 6. Results with the long chain.
Table 6. Results with the long chain.
Policy F ( c 1 , c 2 , c 3 , c 4 , c 5 , c 6 )( A 1 , A 2 , A 3 , A 4 , A 5 , A 6 ) S L A R ( O 1 , O 2 , O 3 , O 4 , O 5 , O 6 )
G1921.77(74.16%, 67.13%, 68.69%, 62.87%, 61.89%, 72.77%)(2.57%, 3.83%, 3.76%, 5.01%, 3.70%, 2.57%)67.59%3.70%(106.75%, 106.62%, 106.59%, 107.06%, 107.20%, 106.74%)
PT5.38(100%, 98.40%, 98.48%, 86.48%, 100%, 100%)(0.00%, 0.17%, 0.28%, 2.30%, 0.00%, 0.00%)96.08%0.66%(76.03%, 73.63%, 74.60%, 93.78%, 64.20%, 72.82%)
WR4.68(100%, 98.11%, 95.44%, 86.02%, 100%, 100%)(0.00%, 0.17%, 0.63%, 2.06%, 0.00%, 0.00%)95.48%0.65%(73.59%, 72.68%, 74.70%, 93.30%, 64.13%, 70.78%)
DQN2.27(93.72%, 97.40%, 95.80%, 99.35%, 95.98%, 98.76%)(0.88%, 0.46%, 0.86%, 0.32%, 0.63%, 0.19%)96.96%0.55%(87.08%, 84.56%, 82.53%, 83.39%, 78.75%, 53.52%)
Table 7. Results under different arrival rates.
Table 7. Results under different arrival rates.
Test Case with Different λ GPTWRDQN
F S L A R F S L A R F S L A R F S L A R
(10, 10, 10, 6, 13, 5)9413.0343.03%6.92%4042.3574.71%4.08%2998.6880.21%3.50%1859.4681.35%3.12%
(7, 10, 13, 8, 10, 6)17,180.8728.89%9.33%7662.6373.37%8.60%7459.2775.67%7.46%5869.3676.58%6.39%
(14, 6, 5, 6, 8, 15)16,506.7635.12%9.06%11,331.9260.73%17.89%10,888.0061.49%17.24%7139.8562.11%6.91%
Table 8. Results under different service rates.
Table 8. Results under different service rates.
Test Case with
Different ( μ k k , μ k j )
GPTWRDQN
F S L A R F S L A R F S L A R F S L A R
(0.25, 0.15)21,088.8723.06%9.55%1817.5178.84%3.29%1519.0382.06%3.21%1069.8583.92%2.68%
(0.3, 0.05)18,140.3930.03%9.20%4.6495.04%0.67%4.8095.69%0.67%2.0497.86%0.43%
(0.25, 0.1)24,926.1415.01%10.30%2693.2571.99%4.31%1429.7780.90%3.07%853.2982.47%2.18%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Q. Dynamic Routing Policies for Multi-Skill Call Centers Using Deep Q Network. Mathematics 2023, 11, 4662. https://doi.org/10.3390/math11224662

AMA Style

Zhang Q. Dynamic Routing Policies for Multi-Skill Call Centers Using Deep Q Network. Mathematics. 2023; 11(22):4662. https://doi.org/10.3390/math11224662

Chicago/Turabian Style

Zhang, Qin. 2023. "Dynamic Routing Policies for Multi-Skill Call Centers Using Deep Q Network" Mathematics 11, no. 22: 4662. https://doi.org/10.3390/math11224662

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop