1. Introduction
In the increasingly complex market environment of service products, the traditional single-skill call centers are gradually being replaced by multi-skill call centers. Multi-skill call centers are typical Parallel-Server Systems. There are multiple classes of customers and multiple types of agent groups. In order to optimize the service system, it is a practical problem to match the incoming call and agent reasonably. The complexity of multi-skill routing rules is caused by the fact that customers need an agent with specific skills and there are shared skills between different agent groups.
Some rudimentary routing policies can be called static if they do not take into account the real-time state of the service system. For example, assigning the longest waiting call to the longest idle agent that has the right skill to serve it or assigning the customers to the Faster Servers First (FSF). In addition, it is an intuitive static policy solution to prioritize the match between the customer types and the agent groups. When an incoming call arrives or an agent becomes idle, it matches them according to the preset priority. Due to its simplicity, many analyses of call center queuing systems are based on static routing policies [
1,
2,
3,
4]. Especially, Tezcan and Dai proved that a
-type greedy routing policy is asymptotically optimal for N-systems in a many-server heavy-traffic regime [
5].
When the queuing system becomes complex, it turns out that the static routing policy is not optimal. Some state-dependent threshold policies have been proposed as exampled by Ormeci [
6], who considered a call center with one shared and two dedicated agent groups and proved the existence of a monotonic threshold policy for the shared station, where the thresholds depend on the number of customers in all three stations. Chan et al. reviewed the common routing policies, most of which are based on routing rules found in industry, and introduced a dynamic routing policy named weight-based routing (WR) [
7].
The optimization of dynamic routing is usually based on the Markov decision process (MDP), and approximate dynamic programming (ADP) is the most popular solving approach [
8,
9,
10,
11]. This often considers call centers with few call types and specific structures like “N-design” or a “Hierarchical approach structure” and makes some simplifying assumptions such as no waiting room and customers having infinite patience. Poisson arrivals and exponential service times are assumed in most of the literature. In real-time call centers, however, the situation is often complicated. There are many more call types, and the arrival and service rates have different distributions.
With the development of deep learning technology, learning heuristic-based methods are promising in numerous research fields [
12,
13]. Many deep reinforcement learning (DRL) research methods have emerged in the sequential decision making problems that are difficult to solve using traditional operation research methods. For instance, Tan explores the use of a deep reinforcement learning architecture for conducting path planning experiments with mobile robots [
14], and Kool et al. apply DRL with an encoder–decoder framework with multi-head attention layers to tackle Vehicle Routing Problems (VRPs) [
15]. Fuller et al. implement Reinforcement Learning in the Call Center Queuing Simulation Model [
16]. To solve the problem of continuous state spaces, Mnih et al. proposed a DRL algorithm that combines DNN and Q-learning for function approximation to handle high-dimensional inputs, which is called the Deep Q Network (DQN) [
17]. Although the DQN was proposed in 2015, it is still widely used to deal with high dimensional and continuous state space problems, and it shows excellent characteristics in global transparency and global optimization [
18]. The double DQN-based 5G network job scheduling algorithm provides higher convergence between wireless nodes in the 5G network and consumes less energy. And compared with the current standard deep learning methods DeMADRL and BiDRL, this algorithm has the best effect [
19]. Kopacz et al. modeled cooperative–competitive social group dynamics with multi-agent environments and used three methods to solve the multi-agent optimization problem. It was found that training in a decentralized manner with the DQN outperforms both the monotonic value factorization methods (QMIX) and the multiagent variational exploration approach (MAVEN) on the analyzed cooperative–competitive environments for both agent types [
20]. Mukhtar et al. developed the Deep Graph Convolution Q-Network (DGCQ) by combining the DQN and the Graph Convolutional Network (GCN) to achieve a signal-free corridor. The proposed model is trained on the robust synthetic traffic network and evaluated on the real-world traffic networks that outperform the other state-of-the-art models [
21].
In summary, the multi-skill call center is a complex queuing system. When analyzing its dynamic routing strategy, it is difficult to establish an accurate model because it is difficult to predict the arrival of future customers, the completion time of serving customers in the system, and the acceptable waiting time of customers in the queue. However, the traditional methods often assume that the customer arrival and service time obey a certain distribution, which means that the simulation process is not close to reality. At the same time, the traditional methods usually can only obtain good results under a specific call center structure, which makes the results not practical. To solve this problem and optimize the entire queuing system, this paper creatively uses the DQN as a powerful tool to build a dynamic routing policy for multi-skill call centers for the first time. The DQN does not rely on accurate modeling but constantly learns under the interaction with the environment and can be applied to call centers with different structures. Therefore, compared with the previous methods, the dynamic routing strategy based on the DQN is more practical and can show a better optimization ability in the more complex and uncertain simulation process.
The contribution of the article is mainly manifested in two aspects. In theory, the article studies the dynamic routing policies for multi-skill call centers from the perspective of the Deep Q Network. Through the combination of the queuing theory and deep reinforcement learning, the lack of applicability of traditional methods in the more complex and uncertain simulation process is remedied. In practice, this study can guide each customer and agent to make scientific decisions according to the current queue situation, so as to achieve a higher service level and lower abandonment ratio of customers.
The remainder of the article is organized as follows:
Section 2 elaborates on the problem statement.
Section 3 provides the reinforcement framework design.
Section 4 and
Section 5 discuss the simulator design and numerical experiments results. In
Section 6, we offer our conclusions.
3. Reinforcement Framework Design
A multi-skill call center is a complex queuing system, and it is difficult to construct an accurate model when analyzing its dynamic routing policy. Reinforcement learning, however, does not rely on precise modeling, but instead, it learns the state transfer function and reward function in the interaction with the environment (e.g., simulation). Reinforcement learning tasks are usually described in terms of a Markov decision process. The state space, action space, and reward space are needed to construct a complete Markov decision process. This section will expand on these aspects and finally introduce the model-independent reinforcement learning algorithm Deep Q Network (DQN).
3.1. States
Intuitively, call waiting times and agent idle times of each group are of concern when making routing decisions. Chan et al. combined the two using an affine function to construct a routing policy based on weights [
7]. However, they only considered the customers with the longest waits in each type and the agents with the longest idle time in each agent group. Suppose in an extreme case, there are two queues A and B, where queue A has one and only one customer waiting for 100 min, and queue B has 100 customers waiting for 90 min. In their routing policy, queue A has a higher weight on call waiting times than queue B, which is obviously unreasonable. To improve this defect, we define the state variables at time
as
.
is an n-dimensional vector
, representing the waiting time of the first
customers in type
Similarly,
is an m-dimensional vector
, representing the idle time of the first
agents in agent group
.
3.2. Actions
When a call of type arrives, the decision maker can either assign it to any agent group that can serve the call or keep the customer waiting. Hence, the customer’s actions set at time can be defined as , where action stands for keeping the customer waiting. When an agent of agent group finishes the service, the decision maker can either assign a new call to him that the agent can handle or keep him idle. Hence, the agent’s actions set at time can be defined as , where action stands for keeping the agent idling.
However, in order to facilitate the construction of a neural network for learning, we need to unify the action space of customers and agents. It is noted that, in essence, a certain edge in the bipartite graph is selected, whether the arriving customer of class selects the agent of class for service or the idle agent of class selects the customer of class in the queue. So, the state space can be defined as .
3.3. Rewards
The setting of the reward function is a very important step in reinforcement learning, which will affect the convergence speed and degree of the reinforcement learning algorithm. The objective is to optimize the service level, abandonment rate, or agent group occupancy rate of the queueing system. Therefore, the reward function must reflect these optimization goals. Tezcan and Dai considered the cost function as the holding and reneging costs during a finite time interval [
5]. Referring to this definition, we define the reward function as follows.
where
denotes the set of customers of type
in queue at time
,
denotes the waiting time of customer
,
denotes the number of customers who have abandoned the system in time period
,
denotes the occupancy rate of agent group
in the time period
, and
denotes the expected occupancy rate for agent group
.
,
,
are the penalty parameters of the holding, abandonment, and occupancy, respectively. We take
as the reward function when the optimization goal is to maximize the service level, take
as the reward function when the optimization goal is to minimize the abandonment rate, and take
as the reward function when the optimization goal is to minimize the occupancy rate. The combination of the three reward functions is discussed in
Section 5 Numerical Experiments. In particular, it can be seen from Equation (3) that when the customer’s waiting time is less than the AWT, the penalty is proportional to its waiting time, and when it is greater than the AWT, it is a fixed value.
3.4. Deep Q Network
After the state space, action space, and reward space of the problem are given, we need to construct an appropriate algorithm to obatin the state value function
or the action state value
, where
means the long-term discount reward and
is the discount factor. Q-learning is a classic model-free method to solve this kind of problem and has the following steps [
27]:
Initializes an empty table and maps all states to values of actions.
The interaction with the environment results in a tuple s, a, r, s′ (state, action, return, new state). In this step, we need to think about exploration versus exploitation and decide which action to take.
Update the values using the Bellman approximation.
Check the convergence conditions. If not met, repeat from step 2.
When the number of states and actions is very large (such as the call center dynamic routing problem), it is not possible to update all of the state action pairs one by one. The function approximation method approximates the entire state value function (or action value function) with a parameterized model and updates the whole function in each learning process. In this way, the value estimates of states (or pairs of state actions) that have not been accessed can also be updated. Linear functions and artificial neural networks are two of the most common approximation functions. The latter combines deep learning with reinforcement learning and is called the Deep Q Network. The affine function proposed by Chan et al. is a kind of linear approximation function [
7], although they used a genetic algorithm to solve it. Their paper notes that such linear approximations have the following drawback: policies with similar near-optimal costs may have very different weight parameters. However, the Deep Q Network uses an artificial neural network
to replace the action value function, with a strong nonlinear expression ability. The structure of the DQN model includes three linear layers, each of which is connected by an ReLU function. In order to solve the problems of training instability and difficulty, researchers adopted two improvements to deep Q learning: experience replay and a target network [
28,
29]. Algorithm 1 presents the application of the improved DQN algorithm in the dynamic routing problem of multi-skill call centers.
Algorithm 1 Deep Q learning algorithm with experiential replay and target network |
| Data: | // weights of the evaluation network |
| Data: | // weights of the target network |
| Data: | // expected reward for action at state |
| Data: | // replay buffer |
| Data: | // training time for one episode |
| Data: | // max episodes |
| Data: | // a threshold value |
| Data: | // discount factor |
| Data: | // learning rate of the evaluation network |
| Data: | // learning rate of the target network |
1 | Initialize and , empty replay buffer , |
2 | for do |
3 | for do |
4 | for each customer in queue, new arrival customer and each idle agent do |
5 | current state // observe current state |
6 | if decision maker is a customer then marked it with a Boolean variable |
7 | |
8 | else decision maker is an agent |
9 | |
10 | a random number // generate a random number |
11 | chooseAction // according to the evaluation network, select from the action set and execute it. |
12 | // observe the reward |
13 | new state; // observe the new state |
14 | Store transition in the buffer |
15 | Sample a random minibatch of transitions from |
16 | // estimate of the reward |
17 | // update |
18 | // update current state |
19 | |
20 | if then update |
21 | |
22 | end for |
23 | end for |
24 | end for |
Function chooseAction ) |
Input: | // a state of the agent |
Input: | // Boolean variable, 1 represents customer and 0 represents agent |
Data: | // current value of |
Data: | // value of the update for |
Data: | // a threshold value |
Data: | // set of all available actions for customer of type |
Data: | // set of all available actions for agent of type |
1 | if then decision maker is a customer |
2 | customer type |
3 | if then is greedy |
4 | |
5 | else is random |
6 | // is chosen randomly |
7 | // is the th element of |
8 | else decision maker is an agent |
9 | agent type |
10 | if then |
11 | |
12 | else is random |
13 | |
14 | |
15 | |
16 | return |
In Algorithm 1, we use the target network to estimate the return as a learning target. In the process of weight updating, only the weight of the evaluation network is updated. After a certain number of updates, the weight of the evaluation network is assigned to the target network. We store the new transition in a buffer of a fixed size so that it pushes the oldest experience out of it. We then take samples from the buffer to update the optimal value function. This is called a replay buffer.
For each customer in the queue, each new arrival customer, and each idle agent, an action is chosen through the function chooseAction. This function uses a greedy scheme by following the Epsilon-Greedy method. Under this policy, the customer or the agent chooses the action that results in the highest Q-value with a probability of , otherwise, it selects a random action. As discussed above, the action space is . gives the Q-value of any and action . For a customer of type , the set of actions he can choose is a subset of , defined as . So, the choice of action under the greedy strategy is . The same can be deduced for an agent of class . The reduces linearly according to a decay parameter .
5. Numerical Experiments
We first experimented with the dynamic routing policy in the classic small-scale call centers, such as the X-design and W-design call centers. Their structure is shown in
Figure 1 and
Figure 2. Then for a large-scale call center, we considered the most commonly used “long chain” structure and the “single pooling” structure as shown in
Figure 3 and
Figure 4. The rectangles in
Figure 1,
Figure 2,
Figure 3 and
Figure 4 represent multiple customer types and the circles represent multiple agent groups.
For each call center structure, we set a group of corresponding parameters to simulate the real situation. In the simulation process, all parameter settings are shown in
Table 1. Most of our parameter settings refer to Chan et al. ‘s study [
7]. For the parameters of the DQN training model, we set them according to experience and the results of multiple simulations.
In the X-design model, we take Poisson arrivals with rates
. In the W-design model, we consider the arrival process for each call type to be Poisson–Gamma with a stationary Poisson arrival process for each day, whose random rate (for the entire day) has a gamma distribution with means
and standard deviations
. Long chain is an important structure used to improve the flexibility of systems such as supply chains and multi-skill call centers [
30]. Each customer
can be served by 2 agent groups,
and
mod
. Single pooling is another flexible architecture proposed by Legros et al. [
4]. They showed that single pooling performs better than chaining for various cases of asymmetry. For single pooling, each customer
can be served by agent groups
and
.
When using the DQN policy for simulation, we refer to Chan et al. ‘s research [
7] and construct three different reward functions based on Equations (3)–(5), as shown in Equations (6)–(8). According to their simulation results, in four kinds of call center structures, the optimal form of reward function is selected. At the same time, in order to evaluate the final result, we construct an evaluation function for the result, as shown in Equation (9).
is a reward function that only considers the service level, considers both the service level and abandonment rate, and considers the service level, abandonment rate, and agent group occupancy. evaluates the overall balance of the results. The lower the value of , the more balanced the results. and denote the service level and abandonment rate of customer type throughout the simulation, and denotes the occupancy rate of agent group throughout the simulation. represents the expected service rate for each customer type, and represents the expected occupancy rate for each agent group as . Under the three reward functions, the results are evaluated as , , and respectively.
In our training model, the max episodes for the X-design and W-design models are 20 for long chain and for single pooling is 50, and the state dimension parameters are . Therefore, the state space is 40 dimensions in the X-design model, 60 dimensions in the W-design model, and 120 dimensions in the long chain and single pooling models. The structure of the DQN model includes three linear layers, each of which is connected by the ReLU function. All the experiments were performed on a 2.60-GHz Intel(R) Core(TM) i7-10750H CPU made by HP, sourced from Shanghai, China.
Under the four call center structures, using the DQN policy, the results of the three kinds of reward functions are shown in
Table 2.
and represent the final overall service rate and the final overall abandonment rate, which are calculated using Equations (1) and (2) respectively. According to the calculation results, we can find that under the two structures, X-design and long chain, when is selected as the reward function, the service rate is the highest, the abandonment rate is the lowest, and the result evaluation is the best. Under the single pooling structure, can achieve the lowest abandonment rate and the best result evaluation, but the service rate is the second highest. Under the W-design structure, the three reward functions perform the same. Overall, we believe that is the best form of the reward function. And it is used throughout the rest of the article.
To highlight the advancement of the DQN policy, we compare it with three other classic and influential routing policies. The routing policies we compared included static policy Global FCFS (G), dynamic policies Priorities with Idle Agent Thresholds (PT), and Weight-Based Routing (WR). The detailed definition of these policies and the routing optimization can be found in Chan et al. (2014) [
7].
5.1. Experiments with the X-Design
Based on the parameter settings of the X-design instance mentioned above, the results obtained using four different policies including DQN are shown in
Table 3.
It can be found that compared with other policies, the DQN is in a leading position in terms of the overall service level, abandonment rate, and agent group occupancy rate. However, the value of the evaluation function of the DQN is not good, which means that the results of the DQN are less balanced than those of other policies.
At this point, we can adjust the penalty factor for each type of customer to achieve our specific goal of service level or abandonment rate for a certain type of customer. For example, in the X-design model, if we set the holding penalty of two types of customers to be the same (
), when the performance evaluation is the service level, the service levels of the two types of customers are shown in
Figure 5. The service level
is 88.11%, while the service level
is 67.72%. This is because the arrival rate of the first type of customer is 10 times that of the second type of customer, so when the holding penalty of the two types of customers is the same, the system will sacrifice the service level of the second type of customer appropriately to satisfy the majority of the first type of customers who arrive. If we adjust the holding penalty of the second type of customer to 10 times that of the first type of customer and keep the other parameters unchanged, the result is shown in
Figure 6. The service level of the two types of customers becomes close; the service level
is 78.70%, and the service level
is 78.30%, but the overall service level decreases.
In fact, we can flexibly allocate the holding penalty to meet various demands, whether we want the overall service level to be the highest or each service level to be relatively good. For other strategies, although there is no reward function like that in the DQN, the goal can also be changed according to the actual demand to obtain the desired result. In this paper, since the reward function in the DQN is set as the weighted sum of the service rate, abandonment rate, and occupancy rate, the objectives corresponding to other policies are also the comprehensive embodiment of the three indexes.
Figure 7 shows the loss curve of the DQN algorithm when the optimization goal is to maximize the service level in the X-design structure. The horizontal axis is the number of training steps, and the vertical axis is the loss. One episode contains 36,000 steps. It can be seen that 20 episodes can well ensure the convergence of the algorithm.
Figure 8 shows the curve of the average reward changing with the number of training steps.
5.2. Experiments with Other Call Center Structures
Based on the parameter settings of the W-design instance, single pool instance, and long chain instance mentioned above, the results obtained using four different policies are shown in
Table 4,
Table 5 and
Table 6.
Based on
Table 3,
Table 4,
Table 5 and
Table 6, it can be found that in a small-scale call center structure such as the X-design or W-design, the DQN is superior in terms of the overall service rate, abandonment rate, and agent group occupancy rate compared with other policies. However, the results of the DQN are not more balanced than those of other policies. For large-scale call center structures such as single pool and long chain, the advantages of the DQN are more prominent. In addition to service rates, abandonment rates, and occupancy rates, the results of the DQN are more balanced than those of other policies. Overall, the DQN policy shows superiority under all these kinds of different call center structures, so the applicability of the DQN is very strong.
Under the three structures, after the adoption of the DQN policy, the change curve of the service rate of each customer type is shown in
Figure 9,
Figure 10 and
Figure 11.
Through these figures, we can find that under different call center structures, the service rate of each type of customer will fluctuate greatly at the beginning and gradually become stable. In the instances we set, the service rates of all types of customers in the long chain structure are finally stable at more than 85%, but they are stable between 50% and 85% in the W-design structure. And there is a significant difference in the service rates of all types of customers in the single pool structure, with a difference of more than 0.5 between the highest value and the lowest value.
In addition, the routing optimization based on the genetic algorithm (WR) takes about 30 min in the X-design and W-design call centers, and about 3 h in the long chain and single pooling models. In addition, the algorithm in the long chain and single pooling models uses a parallel method (6 cores) to solve the fitness value. However, the training time of the DQN-based algorithm using a single thread is only about 10 min in the X-design and W-design models and 45 min in the long chain and single pooling models.
5.3. Robustness of the DQN Policy
In order to verify the robustness of the dynamic routing policy we proposed, we consider the different distribution of the arrival rate and service rate, the different parameters in the distribution function, the change in the number of people in the agent groups, and other factors that may affect the performance of the whole system. Here, we take long chain structure as an example to verify the robustness of the DQN by adjusting the arrival rate and service rate. The results are shown in
Table 7 and
Table 8.
It can be seen that under different arrival rates and service rates in the long chain structure, the DQN is a relatively better routing policy.