1. Introduction
Packet classification techniques play a significant role in providing advanced network services, such as packet filtering, quality of services (QoS), security monitoring, and virtual private networks [
1]. The packet classifier runs on the Internet router to classify the received packets into different flows based on predefined rules called packet filters. Different network services may use different header fields to classify the packets. For example, in IPv4, five header fields are typically used in packet filters; these include the source/destination Internet protocol (IP) address, the source/destination port, and the protocol type.
A filter,
F, with
d fields is called a
d-dimensional filter, expressed as
F = (
f[1],
f[2], ...,
f[
d]), in which the content in the
ith field
f[
i] could be a variable-length prefix, an exact value, a range, or a wildcard, indicating that all values are valid for that field. For packet
P and filter
F, if all selected packet headers correspond to the values in their associated fields in
F, we say that packet
P matches filter
F. For example, a two-dimensional (2D) filter
F = (140.113. *. *, *), source IP address (SA) with prefix 140.113.xxx.xxx, and any destination IP address (DA) will match
F. Therefore, packet
p1 = (140.113.1.1, 8.8.8.8) matches
F, but
p2 = (140.114.1.1, 8.8.8.8) does not. Each filter exhibits an associated action specifying how to treat those packets that match the filter. When a packet matches multiple filters with different actions, filter conflict occurs, resulting in ambiguity in packet classification. For example,
Table 1 presents a 2D filter database for firewall applications. Assuming that the IP address length is four bits, incoming packet
p1 = (0001, 1000) will match filter
a and be accepted. Another incoming packet,
p2 = (0001, 0000), matches filters
a and
b. Because
a and
b exhibit different actions, they cannot decide whether the packet should be accepted or rejected, resulting in a conflict. The conflicting actions of
a and
b may raise security vulnerabilities, QoS failures, or routing errors [
2,
3], depending on the application used. Three possible solutions can be applied to solve the conflict problem [
3].
Select the first matching filter in the filter database.
Assign each filter a priority and select the matching filter with the highest priority.
Assign each field a priority and select the matching filter with the most specific matching field with the highest priority.
However, none of the above methods can fully solve the conflict problem.
Figure 1 depicts a 2D representation based on the contents of
Table 1, and the overlapped areas indicate conflicts between filters. Let
a b indicate that when a packet matches filters
a and
b simultaneously, the action associated with filter
a is selected. In other words, filter
a has a higher priority than filter
b. If we set
b c,
c d, and
d a, we observe that priority setting may lead to
a a, which is a contradiction. There is no way to find a priority sequence with no conflicts. Therefore, resolve filters [
3] have been developed to solve the filter-conflict problem. The idea is to add a new filter to the overlapped area of two conflicting filters and set a higher priority to resolve the conflict. In
Figure 1, resolve filter
e is generated for the overlapped area of filters
a and
b. Similarly, the other three overlapped areas require associated resolve filters. For every resolve filter generated, it is necessary to ensure that it does not conflict with the other filters. In addition, some applications must update packet filters frequently [
4]. For example, within one millisecond, several access controls or QoS filters may be updated, or several thousand filters may be changed, due to the dynamic creation or the deletion of the virtual routers [
5]. Therefore, conflict detection must be executed for every newly added or updated filter to prevent conflicts and ensure the correctness of the packet classification. Consequently, the efficiency of conflict detection influences network performance. It has been demonstrated that determining the minimum number of resolve filters in a filter database is an NP-hard problem [
3,
6].
Diversified network services will appear continuously, due to the development trend of the Internet, and the transmission speed and data volume will continue to increase. Regarding the filter database in routers, the number of filters will increase significantly. In addition, due to the deployment of IPv6 and the development of software-defined networks [
7], filters should deal with more content, and the number of fields to match in conflict detection is much greater than the number of fields in traditional network architecture. These new types of network protocols or services may increase conflict-detection complexity. Currently, most conflict-detection algorithms can only handle 2D conflict detection, and for practical operation, additional dimensions other than the source/destination IP address must be included. Although some 2D conflict-detection algorithms can be extended to handle more dimensions for conflict detection, they require a large memory space. Therefore, designing an efficient conflict-detection algorithm is still a challenge.
In packet classification applications, the classification process for each packet is independent. Due to the development trend of CPUs, researchers have investigated packet classification on multicore CPUs [
8,
9,
10,
11,
12]. However, from the viewpoint of parallelism, the number of cores is not sufficiently high, and the scale of performance improvement is limited. Compared with a CPU, a general-purpose GPU has a large number of cores and can offer a superior parallel computing capability. Because the reduced dependence of data and control makes them more appropriate for the parallelism on multi-core and many-core systems. Many researchers have attempted to use GPUs to solve computing intensive problems in related domains [
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25]. Similarly, each filter is independent while executing conflict detection, and the field content matching does not require complex computations. Thus, conflict detection can be executed in parallel to improve performance significantly.
In this study, we propose two parallel algorithms that can solve five-dimensional (5D) filter conflict problems. First, we formally defined the conflict-detection problem. To develop efficient conflict-detection algorithms, we divided a 5D filter into two parts based on the field format. We then derived the conditions under which both filters must hold if they conflict. Based on these conditions, we developed a filtering mechanism that reduced the number of comparisons for each filter during conflict detection. With the proposed filtering mechanism, each filter experienced different comparisons, resulting in inefficient parallel processing in the GPU because of workload imbalance among threads. Therefore, we propose a scheme for workload balancing to further improve the parallel computing performance.
The remainder of this paper is organized as follows.
Section 2 reviews existing conflict-detection algorithms and briefly introduces the GPU architecture. In
Section 3, we propose a simple 5D conflict-detection algorithm. In
Section 4, we present our proposed general parallel conflict-detection algorithm (the GPCDA) and enhanced parallel conflict-detection algorithm (the EPCDA).
Section 5 describes the experimental setup, results, and analyses. Finally,
Section 6 concludes this study.
4. Conflict-Detection Algorithms Using GPU
In this section, we propose two new 5D detection algorithms implemented on GPUs. First, we introduce the implementation of the detection algorithm on a GPU and describe its operational process. The number of matches required by each filter varies to prevent a duplicate detection process because the workload for each filter is different. The main design concept of the EPCDA is the proper assignment of filters to threads for balancing the workload of each thread and improving the parallel computing performance.
4.1. General Parallel Conflict-Detection Algorithm
During conflict detection, the matching process for each filter is independent. From the practical execution example in
Section 3.2, during the programming stage we can assign a filter to several threads for execution.
Figure 4 illustrates the proposed GPCDA architecture and its operating process. In Step 1, we copy the filter database from the host to the device. In Step 2, we sequentially assign the filter stored in the filter database to pre-allocated threads. Since the memory space required by the filter database is not large, we can store the filter database in a unified L1/texture cache (which is a read-only cache) to minimize memory access latency. In Step 3, each thread executes the detection algorithm independently and stores the conflict-detection results in the pre-allocated shared memory. In Step 4, each block collects the detection result reported by its associated thread and stores it in the device memory. In Step 5, all the detection results are sent back to the host through the device. We define the total execution time for conflict detection as the total time required to execute Steps 1 to 5.
Suppose that
i threads and
j blocks are allocated. This means that
i j =
K threads are available for parallel computing. If
n filters should execute conflict detection, each thread must execute
filter conflict detections. In the GPCDA, the filter dispatch order is based on the successive location in the filter database, such that threads can be assigned to execute sequentially. Assuming that
n filters are stored in a filter database and denoted as list
,
K filters
F0–
FK−1 are initially assigned as threads 0 to
K − 1, and the next
K filters,
FK–
F2K−1, are also assigned as threads 0 to
K − 1. Thus, each assignment order of
K filters starts from thread 0 and ends at thread
K − 1.
Figure 5 illustrates how the filters in
T are dispatched to
K threads using the GPCDA. Because each filter must match other filters in
T, each thread should execute the detection algorithm twice when
n = 2
K, and the total number of matches is 2(2
K − 1).
We observed that the GPCDA might obtain a duplicated detection result when the above filters perform conflict detection. Assuming that filters
f and
g conflict, both
f and
g will detect each other while executing conflict detection [
33]. Therefore, we changed the matching policy. For filter
Fs in
T, we only needed to compare the filter set before
Fs, that is, filters
F0–
Fs−1. Consequently, it could prevent the duplicated detection problem and reduce the number of comparisons required by each filter. The reduction in the number of comparisons required by each filter indicates a reduction in time for the GPCDA to execute conflict detection. This creates a new problem. If we use the term “the total number of filter comparison for each thread to execute” to represent the workload of the threads, the workload of each thread will become unbalanced (
Figure 5). For example, when
n = 2
K, thread 0 needs to execute
K comparisons, whereas thread
K − 1 needs to perform a comparison 3
K − 2 times. When the workloads between threads are unbalanced, some threads with less workload may finish the task earlier and become idle. However, threads in different blocks cannot access the same shared memory and support each other. Thus, the final detection results can only be returned to the host until the thread that “finishes the last comparison” reports its result. If the workloads among threads are balanced, thread idling and the total execution time of conflict detection can be reduced. The even distribution of the workload to threads is a key factor for improving the performance of parallelism.
4.2. Workload Scheduling Problem
To achieve optimized performance, we defined a workload scheduling problem regarding how to assign filters to threads evenly according to their workload.
Description of workload scheduling problem: Assuming that we perform conflict detection for n filters, the workloads of n filters are , which are assigned to m threads with equal computing capability. The issue is to perform workload scheduling to minimize the execution time for conflict detection.
We can convert the above problem to the well-known deterministic scheduling problem as follows:
n independent tasks
, which require execution times
, respectively, are assigned to
m processors with equal computing capability, and tasks will not be interrupted during their execution. The goal is to obtain a scheduling result so that the finish time is the shortest. During workload scheduling, the conflict detection of
n filters is considered as an independent task. When a filter executes conflict detection, it cannot be interrupted or reassigned to another thread, and
m threads with equal capabilities are similar to
m processors with identical capabilities. Unfortunately, the deterministic scheduling problem has been proven to be an NP-complete problem [
37,
38], indicating that an optimum workload scheduling algorithm cannot be found within a limited time. Therefore, we attempted to propose a near-optimum workload scheduling algorithm. The longest processing time (LPT) algorithm [
39] has been analyzed and proven to be the near-optimum algorithm closest to the optimum result.
The concept of the LPT algorithm is to arrange the task to be executed from long-to-short execution times, followed by its designed dispatch algorithm assigning tasks to the processors. The LPT algorithm guarantees that the difference in performance does not exceed , compared with the optimum result; that is, the lower-bound performance of the LPT algorithm is 1.33 of the optimum performance. Therefore, we propose an EPCDA based on the LPT concept. In the EPCDA, a filter list is assigned to each thread to achieve workload balancing.
4.3. Enhanced Parallel Conflict-Detection Algorithm
When a filter executes conflict detection, its number of comparisons represents the workload of that filter; therefore, we prearranged the order of each filter based on the number of comparisons required for that filter. In the GPCDA, each filter Fs in list T must be initially compared with all preceding filters in list T. We changed the comparison policy in the EPCDA. For each filter in the EPCDA, Fs only needs to compare the filter set with the location behind Fs. For example, filter F0 needs to be compared with n-1 filters behind it, and F1 needs to be compared with n-2 filters behind it. Consequently, the number of comparisons required for each filter decreases with its sequential order in the filter list. In this way, the conflict-detection process is not duplicated, and the concept is the same as the LPT algorithm’s, in which tasks are executed based on the sorted order of execution time. Furthermore, we presorted the filters in T based on the SA prefix length. When the EPCDA performs a conflict detection for filters, each comparison process performs a large number of logical comparisons. The presorted order of filters ensures that the SA prefix length of each filter that is being compared does not exceed its own. This simplifies the logical comparison process and reduces the number of memory accesses. We could construct as the list obtained after sorting T.
In the EPCDA, the filter dispatch order is the same as that of the method proposed in [
39].
Figure 6 illustrates the EPCDA dispatch method. Initially,
K filters
F′
0–
F′
K−1 are dispatched to thread 0 through thread
K − 1. Because the filters in
T′ have been sorted based on the workload, the dispatch order of the next
K filters is opposite to that of the former
K filters to balance the workload of each thread. Thus,
F′
K–
F′
2K−1 are dispatched sequentially from thread
K − 1 through thread 0. Whenever the EPCDA dispatches 2
K filters, the former
K filters are dispatched to threads in the order from thread 0 to thread
K − 1. In contrast, the latter
K filters are dispatched to threads from thread
K − 1 to thread 0. Such a dispatch order ensures that for each 2
K filter, the workload of each thread will be nearly balanced. Even when the number of dispatched filters is less than 2
K, it still achieves minimal difference in the workload of each thread. Algorithm 1 shows the algorithm for each thread to process filter conflict detection in the EPCDA, and
Table 3 lists the notations used in Algorithm 1.
Algorithm 1: Parallel function of EPCDA |
Input: filter database filter[] Output: conflict results
1 threadID = blockIdx.x ∗ blockDim.x + threadIdx.x; 2 threadSize = blockDim.x ∗ gridDim.x; 3 base = threadSize ∗ 2; 4 start = (base − 1) − threadID;
5 i = threadID; // dispatch direction . 6 do 7 for j i + 1 to filter.size() – 1 do 8 detection(filter[j], filter[i]); 9 end 10 i += base; 11 while i filter.size(); 12 i = start; // dispatch direction . 13 do 14 for j i + 1 to filter.size() – 1 do 15 detection(filter[j], filter[i]); 16 end 17 i += base; 18 while start filter.size(); |
From the dispatch in the EPCDA, we observed that the numbers of comparisons required by each thread for performing conflict detection were very close. When n equals 2K, each thread must execute 2K − 1 comparisons. The more balanced the workload, the smaller the maximum number of total comparisons performed by the threads. In other words, the total time required to execute the detection algorithm is reduced. This finding demonstrates that the EPCDA can improve parallelism performance.
5. Results and Discussion
In this section, the execution performance of conflict detection based on our proposed GPCDA and EPCDA and a single CPU (Host) are evaluated. The filter databases required for the experiment were obtained from Class-Bench [
40], which provided 12 parameter files obtained from three types of practical applications, including access control lists, firewalls, and IP chains. For each parameter file, we generated six filter databases with sizes 5K, 10K, 15K, 20K, 30K, and 100K. The performance evaluation indicator was defined as the average time required for each filter to execute conflict detection, counted in microseconds. (
nThread,
nBlock) represents the allocation of
nThread nBlock threads for parallel execution. Algorithms for simulation experiments were implemented in C++, whereas the GPCDA and the EPCDA were implemented with an additional version 7.5 CUDA toolkit for parallel programs. The test environment was set up using an Intel Core i5-4570 3.2 GHz PC with 12 GB memory. The GPU configuration was NVIDIA GeForce 970X (computation ability: 5.2) [
41], which provided 13 SMs to support parallel computing. Each SM was composed of 128 SPs, 4G device memory, 98 KB shared memory, and a 48 KB L1 cache.
5.1. Comparing Speed Performance
In this subsection, we present a comparison of the speed performance of the Host, the GPCDA, and the EPCDA in different filter databases. The GPCDA and the EPCDA used two different total numbers of thread allocations, (256, 4) and (512, 8), for parallel computing.
Figure 7 shows the required time for each filter to execute conflict detection in 12 different databases when the database size is 30K, among which databases the time for the GPCDA and the EPCDA is defined in
Section 4.1. After the acceleration of parallel computing by the GPU, the GPCDA and the EPCDA are better than the Host in terms of conflict-detection speed. When the (256, 4) allocation is used, the GPCDA and the EPCDA are faster than the Host alone, by 1.3 to 4.5 and 3.9 to 9.3 times, respectively. Under the (512, 8) allocation, more threads are available to support parallel computing; thus, the speed performance significantly improves. The GPCDA and the EPCDA are 4.2 to 12.4 and 10.9 to 32.4 times faster, respectively, than the Host alone. Because the dispatch method in the EPCDA enables every thread to have a balanced workload, it improves the parallel computing performance. With the same number of threads, the EPCDA outperforms the GPCDA. The average time required for conflict detection in the EPCDA (256, 4) allocation is almost the same as in the GPCDA (512, 8) allocation, which demonstrates that workload balancing affects performance significantly.
Intuitively, when the number of allocated threads is doubled, performance should also be doubled. However, the experimental results (
Figure 7) show that the performance did not double as expected. In the experiment, the number of allocated threads was four, but the performance improved by only 2 to 3 times. This was because when more threads executed parallel computation, the degree of resource competition increased, resulting in performance degradation.
Table 4 lists the average time required for each filter to execute conflict detection using algorithms in different databases when the filter database size is 100 K. Here, the GPCDA and the EPCDA outperformed the Host alone. In the EPCDA, the maximum time for a filter to perform conflict detection was lower than 22 µs, suggesting that the EPCDA still maintained good performance for a large database. In addition, the EPCDA can operate in applications with frequent filter updates; in such cases, fast conflict detection is required to achieve high throughputs.
5.2. Comparison of Performance for Different Workload Dispatch Methods
In this subsection, we evaluate the effect of different workload dispatch methods on performance. We added two workload dispatch methods for comparison with the GPCDA and the EPCDA. The first comparison used the worst workload dispatch method. Suppose that K threads execute a computation. We divided n filters into K equal parts. The filters in each part were dispatched sequentially to the threads for execution. For example, thread 0 is responsible for performing conflict detection of filters 0 to , thread 1 is responsible for filters to , and so on, until thread K − 1 performs conflict detection of filters to . When the poorest workload dispatch method was used, we observed a significant difference in the total number of comparisons between threads 0 and K − 1, leading to the most significant workload imbalance. The second comparison method involved using the dispatch method of the shortest processing time (SPT) algorithm in the EPCDA (denoted EPCDAR). The dispatch method in the SPT algorithm is the opposite of that in the LPT algorithm. Each filter Fs was compared with filters F0–Fs−1.
Here, we address the EPCDA with respect to the difference in performing conflict detection using the LPT and SPT algorithms. First, if the number of filters is a multiple of
K, regardless of whether the LPT or the SPT algorithm is used, each thread requires the same total number of comparisons. However, if the number of filters is not a multiple of
K, a difference exists. This trend can be explained using
Figure 8. When the dispatched four threads performed conflict detection of filters
F0–
F8, the filters dispatched in the LPT algorithm experienced a significantly more balanced workload than those in the SPT algorithm. The total number of comparisons for both algorithms was 9. For example,
F0 and
F7 required eight and one comparisons, respectively, in thread 0 (
F8 did not need an additional comparison), and
F2 and
F5 required six and three comparisons, respectively, in thread 2. However, when filters were dispatched through the SPT algorithm, the workload in thread 0, which reached 15 (needing 0, 7, and 8 comparisons) was higher than that of the other threads. Workload imbalance increased the total execution time for conflict detection.
Second, for the number of memory access times, after the dispatch of the LPT algorithm, while processing the first conflict detection in each thread, each thread wrote the compared filters in the cache, owing to a cache miss. Assuming that the cache size is sufficiently large to store all compared filters, the next conflict-detection process could be immediately performed because the filters to be compared were already written to the cache. For example, during
F0 conflict-detection processing, thread 0 needed to load filters
F1–
F8 into its local cache because of cache miss; therefore, there were no additional memory access times during
F7 and
F8 conflict-detection processing. In contrast, most comparisons in the SPT algorithm still indicated cache misses after the first conflict detection was processed. Consequently, it needed to wait until the filter to be compared was written to the cache, resulting in a significant amount of memory access and an increase in the total required time for conflict detection. Under a highly unbalanced workload, the parallel performance was significantly reduced (
Figure 9). The average execution time for conflict detection in each filter exceeded that in the GPCDA. Although the performance was better than that of the GPCDA when a filter used the EPCDA
R to execute conflict detection, the EPCDA was the best when the LPT algorithm was used to dispatch tasks.
5.3. Analysis of Parallelism Efficiency
In this subsection, we discuss the efficiency of parallelism for a fixed total number of threads using different (
nThread,
nBlock) allocations for parallel computing. Here, we compare the efficiency of the (256, 8) and (512, 4) allocations.
Table 5 and
Table 6 list the average required times to execute conflict detection by a filter using the Host, the GPCDA, and the EPCDA, with databases of the lowest and highest numbers of detected conflicts for the three filter database types. The average required times for a filter to execute conflict detection in the GPCDA and the EPCDA were several times lower than that in the Host alone, regardless of the condition of the smallest or largest number of conflicts.
Regardless of whether the GPCDA or the EPCDA was used, the time required for each filter to execute conflict detection under the (256, 8) allocation was always shorter than that under the (512, 4) allocation. This trend indicates that for the same total number of threads, the more threads in a block, the worse the parallel performance. Through analysis with NVIDIA Visual Profiler [
42], two major factors that affect parallelism efficiency are “stalled for memory dependency” and “stalled for synchronization”. “Memory dependency” is mainly attributed to the additional stall caused by the data dependency of two consecutive instructions, whereas “stalled for synchronization” is mainly attributed to the _syncthreads( ) instruction in the CUDA. Recall that threads in a block use a warp as a unit for hardware to execute parallel computing, and threads in the same block access the data in the same shared memory. When threads in the same warp are executed, the CUDA invokes the _syncthreads( ) instruction to ensure data consistency while the threads are performing computations. If tasks in some threads of the same warp are completed earlier, the threads must wait for other threads to complete their tasks, and that batch of threads proceeds to the next dispatched task. Therefore, when a block is dispatched with a higher number of threads, it is easier to cause “stalled for synchronization” and affect parallel efficiency
5.4. Limitations of Parallel Computation
Based on the results discussed thus far, it can be observed that when the number of allocated threads and blocks increases, the performance improves, so that the GPCDA and the EPCDA can handle a large filter database. In this subsection, we discuss the limitations of parallel computation in a large-filter database in the GPCDA and the EPCDA. When the number of filters was large, even when the number of allocated threads for parallel computing increases several times, the performance did not increase accordingly (
Figure 10). We have discussed the reason for this above. We observed that for parallel computing of threads under allocation 13,312 (1024, 13), the GPCDA and the EPCDA exhibited the best performance. However, the performance for threads under allocation 16,384 (1024, 16) decreased. GTX 970 supports 13 SM at most for parallel computing, and each block allows only up to 1024 threads. If more blocks are allocated, it causes some SMs to execute at least two computation blocks, and the execution time increases, due to hardware scheduling. In the 8192 (1024, 8) allocation, when threads executed the EPCDA, the difference was minimal compared to configuration 13,312 (1024, 13). Thus, when the workload approached a balance, we could use less hardware to achieve the same performance. This confirmed that the EPCDA can execute conflict detection using a GPU with a few cores to reduce hardware costs.
6. Conclusions
In this study we applied GPUs to conflict-detection algorithms, and developed the GPCDA and the EPCDA, which can accelerate conflict detection through parallel computing in GPUs. Based on the simulation experiment, we found that the GPCDA and the EPCDA performed conflict detection up to 33.7 times faster than the CPU, regardless of the type and size of the filter databases; this was particularly evident when the number of allocated threads for parallel computing became large. We also observed that for the same total number of threads, different numbers of allocated blocks and threads significantly influenced the parallel efficiency. We analyzed the factors affecting the parallel efficiency, which may guide GPU utilization in other applications.
If the workload between threads was unbalanced, different threads required different computation times. This may have caused some threads to be idle because of a smaller workload, whereas threads with a larger workload spent more time, increasing the total parallel computing. Therefore, the workload balance was an essential factor that influenced parallel performance. Unfortunately, the literature reports that dispatching tasks with different execution times to multicores with the same computing capability and achieving optimized workload balance is an NP-complete problem. Thus, we developed a near-optimum workload balance mechanism in the EPCDA, such that the workload among threads could approach a balance and maximize the performance of parallel computing. Based on the simulation experiment, we observed different performances for different workload dispatch methods. Using the EPCDA, even though we used fewer threads for parallel computing, it achieved the same performance as the GPCDA, which used more threads for parallel computing; thus, the hardware cost was reduced.
Finally, when the number of filters was large, the EPCDA still maintained good performance, with a filter database size of 100K. In the EPCDA, each filter spent 22 µs, at most, to process conflict detection; this was 9.4 to 33.7 times faster than using only a CPU. Therefore, the EPCDA is suitable for applications with frequent filter-database updates. The limitations and future research directions of this study are as follows. First, the detection speed can be increased by improving the GPU memory access efficiency, such as by minimizing non-coalesced memory accesses and bank conflicts. Second, in this study, we focused on analyzing the critical procedure in 5D conflict detection, and balancing the workload of GPU threads. The proposed algorithms were not designed for a specific GPU architecture/model. For a specific GPU architecture/model, our proposed algorithms can be modified to achieve better performance by taking advantage of hardware features. Third, this study assumed that conflict detection is executed on a single-GPU platform. For a heterogeneous multi-GPU platform, the workload balancing problem becomes much more complicated and the communication cost between GPUs should be taken into account when designing a conflict-detection algorithm.