Collective Communication Performance Evaluation for Distributed Deep Learning Training

Lee, Sookwang; Lee, Jaehwan

doi:10.3390/app14125100

Open AccessArticle

Collective Communication Performance Evaluation for Distributed Deep Learning Training

by

Sookwang Lee

¹

and

Jaehwan Lee

^2,*

¹

Supercomputing Technology Research Center, Electronics and Telecommunications Research Institute, Daejeon 34054, Republic of Korea

²

Department of Computer Engineering, Korea Aerospace University, Goyang 10540, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(12), 5100; https://doi.org/10.3390/app14125100

Submission received: 16 May 2024 / Revised: 8 June 2024 / Accepted: 9 June 2024 / Published: 12 June 2024

(This article belongs to the Special Issue Parallel, Distributed and Cloud Computing: Status, Prospects and Future)

Download

Browse Figures

Versions Notes

Abstract

:

In distributed deep learning, the improper use of the collective communication library can lead to a decline in deep learning performance due to increased communication time. Representative collective communication libraries such as MPI, GLOO, and NCCL exhibit varying performance based on server environment and communication architecture. In this study, we investigate three key aspects to evaluate the performance of the collective communication libraries in a distributed deep learning setting in an intra-node environment. First, we conduct a comparison and analysis of collective communication library performance within common distributed deep learning architectures, such as parameter servers and ring all-reduce methods. Second, we evaluate the performance of these libraries in different environments, including various container platforms and bare metal setups, considering the scalability and flexibility advantages offered by cloud virtualization. Last, to ensure practicality, we assess the libraries’ performance in a Linux shell and within the PyTorch framework. In the cross-docker virtualization environment, NCCL shows up to 213% higher latency compared to single docker, while GLOO exhibits 36% lower latency in single docker than in cross docker, and NCCL achieves up to 345% lower execution time in all-reduce operations compared to other libraries (MPI and GLOO). These findings will inform the selection of an appropriate collective communication library for designing effective distributed deep learning environments.

Keywords:

distributed deep learning; data parallelism; collective communication; cloud virtualization

1. Introduction

Deep learning performance depends on two main factors: the number of parameters in the model and the size of the training dataset. As these parameters and data continue to grow, we see an increase in the size of deep learning models and training data [1]. For instance, GPT-3, a massive deep learning model, required 2048 CPUs and 2048 high-performance GPUs to train, with around 175 billion parameters and a 650 GB dataset [2]. To train such large-scale deep learning models, we need specialized GPU clusters. However, when setting up a GPU cluster, latency issues arise due to the communication between cluster components. When deep learning tasks are executed on a large GPU cluster, the time spent on collective communication can be more than ten times the time spent on actual computation [3]. This performance bottleneck affects the overall training process. Furthermore, different communication libraries employ various methods for collective communication, which means that it is crucial to choose the right communication library that suits the specific environment to optimize distributed deep learning cluster training.

PyTorch supports several libraries for collective communication, including the Message Passing Interface (MPI) [4], “GLOO” [5], and the NVIDIA Collective Communication Library (NCCL) [6]. MPI is a standard interface used for distributed memory computing. It consists of a message-based communication protocol and a library set. MPI is versatile, working across various platforms and finding applications in cluster computing, supercomputers, and cloud environments. It excels at handling large datasets and parallel computations. GLOO, on the other hand, is a collective communication library developed by Facebook that is specifically tailored for distributed deep learning. It focuses on efficient communication in large-scale deep learning models while maintaining scalability based on the cluster and model size. Unlike MPI, GLOO is not a general-purpose library; its specialization lies in deep learning tasks. NCCL, or the NVIDIA Collective Communications Library, was developed by NVIDIA to enhance efficient communication between GPUs. NCCL is compatible with CUDA-based applications and provides swift communication on NVIDIA GPUs. It enables direct communication between GPUs, ensuring rapid data exchange. Moreover, NCCL optimizes data movement between GPUs by employing various communication patterns and algorithms. This enhancement translates into faster training speeds and improved performance for distributed deep learning models. Consequently, NCCL proves invaluable for training large-scale GPU clusters. Given these distinctions, it is essential to analyze these three collective communication libraries in the context of each type of distributed deep learning environment.

Latency in collective communication primarily involves two key factors. First, it involves the delay incurred when invoking functions of collective communication libraries for communication between nodes. This latency depends on factors like communication protocols, network conditions, the distance between nodes, and more. Second, there is latency in copying data (such as parameters and gradients) needed for collective communication between the system memory and GPU memory [7]. This latency is influenced by factors like the amount of data and the speed of memory transfers. To understand these fundamental latency components, we conducted a performance analysis in the Linux shell environment. Additionally, since distributed deep learning typically involves deep learning frameworks, it is essential to examine collective communication latency in these environments. Therefore, in this paper, we conducted additional experiments in the PyTorch environment, which is a widely used deep learning framework.

Cloud-based GPU cluster servers, like those offered by Amazon Web Services and Google Cloud, are gaining popularity in the business world. They are chosen for their scalability and ease of management, with market spending on cloud virtualization reaching USD 63.7 billion in the first quarter of 2023 [8]. These cloud services provide flexibility and cost benefits, making it easier to set up and manage distributed deep learning training. However, when working in a cloud environment, it is crucial to consider the performance of collective communication in distributed deep learning. Efficient communication greatly affects the performance and scalability of such training. While previous research has primarily focused on assessing collective communication performance in bare metal [9] environments, our study takes into account both bare metal and containerized environments [10], including docker [11] and singularity [12] containers. This comprehensive approach ensures a thorough evaluation of distributed deep learning training performance in the cloud.

Distributed deep learning can be divided into two main categories: data parallelism [13] and model parallelism [14]. Data parallelism involves multiple nodes with identical models processing a single dataset, while model parallelism distributes one model across multiple nodes for training. In this study, we focus on data parallelism, which is commonly used in distributed deep learning, and examine the communication methods between nodes. Two representative communication architectures for data parallelism are the parameter server [15] and ring all-reduce [16]. The parameter server follows a centralized approach, where a central server manages model parameters while client nodes process data to compute gradients and update the server. In contrast, ring all-reduce is a decentralized approach where each node owns part of the model, and gradients are exchanged and combined with neighboring nodes to update the model. These data parallelism methods are widely employed in distributed deep learning research and play a significant role in designing efficient deep learning workloads. For instance, when optimizing heterogeneous server environments, a combination of parameter server and all-reduce methods is often considered for decision-making [17]. Additionally, in the fully sharded data parallel (FSDP) approach [18], which involves dividing parameters and gradients into smaller portions and assigning them to different nodes for parallel processing, collective communication is also utilized. Analyzing these approaches contributes to enhancing the structures and efficiency of distributed deep learning and the deep learning workload design.

In this study, we conducted an in-depth examination of intra-node communication methods in the context of data parallelism. Our primary objective was to assess the performance of collective communication in distributed deep learning environments. We conducted a comprehensive analysis across various scenarios, encompassing both bare metal and containerized environments, utilizing both the Linux shell and the PyTorch framework. The ultimate objective of our research was to identify effective strategies for intra-node communication in data parallelism and propose the most suitable communication approaches for distributed deep learning environments.

Here is a summary of the contributions:

Virtualization environment (container vs. bare metal): We assessed the collective communication libraries’ performance in diverse server environments within a multi-GPU cluster. Our investigation revealed discernible variations in collective communication performance between bare metal and container setups, prompting a detailed analysis of the performance changes exhibited by each library. Furthermore, we evaluated different types of containers within the well-established docker and singularity container environments, tailored for high-performance computing (HPC) applications. Our findings indicated a reduction in collective communication time within the single-container (multi-GPU) environment for the MPI library, while an increase in latency was observed in cross-container communication using NCCL.
Execution environment (Linux shell and PyTorch): In this paper, we conducted a comprehensive performance evaluation of collective communication within the Linux shell and PyTorch environments. Within the Linux shell, we investigated the complexity of inter-node communication within collective communication routines and thoroughly examined the memory copying process between the CPU and GPU within a single node. In the PyTorch environment, our analysis focused on measuring and evaluating the actual latency of collective communication occurring during distributed deep learning, particularly utilizing backend functionalities. Additionally, we configured the DDP architecture during the deep learning workload phase to measure the time taken by collective communication libraries. Based on this, the paper proposes the optimal collective communication library for real distributed deep learning systems.
Comparison of data-distributed deep learning architectures (parameter server vs. all-reduce): In this study, we scrutinized the performance of a collective communication library within the framework of data-distributed parallel deep learning. Our focus was on evaluating two key architectural paradigms: the parameter server employing a central server–client structure and the decentralized ring all-reduce method. Our experimental findings revealed a significant advantage of the NCCL communication library over MPI and GLOO when used in the all-reduce method.

The remainder of this paper is structured as follows:

Section 2 provides background information on collective communication architecture and the primary communication methods employed by the library. Section 3 discusses the collective communication architecture and presents experiments conducted in both the Linux shell and PyTorch environments. Section 4 provides details about the experimental environment utilized in this study. Section 5 presents the findings derived from experiments conducted within the Linux shell environment, whereas Section 6 presents the results obtained from experiments conducted in the PyTorch environment. Section 7 discusses the comprehensive experimental results and the overall experimental findings. Section 8 reviews previous related studies. Finally, Section 9 summarizes our conclusions and outlines future research directions.

2. Background

2.1. Data Parallel Distributed Deep Learning Training Architectures: Parameter Server vs. Ring All-Reduce

Distributed data parallel (DDP) is a technique designed for the scalable and efficient processing of deep learning training. Essentially, DDP needs the distribution of gradients and parameters associated with deep learning across multiple GPU nodes, allowing for concurrent training across these nodes. Figure 1 illustrates the parameter server architecture. It consists of a central parameter server, with each worker node maintaining communication with this server to synchronize model parameters. Each worker operates independently, handling a portion of the data and subsequently transmitting the outcomes to the parameter server. To facilitate efficient communication among workers, two methods are employed: broadcast [19] and gather (allgather) [20]. The broadcast method ensures uniform model updates by transmitting parameters from the parameter server to all worker nodes. Conversely, the gather (allgather) method collects the results computed by individual workers and subsequently conveys these results to the parameter server for consolidation and updating.

Figure 2 depicts the ring all-reduce architecture. In this setup, multiple worker nodes are organized in a circular ring configuration. Each node within the ring possesses data of identical size, transmitting model parameters that necessitate updates while also receiving data from other nodes. At each node, the received data are merged with its own data, resulting in the generation of new data, which is then passed on to the next node in the ring. This process continues iteratively, with data propagating along the ring until all nodes eventually acquire the same updated data.

2.2. Communication Primitives for Distributed Deep Learning

Collective communication encompasses a range of operations, and Figure 3 illustrates the most commonly employed methods: broadcast, gather, allgather, reduce, and all-reduce.

Broadcast: The broadcast operation involves transmitting data originating from one node to all other nodes. Essentially, one node sends out the data, and all remaining nodes receive those identical data. Broadcast is utilized when the same data are required across all nodes.
Gather: The gather operation consolidates data gathered from all nodes into a single designated node. Each node forwards its data to the central node, which accumulates all received data. Gather is used when you need to collect distributed data into a single dataset.
Allgather: Similar to the gather operation, allgather also involves transferring data collected from all nodes, but in this case, it disseminates the data to all nodes. Every node transmits its data to all other nodes, leading to each node possessing a comprehensive set of data.
Reduce: The reduce operation focuses on transmitting results derived from computations performed across multiple nodes to a single designated node, where these results are combined. Each node shares its data with other nodes, executes computations, and transmits the computed results to one designated node. Reduce is instrumental in aggregating and summarizing outcomes obtained from multiple nodes.
All-reduce: The all-reduce operation extends the concept of reduce by transferring the results of operations performed by several nodes to all nodes, and subsequently, combining these results. Each node communicates its data with all other nodes, and following computations, all nodes collect and integrate the computation results. All-reduce is crucial in distributed learning for gradient updates, parameter synchronization, and comprehensive model updates across all nodes.

These collective communication operations play a pivotal role in distributed deep learning, facilitating the efficient exchange and coordination of models and data. Each operation plays a crucial role in transmitting and combining data according to specific communication patterns. In distributed learning, these operations are essential for efficient model training and communication.

2.3. Collective Communication Libraries for Distributed Deep Learning

MPI, GLOO, and NCCL are three prominent collective communication libraries employed in distributed deep learning. MPI is a widely adopted standard interface for high-performance parallel computing [21]. It facilitates message exchange and synchronization among multiple processes within a distributed memory system. MPI is instrumental in enabling parallel operations, with each process possessing a unique rank and the ability to send and receive messages to and from other processes. This framework provides various synchronization functions to control the order of operations and prevent deadlock situations. MPI is available in multiple programming languages (such as C, C++, and Fortran) and finds extensive use in high-performance computing clusters and supercomputers. Additionally, MPI serves as a communication library not only in standalone programs but also within distributed deep learning frameworks.

GLOO is a distributed parallel processing framework developed by Facebook, primarily integrated into the PyTorch deep learning framework. It specializes in facilitating distributed deep learning training. GLOO’s primary role is to exchange and synchronize data among multiple worker nodes. It is optimized for handling small data exchange tasks, particularly the parameter updates of deep learning models. GLOO provides a range of distributed algorithms designed to efficiently communicate between multiple workers. Its exceptional performance is evident when each worker engages in small-sized data exchange tasks. GLOO supports CPU-based distributed processing and can be used in conjunction with MPI or other communication libraries. It enjoys widespread use in distributed deep learning model training and serves as the foundation for PyTorch’s distributed training functionality.

NCCL is a specialized library developed by NVIDIA for enabling collective communication between GPUs. Its primary purpose is to optimize distributed processing that involves multiple GPUs during deep learning model training. NCCL is compatible with CUDA-based GPU operations and excels in streamlining data transfer and synchronization among GPUs. This optimization significantly accelerates operations like parameter updates in deep learning models. NCCL demonstrates excellent performance in multi-GPU systems, making it particularly valuable for expediting training tasks involving large-scale deep learning models and datasets. It can also be seamlessly integrated with other communication libraries and is commonly used in conjunction with Horovod [22], a distributed deep learning framework.

In summary, MPI, GLOO, and NCCL are collective communication libraries employed for parallel computing and distributed processing, each serving distinct fields and purposes. MPI is a standard interface in high-performance computing, while GLOO and NCCL are primarily tailored for distributed deep learning training, with GLOO optimized for CPU-based processing and NCCL specialized for GPU-based operations.

3. Distributed DL Communication: Libraries and Architectures Compared

This section offers a comparison of the flowcharts and architectures of the collective communication libraries utilized in the Linux shell and PyTorch environments. Each environment is subdivided into subsections, with dedicated explanations for the parameter server and ring all-reduce methods. The experiments conducted in this paper were based on the architectures outlined here as their fundamental structure. The code implemented in the Linux shell and PyTorch part was written based on a flow chart and did not include any elements that affect the performance of the code.

3.1. Linux Shell Environment

Within the Linux shell environment, measurements were taken using MPICH [23], OpenMPI [24], CUDA-aware OpenMPI, and the NCCL libraries. Figure 4 provides a flowchart illustrating the communication library architecture in the Linux shell environment. Figure 4a shows broadcast, Figure 4b shows gather (allgather), and Figure 4c shows a flow chart of ring all-reduce. Note that since the architectures of MPI, CUDA-aware MPI, and NCCL operate differently, the function call methods are uniquely configured for each subroutine. Figure 4a,b represents the communication primitives employed in the parameter server method. In the case of using the MPI library, it is essential to invoke CUDA-aware MPI. For non-MPI scenarios, the NCCL library is utilized. There are also variations among libraries in how memory is managed. Given that the MPI library solely supports communication between CPU memories, the worker node must execute the cudaMemcpy function to facilitate data transfer between CPU and GPU memories. Conversely, CUDA-aware MPI allows for direct communication between GPUs, obviating the need for the cudaMemcpy process. NCCL exclusively supports GPU-to-GPU communication, thus employing a representative worker node known as the chief worker to facilitate communication between the parameter server and workers by enabling communication between CPU memory and GPU memory.

As described above, the all-reduce method presented in Figure 4c operates without a central server and involves direct communication among workers. Like the parameter server method, MPI mandates the inclusion of the cudaMemcpy process. After parameter updates are executed through the rotation of worker nodes forming a ring, parameters stored in CPU memory are copied to GPU memory. Following the completion of gradient operations in the GPU, there is a subsequent step of transferring the gradients to CPU memory. Conversely, in the case of CUDA-aware MPI and NCCL, which permit direct GPU communication, parameter communication is exclusively accomplished through the all-reduce function. The architectural intricacies of each function are expounded upon in detail in Figure 5, Figure 6 and Figure 7.

3.1.1. Parameter Server Method

In the parameter server method, both the broadcast operation, which disseminates parameters to worker nodes, and the gather (allgather) operation, which aggregates updated gradients from workers to the parameter server, occur within a single iteration. Figure 5 provides an overview of the broadcast architecture for each library. In the case of MPI, the parameter server employs the MPI_Bcast function to transmit the parameters created within the parameter server to the CPU memory of each worker. Subsequently, each worker utilizes the cudaMemcpy function to copy data from CPU memory to GPU memory. Due to CUDA-aware MPI’s GPU communication capabilities, cudaMemcpy is omitted from the architecture, allowing for direct communication with GPU memory.

NCCL only supports communication between GPUs, so it cannot communicate with a parameter server that does not have a GPU allocated to it. Allocating GPUs to the parameter server would waste resources. In such cases, the approach involves using MPI_Send to send data to the CPU memory of a representative worker node (chief worker). Subsequently, the data are copied to the GPU memory of the chief worker using cudaMemcpy, and then NCCL_Bcast is utilized to broadcast these data to the other worker nodes.

Figure 6 illustrates the gather architecture for the parameter server method in each library. In MPI, the process begins with each worker generating 1 GB of data on the GPU. The cudaMemcpy function is invoked in the reverse order of the broadcast, copying the GPU memory to CPU memory. Subsequently, each worker’s data are transmitted to the parameter server. In the case of CUDA-aware MPI, each worker’s gradient is configured to be sent directly to the parameter server’s CPU memory. NCCL synchronizes the data across all workers, including the chief worker, through the allgather operation. Among these, the chief worker performs a CPU memory copy allocated to that node using cudaMemcpy and then utilizes MPI_Send to transmit the aggregated data to the parameter server.

3.1.2. Ring All-Reduce Method

As depicted in Figure 7, the ring all-reduce method involves the creation of data in GPU memory, with data circularly shared among each worker. CudaMemcpy replicates data to the CPU memory of each worker, and communication between CPU memories is then transmitted clockwise to each node via MPI. Once all transfers are completed and data from each worker are gathered, they are copied back from the CPU memory to the GPU memory (cudaMemcpy). In the case of CUDA-aware MPI and NCCL, they support direct communication between GPUs. However, for multi-node communication, data need to be transmitted via CPU memory. Therefore, CUDA-aware MPI requires the cudaMemcpy operation, as shown in Figure 7b. This means that for intra-node communication, data must first be copied from the GPU to CPU memory and then transmitted to other nodes.

3.2. PyTorch Environment

PyTorch has several communication options like MPI, GLOO, and NCCL. MPI uses the OpenMPI CUDA-aware library, GLOO by Facebook is for distributed deep learning using CPU memory, and NCCL by NVIDIA focuses on GPU-to-GPU communication. Figure 8 shows the aggregate communication library architecture in the PyTorch environment. In a parameter server setup, MPI and GLOO handle CPU–GPU communication, but NCCL is exclusively for GPU–GPU communication. To use NCCL with the parameter server, you need to add a data duplication step between the CPU and GPU using CUDA functions. All three libraries have the same architecture for ring all-reduce. When performing actual distributed deep learning, the parameter server architecture integrates broadcast and gather routines. Additionally, since the architecture of NCCL and other communication libraries is different, it can be seen that different processes are performed during model learning. Likewise, the execution process for all-reduce is the same for the three integrated communication libraries. In the process of collecting actual model parameters, both architectures must perform average calculations to obtain the values of the actual collected parameters.

In the PyTorch environment, additional experiments were conducted in practical distributed deep learning settings. This included configuring parameter servers and all-reduce architectures for three collective communication libraries to train local models on real datasets and aggregate updated model parameters through inter-node communication. Figure 9 illustrates the processes for both architectures.

Figure 9a illustrates the distributed deep learning process with the parameter server architecture. The process involves loading the Cifar-10 dataset and conducting training according to specified epochs and batch sizes. Different procedures are followed depending on whether the NCCL backend is used during the distributed process. Firstly, when using CPU communication-based methods such as MPI or GLOO, the gather function is invoked to collect local training parameters from workers into the parameter server. Subsequently, the parameters are averaged. The parameter server then receives the averaged parameters, broadcasts them to the workers, and proceeds to the next iteration. On the other hand, if NCCL is utilized, the allgather function gathers worker parameters into the parameter server’s GPU memory. Following this, a CPU memory copy is performed, and parameter averaging takes place. The averaged parameters are then allocated to the GPU memory and broadcast to workers for subsequent training iterations.

Figure 9b illustrates the all-reduce architecture, which follows a uniform procedure across three collective communication libraries. After each worker completes local training, the all-reduce function is invoked to synchronize parameters. Then, parameter averaging is performed by each worker to advance the training iteration.

3.2.1. Parameter Server Method

In the PyTorch environment, the broadcast operations in each library are illustrated in Figure 10. For MPI and GLOO, the data are transmitted directly from the parameter server to the worker GPU memory. However, in the case of NCCL, only GPU-to-GPU communication is supported, so a GPU must be assigned to the parameter server. However, because the parameter server method has a central server–client structure, a worker role is not assigned to the central server; it goes from the parameter server GPU memory to the worker GPU memory for inter-GPU communication. This setup can lead to inefficient GPU resource usage when resources are limited. Figure 11 illustrates the architecture of each library for the gather operation. The gather subroutine involves transmitting gradient values calculated by each worker to the parameter server. Since allgather stores the entire gradient value in all worker nodes, MPI and GLOO use the gather subroutine to save memory. In contrast, NCCL generally employs the allgather function, so the process of applying allgather to the architecture is shown. Similar to broadcast, NCCL requires a dedicated GPU for the parameter server, resulting in one less available GPU for workers compared to the MPI and GLOO architectures.

3.2.2. Ring All-Reduce Method

As depicted in Figure 12, in contrast to the parameter server method, the ring all-reduce approach, which is a decentralized method, does not waste GPUs in the NCCL library because all nodes are set up as workers. This means that the architecture of all communication libraries is the same in this case. The ring all-reduce architecture works by dividing the updated gradient stored in the GPU memory of each worker by the number of workers. Then, it transmits this divided gradient value to the next worker and receives the entire gradient value from the previous worker. This process repeats recursively, updating all gradient values through

2 (N - 1)

executions when there are N workers in the system.

4. Experimental Setup

The performance of distributed deep learning communication libraries was tested using both parameter server and ring all-reduce architectures on both bare metal and container systems. In the parameter server method, broadcast and gather (allgather) subroutines were treated separately, while all-reduce was performed as a single routine. These tests were conducted on various environments, including bare metal, singularity, multiple GPUs within a single-docker container (single docker), and one GPU per docker container (cross docker).

Singularity is a container technology primarily used in scientific and engineering computing, especially in HPC environments. Docker is a container platform that packages servers and applications into containers, making the execution environment independent and simplifying the deployment of applications across different environments. The aim was to compare the performance differences between bare metal and containerized environments and find the most suitable environment for distributed deep learning training.

The experiment generated 1 GB of random tensor data in Linux shell and PyTorch, performed each collective communication routine, and measured the latency based on the same scenario in the PyTorch deep learning workload.

Table 1 below shows the server specifications used in the experiment. For the software, NVIDIA driver 515.48, CUDA 11.3, OpenMPI 4.1.4, MPICH 3.3, NCCL 2.4, PyTorch 2.0.1 and docker 20.10.18 were utilized.

5. Experiments in the Linux Shell Environment

Section 5 involves the analysis of collective communication library performance using the Linux shell. Two major methods, the parameter server and ring all-reduce methods, were examined separately. In the parameter server method, the broadcast and gather (allgather) operations were further divided into subroutines for detailed analysis. Each experiment focused on measuring latency while increasing the number of GPUs to assess the impact of GPU scaling. The study considered different server environments, including bare metal, singularity, single-docker containers, and cross-docker containers, and evaluated their performance. The total latency observed was the sum of the execution time of library functions and the time taken for data transfer using cudaMemcpy. Several communication libraries were tested in these experiments, including MPICH, OpenMPI, CUDA-aware MPI (OpenMPI), and NCCL. The aim was to comprehensively analyze and compare the performance of these libraries under various conditions and configurations.

5.1. Parameter Server Subroutine

In the parameter server approach, the actions of broadcasting and gathering (allgather) together make up one cycle or operation. To put it simply, when you add the time it takes for broadcasting and gathering (allgather) in this cycle, you obtain the total latency in the parameter server method. In our study, we broke down and separately tested each of these sub-tasks. As explained in Section 2, MPI used the gather function, while NCCL used the allgather function for these tasks.

5.1.1. Broadcast

In Figure 13 and Figure 14, we can see the broadcast results in the Linux shell. What stands out is that when using three GPUs in the singularity environment, the latency is significantly shorter compared to other virtualization environments. When utilizing all available GPUs, the difference is quite remarkable, with up to 75% reduction in latency compared to the single-docker environment, which is essentially the same type of single container. Also, with MPI (including CUDA-aware MPI), there is no significant difference in latency when comparing cross-docker setup and single-docker setup. Interestingly, when we look at the latency in the bare metal environment versus the single-container environment using NCCL, the single container shows a reduction of more than 30% in latency. This is in contrast to MPI, where the cross-docker latency in NCCL experiences a significant increase, representing a whopping 213% increase compared to NCCL’s single-docker setup. Additionally, the noticeable difference in latency between MPI and CUDA-aware MPI is not prominently evident in the observed results.

5.1.2. Gather

In Figure 15 and Figure 16, we can observe the latency results of the gather (allgather) operation within the parameter server method. Notably, the bare metal setup consistently exhibits outstanding performance, with CUDA-aware, OpenMPI, and MPICH libraries performing well except when using a single GPU. For instance, in the case of using four GPUs, MPICH showed a latency of 2.56 s, OpenMPI of 2.38 s, and CUDA-aware MPI of 2.22 s, indicating a mere 7% difference between each library. Additionally, when compared to other environments, MPI’s cross-docker configuration demonstrated lower latency, or less than 0.5 s difference. However, NCCL exhibited a noticeable divergence. The difference between single docker and cross-docker setups exceeded 46%, and it is evident that latency increases significantly in the cross-container environment for both the broadcast and allgather operations performed in the parameter server method.

5.2. Ring All-Reduce Routine

Figure 17 and Figure 18 illustrate the all-reduce latency results. In the MPI environment, it is noticeable that MPICH exhibits higher latency compared to OpenMPI and CUDA-aware MPI. For example, when all available GPUs were used on bare metal, MPICH had a latency of 3.87 s, while OpenMPI had 3.296 s, and CUDA-aware MPI had 3.22 s. This represents a difference of over 20.4% for MPICH compared to the CUDA-aware MPI. What stands out for the singularity among the experimental environments is that MPICH recorded a latency of 2.97 s, which is 30% lower than that of bare metal. NCCL measured to have up to 78% lower latency compared to other libraries. Interestingly, when comparing different environments, there was no significant difference in latency in the cross-docker setup, unlike that observed in the parameter server method.

5.3. Linux Shell Function Execution Times

In the Linux shell experiments, the latency for each execution step under the criterion of utilizing all available resources reflects the combined results of function calls from collective communication libraries and cudaMemcpy. The latencies for each process are shown in Table 2, Table 3 and Table 4 as follows.

Table 2 displays the execution times of each function during the broadcast operation. In terms of bare metal benchmarks, for MPICH, the broadcast function takes 0.945 s, occupying 59% of the total time, while cudaMemcpy accounts for 0.653 s, representing 41%. Conversely, OpenMPI requires 1.235 s for the broadcast function, constituting 65% of the total time. Since CUDA-aware MPI does not include cudaMemcpy time, the broadcast function is used as a single process. NCCL performs operations through processes such as broadcast, cudaMemcpy, and MPI send, which transfer data from the parameter server to the chief worker. The time breakdown for NCCL is 0.453 s (45%) for the broadcast function, 0.171 s (17%) for cudaMemcpy, and 0.384 s (38%) for MPI send. When comparing against the server environment, in singularity, the overall function calls and cudaMemcpy times decrease, while in a cross-docker environment, the NCCL backend shows over four times longer broadcast function execution times.

Table 3 shows the execution times of functions during the gather (allgather) operation. Both MPICH and OpenMPI consist of gather functions and cudaMemcpy. For MPICH, the gather subroutine takes 1.806 s (70% of the total time), while for OpenMPI, it takes 1.641 s (67%). In both cases, cudaMemcpy accounts for approximately 0.7 s (30%). CUDA-aware MPI takes 2.565 s for the entire process. NCCL demonstrates a lower execution time of 0.466 s (13%) on bare metal, but in a cross-docker environment, it takes 1.925 s, constituting 37% of the total time.

Table 4 shows the execution times of functions in the all-reduce architecture. In MPI libraries, this includes the time for communication between CPU memories and the memory copy between the host (CPU) and the device (GPU). For other libraries, the time for cudaMemcpy is not included due to direct communication between GPUs. In CPU communication MPI (MPICH, OpenMPI), except for MPICH having 1.3 times longer all-reduce function time than OpenMPI in bare metal environments, overall, both libraries show similar latency. Additionally, in all libraries, including CUDA-aware MPI, there is a tendency for the all-reduce function call time to decrease in singularity. NCCL demonstrates superior overall performance compared to other libraries.

6. Experiments in the PyTorch Environment

Section 6 of the study provides an in-depth analysis of the performance of a collective communication library in PyTorch. Similar to the approach taken in the Linux shell experiments, this section examines both the parameter server and ring all-reduce methods.

In the parameter server method, the broadcast and gather (including allgather) operations were scrutinized by breaking them down into subroutines. In each experiment, latency was measured while incrementally adding GPUs, allowing for an analysis of the results concerning GPU expansion. These experiments were conducted across various server environments, including bare metal, singularity, single-docker containers, and cross-docker containers. Latency was measured based on the execution time of the backend functions, and the communication libraries MPI, GLOO, and NCCL were utilized in these experiments. The goal was to comprehensively assess and compare the performance of these communication libraries under different conditions and configurations within the PyTorch environment.

6.1. Parameter Server Subroutine

6.1.1. Broadcast

The results of the broadcast subroutine in the parameter server method are presented in Figure 19. In the MPI_Bcast, it is evident that latency increases in the order of bare metal, singularity, single docker, and cross-docker environments. There were some peculiarities in latency as the number of GPUs increased. We found that it took less time on four GPUs than single-container latency on three GPUs. For GLOO_bcast, the latency in the bare metal and singularity environments differs by less than 3%. It can be seen that the gap between single docker and bare metal is relatively large.

In the case of NCCL_bcast, the difference between bare metal and singularity is less than 1%, but the latency is approximately 19% higher in single docker, which is the same single-container environment as singularity. Moreover, when comparing bare metal to cross docker in a four-GPU environment, cross docker exhibited 89% higher latency.

6.1.2. Gather

In Figure 20, the results of the subroutines gather and allgather among the parameter server methods are displayed. When measuring MPI, gather shows lower latency in a single container (singularity, single docker) than in bare metal when using four GPUs. For GLOO_Gather, there is a noteworthy 36% reduction in latency in the cross-docker environment compared to other environments. This difference suggests that cross-container communication generally takes less time than separate-container processes. In the case of NCCL_Allgather, it is evident that latency significantly increases as the number of GPUs used in the cross-docker setup increases. With four GPUs in use, the latency is 54% higher compared to the bare metal environment. This indicates that NCCL’s performance is particularly impacted by the configuration and scaling of GPUs in a cross-docker environment.

6.2. All-Reduce Routine

In Figure 21, the results of the ring all-reduce routine are depicted. When examining the MPI_Allreduce measurements, a significant difference in latency is evident between bare metal and the singularity environment. Specifically, with three GPUs in use, singularity takes 33% less time than bare metal. For GLOO_Allreduce, there is not a substantial difference in latency across all environments. Notably, when comparing cross docker with other environments, cross docker shows only a 2–7% difference in latency. In the case of NCCL_Allreduce, it demonstrates strong performance compared to other backend architectures. On bare metal with four GPUs, NCCL recorded a latency of 0.6470 s, which is 332% longer than MPI and 149% longer than GLOO. However, when comparing cross docker to the other NCCL environments, a significant difference in latency is noticeable. When using four GPUs, cross docker exhibits a considerably higher latency measurement: 129% higher compared to bare metal. This suggests that NCCL’s performance is significantly impacted by the configuration and scaling of GPUs in a cross-docker environment.

6.3. Deep Learning Experiment

Based on the results measured for each architecture, experiments were conducted at the deep learning application stage. The time taken to train a ResNet-18 model on the Cifar-10 dataset with a batch size of 32 for 10 epochs was measured. The measured contents include the time taken for each collective communication function call and parameter averaging, as well as the total training time. In the parameter server approach, both broadcast and gather operations were performed in a single iteration, hence the time for each was measured. Specifically, for gather, it includes the time taken to compute the average of aggregated parameters from all nodes at the parameter server. Furthermore, since NCCL communicates based on GPUs, the number of worker trainings at the parameter server is N-1 instead of N. As a result, you can see the difference in step size in Table 5.

Table 5 presents the results of the parameter server architecture. In all environments, MPI outperformed GLOO, and NCCL demonstrated excellent performance in each collective communication function and computational process, considering the differences in architectures. A notable observation is that, similar to the trends observed in previous experiments, in the cross-docker environment, GLOO showed approximately 0.77 times lower latency compared to bare metal, resulting in approximately 0.82 times longer training time. In contrast, in NCCL, the improvement in latency in the cross-docker environment compared to bare metal was more than 1.4 times, showing a higher rate of increase than other libraries.

Table 6 shows the results of deep learning performance in the all-reduce architecture. In this case, all three libraries have the same architecture. Similarly, NCCL, MPI, and GLOO demonstrate superior performance in all environments. In the multi-GPU per container environment, they all exhibit lower latency compared to bare metal, while in the single GPU per container environment, higher latency is observed. Particularly noteworthy is that NCCL shows a significant performance degradation, with an all-reduce time that is 1.76 times higher than bare metal.

7. Summary of Experimental Results

7.1. Linux Shell

In Figure 22, the results of collective communication executed in the Linux shell for each environment are presented. In the case of MPI_Send, used for communication between the chief worker and the parameter server in the NCCL architecture, only the measured time for OpenMPI was displayed in the overall summary because the time required for MPICH and OpenMPI differed by less than 0.1 s.

When examining the broadcast and gather (allgather) subroutines in the parameter server method, it is evident that MPI’s performance is lower in bare metal and single-container setups compared to NCCL in broadcasting. However, the latency of NCCL increases rapidly in the cross-docker environment, resulting in higher latency than MPI. In the case of gather (allgather) measurements, MPI outperforms NCCL in all environments, with the gap widening further in cross docker, where it exhibits an 118% higher latency than OpenMPI. For all-reduce, it is noticeable that NCCL’s performance is significantly better than that of MPI, with a substantial difference of 78% when compared to MPICH in the bare metal setup, representing the largest gap in performance among the tested scenarios. In comparing MPI and CUDA-aware MPI, it is evident that there is not a significant discrepancy in latency between the two architectures. This suggests that, when using a single server, the impact of latency appears to be more influenced by hardware architecture than direct transmission support.

Table 7 summarizes the best-performing environment and library for each subroutine. In the broadcast subroutine, NCCL_Bcast performed in a single-docker environment achieved the best latency at 0.76 s, outperforming other scenarios. For the gather subroutine, the shortest latency was achieved using the CUDA-aware library in a bare metal environment, with a measurement of 2.225 s. In the ring all-reduce method, NCCL performed exceptionally well in a singularity environment, showing the best performance with a latency of 2.09 s. Overall, it is worth noting that the performance degradation is minimal in multi-GPU allocation environments. When comparing bare metal to single-container setups, all libraries exhibit similar or lower latencies.

Table 8 highlights the lowest performing environment and library for each subroutine. In both the broadcast and gather (allgather) subroutines of the parameter server method, NCCL performed in a cross-docker environment exhibited low performance, with latencies of 2.384 s and 5.135 s, respectively. This indicates that the performance of NCCL is significantly reduced in the single-GPU environment per container configuration of the parameter server method. For the all-reduce subroutine, the highest latency was measured in the MPICH bare metal environment, with a latency of 3.877 s. In contrast, NCCL performed much better in the same environment, with a latency of 0.892 s, showcasing a significant difference of 334%.

7.2. PyTorch

7.2.1. Collective Communication Simulation Experiments

In Figure 23, the results are consolidated for using four GPUs in each subroutine. In MPI broadcast, the latency gradually increases in the order of bare metal, singularity, and single docker. Conversely, for the rest of the gather and all-reduce subroutines, lower latency is observed in the single-container setups compared to bare metal. For example, in the gather subroutine, bare metal recorded 3.53 s, while singularity achieved 3.20 s. In single docker, it took 3.18 s. In MPI all-reduce, bare metal took 2.8 s, whereas singularity recorded 2.38 s, and single docker took 2.39 s, resulting in latency being 17% lower. In GLOO gather, cross container decreased by 36.6% compared to single container, and overall, compared to other communication backends, the gap between single container and cross container is relatively small. NCCL showed excellent overall performance in the bare metal and single-container setups. However, latency increased rapidly in the cross-docker environment, which involves a single GPU per container. When comparing bare metal to cross docker, broadcast was 89% higher, allgather was 54% higher, and all-reduce recorded 131% higher latency.

Table 9 summarizes the scenarios with the lowest latency, while Table 10 presents the scenarios with the highest latency. Across all scenarios, multi-GPU per container setups consistently performed exceptionally well, whereas single GPU per container environments exhibited the opposite trend, showcasing higher latency.

Regarding library performance, NCCL demonstrated the best performance in the broadcast and all-reduce subroutines, while MPI exhibited the lowest latency in the gather subroutine. Conversely, when it came to the lowest-performing scenario, NCCL’s allgather subroutine recorded a latency of 5.45 s. This represented a 21% difference compared to MPI and a substantial 61% difference compared to GLOO when compared to other libraries. MPI exhibited lower performance in broadcasting and all-reduce subroutines, and it is noteworthy that the subroutine with good performance in NCCL recorded the lowest latency. Overall, GLOO consistently showed intermediate results across all subroutines.

7.2.2. Deep Learning Experiments

Table 11 presents the results of the minimum latency in deep learning experiments. Similar to the previous collective communication experiments, NCCL’s superior performance is evident once again. The difference lies in the gathering subroutine, where NCCL demonstrates lower latency compared to other libraries. This indicates that GPU-based collective communication libraries exhibit outstanding performance in the deep learning execution phase. However, there are drawbacks, such as the necessity of using GPUs for the parameter server and the resulting GPU wastage within nodes.

Table 12 illustrates the worst results in deep learning experiments. For the broadcast and gather subroutines used in the parameter server approach, GLOO’s performance was consistently the lowest in all environments except for the cross-docker environment. However, due to the reduction in latency observed in the cross-docker environment, MPI’s performance was the lowest. Additionally, in the results of the all-reduce experiments, GLOO exhibited the highest latency.

Through deep learning experiments, we analyzed the actual performance of collective communication libraries at the application level. Comparing the findings with experiments conducted on random data, we observed similar trends for each library. Based on the results presented in this paper, we anticipate that it will contribute to selecting the optimal collective communication library for each server environment and architecture.

7.3. Findings

In this study, we have identified three significant findings:

NCCL’s latency spike in cross-docker virtualization environments: We observed a notable increase in latency when using the NCCL library in cross-docker virtualization environments. For instance, when comparing the latency of broadcast operations in the Linux shell with four GPUs, we found a 213% increase when using cross docker compared to the single-docker environment. Similarly, in PyTorch, the latency of allgather operations increased by 54% in the cross-docker setup.
Improved latency in single docker for GLOO gather operations: Our research revealed that GLOO gather operations exhibited 36% lower latency in cross-docker environments compared to single-docker configurations. This finding underscores the importance of considering the choice of virtualization environment when optimizing latency for specific collective communication tasks.
NCCL dominance in all-reduce operations: NCCL demonstrated superior performance in all-reduce operations compared to MPI and GLOO. For example, in the bare metal server environment within a Linux shell, NCCL achieved a remarkable 78% reduction in latency compared to MPICH during all-reduce operations. In PyTorch, NCCL showcased a substantial performance advantage, with a latency difference of 345% when compared to MPI.

These findings provide valuable insights for practitioners seeking to optimize collective communication library selection and configuration in distributed deep learning environments, particularly when dealing with virtualization variations impacting performance.

8. Related Works

Section 8 of the paper discusses and highlights the related research papers and their explanations, emphasizing the differences between those papers and the current study. One of the papers examined in this section focuses on analyzing latency in various communication libraries in both inter-node and intra-node environments [25]. It delves into the collective communication functions commonly used in distributed deep learning, providing a detailed investigation of each library’s performance. However, this paper is limited to conducting experiments solely on bare metal, without considering performance analysis in a virtualized environment. As emphasized in our study, it is crucial to analyze performance not only in bare metal but also on cloud virtual servers, reflecting the real-world scenarios where distributed deep learning takes place.

Another paper discussed in this section evaluated the performance of collective communication within a container environment [26]. This paper specifically measured and analyzed latency by replicating distributed deep learning architectures within docker container environments through Linux shell commands. However, it is important to note that measuring latency within the Linux shell environment may have limitations, as it differs from measuring collective communication performance in actual deep learning workloads. In contrast, our study addresses this limitation by analyzing communication latency in practical distributed deep learning scenarios, encompassing the collective communication library used in PyTorch alongside the Linux shell environment. This approach provides a more comprehensive understanding of real-world distributed deep learning communication performance.

In another study, GLOO and NCCL, communication libraries provided by PyTorch, were applied to a model like ResNet50 and Bert to evaluate PyTorch’s data parallel learning acceleration [27]. This study demonstrated the potential for near-linear scalability in the architecture implemented in the NCCL backend. However, it is important to note that this study did not consider each method of the parameter server and all-reduce, which are data parallelism architecture methods, and it did not experiment with container environments such as docker containers and singularity.

There are also studies focusing on performance changes when using deep learning APIs within containers [28]. One study benchmarked the performance changes when configuring a deep learning software framework within a docker container compared to a local server. While this study found no noticeable downsides to running deep learning APIs in docker containers and suggested encapsulation and dissemination as a feasible solution, it did not analyze communication overhead, which is a limitation. Another study analyzed the performance of singularity and udocker [29] through deep learning infrastructure benchmarking, showing performance that was equivalent to that of a local server [30]. However, this study did not perform a comparative analysis of various collective communication libraries.

Furthermore, there is a study that analyzed the communication of deep learning workloads in an HPC environment [31]. This study profiled the performance impact of high-performance interconnects like InfiniBand, Omni-Path, PCIe, and NVLink when integrated into a heterogeneous HPC system with GPUs. It conducted an analysis of MPI microbenchmarks and DNN training workloads using Horovod. It is worth noting that this study differs from ours as it measured communication latency in a commonly used 1 Gbit/s network environment rather than a high-performance network environment.

In our paper, we propose a solution to reduce deep learning overhead by empirically analyzing communication time on a large GPU cluster. We replicate the collective communication architecture in various multi-GPU environments, including bare metal and containers, using Linux shell and PyTorch. We measure latency and perform experimental comparative analysis on the gradient and parameter communication process. This analysis aims to help reduce deep learning overhead by providing insights into communication time for each library and detailed characteristics such as communication speed and scalability. Our focus is on understanding the actual communication overhead arising from direct function calls during deep learning workload runs.

9. Conclusions

In this paper, we compare and analyze distributed deep learning communication libraries in both bare metal and container environments. According to the research findings, in the parameter server approach with multi-GPU server allocation, MPI outperforms GLOO in terms of CPU-based communication, while in single GPU per container environments, GLOO exhibits superior performance. For NCCL, which is based on GPU communication, it shows excellent performance in both parameter server and all-reduce operations, except for GPU allocation constraints, and demonstrates more pronounced performance differences in all-reduce. Overall, in container environments, the multi-GPU setup within singularity containers outperforms other server environments. Based on these results, the following recommendations are suggested.

For the parameter server method, MPI is the recommended choice for bare metal, singularity, and single-docker environments. In the cross-docker setup, using GLOO is advisable to minimize collective communication time. When employing the ring all-reduce method, it is advisable to utilize NCCL, irrespective of the server environment. Through this study, it would be useful from the user’s perspective if a framework such as PyTorch had a function to select a collective communication library with minimum latency by considering the server execution environment.

These results can serve as valuable insights for enhancing the design and implementation of distributed deep learning systems. Future research may delve into the impact of hardware and network topology on communication libraries and develop an automated framework for selecting the most suitable communication library for specific distributed deep learning environments, leveraging the findings of this study. Furthermore, additional experiments in inter-node server environments could enable a comprehensive study of distributed deep learning, focusing on overall cases of data parallelism.

Author Contributions

Conceptualization, J.L. and S.L.; methodology, J.L.; validation, J.L. and S.L.; writing—original draft preparation, S.L.; writing—review and editing, J.L.; supervision, J.L.; project administration, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Research Foundation of Korea (NRF-2023R1A2C1005750).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ahn, S.; Lim, E. SoftMemoryBox II: A Scalable, Shared Memory Buffer Framework for Accelerating Distributed Training of Large-Scale Deep Neural Networks. IEEE Access 2020, 8, 207097–207111. [Google Scholar] [CrossRef]
Wu, J.; Gan, W.; Chen, Z.; Wan, S.; Lin, H. AI-Generated Content (AIGC): A Survey. arXiv 2023, arXiv:2304.06632. [Google Scholar]
Dryden, N.; Maruyama, N.; Moon, T.; Benson, T.; Yoo, A.; Snir, M.; Van Essen, B. Aluminum: An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems; Technical Report; Lawrence Livermore National Lab. (LLNL): Livermore, CA, USA, 2018. [Google Scholar]
Gropp, W.; Lusk, E.; Skjellum, A. Using MPI: Portable Parallel Programming with the Message-Passing Interface; MIT Press: Cambridge, MA, USA, 1999; Volume 1. [Google Scholar]
Arnold, S. Writing Distributed Applications with PyTorch. Available online: https://sebarnold.net/posts/writing_distributed_apps_pytorch_20170614/note.pdf (accessed on 14 June 2017).
Jeaugey, S. Nccl 2.0. GPU Technol. Conf. (GTC) 2017, 2, 23. [Google Scholar]
Cho, S.; Hong, J.; Choi, J.; Han, H. Multithreaded double queuing for balanced CPU-GPU memory copying. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, Limassol, Cyprus, 8–12 April 2019; pp. 1444–1450. [Google Scholar]
Infographic: Big Three Dominate the Global Cloud Market—Statista.com. Available online: https://www.statista.com/chart/18819/worldwide-market-share-of-leading-cloud-infrastructure-service-providers/ (accessed on 16 July 2023).
Lin, C.Y.; Pai, H.Y.; Chou, J. Comparison Between Bare-metal, Container and VM using Tensorflow Image Classification Benchmarks for Deep Learning Cloud Platform. In Proceedings of the 8th International Conference on Cloud Computing and Services Science (CLOSER 2018), Funchal, Portugal, 19–21 March 2018; pp. 376–383. [Google Scholar]
Xu, P.; Shi, S.; Chu, X. Performance evaluation of deep learning tools in docker containers. In Proceedings of the 2017 3rd International Conference on Big Data Computing and Communications (BIGCOM), Chengdu, China, 10–11 August 2017; pp. 395–403. [Google Scholar]
Rad, B.B.; Bhatti, H.J.; Ahmadi, M. An introduction to docker and analysis of its performance. Int. J. Comput. Sci. Netw. Secur. (IJCSNS) 2017, 17, 228. [Google Scholar]
Kurtzer, G.M.; Sochat, V.; Bauer, M.W. Singularity: Scientific containers for mobility of compute. PLoS ONE 2017, 12, e0177459. [Google Scholar] [CrossRef] [PubMed]
Dryden, N.; Moon, T.; Jacobs, S.A.; Van Essen, B. Communication quantization for data-parallel training of deep neural networks. In Proceedings of the 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC), Salt Lake City, UT, USA, 14 November 2016; pp. 1–8. [Google Scholar]
Zhang, H.; Zheng, Z.; Xu, S.; Dai, W.; Ho, Q.; Liang, X.; Hu, Z.; Wei, J.; Xie, P.; Xing, E.P. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC 17), Santa Clara, CA, USA, 12–14 July 2017; pp. 181–193. [Google Scholar]
Li, M.; Zhou, L.; Yang, Z.; Li, A.; Xia, F.; Andersen, D.G.; Smola, A. Parameter server for distributed machine learning. Big Learn. NIPS Workshop 2013, 6, 2. [Google Scholar]
Patarasuk, P.; Yuan, X. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput. 2009, 69, 117–124. [Google Scholar] [CrossRef]
Kim, Y.; Choi, H.; Lee, J.; Kim, J.S.; Jei, H.; Roh, H. Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster. Clust. Comput. 2020, 23, 2287–2300. [Google Scholar] [CrossRef]
Zhao, Y.; Gu, A.; Varma, R.; Luo, L.; Huang, C.C.; Xu, M.; Wright, L.; Shojanazeri, H.; Ott, M.; Shleifer, S.; et al. Pytorch FSDP: Experiences on scaling fully sharded data parallel. arXiv 2023, arXiv:2304.11277. [Google Scholar] [CrossRef]
Awan, A.A.; Chu, C.H.; Subramoni, H.; Panda, D.K. Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL? In Proceedings of the 25th European MPI Users’ Group Meeting, Barcelona, Spain, 23–26 September 2018; pp. 1–9. [Google Scholar]
Kang, Q.; Träff, J.L.; Al-Bahrani, R.; Agrawal, A.; Choudhary, A.; Liao, W.K. Scalable algorithms for MPI intergroup allgather and allgatherv. Parallel Comput. 2019, 85, 220–230. [Google Scholar] [CrossRef]
Gropp, W.; Lusk, E.; Doss, N.; Skjellum, A. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 1996, 22, 789–828. [Google Scholar] [CrossRef]
Sergeev, A.; Del Balso, M. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv 2018, arXiv:1802.05799. [Google Scholar]
Thakur, R.; Rabenseifner, R.; Gropp, W. Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 2005, 19, 49–66. [Google Scholar] [CrossRef]
Graham, R.L.; Woodall, T.S.; Squyres, J.M. Open MPI: A flexible high performance MPI. In Proceedings of the Parallel Processing and Applied Mathematics: 6th International Conference, PPAM 2005, Poznań, Poland, 11–14 September 2005; Springer: Berlin/Heidelberg, Germany, 2006. Revised Selected Papers 6. pp. 228–239. [Google Scholar]
Weingram, A.; Li, Y.; Qi, H.; Ng, D.; Dai, L.; Lu, X. xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning. J. Comput. Sci. Technol. 2023, 38, 166–195. [Google Scholar] [CrossRef]
Choi, H.; Kim, Y.; Lee, J.; Kim, Y. Empirical Performance Evaluation of Communication Libraries for Multi-GPU based Distributed Deep Learning in a Container Environment. KSII Trans. Internet Inf. Syst. 2021, 15, 911–931. [Google Scholar]
Li, S.; Zhao, Y.; Varma, R.; Salpekar, O.; Noordhuis, P.; Li, T.; Paszke, A.; Smith, J.; Vaughan, B.; Damania, P.; et al. Pytorch distributed: Experiences on accelerating data parallel training. arXiv 2020, arXiv:2006.15704. [Google Scholar] [CrossRef]
Balaji, A.; Allen, A. Benchmarking automatic machine learning frameworks. arXiv 2018, arXiv:1808.06492. [Google Scholar]
Gomes, J.; Bagnaschi, E.; Campos, I.; David, M.; Alves, L.; Martins, J.; Pina, J.; Lopez-Garcia, A.; Orviz, P. Enabling rootless Linux Containers in multi-user environments: The udocker tool. Comput. Phys. Commun. 2018, 232, 84–97. [Google Scholar] [CrossRef]
Grupp, A.; Kozlov, V.; Campos, I.; David, M.; Gomes, J.; López García, Á. Benchmarking deep learning infrastructures by means of tensorflow and containers. In Proceedings of the International Conference on High Performance Computing, Dublin, Ireland, 15–19 July 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 478–489. [Google Scholar]
Ibrahim, K.Z.; Nguyen, T.; Nam, H.A.; Bhimji, W.; Farrell, S.; Oliker, L.; Rowan, M.; Wright, N.J.; Williams, S. Architectural requirements for deep learning workloads in hpc environments. In Proceedings of the 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), St. Louis, MO, USA, 15 November 2021; pp. 7–17. [Google Scholar]

Figure 1. Parameter server architecture.

Figure 2. Ring all-reduce architecture.

Figure 3. Data exchange behaviors in five communication primitives.

Figure 4. Linux shell flowchart.

Figure 5. Linux shell broadcast architecture.

Figure 6. Linux shell gather architecture.

Figure 7. Linux shell all-reduce architecture.

Figure 8. PyTorch flowchart.

Figure 9. PyTorch deep learning flowchart.

Figure 10. PyTorch broadcast architecture.

Figure 11. PyTorch gather architecture.

Figure 12. PyTorch all-reduce architecture.

Figure 13. Linux shell MPI CUDA-aware broadcast.

Figure 14. Linux shell NCCL broadcast.

Figure 15. Linux shell MPI CUDA-aware gather latency.

Figure 16. Linux shell NCCL allgather latency.

Figure 17. Linux shell MPI CUDA-aware all-reduce latency.

Figure 18. Linux shell NCCL all-reduce latency.

Figure 19. PyTorch parameter server broadcast latency.

Figure 20. PyTorch parameter server gather latency.

Figure 21. PyTorch ring all-reduce latency.

Figure 22. Linux shell communication library architecture latency.

Figure 23. PyTorch communication library architecture latency.

Table 1. Hardware overview of experimental system.

Experimental Server
GPU	4 NVIDIA GeForce RTX 3080 GPUs (12 GiB)
CPU	1 Intel Core i9-10900 processor (10 cores)
CPU Memory	32 GB 2933 MHz DDR4
PCIe	bidirectional 16 GBps PCIe (Gen 3)

Table 2. Linux shell function execution times (broadcast).

Server	Backend	Broadcast Time (s)	CudaMemcpy Time (s)	MPI Send Time (s)	Total Latency (s)
Bare metal	MPICH	0.945 (59%)	0.653 (41%)	-	1.598
	OpenMPI	1.235 (65%)	0.650 (35%)	-	1.885
	CUDA-aware MPI	1.840 (100%)	-	-	1.840
	NCCL	0.453 (45%)	0.171 (17%)	0.384 (38%)	1.008
Singularity	MPICH	0.439 (40%)	0.649 (60%)	-	1.088
	OpenMPI	0.439 (40%)	0.650 (60%)	-	1.089
	CUDA-aware MPI	1.091 (100%)	-	-	1.091
	NCCL	0.438 (55%)	0.166 (20%)	0.196 (25%)	0.800
Single Docker	MPICH	1.262 (66%)	0.650 (34%)	-	1.912
	OpenMPI	1.268 (66%)	0.647 (34%)	-	1.915
	CUDA-aware MPI	1.920 (100%)	-		1.920
	NCCL	0.420 (55%)	0.167 (22%)	0.173 (23%)	0.760
Cross Docker	MPICH	1.257 (65%)	0.658 (35%)	-	1.915
	OpenMPI	1.277 (65%)	0.654 (35%)	-	1.931
	CUDA-aware MPI	1.973 (100%)	-	-	1.973
	NCCL	1.941 (81%)	0.167 (7%)	0.276 (12%)	2.384

Table 3. Linux shell function execution times (gather).

Server	Backend	Gather (Allgather) Time (s)	CudaMemcpy Time (s)	MPI Send Time (s)	Total Latency (s)
Bare metal	MPICH	1.806 (70%)	0.758 (30%)	-	2.564
	OpenMPI	1.641 (67%)	0.748 (33%)	-	2.389
	CUDA-aware MPI	2.225 (100%)	-	-	2.225
	NCCL	0.446 (13%)	1.782 (52%)	1.220 (35%)	3.448
Singularity	MPICH	1.825 (71%)	0.755 (29%)	-	2.580
	OpenMPI	1.574 (61%)	0.749 (39%)	-	2.563
	CUDA-aware MPI	2.565 (100%)	-	-	2.565
	NCCL	0.429 (12%)	1.782 (52%)	1.236 (36%)	3.447
Single Docker	MPICH	1.679 (69%)	0.749 (31%)	-	2.428
	OpenMPI	1.685 (69%)	0.746 (31%)	-	2.431
	CUDA-aware MPI	1.920 (100%)	-		1.920
	NCCL	0.416 (12%)	1.811 (52%)	1.286 (36%)	3.513
Cross Docker	MPICH	1.664 (65%)	0.751 (35%)	-	2.415
	OpenMPI	1.598 (65%)	0.747 (35%)	-	2.345
	CUDA-aware MPI	1.973 (100%)	-	-	1.973
	NCCL	1.925 (37%)	1.886 (37%)	1.324 (26%)	5.135

Table 4. Linux shell function execution times (all-reduce).

Server	Backend	All-Reduce Time (s)	CudaMemcpy (H to D) Time (s)	CudaMemcpy (D to H) time (s)	Total Latency (s)
Bare metal	MPICH	2.483 (64%)	0.639 (16%)	0.755 (20%)	3.877
	OpenMPI	1.903 (58%)	0.638 (19%)	0.755 (23%)	3.296
	CUDA-aware MPI	3.226 (100%)	-	-	3.226
	NCCL	2.285 (100%)	-	-	2.285
Singularity	MPICH	1.579 (53%)	0.642 (22%)	0.756 (25%)	2.977
	OpenMPI	1.574 (53%)	0.637 (22%)	0.744 (25%)	2.955
	CUDA-aware MPI	2.952 (100%)	-	-	2.952
	NCCL	2.096 (100%)	-	-	2.096
Single Docker	MPICH	1.893 (57%)	0.647 (20%)	0.770 (23%)	3.310
	OpenMPI	1.870 (57%)	0.639 (20%)	0.762 (23%)	3.271
	CUDA-aware MPI	3.242 (100%)	-	-	3.242
	NCCL	2.106 (100%)	-	-	2.106
Cross Docker	MPICH	2.222 (62%)	0.634 (18%)	0.744 (20%)	3.600
	OpenMPI	2.396 (62%)	0.628 (16%)	0.841 (22%)	3.838
	CUDA-aware MPI	3.610 (100%)	-	-	3.610
	NCCL	2.200 (100%)	-	-	2.200

Table 5. Parameter server architecture deep learning latency.

Server	Backend	Steps	Bcast Time (s)	Gather Time (s)	Training Time (s)
Bare metal	MPI	391	213 (-)	880 (-)	1095 (-)
	GLOO	391	317 (-)	1359 (-)	1676 (-)
	NCCL	521	12 (-)	491 (-)	503 (-)
Singularity	MPI	391	234 (×1.09)	1058 (×1.20)	1288 (×1.17)
	GLOO	391	314 (×0.99)	1266 (×0.93)	1580 (×0.94)
	NCCL	521	10 (×0.83)	356 (×0.72)	367 (×0.72)
Single Docker	MPI	391	234 (×1.09)	1069 (×1.21)	1301 (×1.18)
	GLOO	391	304 (×0.95)	1301 (×0.95)	1603 (×0.95)
	NCCL	521	10 (×0.83)	361 (×0.73)	373 (×0.74)
Cross Docker	MPI	391	363 (×1.70)	1084 (×1.23)	1447 (×1.32)
	GLOO	391	334 (×1.05)	1053 (×0.77)	1386 (×0.82)
	NCCL	521	14 (×1.16)	695 (×1.41)	833 (×1.45)

Table 6. All-reduce architecture deep learning latency.

Server	Backend	Steps	All-Reduce Time (s)	Training Time (s)
Bare metal	MPI	391	291 (-)	384 (-)
	GLOO	391	531 (-)	650 (-)
	NCCL	391	119 (-)	186 (-)
Singularity	MPI	391	282 (×0.96)	374 (×0.97)
	GLOO	391	505 (×0.95)	623 (×0.95)
	NCCL	391	96 (×0.80)	162 (×0.86)
Single Docker	MPI	391	284 (×0.97)	378 (×0.98)
	GLOO	391	488 (×0.91)	607 (×0.93)
	NCCL	391	94 (×0.79)	164 (×0.88)
Cross Docker	MPI	391	355 (×1.21)	451 (×1.17)
	GLOO	391	617 (×1.16)	750 (×1.15)
	NCCL	391	210 (×1.76)	283 (×1.51)

Table 7. Best results of each experiment (Linux shell).

Experiment	GPU Allocation Type	Environment	Library	Latency (s)
Broadcasting	multi-GPU per container	single docker	NCCL	0.76
Gathering	local server	bare metal	CUDA-aware	2.22
All-reduce	multi-GPU per container	singularity	NCCL	2.09

Table 8. Worst results of each experiment (Linux shell).

Experiment	GPU Allocation Type	Environment	Library	Latency (s)
Broadcasting	single GPU per container	cross docker	NCCL	2.38
Gathering	single GPU per container	cross docker	NCCL	5.13
All-reduce	local server	bare metal	MPICH	3.87

Table 9. Best results of each experiment (PyTorch).

Experiment	GPU Allocation Type	Environment	Library	Latency (s)
Broadcasting	multi-GPU per container	singularity	NCCL	1.06
Gathering	multi-GPU per container	single docker	MPI	3.18
All-reduce	multi-GPU per container	single docker	NCCL	0.64

Table 10. Worst results of each experiment (PyTorch).

Experiment	GPU Allocation Type	Environment	Library	Latency (s)
Broadcasting	single GPU per container	cross docker	MPI	2.60
Gathering	single GPU per container	cross docker	NCCL	5.45
All-reduce	single GPU per container	cross docker	MPI	2.85

Table 11. Best results of each experiment (PyTorch deep learning).

Experiment	GPU Allocation Type	Environment	Library	Latency (s)
Broadcasting	multi-GPU per container	singularity	NCCL	9.98
Gathering	multi-GPU per container	singularity	NCCL	356.06
All-reduce	multi-GPU per container	single docker	NCCL	94.10

Table 12. Worst results of each experiment (PyTorch deep learning).

Experiment	GPU Allocation Type	Environment	Library	Latency (s)
Broadcasting	single GPU per container	cross docker	MPI	363.35
Gathering	single GPU per container	cross docker	MPI	1084.23
All-reduce	single GPU per container	cross docker	GLOO	617.26

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, S.; Lee, J. Collective Communication Performance Evaluation for Distributed Deep Learning Training. Appl. Sci. 2024, 14, 5100. https://doi.org/10.3390/app14125100

AMA Style

Lee S, Lee J. Collective Communication Performance Evaluation for Distributed Deep Learning Training. Applied Sciences. 2024; 14(12):5100. https://doi.org/10.3390/app14125100

Chicago/Turabian Style

Lee, Sookwang, and Jaehwan Lee. 2024. "Collective Communication Performance Evaluation for Distributed Deep Learning Training" Applied Sciences 14, no. 12: 5100. https://doi.org/10.3390/app14125100

APA Style

Lee, S., & Lee, J. (2024). Collective Communication Performance Evaluation for Distributed Deep Learning Training. Applied Sciences, 14(12), 5100. https://doi.org/10.3390/app14125100

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Collective Communication Performance Evaluation for Distributed Deep Learning Training

Abstract

1. Introduction

2. Background

2.1. Data Parallel Distributed Deep Learning Training Architectures: Parameter Server vs. Ring All-Reduce

2.2. Communication Primitives for Distributed Deep Learning

2.3. Collective Communication Libraries for Distributed Deep Learning

3. Distributed DL Communication: Libraries and Architectures Compared

3.1. Linux Shell Environment

3.1.1. Parameter Server Method

3.1.2. Ring All-Reduce Method

3.2. PyTorch Environment

3.2.1. Parameter Server Method

3.2.2. Ring All-Reduce Method

4. Experimental Setup

5. Experiments in the Linux Shell Environment

5.1. Parameter Server Subroutine

5.1.1. Broadcast

5.1.2. Gather

5.2. Ring All-Reduce Routine

5.3. Linux Shell Function Execution Times

6. Experiments in the PyTorch Environment

6.1. Parameter Server Subroutine

6.1.1. Broadcast

6.1.2. Gather

6.2. All-Reduce Routine

6.3. Deep Learning Experiment

7. Summary of Experimental Results

7.1. Linux Shell

7.2. PyTorch

7.2.1. Collective Communication Simulation Experiments

7.2.2. Deep Learning Experiments

7.3. Findings

8. Related Works

9. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI