1. Introduction
Supercomputing services are currently among the most important services provided by national supercomputing centers worldwide [
1]. This designation typically refers to the use of aggregated computing power to solve advanced computational problems related to scientific research [
2]. Most of these services are composed of cluster-based supercomputers and are important for providing the computational resources and applications necessary in various scientific fields [
3]. These supercomputing services are often divided into high-performance computing (HPC) services and high-throughput computing (HTC) services [
4] according to the workflow of the job. An HPC service has a strong impact on jobs that are tightly coupled in parallel processing. These jobs (such as message passing interface (MPI) parallel jobs) must process large numbers of computations within a short time. In contrast, an HTC service involves independence between jobs that are loosely coupled in parallel processing. These jobs must process large and distributed numbers of computations over a specific period (such as a month or a year) [
5].
Most national supercomputing centers are plagued with inconvenience and inefficiency in handling these types of jobs. Therefore, in this study, we focused on the current problems facing HPC users and supercomputing center administrators seeking to provide efficient systems. HPC users face four main problems when performing traditional HPC jobs; these are complexity, compatibility, application expansion, and a pay-as-used arrangement. Conversely, center administrators face five main problems with traditional HPC systems; these are cost, flexibility, scalability, integration, and portability.
The solution to these five problems with traditional HPC is to adopt cloud computing technology to provide HPC services in a cloud environment [
6]. With the development of cloud computing technology and the stabilization of its service, the provision of HPC services in the cloud environment has become one of the main interests of many HPC administrators. Many researchers have begun to enter this field, using descriptors such as HPC over Cloud, HPC Cloud, HPC in Cloud, or HPC as a Service (HPCaaS) [
7]. Although these approaches solve typical scalability problems in the cloud environment and reduce deployment and operational costs, they fail to meet high-performance requirements, because of the performance degradation of virtualization itself. In particular, the performance degradation in overlay networking solutions involving Virtual Machines (VMs) and networking virtualization has become a most serious issue. To address this, studies are currently underway to program the fast packet processing between VMs on the data plane using a data plane development kit [
8]. Another fatal problem results from the inflexible mobility of VM images, which include various combinations or versions of, e.g., the operating system, compiler, library, and HPC applications. Thus, the biggest challenge to HPC administrators is to achieve scalability of the cloud, performance, and image service in virtual situations.
One solution to overcome this challenge is to implement an HPC cloud in a containerized environment. Integrating the existing HPC with a containerized environment leads to complexity in terms of designing the HPC workflow, which is defined as the flow of tasks that need to be executed to compute on HPC resources. The diversity of these tasks increases the complexity of the infrastructure to be implemented. We defined some requirements of an HPC workflow in a containerized environment and compared them with those of other projects such as Shifter [
9], Sarus [
10], EASEY [
11], and JEDI [
12], which are suggested by several national supercomputing centers. It is difficult to design an architecture that includes all functionalities that satisfy the requirements of users and administrators on the basis of these related research analyses.
In this study, we proposed an HPC cloud architecture that can reduce the complexity of HPC workflows in containerized environments to provide supercomputing resources scalability, high performance, user convenience, various HPC applications, and management efficiency. To evaluate the serviceability of our proposed architecture, we developed a platform that was part of the Partnership and Leadership of Supercomputing Infrastructure (PLSI) project [
13] led by the Korean National Supercomputing Center, Korea Institute of Science and Technology Information (KISTI), and built a test bed based on the PLSI infrastructure. In addition, we developed a user-friendly platform that is easy to use and uses a minimal knowledge base interface to ensure user convenience.
The reminder of this paper is organized as follows: In
Section 2, we describe related work on container-based HPC cloud solutions involving national supercomputing resources. In
Section 3, the platform implemented is explained and information about system architecture and workflows is included, and the detailed method is described in
Section 4. In
Section 5, we present the results of evaluation applied to various aspects of the platform we developed. Finally, we provide concluding remarks in
Section 6.
3. Platform Design
To meet the requirements that were mentioned in the previous section, we designed a container-based HPC cloud platform based on system analysis. The system architecture and workflow designs were proposed with a consideration of the requirements of current users and administrators. The workflow of the image management system, job management system, and metering data management were explained in detail.
3.1. Architecture
This platform was designed in accordance with the three service roles of the cloud architecture (
Figure 7); these are service creator, service provider, and service consumer roles that must be distinguished to enable self-service. In this figure, items in blue boxes were implemented by exiting from the open source software, those in green boxes were developed by integrating the necessary parts, and those in yellow boxes were newly developed. The service creator can manage a template consisting of Template Create and Template Evaluate processes. The previously verified templates can be searched using Template List and Template Search and can also be deleted using Template Delete. Template List, Template Search, and Template Delete were developed as CLI tools and provided as a service. All verified templates are automatic installation and configuration scripts for versions such as Operating System (OS), library, compiler, and application. All container images can be built based on verified templates. All jobs (including container and application) can be executed based on these built images.
HPCaaS is provided by our container-based HPC cloud service provider to service consumers through the container-based HPC cloud management platform, which consists primarily of job integration management and image management processes. Our platform provides services based on two container platforms, the hardware-based management of which is accomplished with Container Platform Management. Image management is based on a distributed system and is a base for the implementation of workload-distributing, task-parallelizing, auto-scaling, and image-provisioning functions of Image Manager in detail. We designed container and application provisioning by developing integration packages according to the different types of jobs because different container solutions need different container workload managers. The on-demand billing function was implemented using measured metering data. We designed real-time resource monitoring for CPU use and memory usage, and a function for providing various types of Job History Data records based on collected metering data. In addition, interfaces for connecting with other service roles were also implemented. Service Development Interface sends the template created by the service creator to the service provider. The image service and job service created on this template are delivered in the form of HPC Image Service and HPC Job Service through Service Delivery Interface.
The workflow diagram presents a detailed design of the container-based HPC cloud platform, which includes image management (yellow boxes) and job integration management (blue boxes), as demonstrated in
Figure 8. We proposed a distributed system for image management to reduce the workload resulting from requests. When HPC users request the desired container image, Image Manager automatically generates and provides the image based on the existing templates created by administrators. For example, when a user tries to execute an MPI job, a container image including MPI should be checked first. If the requested image exists, submit the MPI job with this image. If not, the user can request an image build with an existing template. If there is no template, the user can request a template from the administrator. Each user can request the image but can also share it with other users. In addition, we designed an auto-scaling scheduler for Image Manager nodes regarded as workers. To ensure job integration management, we integrated a batch scheduler and container orchestration mechanism to deploy the container and application simultaneously. After creating a container and executing applications, all processes and containers were automatically deleted to release the resource allocation. Additionally, a data collector for data metering was designed. Finally, we described My Resource View, which was designed to show the resource generated by each user and is used for implementing multitenancy. To share the resource pool with different or isolated services by each user, we designed My Resource View to check through usage statistics for each user’s resources.
3.2. Image Management
The workflow of image management is shown in
Figure 9. The user can submit image requests on the login node; when user requests are received, Image Manager delivers them to Docker Daemon or Singularity Daemon to deploy the images and then reports the history to Image Metadata Storage on the image manager node. Docker Daemon uses templates written in Docker file format to create Docker images automatically and in accordance with the user’s request. All the created Docker images are stored in Docker Image Temporary Storage for the steps that follow. Singularity Daemon also uses templates written in definition file format to create Singularity images automatically, based on the request. Similarly, all created Singularity images are stored in Singularity Image Temporary Storage for upcoming steps. When the user requests Docker Image Share, Image Manager uploads the requested image to the Private Docker Registry that has already been built as a local hub. This local hub is used as Docker Image Storage to ensure high-speed transmission and security. When a user requests Singularity Image Share, Image Manager uploads the requested image to the parallel file system that has been mounted on all nodes. Once the image is uploaded, a job request can be submitted using this image.
Image Manager constitutes the key point of our platform. We presented a distributed and parallel architecture for our platform to reduce the workload of Image Manager resulting from multiple users with access to Image Manager Server. As depicted in
Figure 10, six features were designed as parallel workloads, thereby allowing users to send user requests simultaneously. On the login node, we deployed client parts from Client A to Client F; every client has its matching worker port designated Worker A to Worker F. We designed Docker Image Create, Docker Image Share, and Docker Image Delete Features for Docker workers; for Singularity workers, Singularity Image Create, Singularity Image Share, and Singularity Image Delete Features were designed. Between client and worker, we presented Task Queue for listing user requests by task unit. When User A creates task ① and User B creates task ②, according to the queue order, the Image Manager first receives task ① and then receives task ②. Likewise, tasks ③ and ④ requested by incoming users are queued in that order and are executed immediately after the preceding tasks are completed. Unlike other workload distributions for Image List and Image Search Features, these are designed separately to connect only to Image Metadata Storage.
We designed an auto-scaling scheduler composed of scale-up and scale-down rules for the scheduler. As shown in
Figure 11, users can submit image requests to Auto-scaling Group using Login Node, which consists of a master worker node and several slave worker nodes. Redis-server runs on the Master Worker node and synchronizes the slave workers. Each worker will send the created image automatically to the public mounted file system or Docker Private Registry, depending on the type of platform.
To calculate the waiting time of the latency, the execution time of the image was added. After comparison of the execution time and the waiting time, it is determined whether to scale up or scale down the slave workers.
Figure 12 presents the flowchart of the auto-scaling scheduler. We designed auto-scaling algorithms for each defined task queue that cannot be executed in parallel. When a user requests a task queue, our system performs the following determination steps starting with Create Image Start. Next, the system checks whether there are current active tasks in the auto-scaling worker group using Check Task Active. If tasks in the worker group are not currently active, any worker is selected to activate the task using Add Consumer, and the task is sent using Send Task to the selected worker using the routing key. When the task sent from the worker node is finished, Cancel Consumer is used to deactivate the task, and if it is not the master worker node, the worker node registered using Remove Worker is released. The task queue process for image generation is a scale-down and waits for the next user’s request.
However, if there are currently active tasks in the auto-scaling worker group, an arbitrary worker among the active workers is first selected using the following steps: add a worker node, activate the task queue using Add Consumer, and send the task by specifying the routing key. When the work is finished, and the worker node is released, it returns to the scale-down state and waits for another request. If there are currently running tasks in all workers, the worker node with the smallest sum of the waiting time is selected and sent. The variables and descriptions to get Equations (1) and (2) for the Get Min (total waiting time) worker are summarized in
Table 2 below.
Equation (1) explains Φ
WT that calculates the total waiting time for active tasks. After getting a list of currently running tasks, how long each task has been executed up until the present time is calculated using the formula Φ
CTΦ
ST. Here, Φ
ST represents the start time of the current active task, Φ
CT the current time of the system, and n the number of current active tasks. Φ
WT denotes the sum of latencies of active tasks for each worker node that has active tasks. Finally, we can get the worker Φ
W that has a minimum value of the total waiting time of the workers using Equation (2).
However, when these equations are applied to an actual cluster environment, the following limitations exist. We implemented one worker per node; therefore N also stands for the number of nodes of Image Manager. The higher the value of N, the better, but it is limited by the number of network switch ports connecting nodes in a cluster configuration. Considering the availability of our HPC resources, we tested N up to 3. We also implemented one task per process; therefore, n also stands for the number of processes of the host. The maximum value of n is the number of cores per one node. However, considering each task is not executed completely independently but shares the resource of the host, we actually created 4 tasks and tested them. The optimization of the resource use about n is one more area of further research.
3.3. Job Management
We presented a design of the job integration management system. After creating and uploading the image, the user can use Submit Job Request to schedule jobs, allocate computer resources, create containers, and execute HPC applications. The platform then automatically deletes containers to release allocated resources. Depending on container platform types, such as Docker and Singularity, our system designs different processes for submitting jobs. The key aspect of job integration management is the integration work done with a traditional batch scheduler so that traditional HPC users also can use our system. In addition, our system is designed to automate container, application, and resource releases through Submit Job Request. Our system consists of three main features, i.e., Job Submit, Job List, and Job Delete, for which respective flowcharts are presented in the following sections.
Different containers require different schedulers; the Singularity platform can directly use traditional HPC batch schedulers for submitting jobs, while the Docker platform requires integration with the default container scheduler. Job integration management in our system was designed to support both platforms. As shown in
Figure 13, the user can submit jobs through Job Submit, but since the job submission process varies based on the platform, we designed it as both Docker Job Summit and Singularity Job Submit. Job List, Job Delete, and History Delete were also designed.
Metering data management was implemented by Data Collector, which consists of Resource Measured Data Collector, Real-time Data Collector, and Job History Data Collector, as specified in
Figure 14. Resource Measured Data Collector collects resource measured data provided by the batch scheduler. Real-time Data Collector collects current CPU use and memory use for containers running in the container platform group by sending requests using the sshpass command every 10 s while the container is running. Job History Data Collector organizes measured data and real-time data designated by JobID as a history dataset that can be reconstructed into history data for a certain period. The sshpass command for sending requests every 10 s will create an overhead on both the metering node and the compute nodes. As a way to improve performance, there is a solution of collecting logs using another communication protocol or installing a log collection agent on the compute node side that can transmit to the metering node. It has not yet been applied as a future plan.
4. Platform Implementation
Our research goal was to develop a system that provides container-based HPCaaS in the cloud. To evaluate our system, we created a cluster environment and verified the serviceability of our container-based HPC cloud platform using supercomputing resources. We configured the cluster to check the availability of our platform. As depicted in
Figure 15, our container-based HPC cloud platform was constructed based on the network configuration of the KISTI cluster. There are three types of network configurations: a public network connected by a 1G Ethernet switch (black line), a management network connected by a 10G Ethernet switch (black line), and a data network connected by an InfiniBand switch (red line). We constructed three Image Manager Server nodes as an auto-scaling group, and these are connected to Docker Private Registry and the parallel file system GPFS storage with the management network. As the figure shows, we configured two container types, Docker Container Platform (green), which is configured with the Calico overlay network for container network communication, and Singularity Container Platform (blue), which is configured with the host network for container network communication. Both data networks for container communication are connected to InfiniBand Switch, and the Job Submit Server is deployed with PBSPro and Kubernetes. For Docker Container Platform, the integration scheduler of PBSPro and Kubernetes will work to create jobs. In contrast, for Singularity Container Platform, only PBSPro will work to create jobs. We also built Image Metadata Server and Job Metadata Server for storing and managing image metadata and job metadata, respectively.
Table 3 shows the installed software information. We have software licenses for PBSPro and GPFS. Therefore, we used other open source software compatible with the software. The choice of open source software was based on easy and common use. We installed MariaDB v5.5.52 for Image Metadata Server as the relational database MySQL and MongoDB v2.6.12 for Job Metadata Server as the Not Only SQL (NoSQL) database. Job Metadata Server is more suitable for building NoSQL databases rather than relational databases for storing metering data on resource use. NoSQL is designed to improve latency and throughput performance by providing highly optimized key value storage for simple retrieval and additional operations. We selected MongoDB to implement the NoSQL database because it is designed based on the document data model and provides a manual for various programming languages, as well as a simple query syntax in Python. Job Submit Server and Image Submit Server that implemented the distributed system were installed using a combination of Python v2.7.5, Celery v4.1, and Redis-server v3.2.3. PBSPro v14.1.0 was installed as a batch scheduler, and Kubernetes v1.7.4 was installed as a container scheduler. The latest Docker v17.05-ce was installed with Calico v2.3 to configure the overlay network with Kubernetes for Docker containers. For Singularity Container Platform, we installed Singularity 2.2, which is the most stable version. PLSICLOUD is a user command line tool with which users submit image and job requests, and it was installed in Image Submit Server and Job Submit Server.
5. Evaluation
In this study, we evaluated a platform with essential attributes such as on-demand self-service, rapid elasticity and scalability, auto-provisioning, workload management, multitenancy, portability of applications, and performance, which can meet the listed requirements experienced by HPC users and administrators in
Table 1.
5.1. On-Demand Self-Service
On-demand self-service means that a consumer can unilaterally provision computing capabilities, such as computing time and storage, as needed automatically without requiring human interaction with each service provider [
20]. HPC users can self-service on both current image resources and computational resources, which was automatically provided by the service provider, as shown in
Figure 7. Users can also request their own images and submit jobs by allocating computational resources through the plsicloud_create_image and plsicloud_submit_job commands. We also presented on-demand billing management of this platform, which enables a pay-as-used model. By integrating the tracejob command of PBSPro, we implemented a resource usage calculation result for each user in
Figure 16. As the figure shows, we provided average CPU use, total number of used processes, total used wall-time, CPU time, memory size, and virtual memory size for on-demand billing management evaluation. Based on this used resource information, our supercomputing center can apply a pricing policy for the pay-as-used model.
5.2. Rapid Elasticity and Scalability
Rapid elasticity and scalability mean that capabilities can be elastically provisioned and released, in some cases automatically, to scale rapidly up and inward commensurate with demand [
20]. One important factor affecting rapid elasticity on our platform is the rapid movement of image files. When a user makes a job request for a resource with the required image, the system must quickly move the image created in Image Temporary Storage to the compute nodes. For Singularity, there are no considerations when transferring an image, which can be also run as a container. The problem is to deploy the Docker image. Since Docker has a layered image structure and is managed by the Docker engine, it must be compressed into a file to move to the compute nodes. Considering the file compression time, file transfer time, and file decompression time, this is very inefficient. To solve this problem, we built Docker Private Registry that can store images connected with the management network. We calculated the degree of time deduction with and without Docker Private Registry for handling a 1.39 GB image. With Docker Private Registry, it took 122 s, 20 s faster than without Docker Private Registry.
For reducing the workload of creating images, we developed an auto-scaling scheduler with three workers implemented in a group for our platform. We applied this custom algorithm to compare the waiting time of each task with and without cases. With one worker, the second task must wait for the previous task until it is completed. The third task must wait for 2059 s, which is almost twice the first waiting time. For the fourth task in a queue, it will wait for 3083 s, requiring 6166 s in total as a waiting time. With this auto-scaling algorithm, three workers will work in parallel, so only 1080 s are needed for the fourth task; thus the total waiting time required is only 1200 s.
5.3. Auto-Provisioning
Automatic provisioning of resources is an integral part of our platform. In a containerized environment for the HPC cloud, the provisioning process of resources requires not only an image and a container but also their applications. More specifically, provisioning of a job in application units is needed rather than in container units. To solve this problem, we evaluated the auto-provisioning of image and job resources by making their life cycles.
Figure 17 shows the auto-provisioning life cycle of the image process. Once users request to build an image, configurations for image creation are verified and the state will be Creating. After image creation is complete, the state changes to Created and the image is automatically uploaded to Docker Private Registry or the shared file system. In this process, the image status is displayed as Uploading. Once it is uploaded, the state changes to Uploaded. If an error occurs during Creating, Uploading, and Deleting states, it is automatically displayed as the Down state. Images in the states of Created, Uploaded, and Down can be deleted, and these deleted images are automatically expunged in the repository.
Figure 18 shows the auto-provisioning life cycle of the job process. Once users submit a job, containers are executed using a private or a shared image. In this process, the state changes from Building to Creating. After the creation is complete, the application is executed by changing to the Created state and then automatically going to the Running state. If the application execution is finished, the daemon of the job is automatically updated to the Finished state to complete the operation. Then, allocated resources are automatically released with expunging the container, including applications in the Expunged state. Jobs can be deleted in the states of Created, Running, Finished, and Down. If a forced request for resource release is received, the state shows Expunged, and then the allocated resource is released.
5.4. Workload Management
There are three workloads considered in our platform: image task, container execution, and application execution. Distributing and parallelizing workloads for image tasks are implemented by defining each client, task queue, and worker with Celery as a distributed framework and Redis as a message broker. Workloads for container and application execution are handled using an existing batch scheduler. We evaluated for managing these workloads by integrating some commands of the container scheduler (Kubernetes) and the batch scheduler (PBSPro).
Figure 19 shows the resource monitoring, which includes information about containers and batch jobs using the plsicloud my_resouce command.
5.5. Multitenancy
Multitenancy means that the provider’s computing resources are pooled to serve multiple consumers assigned and reassigned to consumer demand [
20]. In our platform, we provided the isolated environment to each user with shared resource pools.
Figure 20 shows the concept of the multitenancy implemented in our platform. Image metadata and job metadata are stored in the two different types of databases—MySQL and NoSQL—according to the data characteristics of the resources. We can use PLSICLOUD to obtain the information sent to the service by each user. The development of the PLSICLOUD CLI tool allows evaluation of the cloud multitenancy model.
5.6. Portability of Applications
We evaluated the portability of applications by providing containerized applications. We verified containerization of frequently used container versions, OS versions, and compilers dependent on parallel libraries to meet the requirements of users, as shown in
Table 4. We tested serviceability through the containerizing HPL task, a benchmark tool that tests HPC system performance and that is installed automatically from the OS to the application based on our templates [
21]. Currently, our system supports Docker v17.06.1-ce and stable Singularity v2.2, along with CentOS 6.8 and 7.2; OpenMPI provides v1.10.6 and v2.0.2. Each of them contributes to a mathematical library GotoBLAS2 based on a parallel library.
5.7. Performance Evaluation with MPI Tests
We constructed two nodes for running point-to-point MPI parallel tasks to check different types of bandwidth and latency with the benchmark tool osu-micro-benchmarks v5.4 [
22]. The latency test mainly measures the latency caused by sending and receiving messages by data size between two nodes using the ping-pong test. The test results of latency between two nodes with bare-metal, singularity, and docker-calico are almost the same and will be not mentioned in this paper [
21]. The test results of bandwidth (
Figure 21) include bandwidth, bi-directional bandwidth, and multiple-bandwidth tests. As shown in the figure, the peak bandwidth exists in a certain message size interval. This interval is almost the same in three cases of bare-metal, singularity, and docker-calico, except for the instability of docker-calico.
Results between the two nodes are insufficient to evaluate the performance of the container-based HPC platform in an HPC environment. Therefore, we constructed 8 nodes with 16, 32, 64, and 72 cores for measuring the latency test with various MPI blocking collective operations (barrier in
Figure 22, gather in
Figure A1, all-gather in
Figure A2, reduce in
Figure A3, all-reduce in
Figure A4, reduce-scatter in
Figure A5, scatter in
Figure A6, all-to-all in
Figure A7, and broadcast in
Figure 23 using the same benchmark tool osu-micro-benchmarks.
In
Figure 22, the barrier latency of singularity shows almost the same performance with the bare-metal case except with 64 cores. The value of docker-calico shows a performance gap between singularity and bare-metal. For the remaining operations, as the number of cores increases in the same number of nodes, the performance results are almost the same with bare-metal, singularity, and docker-calico when looking at the overall graph change. However, in
Figure 23, the results of docker-calico in a particular interval are worse than the other two cases in a specific message size range.
6. Conclusions and Future Work
The container-based approach to the HPC cloud is expected to ensure efficient management and use of supercomputing resources, which are areas that present challenges in the HPC field. This study demonstrates the value of technology convergence by attempting to provide users with a single cloud environment through integration with container-based technology and traditional HPC resource management technology. It provides container-based solutions to the problems of HPC users and administrators, and these can be of practical assistance in resolving issues such as complexity, compatibility, application expansion, pay-as-used billing management, cost, flexibility, scalability, workload management, and portability.
The deployment of a container-based HPC cloud platform is still a challenging task. Thus far, our proposed architecture has set and used measurement values for various resources by mainly considering KISTI’s computer-intensive HPC jobs. In future work, we must consider HTC jobs, network-intensive jobs, and GPU-intensive jobs, especially for machine learning or deep learning applications, and add measurement values for new resources that satisfy the characteristics of these jobs. Another potential research task is to automate the process of creating and evaluating templates for HPC service providers as service creators. In our platform, there is not enough generalization in the degree of automation to conform with the application characteristics realized whenever a new template is added.
In the current job integration management part, additional development is required for the batch scheduler for general jobs and other interworking for Kubernetes with container jobs. In our platform, the user cannot access the container directly. Container creation, execution, application execution, and container deletion are all automated. However, it is possible to solve this issue by developing a linked package for the Application Programming Interface (API) of the existing Kubernetes and batch job schedulers; these will be connected to the machine on which the job is running, while submitting the job according to the user’s requirements. The evaluation of our platform was conducted in a small cluster environment. If we apply our platform to a large cluster environment, future evaluations of availability, deployment efficiency, and execution efficiency would be needed. We hope that our proposed architecture will contribute to the widespread deployment and use of future container-based HPC cloud services.