Modelling the Impact of Cloud Storage Heterogeneity on HPC Application Performance

Marquez, Jack; Mondragon, Oscar H.

doi:10.3390/computation12070150

Open AccessArticle

Modelling the Impact of Cloud Storage Heterogeneity on HPC Application Performance

by

Jack Marquez

and

Oscar H. Mondragon

^*

Faculty of Engineering, Universidad Autonoma de Occidente, Cali 760030, Colombia

^*

Author to whom correspondence should be addressed.

Computation 2024, 12(7), 150; https://doi.org/10.3390/computation12070150

Submission received: 15 May 2024 / Revised: 22 June 2024 / Accepted: 11 July 2024 / Published: 19 July 2024

(This article belongs to the Section Computational Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Moving high-performance computing (HPC) applications from HPC clusters to cloud computing clusters, also known as the HPC cloud, has recently been proposed by the HPC research community. Migrating these applications from the former environment to the latter can have an important impact on their performance, due to the different technologies used and the suboptimal use and configuration of cloud resources such as heterogeneous storage. Probabilistic models can be applied to predict the performance of these applications and to optimise them for the new system. Modelling the performance in the HPC cloud of applications that use heterogeneous storage is a difficult task, due to the variations in performance. This paper presents a novel model based on Extreme Value Theory (EVT) for the analysis, characterisation and prediction of the performance of HPC applications that use heterogeneous storage technologies in the cloud and high-performance distributed parallel file systems. Unlike standard approaches, our model focuses on extreme values, capturing the true variability and potential bottlenecks in storage performance. Our model is validated using return level analysis to study the performance of representative scientific benchmarks running on heterogeneous cloud storage at a large scale and gives prediction errors of less than 7%.

Keywords:

HPC cloud; heterogeneous storage; performance modelling; extreme value theory

1. Introduction

High-performance computing (HPC) can be defined as the use of supercomputers to efficiently solve complex computational problems [1], while cloud computing is defined as ubiquitous and on-demand access to configurable computing resources [2]. Although these two technologies were initially designed for different purposes, recent attempts to integrate them have given rise to the new concept of the HPC cloud. Netto et al. [3] defined the HPC cloud as “the use of cloud resources to run HPC applications”. Both new challenges and new opportunities have emerged from this work.

From around 2009 onwards, some HPC users started to consider the cloud as a cost-effective alternative to high-cost HPC clusters. The idea of a pay-on-demand business model that was flexible and provided more customer control over resources was tempting [4,5]. Cloud computing allows for the flexible provision of computing resources such as CPUs, memory, storage, networks and graphics processing units. These resources are elastic, meaning that they can be scaled according to the requirements of each application. Instantaneous access to computational resources can help users to deploy or test their application when they need it [6].

The deployment of HPC applications over cloud computing clusters presents several challenges that have yet to be resolved. One potential problem concerns storage systems, as cloud clusters do not use the same types of storage systems as HPC clusters. Storage systems can also be considered an issue in the adoption of cloud systems for HPC applications [3,7,8]. Anticipating the performance of large-scale applications in heterogeneous storage systems is challenging. Technical documentation on aspects such as the throughput and latency of storage technologies is not sufficient to predict the performance of HPC applications, given the impacts of multiple sources of interference.

In this paper, we describe a statistical model based on Extreme Value Theory (EVT) that can be used to characterise and predict the performance of HPC applications that rely on heterogeneous storage in cloud systems. Our paper makes the following contributions:

We present an EVT-based model for characterising the performance of HPC applications that make use of heterogeneous storage technologies in cloud computing systems.
We develop a method for predicting the performance of HPC applications based on return level analysis with different numbers of storage nodes, which can inform storage algorithms and hence improve the performance of applications.
We evaluate the proposed model by using it to predict the performance of HPC applications running on heterogeneous cloud storage at large scales.

This paper is organised as follows: Section 2 describes the main concepts associated with heterogeneous storage in cloud computing infrastructures, parallel file systems for supporting HPC applications in the cloud, and Extreme Value Theory. In Section 3, we describe our approach to modelling the impact of using heterogeneous cloud storage on the performance of HPC applications, and we validate this model in Section 4 using return level analysis. Section 5 describes some of the main related works. Finally, we present conclusions and suggestions for future work in Section 6.

2. Background

In this section, we describe the storage systems currently used in general-purpose cloud computing environments, and show how these can be leveraged efficiently in HPC. We then focus on BeeGFS as a flexible and scalable alternative for the deployment of a parallel file system that is optimised for HPC on cloud infrastructure. Finally, we describe key concepts related to the modelling of application performance using EVT.

2.1. Heterogeneous Storage Systems in Cloud Computing

In cloud computing clusters, the performance of data-intensive applications is limited by disk data transfer rates, among other factors. To mitigate the impact on performance, cloud systems that offer hierarchical and heterogeneous storage architectures are becoming commonplace. The integration of different storage alternatives such as solid-state drives (SSDs), hard disk drives (HDDs), or even RAMDISK (a block of RAM used as volatile storage) may improve the performance of applications by taking advantage of the characteristics of each type of storage [9].

Table 1 presents the storage technologies and services offered by the top three cloud providers. It can be seen that cloud providers offer a variety of heterogeneous storage devices, each of which has certain particularities. For instance, Azure provides disks with high throughput, high IOPS and low latency (Ultra disk), but also provides a low-cost disk with standard throughput, IOPS and latency (Standard HDD). In addition to a general-purpose SSD, AWS provides a type of SSD with configurable IOPS (io2-io1), and another optimised for speed. Each of these disks is designed for special cases, and offers different capabilities such as volume size, maximum IOPS and maximum throughput. GCP also provides different types of disks and offers zonal and regional replication, as well as high-performance disks (Extreme Persistent Disk). The three main types of storage service are object, block and file storage, which make use of the available disk types.

2.2. Leveraging Heterogeneous Cloud Storage for HPC

Modern HPC systems are designed to provide high performance for scientific applications by offering low-latency networks such as InfiniBand and optimised distributed parallel file systems. LustreFS [10] and BeeGFS [11] are two of the most commonly used parallel file systems in the top500 HPC clusters. Both of these parallel systems are being increasingly deployed on cloud platforms [11,12,13], with the aim of achieving fast access to large amounts of data in cloud clusters.

The use of Lustre and BeeGFS helps to maximise the writing and reading process of large amounts of data in cloud clusters. In this work, we focus on the BeeGFS parallel file system, as it has certain features that make it particularly suitable for cloud environments, such as its simple installation. It is also available for different Linux distributions, is hardware agnostic, and supports high concurrency. An additional benefit is that it is unnecessary to host the management service on a dedicated machine.

BeeGFS Parallel File System

BeeGFS is a client-server model file system that was developed with a focus on performance and scalability. Figure 1 shows the architecture of BeeGFS, which is composed of three types of nodes: a management server, a metadata server, and an object storage server. The management server is the node in charge of the configuration of the BeeGFS file system and its other components; there is usually only one management server in a BeeGFS configuration. The metadata server contains the metadata target, a storage device that carries the structure of the file system and the file names. This server also manages indexes and namespaces. The object storage server is responsible for receiving the data sent from the client and storing it in the object storage target (OST). A given BeeGFS configuration may involve a high number of OSTs, and BeeGFS will try to store data efficiently by splitting the workload among them.

BeeGFS was developed with easy-to-use feature installation and management. Two of the most important aspects of this simple installation are that BeeGFS does not require a kernel patch, and that it comes with graphical Grafana [14] dashboards. BeeGFS also has a striping feature, and the chunk size, the numbers of storage nodes to use (targets) and the heterogeneous storage support can be specified.

2.3. Modelling Extreme Values

EVT is a field of statistics that relates to the behaviour of exceptional or extreme values of a set of random variables, i.e., those which deviate from the median of the probability distribution [15]. EVT has been applied in areas such as statistical quality control [16], finance [17,18,19], transportation [20], study of calcium content [21] and in hydrology to calculate the probability of floods in a certain period [22,23,24]. In computer science, EVT has been used to predict HPC applications’ performance [25] and the impact of interference sources on HPC applications’ performance [26,27] on HPC clusters.

Our work focuses on the impact of storage actions on HPC applications provisioned in cloud environments. Standard statistical analyses focus on average performance and may overlook significant outliers that impact HPC applications in cloud environments. Extreme Value Theory (EVT) is necessary to model these outliers and provide accurate predictions of worst-case performance scenarios. Measuring extreme execution times is crucial because it allows EVT to capture the system’s true variability. Unlike traditional methods focused on averages, EVT prioritises these outliers, which can be hidden bottlenecks or signs of future issues. By understanding these extremes, we can proactively allocate resources for peak loads and prevent outages. EVT even allows us to quantify the likelihood of such events, enabling systems design with sufficient capacity. This focus on extremes, rather than just averages, ensures more reliable and efficient resource provisioning.

EVT was developed based on the results of work by Fisher and Tippett [28,29]. These authors showed that the distributions of a set of random and independent, identically distributed (i.i.d.) variables with an unknown underlying distribution and outliers or extremes could converge to one of three possible asymptotic distributions: Frechet, Gumbel or Weibull. Jenkinson [30] introduced a generalised extreme value (GEV) distribution that combined these three distributions. Equation (1) shows the cumulative distribution (cdf) for the GEV distribution.

F (x | ξ, μ, σ) = e^{{- [1 + ξ (\frac{x - μ}{σ})]}^{\frac{- 1}{ξ}}}

(1)

Equation (1) has three parameters: scale

σ

, location

μ

and shape

ξ

, where,

- \infty < ξ < \infty

,

- \infty < μ < \infty

and

σ > 0

. In this equation, the scale parameter defines the dispersion of the distribution or the variability, whereas the location parameter defines where the distribution is centred on the real axis, and the shape parameter, also known as the extreme value index, defines whether the distribution is a Gumbel (

ξ

= 0), Fréchet (

ξ >

0) or Weibull distribution (

ξ <

0) [31] (see Figure 2).

To fit the GEV distribution, samples of extreme values from the measured values must be obtained. There are two common methods for carrying out this process: peaks over threshold (POT) and the block maxima method (BMM). In the POT approach, a threshold is established and samples are selected from above that threshold. BMM consists of dividing the measured values into blocks of size n and selecting the maximum value from each block. BMM has been widely used in hydrology, where the block size is set to obtain the GEV parameters for seasonal variation over several years [22].

Three-parameter estimation has been widely studied in relation to GEV [32]. The most well-known methods are maximum likelihood estimation (MLE) [33] and the method of L moments (LMOM) [34]. MLE is the most frequently used method of estimating GEV parameters in conjunction with BMM. Its main advantage is based on the asymptotic normality and parameter estimation when there is a known prior distribution. According to Smith [35], to establish the asymptotic property of GEV and obtain suitable estimators,

ξ

must be a value in the range

(- 0.5, \infty)

. If

ξ

is in the range

(- 1, - 0.5)

, estimators can be obtained but the asymptotic property is not established. Finally, if

ξ

is in the range

(- \infty, - 1)

, estimators can not be obtained.

Once the GEV model has been determined and the parameters have been calculated, it is possible to calculate the return level values of the extremes. A return level is defined as the value that will be exceeded on average only once for every N samples, or for every N blocks of the distribution when using the block maxima method [36]. We can calculate this return value as:

R L = F^{- 1} (P)

(2)

In Equation (2), F represents the GEV distribution, and P is given by

P = 1 - 1 / i

, where i, is the return period, defined as the average length of time between events with the same or a more significant value. Return level analysis has historically been applied more often in domains such as finance [37] and hydrology [38] than in computer science [26].

3. Modelling Heterogeneous Cloud Storage Impact on HPC Application Performance

This section describes our proposed stochastic model for the analysis of heterogeneous storage performance and the way in which we estimate the parameters for this model, which allows us to explore the impact of heterogeneous storage on HPC cloud systems. Unlike previous approaches, our model leverages Extreme Value Theory (EVT) to address the variability and extremities in storage performance without assuming any a priori distribution of the storage times. This novel application of EVT provides a more flexible and accurate representation of storage performance in parallel distributed systems such as BeeGFS, thereby offering new insights into the performance dynamics of HPC applications.

3.1. Modelling Approach

The storage time in distributed parallel file systems such as BeeGFS is dominated by the time taken by the slowest storage node in the cluster. We use EVT to model the data storage performance in this type of system. To do this, we assume that the time that a distributed parallel file system takes to store each chunk of a file is i.i.d.

We use BMM to model the extremes. The block and sample sizes must be large enough to obtain a good model fitting. Several methods have been proposed for testing the goodness of fit of a model [39,40]; these studies have presented approaches for the selection of the size of the blocks, using BMM and the sample, and have also suggested techniques for testing the asymptotic property using graphs or by analysing the values resulting from the estimation of parameters. We describe a rule of thumb for testing the asymptotic property in Section 2.3.

3.2. Estimating the Model Parameters

In our block maxima formulation, the block size is the number of storage nodes configured in BeeGFS, and we sample the value of the execution time from the slowest node. After selecting a set of block maxima, we need to estimate the parameters of the GEV distribution. Henwood et al. [41], suggested a minimum block number of 60; we use a block number of 100 and follow a method based on the normality test suggested in [40] to determine whether the values are normally distributed with a 95% confidence interval and to avoid unbiased estimations. We then apply MLE to estimate the GEV parameters.

3.3. Predicting Performance on Heterogeneous Storage

Predicting the performance of applications in cloud environments is essential to achieve efficient resource provisioning. We use return level analysis and our model to predict the performance of scientific applications using cloud infrastructure at scale. After obtaining estimations for the three GEV parameters, we can use return level analysis based on Equation (2), as described in Section 2.3. In our model, the return period in Equation (2) is the block number used in the block maxima method.

4. Validation of the Model

In this section, we present a validation of our proposed model using representative scientific benchmarks. We study the performance of these benchmarks running on heterogeneous cloud storage, and use our model to predict performance at larger scales. The results are validated against simulations and observed values. To collect the initial data to fit our GEV model, we configured a BeeGFS cluster consisting of eight nodes. We used a block size equal to the number of nodes to apply BMM and MLE for GEV parameter estimation. Subsequently, we expanded the cluster up to 16, 32, 64 and 128 nodes to collect performance data at those scales and compare against the results of the return level analysis.

4.1. Experimental Setup

For our experiments, we used nodes from CloudLab [42], a project of the University of Utah (Salt Lake City, UT, USA), Clemson University (Clemson, SC, USA), the University of Wisconsin Madison (Madison, WI, USA), the University of Texas at Austin (Austin, TX, USA), the University of Massachusetts Amherst (Amherst, MA, USA), and US Ignite (Washington, DC, USA). Each of the configured nodes had three available storage targets (RAMDISK, SSD, HDD) (see Figure 3). For these experiments, we used c220g2 nodes. Each of these nodes had two Intel E5-2660 v3 10-core CPUs at 2.60 GHz (Haswell EP), 160 GB ECC Memory (10 × 16 GB DDR4 2133 MHz dual-rank RDIMMs), one Intel DC S3500 480 GB 6 G SATA SSDs, two 1.2 TB 10 K RPM 6 G SAS SFF HDDs, and a dual-port Intel X520 10 Gb NIC (PCIe v3.0, eight lanes), and frequency scaling was disabled. The operating system was CentOs 7 with kernel version 3.10.0-1127.19.1.el7.x86 64.

4.1.1. Benchmarks

We selected several representative benchmarks (see Table 2) to validate our model. These benchmarks copy fundamental parts of applications and simulate their behaviour:

Iozone is a tool used for the analysis of file systems. It includes different operations such as write, read, re-write, and re-read, and allows for latency and bandwidth analysis. We use the write operation with a single stream measurement in order to allow BeeGFS to receive all the files as a single unit, and then split this over all the storage nodes.
Fallocate is a Linux function that is widely used as a benchmark for file systems, as it can preallocate new spaces for files of a specific size. This tool first analyses whether the required space is sufficient for the file system, and then reserves that space for the file if required.
BT was developed by NAS Parallel Benchmark (NPB) and is extensively used for testing HPC clusters. It solves a highly configurable block-tridiagonal (BT) problem using MPI.
PIOS is a test tool created to work as an I/O simulator on file systems. This tool simulates a load from many clients that generate I/O in file systems. Due to its parallel nature, PIOS can write to the same or different files at the same time, but in this study it is used to write only to a single file.

4.1.2. Data Collection

For data collection, we followed an experimental methodology according to the recommendations in [49], to avoid wrong measures. Some of the recommendations are about record keeping, in which they recommend adding labels and metadata to the data collected to be able to quickly retrieve and check them later. We used four benchmarks (Fallocate, Iozone, BTC, and PIOS) and three types of storage (RAMDISK, SSD, and HDD), and aimed to ensure that each run of each benchmark had the same testing conditions in the cluster. We therefore developed a script to sample the execution times of the slowest node for each benchmark by repeating every run 100 times and collecting all the possible benchmark-to-storage times at each iteration. The process of benchmark-to-storage is the execution time per node, in this case the slowest node in our configuration that stores a piece of data.

4.2. Estimation of GEV Parameters

In order to estimate the GEV parameters for each benchmark, we used BMM and MLE with 100 blocks and a block size of eight. Table 3 shows the parameters obtained with this approach.

The results in Table 3 allow us to determine which of the three GEV types of distributions gives the best fit. The results indicate that all the distributions are the Weibull type (

ξ < 0

). The shape parameters in all cases are greater than −0.5, thus meeting the asymptotic condition for the GEV distribution. The runtime distributions of the selected benchmarks using BeeGFS on heterogeneous storage devices show similar tailedness, and therefore fit a Weibull distribution. Also, by checking Figure 4, we can see that although all are upper bound, BT_C on RAMDISK, BT_C on HDD, and PIOS on SSD are close to light tail (Gumbel).

The skewness values show that all benchmarks except PIOS have the same location of the tail for the SSD and HDD experiments. PIOS has a left-tailed skewness for HDD, and a right-tailed skewness for SSD. This is an interesting outcome, as PIOS on SSD also has the highest coefficient of variation, indicating that this experiment has the most variation in its data; this generates more extreme values, making the prediction process more challenging.

4.3. Return Level Analysis

In this section, we use return level analysis to predict the performance of HPC applications on BeeGFS. Figure 5, Figure 6, Figure 7 and Figure 8 show the return value, computed using Equation (2), for 16, 32, 64 and 128 BeeGFS object storage servers, respectively. These figures present the results of the return level analysis compared to the observed values. This allows us to predict the performance of the application at scale and to compare this to the observed performance. The largest prediction error was 6.64% and the smallest was 0.46%, for PIOS on 64 nodes with RAMDISK and BT_C on 32 nodes with SSD, respectively.

The variations in the return levels for 16 and 128 nodes were larger for BT_C on HDD, PIOS on RAMDISK and PIOS on SSD (See Figure 7c and Figure 8a,b), which were the experiments with shape parameters closer to zero, which often indicates a higher variability in performance outcomes. For instance, the higher variability observed for BT_C on HDD could be due to the inherent slower performance and higher latency of HDDs compared to SSDs or RAMDISK. Similarly, the performance variability for PIOS on RAMDISK might be influenced by the memory management and data handling differences inherent to RAMDISK storage.

The return level values for Iozone did not produce differences as large as those for BT_C on HDD, PIOS on RAMDISK, and PIOS on SSD (see Figure 6). The return level behaviour of Iozone exhibits less variability than other tools because of its single-stream measurement approach and the efficient file distribution handled by BeeGFS. This leads to more consistent performance metrics, as the tool focuses on straightforward write operations without the added complexity of parallel I/O patterns as in PIOS or the computation and communication overheads introduced by BT. The inherent design of Iozone and its specific use case in this study (i.e., writing a single stream) contribute to its stable and predictable performance, reflected in the less variable return levels observed in the analysis.

5. Related Work

Cloud platforms are increasingly incorporating SSDs into their storage systems. Huang et al. [50] proposed a black-box model to predict the performance of SSDs in terms of the latency, bandwidth and throughput, by applying statistical machine learning algorithms. They evaluated their model using micro-benchmarks and real-world traces from online transaction processing (OLTP) applications, and recorded errors of 9% in the latency prediction and 1% for the bandwidth and throughput. Unlike our work, in which we consider different storage types in our model, this work focused on predicting performance in SSDs but not HDDs.

Mondragon et al. [26] presented a model for analysing the performance of bulk synchronous HPC applications based on the use of EVT. They used their model to characterise the impact of next-generation interference sources on applications and predicted the performance of applications at large scales. Their model obtained a prediction error of less than 7.4% for HPC applications running on HPC clusters. We also apply EVT to characterise and predict the performance of HPC applications using heterogeneous storage in cloud systems.

Dominguez-Trujillo et al. [51] presented an approach for modelling variations in the performance of large-scale HPC systems using EVT through a study of the maximum length of the distributed workload time interval for bulk synchronous HPC applications, using parametric and non-parametric ping. This work focused on the variability generated by the hardware and software used in HPC clusters. Our work instead focuses on analysing and predicting performance variations for heterogeneous storage systems in cloud systems.

Another approach that used EVT was an analysis of the worst-case execution time (WCET) [52], where the authors studied the accuracy of EVT in detecting the WCET for several processes. This approach has also been used for CUDA kernel tasks [53] and the analysis of automotive applications in embedded safety-critical systems [54], but not for the performance of HPC applications using heterogeneous storage systems in the cloud.

6. Conclusions

An increasing number of researchers are considering the use of the cloud to run their HPC applications. This work contributes to the adoption of the cloud as a feasible environment for HPC applications by providing a model for HPC applications in cloud systems and the integration of a high-performance storage file system such as BeeGFS into the cloud. In this paper, we have presented an EVT-based model for analysing, characterising and predicting the performance of HPC applications that use heterogeneous storage technologies in the cloud.

Modelling and understanding the performance of HPC applications that use heterogeneous storage can benefit both cloud providers and cloud users. An accurate prediction of the performance of HPC applications that takes into consideration the storage performance can help cloud providers to offer this feature in their platforms, and can help cloud users to identify the resources they will need more specifically.

In addition to predicting storage performance, our extreme value model can be used as guidance to design intelligent data placement algorithms that consider heterogeneous storage infrastructure, increasingly incorporated by cloud providers, seeking to take advantage of the characteristics of each storage type as well as data locality and data access patterns.

Although our model obtained accurate results with low prediction errors, modelling applications that exhibit heavy-tail storage time distributions such as PIOS is challenging. In those cases, the model might benefit from the use of techniques such as statistical bootstrapping or smoothing sample extremes, in order to more precisely fit collected data to distributions using a small number of samples.

One future direction for this work would be to include in the model other storage devices such as the new NVRAM. The development of a model of the performance of HPC applications based on the use of heterogeneous storage in cloud systems represents only one piece of a larger challenge, and additional cloud resources or even multi-cloud resources could be integrated into the model in future work.

Author Contributions

Conceptualization, J.M. and O.H.M.; Investigation, J.M. and O.H.M.; Methodology, J.M. and O.H.M.; Software, J.M.; Supervision, O.H.M.; Validation, J.M. and O.H.M.; Visualization, J.M.; writing—original draft preparation, J.M. and O.H.M.; writing—review and editing, J.M. and O.H.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data were presented in main text.

Acknowledgments

Results presented in this paper were obtained using CloudLab testbed supported by the U.S. National Science Foundation.

Conflicts of Interest

The authors declare no conflict of interest.

References

Neuwirth, S.; Paul, A.K. Parallel i/o evaluation techniques and emerging hpc workloads: A perspective. In Proceedings of the 2021 IEEE International Conference on Cluster Computing (CLUSTER), Portland, OR, USA, 7–10 September 2021; pp. 671–679. [Google Scholar]
Mell, P.; Grance, T. The NIST Definition of Cloud Computing; NIST: Boulder, CO, USA, 2011. [Google Scholar]
Netto, M.A.S.; Calheiros, R.N.; Rodrigues, E.R.; Cunha, R.L.F.; Buyya, R. HPC Cloud for Scientific and Business Applications: Taxonomy, Vision, and Research Challenges. ACM Comput. Surv. 2018, 51, 1–29. [Google Scholar] [CrossRef]
Borin, E.; Drummond, L.M.A.; Gaudiot, J.L.; Melo, A.; Alves, M.M.; Navaux, P.O.A. High Performance Computing in Clouds: Moving HPC Applications to a Scalable and Cost-Effective Environment; Springer Nature: Berlin/Heidelberg, Germany, 2023. [Google Scholar]
Dancheva, T.; Alonso, U.; Barton, M. Cloud benchmarking and performance analysis of an HPC application in Amazon EC2. Clust. Comput. 2024, 27, 2273–2290. [Google Scholar] [CrossRef]
Aithal, P. Information communication & computation technology (ICCT) as a strategic tool for industry sectors. Int. J. Appl. Eng. Manag. Lett. (IJAEML) 2019, 3, 65–80. [Google Scholar]
dos Santos, M.A.; Cavalheiro, G.G.H. Cloud infrastructure for HPC investment analysis. Rev. Informática Teórica E Apl. 2020, 27, 45–62. [Google Scholar] [CrossRef]
Cheriere, N.; Dorier, M.; Antoniu, G. How fast can one resize a distributed file system? J. Parallel Distrib. Comput. 2020, 140, 80–98. [Google Scholar] [CrossRef]
Subramanyam, R. HDFS Heterogeneous Storage Resource Management Based on Data Temperature. In Proceedings of the 2015 International Conference on Cloud and Autonomic Computing, Boston, MA, USA, 21–25 September 2015; pp. 232–235. [Google Scholar] [CrossRef]
Braam, P. The Lustre storage architecture. arXiv 2019, arXiv:1903.01955. [Google Scholar]
Heichler, J. An introduction to BeeGFS. 2014. Available online: http://www.beegfs.de/docs/whitepapers/Introduction_to_BeeGFS_by_ThinkParQ.pdf (accessed on 1 April 2024).
Souza Filho, P.; Felipe, L.; Aragäo, P.; Bejarano, L.; de Paula, D.T.; Sardinha, A.; Azambuja, A.; Sierra, F. Large Scale Seismic Processing in Public Cloud. In Proceedings of the 82nd EAGE Annual Conference & Exhibition, Amsterdam, The Netherlands, 8–11 June 2020; Volume 2020, pp. 1–5. [Google Scholar]
Rao, M.V. Data duplication using Amazon Web Services cloud storage. In Data Deduplication Approaches: Concepts, Strategies, and Challenges; Academic Press: Cambridge, MA, USA, 2020; p. 319. [Google Scholar]
Chakraborty, M.; Kundan, A.P. Grafana. In Monitoring Cloud-Native Applications: Lead Agile Operations Confidently Using Open Source Software; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
Haan, L.; Ferreira, A. Extreme Value Theory: An Introduction; Springer: Berlin/Heidelberg, Germany, 2006; Volume 3. [Google Scholar]
Reghenzani, F.; Massari, G.; Fornaciari, W. Probabilistic-WCET reliability: Statistical testing of EVT hypotheses. Microprocess. Microsyst. 2020, 77, 103135. [Google Scholar] [CrossRef]
Omar, C.; Mundia, S.; Ngina, I. Forecasting value-at-risk of financial markets under the global pandemic of COVID-19 using conditional extreme value theory. J. Math. Financ. 2020, 10, 569–597. [Google Scholar] [CrossRef]
Embrechts, P.; Klüppelberg, C.; Mikosch, T. Modelling Extremal Events: For Insurance and Finance; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 33. [Google Scholar]
Coles, S.; Bawa, J.; Trenner, L.; Dorazio, P. An Introduction to Statistical Modeling of Extreme Values; Springer: Berlin/Heidelberg, Germany, 2001; Volume 208. [Google Scholar]
Wang, C.; Xu, C.; Xia, J.; Qian, Z.; Lu, L. A combined use of microscopic traffic simulation and extreme value methods for traffic safety evaluation. Transp. Res. Part C Emerg. Technol. 2018, 90, 281–291. [Google Scholar] [CrossRef]
Beirlant, J.; Goegebeur, Y.; Segers, J.; Teugels, J.L. Statistics of Extremes: Theory and Applications; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
Ouellette, P.; El-Jabi, N.; Rousselle, J. Application of extreme value theory to flood damage. J. Water Resour. Plan. Manag. 1985, 111, 467–477. [Google Scholar] [CrossRef]
Merz, B.; Basso, S.; Fischer, S.; Lun, D.; Blöschl, G.; Merz, R.; Guse, B.; Viglione, A.; Vorogushyn, S.; Macdonald, E.; et al. Understanding heavy tails of flood peak distributions. Water Resour. Res. 2022, 58, e2021WR030506. [Google Scholar] [CrossRef]
Tabari, H. Extreme value analysis dilemma for climate change impact assessment on global flood and extreme precipitation. J. Hydrol. 2021, 593, 125932. [Google Scholar] [CrossRef]
Haskins, K.; Wofford, Q.; Bridges, P.G. Workflows for performance predictable and reproducible hpc applications. In Proceedings of the 2019 IEEE International Conference on Cluster Computing (CLUSTER), Albuquerque, NM, USA, 23–26 September 2019; pp. 1–2. [Google Scholar]
Mondragon, O.H.; Bridges, P.G.; Levy, S.; Ferreira, K.B.; Widener, P. Understanding performance interference in next-generation HPC systems. In Proceedings of the SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Salt Lake City, UT, USA, 13–18 November 2016; pp. 384–395. [Google Scholar]
Seelam, S.; Fong, L.; Tantawi, A.; Lewars, J.; Divirgilio, J.; Gildea, K. Extreme scale computing: Modeling the impact of system noise in multicore clustered systems. In Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Atlanta, GA, USA, 19–23 April 2010; pp. 1–12. [Google Scholar]
Fisher, R.A.; Tippett, L.H.C. Limiting forms of the frequency distribution of the largest or smallest member of a sample. In Proceedings of the Mathematical Proceedings of the Cambridge Philosophical Society; Cambridge University Press: Cambridge, UK, 1928; Volume 24, pp. 180–190. [Google Scholar]
Gnedenko, B. Sur La Distribution Limite Du Terme Maximum D’Une Série Aléatoire. Ann. Math. 1943, 44, 423–453. [Google Scholar] [CrossRef]
Jenkinson, A.F. The frequency distribution of the annual maximum (or minimum) values of meteorological elements. Q. J. R. Meteorol. Soc. 1955, 81, 158–171. [Google Scholar] [CrossRef]
Markose, S.; Alentorn, A. The generalized extreme value distribution, implied tail index, and option pricing. J. Deriv. 2011, 18, 35–60. [Google Scholar] [CrossRef]
Lu, L.H.; Stedinger, J.R. Variance of two-and three-parameter GEV/PWM quantile estimators: Formulae, confidence intervals, and a comparison. J. Hydrol. 1992, 138, 247–267. [Google Scholar] [CrossRef]
Hirose, H. Maximum likelihood estimation in the 3-parameter Weibull distribution. A look through the generalized extreme-value distribution. IEEE Trans. Dielectr. Electr. Insul. 1996, 3, 43–55. [Google Scholar] [CrossRef]
Hosking, J.R. L-moments: Analysis and estimation of distributions using linear combinations of order statistics. J. R. Stat. Soc. Ser. B (Methodol.) 1990, 52, 105–124. [Google Scholar] [CrossRef]
Smith, R.L. Extreme value theory based on the r largest annual events. J. Hydrol. 1986, 86, 27–43. [Google Scholar] [CrossRef]
McNeil, A.J. Calculating Quantile Risk Measures for Financial Return Series Using Extreme Value Theory; Technical Report; ETH Zurich: Zurich, Switzerland, 1998. [Google Scholar]
Mehta, N.J.; Yang, F. Portfolio optimization for extreme risks with maximum diversification: An empirical analysis. Risks 2022, 10, 101. [Google Scholar] [CrossRef]
Gu, X.; Ye, L.; Xin, Q.; Zhang, C.; Zeng, F.; Nerantzaki, S.D.; Papalexiou, S.M. Extreme precipitation in China: A review on statistical methods and applications. Adv. Water Resour. 2022, 163, 104144. [Google Scholar] [CrossRef]
Beretta, S. More than 25 years of extreme value statistics for defects: Fundamentals, historical developments, recent applications. Int. J. Fatigue 2021, 151, 106407. [Google Scholar] [CrossRef]
Cai, Y.; Hames, D. Minimum sample size determination for generalized extreme value distribution. Commun. Stat. Comput. 2010, 40, 87–98. [Google Scholar] [CrossRef]
Henwood, R.; Watkins, N.W.; Chapman, S.C.; McLay, R. A parallel workload has extreme variability in a production environment. arXiv 2018, arXiv:1801.03898. [Google Scholar]
Duplyakin, D.; Ricci, R.; Maricq, A.; Wong, G.; Duerig, J.; Eide, E.; Stoller, L.; Hibler, M.; Johnson, D.; Webb, K.; et al. The Design and Operation of CloudLab. In Proceedings of the 2019 USENIX Annual Technical Conference (ATC 2019), Renton, WA, USA, 10–12 July 2019; pp. 1–14. [Google Scholar]
Fragalla, J. Configure, Tune, and Benchmark a Lustre FileSystem. In 2014 Oil & Gas HPC Workshop. 2014. Available online: http://rice2014oghpc.blogs.rice.edu/files/2014/03/Fragalla-Xyratex_Lustre_PerformanceTuning_Fragalla_0314.pdf (accessed on 1 April 2024).
NORCOTT. Iozone Filesystem Benchmark. 2003. Available online: http://www.iozone.org/ (accessed on 1 April 2024).
Conway, A.; Bakshi, A.; Jiao, Y.; Jannen, W.; Zhan, Y.; Yuan, J.; Bender, M.A.; Johnson, R.; Kuszmaul, B.C.; Porter, D.E.; et al. File systems fated for senescence? nonsense, says science! In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST 17), Santa Clara, CA, USA, 27 February–2 March 2017; pp. 45–58. [Google Scholar]
Yu, W.; Vetter, J.; Canon, R.S.; Jiang, S. Exploiting lustre file joining for effective collective io. In Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid’07), Rio de Janeiro, Brazil, 14–17 May 2007; pp. 267–274. [Google Scholar]
Wong, P.; Der Wijngaart, R. NAS Parallel Benchmarks I/O; Version 2.4; Tech. Rep. NAS-03-002; NASA Ames Research Center: Moffet Field, CA, USA, 2003. [Google Scholar]
Oracle. Lustre 1.6 Operations Manual. 2010. Available online: https://docs.oracle.com/cd/E19091-01/lustre.fs16/820-3681-11/820-3681-11.pdf (accessed on 6 April 2024).
Amaral, J.N. About Computing Science Research Methodology. 2011. Available online: https://webdocs.cs.ualberta.ca/~amaral/courses/MetodosDePesquisa/papers/Amaral-research-methods.pdf (accessed on 2 April 2024).
Huang, H.H.; Li, S.; Szalay, A.; Terzis, A. Performance modeling and analysis of flash-based storage devices. In Proceedings of the 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST), Denver, CO, USA, 23–27 May 2011; pp. 1–11. [Google Scholar]
Dominguez-Trujillo, J.; Haskins, K.; Khouzani, S.J.; Leap, C.; Tashakkori, S.; Wofford, Q.; Estrada, T.; Bridges, P.G.; Widener, P.M. Lightweight Measurement and Analysis of HPC Performance Variability. In Proceedings of the 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS); IEEE: New York, NY, USA, 2020; pp. 50–60. [Google Scholar]
Lima, G.; Dias, D.; Barros, E. Extreme value theory for estimating task execution time bounds: A careful look. In Proceedings of the 2016 28th Euromicro Conference on Real-Time Systems (ECRTS), Toulouse, France, 5–8 July 2016; pp. 200–211. [Google Scholar]
Berezovskyi, K.; Santinelli, L.; Bletsas, K.; Tovar, E. WCET measurement-based and extreme value theory characterisation of CUDA kernels. In Proceedings of the Proceedings of the 22nd International Conference on Real-Time Networks and Systems, Versailles, France, 8–10 October 2014; pp. 279–288. [Google Scholar]
Castillo, J.D.; Padilla, M.; Abella, J.; Cazorla, F.J. Execution time distributions in embedded safety-critical systems using extreme value theory. Int. J. Data Anal. Tech. Strateg. 2017, 9, 348–361. [Google Scholar] [CrossRef]

Figure 1. BeeGFS architecture.

Figure 2. Probability density function.

Figure 3. The architecture of BeeGFS.

Figure 4. Density Plots (in seconds) for Block Maxima Sample.

Figure 5. Performance prediction for Fallocate on BeeGFS using heterogeneous storage.

Figure 6. Performance prediction for Iozone on BeeGFS using heterogeneous storage.

Figure 7. Performance prediction for BT_C on BeeGFS using heterogeneous storage.

Figure 8. Performance prediction for PIOS on BeeGFS using heterogeneous storage.

Table 1. Storage services offered by the top three cloud providers.

Cloud Platform	Disk Types	Storage Services/File Systems
Azure	Premium SSD Standard SSD Standard HDD Ultra Disk	Blob Storage Azure Files Queues Tables Disks
Amazon Web Services	SSD-io2-Provisioned IOPS SSD-io2 Block Express SSD-io1 SSD-gp3-General Purpose SSD-gp2 HDD-st1-Optimized Speed HDD-sc1-Cold	S3 Elastic Block Store Elastic File System Glacier Storage Gateway FSx Windows FSx Lustre
Google Cloud Platform	SSD-NVMe SSD HDD Zonal Persistent Disk Regional Persistent Disk Extreme Persistent Disk	Block Storage-Persistent disk Cloud Storage Standard storage Nearline storage Coldline storage Archive Filestore

Table 2. Selected benchmarks for performance analysis.

Benchmark	Ref.
Iozone	[43,44]
Fallocate	[45]
BT-NPB	[46,47]
PIOS	[48]

Table 3. Estimated GEV parameters for block maxima sample.

	Shape ( $ξ$ )	Scale ( $σ$ ) (s)	Location ( $μ$ ) (s)
Fallocate RAMDISK	−0.16	154.47	566.95
Fallocate SSD	−0.17	177.52	711.15
Fallocate HDD	−0.17	192.99	785.41
Iozone RAMDISK	−0.22	70.03	265.10
Iozone SSD	−0.38	75.98	339.61
Iozone HDD	−0.39	103.81	467.63
BT_C RAMDISK	−0.05	116.63	1006.72
BT_C SSD	−0.11	89.46	1268.03
BT_C HDD	−0.07	498.15	2026.47
PIOS RAMDISK	−0.17	21.99	57.42
PIOS SSD	−0.08	25.67	69.20
PIOS HDD	−0.46	36.71	126.54

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Marquez, J.; Mondragon, O.H. Modelling the Impact of Cloud Storage Heterogeneity on HPC Application Performance. Computation 2024, 12, 150. https://doi.org/10.3390/computation12070150

AMA Style

Marquez J, Mondragon OH. Modelling the Impact of Cloud Storage Heterogeneity on HPC Application Performance. Computation. 2024; 12(7):150. https://doi.org/10.3390/computation12070150

Chicago/Turabian Style

Marquez, Jack, and Oscar H. Mondragon. 2024. "Modelling the Impact of Cloud Storage Heterogeneity on HPC Application Performance" Computation 12, no. 7: 150. https://doi.org/10.3390/computation12070150

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Modelling the Impact of Cloud Storage Heterogeneity on HPC Application Performance

Abstract

1. Introduction

2. Background

2.1. Heterogeneous Storage Systems in Cloud Computing

2.2. Leveraging Heterogeneous Cloud Storage for HPC

BeeGFS Parallel File System

2.3. Modelling Extreme Values

3. Modelling Heterogeneous Cloud Storage Impact on HPC Application Performance

3.1. Modelling Approach

3.2. Estimating the Model Parameters

3.3. Predicting Performance on Heterogeneous Storage

4. Validation of the Model

4.1. Experimental Setup

4.1.1. Benchmarks

4.1.2. Data Collection

4.2. Estimation of GEV Parameters

4.3. Return Level Analysis

5. Related Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI