Intelligent Performance Prediction: The Use Case of a Hadoop Cluster
Abstract
:1. Introduction
2. Related Work
3. Proposed Architecture
- -
- An automated collection of monitoring data from as many heterogeneous types of sources as possible.
- -
- A sophisticated analysis mechanism based on many AI/ML algorithms for dynamic resource allocation.
- -
- Its easy integration with existing orchestrator frameworks.
- -
- A scalable and resilient architecture ready to transfer a high number of monitoring entities that supports hierarchical deployments.
- -
- Support of near real-time responsiveness.
3.1. Monitoring Framework
- -
- The Prometheus server [21] is the central point of event monitoring, storage and alerting. All performance metrics are collected using an HTTP pull model and stored in a timeseries database. Some of the key features that make this server suitable for the proposed architecture are: (a) the use/support of a flexible query language (PromQL), which makes the interconnection with external systems easier (b) the existence of many opensource implementations (exporters) for exposing the monitoring metrics of various applications, and the ease of creating new ones, (c) the autonomy that is provided as it does not rely on any complex distributed storage mechanisms and (d) the fact that new monitoring targets can be easily added via reconfiguration or using the file-based service discovery mechanisms.
- -
- The Prometheus Pushgateway [22] allows batch jobs and short lived microservices to expose their metrics to Prometheus. Since this function may not exist long enough to be scraped, the metrics can instead be moved to a Pushgateway. The Pushgateway then exposes these metrics to the Prometheus server.
- -
- The Alertmanager [23] handles alerts sent by client applications such as the Prometheus server. It is responsible for deduplicating, grouping, and routing the alerts to the correct receiver integrations such as email, PagerDuty, or OpsGenie. It also attends to the silencing and inhibition of alerts, which is useful for the management of the number of generated notifications.
- -
- Grafana [24] is an open-source solution that obtains those metrics and alerts that are understandable from the Prometheus server, and it provides interactive visualization web dashboards. These dashboards simplify the virtualization of the collected performance metrics so that following each notification the user can refer to a specific dashboard and identify the problem by observing charts.
- -
- Netdata.io [25] is a powerful real-time monitoring agent which collects thousands of metrics from systems, hardware, virtual machines, and applications with zero configuration. It runs permanently on the physical/virtual servers, containers, cloud deployments, and edge/IoT devices, and is perfectly safe for installation on a system mid-incident without any preparation.
- -
- cAdvisor [26] provides metrics of the resource usage and performance characteristics of the running containers. It is a running daemon that collects, aggregates, processes, and exports information about running containers. It maintains specific resource isolation parameters for each container, historical resource usage, histograms of complete historical resource usage and network statistics.
3.2. Analysis Server
4. Implementation Test Bed—Evaluating a Real-Life Service
4.1. The Deployed Sandbox Test-Bed
4.2. The Evaluation Procedure
4.3. Results
4.3.1. Profiling of Critical System Metrics from Three Layers
4.3.2. Predictions Using Machine Learning
5. Conclusions and Future Work
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Abbreviations
AWS | Amazon Web services |
AI | Artificial Intelligence |
CN | Cloud Native |
DT | Decision Tree |
DNN | Deep Neural Network |
HDFS | Hadoop Distributed File System |
IO | Input-Output |
IDS | Intrusion Detection System |
k-NN | k-Nearest Neighbors |
ML | Machine Learning |
MANO | Management and Orchestrator |
MG | Media Gateway |
LR | Multiple Linear Regression |
PR | Multivariate Polynomial Regression |
NFV | Network Function Virtualization |
NS | Network Service |
OSM | Open Source MANO |
QoD | Quality of Decisions |
QoE | Quality of Experience |
QoS | Quality of Service |
RF | Random Forest |
SLA | Service Level Agreement |
SP | Service Providers |
SSSP | Single Source Shortest Path |
SVR | Support Vector Regression |
VE | Video Encoder |
VM | Virtual Machines |
VNF | Virtual Network Functions |
NFVI | VNF Infrastructures |
References
- Available online: https://www.etsi.org/deliver/etsi_gr/NFV-IFA/001_099/041/04.01.01_60/gr_NFV-IFA041v040101p.pdf (accessed on 23 August 2021).
- Palumbo, F.; Aceto, G.; Botta, A.; Ciuonzo, D.; Persico, V.; Pescapé, A. Characterization and analysis of cloud-to-user latency: The case of Azure and AWS. Comput. Netw. 2020, 184, 107693. [Google Scholar] [CrossRef]
- Wood, T.; Cherkasova, L.; Ozonat, K.; Shenoy, P. Profiling and Modeling Resource Usage of Virtualized Applications. In Proceedings of the ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing, Leuven, Belgium, 1–5 December 2008. [Google Scholar]
- Giannakopoulos, I.; Tsoumakos, D.; Papailiou, N.; Koziris, N. PANIC: Modeling Application Performance over Virtualized Resources. In Proceedings of the 2015 IEEE International Conference on Cloud Engineering, Tempe, AZ, USA, 9–13 March 2015; pp. 213–218. [Google Scholar]
- Duplyakin, D.; Brown, J.; Ricci, R. Active Learning in Performance Analysis. In Proceedings of the 2016 IEEE International Conference on Cluster Computing (CLUSTER), Taipei, Taiwan, 12–16 September 2016. [Google Scholar]
- Giannakopoulos, I.; Tsoumakos, D.; Koziris, N. A decision tree based approach towards adaptive modeling of big data applications. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 11–14 December 2017. [Google Scholar]
- Cao, L.; Sharma, P.; Fahmy, S.; Saxena, V. NFV-VITAL: A framework for characterizing the performance of virtual network functions. In Proceedings of the 2015 IEEE Conference on Network Function Virtualization and Software Defined Network (NFV-SDN), San Francisco, CA, USA, 18–21 November 2015; pp. 93–99. [Google Scholar]
- Peuster, M.; Karl, H. Understand Your Chains: Towards Performance Profile-Based Network Service Management. In Proceedings of the 2016 Fifth European Workshop on Software-Defined Networks (EWSDN), Den Haag, The Netherlands, 10–11 October 2016. [Google Scholar]
- Rossem, S.V.; Tavernier, W.; Peuster, M.; Colle, D.; Pickavet, M.; Demeester, P. Monitoring and debugging using an SDK for NFV-powered telecom applications. In Proceedings of the 2016 IEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN), Palo Alto, CA, USA, 7–10 November 2016. [Google Scholar]
- Rosa, R.V.; Bertoldo, C.; Rothenberg, C.E. Take Your VNF to the Gym: A Testing Framework for Automated NFV Performance Benchmarking. IEEE Commun. Mag. 2017, 55, 110–117. [Google Scholar] [CrossRef]
- Peuster, M.; Karl, H. Profile your chains, not functions: Automated network service profiling in DevOps environments. In Proceedings of the IEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN), Berlin, Germany, 6–8 November 2017. [Google Scholar]
- Iglesias, J.O.; Aroca, J.A.; Hilt, V.; Lugones, D. Orca: An orchestration automata for configuring VNFS. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, Las Vegas, NV, USA, 11–15 December 2017; pp. 81–94. [Google Scholar]
- Sciancalepore, V.; Yousaf, F.Z.; Costa-Perez, X. z-TORCH: An Automated NFV Orchestration and Monitoring Solution. IEEE Trans. Netw. Serv. Manag. 2018, 15, 1292–1306. [Google Scholar] [CrossRef] [Green Version]
- Nam, J.; Seo, J.; Shin, S. Probius: Automated Approach for VNF and Service Chain Analysis in Software-Defined NFV. In Proceedings of the Symposium on SDN Research (SOSR’18), Los Angeles, CA, USA, 28–29 March 2018. [Google Scholar]
- Khan, M.G.; Bastani, S.; Taheri, J.; Kassler, A.; Deng, S. NFV-Inspector: A Systematic Approach to Profile and Analyze Virtual Network Functions. In Proceedings of the 2018 IEEE 7th International Conference on Cloud Networking (CloudNet), Tokyo, Japan, 22–24 October 2018. [Google Scholar]
- Van Rossem, S.; Tavernier, W.; Colle, D.; Pickavet, M.; Demeester, P. Profile-Based Resource Allocation for Virtualized Network Functions. IEEE Trans. Netw. Serv. Manag. 2019, 16, 1374–1388. [Google Scholar] [CrossRef]
- Van Rossem, S.; Tavernier, W.; Colle, D.; Pickavet, M.; Demeester, P. Optimized Sampling Strategies to Model the Performance of Virtualized Network Functions. J. Netw. Syst. Manag. 2020, 28, 1482–1521. [Google Scholar] [CrossRef]
- Schneider, S.; Satheeschandran, N.P.; Peuster, M.; Karl, H. Machine Learning for Dynamic Resource Allocation in Network Function Virtualization. In Proceedings of the 2020 6th IEEE Conference on Network Softwarization (NetSoft), Ghent, Belgium, 29 June–3 July 2020. [Google Scholar]
- Trakadas, P.; Karkazis, P.; Leligou, H.C.; Zahariadis, T.; Papadakis, A. Scalable monitoring for multiple virtualized infrastructures for 5G services. In Proceedings of the International Symposium on Advances in Software Defined Networking and Network Functions Virtualization, Athens, Greece, 22–26 April 2018. [Google Scholar]
- Al-Hazmi, Y.; Gonzalez, J.; Rodriguez-Archilla, P.; Alvarez, F.; Orphanoudakis, T.; Karkazis, P.; Magedanz, T. Unified representation of monitoring information across federated cloud infrastructures. In Proceedings of the IEEE 2014 26th International Teletraffic Congress (ITC), Karlskrona, Sweden, 9–11 September 2014. [Google Scholar]
- GitHub—Prometheus/Prometheus: The Prometheus Monitoring System and Time Series Database. Available online: https://github.com/prometheus/prometheus (accessed on 23 August 2021).
- GitHub—Prometheus/Pushgateway: Push Acceptor for Ephemeral and Batch Jobs. Available online: https://github.com/prometheus/pushgateway (accessed on 23 August 2021).
- GitHub—Prometheus/Alertmanager: Prometheus Alertmanager. Available online: https://github.com/prometheus/alertmanager (accessed on 23 August 2021).
- GitHub—Grafana/Grafana. Available online: https://github.com/grafana/grafana (accessed on 23 August 2021).
- GitHub—Netdata/Netdata: Real-Time Performance Monitoring. Available online: https://github.com/netdata/netdata (accessed on 23 August 2021).
- GitHub—Google/Cadvisor. Available online: https://github.com/google/cadvisor (accessed on 23 August 2021).
- GitHub—Sonata-nfv. Available online: https://github.com/sonata-nfv (accessed on 23 August 2021).
- “OSM ETSI” Git. Available online: https://osm.etsi.org/gitweb/ (accessed on 23 August 2021).
Ref. | Testing Service/App | Examined Metrics | Examined Layers | ML Algorithms |
---|---|---|---|---|
[6] | Spark k-means, Spark Bayes, Hadoop Wordcount, MongoDB | Execution time, throughput | service mainly | Regression Trees |
[7] | Clearwater, Snort, Suricata | Successful call rate, CPU, memory and network usage, Packet processing speed vs. traffic | virtual | - |
[8] | single Video Encoder (VE), blackbox profiling scenario, entire service chain | Number of CPU cores, CPU time | virtual | - |
[9] | Intrusion Detection System (IDS) | CPU, memory, traffic rate, packet loss | virtual | Rule based |
[10] | SIPp prober | Avg. transaction rate, CPU usage | virtual | - |
[11] | A chain of three VNFs | Throughput vs. CPU time | virtual | Plug-in arbitrary analysis scripts |
[12] | The service Media Gateway (MG) composed of various VNFs | CPU load, network drops, rejected calls, latency | virtual | Stochastic Gradient Descent regressor (SGDR) |
[13] | MME, S-GW, HSS, PCRF, PDN-GW | Quality of Decisions (QoD) | virtual | Q-Learning approach |
[14] | 7 (such as Suricata-IPS, NAT, Tcpdump, Firewall, Netsniffer-ng) | Many from each layer, such as CPU, memory usage, Disk read/write I/O requests and bytes etc. | service virtual physical | Various criteria for behavior classification |
[15] | HSS, MME, PGW-U-SGWU, SGW-C-PGW-C, INET-GW and eNB | CPU, CPU cache, memory, bandwidth, Disk | virtual | Decision tree-based multilabel classification technique |
[16] | virtual router, switch, firewall and cache server | CPU usage, packet loss, cache response time | virtual | Linear Regression, k-NN, Interpolation, ANN, Curve Fitting |
[17] | virtual firewall, (pfSense), virtual streaming server (Nginx) | CPU usage, packet loss, lag ratio | virtual | Support Vector Regression, Random Forest, Gaussian Process, k-NN, Interpolation Method, Curve fitting |
[18] | Squid cache, Nginx proxy | CPU, total delay, # of VNF instances | virtual | Linear regression, support vector regression, decision trees, ensemble learning, neural networks |
[19] | Scalable monitoring framework | NS, VNFs, NFVI, SDN controllers | virtual | Rule based |
This work | Hadoop | CPU usage, disk usage, memory usage, throughput, average I/O rate | service virtual physical | Linear Regression, Polynomial Regression, Decision Tree, Random Forest, Support Vector Regression, k-NN, and Rule based |
PowerEdge R230 | |
---|---|
CPU | 4 CPUs × Intel(R) Xeon(R) (CPU E3-1220 v6 @ 3.00 GHz) |
RAM | 16 GB DDR4 |
HDD | 8 TB |
NETWORK | 2 × 1GbE LOM |
Software Tools Versions | |
---|---|
Prometheus Server | v 2.19.3 |
Prometheus Pushgateway | v 1.2.0 |
Prometheus Alertmanager cAdvisor | v 0.23.0 v 0.32.0 |
Netdata.io Apache Spark | v 1.23.2 v 3.2.1 |
Method | Hyperparameter | Optimal Hyperparameter Value/Metric | |||||
---|---|---|---|---|---|---|---|
- | - | CPU Usage PM | Disk Usage PM | CPU Usage NM | Memory Usage NM | Throughput | Average I/O rate |
Multiple Linear Regression (LR) | order of polynomial | 1st | 1st | 1st | 1st | 1st | 1st |
Multivariate Polynomial Regression (PR) | order of polynomial | 2nd | 2nd | 2nd | 2nd | 2nd | 2nd |
Decision Tree (DT) | depth | 5 | 5 | 5 | 5 | 5 | 5 |
Random Forest (RF) | depth, number of estimators | 5, 4 | 5, 4 | 5, 4 | 5, 5 | 5, 5 | 5, 5 |
Support Vector Regression (SVR) | C, γ | 50, 0.01 | 50, 0.0075 | 50, 0.01 | 50, 5 × 10−5 | 50, 5 × 10−3 | 50, 5 × 10−3 |
k-Nearest Neighbors (k-NN) | k | 2 | 2 | 2 | 2 | 1 | 1 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Uzunidis, D.; Karkazis, P.; Roussou, C.; Patrikakis, C.; Leligou, H.C. Intelligent Performance Prediction: The Use Case of a Hadoop Cluster. Electronics 2021, 10, 2690. https://doi.org/10.3390/electronics10212690
Uzunidis D, Karkazis P, Roussou C, Patrikakis C, Leligou HC. Intelligent Performance Prediction: The Use Case of a Hadoop Cluster. Electronics. 2021; 10(21):2690. https://doi.org/10.3390/electronics10212690
Chicago/Turabian StyleUzunidis, Dimitris, Panagiotis Karkazis, Chara Roussou, Charalampos Patrikakis, and Helen C. Leligou. 2021. "Intelligent Performance Prediction: The Use Case of a Hadoop Cluster" Electronics 10, no. 21: 2690. https://doi.org/10.3390/electronics10212690
APA StyleUzunidis, D., Karkazis, P., Roussou, C., Patrikakis, C., & Leligou, H. C. (2021). Intelligent Performance Prediction: The Use Case of a Hadoop Cluster. Electronics, 10(21), 2690. https://doi.org/10.3390/electronics10212690