A Novel Multi-Task Performance Prediction Model for Spark
Abstract
:1. Introduction
- (1)
- We developed and implemented an automated Spark history log reader. It efficiently extracts historical data from Spark and stores them in a text file.
- (2)
- Among the various configuration parameters of Spark, we carefully selected specific parameters that have a significant impact on execution efficiency. In addition, we employed a dimensionality reduction algorithm to simplify the complexity of the data.
- (3)
- We developed and implemented a neural-network-based multi-tasking performance prediction model for Spark. The model can accurately predict the execution time of single or multiple Spark applications. In addition, we conducted a series of comprehensive experiments from different perspectives to demonstrate the accuracy and applicability of our model.
2. Related Work
3. Proposed Model
3.1. Data Collection
3.2. Data Preprocessing
3.2.1. One-Hot Feature Encoding and Feature Normalization
3.2.2. Principal Component Analysis (PCA) Dimensionality Reduction Algorithm
- (1)
- Data standardization: since PCA is sensitive to the variance in the initial variables, all features need to be standardized to have a mean of 0 and a standard deviation of 1. We set the features as a feature matrix , as shown in Equation (4).Finally, the feature matrix X is standardized. The standard transformation of each element in matrix X is shown in Equation (7).
- (2)
- Calculate the covariance matrix: we set the covariance matrix as , which represents the covariance between pairs of features. It is calculated based on all data points to quantify and understand how the features in the dataset are related to each other from a global perspective. Finally, we obtain the covariance matrix as shown in Equation (8).In the covariance matrix C, each element is computed as shown in Equation (9).
- (3)
- Calculate the eigenvalues and eigenvectors: we decompose the covariance matrix into its eigenvalues and eigenvectors. The eigenvalues and eigenvectors of the covariance matrix satisfy the conditions of Equation (10).
- (4)
- Selection of principal components: the eigenvectors corresponding to the k largest eigenvalues are selected to form an matrix P, with m being the dimension of the eigenvectors. Specifically, this is shown in Equation (11).
- (5)
- Mapping to a lower-dimensional space: In this step, we transform the normalized data using the matrix of eigenvectors P. This achieves dimensionality reduction, producing the data matrix in a new lower-dimensional space, as detailed in Equation (12).
3.3. MHAC Model
3.3.1. Custom Multi-Head Attention Networks
3.3.2. Convolutional Neural Network
Algorithm 1 Residual CNN |
1. Input X |
2. For each convolutional layer l do |
3. Convolution operation with kernel h: |
4. Apply batch normalization: |
5. Apply ReLU activation function: |
6. If l > 1, residual connection: |
7. End for |
8. Mean pooling layer: |
9. Apply fully connected layer to obtain the prediction: |
10. Output: Prediction result |
- Step 1.
- Initialize input data, where X represents input features and serves as the final representation of MultiHead.
- Step 2.
- Iterate through each layer l of the CNN.
- Step 3.
- Perform convolution operation using kernel h, with b as bias term, resulting in output Y. The size of h is 3 × 3 as shown in Figure 3.
- Step 4.
- Apply batch normalization to standardize output Y from step 3. Normalize Y using scaling factor , add bias factor , yielding . Here, is a small constant to prevent division by zero.
- Step 5.
- Apply the ReLU activation function, known for its simplicity and effectiveness in mitigating gradient vanishing during neural network training.
- Step 6.
- For the second or deeper convolutional layers, execute residual connection. Add the current layer’s output Y to the previous layer’s input X. This design enables learning of residual mappings, avoiding direct input–output mapping. , , and represent inputs and outputs. This enhances the deep network training and model performance.
- Step 7.
- If all convolutional layers are iterated, proceed; otherwise, return to step 2.
- Step 8.
- For each pooling window , perform average pooling. corresponds to new feature matrix .
- Step 9.
- Execute fully connected layer for prediction results. W is the weight matrix, b is the bias vector—both learned during training. is the feature matrix after pooling, is the model’s prediction.
4. Experimentation
4.1. Dataset and Experimental Environment
4.2. Spark Parameter
4.3. Comparative Modeling and Experimental Metrics
4.4. Experimental Results
4.5. Ablation Experiments
4.6. Baseline Dataset Comparison Experiments
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ding, Z.; Zhang, C. A method of classification-based Spark job performance modeling. In Proceedings of the 2nd International Conference on Applied Mathematics, Modelling, and Intelligent Computing (CAMMIC 2022), Kunming, China, 25–27 March 2022; Volume 12259, pp. 1310–1315. [Google Scholar]
- Awan, M.J.; Khan, M.A.; Ansari, Z.K.; Yasin, A.; Shehzad, H.M.F. Fake profile recognition using big data analytics in social media platforms. Int. J. Comput. Appl. Technol. 2022, 68, 215–222. [Google Scholar] [CrossRef]
- Ameer, S.; Shah, M.A. Exploiting big data analytics for smart urban planning. In Proceedings of the 2018 IEEE 88th Vehicular Technology Conference (VTC-Fall), Chicago, IL, USA, 27–30 August 2018; pp. 1–5. [Google Scholar]
- Agafonov, A.; Yumaganov, A. Short-term traffic flow forecasting using a distributed spatial-temporal k nearest neighbors model. In Proceedings of the 2018 IEEE International Conference on Computational Science and Engineering (CSE), Bucharest, Romania, 29–31 October 2018; pp. 91–98. [Google Scholar]
- Shen, C.; Tong, W.; Hwang, J.N.; Gao, Q. Performance modeling of big data applications in the cloud centers. J. Supercomput. 2017, 73, 2258–2283. [Google Scholar] [CrossRef]
- Cheng, G.; Ying, S.; Wang, B.; Li, Y. Efficient performance prediction for apache spark. J. Parallel Distrib. Comput. 2021, 149, 40–51. [Google Scholar] [CrossRef]
- Wang, K.; Khan, M.M.H. Performance prediction for apache spark platform. In Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, New York, NY, USA, 24–26 August 2015; pp. 166–173. [Google Scholar]
- Gao, Z.; Wang, T.; Wang, Q.; Yang, Y. Execution Time Prediction for Apache Spark. In Proceedings of the 2018 International Conference on Computing and Big Data, Charleston, SC, USA, 8–10 September 2018; pp. 47–51. [Google Scholar]
- Shah, S.; Amannejad, Y.; Krishnamurthy, D.; Wang, M. Quick execution time predictions for spark applications. In Proceedings of the 2019 15th International Conference on Network and Service Management (CNSM), Halifax, NS, Canada, 21–25 October 2019; pp. 1–9. [Google Scholar]
- Al-Sayeh, H.; Hagedorn, S.; Sattler, K.U. A gray-box modeling methodology for runtime prediction of apache spark jobs. Distrib. Parallel Databases 2020, 38, 819–839. [Google Scholar] [CrossRef]
- AlQuwaiee, H.; Wu, C. On Performance Modeling and Prediction for Spark-HBase Applications in Big Data Systems. In Proceedings of the ICC 2022-IEEE International Conference on Communications, Seoul, Republic of Korea, 16–20 May 2022; pp. 3685–3690. [Google Scholar]
- Singhal, R.; Singh, P. Performance assurance model for applications on SPARK platform. In Proceedings of the Performance Evaluation and Benchmarking for the Analytics Era: 9th TPC Technology Conference, TPCTC 2017, Munich, Germany, 28 August 2017; pp. 131–146. [Google Scholar]
- Huang, X.; Zhang, H.; Zhai, X. A Novel Reinforcement Learning Approach for Spark Configuration Parameter Optimization. Sensors 2022, 22, 5930. [Google Scholar] [CrossRef] [PubMed]
- Azhir, E.; Hosseinzadeh, M.; Khan, F.; Mosavi, A. Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark. Mathematics 2022, 10, 3517. [Google Scholar] [CrossRef]
- Yadav, M.L. Query Execution Time Analysis Using Apache Spark Framework for Big Data: A CRM Approach. J. Inf. Knowl. Manag. 2022, 21, 2250050. [Google Scholar] [CrossRef]
- Lin, J.C.; Lee, M.C.; Yu, I.C.; Johnsen, E.B. A configurable and executable model of Spark Streaming on Apache YARN. Int. J. Grid Utility Comput. 2020, 11, 185–195. [Google Scholar] [CrossRef]
- Matteussi, K.J.; Dos Anjos, J.C.; Leithardt, V.R.; Geyer, C.F. Performance evaluation analysis of spark streaming backpressure for data-intensive pipelines. Sensors 2022, 22, 4756. [Google Scholar] [CrossRef] [PubMed]
- Ahmed, N.; Barczak, A.L.; Rashid, M.A.; Susnjak, T. An enhanced parallelisation model for performance prediction of apache spark on a multinode hadoop cluster. Big Data Cogn. Comput. 2021, 5, 65. [Google Scholar] [CrossRef]
- Zhu, C.; Han, B.; Zhao, Y. A comparative performance study of spark on kubernetes. J. Supercomput. 2022, 78, 13298–13322. [Google Scholar] [CrossRef]
- Prasad, B.R.; Agarwal, S. Performance analysis and optimization of spark streaming applications through effective control parameters tuning. In Progress in Intelligent Computing Techniques: Theory, Practice, and Applications, Proceedings of the ICACNI 2016, Rourkela, Odisha, India, 22–24 September 2016; Springer: Berlin, Germany, 2018; Volume 2, pp. 99–110. [Google Scholar]
- Dong, L.; Li, P.; Xu, H.; Luo, B.; Mi, Y. Performance Prediction of Spark Based on the Multiple Linear Regression Analysis. In Proceedings of the Parallel Architecture, Algorithm and Programming: 8th International Symposium, PAAP 2017, Haikou, China, 17–18 June 2017; pp. 70–81. [Google Scholar]
- Maros, A.; Murai, F.; da Silva, A.P.C.; Almeida, J.M.; Lattuada, M.; Gianniti, E.; Hosseini, M.; Ardagna, D. Machine learning for performance prediction of spark cloud applications. In Proceedings of the 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), Milan, Italy, 8–13 July 2019; pp. 99–106. [Google Scholar]
- Ye, G.; Liu, W.; Wu, C.Q.; Shen, W.; Lyu, X. On Machine Learning-based Stage-aware Performance Prediction of Spark Applications. In Proceedings of the 2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC), Austin, TX, USA, 6–8 November 2020; pp. 1–8. [Google Scholar]
- Kordelas, A.; Spyrou, T.; Voulgaris, S.; Megalooikonomou, V.; Deligiannis, N. KORDI: A Framework for Real-Time Performance and Cost Optimization of Apache Spark Streaming. In Proceedings of the 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Raleigh, NC, USA, 23–25 April 2023; pp. 337–339. [Google Scholar]
- Ahmed, N.; Barczak, A.L.; Rashid, M.A.; Susnjak, T. Runtime prediction of big data jobs: Performance comparison of machine learning algorithms and analytical models. J. Big Data 2022, 9, 67. [Google Scholar] [CrossRef]
- Al-Sayeh, H.; Memishi, B.; Jibril, M.A.; Paradies, M.; Sattler, K.U. Juggler: Autonomous cost optimization and performance prediction of big data applications. In Proceedings of the 2022 International Conference on Management of Data, Philadelphia, PA, USA, 12–17 June 2022; pp. 1840–1854. [Google Scholar]
- Lavanya, K.; Venkatanarayanan, S.; Bhoraskar, A.A. Real-Time Weather Analytics: An End-to-End Big Data Analytics Service Over Apach Spark With Kafka and Long Short-Term Memory Networks. Int. J. Web Serv. Res. (IJWSR) 2020, 17, 15–31. [Google Scholar] [CrossRef]
- Ye, K.; Kou, Y.; Lu, C.; Wang, Y.; Xu, C.Z. Modeling application performance in docker containers using machine learning techniques. In Proceedings of the 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS), Singapore, 11–13 December 2018; pp. 1–6. [Google Scholar]
Application Name | Input Data Size | Step |
---|---|---|
Pi | 1–50 | 10,000 |
PageRank | 1–50 | 10,000 Pages |
WordCount | 1–50 | 100 MB |
LR | 1–50 | 100 MB |
Parameter Name | Description |
---|---|
spark.num.executors | Number of executors |
spark.executor.cores | CPU cores per executor |
spark.executor.memory | Memory size per executor |
spark.driver.memory | Total amount of memory allocated to the Driver driver |
spark.driver.cores | Number of CPU cores allocated to the Driver driver process |
spark.executor.instances | Number of instances of executing programs |
spark.default.parallelism | Default number of Tasks per stage |
spark.memory.fraction | Proportion of memory used for execution and storage |
spark.task.cpus | Number of CPU cores allocated to each Task |
spark.shuffle.memoryFraction | Percentage of memory occupied by the Shuffle process |
spark.shuffle.file.buffer | Buffering of write files during Shuffle |
spark.reducer.maxSizeInFlight | The size of the buffer during a Shuffle read |
spark.reducer.maxReqSizeShuffleToMem | Maximum value of the data buffer |
Task Name | Model Name | MAPE | RMSE | MAE | |
---|---|---|---|---|---|
WordCount | KNN | 0.208 | 5.410 | 4.059 | 0.503 |
LR | 0.250 | 6.166 | 4.735 | 0.355 | |
SVC | 1.000 | 22.758 | 21.425 | −7.787 | |
RF | 0.180 | 5.493 | 3.671 | 0.488 | |
GBDT | 0.304 | 6.701 | 5.251 | 0.238 | |
MHAC | 0.148 | 3.917 | 2.834 | 0.740 | |
PageRank | KNN | 0.283 | 1.715 | 1.361 | −0.132 |
LR | 0.328 | 2.134 | 1.615 | −0.752 | |
SVC | 0.999 | 5.744 | 5.513 | −11.694 | |
RF | 0.284 | 1.921 | 1.389 | −0.420 | |
GBDT | 0.291 | 1.564 | 1.328 | 0.059 | |
MHAC | 0.255 | 1.419 | 1.212 | 0.225 | |
Pi | KNN | 0.113 | 0.636 | 0.341 | 0.511 |
LR | 0.118 | 0.720 | 0.384 | 0.375 | |
SVC | 1.000 | 2.624 | 2.461 | −7.307 | |
RF | 0.121 | 0.745 | 0.374 | 0.330 | |
GBDT | 0.236 | 0.810 | 0.598 | 0.208 | |
MHAC | 0.089 | 0.537 | 0.278 | 0.651 | |
LR | KNN | 0.164 | 5.964 | 4.135 | 0.872 |
LR | 0.334 | 10.917 | 8.336 | 0.572 | |
SVC | 0.999 | 31.995 | 27.305 | −2.677 | |
RF | 0.114 | 5.304 | 3.300 | 0.899 | |
GBDT | 0.583 | 13.999 | 10.617 | 0.296 | |
MHAC | 0.098 | 3.994 | 2.647 | 0.943 |
Task Name | Model Name | MAPE | RMSE | MAE | |
---|---|---|---|---|---|
PageRank–WordCount | KNN | 2.772 | 7.952 | 5.860 | −0.722 |
LR | 0.377 | 3.444 | 2.575 | −0.417 | |
SVC | 0.992 | 8.759 | 6.325 | −1.089 | |
RF | 2.793 | 8.298 | 6.162 | −0.875 | |
GBDT | 2.052 | 5.930 | 4.014 | 0.043 | |
XGB | 1.689 | 5.785 | 4.058 | 0.089 | |
MHAC | 0.333 | 2.494 | 1.958 | 0.257 | |
PageRank–WordCount–LR | KNN | 2.772 | 7.952 | 5.860 | −0.722 |
LR | 0.417 | 6.903 | 4.064 | −0.306 | |
SVC | 0.993 | 8.761 | 6.326 | −1.089 | |
RF | 2.796 | 8.362 | 6.162 | −0.904 | |
GBDT | 2.052 | 5.930 | 4.014 | 0.043 | |
XGB | 1.689 | 5.785 | 4.058 | 0.089 | |
MHAC | 0.477 | 5.213 | 3.515 | 0.255 | |
Pi–PageRank–WordCount–LR | KNN | 2.348 | 6.675 | 4.814 | −0.213 |
LR | 1.917 | 6.516 | 4.363 | 0.115 | |
SVC | 14.596 | 31.133 | 29.386 | −25.389 | |
RF | 2.056 | 5.765 | 4.082 | 0.095 | |
GBDT | 2.023 | 5.673 | 4.033 | 0.124 | |
XGB | 1.989 | 5.785 | 4.058 | 0.089 | |
MHAC | 1.863 | 5.625 | 4.021 | 0.138 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shen, C.; Chen, C.; Rao, G. A Novel Multi-Task Performance Prediction Model for Spark. Appl. Sci. 2023, 13, 12242. https://doi.org/10.3390/app132212242
Shen C, Chen C, Rao G. A Novel Multi-Task Performance Prediction Model for Spark. Applied Sciences. 2023; 13(22):12242. https://doi.org/10.3390/app132212242
Chicago/Turabian StyleShen, Chao, Chen Chen, and Guozheng Rao. 2023. "A Novel Multi-Task Performance Prediction Model for Spark" Applied Sciences 13, no. 22: 12242. https://doi.org/10.3390/app132212242
APA StyleShen, C., Chen, C., & Rao, G. (2023). A Novel Multi-Task Performance Prediction Model for Spark. Applied Sciences, 13(22), 12242. https://doi.org/10.3390/app132212242