The Optimization Strategies on Clarification of the Misconceptions of Big Data Processing in Dynamic and Opportunistic Environments
Abstract
:1. Introduction
- Assess the behavior dynamics and opportunism of volunteers;
- Process generic MapReduce big data problems;
- Record system performance.
- The number of volunteers;
- The heterogeneity of volunteers;
- A single overlay or multiple overlays;
- The varying workload and/or varying volunteer numbers.
- Compilation of a multiple factor profile to synthesize dynamics (free join, leave or crash) of large-scale volunteers and its impact on a large part of the overlay and on a large number of tasks.
- Confirmation of scalability of volunteer computing for big data processing goes logarithm-like scale in terms of speedup and in reciprocal inverse-like scale in terms of speedup growth rate.
- Identification of the convergence points of speedup growth rate.
- Proposal of strategies to plan the optimal overlay size, the overlay numbers and the overlay structures for a given problem scale and given dynamics of volunteers.
2. Related Work
3. Preamble of Discussion
3.1. MapReduce Workflow in Dynamic and Opportunistic Environments
3.2. The Measurement of Performance
3.3. The Setting of Dynamics and Workload
- Heterogeneity (H) reflects the difference of volunteers in compute capacity. If we assume that the base capacity is tier 1, then a two-fold slower volunteer is of tier 2.
- Download/Upload Speed (DUS) reflects the internet speed of a volunteer. For example, if we assume a moderate internet speed of 25/10 Mbps for download/upload, the DUS is 20/51 s for a 64 MB data.
- Round Trip Time (RTT) reflects the time to establish an internet connection before or close the connection after a communication between volunteers. A reasonable RTT should be no more than 8 for a moderate speed internet.
- Map/Reduce Ratio (MRR) reflects a big data application, being a data aggregation (input > output), data expansion (input < output), data transformation (input ≈ output) or data summary (input >> output) [23]. The most common big data applications are data aggregation. For example, a 20% MRR means that the workload and data scale of reduce tasks will be 20% of map tasks.
- Redistribution Factor (RF) reflects the difference of the keys of an intermediate result set. For example, a RF of 200 means that an intermediate result set from the map step needs to be redistributed into 200 reduce tasks in the shuffle step.
- Churn Rate (CR) reflects the percentage of total volunteers who behave dynamics or opportunism in terms of leave or crash in the course of computing.
- Start Position (SP) reflects how long a volunteer stays on the overlay before committing dynamics.
- Occurrence Interval (OI) reflects the time period within that a volunteer could commit churn.
- The number of map tasks (NMT) and the number of reduce tasks (NRT);
- The computing load of each task (CLET);
- The size of a map or a reduce task or a map or a reduce result set;
- The lookup time of a map task or a reduce task on the overlay;
- The communication speed.
- The compute intensity, e.g., if the NMT is 1,400,000 (1.4 M) and NRT is 280,000 (0.28 M) and the CLET is 8000 (8 K) time units, the overall computing load is (1.4 M + 0.28 M) × 8000 = 11.2 G + 2.24 G = 13.44 G time units.
- The problem scale, e.g., if the size of a map or a reduce task or a map or a reduce result set is 64 MB, the overall data size to be processed is 1,400,000 × 64 + 280,000 × 64 = 107,520,000 MB ≈ 108 TB.
- The communication intensity, e.g., if the communication speed is 5 download/upload speed (in Mbps) tiers: 12/1, 25/5, 25/10, 50/20 and 100/40 as provided by Australia National Broadband Network (NBN), the download/upload speed (in seconds) of a 64 MB dataset is 5 tiers: 43/512, 20/102, 20/51, 10/26, 5/13.
- The problem property. By varying NMT and NRT, the data aggregation (input > output), data expansion (input < output), data transformation (input ≈ output) and data summary (input >> output) [24] can be configured, e.g., if NRT/NMT = 20%, a data aggregation application is set.
- Overlay setting: in the format of (H, DUS, RTT, CR, SP, OI, MRR, RF), the overlay dynamics setting is (6 tiers, 20/51, 8, 30%, 250 K, 30, 20%, 200).
- Workload setting: for 1,400,000 (1.4 M) map tasks, 280,000 (0.28 M) reduce tasks (if 20% MRR is assumed) and the computing load of each map or reduce task of 8000, the computing workload is 13.44 G in total.
- Data setting: for each map or reduce task or a result set of 64 MB, the total amount of data to be processed is about 108 TB (89.6 TB of map + 17.92 TB of reduce if 20% MRR is assumed).
4. Misconception and Clarification
4.1. Misconception 1
4.2. Misconception 2
HV1(1; 6666), HV2(2; 13,333), HV3(3; 20,000), HV4(4; 26,666), HV5(5; 33,333), HV6(6; 40,000), HV7(7; 46,666), HV8(8; 53,333), HV9(9; 60,000), HV10(10; 66,666), HV11(11; 73,333), HV12(12; 80,000), HV13(13; 86,666), HV14(14; 93,333), HV15(15; 100,000).
HC1(HV1→HV2), HC2(HV2→HV3), HC3(HV3→HV4), HC4(HV4→HV5), HC5(HV5→HV6), HC6(HV6→HV7), HC7(HV7→HV8), HC8(HV8→HV9), HC9(HV9→HV10), HC10(HV10→HV11), HC11(HV11→HV12), HC12(HV12→HV13), HC13(HV13→HV14), HC14(HV14→HV15).
4.3. Misconception 3
4.4. Misconception 4
5. Optimization Strategy
5.1. Optimization Strategy
- Assumption: the dynamics of volunteers, the workload and the dataset of a big data problem are given in the format of the three settings in Section 3.3.
- Goal: an overlay structure that can achieve optimal performance for the given big data problem and the dynamics of volunteers.
5.2. Case Study
- The volunteer dynamics of 30% churn and heterogeneity of 6 tiers as in the overlay setting in Section 3.3;
- The dataset of 108 TB as in the data setting in Section 3.3;
- The overall speedup is: (11,200,000,000 + 2,240,000,000)/(1,854,767 + 413,697) = 5925 times;
- The overall speedup growth rate is: ((5925 − 4267)/4267) × 100% = 38.86%;
- The overall improvement is: 5925 − 4267 = 1658.
- The compute capacities of volunteers are evenly distributed between tiers, and
- The compute capacities of volunteers can be evenly distributed into each overlay,
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Oracle. An Enterprise Architect’s Guide to Big Data—Reference Architecture Overview. Oracle Enterprise Architecture White Paper. 2016. Available online: http://www.oracle.com/technetwork/topics/entarch/articles/oea-big-data-guide-1522052.pdf (accessed on 12 March 2021).
- Sarmenta, L. Volunteer Computing. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2001. [Google Scholar]
- ATLAS@Home. Available online: http://lhcathome.web.cern.ch/projects/atlas (accessed on 12 March 2021).
- Asteroids@home. 2021. Available online: http://asteroidsathome.net/ (accessed on 12 March 2021).
- Einstein@Home. Available online: https://einsteinathome.org/ (accessed on 12 March 2021).
- Li, W.; Guo, W.; Li, M. The Impact Factors on the Competence of Big Data Processing. Int. J. Comput. Appl. 2020. [Google Scholar] [CrossRef]
- Casado, R. The Three Generations of Big Data Processing. 2013. Available online: https://www.slideshare.net/Datadopter/the-three-generations-of-big-data-processing (accessed on 12 March 2021).
- Dean, J.; Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 2008, 51, 107–113. [Google Scholar] [CrossRef]
- Stoica, I.; Morris, R.; Liben-Nowell, D.; Karger, D.R.; Kaashoek, M.F.; Dabek, F.; Balakrishnan, H. Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications. IEEE/ACM Trans. Netw. 2003, 11, 17–32. [Google Scholar] [CrossRef]
- Kaffille, S.; Loesing, K. Open Chord (1.0.4) User’s Manual 2007; The University of Bamberg: Bamberg, Germany, 2007; Available online: https://sourceforge.net/projects/open-chord/ (accessed on 12 March 2021).
- Fadika, Z.; Govindaraju, M.; Canon, R.; Ramakrishnan, L. Evaluating Hadoop for Data-Intensive Scientific Operations. In Proceedings of the IEEE 5th International Conference on Cloud Computing, Honolulu, HI, USA, 24–29 June 2012; pp. 67–74. [Google Scholar]
- Dede, E.; Fadika, Z.; Govindaraju, M.; Ramakrishnan, L. Benchmarking MapReduce Implementations under Different Application Scenarios. Future Gener. Comput. Syst. 2014, 36, 389–399. [Google Scholar] [CrossRef]
- Cheng, D.; Rao, J.; Guo, Y.; Jiang, C.; Zhou, X. Improving Performance of Heterogeneous MapReduce Clusters with Adaptive Task Tuning. IEEE Trans. Parallel Distrib. Syst. 2017, 28, 774–786. [Google Scholar] [CrossRef]
- Hadoop. Available online: https://cwiki.apache.org/confluence/display/HADOOP2/ProjectDescription (accessed on 12 March 2021).
- Jothi, A.; Indumathy, P. Increasing Performance of Parallel and Distributed Systems in High Performance Computing using Weight Based Approach. In Proceedings of the International Conference on Circuits, Power and Computing Technologies, Nagercoil, India, 19–20 March 2015. [Google Scholar]
- Yildiz, O.; Ibrahim, S.; Antoniu, G. Enabling Fast Failure Recovery in Shared Hadoop Clusters: Towards Failure-aware Scheduling. Future Gener. Comput. Syst. 2017, 74, 208–219. [Google Scholar] [CrossRef] [Green Version]
- Singh, S.; Garg, R.; Mishra, P.K. Observations on Factors Affecting Performance of MapReduce based Apriori on Hadoop Cluster. In Proceedings of the International Conference on Computing, Communication and Automation, Greater Noida, India, 29–30 April 2016; pp. 87–94. [Google Scholar]
- Ardagna, D.; Bernardi, S.; Gianniti, E.; Aliabadi, S.; Perez-Palacin, D.; Requeno, J. Modeling Performance of Hadoop Applications: A Journey from Queueing Networks to Stochastic Well-Formed Nets. In Proceedings of the International Conference on Algorithms and Architectures for Parallel Processing, Granada, Spain, 14–16 December 2016; pp. 599–613. [Google Scholar]
- Perarnau, S.; Sato, M. Victim Selection and Distributed Work Stealing Performance: A Case Study. In Proceedings of the 28th IEEE International Symposium on Parallel and Distributed Processing, Phoenix, AZ, USA, 19–23 May 2014; pp. 659–668. [Google Scholar]
- Vu, T.T.; Derbel, B. Link-Heterogeneous Work Stealing. In Proceedings of the 4th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Chicago, IL, USA, 26–29 May 2014; pp. 354–363. [Google Scholar]
- Zhang, X.; Wu, Y.; Zhao, C. MrHeter: Improving MapReduce Performance in Heterogeneous Environments. Clust. Comput. 2016, 19, 1691–1701. [Google Scholar] [CrossRef]
- Li, W.; Guo, W. The Optimization Potential of Volunteer Computing for Compute or Data Intensive Applications. J. Commun. 2019, 14, 971–979. [Google Scholar] [CrossRef]
- Zhang, X.; Qin, Y.; Yuen, C.; Jayasinghe, L.; Liu, X. Time-Series Regeneration with Convolutional Recurrent Generative Adversarial Network for Remaining Useful Life Estimation. IEEE Trans. Ind. Inform. 2021, 17, 6820–6831. [Google Scholar] [CrossRef]
- Chen, Y.; Ganapathi, A.; Griffith, R.; Katz, R. The Case for Evaluating MapReduce Performance Using Workload Suites. In Proceedings of the 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, Singapore, 25–27 July 2011; pp. 390–399. [Google Scholar]
Dynamics | Workflow for Each Task | |||
---|---|---|---|---|
Lookup | Download | Compute | Upload | |
Join or re-join | No impact | - | - | - |
Leave | No impact | No impact | Checkpointed | Must be done |
Crash | No impact | No impact | Must be redone | Must be redone |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, W.; Tang, M. The Optimization Strategies on Clarification of the Misconceptions of Big Data Processing in Dynamic and Opportunistic Environments. Big Data Cogn. Comput. 2021, 5, 38. https://doi.org/10.3390/bdcc5030038
Li W, Tang M. The Optimization Strategies on Clarification of the Misconceptions of Big Data Processing in Dynamic and Opportunistic Environments. Big Data and Cognitive Computing. 2021; 5(3):38. https://doi.org/10.3390/bdcc5030038
Chicago/Turabian StyleLi, Wei, and Maolin Tang. 2021. "The Optimization Strategies on Clarification of the Misconceptions of Big Data Processing in Dynamic and Opportunistic Environments" Big Data and Cognitive Computing 5, no. 3: 38. https://doi.org/10.3390/bdcc5030038
APA StyleLi, W., & Tang, M. (2021). The Optimization Strategies on Clarification of the Misconceptions of Big Data Processing in Dynamic and Opportunistic Environments. Big Data and Cognitive Computing, 5(3), 38. https://doi.org/10.3390/bdcc5030038