Leveraging History to Predict Infrequent Abnormal Transfers in Distributed Workflows †
Abstract
:1. Introduction
- We present a set of stratified sampling techniques to address the challenge introduced by the fact that are far fewer slow transfers than normal ones. The best sampling approach allows us to achieve an F1 score of 0.926 for predicting under-performing transfers.
- We devise a strategy to capture the network state by utilizing information from recently completed file transfers. (Even though this was a key technique used previously in [12], we feel it is still worthwhile to draw attention to it because it is an effective approach that could be effectively used in many application scenarios.) We define a set of features engineered from the most recent transfer from the same subnet as the transfers from a similar location, and the host may show similar behaviors.
- This study utilizes extensive network monitoring data from an active scientific computing facility. This larger data collection allows us to explore different sampling strategies and definitions of slow traffic and show the relative importance of the individual features.
2. Data Description and Exploration
2.1. Description of tstat Data
2.2. High-Level Observations about Data Transfers
2.3. Data Cleaning and Statistics about tstat Data
3. Prediction Methodology
3.1. Defining “Slow” Transfers
3.2. Sampling Strategies
- Baseline (train1/test1): The baseline method is a uniform random sampling of the 41 million transfers with large files. Following this sampling method, we sampled 10,000 transfers from each training and testing period and named the two subsets as train1 and test1. Figure 4a shows a histogram of train1 subset. Note that this subsample does not address the class-imbalance problem. In particular, the test1 contains only three instances of slow transfers with throughput less than bps.
- Stratified 2 (train2/test2): To address the class-imbalance problem, all of our stratified samplings keep all slow transfers and select different samples from the normal transfers. Since there are 8986 slow transfers, the total number of normal transfers selected is also 8986. The simplest method to select the normal transfers is to sample them uniformly. This approach of selecting data records from the training data (from the year 2021) is named train2, and the similarly selected subset from the year 2022 is named test2. The distribution of train2 is shown in Figure 4b. This figure shows a clear gap between and bps. Training with this dataset might not be able to learn that the actual decision boundary is at bps. On the other hand, test results on test2 might be very good since there are fewer data samples near the decision boundary to challenge the classifier.
- Stratified 3 (train3/test3): To put more data samples near the decision boundary of bps, we employ another stratified sampling on the normal transfers. Specifically, we divide the normal transfers into bins based on the logarithm of their throughput. This binning choice was selected after experimenting with a number of different approaches. Since the concern with train2 is that there might not be a sufficient number of samples near the decision boundary, our choice here is to place more samples near the decision boundary. To choose from the normal transfers, we select a number of samples from a bin that is inversely proportional to its lower bin boundary, which samples significantly more normal transfers with lower throughput than those with higher throughput. The subset of data thus selected from the training data (from the year 2021) is named train3; similarly, we also created test3 from the testing data from the year 2022. The histogram of the train3 is shown in Figure 4c. With this distribution, we anticipate that the training task will lead to a more precise model because there are more points near the decision boundary. This fact would also make testing on test3 have lower performance because the test case near the decision boundary would be challenging for the classifier.
- Stratified 4 (train4): Another stratified sampling strategy we examine selects the same number of records from each logarithmic bin for normal transfers while keeping all slow transfers, producing a training sample named train4. The resulting distribution is shown in Figure 4d.
3.3. Extracting Network States from Recently Completed Transfers
3.4. Prediction Algorithms
4. Evaluation
4.1. Experimental Setting
4.2. Best Model Performance
4.3. Impact of Features
4.4. Alternative Threshold Setting
5. Related Work
6. Conclusions and Future Directions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Li, Z.; Huang, Q.; Jiang, Y.; Hu, F. SOVAS: A scalable online visual analytic system for big climate data analysis. Int. J. Geogr. Inf. Sci. 2020, 34, 1188–1209. [Google Scholar] [CrossRef]
- Reiter, T.; Brooks, P.T.; Irber, L.; Joslin, S.E.; Reid, C.M.; Scott, C.; Brown, C.T.; Pierce-Ward, N.T. Streamlining data-intensive biology with workflow systems. GigaScience 2021, 10, giaa140. [Google Scholar] [CrossRef]
- Alekseev, A.; Kiryanov, A.; Klimentov, A.; Korchuganova, T.; Mitsyn, V.; Oleynik, D.; Smirnov, A.; Smirnov, S.; Zarochentsev, A. Scientific Data Lake for High Luminosity LHC project and other data-intensive particle and astro-particle physics experiments. J. Phys. Conf. Ser. 2020, 1690, 012166. [Google Scholar] [CrossRef]
- Beermann, T.; Chuchuk, O.; Di Girolamo, A.; Grigorieva, M.; Klimentov, A.; Lassnig, M.; Schulz, M.; Sciaba, A.; Tretyakov, E. Methods of Data Popularity Evaluation in the ATLAS Experiment at the LHC. EPJ Web Conf. 2021, 251, 02013. [Google Scholar] [CrossRef]
- Behrmann, G.; Fuhrmann, P.; Grønager, M.; Kleist, J. A distributed storage system with dCache. J. Phys. Conf. Ser. 2008, 119, 062014. [Google Scholar] [CrossRef]
- Enders, B.; Bard, D.; Snavely, C.; Gerhardt, L.; Lee, J.; Totzke, B.; Antypas, K.; Byna, S.; Cheema, R.; Cholia, S.; et al. Cross-facility science with the superfacility project at LBNL. In Proceedings of the 2020 IEEE/ACM 2nd Workshop on Extreme-scale Experiment-in-the-Loop Computing (XLOOP), Atlanta, GA, USA, 12 November 2020; pp. 1–7. [Google Scholar]
- Weaver, B.A.; Blanton, M.R.; Brinkmann, J.; Brownstein, J.R.; Stauffer, F. The Sloan digital sky survey data transfer infrastructure. Publ. Astron. Soc. Pac. 2015, 127, 397. [Google Scholar] [CrossRef]
- Abbasi, M.; Shahraki, A.; Taherkordi, A. Deep Learning for Network Traffic Monitoring and Analysis (NTMA): A Survey. Comput. Commun. 2021, 170, 19–41. [Google Scholar] [CrossRef]
- D’Alconzo, A.; Drago, I.; Morichetta, A.; Mellia, M.; Casas, P. A Survey on Big Data for Network Traffic Monitoring and Analysis. IEEE Trans. Netw. Serv. Manag. 2019, 16, 800–813. [Google Scholar] [CrossRef] [Green Version]
- Chandrasekaran, B. Survey of Network Traffic Models; Technical Report CSE 567; Waschington University in St. Louis: St. Louis, MO, USA, 2009. [Google Scholar]
- Juve, G.; Deelman, E. Scientific Workflows and Clouds. XRDS 2010, 16, 14–18. [Google Scholar] [CrossRef] [Green Version]
- Shao, R.; Kim, J.; Sim, A.; Wu, K. Predicting Slow Network Transfers in Scientific Computing. In Proceedings of the Fifth International Workshop on Systems and Network Telemetry and Analytics, SNTA’22, Minneapolis, MN, USA, 30 June 2022; pp. 13–20. [Google Scholar] [CrossRef]
- Finamore, A.; Mellia, M.; Meo, M.; Munafo, M.M.; Di Torino, P.; Rossi, D. Experiences of internet traffic monitoring with tstat. IEEE Netw. 2011, 25, 8–14. [Google Scholar] [CrossRef] [Green Version]
- Kettimuthu, R.; Liu, Z.; Foster, I.; Beckman, P.H.; Sim, A.; Wu, K.; Liao, W.k.; Kang, Q.; Agrawal, A.; Choudhary, A. Towards autonomic science infrastructure: Architecture, limitations, and open issues. In Proceedings of the 1st International Workshop on Autonomous Infrastructure for Science, Tempe, AZ, USA, 11 June 2018; pp. 1–9. [Google Scholar]
- Buda, M.; Maki, A.; Mazurowski, M.A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018, 106, 249–259. [Google Scholar] [CrossRef] [Green Version]
- Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef] [Green Version]
- Lu, D.; Qiao, Y.; Dinda, P.A.; Bustamante, F.E. Characterizing and predicting tcp throughput on the wide area network. In Proceedings of the 25th IEEE International Conference on Distributed Computing Systems (ICDCS’05), Columbus, OH, USA, 6–10 June 2005; IEEE: Piscataway, NJ, USA, 2005; pp. 414–424. [Google Scholar]
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’16, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Basat, R.B.; Einziger, G.; Friedman, R.; Kassner, Y. Optimal elephant flow detection. In Proceedings of the IEEE INFOCOM 2017-IEEE Conference on Computer Communications, Atlanta, GA, USA, 1–4 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–9. [Google Scholar]
- Chhabra, A.; Kiran, M. Classifying elephant and mice flows in high-speed scientific networks. Proc. INDIS 2017, 1–8. [Google Scholar]
- Syal, A.; Lazar, A.; Kim, J.; Sim, A.; Wu, K. Automatic detection of network traffic anomalies and changes. In Proceedings of the ACM Workshop on Systems and Network Telemetry and Analytics, Phoenix, AZ, USA, 25 June 2019; pp. 3–10. [Google Scholar]
- Nakashima, M.; Sim, A.; Kim, J. Evaluation of Deep Learning Models for Network Performance Prediction for Scientific Facilities. In Proceedings of the 3rd International Workshop on Systems and Network Telemetry and Analytics, Stockholm, Sweden, 23 June 2020; pp. 53–56. [Google Scholar]
- Cai, W.; Encarnacion, R.; Chern, B.; Corbett-Davies, S.; Bogen, M.; Bergman, S.; Goel, S. Adaptive sampling strategies to construct equitable training datasets. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, 21–24 June 2022; pp. 1467–1478. [Google Scholar]
- Katharopoulos, A.; Fleuret, F. Not all samples are created equal: Deep learning with importance sampling. In Proceedings of the International Conference on Machine Learning, Stockholm Sweden, 10–15 July 2018; pp. 2525–2534. [Google Scholar]
- Li, J.; Zhang, H.; Liu, Y.; Liu, Z. Semi-supervised machine learning framework for network intrusion detection. J. Supercomput. 2022, 78, 13122–13144. [Google Scholar] [CrossRef]
- Andresini, G.; Appice, A.; Malerba, D. Autoencoder-based deep metric learning for network intrusion detection. Inf. Sci. 2021, 569, 706–727. [Google Scholar] [CrossRef]
- Atefinia, R.; Ahmadi, M. Network intrusion detection using multi-architectural modular deep neural network. J. Supercomput. 2021, 77, 3571–3593. [Google Scholar] [CrossRef]
- Dubey, R.; Zhou, J.; Wang, Y.; Thompson, P.M.; Ye, J. Analysis of sampling techniques for imbalanced data: An n = 648 ADNI study. NeuroImage 2014, 87, 220–241. [Google Scholar] [CrossRef] [Green Version]
- Weinger, B.; Kim, J.; Sim, A.; Nakashima, M.; Moustafa, N.; Wu, K.J. Enhancing IoT anomaly detection performance for federated learning. Digit. Commun. Netw. 2022, 8, 314–323. [Google Scholar] [CrossRef]
- Jan, M.A.; Zakarya, M.; Khan, M.; Mastorakis, S.; Menon, V.G.; Balasubramanian, V.; Rehman, A.U. An AI-enabled lightweight data fusion and load optimization approach for Internet of Things. Future Gener. Comput. Syst. 2021, 122, 40–51. [Google Scholar] [CrossRef]
- Zebari, R.; Abdulazeez, A.; Zeebaree, D.; Zebari, D.; Saeed, J. A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J. Appl. Sci. Technol. Trends 2020, 1, 56–70. [Google Scholar] [CrossRef]
- Wang, J.; Zhao, C.; He, S.; Gu, Y.; Alfarraj, O.; Abugabah, A. LogUAD: Log unsupervised anomaly detection based on Word2Vec. Comput. Syst. Sci. Eng. 2022, 41, 1207. [Google Scholar] [CrossRef]
- Liu, C.; Li, K.; Li, K. A Game Approach to Multi-Servers Load Balancing with Load-Dependent Server Availability Consideration. IEEE Trans. Cloud Comput. 2021, 9, 1–13. [Google Scholar] [CrossRef]
- Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; Sun, L. Transformers in time series: A survey. arXiv 2022, arXiv:2202.07125. [Google Scholar]
- Deng, A.; Hooi, B. Graph neural network-based anomaly detection in multivariate time series. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 4027–4035. [Google Scholar]
- Choi, K.; Yi, J.; Park, C.; Yoon, S. Deep learning for anomaly detection in time-series data: Review, analysis, and guidelines. IEEE Access 2021, 9, 120043–120065. [Google Scholar] [CrossRef]
DTN | Large Transfers | Total Transfers | Ratio |
---|---|---|---|
DTN01 | 737,498 | 172,657,723 | 0.43% |
DTN02 | 39,917,258 | 163,310,125 | 24.43% |
DTN03 | 137,272 | 86,034,525 | 0.16% |
DTN04 | 101,119 | 72,204,065 | 0.14% |
Total | 40,893,147 | 494,206,438 | 8.3% |
Feature | Collection | Description |
---|---|---|
train1 | 2021 | uniform random sample from population (non-stratified) |
train2 | 2021 | keep the entire slow transfers and randomly sample the same number of normal transfers |
train3 | 2021 | keep the entire slow transfers and sample normal transfers progressively for each bin (so as to have more samples near the decision boundary) |
train4 | 2021 | keep the entire slow transfers and sample normal transfers with a fixed number for each bin |
test1 | 2022 | uniform random sample from population (non-stratified) |
test2 | 2022 | keep the entire slow transfers and randomly sample the same number of normal transfers |
test3 | 2022 | keep the entire slow transfers and sample normal transfers progressively for each bin (having more samples near the decision boundary) |
Feature | Description |
---|---|
prev_tput | Latest throughput measured between the same src/dst networks (“a.b.c.0”) |
prev_size | Latest transfer size (in bytes) between the same src/dst networks (“a.b.c.0”) |
size_ratio | Ratio between the latest transfer size (prev_size) vs. current transfer size |
prev_durat | Latest transfer duration (in msec) between the same src/dst networks (“a.b.c.0”) |
prev_min_rtt | Latest minimum RTT between the same src/dst networks (“a.b.c.0”) |
prev_rtt | Latest average RTT between the same src/dst networks (“a.b.c.0”) |
prev_max_rtt | Latest maximum RTT between the same src/dst networks (“a.b.c.0”) |
prev_retx_rate | Latest retransmission rate between the same src/dst networks (“a.b.c.0”) |
time_gap | Time gap between latest vs. current transfers between the same src/dst networks (“a.b.c.0”) |
Train–Test Pair | F1 Score | Accuracy | Precision | Recall |
---|---|---|---|---|
train2-test2 | 0.926 | 0.925 | 0.921 | 0.931 |
train4-test2 | 0.907 | 0.907 | 0.905 | 0.908 |
train3-test2 | 0.885 | 0.888 | 0.914 | 0.857 |
train1-test2 | 0.875 | 0.888 | 0.988 | 0.785 |
train2-test3 | 0.709 | 0.590 | 0.550 | 0.996 |
train3-test3 | 0.682 | 0.533 | 0.518 | 0.998 |
train4-test3 | 0.682 | 0.533 | 0.518 | 0.998 |
train1-test1 | 0.566 | 0.999 | 0.778 | 0.444 |
train4-test1 | 0.415 | 0.998 | 0.303 | 0.659 |
train3-test1 | 0.340 | 0.998 | 0.285 | 0.421 |
train1-test3 | 0.252 | 0.559 | 0.845 | 0.148 |
train2-test1 | 0.234 | 0.995 | 0.145 | 0.603 |
Feature | Number of Occurrence |
---|---|
prev_tput | 12 |
country | 10 |
size_ratio | 9 |
prev_retx_rate | 7 |
pre_rtt_min | 6 |
prev_rtt_max | 6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shao, R.; Sim, A.; Wu, K.; Kim, J. Leveraging History to Predict Infrequent Abnormal Transfers in Distributed Workflows. Sensors 2023, 23, 5485. https://doi.org/10.3390/s23125485
Shao R, Sim A, Wu K, Kim J. Leveraging History to Predict Infrequent Abnormal Transfers in Distributed Workflows. Sensors. 2023; 23(12):5485. https://doi.org/10.3390/s23125485
Chicago/Turabian StyleShao, Robin, Alex Sim, Kesheng Wu, and Jinoh Kim. 2023. "Leveraging History to Predict Infrequent Abnormal Transfers in Distributed Workflows" Sensors 23, no. 12: 5485. https://doi.org/10.3390/s23125485
APA StyleShao, R., Sim, A., Wu, K., & Kim, J. (2023). Leveraging History to Predict Infrequent Abnormal Transfers in Distributed Workflows. Sensors, 23(12), 5485. https://doi.org/10.3390/s23125485