Feature Selection for Improving Failure Detection in Hard Disk Drives Using a Genetic Algorithm and Significance Scores
Abstract
:1. Introduction
2. The Proposed Methodology
2.1. Feature Selection
- Filter Based Methods: These methods use some sort of fitness function to first evaluate and rank different features, and then select a subset of features that have fitness function values above a certain threshold. In essence, these methods filter out the bad features first and then construct the machine learning model. This approach is usually more efficient [28,29], but its performance depends upon the quality of the fitness function. The feature selection method used in this study can be categorized as a filter based method.
- Wrapper Based Methods: These methods do not filter out the bad features before constructing the machine learning model. Rather, they use the classifier to filter out the bad features. For example, different combinations of features may be used by the classifier, and the combination of features that yields the highest classification accuracy may be selected as the best set of features. This approach can be very time consuming and may only result in a sub-optimal solution [28,29].
- Embedded or Hybrid Methods: As the name suggests, these methods use an embedded or a hybrid approach. Unlike wrapper based methods, which iterate through different combinations of the features and may select the best subset of features on the basis of the accuracy of the classifier, these methods do not involve such iterative use of the classifier, which improves their speed. Similarly, unlike the filter based approaches, these methods do not use a separate fitness function to rank different features. Rather, these methods may use the output of the classifier to select the best subset of features. For example, the weights assigned to different inputs (features) in logistic regression or neural networks may be used to rank them, and select the best subset among them.
2.2. Feature Selection Using a Genetic Algorithm
2.3. Feature Selection Using Significance Scores
2.4. Classification Using the Naive Bayes Classifier
3. The SMART Dataset
4. Results and Discussion
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
Abbreviations
AE | Acoustic Emission |
AFR | Annual failure rate |
ARR | Annual (disk) replacement rate |
BIOS | Basic input output system |
CRC | Cyclic redundancy check |
DFP | Disk failure prediction |
DMA | Direct memory access |
FAR | False alarm rate |
FC | Fiber channel |
FDR | Failure detection rate |
FPR | False positive rate |
GA | Genetic algorithm |
HDD | Hard disk drive |
IT | Information technology |
MTTF | Mean time to failure |
MVMN | Multi-variate multi-nomial |
NB | Naive Bayes |
SATA | Serial advanced technology attachment |
SCSI | Small computer system interface |
SMART | Self monitoring, analysis and reporting technology |
SVM | Support vector machine |
ROC | Receiver operating characteristic |
TPR | True positive rate |
References
- Coughlin, T. Near and Far—Digital Storage Supporting Today’s Mobile Devices [The Art of Storage]. IEEE Consum. Electron. Mag. 2014, 3, 64–67. [Google Scholar] [CrossRef]
- Gantz, J.; Reinsel, D. The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. IDC IView IDC Anal. Future 2012, 2007, 1–16. [Google Scholar]
- Schroeder, B.; Gibson, G.A. Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you? ACM Trans. Storage (TOS) 2007, 3, 8-es. [Google Scholar] [CrossRef]
- Schroeder, B.; Gibson, G.A. Understanding failures in petascale computers. J. Phys. Conf. Ser. 2007, 78, 012022. [Google Scholar] [CrossRef]
- Hard Drive Data and Stats Volume Q1–Q3 2015. Available online: https://www.backblaze.com/b2/hard-disk-test-data.html. (accessed on 30 October 2016).
- Schroeder, B.; Gibson, G.A. Disk failures in the real world: What does an MTTF of 1, 000, 000 hours mean to you? In FAST; USENIX: San Hose, CA, USA, 2007; Volume 7, pp. 1–16. [Google Scholar]
- Sankar, S.; Shaw, M.; Vaid, K.; Gurumurthi, S. Datacenter scale evaluation of the impact of temperature on hard disk drive failures. ACM Trans. Storage (TOS) 2013, 9, 1–24. [Google Scholar] [CrossRef]
- Wang, G.; Zhang, L.; Xu, W. What can we learn from four years of data center hardware failures? In Proceedings of the 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Denver, CO, USA, 26–29 June 2017; pp. 25–36. [Google Scholar]
- Kamarthi, S.; Zeid, A.; Bagul, Y. Assessement of current health of hard disk drives. In Proceedings of the 2009 IEEE International Conference on Automation Science and Engineering, Vancouver, BC, Canada, 22–26 August 2009; pp. 246–249. [Google Scholar]
- Jiang, W.; Hu, C.; Zhou, Y.; Kanevsky, A. Are disks the dominant contributor for storage failures? A comprehensive study of storage subsystem failure characteristics. ACM Trans. Storage (TOS) 2008, 4, 1–25. [Google Scholar] [CrossRef]
- Ottem, E.; Plummer, J. Playing It SMART: The Emergence of Reliability Prediction Technology; Technical Report, Technical Report, Seagate Technology Paper; Seagate Technology: Scotts Valley, CA, USA, 1995. [Google Scholar]
- Hamerly, G.; Elkan, C. Bayesian approaches to failure prediction for disk drives. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, 28 June–1 July 2001; Volume 1, pp. 202–209. [Google Scholar]
- Hughes, G.F.; Murray, J.F.; Kreutz-Delgado, K.; Elkan, C. Improved disk-drive failure warnings. IEEE Trans. Reliab. 2002, 51, 350–357. [Google Scholar] [CrossRef] [Green Version]
- Murray, J.F.; Hughes, G.F.; Kreutz-Delgado, K. Machine learning methods for predicting failures in hard drives: A multiple-instance application. J. Mach. Learn. Res. 2005, 6, 783–816. [Google Scholar]
- Henry, R.K. Monitoring PC Hardware Sounds in Linux Systems Using the Daubechies D4 Wavelet. Master’s Thesis, East Tennessee State University, Tennessee, TN, USA, 2005. [Google Scholar]
- Pinheiro, E.; Weber, W.D.; Barroso, L.A. Failure Trends in a Large Disk Drive Population; USENIX: San Hose, CA, USA, 2007. [Google Scholar]
- Wang, Y.; Miao, Q.; Pecht, M. Health monitoring of hard disk drive based on Mahalanobis distance. In Proceedings of the 2011 Prognostics and System Health Managment Conference, Shenzhen, China, 24–25 May 2011; pp. 1–8. [Google Scholar]
- Wang, Y.; Ma, E.W.; Chow, T.W.; Tsui, K.L. A two-step parametric method for failure prediction in hard disk drives. IEEE Trans. Ind. Inform. 2013, 10, 419–430. [Google Scholar] [CrossRef]
- Qian, J.; Skelton, S.; Moore, J.; Jiang, H. P3: Priority based proactive prediction for soon-to-fail disks. In Proceedings of the 2015 IEEE International Conference on Networking, Architecture and Storage (NAS), Boston, MA, USA, 6–7 August 2015; pp. 81–86. [Google Scholar]
- Botezatu, M.M.; Giurgiu, I.; Bogojeska, J.; Wiesmann, D. Predicting disk replacement towards reliable data centers. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 39–48. [Google Scholar]
- Zhang, S.; Bahrampour, S.; Ramakrishnan, N.; Shah, M. Deep Symbolic Representation Learning for Heterogeneous Time-series Classification. arXiv 2016, arXiv:1612.01254. [Google Scholar]
- Black, R.; Donnelly, A.; Harper, D.; Ogus, A.; Rowstron, A. Feeding the pelican: Using archival hard drives for cold storage racks. In Proceedings of the 8th {USENIX} Workshop on Hot Topics in Storage and File Systems (HotStorage 16), Denver, CO, USA, 20–21 June 2016. [Google Scholar]
- Zhang, T.; Wang, E.; Zhang, D. Predicting failures in hard drivers based on isolation forest algorithm using sliding window. J. Phys. Conf. Ser. 2019, 1187, 042084. [Google Scholar] [CrossRef]
- Huang, S.; Liang, S.; Fu, S.; Shi, W.; Tiwari, D.; Chen, H.B. Characterizing disk health degradation and proactively protecting against disk failures for reliable storage systems. In Proceedings of the 2019 IEEE International Conference on Autonomic Computing (ICAC), Umea, Sweden, 16–20 June 2019; pp. 157–166. [Google Scholar]
- Cantu-Paz, E. Feature subset selection, class separability, and genetic algorithms. In Proceedings of the Genetic and Evolutionary Computation Conference, Seattle, WA, USA, 26–30 June 2004; pp. 959–970. [Google Scholar]
- Saberi, M. Feature selection method using genetic algorithm for the classification of small and high dimension data. Proc. Int. Symp. Info. Com. Tech. 2004, 13–16. [Google Scholar] [CrossRef]
- Min, S.H.; Lee, J.; Han, I. Hybrid genetic algorithms and support vector machines for bankruptcy prediction. Expert Syst. Appl. 2006, 31, 652–660. [Google Scholar] [CrossRef]
- Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
- Khalid, S.; Khalil, T.; Nasreen, S. A survey of feature selection and feature extraction techniques in machine learning. In Proceedings of the 2014 Science and Information Conference, Marrakesh, Morocco, 28–30 May 2014; pp. 372–378. [Google Scholar]
- Rida, I.; Al-Maadeed, N.; Al-Maadeed, S.; Bakshi, S. A comprehensive overview of feature representation for biometric recognition. In Multimedia Tools and Applications; Springer: Berlin/Heidelberg, Germany, 2018; pp. 1–24. [Google Scholar]
- Rida, I.; Al Maadeed, S.; Bouridane, A. Unsupervised feature selection method for improved human gait recognition. In Proceedings of the 2015 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 31 August–4 September 2015; pp. 1128–1132. [Google Scholar]
- Chen, Y.; Li, Y.; Cheng, X.Q.; Guo, L. Survey and taxonomy of feature selection algorithms in intrusion detection system. In Proceedings of the International Conference on Information Security and Cryptology, Busan, Korea, 30 November–1 December 2006; pp. 153–167. [Google Scholar]
- Rida, I.; Boubchir, L.; Al-Maadeed, N.; Al-Maadeed, S.; Bouridane, A. Robust model-free gait recognition by statistical dependency feature selection and globality-locality preserving projections. In Proceedings of the 2016 39th International Conference on Telecommunications and Signal Processing (TSP), Vienna, Austria, 27–29 June 2016; pp. 652–655. [Google Scholar]
- Okimoto, L.C.; Lorena, A.C. Data complexity measures in feature selection. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
- Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
- Ng, A.Y.; Jordan, M.I. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 9–14 December 2002; pp. 841–848. [Google Scholar]
- Japkowicz, N. The class imbalance problem: Significance and strategies. In Proceedings of the International Conference on Artificial Intelligence, Melbourne, Australia, 28 August–1 September 2000; Volume 1, pp. 111–117. [Google Scholar]
- Chawla, N.V.; Japkowicz, N.; Kotcz, A. Special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 2004, 6, 1–6. [Google Scholar] [CrossRef]
- Batista, G.E.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
- Liu, X.Y.; Wu, J.; Zhou, Z.H. Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2008, 39, 539–550. [Google Scholar]
- Klein, A. Backblaze Hard Drive Stats for 2016. Available online: https://www.backblaze.com/blog/hard-drive-benchmark-stats-2016/ (accessed on 13 April 2017).
S. No. | SMART ID | Attribute Name |
---|---|---|
1 | 3 | Spin-up Time |
2 | 4 | Start/Stop Count |
3 | 7 | Seek Error Rate |
4 | 10 | Spin Retry Count |
5 | 12 | Power Cycle Count |
6 | 187 | Reported Uncorrected Errors |
7 | 189 | High Fly Writes |
8 | 193 | Load/Unload Cycle Count |
9 | 194 | Temperature |
10 | 197 | Current Pending Sector Count |
11 | 198 | Uncorrectable Sector Count |
12 | 199 | UltraDMA CRC Error Count |
S. No. | SMART ID | Attribute Name |
---|---|---|
1 | 4 | Start/Stop Count |
2 | 7 | Seek Error Rate |
3 | 12 | Power Cycle Count |
4 | 189 | High Fly Writes |
5 | 193 | Load/Unload Cycle Count |
6 | 194 | Temperature |
7 | 197 | Current Pending Sector Count |
8 | 198 | Uncorrectable Sector Count |
9 | 199 | UltraDMA CRC Error Count |
Method | Feature Vector | No. of Folds | No. of Test | False Positive | True Positive | Average |
---|---|---|---|---|---|---|
Dimensionality | for Cross Validation | Iterations | Rate (%) | Rate (%) | Accuracy (%) | |
Naive Bayes with No Feature Selection | 42 | 3 | 10 | |||
Naive Bayes with GA only | 12 | 3 | 10 | |||
Naive Bayes with Proposed Two-Tier Method | 9 | 3 | 10 | |||
SVM with No Feature Selection | 42 | 3 | 10 | |||
SVM with GA only | 12 | 3 | 10 | |||
SVM with Proposed Two-Tier Method | 9 | 3 | 10 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ahmad, W.; Khan, S.A.; Kim, C.H.; Kim, J.-M. Feature Selection for Improving Failure Detection in Hard Disk Drives Using a Genetic Algorithm and Significance Scores. Appl. Sci. 2020, 10, 3200. https://doi.org/10.3390/app10093200
Ahmad W, Khan SA, Kim CH, Kim J-M. Feature Selection for Improving Failure Detection in Hard Disk Drives Using a Genetic Algorithm and Significance Scores. Applied Sciences. 2020; 10(9):3200. https://doi.org/10.3390/app10093200
Chicago/Turabian StyleAhmad, Wasim, Sheraz Ali Khan, Cheol Hong Kim, and Jong-Myon Kim. 2020. "Feature Selection for Improving Failure Detection in Hard Disk Drives Using a Genetic Algorithm and Significance Scores" Applied Sciences 10, no. 9: 3200. https://doi.org/10.3390/app10093200
APA StyleAhmad, W., Khan, S. A., Kim, C. H., & Kim, J.-M. (2020). Feature Selection for Improving Failure Detection in Hard Disk Drives Using a Genetic Algorithm and Significance Scores. Applied Sciences, 10(9), 3200. https://doi.org/10.3390/app10093200