4.1. Dataset
The dataset used in the experiment comes from the public hard disk dataset on the BackBlaze official website. BackBlaze has been compiling and reviewing hard disk data since 2013, posting quarterly and year-end statistics on hard disk usage from various manufacturers and comparing them with previous years to draw conclusions. They also made the original hard disk dataset available for viewing and research.
The data items included in the BackBlaze hard disk dataset are shown in
Table 2. Among them, date, serial number, model, capacity and failure or not of a hard disk are always present and have the same meaning for all hard disks, while the following S.M.A.R.T. feature sets vary by the hard disk model. Because different hard disk manufacturers have different definitions and emphasis on their own hard disk S.M.A.R.T. features, it is normal that the same S.M.A.R.T. features have different meanings for hard disks of different manufacturers. Therefore, on the premise that the failed samples should not be too small and the disk failure rate should be within the normal range, this experiment selects the ST4000DM000 hard disk of the Seagate Company for more than six years from 2016 to 2022 as the experimental object. The statistical data of some years are shown in
Table 3.
Since the year 2022 has not yet ended, only the data of the first two quarters, namely from January to June, were collected in 2022, with a total of 268 failed samples, and the annual failure rate is not yet known.
As can be seen from the statistics, the annual failure rate of ST4000DM000 disk data in these six years is basically around 2%. According to statistics, the annual failure rate of hard disks under normal circumstances should be between 0.5% and 2%. Therefore, the reason for choosing this model of hard disk is that its failure rate is always at a high level, the failed samples are more abundant, and the number of data meets our requirements.
4.4. Experiments and Results
The software environment of the experiment is Windows 10, which is equipped with an Intel Core i5-11260H CPU@2.60GHz, 16G RAM, and an NVIDIA GeForce RTX 3050 laptop GPU. The deep learning algorithm used is Pytorch 1.10.2.
Using the failed samples in the Seagate hard disks with model ST4000DM000 and the samples of 14 days before the anomaly, the time span is from 2016 to 2022. A total of 12 sets of experiments without cross-data were constructed to explore the experimental results under different ratios of training set to test set. After the same standardized processing of data, random forest, GBDT, XGBoost, and AdaBoost classifiers were used as the first layer, and a BP neural network was used as the second layer for experiments and recorded as blending group A. Logistic regression, k-nearest neighbors, support-vector machine, and a Gaussian naive Bayes classifier were used as the first layer, and a BP neural network was used as the second layer for experiments and recorded as glending group B. The results of these two groups of experiments were compared with those obtained by using traditional methods.
Before that, in order to further verify the validity of our experiment results, we tried to reproduce a three-layer stacked LSTM model [
6] for comparison experiments, but as shown in
Figure 8, its loss function presented an extremely unstable state, and the vibration was relieved when it was reduced to two layers.
Therefore, we modified it. After processing with only one layer of LSTM and one layer of dropout, the loss function image behaved much more normally (as shown in
Figure 9), and the test results were better than the complex model. It is speculated that the form of the data after standardization was too simple, leading the complex model to appear to have a serious shock peak, resulting in the reduced performance of the model; therefore, the simpler model was deemed more suitable for processing this kind of data.
In addition, to further verify the rationality of the blending ensemble learning method, we added LSTM to the base learners to obtain better effects and criterion. We set up two additional experimental groups by removing the classifier with the least stable classification effect or least impact on the final result from the base learners of the original two groups of Blending experimental groups and replacing it with LSTM, thus obtaining Blending group C and Blending group D, respectively. Based on the observation and analysis of the intermediate results in the experimental process, we replaced the XGBoost in the base learner of blending group A with LSTM and the Gaussian Naive Bayes classifier in the base learner of blending group B with LSTM, thus obtaining two new ensemble learning models.
Table A1 and
Table A2 in the appendix give the parameter values of some algorithms, and the algorithms not mentioned use default parameters.
Experiments were carried out after everything was ready. The datasets of each year after processing were divided into 12 groups, and repeated experiments were conducted to make the results more convincing through averaging. The experimental results are shown in
Table 4 and
Figure 10.
From the experiment results, we can see that the blending ensemble learning method has significantly improved the evaluation criterion. In most experimental groups, LSTM model experimental results are better than using BP neural networks alone but not better than blending ensemble learning methods. In addition, from the results in
Table 4, in most cases, it can be seen that LSTM, when added to the base learner, can significantly improve the performance. The line chart in
Figure 11 more clearly shows the effect changes after the addition of the LSTM to base learners, which proved the effectiveness of this method. However, it should be noted that there are cases in which the LSTM model has the best effect in some individual experimental groups. We guess that under such dataset processing, if the ratio of the training set to the test set is too large or too small, the performance of the LSTM model will be poor or the results will not be accurate. Because the total amount of datasets is fixed, when the proportion of training sets is too large, the number of test sets is too small, which will lead to the insufficient persuasiveness of test results. In contrast, when the proportion of test sets is large, too few training sets will result in insufficient model training. When the ratio is between 1:1 and 1:0.5, the performance of the LSTM network can be maximized, and when the ratio is around this, the effect of the model is more in line with expectations. However, this conclusion is only a guess and needs to be verified by subsequent experiments.
In addition, compared with the direct use of the BP neural network method, the blending ensemble learning method will also greatly reduce the running time and improve the efficiency of training and testing.
Table 5 shows the training time of each model in some experimental groups. The BP neural networks used are all 10,000 epochs.
The reason for this gap in training time is that the size of the new training set used in the vlending ensemble learning method is reduced. Because it is a part of the original training set, the size of the training set is actually the size of the split validation set, and the complexity of the new training set and test set is also reduced.
Finally, to test the universality of our method, we counted data for six different models of hard disks. In addition to using Seagate hard disks from the same manufacturer as the training data, we counted hard disks from two different hard disk manufacturers, Hitachi and Toshiba. The statistics for the six groups of hard disks used for testing are shown in
Table 6. These six different models of hard disks cover three manufacturers and have capacities of 4TB, 8TB, 12TB and 16TB, and the number of failed hard disks counted ranges from 79 to nearly 1000.
The datasets of the six new models counted were feature-selected and preprocessed in the same way, and all the hard disk data of model ST4000DM000 used for the experiments above were used as the training set to test the six new datasets in
Table 6. The results obtained are shown in
Table 7.
As can be seen from the results obtained in
Table 7, the model generalizes well, and compared with the model obtained by the ensemble learning method, the results of the traditional machine learning or deep learning models, such as random forest, the BP neural network, and LSTM, show poor robustness. The most obvious improvement of the LSTM on the blending method is the experiment on a hard disk from the Hitachi manufacturer. It is worth mentioning that the S.M.A.R.T. feature sets filtered out by Hitachi hard disks have a large gap compared to those of Seagate, so the initial performance is poor. However, since the vast majority of S.M.A.R.T. features are the same for different models of hard disks, aligning the features after comparing previously obtained heatmaps (as shown in
Figure 6) can greatly improve the test results. Therefore, when testing other models of hard disks, the most important thing is to align the features with high correlation among the selected features, so that good results can be obtained.
Taking into account all the above experimental results, it can be seen that the blending ensemble learning method uses a smaller training set to achieve better effect, and the running time is greatly reduced. In addition, experiments with other models of hard disks including the same and different manufacturers showed that the Blending ensemble learning method does not show overfitting and has good universality. In summary, several experimental results prove the rationality and effectiveness of this method.
4.5. Looking for the Best Match
In the above, we have proved that the blending ensemble learning method can improve the performance of the model. Next, we aim to find a combination of base learners that can achieve good performance on many hard disk models.
Since each base learner has different detection strengths and accuracies for positive and negative samples, the ensemble learning method is comprised of the performance of each base learner, which can learn from each other and complement each other to achieve a more stable and efficient model. Taking the hard disk model ST8000DM002 as an example, the number of false positives and false negatives in the results of each base learner is shown in
Table 8.
From
Table 8, we can see that the random forest method has the best and very high anomaly detection accuracy, but the disadvantage is that the number of false-negative samples is too high. Therefore, we take random forest as one of the base learners, and then select the three classification models with the lowest false negative rate, namely LSTM, GBDT, and SVM, as the other three base learners. We use these four models to form a new ensemble learning model to test the test set, and the number of false positives and false negatives are 9 and 23, respectively. The Matthews correlation coefficient is 0.9723992. Compared with the random Forest model alone, although the number of false positives is slightly increased, the number of false negatives is greatly reduced, and the comprehensive performance exceeds each individual model. The experiment results proved our idea, and the following experiment continued according to this idea.
According to the above verification idea, experiments were carried out on the other five models of hard disks, and
Table 9 was obtained.
By observing the confusion matrices obtained from each method experimented on each hard disk model, we found that their performance bias is basically stable, while logistic regression and KNN basically always have a large number of false positives and false negatives, so we excluded these two classifiers when experimenting. We experimented with multiple ensemble combinations based on the results of the statistics, and the results are shown in
Table 9. It can be seen that the combination of base learners in the last column achieved good results on the six hard disk datasets. It is worth noting that the experiment of AdaBoost on Hitachi hard disk HMS5C4040BLE640 achieved high performance that was not available on other methods, so in the combinations without AdaBoost, the results on this hard disk model were poor. In addition, for the experiments on the Toshiba hard disk MG07ACA14TA, the results of each combination were actually only one or two samples apart. Therefore, on the whole, the combination of random forest, SVM, AdaBoost, and LSTM gave the best results on most models, and the second-best results were similar.