1. Introduction
Worldwide, digitally generated data have surpassed 180 ZB in 2025 [
1]. Most of these data are stored in data centers, which are specialized facilities that house computer systems used for sharing applications and data. Data centers primarily use hard-disk drives (HDDs) for storage, with 90% of data stored on HDDs and the remaining 10% on SSDs. The main advantage of HDDs is their large capacity and low cost of stored data, making them ideal for general data storage and data backup. Unlike HDDs, SSDs have no moving parts, which enables them to achieve better performance and reliability and are typically used in low-latency cloud-based applications. SSDs store data using FLASH cells, in which bits are stored as the charge trapped inside the floating gate of MOSFET. The simplest FLASH cell, known as an SLC (Single-Level Cell), holds one charged level to store one bit of data. The main limitation of SSDs is the limited write endurance of the FLASH medium, which allows up to 100,000 write cycles for SLCs (Single-Level Cells). Furthermore, SLC SSDs have a smaller capacity and a dozen times higher price per stored bit when compared with traditional HDDs. SSD capacity can be increased by storing multiple bits of information in one flash cell using discrete charge levels (MLCs and TLCs) [
2]. To increase the storage density, the dimensions of the FLASH cells were reduced, but this resulted in reduced write endurance. To solve a problem caused by scaling planar 2D cells, manufacturers started to use 3D stacking. The size of the FLASH cells did not have to be further reduced thanks to the presence of a third vertical dimension, which enabled stacking a few hundred layers, thus achieving high storage density while keeping acceptable write endurance [
2].
The content of the already written FLASH cell cannot be changed by simple overwriting as is the case with HDDs. It is necessary to perform the read–erase–modify–write cycle to successfully change the content of a FLASH cell. Since SSDs are based on NAND FLASH cell organization, FLASH cells are grouped into larger data units called pages, where all the FLASH cells within one page can only be written at once. Furthermore, pages are grouped into larger data units called blocks, where all pages within one block can only be erased at once. Thus, to change a single bit of data, it is necessary to read the page and block containing the designated FLASH cell, erase the entire block, modify the page, and rewrite all pages from the particular block, which accelerates the FLASH medium’s wear. SSDs rely on the Flash Translation Layer (FTL) to remap the page to the new empty block where it will be written in order to avoid rewriting other pages from the original block. The FTL remaps the page address to a new one, and the old location page is marked as invalid. In the background, when a certain number of invalid pages is reached, garbage collection remaps other pages from the block to new empty blocks and erases that block, freeing it for further use. This operation causes additional erase cycles and a phenomenon known as write amplification, where a single write request from the host will typically require between two and four write cycles, further reduce the write endurance of the FLASH medium.
The limited endurance of FLASH storage can be seen as a factor that limits the operational life of SSDs [
3]. The inability of worn-out FLASH cells to reliably store data leads to SSD failure. Since the SSD controller applies equal wear leveling to all FLASH cells through the use of cell remapping, all FLASH cells will tend to have an equal number of write cycles. Thus, it is possible that due to the gradual degradation of storage medium, SSD failure can be predicted in advance, which could be used to warn users to perform proactive measures to prevent potential data loss. Failure prediction is used to predict the failure of the SSD based on the monitoring of operational parameters, known as SMART [
2]. The simplest algorithms, used by manufacturers, are threshold-based, and they monitor each SMART parameter individually when its value goes above or under a certain factory-defined threshold failure warning. Research in this field is oriented toward the development of algorithms that use multivariate SMART parameters for failure prediction. Researchers have been using basic machine learning algorithms for SSD failure prediction, such as Bayes classifier [
4], random forest [
3,
4,
5,
6,
7,
8,
9,
10], isolation forest [
5], decision tree [
3,
9], gradient-boosted decision tree [
4,
6,
7,
8,
9], neural networks [
3,
5,
7,
8,
9], autoencoder [
3,
5,
8], logistic regression [
3,
9,
10], k-nearest neighbors (k-NN) [
3,
9], and Support Vector Machines (SVMs) [
3,
5,
8,
9]. These algorithms showed limited prediction performance, so researchers have used and developed more sophisticated algorithms to improve the failure prediction rate. Better prediction performance was achieved by LSTM (long short-term memory) [
4,
7,
8,
10,
11,
12], the RUS (Random Under-Sampling) ensemble method [
4], multi-view and multi-task random forest [
13], and WEFR (wear-out-updating ensemble feature ranking) [
14], and mutation similarity-based failure rating and diagnosis (MSFRD) [
15].
In our work, we presented the SSD failure prediction model based on anomaly detection, which uses the Mahalanobis distance measure. The proposed algorithm uses a forward feature selection algorithm to rank the features according to their influence on SSD failure prediction. The main contribution of the developed model can be seen in the prediction performance achieved on the Alibaba SSD dataset, where our model was able to detect 64% of SSD failures using the six highest-ranked features while keeping a high precision of 96%.
2. Materials and Methods
This research presented an SSD failure prediction algorithm based on an anomaly detection algorithm using the distance measure approach. Our current work builds on the foundation established in reference [
16] while introducing several significant improvements. While both approaches utilize a common anomaly detection algorithm based on a distance measure approach, our work focuses on SSD failure prediction rather than HDDs. A key enhancement is the integration of wrapper-based feature selection directly within the anomaly detection process, ensuring that only the most influential features are used, which streamlines the prediction process and improves the overall performance. This targeted adaptation to SSDs and the improved feature selection process represent significant advancements over the methodology presented in [
16]. On the other hand, another key difference between our developed SSD failure prediction algorithm and the HDD failure prediction algorithm [
16] is that we implemented sequential feature selection within the anomaly detection algorithm to create an optimal subset of the most influential SMART features. In contrast, in the HDD failure prediction algorithm [
16], SVM recursive feature selection was executed prior to the anomaly detection process and did not generate an optimal subset of SMART features for anomaly detection. By embedding feature selection within the anomaly detection process, our method ensures that only the most relevant features contribute to failure prediction, leading to improved model performance and reliability. However, the main difference between the proposed algorithms is that our algorithm uses an adaptive decision boundary, whose search area is bounded by MD values from the validation set, in contrast to the fixed search area for the decision boundary used in the HDD failure prediction algorithm [
16].
The detailed flowchart of the proposed algorithm for SSD failure prediction is presented in
Figure 1. The first step in dataset processing removes highly correlated features with correlation values between features close to the value of ±1. This reduces the computational complexity of the algorithm, in terms of memory and execution time, and solves the problem with matrix singularity in the case of computing a matrix inverse. The dataset is then separated into three independent subsets (block dataset splitting in
Figure 1), which are used in different phases of the algorithm. The training set creates the distribution for the normal operating conditions of healthy SSDs. Since the operation of SSDs is monitored using SMART attributes, whose values are presented in different scales, multivariate Mahalanobis distance is used as a distance measure. The Mahalanobis distance measures the distance relative to the central point (or centroid) of the multivariate distribution, which is the overall mean for multivariate data. In the case of the SSD, this distribution is created from the training dataset, which defines the normal working area for healthy SSDs.
Since most of the SMART parameter values typically are not normally distributed, Box–Cox transformation is used to transform these nonparametric distributed values into normally distributed values. Box–Cox transformation [
17] transforms non-negative input data using the exponent
λ, which is selected to minimize the standard deviation of the transformed data, thus converting the data to be normally distributed.
The Box–Cox transformation section, outlined in
Figure 1, is implemented using three separate blocks, each responsible for the power transformation of a different data subset. These blocks apply the power transformation using the exponent
λ, which is determined based on the transformation of the training subset.
SMART attributes have different influences on prediction performance, and some SMART attributes have a higher influence on prediction performance than others, while some SMART attributes could have a negative influence on prediction performance. In order to rank the SMART attributes according to their importance for SSD failure prediction, we implemented a sequential selection algorithm. This algorithm iteratively finds the most relevant subset of features, which maximizes the prediction performance of the anomaly detection model. In our research, we decided to use sequential forward selection, which tends to find a minimal set of features that mostly contributed to prediction performance, rating other features as less relevant. Sequential backward selection algorithms start with the entire feature set and try to eliminate the least relevant features treating the remaining features as relevant. Both techniques have their advantages, depending on the nature of the observed dataset. If most of the features can be treated as relevant, sequential backward selection is more appropriate, and this method will quickly eliminate a few irrelevant attributes. As we discussed before, the number of SMART attributes that contribute to failure prediction is small; thus, the vast majority of the SMART attributes can be treated as irrelevant for failure prediction. Thus, with this in mind, we chose sequential forward selection for our model. The forward feature selection is implemented by three feature selection blocks, outlined in
Figure 1. These blocks select the most relevant subset of features based on the evaluation of model performance on the test set. In the first iteration of forward feature selection, individual features will be used in the anomaly detection algorithm, and features with the highest prediction performance will be chosen as the most relevant. In the following iteration, the most relevant feature will be combined with all the remaining features into a set of bivariate features that will be evaluated in the anomaly detection algorithm, in order to find the second most relevant feature. The relevant features are further combined with the remaining features into multivariate features, and the process is repeated until the least relevant feature is found.
During the evaluation of a certain subset of features, Mahalanobis distance
MDi is calculated for every data point
i, using normalized data points for selected features
and covariance matrix
C according to Equation (2) [
16]. The covariance matrix is used to calculate covariance between used features, according to Equation (3).
The Mahalanobis distance is measured for the entire SSD dataset, relative to the distribution of data in the training subset, which contains only the healthy SSD data. Thus, every instance from the SSD dataset measures the distance between the current operating mode of the SSD and the normal operating mode.
The calculation of the Mahalanobis distance is implemented using three blocks, outlined in
Figure 1. The Distribution Mean block determines the distribution of data from the training set and transfers parameters µ, σ, and C to two blocks, which compute the Mahalanobis distance for the validation and test sets.
In order to classify the operation of SSD as anomalous, which could soon lead to SSD failure, a decision boundary must be defined based on certain classification performance metrics. In the case of the prediction of HDD and SSD failures, the most commonly used performance metrics tend to minimize the number of false positive predictions, which are known as false alarms. The estimation of the decision boundary is performed by the blocks outlined in the lower left part of
Figure 1. The validation subset is used to choose the appropriate value for the decision boundary. This subset contains a set of Mahalanobis distance values for the healthy and failed SSDs, and the decision boundary is found by an iterative search within the interval defined by the mean value for the MD for healthy drives. The vast majority of healthy drives will be distributed closely within the distribution centroid, while the failed drives will be located at a greater distance in relation to the distribution centroid. This interval is divided into steps where the prediction performance for each step is measured on validation data. Prediction performance is evaluated using the confusion matrix, from which various performance metric parameters are calculated. SSD failures are classified as positive outcomes, while healthy drives are classified as negative outcomes. Precision represents the ratio between true negative
tn predictions and actual negative
n predictions and is influenced by false positive
fp predictions which reduce the precision. Recall represents the ratio between true positive
tp predictions and actual positive
p predictions and is referred to as the failure detection rate (FDR). In our model, accuracy is chosen as the performance metrics parameter for the selection of the decision boundary. Accuracy is measured as the ratio between the sum of true positive
tp and true negative
tn predictions and the total number of positive
p and negative
n classes, as shown by Equation (4).
Selecting the decision boundary which maximizes the accuracy and the number of true predictions both for positive and negative classes, which will finally maximize both precision and recall. In our algorithm, we implemented an adaptive decision boundary, which was bounded by values of MD from the validation set, in contrast to the fixed decision boundary used in our algorithm for HDD failure prediction [
16]. The decision boundary is selected from the interval defined by the mean value µ for the Mahalanobis distance of healthy instances from the validation set. The boundaries of this interval are set to [10
µ−1 ÷ 10
µ+1], and the decision boundary is searched within this interval on ten instances of logarithmic scale. The decision boundary that achieved the highest accuracy on the validation set is selected as optimal.
The performance estimation of the anomaly detection algorithm, outlined in the lower right part of
Figure 1, has the role of objectively evaluating the prediction performance on the independent test set for the selected decision boundary. Classification accuracy is collected in the form of the confusion matrix, and the feature that achieves the highest accuracy is added to the subset of selected features.
Disk failure prediction algorithms are trained in such a way as to reduce the number of false positive classifications (false alarms) and maximize precision, which is a conditional metric that is combined with recall. Higher recall values can be achieved by lowering the decision threshold in our algorithm, but this will result in reduced precision and a higher number of false alarms. The F-score is a metric used to create a balance between precision and recall, with different versions placing varying importance on each. The F1-score gives equal weight to precision and recall, making it suitable for balanced classification problems where both metrics are equally important. In contrast, the F0.5 score emphasizes precision more than recall by valuing precision twice as much as recall, as shown in Equation (5). It is the most commonly used type of F-score to measure the performance of algorithms for predicting SSD and HDD failures.
3. Results
In this research, we used the publicly available Alibaba dataset [
18], which contains SMART logs from their data center, collected during the period of two years, from the beginning of 2018 until the end of 2019. During this period, there were almost 500,000 operational SSDs in the data center, which came from three different vendors, which were anonymized as MA, MB, and MC. Each vendor provided two drive models, labeled MA1, MA2, MB1, MB2, MC1, and MC2. During this period, software daemon was used to monitor the operation of SSDs, which collected SMART data for every operational SSD, as well as trouble tickets, which were used to describe unusual drive behavior or drive failure. We developed a proposed model in the form of MATLAB 2017b scripts, which relied on several built-in functions used for feature correlation, Box–Cox transformation, and calculation of the Mahalanobis distance. For this research, we extracted the data for the drive model labeled MC1, which was one of the most numerous drive models in the data center. This drive model has a capacity of 1920 GB and is built using the 3D TLC NAND technology. The extracted dataset was time limited to the period of one year, starting from 1 January 2018 to 31 December 2018. During this period, 1962 SSDs were reported as failed, and their SMART logs were used to construct the dataset. Since this dataset is significantly imbalanced in favor of healthy drives, a random sample of 4308 non-failed drives was selected for analysis. The dataset used in this research contains 506,456 SMART instances from failed drives and 1,985,064 SMART instances from healthy drives. Each SMART instance is collected for each drive for a particular date and contains both the raw and normalized values that measure the performance of SSDs. Raw SMART data represent unprocessed measurement data, which are vendor specific. Normalized SMART data represent values of raw SMART data, which are scaled in range from 100 to 0. Normalized values that are close to 100 indicate that the performance parameter is optimal, while values that are closer to 0 represent unsatisfactory performance parameters. One SMART instance for the MC1 model contains, in total, 51 pairs of raw and normalized SMART attributes from which 21 SMART attributes with non-zero values are used in the dataset. In order to use as much information as possible, raw SMART parameters were used, and these attributes are summarized in
Table 1.
SMART attributes can be grouped into several categories based on the mechanism that is affecting attribute performance:
Power and environment—SMART 9, SMART 12, SMART 174, and SMART 194;
Storage medium—SMART 5, SMART 170, SMART 173, SMART 180, and SMART 196;
Errors—SMART 1, SMART 171, SMART 172, SMART 183, SMART 184, SMART 187, SMART 188, SMART 195, SMART 197, SMART 198, SMART 199, and SMART 206.
SMART attributes that fall into the power and environment category depend on exterior causes. In modern data centers, the influence of these parameters on SSD failures is kept at the lowest level by using redundant power supplies and efficient cooling. According to the SMART 9 attribute, which measures the number of operating hours, SSDs used in the dataset were in their second year of operation in 2018. SMART attributes, which are related to the storage medium, are used to measure the wear level of the FLASH storage medium. FLASH controllers use efficient FTL algorithms to achieve equal wear leveling of all FLASH cells by using them evenly in order to maximize the SSD lifetime. Since FLASH cells will wear evenly over time, these attributes are usually used to measure the RUL (Remaining Useful Life) of SSDs. Errors represent unusual behavior of SSDs, and various SMART parameters are used for counting the occurrence of multiple types of errors. Frequent occurrences of such errors decrease the reliability of SSD operation and could lead to SSD failure. Manufacturers typically use threshold-based techniques to predict SSD failure. These techniques alert users to potential SSD failures when normalized SMART parameter values drop below a predefined threshold. Threshold values are determined by the manufacturer during the factory testing. These thresholds are set in such a way as to minimize the number of false positive SSD failures and reduce the number of unnecessary warranty returns by customers.
Machine learning methods can predict SSD failure by monitoring a set of SMART parameters, thus achieving better failure prediction compared to threshold-based methods which monitor individual SMART attributes. The use of all available SMART attributes in failure prediction is in most cases computationally impractical due to high data dimensionality and data redundancy between attributes. Thus, efficient failure prediction typically relies on a subset of SMART attributes with the highest impact on failure prediction. Feature correlation represents the statistical relationship between the values of two features in the dataset. Highly correlated features will provide similar information to the trained model, so it is advisable to use only one of these features while removing the rest from the model. In the case of using redundant features, model training will typically lead to overfitting with a negative impact on model generalization. In the case of the used dataset for the MC1 drive model, we calculated feature correlation between all of the 21 SMART attributes from
Table 1, and the correlation matrix is presented in the form of a heat map in
Figure 2. The correlation matrix reveals a high feature correlation between certain pairs of SMART attributes which are presented in
Table 2.
The first group of highly correlated features shows the direct correlation between SMART 5 (Reallocated NAND Blocks), SMART 170 (Reserved Block Count), and SMART 196 (Reallocation Event Count) attributes, which have identical values across the entire dataset. Therefore, the SMART 5 attribute is included in the model, while the attributes SMART 170 and SMART 196 are excluded from the model.
A high correlation is observed between SMART 12 (Power Cycle Count) and SMART 174 (Unexpected Power Loss Count), since each unexpected power loss will also be counted during the new power cycle. The SMART 12 attribute is included in the model, since it is more general than SMART 174 which is removed from the model. Furthermore, a direct correlation is observed between attributes SMART 171 (Program Fail Count) and SMART 206 (Write Error Rate) since each unsuccessful programming of the NAND block will be counted as a write error. Thus, we included SMART 171 in our model, while SMART 206 was removed from the model. Also, attributes SMART 183 (SATA Downshift Error Count) and SMART 188 (Command Timeout Count) are in a direct correlation, so we used SMART 183 in our model, while we removed SMART 188 from our model. After the removal of highly correlated features from the dataset, a subset of sixteen features used in this research is presented in
Table 3.
The Alibaba dataset for 6270 MC1 SSD drives is partitioned into three subsets: the training set, which contains 2346 of all healthy drives; the validation set, which contains 981 healthy and 981 failed drives; and the test set, which contains the remaining 981 healthy and 981 failed drives. Dataset partitioning is presented in
Table 4.
The training set is used to define a distribution that defines the normal operating condition for the SSDs. The training set contains 1,090,204 instances from 2346 healthy SSDs, which are used to estimate the exponent λ for Box–Cox transformation and transform data for each feature into normally distributed data.
Forward feature selection starts by selecting a feature from
Table 3, one by one. to find the most relevant one. The algorithm calculates the Mahalanobis distance for the validation and test sets, based on the distribution of healthy data from the training set. A validation set is used to determine the decision boundary sufficient to identify failed drives while keeping the number of falsely identified drive failures (false positives) as low as possible. Accuracy is used as a performance metric to maximize both positive and negative predictions. The value for MD on an entire dataset for the SMART 187 attribute is represented in
Figure 3. Instances are presented in different colors; drives from the validation set are colored in blue, drives belonging to the test set are colored by red, and healthy drives from the training set are represented in green. The decision boundary is searched within the interval marked with black dashed lines, while the chosen decision boundary is marked with a solid black line. All drives whose Mahalanobis distance is greater than the decision boundary are classified as failed, so the majority of the failed drives in
Figure 3 will be correctly classified as true positives. Few healthy drives for which the Mahalanobis distance surpasses the decision boundary will be incorrectly classified as failed, and these drives are marked in
Figure 3 as false positives.
For each selected SMART feature, the confusion matrix is collected from which the prediction accuracy of the anomaly detection algorithm is determined for all features, as shown by the first data column in
Table 5. The most influential feature is SMART 198 with the highest accuracy of 0.783. In the following round of forward feature selection, the second most ranked feature is selected. This feature is used with the most influential feature from the first round to calculate the MD and collect the metrics shown in the second column in
Table 5. The second most influential feature is SMART 1, and the anomaly detection model using two features achieved an accuracy of 0.81. Performance metrics for the following rounds of forward feature selection are shown in the columns of
Table 5.
When all the rounds of forward feature selection are finished, the overall model performance in terms of precision, recall, accuracy, F0.5 score, and ROC-AUC is presented in
Figure 4, in relation to the number of used features. The anomaly detection model achieved the highest accuracy of 0.81 after the six most influential features were found. The Area Under the Curve (AUC) follows a similar trend with accuracy, reaching the maximum value of 0.8225 when the six highest-ranking features are used. Additional features did not have any contribution in increasing the model accuracy. Moreover, after adding less influential features, the model accuracy started to decrease. The proposed model detected 64% of failures while maintaining a high precision of 96%, using the six most influential features. The maximum accuracy of 81% was achieved with six SMART attributes ranked in the following order: SMART 198, SMART 1, SMART 183, SMART 187, SMART 184, and SMART 199. The further addition of SMART attributes will not increase the accuracy but could increase the computational complexity.
The proposed anomaly detection model can detect failure in advance. Warning time is determined as the time difference expressed in working days between the date when the failure alarm has been signaled and the date when actual SSD failure has occurred. The summarized results are shown in
Table 6. The presented results show that our model was able to predict 60% failures at least 7 days prior to the occurrence of actual SSD failure.
4. Discussion
The presented anomaly detection model achieved good prediction performance on the Alibaba SSD dataset with a failure prediction performance of 64%, with a low number of false positives around 4%. The model achieved the stated predictive performance with just six highest-ranked features from twenty-one available features, which significantly reduces the computational complexity of this model, making it usable for practical implementation for real-time failure prediction. A failure prediction warning was given at least 7 days in advance for 60% of failed drives, which is satisfactory for complete data backup.
We compared the prediction performance of our algorithm with previous works in terms of precision, recall, F0.5 score, and accuracy on the Alibaba SSD dataset, and the results are presented in
Table 7. Also, we compared these algorithms according to a number of features they used for model training. The majority of algorithms used the various subsets of original features in order to reduce data redundancy between features. The authors of the paper [
14], in addition to the original features from the Alibaba dataset, generated additional statistical features including minimum and maximum values, span, mean and standard variance, and weighted moving averages for 3-day and 7-day windows. Since the MC1 model originally contained 21 features, seven statistical values were calculated for each instance, thus assuming 168 features were used for the MC1 drive model.
A detailed comparison of failure prediction models which are trained and tested using the Alibaba SSD dataset are presented in
Table 7. When compared with the failure prediction algorithms presented in the paper [
10], which were also tested on the same dataset for the MC1 drive model, our results showed significant improvements in terms of accuracy, precision, and recall. Our model achieves better results when compared to the LSTM model, with a one-year monitoring window, which is more computationally complex. Authors in the paper [
6] also used the Alibaba dataset in their modified ensemble learning model, and they achieved a 21% recall with 100% precision, which, despite its high precision, suffered from a very low recall. Alibaba’s dataset was also used in research [
7] where they achieved a 69% recall with precision scores of 47%, which is lower than our model’s performance. The authors of the paper [
14] used wear-out-updating Ensemble Feature Ranking (WEFR) on the Alibaba dataset. When their model was tested on the MC1 drive, they achieved the recall of 18%, while the precision was 49%, which was less than in our case. The authors of the paper [
13] developed a multi-view and multi-task random forest (MVTRF) algorithm, which they tested on the Alibaba dataset, but on a different model MB1, where they achieved a precision of 89%, a recall of 76%, and an F0.5 score of 86%. The authors of the paper [
15] presented MSFRD transformer-based mutation feature extraction combined with similarity measurement to predict SSD failures. The authors tested their algorithm on the Alibaba SSD dataset for the MB2 disk model and achieved a precision of 87%, with a 27% recall, which is much lower when compared to the results of our algorithm. The authors of papers [
6,
13,
14,
15] presented accuracy results achieved on highly imbalanced datasets, which result in biased accuracy, and these results are marked with an exclamation mark (!) in
Table 7. In our algorithm, we performed dataset balancing using random subsampling of the majority class (healthy disks), so that the accuracy metric is more reliable. The F0.5 score of 88% is among the highest, showing a strong balance between precision and recall, making our model highly suitable for real-world implementation. Our method offers one of the best trade-offs between accuracy, precision, and recall, making it more effective than LSTM, WEFR, and FFN while remaining computationally efficient.
Other research related to SSD failure prediction was conducted using datasets that were inaccessible. Researchers in their work [
2] used the Huawei dataset of 3D TLC SSDs, and they achieved a 42.5% recall with 0.00% false positives. Authors [
11] developed the Variational Auto-Encoder VAE-LSTM, which they used on a closed SSD dataset, and they achieved an FDR of 69% when the FAR was less than 0.035%. The authors of the paper [
8] developed a Temporal Contextual Attention Network (TCAN) transformer-based architecture that integrates LSTM. They tested their model on private datasets from the Tencent data center and achieved a precision of 73%, a recall of 68%, and a ROC-AUC of 0.73. In the paper [
12], the authors used the SSD dataset from Google’s data center, and they used LSTMs with generative adversarial networks to achieve a 0.76 AUC for a 14-day prediction window. By a comparison of generalized classification performance expressed in the form of ROC-AUC, our algorithm with the value of 0.8225 outperforms the results of previous works. The advantage of our algorithm is reflected in its simpler implementation compared to LSTM, with which it achieves an equal level of performance. Based on research published by researchers who used the other dataset, we plan to evaluate classification performance on used datasets when they become publicly available.
The novelty of the proposed algorithm is in its low computational complexity and good predictive performance, which is favorable for lightweight integration in real-time proactive failure detection systems in data centers. The model achieves a high detection rate (recall) while keeping a low number of false alarms (high precision) by using just the six most relevant features. The proposed algorithm achieves better prediction performance compared to previous works which are based on more complex ML algorithms and use a higher number of features. The classification of SSD failure requires the simple calculation of MD and a comparison against the decision boundary. The algorithm can adapt its covariance matrix to long-term trends caused by SSD aging, while the value of the decision boundary can be adjusted to classify new failures.
The limitation of the proposed method is the requirement for normally distributed data, which are required to calculate the Mahalanobis distance properly. Our algorithm uses Box–Cox transformation, which has limits regarding the type of data distribution skew it can transform into nearly normally distributed data. In the case of highly skewed distributions, the usage of other types of transformation is recommended, such as Yao-Jonson or others. The proposed method is susceptible to highly correlated features, which could cause problems with singularity during the calculation of the covariance matrix inverse, required for the calculation of the Mahalanobis distance. This limitation can be avoided by removing the highly correlated features before the use of the algorithm. Our proposed algorithm is designed to be used on balanced datasets and requires that highly skewed datasets are balanced before the use of the algorithm. In this study, the algorithm is tested on the SSD dataset, which has a high class imbalance of negative class. Since this dataset has a sufficient number of minority classes, the under-sampling of the majority class has been used in testing to achieve class balance.
The created algorithm can be generalized on other SSD datasets with a similar set of SMART attributes. Generalized ROC-AUC classification performance achieved on the Alibaba SSD dataset shows that our algorithm is capable of providing reliable results. The proposed method can be used in a wider field of applications, in which failures are caused by the gradual degradation of system parameters. However, when applied to different systems, generalization is possible, but it would likely require some level of modifications, which would ensure that the algorithm remains accurate and adaptable to different systems and scenarios.