The objective for feature extraction and selection is to build a set of representative features that can be used by ML models for making various types of predictions, such as the
next offset and
hotness of a file. Based on the analysis of file access patterns discussed in
Section 3, we extracted a set of 37 features, presented in
Table 2 and discussed in
Section 4.1, that are derived from information extracted from file requests, file metadata and patterns, directory patterns, and file format patterns. The target variables are presented next in
Section 4.2, while feature evaluation is presented in
Section 4.3.
4.1. Feature Extraction
As file requests are executed on a storage system, the file metadata information is updated, and the various patterns discussed in
Section 3 can be computed and stored at the level of files, directories, and file formats. This aggregated information is then used for generating the features presented in
Table 2 in an online fashion, as explained later in
Section 5. All features are normalized to values between 0 and 1 but with different normalization approaches. Next, we discuss the set of extracted features and their normalization, grouped by their granularity.
Request Features: These are collected at the level of individual file requests and include the Request Operation, Request Offset, and Request Length. The Request Operation is encoded with a specific number assigned per operation type: 1 for Open, 2 for Write, 3 for Read, 4 for Close, and 5 for Delete, normalized by dividing it by 5. The Request Offset and Request Length are normalized using the size of the file accessed by the request.
File Features: These are extracted from basic file metadata on a per-file basis, and they include the file ID, file size, time since creation, time since last access, time since last update, history of open times, and current file hotness. They identify the file and aim to determine how frequently this file is accessed and/or updated. The file size is normalized by the fifth root, over the fifth root of the biggest file size found in the system (defaults to 32 GB). Different approaches to normalization were also considered. With min–max normalization, smaller file offsets and lengths would acquire very small normalized values that would be almost indistinguishable from each other. For instance, 80% of the files have a size of less than 8 GB, but they would be normalized to less than 0.25 with min–max normalization, while the remaining 20% would acquire values between 0.25 and 1. The fifth root normalization spreads the normalized values in a more representative fashion. In the above example, those 80% of the files are normalized between 0 and 0.76.
The history of open times reveals how frequently the file is opened. The size of the history is configurable and set to 10 by default. Timestamps are well-known to be bad features because they continuously grow over time. Hence, instead of using them directly, we use the time difference between each consecutive pair of open timestamps as features. All time differences are normalized using min–max normalization, where max is a configurable time window.
The
hotness of a file represents the file’s access rate over a defined time interval. Intuitively, the more a file is accessed and more recently, the higher its hotness value. The file hotness
h is calculated as:
where
is the amount of data read from file
f during time period
t,
is the weight of
t, which is inversely proportional to the time difference between the current time and the beginning of
t, and
is the size of
f.
File Access Patterns Features: The current temporal, spatial, and length patterns are maintained as a class with a specific number representation (e.g., mild temporal is 0 and intense temporal is 1). In addition to the pattern itself, we also maintain the
pattern frequency, which represents the frequency of the intense, sequential, and uniform identification of the temporal, spatial, and length patterns, respectively. The pattern frequencies are normalized with the following:
where
c is the frequency of the intense, sequential, or uniform pattern. Finally, three more features are extracted: the
file access frequency,
file open frequency, and
fully read frequency, which aim to quantify the frequency of access. These are normalized using min–max normalization.
Directory and File Format Features: These features are similar and hold an aggregated view of the access patterns for each directory and each file format. We hypothesize that there is often a predominant pattern in the directory or file format, and that this pattern would help the model in its predictions. Each directory and each file format maintains the same set of nine features: an
ID representing the directory or file format; the
ratio of each pattern identified over all files under the specified directory or file format; the
files count, representing the total number of distinct files under the same directory or with the same file format; and the
access frequency, counting the total access frequency of the files under the same directory or with the same file format. The file, directory, and file format IDs are normalized by
where
x is the hash value of the file name, directory, or file format and
M is a configurable max number of files, root directories, and file formats, respectively. We set
M = 1,000,000 for file IDs and
for directories and file formats. The counts and access frequency features are normalized by the fifth root, as was done for the file size discussed above.
4.2. Target Variables
Target variables represent the values to predict, such as the next offset to be read and the hotness of a file in the near future. Accurate prediction of these two targets can drive caching and tiering policies into making important decisions that optimize the performance of multi-tier storage systems. Next offsets can be used for prefetching data into the cache proactively, thereby reducing the access latency of future read requests. File hotness is a proxy of how popular a file will become, which is crucial when making cache eviction, cache admission, and tier migration decisions. The target variables are also calculated and added as a target column to the feature vectors to generate training instances.
The Next Offset target variable is a numeric value, and a natural solution is to model it as a regression. The advantage of doing so is that, if accurate enough, it can determine the next portion of the file to be accessed, helping a prefetching policy to make good decisions. However, a moderate or even a small error when predicting the next offset may mislead to different regions of the file. This motivates us to also model the prediction of the Next Offset as a 3-class classification problem by determining if (1) the next offset is continuous to the current offset, (2) the next offset is at a random file location, or (3) the file will not be read anymore; labeled as sequential, random, and none, respectively. Classification is a relaxation of the regression problem, yet provides sufficient information to help prefetching policies to take action in the presence of primarily sequential workloads.
The
Next File Hotness is also a numeric value (computed using Equation (
6)), represents the hotness of a file during a future time interval, and can be modeled using either regression or multi-class classification. The prediction of the next file hotness as a regression has the advantage of differentiating how hot (or cold) a file is, which can help cache and tiering policies to take actions, such as prioritizing files during data movement. Contrary to the prediction of the next offset, a small or moderate prediction error may not affect determining how hot the file will be in the near future. This motivates modeling the prediction of the next file hotness as a classification problem. We categorized the file hotness into six distinct classes, with the first class being the coldest and the sixth the hottest. The number of classes was determined based on the distribution of the normalized values of file hotness across all traces.
Overall, we modeled both problems as regression and classification to analyze their performance and differences, computing four target variables: Next Offset and Next Offset Class for predicting the next file offset with regression and classification, respectively, and Next File Hotness and Next File Hotness Class for predicting the future file hotness with regression and classification, respectively.
4.3. Feature Evaluation and Correlation Analysis
During feature evaluation, we analyze the impact of each feature for predicting the four target variables for each trace. Then, we extract a representative and robust subset of features that is sufficiently accurate for performing the predictions across all 17 traces, giving us confidence that these features will also work with other traces. To do so, we use four different and widely used statistical metrics: (i) the p-value test, (ii) the Chi-squared test, (iii) the Gini Importance, and (iv) the Correlation Matrix. Each test provides a different perspective on the relationship between features and the target variable, and using them together can lead to more robust and reliable feature selection. The p-value test determines the features that have a strong relationship () with the target variable, the Chi-squared test indicates the probability of predicting values closer to the testing samples, the Gini Importance metric (calculated from a Random Forest model) determines the importance of each feature in predicting the target variable, and the Correlation Matrix expresses the correlation of extracted features between them.
Next, we present the detailed results of the four metrics for the Next Offset target. The results for the other target variables are similar. According to the p-value metric, the File Size, History of Open Times 10, Request Offset, File Fully Read Frequency, and File Temporal Pattern Frequency are highly important to predict the Next Offset. The connection between some of these features is clearly explained by their nature. For instance, the Request Offset is often correlated with the next offset because of the high frequency of sequential operations. However, the connection between features such as the File Size or the History of open times with is not so straightforward. This happens because the p-value might produce false positives due to incorrect connections with the target variable. For example, History of Open Times 10 is frequently set to 0 because many files are not opened 10 times or more, while the (normalized) next offset is often very close to zero due to the many reads that start at the beginning of a file. Thus, employing the three metrics in conjunction is valuable for diminishing the possible false-positive effect from a single metric.
Following the p-value, the Chi-squared test also gave a high score for the Request Offset and File Fully Read Frequency. These features have higher scores in Traces 14–17, which are traces with a high number of sequential read requests. In contrast with the p-value metric, the File Size received a lower score from the Chi-squared test. It is interesting to observe that file access patterns received high scores from both metrics. For instance, the File Length Pattern and the File Spatial Pattern received a positive p-value score, while their weight received a high Chi-squared score. The access patterns related to files and directories received the highest scores, while the file format patterns received, in general, lower scores. This indicates that the file access patterns are correlated to the target variables and can help the prediction models. The key results from the Gini Importance are similar to the Chi-squared results and are not elaborated due to space constraints.
Finally, our correlation analysis has revealed that most of the features exhibiting a correlation are features related to the access patterns. This happens because many patterns are mutually exclusive. For instance, a file can only be currently either sequentially accessed or not. Note that the patterns for the File Format are typically correlated with the Directory patterns. This happens because many directories often host files with the same formats. The correlation is similar across all target variables.
The objective during feature selection is to decrease the number of features by removing non-representative or correlated features and leaving only a subset of the features that will yield good prediction results across all traces. After calculating the
p-value, Chi-squared, and Gini Importance for each target variable across all traces, we combine the computed values to generate one single representation of each metric. For each tested feature, we add how many times it was selected by the
p-value test across the 17 traces and normalize the sum to the 0–1 range (by dividing by 17). Next, we average the Chi-squared values per feature across the traces and perform min–max normalization to the 0–1 range. We repeat the latter process for the Gini Importance score. The final score is computed by summing the three normalized values per feature, and we use it to rank the most important features. Summation is used because it is simple, interpretable, and effectively balances feature importance without introducing bias. Additionally, since all scores are already normalized, summation naturally maintains a balanced aggregation without distorting the impact of high rankings. Finally, we use the correlation matrices to remove redundant features.
Table 2 presents the combined scores for each feature, along with an indication (✓) if the feature was selected during our feature selection process. Features with a final score of less than 1 exhibit low importance when predicting the target variables and thus can be discarded without loss of correctness in the prediction task. For instance, the
Time since last update of a file has the lowest score across all target variables. On the other hand,
Fully Read Frequency and
Request Offset have the highest scores across all targets, indicating that they are helpful in predicting the target values.
The sets of features selected to predict the target variables have some differences, but the majority of the chosen features are shared between all sets. Based on the ranking scores and the correlation matrices, we selected all features related to the file and directory access pattern. From the File Format, we kept only the File Format ID and File Count because the patterns related to file formats either achieved a low score or had a high correlation with the selected directory pattern features. Features associated with the history of accesses are also not selected, except for the Time since last access and Time since creation. The final set of selected features has 24 features for the Next File Hotness, 27 for the Next Offset, 25 for the Next File Hotness Class, and 25 for the Next Offset Class.
To gain deeper insight into how each feature affects model prediction, we employed SHAP (SHapley Additive exPlanations) values [
65] derived from tree-based machine-learning models, namely XGBoost models. XGBoost [
66] is a gradient-boosting framework that builds an ensemble of trees and works well for both regression and classification problems. For each target variable and each trace, we trained an appropriate model (regressor or classifier), computed SHAP values for all features, and generated violin summary plots. In the case of classification, SHAP values were averaged over all classes to produce a single importance value per feature.
Figure 1 displays the violin summary plots for the four targets for Trace 17, our longest production trace. These plots visualize the distribution of SHAP values for each of the top 20 features across a sample of 10,000 instances, capturing both the magnitude and variability of each feature’s impact on the model’s predictions. The spread on the X-axis indicates how much this feature changes predictions, the direction on the X-axis indicates whether the feature increases or decreases the prediction, while the vertical spread shows the variation of effect across the data points.
In the
Next File Hotness regression setting visualized in
Figure 1a, the (current)
File Hotness is the most impactful feature, with positive SHAP values for higher feature values, indicating that hotter files typically increase the target variable. Features such as
File Fully Read Frequency and
File Size also exhibited strong influence, while other features have SHAP values generally concentrated near zero but showing asymmetry for certain cases. In contrast, the classification model shown in
Figure 1b revealed a broader spread of SHAP values and highlighted additional influential features such as
Time Since Creation and
Directory Access Frequency, reflecting the model’s effort to delineate class boundaries.
As shown in
Figure 1c, the (current)
Request Offset feature dominates the regression model for
Next Offset, indicating a direct relationship to the predicted next offset, which is to be expected for sequential workloads.
File Length Pattern and
File Size also reveal a strong influence, while
Time Since Last Access has an interesting bi-modal effect, according to which both high and low values can push predictions in either direction, suggesting interaction effects. On the other hand, the classification model shown in
Figure 1d identifies features like
File Length Pattern Frequency and
Directory Temporal Mild Ratio as being more discriminative for distinguishing between
Next Offset classes. The broader range and greater dispersion of SHAP values in the classification plot reflect more complex, nonlinear interactions.