DuCFF: A Dual-Channel Feature-Fusion Network for Workload Prediction in a Cloud Infrastructure

Jia, Kai; Xiang, Jun; Li, Baoxia

doi:10.3390/electronics13183588

Open AccessArticle

DuCFF: A Dual-Channel Feature-Fusion Network for Workload Prediction in a Cloud Infrastructure

by

Kai Jia

^1,2,*

,

Jun Xiang

²

and

Baoxia Li

^3,*

¹

School of Artificial Intelligence and Software Engineering, Nanyang Normal University, Nanyang 473000, China

²

Hubei Key Laboratory of Transportation of Internet of Things, School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan 430070, China

³

College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(18), 3588; https://doi.org/10.3390/electronics13183588

Submission received: 21 August 2024 / Revised: 3 September 2024 / Accepted: 9 September 2024 / Published: 10 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

Cloud infrastructures are designed to provide highly scalable, pay-as-per-use services to meet the performance requirements of users. The workload prediction of the cloud plays a crucial role in proactive auto-scaling and the dynamic management of resources to move toward fine-grained load balancing and job scheduling due to its ability to estimate upcoming workloads. However, due to users’ diverse usage demands, the changing characteristics of workloads have become more and more complex, including not only short-term irregular fluctuation characteristics but also long-term dynamic variations. This prevents existing workload-prediction methods from fully capturing the above characteristics, leading to degradation of prediction accuracy. To deal with the above problems, this paper proposes a framework based on a dual-channel temporal convolutional network and transformer (referred to as DuCFF) to perform workload prediction. Firstly, DuCFF introduces data preprocessing technology to decouple different components implied by workload data and combine the original workload to form new model inputs. Then, in a parallel manner, DuCFF adopts the temporal convolution network (TCN) channel to capture local irregular fluctuations in workload time series and the transformer channel to capture long-term dynamic variations. Finally, the features extracted from the above two channels are further fused, and workload prediction is achieved. The performance of the proposed DuCFF’s was verified on various workload benchmark datasets (i.e., ClarkNet and Google) and compared to its nine competitors. Experimental results show that the proposed DuCFF can achieve average performance improvements of 65.2%, 70%, 64.37%, and 15%, respectively, in terms of Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE) and R-squared (

R^{2}

) compared to the baseline model CNN-LSTM.

Keywords:

cloud computing; workload prediction; resource management; temporal convolution network; transformer

1. Introduction

Cloud infrastructure is the aggregation of hardware and software resources that make up cloud computing, which forms the global data centers of different service providers. In light of Gartner’s latest report, global end-user spending on public cloud services is expected to grow by 20.4% to 678.8 billion by 2024, up from 563.6 billion in 2023 [1]. Moreover, most emerging techniques, e.g., remote sensing, Internet of Things, fog and edge computing, and cyber–physical systems, emphatically rely on cloud computing because storage and computing capabilities for these systems are insufficient [2,3,4,5,6]. While users’ demand for these software systems is growing daily, concerns about the availability, reliability, and Quality of Services (QoS) of cloud infrastructure are also rapidly increasing. However, the dramatic variations in workloads tend to trigger the over- or under-provisioning of resources, which drives the need for cloud service providers (CSPs) to agilely formulate strategies to allocate the resources, preventing the violation of service-level agreements (SLAs). Therefore, to further improve energy consumption costs and resource usage efficiency, workload predictions are highly needed for dynamic resource provisioning [7,8,9].

In the cloud computing paradigm, resource-provisioning methods mainly include the following two classes: reactive and proactive methods [10]. Reactive scaling methods allocate the resources after demands arrive, but this is not advantageous due to non-instantaneous initialization and dynamic migration for hosted virtual machines. This will easily lead to problems, including the over-provisioning and under-provisioning of computing resources and will ultimately cause resource waste, increased power consumption, and SLA violations [11]. While proactive scaling techniques allow for the scaling of computing resources before the arrival of practical demands, such methods need to know the upcoming workload information in advance. To pursue system availability and effective resource usage, resource demands in the near future can be predicted by the analysis of historical usage and request patterns and the current state of the cloud, which can use predicted resource demands to assist in assorting the requisite physical resources in a proactive manner [12]. This effectively reduces the number of active servers by allowing only the required number of physical machines to be active, which can promote resource utilization, reliability, and service availability. In addition, the intensity of the workload will also affect the aging problem of the software itself [13,14] and play a crucial role in the formulation of software rejuvenation strategies [15]. Nevertheless, the heterogeneous nature of workloads in cloud infrastructure (i.e., end-users submit different types of application requests, which will require heterogeneous resource capacities), high burstiness (i.e., local irregular fluctuations), and long-term dynamic variations (i.e., long-running characteristics of cloud systems), will make accurate workload prediction an intractable task.

A large number of methods have been developed to apply to the workload prediction tasks in the cloud, e.g., the autoregressive integrated moving average model (ARIMA) [11,16], the multi-layer perceptron (MLP) [17,18], the convolutional neural network (CNN) [19,20], the recurrent neural network (RNN) and its variants [21,22,23,24,25], and others. However, workload prediction tasks are intractable since the workload time series dynamically changes in the face of various user demands. Traditional statistical analysis methods, e.g., linear regression, exponential smoothing, the autoregressive moving average model (ARMA) [26], and ARIMA [11,16], are easy to understand and have been widely adopted for cloud workload prediction scenarios. However, these methods often assume a linear relationship among the changes in workload time series data. Therefore, nonlinear variations and long-term temporal correlation may not be fully modeled under the linear function constraints [16].

Moreover, traditional machine learning methods, such as support vector regression (SVR) [27] and MLP [17,18], have also been widely adopted in cloud workload prediction fields. SVR and MLP can capture the mapping relationships of input and output by using nonlinear transformation. However, these models may sometimes encounter challenges during the training process, as they can become trapped in local optimal solutions and experience gradient problems. This will inescapably cause prediction performance degradation for these traditional machine learning methods. Currently, some deep learning methods, because of their powerful ability in representation learning, have also been employed in workload modeling tasks. These models, such as RNN [21] and its variants, including long short-term memory (LSTM) [25] and gate recurrent units (GRUs) [19], CNN [20], bidirectional LSTM (BiLSTM) [22] and their hybrid models [9,20,28,29,30], have been frequently used in this fields. In these models, CNNs are used to capture spatial or local relationships in workload time series data [20]. RNN and its variants are applied to model temporal dependencies in workload data [19,25]. However, CNN-based methods generally possess limited ability to model long-term and temporal dependencies due to the receptive field limited by the size of the convolution kernel [31,32]. Moreover, RNN and its variants (e.g., LSTM) are confronted with the degradation problem of computational efficiency, since they do not have the ability to execute parallel computing. In addition, they are confronted with the problems of the gradient vanishing and exploding when modeling long-range temporal relationships [33,34]. In a real-world cloud environment, the majority of workloads show short-term irregular fluctuation phenomena and long-term dynamic variations in cloud workload traces [35]. The sudden spikes in the incoming workload represent unprecedented variations in heterogeneous request patterns of different users over time. These problems will cause the loss of useful information during the feature-extraction stage. Ultimately, they may struggle to extract these features’ information effectively [33]. Therefore, the designed model should possess comprehensive extraction abilities for both local irregular fluctuations and long-term dynamic variation characteristics to provide high prediction accuracy.

For this purpose, this paper presents a new framework, called DuCFF, to handle the issues mentioned above in workload-prediction tasks. DuCFF integrates the temporal convolution network (TCN) channel and transformer channel in parallel to achieve the feature extraction capacities of local irregular fluctuations and long-term dynamic variations. TCN can weaken the influence of the size of the convolution kernel and effectively capture intrinsic local information [31], which is broadly adopted in natural language processing (NLP) [31], time series prediction, and classification [36,37], and many others [38]. TCN compensates for the weaknesses of CNN architecture by using hierarchically structured temporal convolutional filters with bigger receptive fields and fewer parameters. The transformer network [34] is an innovative architecture that has parallel computing and long-term dependency modeling capabilities. It only designs and adopts the self-attention mechanism to replace recurrent or convolutional layers in sequence-to-sequence prediction tasks. Currently, the transformer architecture has been broadly employed in numerous fields, including NLP [34], time series prediction [39], computer vision, pattern recognition [40,41]. In the workload-prediction tasks studied in this article, its ability to process any part of the workload time series data without distance limits enables it to better capture the coupled long-range temporal relationships.

In particular, DuCFF first adopts the commonly used variational mode decomposition (VMD) [9] method to decouple the different implicit characteristics of the workload time series, which are further adopted as auxiliary information and combined with the original workload as the model input. Then, DuCFF employs stacked multi-scale TCN blocks to individually extract local irregular fluctuation features from the original input. In addition, adopting a parallel architecture, the proposed model adopts stacked transformer blocks to extract long-term dynamic variation characteristics. By adopting this divide-and-conquer strategy, the robustness of the model can be enhanced by effectively reducing the loss of vital information while maintaining the integrity of the information extraction. Afterward, DuCFF fuses the high-level features extracted from the above two channels, and these fused features are further input to two fully connected layers to accomplish workload prediction for the next time slot. The experimental analysis has been performed to analyze the efficiency of the proposed DuCFF, achieving improvements of 65.2%, 70%, 64.37%, and 15% on average in prediction performance among four evaluation metrics, including Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE) and R-squared (

R^{2}

), compared with CNN-LSTM. This study makes the following contributions:

A workload-prediction framework named DuCFF is presented to fully extract the characteristics of local irregular fluctuation and long-term dynamic variations implied in the original workload time series data of the cloud.
The TCN channel is designed to capture local irregular fluctuation information from raw input. In parallel, the transformer channel is adopted to model long-term dynamic variations features, which can promote workload prediction accuracy.
The prediction performance of the proposed DuCFF is confirmed by carrying out extensive experimental evaluation among three real-world benchmark datasets. The numerical results verify that DuCFF is superior to its nine competitors.

The rest of the paper is organized as follows. Section 2 introduces related research on workload prediction. Section 3 describes the details of the DuCFF framework, and Section 4 depicts the numerical results of the models through extensive experiments. Finally, Section 5 summarizes the conclusion and future work plan.

2. Related Work

Workload prediction can predict upcoming resource utilization and workload demands to help the cloud data center run flawlessly. To date, many methods have been developed to predict the future evolution of workloads in cloud infrastructure. By sorting out the existing workload prediction methods, this article intends to divide them into the following three dimensions: statistical analysis-based, machine learning-based, and hybrid models in workload prediction.

A.: Statistical analysis models

Workload prediction using statistical analysis models has a long history. Traditional statistical analysis models usually have extremely high computational efficiency and relatively simple model structures [16]. The modeling idea of this type of method is to use the known historical data and current data in the workload time series as input samples to estimate various parameters that need to be configured in the model. After the parameters are determined, the constructed model can fit the future evolution trend of the workload data to a certain extent, thereby realizing the process of trend extrapolation. For instance, Calheiros et al. [16] applied ARIMA to estimate workload based on the historical real traces of requests to web servers and further evaluated the influence of the achieved accuracy in terms of efficiency in resource utilization and QoS. Zharikov et al. [42] constructed six alternative prediction models, Simple Exponential Smoothing (SES), Holt’s Linear Trend, Holt’s Damped Trend, ARIMA, Linear Regression with Trend (LR), and TBATS, to determine the best combination that generates the most accurate prediction in a cloud data center. Farahnakian et al. [43] employed the LR model to predict the short-term future CPU usage based on the usage history in each host, which can determine the over-loaded and under-loaded hosts by prediction in the live migration process. Debusschere and Bacha [44] adopted seasonal ARIMA (SARIMA) to estimate the future 168 h server workload, which allows the ARIMA model to directly address the seasonal patterns.

Nevertheless, due to the complex nature of workload time series, statistical models cannot accurately give reasonable prior explanations. Therefore, the setting of premise assumptions restricts the application of statistical models in workload prediction to a certain extent, making it impossible to achieve better predictions.

B.: Machine learning models

Compared with traditional statistical models, machine learning models have more powerful nonlinear modeling capabilities [19,25]. They can directly approximate any mapping function through the training and learning of historical data and can effectively model nonlinear and non-stationary workload time series. For example, Moghaddam et al. [17] adopted multiple prediction models, such as LR, MLP, SVR, decision tree regression, and boosted decision tree regression, to predict over-utilized hosts in cloud data centers. The virtual machines were then migrated based on the prediction results, virtual machine migration time, and other constraints to achieve energy-aware VM consolidation. Kumar et al. [18] developed a neural-network-based workload-prediction framework and used an improved and adaptive differential evolution algorithm to improve the learning efficiency of the proposed model; the experimental results confirmed the superiority of this method on four real-world data traces in comparison with the average, back-propagation, and self-adaptive differential evolution (SaDE) algorithms. Duggan et al. [21] considered the fact that cloud resource consumption is constantly changing and constructed a workload prediction model based on the RNN model. Compared with traditional prediction methods, i.e., back-propagation neural network, random walk, and moving average, RNN obtained better prediction accuracy. Recently, Ruan et al. [25] developed a storage workload prediction method called CrystalLP based on the LSTM model and implemented the sensitivity analysis of the hyperparameters in LSTM. Compared with baseline methods, i.e., ARIMA, SVR, and RNN, CrystalLP achieved improvements in prediction performance. Xu et al. [19] used the GRU model (named esDNN) to perform workload prediction and used a rolling window method to convert the workload data into a supervised learning time series. Compared with BiLSTM, esDNN achieved better prediction results. Chen et al. [45] constructed a resource usage prediction method (named RPTCN), which is based on temporal convolutional networks (TCNs) in cloud systems, and further combined the fully connected layer and attention mechanism with TCNs to improve the prediction accuracy. The experimental results confirmed that the prediction performance of RPTCN is significantly better than LSTM. In addition, Selvan Chenni Chetty et al. [46] proposed a hierarchical tree-based deep convolutional neural network (T-CNN) model with sheep flock optimization (SFO) to enhance CDC’s power efficiency and workload prediction. Also, the proposed model was evaluated using Saskatchewan HTTP traces and NASA benchmark datasets, showing promising predictive accuracy.

The above works verify that the prediction performance of machine learning models is superior to traditional statistical analysis models. However, considering that workload data often have multiple overlapping characteristics, when faced with different scenarios, these models cannot model different characteristics of workload time series simultaneously, resulting in prediction accuracy still having room for improvement to a certain extent. Our work differs from these works, and we use multiple techniques to capture different features of the workload time series to achieve complete feature extraction and prediction.

C.: Hybrid models

The hybrid model can alleviate the problems of insufficient prediction ability and unstable prediction results for a single model by integrating a variety of different technologies, thereby effectively achieving workload prediction [11]. Currently, a variety of hybrid models have been developed by researchers. For instance, Xie et al. [47] constructed a hybrid model by combining the ARIMA and Triple Exponential Smoothing (TES) to model both the linear and nonlinear correlations in the docker container workload time series in a cloud platform and confirmed that this hybrid model can achieve the performance promotion. Devi and Valli [48] considered the linear and nonlinear components hidden in the resource utilization patterns, and they adopted the hybrid model ARIMA-ANN to perform the prediction in the cloud environment. Experimental results indicated that the proposed ARIMA-ANN is better than the single models ARIMA and ANN. For the high-dimensional and highly variable cloud workload data, Chen et al. [49] integrated the top-sparse auto-encoder (TSA) and gated recurrent unit (GRU) with the TSA block (named L-PAW) to address the high-dimensional and highly variable features, respectively, and extensive experiments demonstrated the effectiveness and adaptability of the L-PAW. In recent years, hybrid models based on CNN, LSTM, and their variants have also received widespread attention. For example, Patel and Kushwaha [20] adopted three parallel dilated 1D CNN layers to extract the noise and used the LSTM layer to learn the temporal dependencies from the host CPU usage, and performance comparisons verified the superiority of the proposed model compared to several competitors. In addition, Patel and Kushwaha [28] further integrated parallel and stacked 1-Dimensional CNN (1DCNN) and LSTM networks (named RCP-CL) to model random fluctuations and novel continuous and periodic patterns from contiguous and non-contiguous CPU load values augmented with daily and weekly time patterns, and their experimental results confirmed that the proposed RCP-CL performed better than LSTM and 1D-CNN. Bi et al. [22] integrated both bi-directional and grid long short-term memory networks to perform the prediction of workload and resource time series, and they adopted the data preprocessing operations, including Savitzky–Golay (SG) filter and Min–Max scaler, to remove the impact of noise and extreme points before model training. Experimental analysis verified the model’s advantage compared to its competitors. Chen et al. [6] adopted multiple technologies, including the Savitzky–Golay filter (SG filter), CNN, and BiLSTM with an attention mechanism, to construct the hybrid model SG-CBA, which is used to predict workload in edge data centers, and achieved the better performance relative to baseline models, e.g., RNN, GRU, and BiLSTM. Zhang et al. [50] proposed a hybrid model combining the TES method and LSTM network to perform a Docker container workload proactive prediction, which can not only capture both short-term and long-term dependencies in container resource time series but also smooth the container resource utilization data. For complex and non-linear workload and resources time series, Bi et al. [9] combined variational mode decomposition (VMD), an SG filter, a multi-head attention mechanism, and bidirectional and grid versions of LSTM networks to achieve the accurate prediction, and extensive experiments confirmed that the proposed model outperformed several advanced prediction models in terms of prediction accuracy. We further illustrate the differences between existing studies and this study based on four dimensions, i.e., proposed models, predicted parameters, datasets source, and metrics used, in Table 1.

Unlike these works, which combine CNN and LSTM and their variants, and the class of hybrid models, which combine ARIMA and machine learning methods to perform cloud workload predictions, in this work, we use TCNs and Transformer networks to construct a new workload-prediction framework, DuCFF, which can model the local irregular fluctuation and long-term dynamic variations features in parallel in the workload time series. This framework is committed to further improving the workload prediction accuracy while making up for the weaknesses of traditional CNN and LSTM.

3. Proposed Method

3.1. Problem Formulation

The goal of workload prediction is to predict the upcoming resource requests based on the historical measurement records. In this paper, we aim for univariate time-series data of user requests collected at regular time intervals from various cloud workloads. Let

x = [x_{1}, x_{2}, x_{3}, \dots, x_{t}]

be a set of workload sequences over time t, where

x_{t}

indicates the workload value at time t. Specifically, we use the workload data of the previous t time slots to predict the workload value of the next time slot t + 1. The process can be described as

{\hat{x}}_{t + 1} = f (x_{1}, x_{2}, x_{3}, \dots, x_{t})

(1)

The prediction value

{\hat{x}}_{t + 1}

will be compared against the real values

x_{t + 1}

, and the prediction errors

e_{t + 1}

will be calculated at time t + 1. Finally, the objective of training the proposed model is to minimize

e_{t + 1}

.

3.2. Prediction Process of DuCFF

In this section, we detail the proposed framework DuCFF to address the issues mentioned in the cloud workload prediction. The proposed DuCFF builds the mapping relationships directly from the historical workload time series

x = [x_{1}, x_{2}, x_{3}, \dots, x_{t}]

to the next time slot

y = x_{t + 1}

. The construction process of the proposed DuCFF framework is depicted in Figure 1a, which contains three main modules, including the data preprocessing module, the feature extraction module, and the workload prediction module.

First, we implement the data preprocessing step before feeding the workload data into the network for model training. This step consists of three techniques, including variational mode decomposition (VMD), data normalization, and move window sampling. Then, two feature extraction channels, i.e., the TCN channel and the Transformer channel, are constructed separately in a parallel manner. The former, which includes multiple stacked multi-scale TCN layers, is used to extract the local irregular fluctuations, and the latter, which includes multiple stacked Transformer layers, is applied to model the long-term dynamic variations in workload data. Afterward, DuCFF fuses the high-level features extracted from the above two channels, and finally, these fused features are further input to two fully connected layers to accomplish workload prediction for the next time slot. The specific implementation details of each step are described below.

3.3. Data Preprocessing

Decomposition process of VMD [9]: The initial workload data are decomposed into a few finite band-limited intrinsic mode functions (IMFs) by VMD, which is proposed by Dragomiretskiy [51], where n is the number of decomposed subseries. This method has the advantages of high computational efficiency and robustness, which can solve the modal mixing problem and can be effectively denoised [9,51]. The decomposition process of VMD is described as follows [51]. In this paper, the decomposed results for workload time series are depicted in Figure 2, and the numbers of decomposed subcomponents n (IMFs) are set to 9, 9, and 12, respectively. It can be observed that using VMD technology can filter noise and extract seasonal variations and trend information of workload data, which will be used as auxiliary information of DuCFF for workload prediction.

Data Normalization: In the data preprocessing step, the raw data from different traces are extracted, and the collected workloads are aggregated into fixed time intervals and assembled into a time series. The workload data were recorded and came from different user demands, resulting in varying scales. To mitigate the impact on model training, we normalize the data to a specific range [6]. We use n to indicate the number of time slots of workload data. Formally, the raw workload data can be represented as

X = [x_{1}, \dots, x_{i}, \dots, x_{n}]

, where

x_{i}

indicates the number of requests at the ith moment. This article uses common Min–Max normalization [6] to process collected workload data into the range for [0, 1] as per Equation (2).

x_{i_n o r m} = \frac{x_{i} - x_{m i n}}{x_{m a x} - x_{m i n}}

(2)

where

x_{i}

and

x_{i_n o r m}

indicate the raw and normalized data in the workload time series and

x_{m a x}

and

x_{m i n}

indicate the upper and lower bounds of the workload time series, respectively.

Move window sampling: The workload prediction proposed in this paper is essentially a supervised learning method, so the workload time series data need to be converted into a supervised input form. This paper adopts the moving-window sampling method, which can utilize the temporal dependency peculiarity in workload time series and increase the number of training samples [19] to construct the input of the neural network in a continuous time manner, as shown in Figure 3 (taking a univariate as an example). In this way, the workloads observed over time are stored as historical data and used as input, and the next time slot is regarded as the target output. During the procedure, the input samples for training consist of data samples within the time window, while the target output to be predicted is the request value at the next time slot [7]. Finally, the input samples x and corresponding target output y generated by this step can be formally described as Equation (3).

x = [\begin{matrix} x_{1}, x_{2}, x_{3}, \dots, x_{k} \\ x_{2}, x_{3}, x_{4}, \dots, x_{k + 1} \\ x_{3}, x_{4,}, x_{5}, \dots, x_{k + 2} \\ \dots \\ x_{n - k}, x_{n - k + 1}, x_{n - k + 2}, \dots, x_{n - 1} \end{matrix}], y = [\begin{matrix} {x_{k +}}_{1} \\ x_{k + 2} \\ x_{k + 3} \\ \dots \\ x_{n} \end{matrix}]

(3)

3.4. Feature Extraction

3.4.1. TCN Channel

The temporal convolution network (TCN) is a variant of the traditional CNN architecture. As shown in Figure 4, compared with CNN, TCN can adjust the receptive field size by stacking layers, dilated factor, and filter size, and it can model the temporal dependency in time series data [31]. In practice, the performance of TCN can reach or even exceed various RNN structural models (for example, LSTM and GRU) on various tasks [31]. A TCN architecture consists of causal convolution, dilated convolution, and residual connection [32]. The details of these components are described as follows.

A. Casual convolution: Figure 4b is the illustrative diagram for casual convolution. As an essential part of TCN, causal convolution under causal constraints ensures that the elements at time t in the next layer only depend on the elements at time t in the previous layer and its preceding elements. The difference from traditional CNN is that the convolution operation of causal convolution cannot see future data, which is a strict time constraint model.

B. Dilated convolution: When traditional CNN captures long temporal dependencies in time series, it is limited by the size of the convolution kernel and needs to linearly stack more convolutional layers to achieve this purpose. However, as the number of network layers deepens, the model parameters will also increase rapidly, which will require more time to train the model and easily introduce gradient-related degradation problems. Therefore, as another essential component of TCN, dilated convolution [31,32] is introduced, as depicted in Figure 4c. According to the concept of dilation, it manages the filter of the CNN to sample in line with the dilation factor (d), thereby increasing the receptive field, where dilation refers to the length of the interval between elements of the input sequence. Ultimately, an illustrative diagram of dilated causal convolution (DCC) is shown in Figure 4d. Given a time series input

x \in R^{n}

, for a filter f, the computation of DCC for element s at moment t is described as in Equation (4) [32]:

F (s) = (x *_{d} f) (s) = \sum_{i = 0}^{k - 1} f (i) \cdot x_{s - d \cdot i}

(4)

Among them, d indicates the dilated factor, k is the kernel size, ∗ is the convolution calculation, and

s - d \cdot i

represents the orientation of the past. For the jth layer DCC of a residual block in the TCN channel, the computation process is as follows:

\begin{matrix} D C C^{(j)} = C o n v 1 D (W^{(j)}, b^{(j)}, x_{i}, k e r n e l_s i z e = k, d i l a t i o n_r a t e = d) \end{matrix}

(5)

Through normalization and dropout layers, the output of the residual mapping branch, which is represented as DCC, is obtained. Finally, the output

H (x_{i})

of the shortcut mapping branch after 1 × 1 convolution is added to DCC to obtain the final output

O_{t c n_i}

for a residual block, as shown in Equation (6).

O_{t c n_i} = D C C + H (x_{i})

(6)

O_{t c n} = A d d (O_{t c n_1}, O_{t c n_2}, O_{t c n_3})

(7)

Figure 1b represents a multi-scale TCN layer we adopted in this paper, which totally includes three residual blocks. Each block consists of a dilated causal convolution, normalization, activation, and dropout layers, and the output of each TCN layer is

O_{t c n}

. After TCN channel processing, these extracted local irregular fluctuation features

O_{t c n}

will be fused with extracted long-term dynamic features, as described in the following subsection.

3.4.2. Transformer Channel

TCNs have the advantage of modeling local irregular fluctuation characteristics and temporal dependencies of workload data but have limitations when extracting long-term dynamic variation characteristics. TCNs keep the weight for each input element invariable, but the importance of key elements that are beneficial to prediction is weakened or even ignored. Transformers can make up for the shortcomings of TCNs when extracting long-term dynamic variations in workload data. Compared with traditional RNN and its variants LSTM, GRU, and CNN, the Transformer architecture mainly has the following advantages: (1) the ability to perform parallel computing; (2) the ability to better handle long-distance dependencies; (3) the ability for the correlation between each sequence element to be directly calculated without passing through the hidden layer, which is more direct and efficient.

In addition, a dual channel architecture is adopted in this paper to separately capture local and long-term variation features. This will alleviate the risk of information interference with each other and improve the feature-extraction abilities of the model, thus achieving superior prediction performance. In this study, the encoder part of the traditional transformer architecture is utilized to build a lighter model structure [52], which is more suited to long-range feature learning in workload prediction. The transformer’s encoder consists of two sub-modules, namely the multi-head self-attention and feed-forward network. The network architecture is shown in Figure 1c. Among them, each sub-module uses a residual connection [34,52] to transfer information and requires further layer normalization operation.

A. Multi-head self-attention: The core component of the transformer network is multi-head self-attention (MSA), which is designed to compute the similarity between the input series [34], as depicted in Figure 5. First, a group of input sequence x is embedded and further transformed to

d_{m o d e l}

-dimensional keys (K), values (V), and queries (Q), as shown below.

Q = x W^{q}, K = x W^{k}, V = x W^{v}

(8)

where

W^{q}, W^{k} \in R^{d_{m o d e l} \times d_{k}}

and

W^{v} \in R^{d_{m o d e l} \times d_{v}}

are learnable weight parameters, respectively, and

d_{k}

indicates the dimension of key matrices.

Later, the self-attention matrices can be calculated by adopting the scaled dot product function with scale a factor of

1 / \sqrt{d_{k}}

, and the computation is formulated as

\begin{matrix} S = A t t e n t i o n (Q, K, V) \\ = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V \end{matrix}

(9)

where

Q K^{T} / \sqrt{d_{k}}

is the attention matrix, which is used to calculate the correlation between Q and K and is scaled using the Softmax activation function. However, feature extraction using single-head attention alone is often insufficient. Therefore, it is vital to consider feature information from different representation subspaces to ensure comprehensive information extraction. To address this, the MSA mechanism is developed, which involves performing H calculations for single-head attention and then concatenating the outputs of these heads. Thus, MSA can be calculated as shown below:

\begin{matrix} o_{M H} = M u l t i H e a d (Q, K, V) \\ = C o n c a t (h e a d_{1}, \dots, h e a d_{i}, \dots, h e a d_{H}) \end{matrix}

(10)

h e a d_{i} = A t t e n t i o n (Q, K, V)

(11)

where H indicates the number of heads. After the MSA mechanism computations are accomplished, the output

o_{M H}

for this sub-block will be obtained through residual connection and layer normalization and be used as the inputs of the feed-forward network block.

B. Feed-forward network: Next, in addition to the MSA part, every block of the transformer network also contains the feed-forward network (FFN) component. The FFN we used consists of two fully connected layers and further adopts the ReLU activation function among them, which is formulated below:

F F N (o_{M H}) = W_{2} • R e L U (W_{1} • o_{M H} + b_{1}) + b_{2}

(12)

where

W_{1}

and

W_{2}

indicate the weight matrixes for FFN, respectively. Additionally, to stabilize training here, the residual connection is adopted, and we will obtain the output result

O_{t r a n s}

of a specific transformer block, expressed as

O_{t r a n s} = o_{M H} + F F N (o_{M H})

(13)

The transformer channel of this paper includes two transformer blocks, and these two blocks possess the same network structure. The computation processes of each block follow Equations (8)–(13).

3.4.3. Feature Fusion

To proceed, after the two channels finish the feature extraction process in workload data, the high-level feature representations extracted from previous layers are fused. In this paper, since each channel concentrates on different feature information implied in the workload sequence, the feature vector extracted from the previous module is merged using element-wise addition. This fusion process is described in Equation (14):

O_{t_t} = A d d (O_{t c n}, O_{t r a n s})

(14)

By adopting the addition fusion method, compared with the feature dimension extracted from each channel, the feature dimension

O_{t_t}

is not augmented, thereby not increasing the complexity of the model.

3.5. Workload Prediction

In this step, the fused features

O_{t_t}

are forwarded to the regressor, which includes two fully connected layers. This process will achieve the relational mapping from features extracted in historical workload data to the next time slot. This prediction procedure can be calculated as shown in Equation (15):

W o r k l o a d_{p r e d} = W_{2} • R e L U (W_{1} • O_{t_t} + b_{1}) + b_{2}

(15)

where

W o r k l o a d_{p r e d}

is the estimated value of workload. During the training process, the DuCFF model is evaluated by the Mean Square Error (MSE), and this loss function can be computed as shown below:

M S E_{L o s s} = \frac{1}{N} \sum_{i = 1}^{N} {(W o r k l o a d_{i}^{p r e d} - W o r k l o a d_{i}^{t r u e})}^{2}

(16)

where N indicates the sample number. Before performing the comparisons, a reverse normalization process needs to be executed in the output stage on normalized prediction results, making the output results map to the original data scale. This step aims to expound the actual differences between the raw ground truths and prediction outputs.

4. Experiments and Results

In this section, we perform extensive experiments using real-world workload trace datasets to evaluate the prediction performance of the proposed DuCFF. Specifically, this section consists of the introduction of trace datasets and evaluation metrics, baselines and hyperparameters settings, prediction performance analysis and comparison, parameter sensitivity analysis, and performance overhead analysis.

4.1. Experimental Data Description and Evaluation Metrics

We collect and collate three workload datasets from two real-workload traces (i.e., ClarkNet–HTTP and Google Cloud trace) that include the different application categories.

ClarkNet-HTTP trace: This trace contains all two weeks’ HTTP requests to the ClarkNet WWW server. The details of the workload data consist of the host, timestamp, request, HTTP reply code, and bytes in the reply [53]. Originally, the requested workload time series was not intuitively structured in the desired time series with regular and uniform timestamps. For experimental purposes, this paper aggregated the size of requests every five minutes to build up a univariate time series that contains 4009 samples, abbreviated as CNH.

Google Cloud trace: These cluster trace data include a total of 1600 log files and recorded different virtual machine operating data. Each file is recorded every 5 min by the virtual machine monitor, resulting in 288 data points per day. Each group of data consists of two columns: CPU and memory utilization. The data containing only machine events are already accessible on GitHub (https://github.com/HiPro-IT/CPU-and-Memory-resource-usage-from-Google-Cluster-Data, accessed on 11 January 2024). This paper selects the CPU utilization data of two groups of VMs for ten consecutive days, with 2880 samples, respectively, abbreviated as GC1 and GC2.

In conclusion, this paper collects three workload datasets to evaluate the performance and generalization ability of the model, as depicted in Figure 6. These three workloads are selected since they have special characteristics and dynamics in the workload patterns. For instance, workload, e.g., ClarkNet–HTTP trace, shows dynamic spikes and high fluctuations, and workloads, e.g., Google trace, show seasonality and high fluctuation. In the implemented experiments, 70% of the data are adopted as the training set, while the remaining 30% are used as the testing set.

Evaluation metrics: we adopt four performance evaluation metrics, including MAE, RMSE, MAPE, and

R^{2}

, to measure whether the model is promising, and the computations are expressed as shown below:

M A E = \frac{1}{N} \sum_{i = 1}^{N} |y_{i}^{p r e d} - y_{i}^{t r u e}|

(17)

M A P E = \frac{100}{N} \sum_{i = 1}^{N} |y_{i}^{p r e d} - y_{i}^{t r u e}|

(18)

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i}^{p r e d} - y_{i}^{t r u e})}^{2}}

(19)

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(y_{i}^{p r e d} - y_{i}^{t r u e})}^{2}}{\sum_{i = 1}^{N} {({\bar{y}}^{t r u e} - y_{i}^{t r u e})}^{2}}

(20)

where

y_{i}^{p r e d}

indicates the prediction value and

y_{i}^{t r u e}

is the corresponding ground truth. For the MAE, MAPE, and RMSE metrics, the lower the value, the better the model performance. In contrast, for the metric

R^{2}

, the larger the value, the better the model performance, with an upper limit value of 1.

4.2. Baselines and Parameters Settings

Some existing models possessing state-of-the-art prediction capacities are constructed and compared with the proposed DuCFF. The details of these models are described as below:

SVR [27]: This is a traditional supervised machine learning model that aims to find a regression plane, making the shortest distance across all data in a collection. We implement the algorithm by calling the Scikit-learn library and configure key parameters, such as the radial basis function (RBF) kernel, ‘epsilon’, and ‘C’, based on the prediction accuracy.

MLP [17]: This model can capture the non-linear relationships in workload data with better accuracy, amd it is also named artificial neural network (ANN). We implement this model using a traditional three-layer network structure.

LSTM [25] and GRU [19]: These models can capture the long-term temporal dependency, and they are variants of classical RNN. We implement these two models using a single LSTM/GRU layer structure.

1DCNN [28]: This model can capture the local relationships hidden in workload time series by adopting the one-dimensional convolution operation. We implement this model using a single convolution layer.

BiLSTM [22]: This is the combination of forward LSTM and backward LSTM [6], which can learn the bidirectional long-range context dependencies in workload time series. We implement this model in the form of a single BiLSTM layer.

BiGRU–Attention [19]: This model is the combination of BiGRU and attention mechanism, which can further capture long-term dependencies in the workload sequence and weaken the interference of unimportant elements on the prediction. We implement this model using a single BiGRU layer and attention layer.

1DCNN–LSTM [28]: A workload prediction model aims to model the random fluctuations and novel continuous and periodic patterns from contiguous and non-contiguous CPU load values augmented with daily and weekly time patterns, based on CNN and LSTM. We implement this model using a single 1DCNN layer and LSTM layer.

1DCNN–BiLSTM–Attention [6]: A workload prediction model that integrates the 1DCNN, BiLSTM, and attention mechanism to achieve feature extraction and prediction. We implement the model in the form of a one-step-ahead prediction.

To proceed, the selection of parameters in the constructed model for workload prediction plays a crucial role. The parameters involved in the proposed DuCFF that have a significant impact on the prediction performance include time lag, epoch, number of heads, kernel size, and number of filters. We adopt the grid search strategy to determine the model parameters one by one. Specifically, for all workload datasets, we define the search range [6, 12, 18, 24, 30] for time lag, [50, 80, 100, 150, 200] for epoch, [2, 3, 4, 5, 6] for the Head, [1, 3, 5] for kernel size, [16, 32, 64, 128] for the number of filters, and for hidden dimensions. By performing the parameter search, we select the parameters that achieved satisfactory prediction performance to equip the proposed DuCFF framework and detailed parameter settings are given in Table 2. Meanwhile, for all baseline models, the search strategy of parameters is consistent with the proposed model. After the parameters of all models (including baseline models) are determined, considering the randomness of the initial weight assignment of the deep neural network, each experiment was performed 10 times repeatedly, and this paper records and reports the average results.

4.3. Prediction Performance Analysis and Comparison

This section investigates the prediction performance of DuCFF and baseline models by performing extensive experiment evaluations. The following observations can be obtained.

We first report the numerical prediction results for each model in three workload datasets, as depicted in Table 3. It is easy to observe that the proposed DuCFF is able to reach the best predictive performance, and in all cases (12/12), the best performance gains are achieved relative to the baseline model. This is mainly due to DuCFF’s powerful feature extraction capability. After VMD preprocessing, these complex features contained in the workload time series can easily be captured by a well-designed network structure so as to achieve effective prediction. It is also not difficult to find, for all the baseline models, that performing workload prediction is intricate work, and they cannot obtain low prediction errors and better fitting capabilities. Specifically, these baseline models cannot obtain consistent prediction performance across different workload datasets. For example, in the comparisons of baseline models, SVR achieves the best prediction performance in three cases (3/12), including the MAPE value (18.38) of CNH and the MAE value (1.10) and MAPE value (3.09) of GC2. MLP only obtains the optimal prediction performance in one case (1/12), i.e., a MAE value (1.11) of GC2. GRU obtains the superior prediction performance in three cases (3/12), i.e., RMSE, MAPE, and

R^{2}

values (1.43/2.48/0.9523) of GC1. 1DCNN obtains the best performance in two cases, i.e., RMSE and

R^{2}

values (1.92/0.9540) of GC2. 1DCNN–LSTM can achieve the optimal prediction in three cases, i.e., MAE, RMSE and

R^{2}

values (1214999/1736089/0.7293) of CNH.

However, the prediction performance of these baseline models on remaining cases is unsatisfactory. Moreover, we can also observe that although 1DCNN–BiLSTM–Attention has the capacity to capture local and long-term dynamic variations features, it still cannot achieve effective prediction due to its serial architecture. This is mainly because the integrity of information may not be ensured, causing the loss of helpful information during the feature extraction process. Furthermore, LSTM, BiLSTM, and BiGRU–Attention perform worse than the GRU model in most cases, which illustrates that the complexity of the model structure does not necessarily lead to significant performance gains. Relative to these baseline models, the constructed DuCFF adopts a dual channel with different scales to capture representative features that are beneficial to better predictions, hidden in workload data, thereby achieving remarkable prediction results.

Secondly, we further calculate the average improvement proportion of DuCFF relative to baseline models, to more intuitively verify the prediction ability of DuCFF and its generalization ability on different workload time series, as shown in Table 4. We can observe that DuCFF obtains a significant performance boost in terms of four metrics. In this regard, relative to CNN-LSTM, the four evaluation metrics (i.e., MAE, RMSE, MAPE, and

R^{2}

) for DuCFF can be improved by 65.20%, 70.00%, 64.37%, and 15.00%, respectively; compared with GRU model, DuCFF obtains performance promotion of 64.63%, 69.65%, 64.14%, and 15.09%, respectively. It is interesting that for the hybrid model 1DCNN–BiLSTM–Attention, its average prediction performance is still inferior to the proposed DuCFF and is even worse than that of GRU and CNN-LSTM. As emphasized earlier, the serial structure of 1DCNN–BiLSTM–Attention extracts the features in a layer-by-layer manner, which will easily lead to the loss of helpful information. In summary, the elaborate network structure of DuCFF can effectively avoid information loss, which contributes to extracting the local and long-term variation features from the workload data.

Finally, the fitting effects of all models (i.e., DuCFF and baseline models) on the GC2 dataset are selected as an example for visualization due to page limitations, as depicted in Figure 7. We can observe that the DuCFF model can always accurately model the future variations trend when faced with highly fluctuating workload data, showing excellent fitting ability with low fitting error, and this conclusion can be confirmed by the fitting effect of some mutation points in Figure 7. However, for the baseline models, local irregular fluctuations make all baseline models go astray and be unable to achieve effective fitting with high fitting errors, which is also the root cause of large model prediction errors. The baseline models cannot weaken and filter the influence of these irregular variations on model learning, and excessive attention to these irregular data points will greatly reduce the prediction ability of the model. The above fine-grained comparisons from numerical results, average improvement proportion, and fitting effect confirm the advantages of the proposed model, and from the experimental results, adequate feature extraction is helpful in achieving effective workload prediction.

4.4. Parameter Sensitivity Analysis

This section discusses how parameter combinations influence the prediction performance of the proposed model to explain the rationality of model parameter selections. For this purpose, the sensitivity analysis on the CNH dataset as an example is carried out, and we change each parameter in a reasonable range. The parameters involved in this section consist of the number of IMFs, the time lag, the epoch, the number of filters, and the number of Heads since they have a significant impact on model performance. The numerical results are depicted in Table 5 and Figure 8, where small a MAPE and large

R^{2}

indicate that the model has better prediction performance. Note that we only exhibit the MAPE and

R^{2}

results because the remaining two metrics have similar changes and page limits. From Table 5 and Figure 8, we can conclude with the following findings:

For parameter n: In a certain range, the model performance can be improved to some extent by increasing the number of decomposed subcomponents (IMFs). This is mainly due to more decomposition components that can extract different variation characteristics contained in the raw workload data at a finer granularity. However, more decomposed components as the input also increase the complexity of the model. Therefore, this paper chooses to use n = 9 as a candidate parameter because it can also bring considerable performance improvement relative to the baseline model.
For the parameter time lag: Different input lengths (time lag) have different effects on model performance. In the prediction task of this paper, increasing the length of time lag deteriorates the model performance, while model performance achieves optimal results after using a specific time lag. For example, when the time lag is equal to 30, the model performance is worse than that when the time lag is 24. In general, the historical workload data often contain irregular changes, which make the model learn ineffectively. The noisy information hidden in workload data cannot be fully captured, so the model performance is affected.
For the parameter epoch: Increasing the number of epochs does not mean improving model performance. This is because the model has a performance bottleneck. After a certain number of epochs, the model performance will be saturated, and epochs that are too large will also increase the model training overhead. Therefore, the epoch is set to 200 in this paper.
For the parameter filter: Due to the parallel multi-scale architecture, the proposed model can achieve satisfactory prediction ability with a small filter number. However, increasing the number of TCN filters cannot obtain the ideal performance improvement, and this will further lead to the increase in model parameters, easily causing the overfitting phenomenon. Therefore, the number of filters is set to 32 in this paper.
For the parameter heads: The different number of heads in the Transformer block has a slight impact on model performance. In this paper, heads = 5 is selected as the model parameter as it can reduce the model prediction error.

4.5. Performance Overhead Analysis

This section investigates the performance overhead of each model. We first report the runtime of the training and testing stages for each model on three workload datasets and further analyze the trainable parameters of models. The detailed results are reported in Figure 9 and Table 6, and we have the following observations.

On the identical training set, many models involved in this paper spend much time training offline, except for the SVR model (with less time). Specifically, the CNN-LSTM, BiLSTM, and GRU models take an average of 95.3 s, 103.3 s, and 64.5 s, respectively, to train the models. However, the proposed DuCFF takes an average of 154.5 s to train on the three workload datasets, which is higher than the baseline models. It should be noted that the runtime of model training is not only related to the model network architecture but also closely related to the setting of hyperparameters, such as the lengths of time lag and epoch. This is mainly due to the fact that the greater the time lag, the more historical information the model needs to learn, which will cause the higher computational complexity of the model; furthermore, the larger the epoch, the longer the model training time. Moreover, in the evaluation phase, the online testing time for each model is minimal and acceptable, all below 0.2 s. Specifically, the average runtime of the test stage for DuCFF on the three workload datasets is 0.18 s, which presents a small gap compared with baseline models. Table 6 summarizes the trainable parameter size for all models on three workload datasets, which is exported by the summary() method in the Keras environment. As observed in the numerical results, we can note that DuCFF’s trainable parameters are less than 1DCNN–BiLSTM–Attention, BiLSTM, and BiGRU–Attention in all datasets, but its performance is significantly better than these baseline models. In fact, this benefit comes from the rationality of the design of the DuCFF architecture, using parallel and multi-scale manners to reduce model complexity while achieving efficient information extraction and better prediction.

4.6. Discussion

In summary, from the above horizontal and vertical comparison results obtained in this work, the superiority of the proposed DuCFF can be explained as follows. On the one hand, the prediction performance for some machine learning models (e.g., SVR, MLP) is insufficient for the workload data with high fluctuations and long-term dynamic variations, and they cannot effectively extract useful features to perform prediction. For example, the prediction performance of DuCFF can achieve 65.65%, 70.95%, 64.50%, and 16.11% improvements in terms of four metrics compared with SVR. On the other hand, some advanced models adopting cascaded network architectures cannot achieve consistent prediction performance due to the loss of useful information in the feature-extraction process, such as 1DCNN–LSTM and 1DCNN–BiLSTM–Attention. Finally, in the comparison of models, the DuCFF model is generally the optimal one. It verifies the rationality of the divide-and-conquer strategy adopted by the proposed architecture, which can more effectively capture the variation characteristics of the workload data for better prediction. It presents that the preprocessing process of the workload data using the popular VMD technique before further feature extraction can effectively improve the prediction ability in the case of workload prediction tasks. In addition, the analyses of parameter sensitivity and performance overhead also further confirm the rationality of the proposed DuCFF model in parameter selection and architecture design. The observed numerical results mean that the proposed DuCFF can satisfy the high-precision requirements for workload prediction problems and provide decision support for task scheduling strategies. However, since we only used three sets of data to evaluate the model performance, the diversity of the datasets is lacking, which is also a limitation in this study. Further meaningful research can be conducted in the future is by collecting more workload datasets to verify the generalization ability of the model and confirm the robustness of the model.

5. Conclusions and Future Work

This work constructs a novel workload prediction method, named DuCFF, by integrating data preprocessing technology, multi-scale TCNs, and the transformer. DuCFF employs stacked multi-scale TCN blocks to individually extract local irregular fluctuation features from the original input. Additionally, adopting a parallel architecture, the proposed model adopts stacked transformer blocks to extract long-term dynamic variation characteristics. By adopting this divide-and-conquer strategy, the loss of vital information could be lessened effectively to enhance the performance of the model on the premise of intact information. Extensive experiments are carried out on three datasets collected from ClarkNet–HTTP and Google Cloud traces to evaluate the validity of the proposed model. Specifically, DuCFF can achieve improvements of 65.2%, 70%, 64.37%, and 15% on average in prediction performance among four evaluation metrics, namely MAE, RMSE, MAPE, and

R^{2}

, compared with CNN-LSTM. Furthermore, by performing the analysis of parameter sensitivity and time overhead, DuCFF shows consistently superior performance, which provides a new attempt at workload prediction research.

Nevertheless, the runtime cost of the proposed model is relatively high compared to the baseline models, as mentioned in the previous section. In addition, the generalization ability of the model needs to be further verified by using more diverse workload datasets. Future work may go on to collect workload data from different traces to verify model generalization and explore techniques to simplify the model complexity to reduce the runtime cost. We also plan to use a more practical dynamic decomposition strategy to improve model performance. Furthermore, we intend to apply some feature selection algorithms to mine the importance of different features for workload prediction.

Author Contributions

Conceptualization, K.J.; methodology, K.J.; software, K.J.; validation, J.X. and B.L.; formal analysis, K.J. and J.X.; investigation, K.J.; resources, B.L.; data curation, J.X.; writing—original draft preparation, K.J.; writing—review and editing, K.J. and B.L.; visualization, K.J.; supervision, B.L.; project administration, B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

This study uses the ClarkNet-HTTP trace data (https://ita.ee.lbl.gov/html/contrib/ClarkNet-HTTP.html, accessed on 13 June 2024) and Google cloud traces data (https://github.com/HiPro-IT/CPU-and-Memory-resource-usage-from-Google-Cluster-Data, accessed on 11 January 2024). All users have access to the aforementioned datasets made accessible under a public license. Data generated and acquired for the project will be provided upon request.

Acknowledgments

The authors thank the reviewers for their guiding suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gartner. Gartner Forecasts Worldwide Public Cloud End-User Spending to Reach $679 Billion in 2024. Available online: https://www.gartner.com/en/newsroom/press-releases/11-13-2023-gartner-forecasts-worldwide-public-cloud-end-user-spending-to-reach-679-billion-in-20240 (accessed on 13 November 2023).
Abdelmajeed, A.Y.A.; Albert-Saiz, M.; Rastogi, A.; Juszczak, R. Cloud-Based Remote Sensing for Wetland Monitoring—A Review. Remote Sens. 2023, 15, 1660. [Google Scholar] [CrossRef]
Saxena, D.; Kumar, J.; Singh, A.K.; Schmid, S. Performance Analysis of Machine Learning Centered Workload Prediction Models for Cloud. IEEE Trans. Parallel Distrib. Syst. 2023, 34, 1313–1330. [Google Scholar] [CrossRef]
Kim, I.K.; Wang, W.; Qi, Y.; Humphrey, M. Forecasting Cloud Application Workloads with CloudInsight for Predictive Resource Management. IEEE Trans. Cloud Comput. 2022, 10, 1848–1863. [Google Scholar] [CrossRef]
Kumar, M.; Kishor, A.; Samariya, J.K.; Zomaya, A.Y. An Autonomic Workload Prediction and Resource Allocation Framework for Fog-Enabled Industrial IoT. IEEE Internet Things J. 2023, 10, 9513–9522. [Google Scholar] [CrossRef]
Chen, L.; Zhang, W.; Ye, H. Accurate Workload Prediction for Edge Data Centers: Savitzky-Golay Filter, CNN and BiLSTM with Attention Mechanism. Appl. Intell. 2022, 52, 13027–13042. [Google Scholar] [CrossRef]
Singh, A.K.; Saxena, D.; Kumar, J.; Gupta, V. A Quantum Approach Towards the Adaptive Prediction of Cloud Workloads. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 2893–2905. [Google Scholar] [CrossRef]
Ding, Z.; Feng, B.; Jiang, C. COIN: A Container Workload Prediction Model Focusing on Common and Individual Changes in Workloads. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 4738–4751. [Google Scholar] [CrossRef]
Bi, J.; Ma, H.; Yuan, H.; Zhang, J. Accurate Prediction of Workloads and Resources with Multi-Head Attention and Hybrid LSTM for Cloud Data Centers. IEEE Trans. Sustain. Comput. 2023, 8, 375–384. [Google Scholar] [CrossRef]
Kumar, J.; Singh, A.K.; Buyya, R. Self-directed Learning based Workload Forecasting Model for Cloud Resource Management. Inf. Sci. 2021, 543, 345–366. [Google Scholar] [CrossRef]
Bi, J.; Yuan, H.; Li, S.; Zhang, K.; Zhang, J.; Zhou, M. ARIMA-Based and Multiapplication Workload Prediction with Wavelet Decomposition and Savitzky–Golay Filter in Clouds. IEEE Trans. Syst. Man. Cybern. Syst. 2024, 54, 2495–2506. [Google Scholar] [CrossRef]
Arbat, S.; Jayakumar, V.K.; Lee, J.; Wang, W.; Kim, I.K. Wasserstein Adversarial Transformer for Cloud Workload Prediction. Proc. AAAI Conf. Artif. Intell. 2022, 36, 12433–12439. [Google Scholar] [CrossRef]
Bao, Y.; Sun, X.; Trivedi, K. A Workload-based Analysis of Software Aging, and Rejuvenation. IEEE Trans. Reliab. 2005, 54, 541–548. [Google Scholar] [CrossRef]
Bovenzi, A.; Cotroneo, D.; Pietrantuono, R.; Russo, S. Workload Characterization for Software Aging Analysis. In Proceedings of the 2011 IEEE 22nd International Symposium on Software Reliability Engineering, Hiroshima, Japan, 29 November–2 December 2011; pp. 240–249. [Google Scholar]
Bruneo, D.; Distefano, S.; Longo, F.; Puliafito, A.; Scarpa, M. Workload-Based Software Rejuvenation in Cloud Systems. IEEE Trans. Comput. 2013, 62, 1072–1085. [Google Scholar] [CrossRef]
Calheiros, R.N.; Masoumi, E.; Ranjan, R.; Buyya, R. Workload Prediction Using ARIMA Model and Its Impact on Cloud Applications’ QoS. IEEE Trans. Cloud Comput. 2015, 3, 449–458. [Google Scholar] [CrossRef]
Mashhadi Moghaddam, S.; O’Sullivan, M.; Walker, C.; Fotuhi Piraghaj, S.; Unsworth, C.P. Embedding Individualized Machine Learning Prediction Models for Energy Efficient VM Consolidation within Cloud Data Centers. Future Gener. Comp. Syst. 2020, 106, 221–233. [Google Scholar] [CrossRef]
Kumar, J.; Saxena, D.; Singh, A.K.; Mohan, A. BiPhase Adaptive Learning-Based Neural Network Model for Cloud Datacenter Workload Forecasting. Soft Comput. 2020, 24, 14593–14610. [Google Scholar] [CrossRef]
Xu, M.; Song, C.; Wu, H.; Gill, S.S.; Ye, K.; Xu, C. esDNN: Deep Neural Network Based Multivariate Workload Prediction in Cloud Computing Environments. ACM Trans. Internet Technol. 2022, 22, 1–24. [Google Scholar] [CrossRef]
Patel, E.; Kushwaha, D.S. A hybrid CNN-LSTM Model for Predicting Server Load in Cloud Computing. J. Supercomput. 2022, 78, 1–30. [Google Scholar] [CrossRef]
Duggan, M.; Mason, K.; Duggan, J.; Howley, E.; Barrett, E. Predicting Host CPU Utilization in Cloud Computing Using Recurrent Neural Networks. In Proceedings of the 2017 12th International Conference for Internet Technology and Secured Transactions (ICITST), Cambridge, UK, 11–14 December 2017; pp. 67–72. [Google Scholar]
Bi, J.; Li, S.; Yuan, H.; Zhou, M. Integrated Deep Learning Method for Workload and Resource Prediction in Cloud Systems. Neurocomputing 2021, 424, 35–48. [Google Scholar] [CrossRef]
Gao, J.; Wang, H.; Shen, H. Machine Learning Based Workload Prediction in Cloud Computing. In Proceedings of the 2020 29th International Conference on Computer Communications and Networks (ICCCN), Honolulu, HI, USA, 3–6 August 2020; pp. 1–9. [Google Scholar]
Zhang, Z.; Tang, X.; Han, J.; Wang, P. Sibyl: Host Load Prediction with an Efficient Deep Learning Model in Cloud Computing. In Proceedings of the Algorithms and Architectures for Parallel Processing, Guangzhou, China, 15–17 November 2018; Springer: Cham, Switzerland, 2018; pp. 226–237. [Google Scholar]
Ruan, L.; Bai, Y.; Li, S.; He, S.; Xiao, L. Workload Time Series Prediction in Storage Systems: A Deep Learning Based Approach. Cluster Comput. 2023, 26, 25–35. [Google Scholar] [CrossRef]
Kumar, A.S.; Mazumdar, S. Forecasting HPC Workload Using ARMA Models and SSA. In Proceedings of the 2016 International Conference on Information Technology (ICIT), Bhubaneswar, India, 22–24 December 2016; pp. 294–297. [Google Scholar]
Singh, P.; Gupta, P.; Jyoti, K. TASM: Technocrat ARIMA and SVR Model for Workload Prediction of Web Applications in Cloud. Cluster Comput. 2019, 22, 619–633. [Google Scholar] [CrossRef]
Patel, E.; Kushwaha, D.S. An Integrated Deep Learning Prediction Approach for Efficient Modelling of Host Load Patterns in Cloud Computing. J. Grid Comput. 2023, 21, 1–24. [Google Scholar] [CrossRef]
Ouhame, S.; Hadi, Y.; Ullah, A. An Efficient Forecasting Approach for Resource Utilization in Cloud Data Center Using CNN-LSTM Model. Neural Comput. Applic. 2021, 33, 10043–10055. [Google Scholar] [CrossRef]
Dogani, J.; Khunjush, F.; Mahmoudi, M.R.; Seydali, M. Multivariate Workload and Resource Prediction in Cloud Computing Using CNN and GRU by Attention Mechanism. J. Supercomput. 2023, 79, 3437–3470. [Google Scholar] [CrossRef]
Bai, S.; Zico Kolter, J.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:cs.LG/1803.01271. [Google Scholar]
Fu, S.; Lin, L.; Wang, Y.; Guo, F.; Zhao, M.; Zhong, B.; Zhong, S. MCA-DTCN: A Novel Dual-Task Temporal Convolutional Network with Multi-channel Attention for First Prediction Time detection and Remaining Useful Life Prediction. Reliab. Eng. Syst. Saf. 2024, 241, 109696. [Google Scholar] [CrossRef]
Peng, H.; Jiang, B.; Mao, Z.; Liu, S. Local Enhancing Transformer with Temporal Convolutional Attention Mechanism for Bearings Remaining Useful Life Prediction. IEEE Trans. Instrum. Meas. 2023, 72, 3522312. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
Islam, S.; Venugopal, S.; Liu, A. Evaluating the Impact of Fine-scale Burstiness on Cloud Elasticity. In Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC ’15, Kohala Coast, HI, USA, 27–29 August 2015; pp. 250–261. [Google Scholar]
Zheng, Z.; Zhang, Z.; Wang, L.; Luo, X. Denoising Temporal Convolutional Recurrent Autoencoders for Time Series Classification. Inf. Sci. 2022, 588, 159–173. [Google Scholar] [CrossRef]
Yao, Y.; Zhang, Z.-Y.; Zhao, Y. Stock Index Forecasting Based on Multivariate Empirical Mode Decomposition and Temporal Convolutional Networks. Appl. Soft Comput. 2023, 142, 110356. [Google Scholar] [CrossRef]
Li, Y.; Zuo, Z.; Pan, J. Sensor-based fall detection using a combination model of a temporal convolutional network and a gated recurrent unit. Future Gener. Comput. Syst. 2023, 139, 53–63. [Google Scholar] [CrossRef]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: New York, NY, USA, 2021; Volume 34, pp. 22419–22430. [Google Scholar]
Li, B.; Hu, Y.; Nie, X.; Han, C.; Jiang, X.; Guo, T.; Liu, L. DropKey for Vision Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 22700–22709. [Google Scholar]
Liang, J.; Cao, J.; Fan, Y.; Zhang, K.; Ranjan, R.; Li, Y.; Timofte, R.; Van Gool, L. VRT: A Video Restoration Transformer. IEEE Trans. Image Process. 2024, 33, 2171–2182. [Google Scholar] [CrossRef] [PubMed]
Zharikov, E.; Telenyk, S.; Bidyuk, P. Adaptive Workload Forecasting in Cloud Data Centers. J. Grid Comput. 2020, 18, 149–168. [Google Scholar] [CrossRef]
Farahnakian, F.; Liljeberg, P.; Plosila, J. LiRCUP: Linear Regression Based CPU Usage Prediction Algorithm for Live Migration of Virtual Machines in Data Centers. In Proceedings of the 2013 39th Euromicro Conference on Software Engineering and Advanced Applications, Santander, Spain, 4–6 September 2013; pp. 357–364. [Google Scholar]
Tran, V.G.; Debusschere, V.; Bacha, S. Hourly Server Workload Forecasting up to 168 Hours Ahead Using Seasonal ARIMA Model. In Proceedings of the 2012 IEEE International Conference on Industrial Technology, Athens, Greece, 19–21 March 2012; pp. 1127–1131. [Google Scholar]
Chen, W.; Lu, C.; Ye, K.; Wang, Y.; Xu, C.Z. RPTCN: Resource Prediction for High-Dynamic Workloads in Clouds Based on Deep Learning. In Proceedings of the 2021 IEEE International Conference on Cluster Computing (CLUSTER), Portland, OR, USA, 7–10 September 2021; pp. 59–69. [Google Scholar]
Selvan Chenni Chetty, T.; Bolshev, V.; Shankar Subramanian, S.; Chakrabarti, T.; Chakrabarti, P.; Panchenko, V.; Yudaev, I.; Daus, Y. Optimized Hierarchical Tree Deep Convolutional Neural Network of a Tree-Based Workload Prediction Scheme for Enhancing Power Efficiency in Cloud Computing. Energies 2023, 16, 2900. [Google Scholar] [CrossRef]
Xie, Y.; Jin, M.; Zou, Z.; Xu, G.; Feng, D.; Liu, W.; Long, D. Real-Time Prediction of Docker Container Resource Load Based on a Hybrid Model of ARIMA and Triple Exponential Smoothing. IEEE Trans. Cloud Comput. 2022, 10, 1386–1401. [Google Scholar] [CrossRef]
Devi, K.L.; Valli, S. Time Series-Based Workload Prediction Using the Statistical Hybrid Model for the Cloud Environment. Computing 2023, 105, 353–374. [Google Scholar] [CrossRef]
Chen, Z.; Hu, J.; Min, G.; Zomaya, A.Y.; El-Ghazawi, T. Towards Accurate Prediction for High-Dimensional and Highly-Variable Cloud Workloads with Deep Learning. IEEE Trans. Parallel Distrib. Syst. 2020, 31, 923–934. [Google Scholar] [CrossRef]
Zhang, L.; Xie, Y.; Jin, M.; Zhou, P.; Xu, G.; Wu, Y.; Feng, D.; Long, D. A Novel Hybrid Model for Docker Container Workload Prediction. IEEE Trans. Netw. Serv. Man. 2023, 20, 2726–2743. [Google Scholar] [CrossRef]
Dragomiretskiy, K.; Zosso, D. Variational Mode Decomposition. IEEE Trans. Signal Process. 2014, 62, 531–544. [Google Scholar] [CrossRef]
Zhang, Y.; Feng, K.; Ji, J.C.; Yu, K.; Ren, Z.; Liu, Z. Dynamic Model-Assisted Bearing Remaining Useful Life Prediction Using the Cross-Domain Transformer Network. IEEE/ASME Trans. Mechatron. 2023, 28, 1070–1080. [Google Scholar] [CrossRef]
Internet Traffic Archive. Available online: https://ita.ee.lbl.gov/html/contrib/ClarkNet-HTTP.html (accessed on 13 June 2024).

Figure 1. The architecture of DuCFF, (a) main structure, (b) TCN channel, and (c) Transformer channel.

Figure 2. The decomposed effects for three workload datasets using VMD. (a) Decomposed effect for ClarkNet–HTTP trace data (Requests). (b) Decomposed effect for Google trace data1 (CPU utilization). (c) Decomposed effect for Google trace data2 (CPU utilization).

Figure 3. The implementation process of the move window sampling.

Figure 4. A comparison of traditional CNN and TCN.

Figure 5. The calculation process of MSA.

Figure 6. The variation characteristics of workload data collected from Clarknet and Google traces. (a) CNH. (b) GC1. (c) GC2.

Figure 7. The fitting effects on GC2 for the proposed DuCFF and compared models.

Figure 8. The parameter sensitivity analysis on CNH for the proposed DuCFF in terms of two selected evaluation metrics.

Figure 9. The performance overhead comparisons for all models (time is recorded in seconds).

Table 1. Summary of the differences between existing studies and this study.

Ref.	Proposed Models	Predicted Parameters	Datasets Source	Used Metrics
Calheiros et al. [16]	ARIMA	Number of requests	Wikimedia foundation	RMSD, NRMSD, MAD, MAPE
Farahnakian et al. [43]	LR	CPU	CloudSim toolkit	Average SLA violation percentage, Energy consumption
Debusschere and Bacha [44]	SARIMA	Number of requests	Eolas data center	MAPE
Kumar et al. [18]	ANN-BaDE	Number of requests, CPU, Memory	Google, NASA	RMSE
Duggan et al. [21]	RNN	CPU	PlanetLab	MAE, MSE
Ruan et al. [25]	LSTM	Number of requests	Umass	RMSE, MAE, MAPE
Xu et al. [19]	GRU	CPU	Alibaba, Google	MSE, RMSE, MAPE
Chen et al. [45]	TCN with attention	CPU	Alibaba	MSE, MAE
Xie et al. [47]	ARIMA-TES	CPU, Memory	Google, Simulated cloud	MAPE, MSE
Patel and Kushwaha [28]	1DCNN–LSTM	CPU	Alibala, Google	MSE, RMSE, MAE
Chen et al. [6]	SG-1DCNN-BiLSTM -Attention	CPU	Alibaba	MSE, MAE, $R^{2}$
Ours	Multi-scale TCNs with Transformer	Number of requests, CPU	Google, ClarkNet	MAE, RMSE, MAPE, $R^{2}$

Table 2. Parameter settings of DuCFF.

Parameters	CNH	GC1	GC2
Time lag	24	24	12
Epoch	200	200	150
Head	5	2	2
Kernel size	(1, 3, 5)	(1, 3, 5)	(1, 3, 5)
Filter	32	32	32
Hidden_dim	64	64	64
Optimizer	Adam
Batch size	32
Learning rate	0.001

Table 3. Results of four metrics for different models in three workload datasets.

Datasets	Metrics	SVR	MLP	LSTM	GRU	BiLSTM	BiGRU- Attention	1DCNN	1DCNN- LSTM	1DCNN- BiLSTM- Attention	DuCFF (Ours)
CNH	MAE	1,230,239	1,241,180	1,225,688	1,227,365	1,227,703	1,241,738	1,229,902	1,214,999	1,232,341	215,629
	RMSE	1,767,051	1,753,259	1,766,878	1,746,214	1,745,088	1,754,335	1,751,528	1,736,089	1,748,057	291,797
	MAPE	18.38	19.12	18.74	18.84	18.67	18.88	18.76	18.43	18.76	3.49
	$R^{2}$	0.7153	0.7239	0.7182	0.7248	0.7265	0.7202	0.7231	0.7293	0.7221	0.9922
GC1	MAE	1.22	1.11	1.13	1.12	1.18	1.22	1.13	1.14	1.19	0.46
	RMSE	1.58	1.45	1.46	1.43	1.54	1.60	1.45	1.48	1.53	0.55
	MAPE	2.67	2.49	2.50	2.48	2.62	2.71	2.50	2.53	2.62	0.99
	$R^{2}$	0.9422	0.9515	0.9490	0.9523	0.9452	0.9408	0.9513	0.9470	0.9459	0.9912
GC2	MAE	1.10	1.13	1.16	1.11	1.19	1.16	1.12	1.13	1.14	0.53
	RMSE	1.95	1.97	2.00	1.95	1.98	1.95	1.92	1.94	1.93	0.70
	MAPE	3.09	3.22	3.26	3.17	3.35	3.28	3.17	3.19	3.25	1.55
	$R^{2}$	0.9516	0.9516	0.9506	0.9528	0.9505	0.9518	0.9540	0.9530	0.9533	0.9936

The best results for three datasets are shown in bold, and the best results of baseline models are underlined.

Table 4. The proportion of average improvement of DuCFF on three workload datasets.

	Metrics	SVR	MLP	LSTM	GRU	BiLSTM	BiGRU- Attention	1DCNN	1DCNN- LSTM	1DCNN-BiLSTM- Attention
Imp.	MAE	65.65%	64.97%	65.47%	64.63%	66.41%	66.51%	64.86%	65.20%	65.92%
	RMSE	70.95%	69.90%	70.21%	69.65%	70.67%	71.03%	69.63%	70.00%	70.33%
	MAPE	64.50%	64.54%	64.67%	64.14%	65.65%	65.84%	64.21%	64.37%	65.24%
	$R^{2}$	16.11%	15.22%	15.71%	15.09%	15.33%	15.85%	15.19%	15.00%	15.47%

Table 5. The impact of different numbers of IMFs on model performance.

	Metrics	n = 3	n = 6	n = 9	n = 12
DuCFF	MAE	754,031	377,653	215,629	165,622
	RMSE	1,138,665	534,591	291,797	217,306
	MAPE	11.23	5.96	3.49	2.68
	R²	0.8826	0.9741	0.9922	0.9957

The best results for different IMFs are shown in bold.

Table 6. The comparisons of trainable parameter size of each model on three workload datasets.

Datasets	MLP	LSTM	GRU	BiLSTM	BiGRU- Attention	1DCNN	1DCNN- LSTM	1DCNN-BiLSTM- Attention	DuCFF (Ours)
CNH	1025	66,689	929	133,377	103,837	3713	50,241	100,267	74,785
GC1	513	1169	3393	133,377	102,187	177	20,865	264,589	49,633
GC2	1281	16,961	12,929	133,377	102,187	1729	99,201	266,203	49,345

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jia, K.; Xiang, J.; Li, B. DuCFF: A Dual-Channel Feature-Fusion Network for Workload Prediction in a Cloud Infrastructure. Electronics 2024, 13, 3588. https://doi.org/10.3390/electronics13183588

AMA Style

Jia K, Xiang J, Li B. DuCFF: A Dual-Channel Feature-Fusion Network for Workload Prediction in a Cloud Infrastructure. Electronics. 2024; 13(18):3588. https://doi.org/10.3390/electronics13183588

Chicago/Turabian Style

Jia, Kai, Jun Xiang, and Baoxia Li. 2024. "DuCFF: A Dual-Channel Feature-Fusion Network for Workload Prediction in a Cloud Infrastructure" Electronics 13, no. 18: 3588. https://doi.org/10.3390/electronics13183588

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DuCFF: A Dual-Channel Feature-Fusion Network for Workload Prediction in a Cloud Infrastructure

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Problem Formulation

3.2. Prediction Process of DuCFF

3.3. Data Preprocessing

3.4. Feature Extraction

3.4.1. TCN Channel

3.4.2. Transformer Channel

3.4.3. Feature Fusion

3.5. Workload Prediction

4. Experiments and Results

4.1. Experimental Data Description and Evaluation Metrics

4.2. Baselines and Parameters Settings

4.3. Prediction Performance Analysis and Comparison

4.4. Parameter Sensitivity Analysis

4.5. Performance Overhead Analysis

4.6. Discussion

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI