Renewable-Aware Frequency Scaling Approach for Energy-Efficient Deep Learning Clusters

Park, Hyuk-Gyu; Kang, Dong-Ki

doi:10.3390/app14020776

Open AccessArticle

Renewable-Aware Frequency Scaling Approach for Energy-Efficient Deep Learning Clusters

by

Hyuk-Gyu Park

¹ and

Dong-Ki Kang

^2,*

¹

Department of Computer and Software Engineering, Wonkwang University, Iksan 54538, Republic of Korea

²

Division of Electronic and Information, Department of Computer Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(2), 776; https://doi.org/10.3390/app14020776

Submission received: 22 November 2023 / Revised: 7 January 2024 / Accepted: 9 January 2024 / Published: 16 January 2024

(This article belongs to the Special Issue Recent Applications of High-Performance Computing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Recently, renewable energy has emerged as an attractive means to reduce energy consumption costs for deep learning (DL) job processing in modern GPU-based clusters. In this paper, we propose a novel Renewable-Aware Frequency Scaling (RA-FS) approach for energy-efficient DL clusters. We have developed a real-time GPU core and memory frequency scaling method that finely tunes the training performance of DL jobs while maximizing renewable energy utilization. We introduce quantitative metrics: Deep Learning Job Requirement (DJR) and Deep Learning Job Completion per Slot (DJCS) to accurately evaluate the service quality of DL job processing. Additionally, we present a log-transformation technique to convert our non-convex optimization problem into a solvable one, ensuring the rigorous optimality of the derived solution. Through experiments involving deep neural network (DNN) model training jobs such as SqueezeNet, PreActResNet, and SEResNet on NVIDIA GPU devices like RTX3060, RTX3090, and RTX4090, we validate the superiority of our RA-FS approach. The experimental results show that our approach significantly improves performance requirement satisfaction by about 71% and renewable energy utilization by about 31% on average, compared to recent competitors.

Keywords:

deep learning; deep neural network; frequency scaling; renewable energy; graphic processing unit

1. Introduction

As the demands for artificial intelligence (AI) services are tremendously increasing, interest in high-performance computing clusters is also growing [1]. Various organizations and institutes have recently deployed HPC clusters, which include hundreds or thousands of computing nodes, for AI service processing. The core computing device in these clusters is the graphics processing unit (GPU), capable of accelerating computation speed for data-rich and parallel task processing by exploiting thousands of small multi-cores (e.g., CUDA cores in NVIDIA [2] and Stream Processors in AMD [3]). Recently, NVIDIA launched GPU-based AI hardware platforms, such as DGX-A100, DGX-H100, and DGX-GH200. These have been deployed as AI development solutions in cloud data centers by major vendors including Microsoft, Google, and Amazon [4]. Along with the high computational capacity of GPU-based clusters, a new challenge has emerged: the issue of energy consumption [5,6]. Despite ongoing improvements in the performance-per-watt of GPU devices, the energy consumption costs incurred by modern GPU clusters remain a significant obstacle to the expansion of next-generation AI technology across various fields. For example, a single NVIDIA H100 rack, consisting of eight Hopper-based SXM cards, may require a total thermal design power (TDP) of 5600 W (700 × 8 W) for full workload. In this scenario, the rack could lead to energy consumption of up to 130 kWh per day, corresponding to an electricity usage cost of more than an average of USD 15.6 in Nevada, USA, as of 2023.

The exploitation of renewable energy generation has recently emerged as an attractive solution for reducing energy consumption costs in commercial clusters and data centers [7,8]. Photovoltaic and wind energy are the most popular options in both research and industrial fields, due to their abundant availability as natural resources and the significant improvements in the efficiency of solar panels and wind turbines. The main challenge in efficiently utilizing renewable energies lies in accurately predicting future renewable energy generation, even amidst uncertainty and irregularity. To address this, there has been a surge in studies aiming to adopt AI-based methods for renewable energy prediction. Khan et al. [9] propose an Echo State Network–Convolutional Neural Network (CNN) model for accurate renewable energy prediction. They linearly connect the ESN to CNN layers using residual connections, efficiently avoiding the vanishing gradient problem and reducing error rates in predictions. Hwang et al. [10] present a combination of deep learning models including Extreme Learning Machines, CNN, and Bidirectional Long Short-Term Memory to accurately forecast the different frequency series found in wind energy data. They apply a secondary decomposition method to enhance the prediction accuracy of low-frequency components. Liao et al. [11] introduce a new forecasting method for power generation curves, based on both Graph Convolutional Network (GCN) and Long Short-Term Memory models (LSTM). They employ GCN to capture spatial correlations among multiple renewable energy sources and use LSTM to track dynamic behaviors in power generation. Although state-of-the-art studies on renewable energy, including those mentioned above, yield impressive results, it is crucial to note that they concentrate solely on AI-based methods for predicting renewable energy generation. They do not investigate the use of renewable energy as a means to reduce energy costs associated with processing deep learning jobs in GPU-based clusters.

Note that research investigating power and energy consumption behaviors in modern clusters due to deep learning (DL) job processing is in its early stages. Yao et al. [12] propose an energy-aware DL job scheduler to reduce the energy consumption of CNN inference services on GPU devices. Their scheduler coordinates batch processing and dynamic voltage frequency scaling (DVFS) to effectively respond to workload fluctuations, achieving a 28% reduction in energy consumption compared to its competitors while meeting latency service-level objectives. Liu et al. [13] introduce a novel framework, Morak (a multi-knob resource management framework), that conducts GPU resource partitioning with constraints on power and latency. Their framework optimizes GPU frequency scaling and deep neural network (DNN) model training batch sizing without violating the constraints, improving throughput by about 68% compared to other state-of-the-art baselines. Nabavinejad et al. [14] investigate the potential of renewable energy for efficient deep neural network model inference job processing. They focus on the fluctuation of power generation in renewable energy sources, presenting various results of inference throughput according to DNN model types and renewable energy sources (hydroelectric, solar, and wind). Although the aforementioned studies provide impressive insights and methodologies, they lack sophisticated procedures for utilizing renewable energy in DL job processing within modern clusters. These studies fail to account for the explicit deadlines of DNN model training. Additionally, their heuristic-based methods do not guarantee theoretically optimal decision making in DL job management.

In this paper, we introduce a novel Renewable-Aware Frequency Scaling (RA-FS) approach for energy-efficient DL job processing in GPU-based clusters. Our approach focuses on reducing energy consumption costs during DL job processing by harnessing renewable energy generation while meeting the performance requirements of diverse DL job requests. The key technical contributions of our work are as follows:

We design a GPU core and memory frequency scaling approach to optimize performance in real-time, tailored to the characteristics of DL job requests and assigned GPU devices. This FS-based approach is universally applicable across different GPU architectures. Even with minimal profiling data, we can determine model coefficients of FS for sophisticated performance tuning.
Our approach includes an optimization model that integrates renewable energy generation with DL job performance requirements. We introduce novel metrics: Deep Learning Job Requirement (DJR) and Deep Learning Job Completion per Slot (DJCS) to quantify service quality efficiently. These metrics enable us to easily maximize performance requirement satisfaction. Additionally, we employ a gated recurrent unit (GRU)-based predictor to forecast the irregular energy generation from renewable sources.
We present a log-transformation-based re-formulation technique to address our non-convex problem, which involves an energy consumption term defined by the product of power and training time. This technique transforms the non-convex problem into a convex one, ensuring an accurate optimal solution.
Utilizing a range of NVIDIA GPU devices, including those with the Ada Lovelace architecture (for example, RTX 4090), we collect data on real-world training performance and energy consumption at different GPU core and memory frequency settings. Additionally, by considering multiple regions of renewable energy generation, we develop practical experimental scenarios. These scenarios are designed to demonstrate the effectiveness of our RA-FS approach, highlighting its advantages over recent works in the field.

This paper is structured as follows. Section 2 outlines the framework architecture that incorporates our proposed RA-FS approach. Section 3 details the system model, including the deep learning job model and the GPU energy consumption model. In Section 4, we define our primary optimization problem and discuss its re-formulation based on the log-transformation technique. Section 5 presents preliminary experimental results, showcasing the epoch completion times and energy consumption for DL job processing on NVIDIA GPU-based devices. This section also includes a performance comparison of the RA-FS approach with other recent studies. The paper concludes with Section 6, summarizing our findings and contributions.

2. Proposed Framework Structure

Figure 1 illustrates the framework structure of our proposed Renewable-Aware Frequency Scaling approach.

Service Users: Service users submit requests for DNN model training jobs necessary for AI-based applications such as image classification, video object detection, language translation, etc. These requests specify the type of DNN model, the accompanying raw dataset, a deadline d, and the Deep Learning Job Requirement (as discussed in Section 4). Service users can range from individual clients to large-scale vendors. Failure to meet the predefined DJR within the stipulated deadline may result in a decline in the reputation of the associated DNN inference service.

Energy Sources: The GPU clusters are powered by a combination of renewable (such as solar panels and wind farms) and traditional energy sources. The large vendors (e.g., Amazon, Apple, and Meta) that have their own data centers commonly also deploy their own renewable generators. In contrast to the electricity bill of energy purchase from the grid market, which is proportional to the amount of energy usage and variable according to the day-ahead and real-time market states, the construction and operation cost for renewable generators is always fixed regardless of amount of dynamic renewable generation. In that case, we do not need to explicitly consider the dynamic price for renewable generation. For simplicity, this paper assumes that the energy from renewable sources is free of cost (except the fixed construction and operation cost for generators) and only considers charges applying to the consumption of non-renewable energy. In contrast to traditional sources, power generation from renewable sources can fluctuate significantly due to weather conditions (e.g., solar radiation and wind speed). Accurate prediction under these fluctuations is crucial for improving energy cost efficiency. Therefore, we incorporate a gated recurrent unit-based renewable energy predictor module into our framework, with details provided in Section 3.

Servers: In our framework, each server is equipped with a GPU device capable of frequency scaling for both the GPU core and memory. Additionally, each server incorporates a controller and monitor module to manage the frequency scaling of the GPU device. The frequency values for the GPU core and memory are dynamically adjusted, taking into account the architecture of the GPU device and the characteristics of the executed DL job.

RA-FS Manager: The RA-FS manager proposed in our framework is responsible for allocating DL jobs to appropriate servers and scaling the frequency values of GPU devices. This manager comprises two sub-modules: a GRU-based renewable energy predictor, and a log-transformation-based problem optimizer. The GRU-based predictor learns the raw data on energy generation from renewable sources. Utilizing an advanced LSTM model, this predictor accurately forecasts future renewable energy generation and relays these predictions to the optimizer. The problem optimizer’s objective is to find optimal frequency vectors and DL job allocation vectors based on profiling data from all connected servers. Through our defined optimization formulation (discussed in Section 5), the RA-FS manager can finely tune the training performance and energy consumption for assigned DL jobs.

3. Gated Recurrent Unit-Based Prediction

In order to maximize the training performance while minimizing the energy consumption cost of GPU clusters, we need to accurately predict the renewable generation even under its irregularity. In this paper, we adopt the GRU [15], a modified version of the LSTM model. The GRU combines the forget and input gates into a single update gate and merges the cell state with the hidden state, making it well suited for predicting the sequence-ordered data of energy generation from renewable sources.

Figure 2 illustrates the structure of our defined GRU network model for prediction. Let

v (k)

represent the historical input data and

\tilde{v} (k)

denote the predicted output data at time slot k. Based on [15], we present the associated functions for training the GRU network model as follows:

\begin{matrix} q (k) = σ (W_{z} \cdot [h (k - 1), v (k)]), \\ u (k) = σ (W_{r} \cdot [h (k - 1), v (k)]), \\ \tilde{h} (k) = \tanh (W \cdot [u (k) * h (k - 1), v (k)]), \\ h (k) = (1 - q (k)) \cdot h (k - 1) + q (k) \cdot \tilde{h} (k) . \end{matrix}

(1)

Here,

q (k)

and

u (k)

denote the update gate vector and reset gate vector at time slot k, respectively. W denotes the model weight parameter matrix. The terms

\tilde{h} (k)

and

h (k)

represent the associated output vector and candidate activation vector at time slot k, respectively. The hyperbolic tangent function

\tanh (\cdot) = \frac{\sinh (\cdot)}{\cosh (\cdot)}

is employed for calculating the candidate activation vector

\tilde{h} (k)

, where

\sinh (\cdot)

and

\cosh (\cdot)

represent the hyperbolic sine cosine functions, respectively. Through the feedback-based forward and back-propagation within the GRU network model, the model parameters for each block are updated in response to the discrepancies between the historical input sequences and actual output sequences.

Based on historical input sequences of renewable energy, the defined GRU network model is capable of inferring the corresponding output sequences. Let l denote the length of both the input and output sequences. Then, we present our GRU-based prediction model as follows:

\begin{matrix} ({\tilde{v}}_{i} (k + 1), \dots, {\tilde{v}}_{i} (k + l)) \\ = GRU ((v_{i} (k - l), \dots, v_{i} (k)), W) \end{matrix}

(2)

Different types of loss functions, such as mean squared error loss and cross-entropy loss, can be considered for updating the model parameters W. In this work, we use the MSE loss function due to its simplicity and compatibility with the PyTorch framework.

4. System Model

4.1. Deep Learning Job Model

We consider

| I |

DL jobs invoked in the cluster. For the i-th DL job, the job description is represented by

j_{i} = {a_{i}, d_{i}, E_{i}}

. Here,

a_{i}

denotes the arrival time (i.e., the index of the arrived time slot),

d_{i}

is the deadline as defined by the DL service user, and

E_{i}

represents the number of required epochs for training. Based on this, we present the following definition.

Definition 1

(Deep-Learning Job Requirement, DJR). Let

E_{i}, M_{i}

, and

b_{i}

denote the total epochs, the total number of mini-batches per epoch, and the associated data chunk size per iteration for the i-th DL job, respectively [16]. Then, we denote

R_{i} = E_{i} M_{i} b_{i}

as the Deep Learning Job Requirement, which represents the total required workload for training within the pre-established deadline

d_{i}

.

Now, we present a GPU core frequency-based performance model for DL jobs, utilizing a statistical modeling approach. This model is grounded in the relationship

t \propto \frac{1}{f}

(time is inversely proportional to frequency), where t represents the DL job processing time and f denotes the frequency value, as discussed in [17,18]. In this model,

λ_{i}^{F}

and

λ_{i}^{B}

are the performance model coefficients for feed-forward and back-propagation processes in DNN model training jobs, respectively. These coefficients are determined by the capability of the Streaming Multiprocessors (SMs) in the GPU devices. The coefficient

λ_{i}^{O}

is assigned for other sub-processes, excluding feed-forward and back-propagation. Let

λ_{i} = λ_{i}^{F} + λ_{i}^{B} + λ_{i}^{O}

. Assuming the i-th DL job is running on a GPU worker at time slot k, we can derive both the DNN model updating time

t_{i}^{up} (k)

and the training time

t_{i}^{tr} (k)

as follows:

\begin{matrix} t_{i}^{up} (k) = t_{i}^{tr} (k) + \frac{2 b_{i}}{{bw}_{i}} \end{matrix}

(3)

\begin{matrix} t_{i}^{tr} (k) = \frac{λ_{i} b_{i}}{f_{i} (k)} \end{matrix}

(4)

Here,

f_{i} (k)

represents the GPU core frequency value for the worker running the i-th DL job, and

b_{i}

refers to the associated data chunk size for each mini-batch. The term

{bw}_{i}

denotes the bandwidth between the worker and the parameter server. The second term in (3) represents the communication time, during which GPU workers send gradients to and receive updated model parameters from the parameter server. Note that the frequency value

f_{i} (k)

does not influence this term. Furthermore, we assume that the parameter server is located on the same machine as the GPU worker. Consequently, the term

\frac{2 b_{i}}{{bw}_{i}}

can be considered approximately zero, rendering it a negligible constant. Therefore, our focus primarily lies on (4). Building upon (4), we propose the following definition.

Definition 2

(Deep-Learning Job Completion per Slot, DJCS). We define

θ_{i} (k)

as the Deep Learning Job Completion per Slot, which quantifies the workload processed in the i-th DL job at a given frequency value of the worker for time slot k. Let τ denote the duration of one time slot. Then,

θ_{i} (k)

can be expressed as the reciprocal of

t_{i}^{t r} (k)

, formulated as follows:

\begin{matrix} θ_{i} (k) = \frac{τ}{t_{i}^{t r} (k)} = \frac{τ \cdot f_{i} (k)}{λ_{i} b_{i}} . \end{matrix}

(5)

Clearly, for each DL job, the total DJCS across all time slots must meet the predetermined DJR. Let

x_{i} (k)

be the indicator that signifies whether the worker is allocated to the i-th DL job at time slot k (where

x_{i} (k) = 1

indicates allocation and

x_{i} (k) = 0

indicates non-allocation). Then, we propose the following constraints.

\begin{matrix} \sum_{\forall k \in [K]}^{} x_{i} (k) \cdot θ_{i} (k) \geq R_{i}, \forall i \in I . \end{matrix}

(6)

Based on the definition of DJCS and the constraints outlined in (6), we are able to design a flexible energy control framework for DL job processing, adaptive to renewable generation.

4.2. GPU Energy Consumption Model

To estimate the GPU energy consumption based on frequency values, we exploit a fitting-based modeling approach, which is applicable for clusters comprising heterogeneous GPU device architectures. Let

δ_{i}^{F}

,

δ_{i}^{B}

, and

δ_{i}^{O}

denote the coefficients for feed-forward, back-propagation, and other workloads, respectively, and

δ_{i} = δ_{i}^{F} + δ_{i}^{B} + δ_{i}^{O}

. Similar to the DL job model above, these coefficients are associated with the characteristics of both the DNN model and the GPU device. The power consumption for the i-th DL job at time slot k can be defined as follows:

\begin{matrix} p_{i} (k) = p_{i}^{STD} + δ_{i} \cdot b_{i} \cdot x_{i} (k) \cdot f_{i} (k), \forall i \in I, \forall k \in K, \end{matrix}

(7)

where

p_{i}^{STD}

represents the static power consumption of worker

s_{i}

. We can derive the integrated coefficient

δ_{i}

using an output-based estimation approach, which eliminates the need for costly offline profiling analysis. Let

E_{i} (k)

denote the energy consumption of the worker running the i-th DL job during the interval

[k \cdot τ, (k + 1) \cdot τ]

. Then,

E_{i} (k)

can be presented as follows:

\begin{matrix} E_{i} (k) = \{\begin{matrix} p_{i} (k) \cdot τ, & if x_{i} (k) = 1 (active) \\ p_{i}^{STD} \cdot τ, & if x_{i} (k) = 0 (idle) \end{matrix}, \forall i \in I, \forall k \in K . \end{matrix}

(8)

If

x_{i} (k) = 1

(indicating that time slot k is allocated for processing the i-th DL job), then the GPU frequency significantly affects the associated energy consumption, as detailed in Equation (7). On the other hand, if

x_{i} (k) = 0

(meaning the worker is in idle state and does not process any invoked DL job during time slot k), the energy consumption is determined solely by the static power consumption

p_{i}^{STD}

.

5. Problem Formulation

5.1. Primary Problem Formulation

The objective of our work is to minimize the energy consumption cost for DL job processing while guaranteeing that DJR is met within the predetermined deadline. To do this, we define two types of decision variables: the job allocation vectors

X = (x_{1}, x_{2}, \dots, x_{I})

and the frequency setting vectors

F = (f_{1}, f_{2}, \dots, f_{I})

, where

x_{i} = (x_{i} (1), \dots, x_{i} (K)), \forall x_{i} (k) \in {0, 1}

, and

f_{i} = (f_{i} (1), \dots, f_{i} (K))

. If a certain i-th DL job is processed on the worker during time slot k, then

x_{i} (k) = 1

, and the associated frequency value

f_{i} (k)

may be set within a feasible range. Let

r (k)

denote the amount of renewable energy generation at time slot k. Then, based on Equations (7) and (8), we formulate the energy cost control problem as follows:

Problem 1

(Energy Cost Control Problem).

\begin{matrix} min_{X, F} & \sum_{\forall k \in K}^{} max (\sum_{\forall i \in I}^{} E_{i} (k) - r (k), 0) \cdot c (k), \end{matrix}

(9)

\begin{matrix} s . t . & f_{i}^{-} \leq f_{i} (k) \leq f_{i}^{+}, \forall i \in I, \forall k \in K, \end{matrix}

(10)

\begin{matrix} a_{i} \leq k \cdot x_{i} (k) \leq d_{i}, \forall i \in I, \forall k \in K, \end{matrix}

(11)

\begin{matrix} \sum_{\forall k \in K}^{} x_{i} (k) \cdot θ_{i} (k) \geq R_{i}, \forall i \in I, \end{matrix}

(12)

\begin{matrix} x_{i} (k) \in {0, 1}, \forall i \in I, \forall k \in K . \end{matrix}

(13)

Constraints (10) ensure that the frequency setting values for each worker fall within the feasible range, bounded by the maximum value

f_{i}^{+}

and the minimum one

f_{i}^{-}

. Constraints (11) ensure that the completion time of the i-th DL job does not exceed the predetermined deadline

d_{i}

. Constraints (12) guarantee that the sum of DJCS across the assigned time slots satisfies the DJR. To comply with these constraints, it is necessary to find a proper slot allocation plan and frequency scaling solution. Lastly, constraints (13) enforce the integrity of the decision variables. Obviously,

d_{i} - a_{i}

represents the number of required available time slots for processing the i-th DL job. Let

L_{i}

denote the lower bound for the number of time slots required for the i-th DL job to meet the DJR

R_{i}

. By setting

f_{i} (k) = f_{i}^{+}

, and based on constraints (6), we calculate

L_{i}

as follows:

\begin{matrix} L_{i} = min \sum_{\forall k \in K}^{} x_{i} (k) s . t . (6) . \end{matrix}

(14)

If we cannot find a value for

L_{i}

such that

L_{i} \leq d_{i} - a_{i}

, it is impossible to meet the deadline of the i-th DL job within any of the potential solutions. Thus, our proposed RA-FS manager will attempt to drop that DL job request. Note that Problem (1) is classified as a non-convex mixed-integer nonlinear problem (non-convex MINLP) due to the product terms of the decision variables

x_{i} (k)

and

f_{i} (k)

included in both the objective function (9) and the constraints (12). To solve this issue, we present a reformulation for the original problem (1) in the next subsection.

5.2. Problem Re-Formulation

Instead of attempting to solve the original Problem 1 directly, we present a modified convex MINLP problem that is equivalent to Problem 1. To this end, we exploit a Log-transformation technique [19] on Problem 1. First, we introduce new variables for this transformation as follows:

\begin{matrix} y_{i} (k) = x_{i} (k) + 1, \forall i \in I, \forall k \in K . \end{matrix}

(15)

Obviously,

y_{i} (k) \in {1, 2}, \forall i \in I, \forall k \in K

. Then, the additional variables for reformulation are defined as follows:

\begin{matrix} {\bar{y}}_{i} (k) = \ln (y_{i} (k)), \end{matrix}

(16)

\begin{matrix} {\bar{f}}_{i} (k) = \ln (f_{i} (k)), \end{matrix}

(17)

where

\ln (\cdot)

denotes

\log_{e} (\cdot)

. Therefore,

x_{i} (k) = y_{i} (k) - 1 = \exp ({\bar{y}}_{i} (k)) - 1

and

f_{i} (k) = \exp ({\bar{f}}_{i} (k))

. Let

z_{i} (k) \in {0, 1}, \forall i \in I, \forall k \in K

denote the new binary integer variables. We set

\bar{Y} = ({\bar{y}}_{1}, {\bar{y}}_{2}, \dots, {\bar{y}}_{I})

,

\bar{F} = ({\bar{f}}_{1}, {\bar{f}}_{2}, \dots, {\bar{f}}_{I})

and

Z = (z_{1}, z_{2}, \dots, z_{I})

where

{\bar{y}}_{i} = ({\bar{y}}_{i} (1), \dots, {\bar{y}}_{i} (K))

,

{\bar{f}}_{i} = ({\bar{f}}_{i} (1), \dots, {\bar{f}}_{i} (K))

and

z_{i} = (z_{i} (1), \dots, z_{i} (K))

, as the auxiliary vectors. Then, we propose the reformulated problem as follows:

Problem 2

(Log-transformed Problem).

\begin{matrix} min_{\bar{Y}, \bar{F}, Z} & \sum_{\forall k \in K}^{} max [\sum_{\forall i \in I}^{} {p_{i}^{STD} + δ_{i} \cdot b_{i} \cdot (\exp ({\bar{y}}_{i} (k) + {\bar{f}}_{i} (k)) - \exp ({\bar{f}}_{i} (k)))} \cdot τ - r (k), 0] \cdot c (k), \end{matrix}

\begin{matrix} s . t . & \exp (f_{i}^{-}) \leq \exp (f_{i} (k)) \leq \exp (f_{i}^{+}), \forall i \in I, \forall k \in K, \end{matrix}

(18)

\begin{matrix} a_{i} \leq k \cdot (\exp ({\bar{x}}_{i} (k)) - 1) \leq d_{i}, \forall k \in K, \end{matrix}

(19)

\begin{matrix} \sum_{\forall k \in [K]}^{} \frac{(\exp ({\bar{y}}_{i} (k) + {\bar{f}}_{i} (k)) - \exp ({\bar{f}}_{i} (k))) \cdot τ}{λ_{i} \cdot b_{i}} \geq R^{i}, \forall i \in I, \end{matrix}

(20)

\begin{matrix} {\bar{y}}_{i} (k) = \ln (2) \cdot z_{i} (k), \forall i \in I, \forall k \in K, \end{matrix}

(21)

\begin{matrix} z_{i} (k) \in {0, 1}, \forall i \in I, \forall k \in K . \end{matrix}

(22)

Transforming the constraints (16) for our log-transformation is not straightforward, so we employ a modeling technique based on a special-ordered set of type-I (SOS-1), as described in [19]. This technique, combined with constraints (21) and (22), ensures that

{\bar{y}}_{i} (k)

takes values in

{\ln (1) = 0, \ln (2) = 1}

for all

i \in I

and

k \in K

. Additionally, we have developed a modified formulation for Problem 2 that eliminates the max term from the objective function. For more details, please see the Appendix in [20].

Algorithm 1 outlines the steps of our proposed RA-FS approach. If there is no feasible solution to meet the DJR

R_{i}

for a specific i-th DL job within its deadline

d_{i}

, the algorithm drops that request (as shown in line 2). Subsequently, for all incoming time slots, a GRU-based predictor forecasts the future generation of renewable energy sources (referenced in line 3). Then, the MINLP solver tackles the convex optimization Problem (2) (mentioned in line 4), determining the optimal solution (

X^{*}

,

F^{*}

). Utilizing this solution, our approach then executes the actual frequency scaling and allocates the DL jobs accordingly.

Algorithm 1: RA-FS Approach

INPUT:

Model coefficients

λ_{i}^{F}

,

λ_{i}^{B}

,

λ_{i}^{O}

,

δ_{i}^{F}

,

δ_{i}^{B}

,

δ_{i}^{O}, \forall i

;

Historical data

r (k)

,

c (k)

;

Specification of DL jobs, arrival time

a_{i}

, deadline

d_{i}

, and DJR

R_{i}, \forall i

;

OUTPUT:

Optimal solution (

X^{*}

,

F^{*}

),

1: check current available (not-assigned) time slots

2: drop the DL job requests such that

L_{i} > d_{i} - a_{i}

3: predict

r (k)

,

c (k), \forall k

by using GRU-based predictor

4: solve Problem (2) by using the MINLP solver.

5: obtain (

X^{*}

,

F^{*}

) based on

{\bar{Y}}^{*}, {\bar{F}}^{*}, Z^{*}

and (15)–(17)

6. Experiments

6.1. Preliminary Experiments

In this section, we present the pre-measured experimental results of DNN model training performance and the associated power/energy consumption by computing nodes, in relation to GPU core and memory frequency settings. We employed NVIDIA GPU devices, namely, RTX3060, RTX3090, and the state-of-the-art RTX4090. The specifications of the computing nodes, including these GPU devices, are detailed in Table 1 and Table 2. We iteratively execute the ‘nvidia-smi rgc’ and ‘nvidia-smi rmc’ commands to adjust the frequency values of each GPU device [21]. Subsequently, we collect data on epoch completion times and power consumption using parsing scripts we developed. For our target DNN models, we conduct training sessions for DNN models, SqueezeNet [22], PreActResNet18 [23], and SEResNet50 [24] with the CIFAR100 dataset [25], based on the PyTorch framework version 2.1.1 [26], along with CUDA 11.8 [27] and cuDNN 8.9.6 libraries [28].

The epoch completion times for training various DNN models on Node1 (RTX3060), Node2 (RTX3090), and Node3 (RTX4090) are depicted in Figure 3. Figure 3a–c illustrate the epoch completion time curves for SqueezeNet model training at different core and memory frequency values. Node3 demonstrates superior training performance with an average of 9.9 s, compared to Node1’s average of 31.6 s and Node2’s average of 20.9 s across all frequency values. This is because that the RTX4090 GPU in Node3 possesses over 16,000 CUDA cores, approximately 1.5 times more than the RTX3090 and 5 times more than the RTX3060. Unlike Node1 (RTX3060) and Node2 (RTX3090), Node3 (RTX4090) maintains a consistent epoch completion time across various GPU frequency values. This result is due to the relatively small number of weight parameters in the SqueezeNet model (1.24 million trainable parameters, equivalent to 5 MB). Therefore, Node3 (RTX4090) can complete each training iteration swiftly, even at the lowest frequency setting. Figure 3d–f show the epoch completion time curves for PreActResNet18 model training on each node at different frequency values. The average epoch completion time for all curves is 49 s, which is 2.5 times longer than that of the SqueezeNet model training (20 s). This is due to the PreActResNet18 model having a larger number of weight parameters (11 million trainable parameters), approximately 10 times that of SqueezeNet. Aside from the magnitude of the resulting values, the shapes of these curves are similar to those in Figure 3a–c. Figure 3g–i present the epoch completion time curves for SEResNet50 model training in each case. Given the large parameter size of the SEResNet50 (more than 25.6 million trainable parameters), the difference in training performance due to the frequency setting is also significant. For Node3 (RTX4090), the epoch completion time at frequency values (time: 77 s, core: 420 MHz, memory: 5001 MHz) is about 35% longer than at (time: 54.7 s, core: 2760 MHz, memory: 5001 MHz). In this scenario, the training performance is heavily influenced by both core and memory frequency values. Node3 (RTX4090) achieves an epoch completion time of 36 s at (core: 2760 MHz, memory: 10,501 MHz), which is approximately 56% faster than the 57 s at (core: 2760 MHz, memory: 5001 MHz).

The power consumption for training each DNN model on Node1 to Node3 is depicted in Figure 4. Figure 4a–c show the bar charts of power consumption for SqueezeNet model training at different core and memory frequency values. Interestingly, Node3 (RTX4090) does not display a significantly higher power consumption (average of 75.9 [W]) compared to Node1 (average of 50.7 [W]) and Node2 (average of 124.3 [W]), contrary to the results presented in Figure 3d–f. It is due to the RTX4090’s enhanced performance per watt (PPW) compared to its predecessor in the Ampere product family, which includes the RTX3090. However, it is worth noting that the purchase price of the RTX4090 is three times that of the RTX3090. Note that at high core frequency values (e.g., above 1620 MHz for RTX3060/3090 and 2220 MHz for RTX4090), the potential for improving epoch completion time is relatively small compared to the increase in power consumption. This implies that maximizing GPU frequency values without a considerate policy may lead to an undesirable rise in energy consumption costs for a minor improvement in performance. Figure 3d–f show the power consumption for PreActResNet18 model training on each node at different frequency values. The average power consumption across all curves is 123.5 [W], which is 1.3 times higher than that of SqueezeNet model training (93.4 [W]). Similar to Figure 3g–i, the larger-sized PreActResNet18 model training requires more power than the relatively smaller-sized SqueezeNet model. Figure 4g–i show results similar to those in Figure 4a–f above.

Figure 5 illustrates the energy consumption for training each DNN model on Nodes 1 through 3. The energy consumption for each case is calculated by multiplying the power consumption by the epoch completion time. To minimize energy consumption, the optimal frequency values must be determined based on the characteristics of the GPU device architectures and the DNN models. For instance, the optimal frequency pair for training the PreActResNet18 model on Node1 is (core: 1020 MHz, memory: 5000 MHz), while for the SEResNet50 model on Node2, it is (core: 1215 MHz, memory: 9501 MHz). To determine the optimal frequency settings for each experimental case, it is necessary to perform iterative offline profiling. This process involves collecting data on power consumption and DNN model training time across the range of potential frequency scales, utilizing specific parsing scripts.

6.2. Competitor Setup

To demonstrate the superiority and efficiency of our proposed RA-FS approach, we compare it with other methods for energy-efficient DL job processing. We define the competitors of our proposed approach as follows:

(1) MIN Frequency Setting (MIN): This approach sets the core and memory frequency values of all GPU devices to their minimum levels, irrespective of the DNN model training performance requirements and renewable energy generation. It may result in undesirable energy consumption due to the extended training time, even though it significantly compromises the performance of DNN model training.

(2) MAX Frequency Setting (MAX): In this approach, the core and memory frequency values of all GPU devices are set to their maximum fixed values. Unlike the MIN approach, which prioritizes energy efficiency, this method aims to achieve the highest possible performance in DNN model training, but it may result in undesirably high energy consumption of high-performance GPU device like RTX4090.

(3) Energy Optimal Frequency Setting (eOPT): In this approach, the core and memory frequency values of all GPU devices are set to their energy-optimal fixed values. We pre-determined the optimal frequency settings for each pair of deployed GPU devices and DNN models based on the raw data derived from Figure 5. For example, on Node1 (RTX3060), we set (core: 1020 Mhz, memory: 5000 Mhz) as the optimal frequency value for all DL models: SqueezeNet, PreActResNet18, and SEResNet50. This approach achieves the minimum energy consumption but does not guarantee the satisfaction of deadlines for each DL job request.

(4) DeepPower-Controller (DP-CTR) [29]: This approach provides real-time power control for cost-efficient DL job processing in GPU-based clusters. Utilizing the concept of model predictive control, it adjusts core and memory frequency values in response to dynamic renewable generation and electricity prices. However, this approach naively minimizes DNN model training time without a defined performance requirement, as in our DJR, potentially leading to inefficient power usage in meeting the performance requirements of diverse DL job requests.

(5) Energy Budget-based Frequency Scaling and Job Allocation (EB-FJ) [30]: This approach involves assigning a limited energy budget and aims to find the optimal frequency setting and DL job allocation plan within this budget. The authors introduce a scheduler called PowerFlow, which minimizes the average Job Completion Time without exceeding a predefined energy budget. Like the DP-CTR approach, it does not account for explicit deadlines of assigned DL jobs, failing to achieve optimal energy budget distribution for each job. For a fair comparison, we set the energy budget for this approach to match the total energy consumption of our proposed RA-FS approach.

6.3. Renewable Energy Source and Workload Setup

The generated energy can be produced by renewable generators: wind farms and photovoltaic cells (solar panels). We have indirectly determined the amount of energy generated by renewable generators using raw data of multiple parameters. To obtain practical data for renewable energy, we utilize the values of solar radiance (W/m²), outside air temperature (°C), and wind speed (m/s) from raw data sets in region 1 (NELHA, Kailua Kona, HI, USA), region 2 (NREL M2, Boulder, CO, USA), and region 3 (Oak Ridge LAB, Oak Ridge, TN, USA) for the periods of 1 to 3 June 2018 and 1 to 3 December2018 [31]. For more details, refer to our previous studies [29,32].

Based on raw data derived from Section 6.1, we established a simulation environment to verify the performance of our proposed RA-FS approach in a large-scale cluster. We considered 100 GPU worker nodes (33 Node1, 33 Node2, and 34 Node4) and set the entire simulation time as 24 h, with each time slot lasting 15 min. Each generated request was allocated to an individual node (1 request per node), and we randomly determined the arrival time, job deadline, and the number of epochs within the ranges [0 h, 8 h], [200, 500], and [20 h, 24 h], respectively. The target model for each request was selected from our defined set of 3 models, SqueezeNet, PreActResNet18, and SEResNet50. Depending on the workload level, we considered three cases: low (SqueezeNet: 50%, PreActResNet18: 30%, SEResNet50: 20%), medium (SqueezeNet: 30%, PreActResNet18: 40%, SEResNet50: 30%), and high (SqueezeNet: 20%, PreActResNet18: 30%, SEResNet50: 50%). For example, in the low workload case, we generated and allocated 50 requests for SqueezeNet, 30 requests for PreActResNet18, and 20 requests for SEResNet50 to the cluster.

6.4. Performance Comparison Results

Figure 6 presents a comparison of the average epoch completion time and the DJR violation degree (%) for all approaches, including MIN, MAX, eOPT, DP-CTR, EB-FJ, and our proposed RA-FS approach. The DJR violation degree is a measure of how much the completion time of a DL job has exceeded and been delayed relative to a pre-defined deadline. For instance, if the deadline is set at 30 s and the actual training completion time is 60 s, the DJR violation degree would be 100%. Figure 6a–c display the average epoch completion times for all approaches with renewable energy generation in Regions 1 to 3, respectively. The MIN and MAX approaches serve as baselines (i.e., the lower and upper bounds of average epoch completion time) for performance comparison. In addition, note that the eOPT approach only focuses on energy reduction, so it may not minimize the epoch completion time and not ensure the DJR compliance. Our proposed RA-FS approach demonstrates slightly better training performance (an average epoch completion time of 79.3 s) compared to the other approaches, except for the MAX approach. The DP-CTR and EB-FJ approaches achieve training performance comparable to our approach (average epoch completion times of 85.4 s and 80.2 s, respectively). This stems from their ability to minimize DNN model training time through sophisticated optimization modeling. Notably, the EB-FJ approach employs a combination of GPU frequency scaling (micro control) and dynamic DL job allocation (macro control) to achieve performance akin to our RA-FS approach. Figure 6d–f illustrate the DJR violation degree of all approaches in Regions 1 to 3, respectively. In contrast to Figure 6a–c, our RA-FS approach (average of 2.7%) surpasses the DP-CTR (average of 17.8%) and EB-FJ (average of 13.5%) approaches, nearing the performance of the MAX approach (average of 0.7%). Utilizing DJR- and DJTS-based frequency scaling, our RA-FS approach effectively addresses the performance requirements of heterogeneous DL job requests. Our RA-FS approach prioritizes high frequency values for DL jobs with tight deadlines, while allocating lower frequency values to those with more flexible deadlines.

Figure 7 examines the ratio of total energy consumption to renewable energy consumption across all approaches in Regions 1 to 3. To facilitate a fair comparison, we set the energy budget for the EB-FJ approach to equal the total energy consumption incurred by our RA-FS approach. This budget is dynamically adjusted on an hourly basis. For instance, if the total energy consumption by our approach is 200 [kWh] for the 2:00 p.m. to 3:00 p.m. time slot, the corresponding energy budget for the EB-FJ approach during the same time slot is also set to 200 kWh. Figure 7a–c present bar charts illustrating the performance in Region 1, categorized by workload size. In all cases, our proposed RA-FS approach consistently achieves the best performance (average of 147 [kWh]) in terms of renewable energy utilization. The results of our approach closely approximate those of the MAX approach (average of 164 [kWh]). This is because our problem formulation explicitly incorporates the

E (k) - r (k)

term (refer to Problem 1) and adjusts frequency values accordingly. Our approach dynamically allocates larger capacity (i.e., higher frequency values) to DL jobs during periods of high renewable energy generation, while assigning smaller capacity (i.e., lower frequency values) to DL jobs during periods of low renewable generation. The other approaches demonstrate inferior performance in renewable energy utilization compared to ours. The DP-CTR approach averages a renewable generation of about 127 [kWh], while the EB-FJ approach averages approximately 69.3 [kWh]. Although the DP-CTR approach explicitly considers the amount of renewable generation, it underperforms compared to our proposed approach because its focus is on minimizing power consumption costs rather than directly reducing ‘energy consumption’ costs. The EB-FJ approach exhibits the worst performance in terms of renewable energy utilization compared to the others. This is because it does not explicitly factor in renewable energy consumption, focusing instead on energy efficiency influenced by the architectural characteristics of the deployed GPU devices. Figure 7d–f display the renewable energy utilization in Region 2. The trends in these results are similar to those in Figure 7a–c, with the absolute amount of renewable energy usage increasing due to a rise in renewable generation. Figure 7g–i present the renewable energy utilization in Region 3. In this case too, our RA-FS approach demonstrates the most efficient renewable utilization, averaging 55%, compared to 36% for the DP-CTR approach and 27% for the EB-FJ approach.

In summary, across all experimental scenarios, the RA-FS approach markedly enhances the performance of renewable energy utilization, achieving an average improvement of about 63% compared to competing methods. This outcome suggests that the RA-FS approach holds significant potential for energy cost savings over its competitors. The experimental results confirm that our proposed approach is a promising candidate for creating renewable-aware and energy-efficient DL job processing clusters equipped with modern GPU devices.

7. Conclusions

In this paper, we introduce a novel Renewable-Aware Frequency Scaling (RA-FS) approach for energy-efficient deep learning (DL) job processing clusters. To the best of our knowledge, this is the first study to specifically address renewable energy generation in reducing the energy consumption costs associated with DL job processing, utilizing a real-time GPU frequency scaling method. Furthermore, to address the non-convex optimization challenge posed by the multiplication of power and training time, we introduce a log-transformation technique. This technique effectively transforms the non-convex problem into a convex one, ensuring the optimality of the derived solution. We incorporate deadline constraints, defined as Deep-Learning Job Requirement (DJR) and Deep-Learning Job Completion per Slot (DJCS), into the objective function. By doing this, our approach enables maximizing renewable energy utilization while meeting the performance requirements of each DL job. Using the NVIDIA-SMI utility on recent NVIDIA GPU architectures, including RTX3060, RTX3090, and RTX4090, we measured the actual data of epoch completion times, power consumption, and associated energy consumption for various deep neural network (DNN) models—SqueezeNet, PreActResNet, and SEResNet—trained across a range of GPU frequency settings. Through diverse experimental results, which included renewable generation data from multiple regions, we confirmed that our proposed RA-FS approach surpasses its competitors in terms of training performance and energy consumption cost efficiency. The RA-FS approach achieves an average improvement of approximately 5% in epoch completion time, 71% in deadline satisfaction, 10.7% in energy consumption, and 31% in renewable energy utilization compared to its competitors. We conclude that our work is a highly promising option for future deep learning job processing clusters, especially those aiming for carbon neutrality and the use of natural, renewable energy sources.

In future work, we plan to expand our RA-FS approach to include parallel DL job processing on both homogeneous and heterogeneous worker nodes for practicality. We will develop an online-based RA-FS approach for clusters to ensure acceptable performance, even when dealing with the uncertain behaviors of DL job requests. Furthermore, we will explore diverse strategies, including the integration of an energy storage system and continual learning job allocation, to effectively enhance the utilization of renewable energy.

Author Contributions

Conceptualization, D.-K.K.; methodology, D.-K.K. and H.-G.P.; software, D.-K.K.; validation, D.-K.K. and H.-G.P.; formal analysis, D.-K.K.; investigation, D.-K.K. and H.-G.P.; resources, D.-K.K. and H.-G.P.; data curation, D.-K.K.; writing—original draft preparation, D.-K.K. and H.-G.P.; writing—review and editing, D.-K.K.; visualization, D.-K.K. and H.-G.P.; supervision, D.-K.K.; project administration, D.-K.K. and H.-G.P.; funding acquisition, D.-K.K. and H.-G.P. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by Wonkwang University in 2021.

Data Availability Statement

The data presented in this study are available in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Silvano, C.; Daniele, I.; Fabrizio, F.; Leandro, F.; Serena, C.; Luca, B.; Francesco, C. A survey on deep learning hardware accelerators for heterogeneous hpc platforms. arXiv 2023, arXiv:2306.15552. [Google Scholar]
NVIDIA. Available online: https://www.nvidia.com/ (accessed on 21 November 2023).
AMD. Available online: https://www.amd.com/ (accessed on 21 November 2023).
NVIDIA DGX Platform. Available online: https://www.nvidia.com/en-us/data-center/dgx-platform/ (accessed on 21 November 2023).
You, J.; Jae, W.C.; Mosharaf, C. Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training. In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), Boston, MA, USA, 17–19 April 2023; pp. 119–139. [Google Scholar]
Kang, D.K.; Lee, K.B.; Kim, Y.C. Cost efficient gpu cluster management for training and inference of deep learning. Energies 2022, 15, 474. [Google Scholar] [CrossRef]
Peng, X.; Bhattacharya, T.; Cao, T.; Mao, J.; Tekreeti, T.; Qin, X. Exploiting renewable energy and UPS systems to reduce power consumption in data centers. Big Data Res. 2022, 27, 100306. [Google Scholar] [CrossRef]
Cao, Z.; Zhou, X.; Hu, H.; Wang, Z.; Wen, Y. Toward a systematic survey for carbon neutral data centers. IEEE Commun. Surv. Tutor. 2022, 14, 895–936. [Google Scholar] [CrossRef]
Khan, Z.A.; Tanveer, H.; Ijaz, U.H.; Fath, U.M.U.; Baik, S.W. Towards efficient and effective renewable energy prediction via deep learning. Energy Rep. 2022, 8, 10230–10243. [Google Scholar] [CrossRef]
Goh, H.H.; He, R.; Zhang, D.; Liu, H.; Dai, W.; Lim, C.S.; Kurniawan, T.A.; Teo, K.T.K.; Goh, K.C. A multimodal approach to chaotic renewable energy prediction using meteorological and historical information. Appl. Soft Comput. 2022, 118, 108487. [Google Scholar] [CrossRef]
Liao, W.; Bak-Jensen, B.; Pillai, J.R.; Yang, Z.; Liu, K. Short-term power prediction for renewable energy using hybrid graph convolutional network and long short-term memory approach. Electr. Power Syst. Res. 2022, 211, 108614. [Google Scholar] [CrossRef]
Yao, C.; Liu, W.; Tang, W.; Hu, S. EAIS: Energy-aware adaptive scheduling for CNN inference on high-performance GPUs. Future Gener. Comput. Syst. 2022, 130, 253–268. [Google Scholar] [CrossRef]
Liu, D.; Ma, Z.; Zhang, A.; Zheng, K. Efficient GPU Resource Management under Latency and Power Constraints for Deep Learning Inference. In Proceedings of the 2023 IEEE 20th International Conference on Mobile Ad Hoc and Smart Systems (MASS), Toronto, ON, Canada, 25–27 September 2023; pp. 548–556. [Google Scholar]
Nabavinejad, S.M.; Guo, T. Opportunities of Renewable Energy Powered DNN Inference. arXiv 2023, arXiv:2306.12247. [Google Scholar]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; p. 1724. [Google Scholar]
Bao, Y.; Peng, Y.; Wu, C.; Li, Z. Online job scheduling in distributed machine learning clusters. In Proceedings of the IEEE INFOCOM 2018-IEEE Conference on Computer Communications (INFOCOM), Honolulu, HI, USA, 15–19 April 2018; pp. 495–503. [Google Scholar]
Kang, D.K.; Ha, Y.G.; Peng, L.; Youn, C.H. Cooperative Distributed GPU Power Capping for Deep Learning Clusters. IEEE Trans. Ind. Electron. 2022, 69, 7244–7254. [Google Scholar] [CrossRef]
Abe, Y.; Sasaki, H.; Kato, S.; Inoue, K.; Edahiro, M.; Peres, M. Power and performance characterization and modeling of GPU-accelerated systems. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS), Phoenix, AZ, USA, 19–23 May 2014; pp. 113–122. [Google Scholar]
Belotti, P.; Kirches, C.; Leyffer, S.; Linderoth, J.; Luedtke, J.; Mahajan, A. Mixed-integer nonlinear optimization. Acta Numer. 2013, 22, 1–131. [Google Scholar] [CrossRef]
Lin, M.; Wierman, A.; Andrew, L.L.; Thereska, E. Dynamic right-sizing for power-proportional data centers. IEEE/ACM Trans. Netw. 2012, 21, 1378–1391. [Google Scholar] [CrossRef]
NVIDIA-SMI. Available online: https://developer.nvidia.com/nvidia-system-management-interface (accessed on 21 November 2023).
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv 2021, arXiv:1602.07360. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 630–645. [Google Scholar]
Hu, J.; Li, S.; Gang, S. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Salt Lke City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Pytorch-cifar100. Available online: https://github.com/weiaicunzai/pytorch-cifar100 (accessed on 21 November 2023).
PyTorch. Available online: https://pytorch.org/get-started/locally/ (accessed on 21 November 2023).
CUDA. Available online: https://developer.nvidia.com/cuda-toolkit-archive (accessed on 21 November 2023).
cuDNN. Available online: https://developer.nvidia.com/rdp/cudnn-download (accessed on 21 November 2023).
Kang, D.K.; Youn, C.H. Real-time control for power cost efficient deep learning processing with renewable generation. IEEE Access 2019, 7, 114909–114922. [Google Scholar] [CrossRef]
Gu, D.; Xie, X.; Huang, G.; Jin, X.; Liu, X. Energy-Efficient GPU Clusters Scheduling for Deep Learning. arXiv 2023, arXiv:2304.06381. [Google Scholar]
Measurement and Instrumentation Data Center (MIDC). Available online: https://midcdmz.nrel.gov/ (accessed on 21 November 2023).
Kang, D.K.; Yang, E.J.; Youn, C.H. Deep learning-based sustainable data center energy cost minimization with temporal MACRO/MICRO scale management. IEEE Access 2018, 7, 5477–5491. [Google Scholar] [CrossRef]

Figure 1. The Framework Structure of the proposed Renewable-Aware Frequency Scaling Approach.

Figure 2. Gated recurrent unit-based many-to-many neural network model for renewable generation sequence prediction.

Figure 3. Epoch Completion Time (s) for SqueezeNet, PreActResNet18, and SEResNet50 on Node1 (RTX3060), Node2 (RTX3090), and Node3 (RTX4090).

Figure 4. Power Consumption [W] for SqueezeNet, PreActResNet18, and SEResNet50 on Node1 (RTX3060), Node2 (RTX3090), and Node3 (RTX4090).

Figure 5. Energy Consumption [Wh] for SqueezeNet, PreActResNet18, and SEResNet50 on Node1 (RTX3060), Node2 (RTX3090), and Node3 (RTX4090).

Figure 6. Comparison of Average Epoch Completion Time and DJR Violation Ratio.

Figure 7. Comparison of Ratio of Total Energy Consumption to Renewable Energy Consumption.

Table 1. Computing Node Specifications.

	Node1	Node2	Node3
CPU	i5-11400F	i9-10900K	i9-13900KF
Board	B560M-A	Z490 Ext4	Z790-P WIFI
MEM	DDR4 32 GB	DDR4 64 GB	DDR5 64 GB
GPU	RTX3060 (12 GB)	RTX3090 (24 GB)	RTX4090 (24 GB)
Disk	WD SN350 1 TB	SSD970 1 TB	SHPP41-1000 GM 1 TB

Table 2. GPU Specifications [2].

	RTX3060	RTX3090	RTX4090
CUDA cores	3584	10,496	16,384
Global Memory	GDDR6 12 GB	GDDR6X 24 GB	GDDR6X 24 GB
Freq. Range (core)	405–2160 Mhz	210–2100 Mhz	210–3105 Mhz
Freq. Range (mem)	405–7501 Mhz	405–9751 Mhz	405–10,501 Mhz
TDP	170 [W]	350 [W]	450 [W]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, H.-G.; Kang, D.-K. Renewable-Aware Frequency Scaling Approach for Energy-Efficient Deep Learning Clusters. Appl. Sci. 2024, 14, 776. https://doi.org/10.3390/app14020776

AMA Style

Park H-G, Kang D-K. Renewable-Aware Frequency Scaling Approach for Energy-Efficient Deep Learning Clusters. Applied Sciences. 2024; 14(2):776. https://doi.org/10.3390/app14020776

Chicago/Turabian Style

Park, Hyuk-Gyu, and Dong-Ki Kang. 2024. "Renewable-Aware Frequency Scaling Approach for Energy-Efficient Deep Learning Clusters" Applied Sciences 14, no. 2: 776. https://doi.org/10.3390/app14020776

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Renewable-Aware Frequency Scaling Approach for Energy-Efficient Deep Learning Clusters

Abstract

1. Introduction

2. Proposed Framework Structure

3. Gated Recurrent Unit-Based Prediction

4. System Model

4.1. Deep Learning Job Model

4.2. GPU Energy Consumption Model

5. Problem Formulation

5.1. Primary Problem Formulation

5.2. Problem Re-Formulation

6. Experiments

6.1. Preliminary Experiments

6.2. Competitor Setup

6.3. Renewable Energy Source and Workload Setup

6.4. Performance Comparison Results

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI