1. Introduction
Long-running programs include database systems, operating systems, and platforms that support sensor systems. Such software needs to be very reliable, but should also be efficient in execution time and energy consumption. Thus its reliability [
1] is often assured via checkpoints, to avoid that each failure leads to excessive overhead in execution time [
2,
3,
4] and energy consumption [
5].
Indeed, among the mechanisms that restore or preserve system consistency after failures [
6], Checkpointing and Recovery (CR) is used widely to periodically save an up-to-date copy of system or program state that is used to restart execution if a failure occurs. CR can also be found in high performance systems [
7,
8,
9,
10], operating systems including Linux [
11,
12,
13], databases [
14], and distributed systems [
15,
16,
17,
18].
Thus checkpoint intervals have been widely studied to maximize system availability and minimize program execution time for transaction-oriented systems [
19,
20,
21], and imbedded multiple level checkpoints introduced in [
22,
23] were recently studied in [
24]. CR [
6,
25] includes “Application-level Checkpoint and Restart” (ALCR) [
26,
27], that uses smaller memory space but requires significant programming skills to insert checkpoints in long-running loops [
28,
29]. Since longer inter-checkpoint intervals increase the required time and energy of system restart, and short intervals increase them due to frequent checkpoints, the checkpoint interval should be chosen to minimizeboth energy consumption and execution time [
30,
31].
In recent years the importance of energy savings in information technology and software has been often emphasized [
32,
33,
34,
35], and research has addressed the efficient allocation of energy in computer systems [
36,
37,
38] including the use of of server or network node vacations to reduce energy consumption [
39], techniques to select Cloud servers based on energy efficiency [
40], and the use of renewable energy sources [
41,
42,
43,
44]. There has been less work on more detailed techniques such as checkpoints to reduce energy consumption [
45,
46], or on checkpoint optimization in modern software using ALCR [
47,
48]. In addition, commonly used tools such as ALCR do not offer assistance in selecting checkpoint intervals to optimize energy consumption or execution time, and a software tool was proposed recently to address this issue [
49].
Thus in this paper we focus on analyzing the checkpoint intervals in a unified manner to effect savings in a weighted combination of execution time and energy, since energy consumption is of importance both in autonomously operating platforms and for software running in large scale Cloud data centers [
50]. In the sequel, starting from first principles, we develop a mathematical model to estimate the average execution time as well as the energy consumption of a program with long loops that operates in the presence of failures, without and with ALCR. This allows us to compute the checkpoint interval that minimizes the program’s energy consumption and average execution time, and its value that can minimize a cost function that is a weighted sum of both elements, expressed via the Lambert Function, with numerical examples that illustrate the results. In addiion, we also apply these results to a well known software benchmark.
The rest of the paper is structured as follows. In
Section 2, the mathematical model that estimates the average execution time and energy consumption of a software program that operates in the presence of failures with and without checkpoints is presented. In
Section 3, based on this mathematical model, the closed-form expression of the optimum checkpoint interval is derived. In
Section 4, we illustrate our results through a set of numerical examples. In addition,
Section 5 is devoted to show how our model can be used to select checkpoints for the popular Rodinia Benchmark of real-world open-source software written in C and C++ programming languages, which is widely used for software performance evaluation and energy optimization, and in particular the
streamcluster (
https://github.com/yuhc/gpu-rodinia/tree/master/opencl/streamcluster) program. Finally,
Section 6 concludes the paper and discusses directions for future work.
2. A Single Loop Program with Checkpoints
Consider a program P that executes instructions between its -th and n-th checkpoint, without counting all possible failures and failure recoveries. Now consider the instant when the program creates its n-th checkpoint, and let denote the total number of instructions that the program has executed by time since it started, where does not include all the repeated instructions that were executed due to checkpoints and failure recovery, and obviously: .
Let be the computation time needed to create the n-th checkpoint. This quantity will generally depend on the total memory space occupied by the program, but in certain cases it may depend on , since the program may generate new data as it is executing. Hence we will write where and are constants for the given program.
On the other hand, suppose a failure occurs after the program has successfully executed
y instructions after the
n-th checkpoint, i.e., after the program has executed
instructions. If
is the computation time needed to restart the program from the most recent checkpoint, when the program has successfully executed
instructions after the most recent checkpoint but before the
checkpoint, then we will have:
Therefore the time duration depends on the number y of instructions that have been executed by the program since the last checkpoint was established. In summary, we are assuming that:
The time needed to establish the n-th checkpoint depends on the “age of the program” or the total number of instructions it has executed since the beginning, i.e., ,
The time needed to recover from a failure after the n-th checkpoint, including the time related to re-loading system state after the failure, only depends on , the “computation time undertaken by the program since the last checkpoint”, i.e., .
Similarly, we denote the energy consumption for creating the n-th checkpoint to be , and is the energy used to recover from a failure after a failure that occurs when the total number of instructions executed is . Also, we will have , and with and .
Let
be positive constants that represent the relative costs of computation time and energy consumption. We can then define the parameters:
and the total cost of an instruction can be viewed as the weighted sum of its executon time and of its energy consumption
.
Fixed Checkpoint Intervals
Earlier work has shown that “age dependent” checkpoints [
51] can reduce the overall cost of checkpointing and failure recovery, when (for instance) the failure rate of a system increases with time. However, most practical checkpointing schemes use a simpler approach where checkpoints are carried out periodically each time the program has executed successfully a predetermined fixed number of instructions
. Thus, in the sequel we will make this assumption so that checkpoints are placed after
, etc. instructions have been successfully executed, and we will proceed to compute the optimum value of
y, assuming that
n is fixed in advance.
When the program ends after instructions are executed, a further -th checkpoint is not needed, while the first checkpoint is obviously installed before the first instruction is executed.
We can then formulate our problem as that of a program that executes a total fixed number of instructions Y, where we want to choose the constant value y of the number of instructions between checkpoints, or equivalently we can choose N, the number of checkpoints so that so that the total overhead in additional work and energy consumption due to failures and due to checkpoints is minimized.
For a given
y, let us compute
, which is the corresponding total expected execution time including all restarts due to failures, starting from the most recent checkpoint. When the average execution time per instruction is
c, and the failure probability per instruction is
, the total average time elapsed time for the execution of
y instructions is:
because with probability
a failure does not occur during the
y instructions, leading to an execution time of
time units, while with probability
at least one failure does occur among the
y instructions, and the first of those requires a program re-start time of
, to which we should add
representing the effect of all future failures after the program has been re-initialised from the checkpoint.
Also, we have to include the execution time plus the amount of additional work needed per executed instruction, until the failure occurs—hence the term
– multiplied by
x and the probability that the failure occurs at instruction
x which is
, summed over
x running from 1 to
y. Since
we obtain:
the total expected energy consumption
for a number of instructions
y after the most recent checkpoint, we similarly obtain the quantity:
where
denotes the average energy consumption per instruction, so that
Interestingly enough, we can show using l’Hôpital’s Rule, for all
, that:
as would be expected.
Treating
y as if it were a real number, we can compute the derivative of
. We first note that for a differentiable function
of the real variable
y, we can write:
and therefore
Because , the quantity , and since y is large, is very large and .
3. Minimizing Computation Time and Energy
When we include both the time and energy needed to create each checkpoint, and assuming a fixed number of instructions
y executed between successive checkpoints, we can obtain the total cost of the program up to and including the last instruction executed at
as:
The optimum checkpoint interval
is then the value of
y that minimizes
, the overall cost per unit work that is accomplished, i.e.,
divided by
which is the total number of useful instructions executed over this time:
Therefore, to seek the optimum value of
y, we compute the following derivative and set it to zero:
so that the optimum value of
y is:
Defining
and
we have:
To verify that
is the minimum value, we compute:
where
denote the first and second derivatives of
with respect to
y, and
. Since at
we have
, we can write:
and we need to examine the sign of
. Starting from (
8) we have:
which is positive, so that
is indeed the value of
y at the minimum.
3.1. The Optimum Checkpoint Using the Lambert Function
Let us first recall the definition of the
Lambert Function [
52,
53,
54,
55]. Consider any two numbers
, which have the following relation:
Thus if we can write , then , and similarly if , then .
Applying (
19) to Equation (
15), we can write the expression for
as:
which provides an explicit solution for the value of the optimum checkpoint interval
. Clearly, if we set
and
, we obtain the optimum checkpoint that simply minimizes the overall execution time, without consideration for the energy consumption.
Also, if in the system under consideration the creation of a checkpoint does not depend on the amount of successful computation that the program has accomplished until the time of the checkpoint, then we simply set in the expression for B, so that which is the case that is usually discussed in the literature.
3.2. Sensitivity of the Optimum to Energy Consumption and Computation Time
An important question concerns how
varies with changes in the relative importance of the energy expenditure with respect to computation time. To address this issue as a single parameter problem, we will set
, and consider the derivative of
with respect to
. Noting that we can now write
and
, we have:
where we have used the identity:
when
and
. These two conditions will be satisfied because it is unlikely in practice that the system parameters be such that
, furthermore it is impossible that
because
.
Thus we can use the expression (
21) to determine how fast
will vary as a function of
. In particular we have the following very interesting result.
Result : When , then does not depend on the relative weight of the execution time and energy consumption, so that a single value of will minimize the overall cost for and any value of that represents the relative importance of energy consumption to computation time.
4. A Program with a Single Long Loop
In this section, we will apply the previous results to a program with a single long loop of length
L instructions which is executed some number, say
T times, so that
. For this program, we may be constrained to place checkpoints either at the start of a loop so that
with one checkpoint for each
loops, or
n checkpoints may be placed within the loop with
where
, or we set
. We first apply the previous results to compute
:
where:
so that
Let us denote by the integer that is closest to the real number x. Then we compute , and:
If we set ,
If , we set .
To illustrate these results, numerical examples are provided in order to show the effect of the checkpoint interval n (expressed in terms of the number of loop repetitions between checkpoints) on the expected execution time and the total energy consumption of a software application that operates in the presence of failures. In order to differentiate the effect of computation time and energy consumption, we use to represent the checkpoint interval that minimizes the total computation time, while refers to the optimum checkpoint interval that minimizes the total energy consumption. Note that in the preceding analysis, can be obtained by setting , while is obtained by setting .
These examples consider the case of a program with a single loop in which checkpoints are established at the beginning (or at the end) of the loop. We consider a small, medium, large, and very large program, comprised of
instructions, respectively. The expected execution time of the same program with and without the adoption of the ALCR mechanism is calculated and the corresponding optimization problem is shown numerically. The parameter values that we use are:
In
Figure 1, the example of a small software program (i.e.,
is considered.
Figure 1a compares the expected execution time of the application with and without the ALCR mechanism for different values of
n, while
Figure 1b shows the expected
in terms of expected execution time for different values of
n. The values that correspond to the optimum checkpoint interval
are marked within a rectangle.
Figure 1 illustrates the fact that the optimum checkpoint interval
minimizes the overall execution time of the application and maximizes the overall expected Gain. From
Figure 1 it is clear that the ALCR mechanism will not reduce the expected execution time of a given software application unless the checkpoint interval is optimally selected. Indeed, for some poorly chosen values of
n, the expected execution time of the application with checkpointing is higher than the expected execution time of the same application without checkpoints. For instance in this example, choosing a very small checkpoint interval (i.e., below 5) will actually lead to an increase in the expected execution time of the software program, compared to the execution time of the same program when the checkpointing mechanism is not adopted. This suggests that frequent checkpointing which enhances the reliability of the software program, may result in increases of execution time due to the cost of checkpointing itself.
Similar observations can be made for software with longer loops in
Figure 2,
Figure 3 and
Figure 4. This emphasizes the importance of setting
n to be close or at
, when there is a need for minimizing the execution time of the program.
The examples of
Figure 1,
Figure 2,
Figure 3 and
Figure 4 show that a significant reduction in the execution time of a software application can be achieved by the ALCR mechanism, if the checkpoint interval is selected to be at, or close to, the optimum
. In these examples, the Gain ranges from 64% to 80%. However, suboptimal values of the checkpoint interval will lead to a smaller Gain or even to an average execution time, which is larger than when ALCR is not used. Indeed, the checkpoint interval should not be selected arbitrarily and must be tuned to a value at, or close to, the optimum
.
Still, there is a relationship between calculations for
and
. However, we must have in mind that the optimum checkpoint interval will be different regarding energy consumption and execution time.
Figure 5 shows how they correspond to each other. More specifically,
Figure 5a shows how execution time changes when we want to use optimal checkpoint interval calculated for energy consumption. Similarly,
Figure 5b shows how energy consumption changes when we want to use the checkpoint interval that optimizes execution time.
The numerical example presented in
Figure 5 shows that the checkpoint interval that minimizes the energy consumption does not necessarily minimize the execution time as well and vice versa. In particular, in the given example, setting the value of
n to
will minimize the expected execution time of the software program, but will lead to around half the maximum achievable energy savings. Similarly, setting the value of
n to
will minimize the expected energy consumption of the software program, but will lead to lower than the maximum achievable savings in execution time. Hence, the type of the application should be also taken into account in order to decide, whether to prioritize the execution time or the energy consumption of a given program. It should be noted that the model is highly configurable, which means that the user can define the relative importance of the quality attributes of execution time and energy consumption for a given software program, by properly setting the
and
parameters of the model (see
Section 3). This enables the calculation of the checkpoint interval that strikes a desired balance between these two quality attributes.
Impact of g and B on the Optimum Checkpoint Interval
The optimum checkpoint interval
is expected to be influenced both by the probability of failure
, and by the cost of checkpointing
. In
Figure 6, the optimum checkpoint interval
is plotted against the probability of failure
g, for three different cases of checkpointing cost
. Four different examples are provided, corresponding to a sample software program of small, medium, large, and very large size. In fact, the same cases of programs that were investigated in
Section 4 were considered in this section.
From the different graphs in
Figure 6 and
Figure 7, we notice that the same behavior is observed regarding the impact that the values of
and
g have on the optimum checkpoint interval, regardless of program size. Indeed for a given checkpointing cost
, the higher the probability of failure
g, the lower the optimum checkpoint interval
. This means that for a given checkpointing cost, the higher the probability of failure the more frequently the checkpoints should be generated. This is reasonable since the more frequent the failures are the more frequent the checkpointing should be, in order to reduce the cost incurred by the failure-related re-executions. Conversely, for a specific probability of failure
g, a higher cost of a single checkpoint
leads to a larger optimum checkpoint interval
. This is also reasonable, since the higher the checkpointing cost (given that the frequency of failures is constant) the less frequent the checkpointing, since frequent checkpointing may incur checkpoint-related costs.
These observations are highly intuitive since frequent checkpointing should be applied when the probability of failure is high, while checkpoints should be generated less frequently when the checkpointing cost is high. The same observations hold for the case of the optimum checkpoint interval that minimizes the total expected energy consumption of the program.
5. Demonstration through a Real-World Example
In
Section 4, we illustrated the effect of the checkpoint interval
n (i.e., the number of loop repetitions between consecutive checkpoints) on the expected execution time and energy consumption of a software program that operates in a failure-prone environment through a set of numerical examples. The results of the simulation led us to the observation that the checkpoint interval should be chosen to be at (or close to) its optimum value (computed by our mathematical model) in order to achieve significant gains with respect to execution time or energy consumption and to avoid potential costs that may be caused by assigning arbitrary values to
n.
To enhance the completeness of the present work, we also illustrate the effect of the checkpoint interval selection on the computation time and energy consumption of a real-world software program. More specifically, instead of being based on simulated values, we selected a real-world open-source software program with a configurable computational loop and we determined the required model parameters through actual measurements. Then we used our model in order to compute the optimum checkpoint intervals that optimize the execution time and energy consumption of the selected program for different cases of program size (in fact, loop length). We focused on the execution time and energy savings that could be achieved through the selection of the checkpointing interval using the proposed model.
For the purposes of the present experiment, we used the Rodinia Benchmark (
https://github.com/yuhc/gpu-rodinia) [
56] as the basis of our analysis. The Rodinia Benchmark is a popular benchmark of real-world open-source software programs written in C and C++ programming languages, which is widely used for benchmarking techniques and mechanisms for software performance and energy optimization. From the different programs that Rodinia contains, we used the
streamcluster (
https://github.com/yuhc/gpu-rodinia/tree/master/opencl/streamcluster) program as the basis of our example. The reasoning behind the selection of this program is that it contains a computational loop that is also highly configurable, making it suitable for the purposes of our analysis. In fact, by providing the correct input, the loop can be as lengthy as we wish, allowing us to take different cases of loop length.
To compute the actual parameters that are necessary for the execution of our mathematical model, the
Energy Toolbox of the SDK4ED Project was utilized [
57,
58]. The
Energy Toolbox provides measurements of the execution time and energy consumption of a software program at the loop-level of granularity, being mainly based on popular profiling tools like Linux Perf (
https://perf.wiki.kernel.org/) and Valgrind (
http://www.valgrind.org/), as well as on static estimations [
57,
59]. The provision of loop-level performance and energy measurements made it highly suitable for our case, which actually constitutes the main reason for its selection. After executing the
Energy Toolbox for the selected software program the following parameters were determined (It should be noted that all the measurements were made on an ARM Cortex A57 (Nvidia Jetson TX1) processor.):
As already mentioned, since the benchmark is highly configurable, we considered three cases of loop length (in fact, of program size). In particular, we considered the case of a small, medium, and large loop comprising , , and instructions respectively. It should be noted that this characterization is based exclusively on the relative size of the loops that the program contains and it is used to better facilitate the description of the present experiment.
In
Figure 8, the example of the program with the small loop is illustrated (i.e.,
).
Figure 8a compares the expected execution time of the software program with and without checkpointing. Similarly,
Figure 8b compares the expected energy consumption of the selected software program with and without the adoption of the ALCR mechanism. The checkpoint interval that minimizes the expected execution time (
) and the checkpoint interval that minimizes the expected energy consumption (
) are marked within a rectangle in
Figure 8a,b respectively.
Figure 8 shows that important savings in both the expected execution time and energy consumption are achieved for software program, if the checkpoint interval is selected to be at (or close to) the values of
or
respectively computed by the mathematical model. More specifically, if
n is selected to be equal to
, a
gain in execution time, and a gain of
in energy consumption is obtained when
n is chosen equal to
. It is very clear that selecting arbitrary values for the checkpoint interval should be avoided, as this may lead to excessive increase in the execution time and energy consumption: i.e. no gain but even additional costs.
As can be seen by the given example, if
n is set to be less than 3 in
Figure 8a, the expected execution time of the program will be higher than its expected execution time when checkpointing is not adopted. Similarly, if
n is set to a value lower than 8 in
Figure 8b, the expected energy consumption of the program will be higher than the expected energy consumption of the same program when checkpointing is not adopted. This indicates that frequent checkpointing may lead to the introduction of additional costs with respect to execution time and energy consumption. In addition to this, in both cases, if
n is set to a value different (lower or higher) than the optimum values
and
that are computed by our model, lower than the maximum achievable gains in terms of execution time and energy consumption are achieved, leading to omission of important savings. Hence, this suggests that the arbitrary selection of the checkpoint interval should be avoided, as it may lead to omission of important savings or even introduction of additional costs, and, in turn, it verifies that there is a need for a mechanism (model) for recommending the optimum checkpoint interval.
Similar observations can be made for longer loops in programs as can be seen by
Figure 9 and
Figure 10. As for the previous case, these examples show that important savings in terms of execution time and energy consumption can be achieved, provided that the checkpoint interval is properly set. Here the maximum execution time savings are
and
, whereas the maximum energy savings are
and
, for both medium leng and long loops, respectively. These examples also show that a poorly chosen value for the checkpoint interval may lead to the introduction of additional overhead with respect to the execution time and energy consumption of the software program, highlighting the importance of the choice of an optimum checkpoint interval. These results for a real program example also agree with the “theoretical” conclusions drawn from the numerical examples of
Section 4.
Although in our examples it appears that the optimum values for computation time and energy, namely
and
, are relatively close to each other, this will not be generally the case, and depending on various parameters these values can differ significantly. Hence, the end user can decide whether execution time or energy consumption should be prioritized by using the parameters
and
. As mentioned in
Section 3, by carefully setting these parameters, the model can be used in order to compute the optimum checkpoint interval that optimizes the execution time (
and
), energy consumption (
and
), or a weighted combination of those two requirements (
and
). Hence, the mathematical model presented in this paper can be used in practice to satisfy different user needs with respect to energy consumption and execution time of software programs with loops.
6. Conclusions
Checkpoints are widely used to allow a system to recover from failures without having to restart a program’s execution from scratch every time a failure occurs. However, checkpointing may add costs in additional time and energy, even when no failures occur. Thus, we have analyzed the choice of optimum checkpoint intervals in a unified manner from the perspective of energy consumption and execution time. Starting from first principles we have derived the optimum checkpoint for programs with a long running outer loop. Explicit analytic results have been derived and illustrated with numerical examples. The model was also demonstrated using a real-world software program retrieved from a popular benchmark.
More specifically, in this paper, we have focused on the importance of energy consumption on the appropriate choice of checkpoint intervals for long-running programs that require highly reliable operations. To this effect, we have developed a mathematical model that details the manner in which program execution time and energy consumption interact in a system that is subject to the establishment of regularly spaced checkpoint intervals.
The analysis has been used to determine the optimum number of checkpoints that either minimizes total average energy consumption, or total average execution time, or a linear combination of both. The solution to this optimization problem has been shown to relate directly to an expression that includes the classical Lambert function. The sensitivity of the optimum checkpoint interval to variations in all systems and checkpointing parameters has also been computed analytically.
The results were then used to derive the optimum checkpointing interval for a program with a long loop, so that checkpoints are installed either within each loop, or at the beginning of some of the loops. Several numerical examples were presented to illustrate the manner in which this approach could be used in a practical setting, for instance, to guide the choices that need to be made with application-level checkpointing and recovery (ALCR). A real-world example using an actual software program retrieved from the Rodinia Benchmark was also presented.
Both the numerical examples and the example that was based on the real-world software program led to some interesting observations. Firstly, in order to achieve important savings (i.e., gains) in terms of execution time and energy consumption, the checkpoint interval should be chosen to be at (or, at least, close to) its optimum value, as reported by our mathematical model. In addition to this, the arbitrary selection of the checkpoint interval should be avoided, as it may lead to lower than the maximum achievable gains in terms of execution time and energy consumption or even to the introduction of additional overheads. This further supports the need for a mechanism (i.e., a model) able to compute the optimum checkpoint interval. Finally, the results of these examples also highlighted the ability of the proposed model to be used in practice for satisfying different user and application needs with respect to execution time and energy consumption through properly setting its parameters. In fact, the proposed model can be used to compute the optimum checkpoint interval that minimizes its execution time, energy consumption, or a combination of those requirements.
Future work will consider nested program structures, and ways of linking checkpointing and program structure in a useful manner, similar to what is done in this paper for programs with a large single loop. The impact of multiple programs running on the same platform also needs to be considered. Indeed the ALCR approach deals with each program singly, while the checkpoint for each program dilates the execution time and energy consumption of each individual program, and by extension of the collection of programs, which share the same platform.