1. Introduction
Most embedded systems are known as real-time, of which timing constraints are one of the system design requirements. One of the most important timing constraints in real-time systems is to complete executions within their corresponding deadlines [
1]. Therefore, the correctness of a real-time system depends on the time in which the correct logical and functional output is generated. One class of real-time systems with strict timing constraints is the hard real-time system. If the timing constraints of hard real-time systems are not met, the consequences can be life threatening or cause serious economic losses. Therefore, hard real-time system designers must ensure that all timing constraints are met before the target system is actually implemented.
In addition to meeting the timing constraints of a hard real-time system, the systems functional accuracy should be guaranteed [
2]. An output produced in a timely manner is not reliable when the system is out of the specified functional correctness. This deviation occurs in a computer system when the system is defective. Therefore, hard real-time systems must meet all timing constraints of tasks while they are not defective. Timing constraints in real-time applications can be met through proper job scheduling, and fault tolerance can be achieved by obtaining the required level of reliability.
Although many fault-tolerant techniques have been previously implemented using hardware [
3], recent software-based techniques such as rollback, checkpointing, and re-execution have been proposed, which were initially designed for uniprocessor systems [
4,
5,
6]. Checkpoints using the rollback technology manage each checkpoint, where the state of the system is stored in stable storage, and the system status is restored to the latest checkpoint if a transient error is detected [
4]. The re-execution technique runs a job multiple times and selects the correct output obtained during multiple runs [
5,
6]. If all outputs are incorrect during the multiple runs, it re-executes the job to improve system reliability. Under the re-execution technique, tasks execute multiple times, which increases the possibility of deadline misses due to prolonged execution times. Therefore, existing research considering the re-execution technique aims at improving the system reliability of mixed-critical systems or energy-sensitive real-time systems; however, the schedulability of the system has inevitably been compromised.
In multi-processor domains, errors can be tolerated by taking advantage of the functionality of multiple processors [
7,
8,
9]. The most common approach is the primary-backup approach in which the backup of a task executes if its primary does not execute successfully [
7]. Backup overloading approach schedules a backup copy of the primary job in a time-overlapping manner for operational efficiency [
8]. Another efficient overloading algorithm on multiple processors has been proposed through dynamic logical grouping between copies of tasks [
9]. Regarding a hardware approach, Cirinei et al. proposed a dynamic reconfiguration of a multiprocessor hardware platform with a balance between performance and fault tolerance through concurrent replication [
10]. N modular redundancy (NMR) executes identical copies for each task simultaneously on multiprocessor platforms, and a single correct output is voted if any. Because this technique produces
N identical copies of each job, which execute in parallel, some tasks may miss their deadlines owing to enlarged computing power required for completing their execution [
4,
5,
6].
Although NMR to make the target system tolerant to a transient fault is an effective approach for real-time scheduling, it can compromise the schedulability of the target system while improving reliability [
4]. Because this technique produces
N identical copies of each job, which are executed in parallel, some tasks may miss their deadlines owing to enlarged computing power required for completing their executions. This is due to the
N modular redundancy techniques limited capability of forcing the same
N number of copies for all tasks, where the number
N is determined without schedulability analysis [
4,
6].
In this study, we propose a task-level N modular redundancy (TL-NMR) technique, which improves the system reliability of the target system in which tasks are scheduled by any fixed-priority (FP) scheduling without schedulability loss. The TL-NMR framework determines the number of copies (not the same number of copies for all the tasks) of each job of a task while ensuring that every copied job can complete its execution before its absolute deadline by effectively using a new response-time analysis (RTA) proposed in this paper. Then, copies of each job execute simultaneously on multiple processors under the given FP scheduler, and a single correct output (if any) is voted on. Based on the experimental results, we demonstrate that TL-NMR maintains schedulability while significantly improving the average system safety, compared to the existing NMR.
The remainder of this paper is organized as follows.
Section 2 presents our system model including task and reliability models.
Section 3 introduces our proposed fault-tolerant scheduling framework called TL-NMR.
Section 5 evaluates RTA for TL-NMR with various performance metrics.
Section 6 concludes this study.
3. TL-NMR Framework
In this section, we propose TL-NMR, which improves the system reliability of a target system in which tasks are scheduled by any FP scheduling without schedulability loss. The TL-NMR framework determines , which represents the individual number of copies of each job of a task by effectively using a new RTA proposed in this paper. Then, copies of each job execute simultaneously on multiple processors under the given FP scheduler, and a single correct output (if any) is voted.
Because a real-time scheduling algorithm exploiting the existing NMR technique schedules
N copies of each job, it requires
N times
(i.e.,
) for each
. Thus, some jobs (although they satisfy their absolute deadlines under vanilla (It indicates a scheduling algorithm that does not incorporate NMR.) FP scheduling) may miss their absolute deadlines owing to prolonged execution times.
Figure 1 illustrates the scenario where a schedulable task set under vanilla RM scheduling RM scheduling (assigns a higher priority to that task
which has a small
.) becomes unschedulable owing to NMR.
and
have the same priority owing to their same
; however, we assume that
has a higher priority than
according to a simple tie-breaking rule that a task with a smaller task index has a higher priority. As shown in
Figure 1, jobs of three tasks
,
, and
are schedulable under vanilla RM on
processors (
Figure 1a). Considering RM scheduling that exploits NMR for
, two identical copies of each job are scheduled (
Figure 1a). The first job
of a job
starts its execution at time instant
, and it is preempted at
owing to higher priority jobs. Then, it completes its execution at
. However, the second job
of the job
begins its execution at
because all the three processors are occupied by higher priority jobs. Then, it misses its absolute deadline at
because it does not execute completely for
.
We can easily observe that
in
Figure 1b becomes schedulable if one copied job of any task is not considered for scheduling (e.g.,
for any
). TL-NMR is capable of assigning
for each task, while it does not make a schedulable task set unschedulable under any FP scheduling. To achieve this goal, we need to address the following questions:
To address question 1, TL-NMR effectively assigns the value of using -assignment algorithm in conjunction with a new RTA for TL-NMR so that of every task is never greater than . With of every task , TL-NMR makes identical copies of each job and schedules a task set according to the given FP scheduling algorithm. To address question 2, we develop a new RTA for TL-NMR to guarantee that there is no deadline miss while identical copies of each job are scheduled by the given FP scheduling algorithm.
Algorithm 1 presents the operation of TL-NMR. At the beginning, for every task is set to one. Then, of each task is determined (Lines 2–8). For every task in , schedulability is tested with a value of using RTA for TL-NMR, and is assigned for if is deemed schedulable by the test (Lines 4–6). This procedure is conducted times (Lines 2–8) because at most m copies of each job are allowed by TL-NMR on an m processor platform. Thereafter, a given task set is scheduled (Lines 9–17). For every time instant t, a job of a task is inserted in a ready queue Q whenever is released (Lines 9–11). Released jobs in Q are scheduled according to the base FP scheduling algorithm (Line 13). Finally, is removed from Q when the execution of all the copied jobs of is completed.
Algorithm 1 TL-NMR |
- 1:
for all tasks - 2:
for from 1 to do - 3:
for in do - 4:
if is deemed schedulable with by a given new RTA then - 5:
- 6:
end if - 7:
end for - 8:
end for - 9:
for Every time instance t do - 10:
if is released by then - 11:
Insert into Q - 12:
end if - 13:
Schedule jobs ( copied jobs for each job ) in Q according to a given FP scheduling - 14:
if All copied jobs of finish its execution then - 15:
Delete from Q - 16:
end if - 17:
end for
|
4. New RTA for TL-NMR
Because our goal is to improve reliability while guaranteeing schedulability under TL-NMR, we should be able to judge whether the task set is schedulable with the given values of for every task while is assigned by TL-NMR. Hence, we develop a new RTA that can be incorporated into TL-NMR.
The response time of a task can be upper-bounded by the summation of the worst case execution time and the worst case time instance hindering the execution of a job of . Because is given, RTA for TL-NMR upper bounds the latter by exploiting the notion of interference in an interval [) (where ℓ is limited to ).
The interference
on a copied job
in the interval [
) is defined as the cumulative length of all the intervals in which
is ready to be executed but cannot be scheduled on any processor owing to
m higher priority jobs [
12].
When interference occurs on
at a certain time instance, at least
m higher priority jobs exist to hinder the execution of
. Thus, we need to calculate how much of the execution of the higher priority job contributes to the interference
on
to upper bound
. An important property of schedules under TL-NMR is that the execution of a copied job
can be hindered by jobs of not only other tasks
but also by other copied jobs (of
) sharing the same release time
and absolute deadline
. For example,
in
Figure 1b cannot execute in the interval [2, 4) owing to
,
(of
) and
(of
) occupying three processors. Hence, we first let
be the set of the
p-th copied jobs of
. Then, we define the interference
of
on
and the interference
of the other copied job
(i.e., except the job of interest
) of
on
in an interval [
).
The interference of on in the interval [) is defined as the cumulative length of all the intervals in which is ready to be executed but cannot be scheduled on any processor while jobs of execute.
The interference of the other copied job (i.e., except the job of interest ) of on in the interval [) is defined as the cumulative length of all the intervals in which is ready to be executed but cannot be scheduled on any processor while executes.
To upper bound
of
on
, RTA for TL-NMR uses the notion of the workload of
in an interval of length
ℓ, which is defined as the amount of computation time required for
in
ℓ [
13]. The upper part of
Figure 2 illustrates the scenario in which the workload of a task
is maximized under preemptive work-conversing TL-NMR scheduling. Let us assume that
. The first job
of
in the upper side of
Figure 2, starts its execution at the beginning of the interval
ℓ and completes its execution at
, thereby executing for
time units without any interference or delay. Thereafter, the following jobs (i.e.,
and
) are released and scheduled subsequently. Considering the number of executions of jobs fully executing for
, the other jobs executing for a portion of
, and the number of copies
, the upper-bounded workload
is calculated by
where
is the number of jobs executing for
calculated by
Because the other copied jobs of
share the same release time
and absolute deadline
,
ℓ is limited to
. The interference
of the other copied job
of
on
in the interval [
) can be upper-bounded by
as shown in the lower part of
Figure 2. As seen in
Figure 2,
times of
and
times of
can contribute to
. In addition, because a job cannot execute in a time instant if
m other higher priority jobs execute, using upper-bounded
and
,
is upper-bounded by
Therefore, if is not larger than , can finish its execution at or before , where ℓ is limited to .
Although (
7) safely upper-bounds
, the value derived by (
7) is highly overestimated because it includes the amount of execution of
and
which can be performed in parallel with
. As seen in
Figure 3, a portion of the execution of
is performed in parallel with a job
of interest, which cannot contribute to
. This phenomenon will also occur with
or
if
is larger than
. Thus, RTA for TL-NMR limits the amount of
of
(similarly
of
), which potentially contributes to
and
. Here,
is more tightly upper-bounded than (
7) and is given by
Based on this reasoning, RTA for TL-NMR tests the schedulability of as follows.
Theorem 1. A task is schedulable under TL-NMR, if a copied job satisfies the following for any ℓ that holds . Proof. Suppose that
cannot complete its execution in [
) even if Equation (
9) holds. By the definition of
,
’s execution is hindered by higher priority jobs in [
) for at most
. In addition, the worst case execution time for
to complete its execution is
. Thus,
can complete its execution in
after the release of
. By Equation (
9),
holds, and
ℓ is less than or equal to
. Therefore
is schedulable if Equation (
9) holds, which contradicts the supposition. □
The remaining issue is to find a value of ℓ and an upper bound . RTA for TL-NMR works as follows. Initially, ℓ is set to and RTA tests whether the inequality holds. If the inequality holds, the task is deemed schedulable. Otherwise, RTA resets ℓ to the previous value of the LHS of the inequality, until the inequality holds or ; represents that is deemed unschedulable. If the inequality holds, is deemed schedulable and the value of ℓ satisfying the inequality is , i.e., holds.
Figure 4 illustrates how RTA judges the schedulability of
scheduled by TL-NMR.
5. Results
In this section, we evaluate the performance of the proposed TL-NMR compared to that of the existing fault-tolerant techniques. For performance metrics, we measure the number of randomly generated task sets that are deemed schedulable (called schedulable ratio) and the average of considered task sets’ system safety (called average system safety). Task sets for our evaluation are randomly generated based on a popular task set generation method that has been exploited in a number of existing studies regarding real-time scheduling [
14,
15,
16]. The task set generation method used in our study has two parameters such that the number of processors
and the input parameter
or
for individual task use (
) distribution. If
is given, a value for
is uniformly selected in [0, 0.5) and [0.5, 1) with probability
and
, respectively, according to a given bimodal distribution. On the other hand, when
is given, the value is selected according to the exponential distribution whose probability density function is
. Then, the parameters of a task in a task set are determined as follows.
is uniformly determined in
,
is chosen by the bimodal or exponential parameter with
already determined, and
is uniformly chosen in
. Ten thousand task sets are randomly generated for each value of
m.
Next, we discuss the performance (i.e., schedulable ratio and average system safety) and properties of the following techniques:
TL-NMR-RM: RM scheduling incorporating the proposed TL-NMR is discussed in
Section 3, and
xMR-RM: RM scheduling incorporating NMR in which x identical copies are executed to improve reliability (e.g., 3M-RM represents the scheduler where three identical copies of each job are executed under RM scheduling) proposed in the existing studies [
4,
5,
6].
Please note that although we conducted experiments for well performing FP scheduling such as earliest quasi-deadline first and deadline monotonic, we do not explicitly discuss the performance of such scheduling because they show a trend similar to RM scheduling. 1MR-RM may also represent RM scheduling without NMR because a single copy of each job of each task is executed under vanilla RM scheduling.
Figure 5 presents the experimental results regarding the schedulable ratio and average system safety of the considered techniques.
Figure 5a,b shows the schedulable ratio of the considered techniques according to varying task set use
for
and
, respectively. TOT represents the number of generated task sets whose task set use is represented by the axis of task set use. For example, the number of generated task sets whose task set use is about 1.5 is approximately 430 (represented by z-axis). Furthermore,
Figure 5c–f plots the average system safety of task sets of the considered techniques for
and
, respectively.
From the figures, we make the following observations:
TL-NMR-RM maintains the same performance in schedulable ratio (i.e., the number of task sets deemed schedulable) compared to 1MR-RM for both
and 16 (
Figure 5a,b),
TL-NMR-RM improves the average system safety more for all task set use compared to 1MR-RM, while higher number of job-copies in NMR decreases both the schedulable ratio and average system safety (
Figure 5c–f), and
TL-NMR-RM better outperforms 1MR-RM for greater values of
(
Figure 5c–f).
Observation 1 is due to the key property of TL-NMR. That is, it improves the system reliability without schedulability loss by determining the individual number of copies of each task (in conjunction with the proposed RTA in
Section 4) according to Algorithm 1. Thus, a schedulable task set under 1MR-RM (i.e., vanilla RM) never become unschedulable under TL-NMR-RM.
Observation 2 is the natural consequence stemming from Observation 1 in that the system reliability is enhanced owing to increased
(as Equation (
2) indicates), while the schedulability is not changed under TL-NMR-RM compared to that in 1MR-RM. However, the average system safety decreases for the higher number of job-copies in NMR. This counterintuitive phenomenon happens because the schedulability plays a more important role to improve the average system safety than to increase the system reliability. Recall that the system safety is zero when the task set is unschedulable. System reliability is inevitably improved if
of each task increases, but the schedulability is not guaranteed when such an increase of
is conducted not in conjunction with schedulability analysis such as RTA. Thus, the higher number of copies in NMR aggravates the average system safety by compromising schedulability even though it may improve reliability.
Observation 3 highlights the advantage of TL-NMR such that the system reliability under TL-NMR increases more compared to that under 1MR-RM when the arrival rate of transient error increases. As Equation (
2) indicates, the system reliability is disproportional to
but proportional to
. Thus, the average system safety under 1MR-RM (and the other xMR-RM series) sharply decreases with increasing
while TL-NMR-RM makes up for such a degradation by increasing
without schedulability loss.