1. Introduction
Modern computing continues to evolve with increasing complexity in both hardware and software. More applications of different types are concurrently executed on platforms featuring an increasing number and type of parallel computing resources (cores) [
1,
2]. The advantages are clear, as parallel computing can help delay the potential saturation of Moore’s Law and better use the performance and energy efficiency opportunities provided by technology scaling [
3,
4]. However, managing resources in this complex space for energy efficiency is proving highly challenging, especially when different application scenarios (single or concurrent) need to be taken into account [
5,
6].
Contemporary processors, such as those from Arm and Intel, feature dynamic voltage frequency scaling (DVFS) as a means of handling the energy and performance trade-off [
7,
8]. Power governors enable DVFS at the system software level. For instance, Linux includes different power governors that can be activated based on the system requirements. These include
powersave for low-power, low-performance mode,
ondemand for performance-sensitive DVFS,
performance for higher performance, and
userspace for user-specified DVFS. These governors attempt to suitably tune the voltage/frequency pairs according to performance and energy requirements and workload variations. The voltage/frequency can be tuned to just satisfy the performance requirements according to the workload, but not more, in order to reduce energy consumption.
Performance requirements continue to increase, making DVFS alone less effective [
9]. As a result, DVFS is often coupled with task mapping (TM), which distributes workloads among multiple cores [
10]. When satisfying the same performance requirement, using more cores means that each core has a lighter load and aggressive DVFS can be applied to reduce the overall energy consumption. On the other hand, in order to achieve such energy efficiency, it is crucial to understand the synergy between hardware and software [
11].
Core allocations to threads (TM) are usually handled by a scheduler, instead of the governor which takes care of DVFS [
12]. A typical Linux scheduler does load balancing, i.e., it distributes the overall workload at any time across all available cores to achieve maximum utilization. Although this objective is rational the implementation tends to be crude. For instance, there is usually no discrimination about the type of task or thread being scheduled, such as CPU- or memory-intensive [
12]. Given particular performance requirements, different types of threads should be treated differently for performance and energy optimization. Indiscriminate treatment may lead to sub-optimal energy efficiency [
13].
A number of approaches have been reported on the research of using DVFS and TM synergistically for energy-efficient multi-core computing [
13]. These approaches broadly fit into two types: offline (OL) and run-time (RT). In OL approaches, the system is extensively reasoned to derive energy and performance models [
13,
14], which lead to run-time decisions based on these models which stay constant. In RT approaches, the models are typically learned using monitored information [
15,
16]. Since RT modeling is costly in terms of resources, often a combination of OL and RT are used [
13].
Section 2 provides a brief review of these approaches. A recurring scheme in these approaches is that the focus is primarily on single-application workloads in isolation. However, the same application can exhibit different energy/performance trade-offs depending on whether it is running alone or concurrently with other different workloads. This is because: (a) a workload application may switch between memory- and CPU-intensive routines, and (b) architectural sharing between applications affect the energy/performance trade-offs (see
Section 7.1.2).
Table 1 summarizes the features and limitations of existing approaches.
Tackling energy efficiency in concurrent applications considering the workload behavior changes highlighted above is non-trivial. When mapping onto heterogeneous multi-core systems, this becomes more challenging because the state space is large and each application requires different optimization. The hardware state space of a multi-core heterogeneous system includes all possible core allocations and DVFS combinations. Here, we discuss this using the scenario of multiple parallel applications running on one of the example experimental platforms used in this paper, the Odroid XU3 (detailed in
Section 4.2), which has two types of CPU cores, A7 and A15, organized into two DFVS domains, as motivational examples. Here,
is the total number of concurrent applications running on the system;
is the number of A7 cores used and
is the number of A15 cores used.
Table 2 shows the number of possible core allocations for a total
number of applications running on the Odroid with
= 3 and
= 4. The “brute force” value represents
combinations, not all of which are actually allowed considering the following rules: (1) each application must have at least one thread; and (2) no more than one thread per core is allowed. However, there is no simple algorithm to iterate through only valid core allocations and an explosion of the search state space is inevitable. The number of possible core allocations is then multiplied by the number of DVFS combinations, which is calculated as
, where
is the number of DVFS points in the A7 domain, and
is the number of DVFS points in the A15 domain.
In this work, we addressed these limitations with an adaptive approach, which monitors application scenarios at RT. The aim was to determine the optimal system configuration such that the power normalized performance can be maximized at all times. The approach is based on profiling single and concurrent applications through power and performance measurements. For the first time, our study reveals the impact of parallelism in different types of heterogeneous cores on performance, power consumption, and power efficiency in terms of instruction per second (IPS) per unit power (i.e., IPS/Watt) [
20]. In our proposed approach, we make the following specific contributions:
using empirical observations and CPU performance counters, derive RT workload classification thresholds;
based on the workload classification and multivariate linear regression (MLR) to model power and performance trade-offs expressed as instructions per second (IPS) per Watt (IPS/Watt), propose a low-complexity approach for synergistic controls of DVFS and TM;
using synthetic and real-world benchmark applications with different concurrent combinations, investigate the approach’s energy efficiency, measured by power-normalized performance in IPS/Watt, and implement the low-complexity approach as a Linux power governor and validate through extensive experimentation with significant IPS/Watt improvements.
To the best of our knowledge, this is the first RT optimization approach for concurrent applications based on workload classification, refined further with MLR-based modeling, practically implemented and demonstrated on both heterogeneous and homogeneous multi-core systems. The rest of the paper is organized as follows.
Section 2 reviews the existing approaches. The proposed system approach is described in
Section 3.
Section 4 shows the configuration of systems used in the experiments and the applications. Workload classification techniques are described in
Section 5, where
Section 5.2 details the run-time implementations.
Section 6 deals with combining workload classification with multivariant linear regression, with the decision space of the latter significantly reduced by the former.
Section 7 discusses the experimental results, and, finally,
Section 8 concludes the paper.
2. Related Work
Energy efficiency of multi-core systems has been studied extensively over the years. A power control approach for multi-core processors executing single application has been proposed in Reference [
21]. This approach has three layers of design features also shown by other researchers: firstly, adjusting the clock frequency of the chip depending on the power budget; secondly, dynamically group cores to run the same applications (as also shown in Reference [
22,
23]), and finally, modifying the frequency of each core group (as also shown in Reference [
11,
24]). Among others, Goraczko et al. [
17] and Luo et al. [
18] proposed DVFS approaches with software task partitioning and mapping of single applications using a linear programming (LP) based optimization during RT to minimize the power consumption. Goh et al. [
25] proposed a similar approach of task mapping and scheduling for single applications described by synthetic task graphs.
Several other works have also shown power minimization approaches using practical implementation of their approaches on heterogeneous platforms. For example, Sheng et al. [
19] presented an adaptive power minimization approach using RT linear regression-based modeling of the power and performance trade-offs. Using the model, the task mapping and DVFS are suitably chosen to meet the specified performance requirements. Nabina and Nunez-Yanez [
15] presented another DVFS approach for field-programmable gate array (FPGA)-based video motion compensation engines using RT measurements of the underlying hardware.
A number of studies have also shown analytical studies using simulation tools, like gem5, together with McPAT [
14,
26], for single applications. These works have used DVFS, task mapping, and offline optimization approaches to minimize the power consumption for varying workloads.
Over the years substantial research has been carried out addressing RT energy minimization and/or performance improvement approaches. These approaches have considered a single-metric based optimization: primarily performance-constrained power minimization, or performance improvement within a power budget [
27]. For example, Shafik et al. proposed an RT DVFS control approach for power minimization of multiprocessor embedded systems [
28]. Their approach uses performance and user experience constraints to derive the lowest possible operating voltage/frequency points through reinforcement learning and transfer principles. Das et al. presented another power minimization approach that models RT workload characterization to continually update the DVFS and core allocations through multinomial logic regression based predictive controls [
29].
An RT classification of workloads and corresponding DVFS controls based on similar principles is proposed by Wang and Pedram for performance-constrained power minimization [
16]. As far as performance optimization within a power budget is concerned, Chen and Marculescu proposed a distributed reinforcement learning algorithm to model power and performance trade-offs during RT [
30]. Using this model, the DVFS and core allocations are adapted dynamically using feedback from the performance counters. Another power-limited performance optimization approach is presented by Cochran et al. showing programming model based power budget annotations and corresponding controls [
31]. Based on application requirements, Nabina and Nunez-Yanez [
15] presented a DVFS approach for FPGA-based video motion compensation engines. Santanue et al. [
32] and Tiago et al. [
33] suggested a smart load balancing technique to improve energy efficiency for single applications running on heterogeneous systems. This technique depends on the sense-predict-balance process through the variation of workload and performance/power trade-offs.
Gem5 with McPAT have been used to demonstrate four different core types, where each core operated in a fixed frequency. Petrucci et al. [
14] proposed a thread scheduling algorithm called (lucky), which is based on lottary scheduling. This algorithm is implemented by using Linux 2.6.34 kernel with performance monitor to optimize the thread-to-core affinity. Matthew et al. [
34] proposed a DVFS approach with different core allocated for controlling concurrent applications exercised on homogeneous systems at RT.
Numerous studies have focused on using classification-based techniques in dynamic power management with DVFS together at run-time [
9,
31,
35,
36,
37,
38,
39]. For instance, Gupta et al. [
9] proposed a new run-time approach based on workload classification. To build this classifier extensive offline experiments are made on heterogeneous many core platforms and MATLAB is used to determine the classifier parameters offline. Pareto function is used to determine the optimal configuration. However, this classification is heavily based on offline analysis results and assigns an application a fixed type, regardless of its operating context. It also requires the annotation of applications by application programmers through using a special API.
Dey et al. [
40] suggested a new management technique for a power and thermal efficiency agent for mobile MPSoC platforms based on reinforcement learning. Fundamental to this approach is the use of software agent to explore the DVFS in mobile CPU and GPU based on user’s interaction behavior. This approach has been validated on Galaxy Note 9 smartphone utilizing Exynos 9810. The experimental results show that this new management technique can increase performance while reducing temperature and power consumption.
A model-free RT workload classification (WLC) approach with corresponding DVFS controls is proposed by Wang and Pedram [
16]. This approach employs reinforcement learning, with the action space size a big concern for the authors, even though for only homogeneous systems at much higher granularity than CPU cores. WLC has also been used OL, but this produces a fixed class for each application [
13] and cannot deal with workload behavior changes during execution.
For a comprehensive survey on the wider related field of energy management in energy-critical systems, see Reference [
41]. This paper is based on our previous work published in PATMOS 2018 [
42], with substantial extensions.
4. System Fundamentals
In this section, we describe the platforms, workload applications, and performance counters used in this investigation. We study a homogeneous and a heterogeneous parallel processing platform, which both provide all the performance counters and power monitors we need for the methodology. We chose standard benchmark application workloads, which provide a variety of degrees of concurrency and memory access and CPU usage scenarios. The two hardware platforms, PARSEC workload applications, and performance counters are further detailed below.
4.1. Homogeneous System
The homogeneous experimental platform is a PC based on an Intel Core i7 Sandybridge CPU which contains no on-chip GPU facility. This CPU is chosen because it has a reasonable number of hard (4) and soft (8) cores, has no on-chip GPU to complicate the power consumption and communications, and has a relatively large number of possible operating frequencies and voltages. The operating system is Ubuntu Linux.
Run-time power monitoring is developed for the experimental platform for validation purposes. This is done by inserting a precision shunt resister into the earth side of the power connection to the CPU. As high-precision current meters tend to have a 1A upper limit, which many CPU operations will exceed, the shunt resister allows the inference of current via measuring voltage.
The performance and power utility
Likwid [
43] is used to obtain the majority of the experimental data.
Likwid makes use of on-chip performance counters (sensors) in Intel CPUs to collect performance and power data. For instance, the Running Average Power Limit (RAPL [
44]) counters are accessed to infer power dissipation. The form factor of the platform allows the actual measurement of CPU power by way of an inserted shunt resister into the CPU power supply circuit, and readings from these measurements were used in initial experiments to build confidence on the RAPL readings.
Before the main experiments, Likwid was first confirmed to be accurate for the experimental platform through cross-validation with physical power measurements using the shunt resister, described above. The use of performance counters rather than external power measurement in most of the experiments is motivated by the desire of developing an RTM, which, for practicality and wide applicability, can only rely on built-in sensors and not shunt resisters.
4.2. Heterogeneous System
The popularity of heterogeneous architectures, containing two or more types of different CPU cores, continues to grow [
45]. These systems offer better performance and power trade-off flexibility; however, it may be more complicated to ensure optimal energy consumption. The Odroid-XU3 board supports techniques, such as DVFS, affinity, and core disabling, commonly used to optimize system operation in terms of performance and energy consumption [
46,
47].
The Odroid-XU3 board is a small eight-core computing device implemented on energy-efficient hardware. The board can run Ubuntu 14.04 or Android 4.4 operating systems. The main component of Odroid-XU3 is the 28nm System-on-Chip (Soc) Exynos 5422. This SoC is based on the ARM big.LITTLE heterogeneous architecture and consists of a high performance Cortex-A15 quad core processor block, a low power Cortex-A7 quad core block, Mali-T628 MP6 GPU cluster, and 2GB DRAM LPDDR3. The board contains four real time current sensors that give the possibility of power measurement on the four separate power domains: big (A15) CPUs, LITTLE (A7) CPUs, GPU cluster, and DRAM. In addition, there are also four temperature sensors for each of the A15 CPUs and one sensor for the GPU cluster. This work only concerns the CPU blocks, and the other parts of the SoC may be investigated in future work.
On the Odroid-XU3, for each CPU power domain, the supply voltage (Vdd) and clock frequency can be tuned through a number of pre-set pairs of values. The performance-oriented Cortex-A15 block has a range of frequencies between 200 MHz and 2000 MHz with a 100 MHz step, whilst the low-power Cortex-A7 quad core block can scale its frequencies between 200 MHz and 1400 MHz with a 100 MHz step.
4.3. Workload Applications
The PARSEC [
48] benchmark suite attempts to represent both current and emerging workloads for multiprocessing hardware. It is a commonly used benchmark suite for evaluating concurrency and parallel processing. We therefore use PARSEC on the Odroid-XU3 platform, in which heterogeneity can be representative of different design choices that can greatly affect workloads. PARSEC applications exhibit different memory behaviors, different data sharing patterns, and different workload partitions from most other benchmark suites in common use. The characteristics of applications, according to Reference [
48], which are used in this paper can be seen in
Table 3.
Whilst we experimented with all PARSEC applications at various stages of work, six applications from the suite are selected for presentation in the paper to represent CPU-intensive, memory-intensive, and a combination of both. Such a classification reduces the effort of model characterization for combinations of concurrently running applications (
Section 5). We found no surprises worth reporting in the accumulated experimental data with regard to the other PARSEC applications.
4.4. Performance Counters
In this work, we use performance counters to monitor system performance events (e.g., cache misses, cycles, instruction retired) and, at the same time, capture the voltage, current, power, and temperature directly from the sensors of Odroid-XU3. For the Intel Core i7, real power measurements with a shunt resister were used to establish confidence in the RAPL power counters initially, whilst the majority of experiments are based on performance counter readings once the confidence has been achieved. The performance counter consists of two modules: kernel module and a user space module.
For the Odroid, the hardware performance counter readings are obtained using the method presented by Walker et al. [
49], with similar facilities used through
Likwid for the Core i7.
Here, we describe the Odroid case in more detail. In the user space module, the event specification is the means to provide details of how each hardware performance counter should be set up.
Table 4 lists notable performance events, some of which are explained as follows:
INST_RETIRED is the retired instruction executed and is part of the highly reported instruction per cycles (IPC) metric.
Cycles is the number of core clock cycles.
MEM_ACCESS is Memory Read or Write operation that causes a cache access to at least the level of data.
L1I_CACHE is level 1 instruction cache access.
5. Workload Classification RTM
This section makes use of both heterogeneous and homogeneous systems in its investigations but mainly concentrates on the heterogeneous Odroid XU3 in its discourse, unless otherwise noted. Different types of cores are especially useful for demonstrating the advantages of the approach.
5.1. Workload Classification Taxonomy
The taxonomy of workload classes chosen for this work reflects differentiation between CPU-intensive and memory-intensive workloads, with high- or low-activity. Specifically, workloads are classified into the following four classes:
Class 0: low-activity workloads;
Class 1: CPU-intensive workloads;
Class 2: CPU- and memory-intensive workloads; and
Class 3: memory-intensive workloads.
Extensive exploratory experiments are run in this work to investigate the validity of these general concepts.
The experiments are based on our synthetic benchmark, called
mthreads [
50], which attempts to controllably re-create the effect of memory bottleneck on parallel execution. The tool accomplishes this by repeatedly mixing CPU-intensive and memory-intensive operations, the ratio of each type is controlled by the parameter M. The CPU-intensive operation is a simple integer calculation. The memory-intensive operation is implemented by randomly writing to a 64 MB pre-allocated array. The randomization helps reduce the effect of caching. Parameter
gives CPU-intensive execution,
leads to memory-intensive execution; the values in between provide a linear relation to the number of memory accesses per instruction. The execution is split into N identical parallel threads, each pinned to a specific core.
Figure 2 presents the flowchart of the tool.
Figure 3 shows the energy efficiency of
mthreads running on 2 to 4 A7 cores (one of the A7 cores may have a heavy operating system presence—if C0 is turned off, the operating system stops; hence, this data does not include the single core case, which would be skewed by this system behavior) with M values ranging from 0 to 1. It can be seen that it is better to use fewer cores for memory-intensive tasks (larger M), but it is better to run more cores in parallel for CPU-intensive tasks (smaller M). Characterization results sweeping through the frequency ranges and core combinations with
mthreads confirm the validity of the classification taxonomy and establish a TM and DVFS strategy based on relative CPU and memory use rates. The full set of
mthreads data, supported by experiments with applications other than
mthreads including the entire PARSEC suite, is used to generate our run-time management (RTM) presented in subsequent sections.
5.2. Run-time Management Based on Workload Classification
Figure 1 presents the general architecture of RTM inside a system. In this section, we explain the central RTM functions—classification and control actions based on performance monitors and actuators (e.g., TM and DVFS). The general approach does not specify the exact form of the taxonomy into which workloads are classified, the monitors and actuators the system need to have, nor the design figure of merit. Our examples classify based on differentiating CPU and memory usages and the execution intensiveness, try to maximize IPS/Watt through core-allocation and DVFS, and get information from system performance counters [
42].
The governor implementation is described in
Figure 4, which refines
Figure 1. At time
, task
i is added to the execution via the system function
execvp(). The RTM makes TM and DVFS decisions based on metric classification results, which depends on hardware performance counters and power monitors to directly and indirectly collect all the information needed. This helps avoid instrumenting applications and/or special API’s (unlike, e.g., Reference [
51]), providing wider support for existing applications. The TM actuation is carried out indirectly via system functions. For instance, core pinning is done using
sched_affinity(pid), where
pid is the process ID of a task. DVFS is actuated through the
userspace governor as part of
cpufreq utilities.
5.3. Workload Classification
Real applications do not have precisely tuneable memory usage rates, unlike
mthreads. They may also have phases during which they may appear to be one class or another during their execution; therefore, attempts at classifying each application as a whole offline (as seen in Reference [
13]) may be of limited value (see
Section 7.1.1 for detailed discussions). In this work, information from performance counters is used to derive the classes of all applications running on the system for each control decision cycle. The assumption is that, during a control decision cycle, the class of an application is unlikely to change. This assumption requires that the length of control cycles is sufficiently short relative to the rate of class change of the applications (according to the Nyquist/Shannon sampling principle). The choice of control cycle length therefore depends on expected application scenarios and what happens when/if Nyquist/Shannon is violated should be carefully considered by the designer. This point will be discussed in detail in
Section 7.1.2, with the help of system design case studies.
The classification using performance counter readings is based on calculating a number of metrics from performance counter values recorded at set time intervals and then deriving the classes based on whether these metrics have crossed certain thresholds. Example metrics and how they are calculated are given in
Table 5.
Normalized instructions per clock (nipc) measures how intensive the computation is. It is the instructions per unhalted cycle (IPC) of a core, normalized by the maximum IPC (). can be obtained from manufacturer literature. Cycles is the unhalted cycles counted. Normalization allows nipc to be used independent of core types and architectures.
Instructions per reference clock (iprc) contributes to determining how active the computation is. ClockRef is the total number of clock cycles given by with Freq and Time from the system software.
Normalized non-memory IPC (nnmipc) discounts memory accesses from nipc, indicating CPU activity. From experiments with our synthetic benchmark, this shows an inverse correlation to the memory use rate.
CPU to memory ratio (cmr) relatively compares CPU to memory activities.
Unhalted clock to reference clock ratio (urr) determines how active an application is.
The general relationship between these metrics and the application (workload) classes are clear, e.g., the higher
nnmipc is, the more CPU-intensive a workload will be. A workload can be classified by comparing the values of metrics to thresholds. Decision-making may not require all metrics. The choice of metrics and thresholds can be made by analyzing characterization experiment results for each platform. From studying the relationship between M and the list of metrics from
mthreads experiments on the Odroid XU3, we find that
nnmpic and
cmr show the best spread of values with regard to corresponding to different values of M (see
Figure 5). Whichever one of these to use depends on designer preferences on the range of threshold values between different application classes to use. We choose
nnmipc to differentiate CPU and memory usage rates and
urr to differentiate low and high activity. The thresholds used are determined based on our
mthreads characterization database and given in
Table 6. We tested this approach by running PARSEC programs and obtaining values of the chosen metrics, with the results shown in
Table 7. These confirm that
nnmipc can be used to differentiate CPU- and memory-intensive applications. For instance,
ferret is regarded as CPU-intensive [
52] and its per-core
nnmipc value is above 0.35. The other metrics may work better on other platforms and are included here as examples of potential candidates depending on how a
mthreads-like characterization program behaves on a platform with regard to the relationships between M values and the metrics.
To confirm our approach, another set of experiments were carried out on the Intel Core i7 platform, as can be seen in
Table 8. These results agree with those found from the Odroid XU3. Based on these experiments, we also choose
nnmipc to differentiate CPU and memory usage rates and
urr for differentiating low and high activity. Threshold values are established from Core i7 characterization experiments and are different from those for Odroid XU3.
In principle, for each hardware platform, based on the available performance counters, the choice of metrics and the classification threshold values should both be based on classification results obtained from that platform.
5.4. Control Decision Making
This section presents an RTM control algorithm that uses application classes to derive its decisions. The behavior is specified in the form of two tables: a threshold table (
Table 6), used for determining application classes; and a decision table (
Table 5), providing a preferred action model for each application class.
The introduction of new concurrent applications or any other change in the system may cause an application to change its behavior during its execution. It is therefore important to classify and re-classify regularly. The RTM works in a dedicated thread, which performs classification and decision-making action every given time frame. The list of actions performed every RTM cycle is shown in Algorithm 1.
Algorithm 1 Inside the RTM cycle. |
1: Collect monitor data |
2: for each application do |
3: Compute classification metrics ▹ Section 5.3 |
4: Use metric and threshold table to determine application class ▹ Table 5 |
5: Use decision table to find core allocation and frequency preferences ▹ Table 6 |
6: Distribute the resources between the applications according to the preferences |
7: Wait for ▹ Section 5.4 |
8: end for |
9: return |
In Algorithm 1,
is the time between two RTM control cycles. The RTM determines the TM and DVFS of power domains once each control cycle, and these decisions keep constant before the next control cycle. The data from the system monitors (performance counters and power meters) is collected asynchronously. Every core has a dedicated monitor thread, which spends most of its time in a sleep state and wakes every
to read the performance counter registers. The readings are saved in the RTM memory. This means that the RTM always has the latest data, which is at most
old. This is mainly done because ARM performance counter registers can be accessed only from code on the same CPU core. In this case, asynchronous monitoring has been empirically shown to be more efficient. In our experiments, we chose
ms, which has shown a good balance between RT overhead and energy minimization. The time the RTM takes (i.e., RT overhead) is negligible compared to 500 ms for the size of our system. This interval can be easily reduced with slightly higher overheads or increased with less energy efficiency trade-offs. The flowchart of the entire RTM cycle is shown in
Figure 6.
The RTM uses monitor data to calculate the classification metrics discussed in
Section 5.2. These metrics form a profile for each application, which is compared against the thresholds (
Table 6). Each row of the table represents a class of applications and contains a pre-defined value range for each classification metric. Value ranges may be unbounded. A metric
x can be constrained to the range
, equivalent to
. An application is considered to belong to a class, if its profile satisfies every range in a row. If an application does not satisfy any class, it is marked as “unclassified” and gets a special action from the decision table. An application is also unclassified when it first joins the execution. In that case, it goes to an A15 core for classification.
The decision table (
Table 9) contains the following preferences for each application class, related to system actuators (DVFS and core allocation decisions): number of A7 cores, number of A15 cores, and clock frequencies. Number of cores can take one of the following values: none, single, or maximum. Frequency preference can be minimum or maximum. The CPU-intensive application class (Class 1) runs on the maximum number of available A15 cores at the maximum frequency as this has shown to give the best energy efficiency (in terms of power normalized performance) in our previous observations [
7].
Table 6 and
Table 9 are constructed OL in this work based on large amounts of experimental data, with those involving PARSEC playing only a supporting role. For instance, although
ferret is regarded as CPU-intensive, it is so only on average and has non CPU-intensive phases (see
Section 7.1.1). Therefore,
Table 9 is obtained mainly from analyzing experimental results from our synthetic benchmark
mthreads (which has no phases), with PARSEC only used for checking if there are gross disagreements (none was found). Because of the empirical nature of the process, true optimally is not claimed.
In this work, we assume that the RTM does not have to deal with more threads than the number of cores in the system—if there are more threads than cores, some will not get scheduled by the system scheduler, which is outside the domain of the RTM. Our experiments therefore do not feature more concurrent applications than the number of cores in the system. The RTM attempts to satisfy the preferences of all running applications. In the case of conflicts between frequency preferences, the priority is given to the maximum frequency. When multiple applications request cores of the same type, the RTM distributes all available cores of that type as fairly as possible. When these conflicting applications are of different classes, each application is guaranteed at least a single core. Core allocation (TM) is done through the following algorithm.
Algorithm 2 shows the procedure APPLYDECISION for mapping the RTM decisions to the core affinity masks. RTM provides a decision for each app and for each core type
, where
is the core type, and
is the app index, given the total number of apps
m. The decisions are arranged in arrays
and
. Additional constants used by the algorithm are:
are the total number of little and big cores, respectively, and the IDs of cores by type are listed in the pre-defined
. The complexity of the algorithm is linear to
m. The result of the algorithm is the set of core IDs
, which can be used to call the
sched_setaffinity function for the respective app
i.
Algorithm 2 mapping the RTM decisions to the core affinities |
1: procedure ApplyDecision() |
2: ▹ Get per-app number of little cores |
3: ▹ Get per-app number of big cores |
4: for do |
5: |
6: |
7: ▹ Use to set core affinity mask for the app i. |
8: end for |
9: end procedure |
10: function ReqCores() |
11: |
12: |
13: if then |
14: ▹v is the MAX number of cores |
15: ▹w is the remainder |
16: end if |
17: for do |
18: if then |
19: if then ▹ Distribute the remainder |
20: |
21: |
22: else |
23: |
24: end if |
25: else if then |
26: |
27: else |
28: |
29: end if |
30: end for |
31: return |
32: end function |
8. Conclusions
An optimization scheme targeting power-normalized performance was developed for controlling concurrent application executions on platforms with multiple cores.
In the first instance, models are obtained off-line from experimental data. Explorations with model simplification are shown to be successful as by and large optimal results are obtained from using these models in RT control algorithms compared with existing Linux governors. In many cases, the improvements obtained are quite significant.
A run-time workload classification management approach is proposed for multiple concurrent applications of diverse workloads running on heterogeneous multi-core platforms. The approach is demonstrated by a governor aimed at improving system energy efficiency (IPS/Watt). This governor classifies workloads according to their CPU and memory signatures and makes decisions on core allocation and DVFS. Due to model-free approach, it leads to low RTM complexity (linear with the number of applications and cores) and cost (lookup tables of limited size). The governor implementation does not require application instrumentation, allowing for easy integration in existing systems. Experiments show the governor provides significant energy efficiency advantage compared to existing approaches. Detection of low-parallelizability improves the stability of the governor. A synthetic benchmark with tunable memory use supports the characterization process.
This method is further improved with tuning the results of workload classification by a learning-based optimization using multivariant linear regression. With the workload classification having drastically reduced the modeling space, the regression-based learning has been shown to work effectively. This RTM is demonstrated on both heterogeneous and homogeneous platforms.
For experimental purposes of homogeneous and heterogeneous systems, we demonstrated a novel RT approach, capable of workload classification and power-aware performance adaptation under sequential and concurrent application scenarios in heterogeneous multi-core systems. The approach is based on power and performance models that can be obtained during RT by multivariate linear regression based on low-complexity hypotheses of power and performance for a given operating frequency. The approach is extensively evaluated using PARSEC-3.0 benchmark suite running on the Odroid-XU3 heterogeneous platform.
A selection of experimental results was presented to illustrate the kinds of trade-offs in a variety of concurrent application scenarios, core allocations, and DVFS points, highlighting an improvement of power normalized performance which produced IPS/Watt improvements between 26% and 139% for a range of applications. It is expected that modern embedded and high-performance system designers will benefit from the proposed approach in terms of a systematic power-aware performance optimization under variable workload and application scenarios. Our future work will include investigating the scalability of the approach to more complex platforms and higher levels of concurrency.