1. Introduction
RNA secondary structure prediction poses a fundamental and computationally intensive challenge in biological computing. The objective is to predict the secondary non-crossing RNA structure for a given RNA sequence, minimizing the total free energy. Early dynamic programming algorithms by Smith and Waterman [
1] and Nussinov et al. [
2] focused on maximizing the number of complementary base pairs.
Zuker et al. [
3] introduced a sophisticated dynamic programming algorithm that predicts the most stable secondary structure for a single RNA sequence by computing its minimal free energy, utilizing a “nearest neighbor” model. The algorithm estimates thermodynamic parameters for neighboring interactions, scoring all possible structures based on loop entropies. The RNA secondary structure is composed of four independent substructures—stack, hairpin, internal loop, and multi-branched loop—with the energy of the structure being the sum of these substructure energies.
Zuker’s algorithm comprises two steps. The first, most time-consuming step involves calculating the minimal free energy of the input RNA sequence using recurrence relations outlined in the provided formulas. The second step entails a trace-back to recover the secondary structure with the base pairs. While the trace-back step is computationally less demanding, optimizing the energy matrix calculation in the initial step is critical for enhancing overall algorithm performance [
4].
This paper delves into the examination of the performance of tiled Zuker loop nests codes generated by selected automatic optimizers based on the polyhedral model.
The polyhedral model represents loop nests as polyhedra with affine loop bounds and schedules, offering advanced loop transformations and the ability to analyze data dependencies. By utilizing this model, compilers can automatically optimize loops, enhance performance (especially in locality using loop tiling), and exploit parallelism [
5].
Loop tiling, also known as loop blocking or loop partitioning, is a compiler optimization technique to improve cache utilization and enhance the performance of loop-based computations [
6]. It involves partitioning a loop into smaller sub-loops or blocks, known as tiles, which fit effectively into the cache. The primary goal of loop tiling is to capitalize on spatial locality, emphasizing the access of data elements in close proximity to memory. Iterations within each tile can leverage data element reuse by breaking a loop into smaller tiles, reducing cache misses, and optimizing memory access patterns.
Bioinformatics algorithms such as those of Zuker, Nussinov, or Smith–Waterman can benefit from parallelization using the widely recognized loop skewing method implemented in polyhedral compilers. This technique for loop transformation modifies the iteration order of loop nests to create a more favorable schedule. The fundamental concept behind loop skewing involves altering the original iteration space of a loop nest by applying an affine transformation to the loop indices. This transformation introduces a skewing factor that dictates the new relationship between the loop indices, ultimately changing the execution order of loop iterations. It is worth noting that the effectiveness of loop skewing depends on the tiling algorithms implemented in compilers, and the parallelization itself may vary in the resulting codes.
Several numerical approaches within the polyhedral model aim to enhance structure prediction accuracy for RNA sequences. The algorithm by Zhi J. Lu and colleagues introduces the Maximum Expected Accuracy (MEA) concept, incorporating base pair and unpaired probabilities [
7]. The NPDP Benchmark Suite, encompassing Nussinov, Zuker, and MEA algorithms, serves as a comprehensive resource for numerical sources and aspects, emphasizing the challenges of optimizing these tasks with conventional tiling strategies [
8,
9,
10].
Efforts to optimize RNA folding algorithms include the “Transpose” technique proposed by Li et al. for Nussinov’s RNA folding, later extended to optimize Zuker’s code [
11]. Zhao et al. improved the “Transpose” method and conducted an experimental study of energy-efficient codes for Zuker’s algorithm based on the LRU cache model [
12].
While PLUTO, an advanced polyhedral code generation tool, has been widely used for optimizing C/C++ programs, it faces limitations in achieving maximal code locality and performance for NPDP problems, such as Nussinov’s RNA folding and the McCaskill probabilistic RNA folding kernel [
13,
14]. The inability of PLUTO to tile the innermost loop of Nussinov’s RNA folding and the third loop nest in Zuker’s code highlights these challenges.
As Mullapudi and Bondhugula presented for Zuker’s optimal RNA secondary structure prediction, dynamic tiling involves 3D iterative tiling for dynamic scheduling, calculated using reduction chains [
13]. However, this approach emphasizes dynamic scheduling of tiles rather than generating a static schedule.
Wonnacott et al. introduced 3D tiling for “mostly-tileable” loop nests in RNA secondary-structure prediction codes, focusing on serial codes due to limitations in handling more complex instances [
14]. Tchendji et al. explored tiling correction and Four-Russian RNA Folding, proposing a parallel tiled and sparsified four-Russians algorithm for Nussinov’s RNA folding [
15].
In our previous endeavors, we introduced a tiling technique aimed at transforming original rectangular tiles into target ones, ensuring validity under lexicographic order [
16]. The process of tile correction was conducted through the utilization of the transitive closure of loop dependence graphs. This innovative approach yielded a substantial speed-up in the generated tiled code compared to state-of-the-art source-to-source optimizing compilers. The technique found its practical implementation within the polyhedral TRACO compiler.
Our exploration into tiling correction and Four-Russian RNA Folding attracted the attention of Tchendji et al., leading to an in-depth study. They proposed a parallel, tiled, and sparsified four-Russians algorithm tailored for Nussinov’s RNA folding [
15]. This novel approach is deemed more cache-friendly, strategically organizing blocks of four Russians into parallelogram-shaped tiles. The experimental study, spanning CPUs and massively parallel GPU architectures, showcased superior performance compared to the outcomes of our previous work [
16,
17]. While the authors concentrated only on the Nussinov loop nest manually, they expressed a commitment to exploring other NPDP problems in the future.
In a more recent contribution, our team introduced a space–time loop-tiling approach in a paper published in 2019 [
18]. This methodology generates target tiles by applying the intersection operation to sets representing sub-spaces and time slices. Each time partition encompasses independent iterations, facilitating parallel execution, while the enumeration of time partitions follows a lexicographical order. This approach extends our ongoing efforts in space–time tiling, showcasing promising prospects for the development of new polyhedral optimizing compilers. The corresponding codes were generated using the DAPT compiler as introduced in a previous publication [
19].
There are other state-of-the-art polyhedral compilers, such as Tiramisu [
20], AlphaZ [
21], Pencil [
22], Halide [
23], AutoGen [
24], and Apollo [
25], that generate code for CPUs and GPUs. However, we have not found their applicability to NPDP problems, or they are based on the PLUTO framework, or not fully documented source-to-source maintained projects. As a result, we will not further explore their capabilities in this article.
In the realm of RNA folding algorithms, the Zuker kernel and Nussinov RNA folding pose challenges for optimizing compilers due to their involvement in mathematical operations over affine control loops within the polyhedral model [
13]. The acceleration of Zuker RNA folding proves particularly intricate, residing in the domain of non-serial polyadic dynamic programming (NPDP), a subset with non-uniform data dependencies [
16]. Moreover, Zuker’s loop structure is more intricate for automatic tiling strategies than Nussinov’s algorithm, featuring quadruple nested loops with more instructions and data dependencies.
The paper by Yuan et al. [
26] introduces a novel two-level tessellation scheme for stencil computations, aiming to explore data locality and parallelism more efficiently than traditional blocking methods. It designs a set of blocks that tessellate the spatial space in various ways, allowing for parallel processing without redundant computation. Experimental results demonstrate up to 12% performance improvement over existing concurrent schemes. The paper by Bertolacci et al. [
27] leverages Chapel parallel iterators to implement advanced tiling techniques, including time dimension tiling, to improve the parallel scaling of stencil computations on multicore processors. It proposes parameterized space and time tiling iterators through libraries, facilitating code reuse and easier tuning for improved programmer productivity. The approach demonstrates better scaling compared to traditional data parallel schedules. There is also a study [
28] which presents a method for constructing tiled computational processes organized as a two-dimensional structure for algorithms represented by multidimensional loops. The method allows for data exchange operations to be confined within rows or columns of processes, optimizing parallel computations on distributed memory computers. Some studies have investigated the problem of obtaining global dependencies, i.e., informational dependencies between tiles, in the context of parametrized hexagonal tiling applied to algorithms with a two-dimensional computational domain. The paper [
29] contributes to the efficient use of multilevel memory and optimization of data exchanges in both sequential and parallel programming. Experimental results demonstrate that the model-based tiled Sparse Matrix–Dense Matrix Multiplication (SpMM) and Sampled Dense–Dense Matrix Multiplication (SDDMM) achieve high performance relative to the current state-of-the-art methods [
30]. In paper [
31], the authors introduce monoparametric tiling, a restricted parametric tiling transformation for polyhedral programs that retains the closure properties of the polyhedral model. The technique facilitates efficient autotuning and run-time adaptability without breaking the mathematical closure properties of the polyhedral model.
The upcoming sections of the paper will be dedicated to various aspects of our research. In the next section, we delve into the polyhedral representation of the Zuker loop nests, elucidating the application of three polyhedral compilers: PLUTO, TRACO, and DAPT. The Results section comprehensively explores the time and energy benefits, as well as the locality and scalability of the generated codes. The final section analyzes the experimental study with our previous work, concluding the paper with insights into future work.
2. Materials and Methods
Zuker defines two energy matrices, and , with pairs satisfying the constraints and , where N is the length of a sequence. represents the total free energy of a sub-sequence defined by indices i and j, while represents the total free energy of a sub-sequence starting at index i and ending at index j if i and j form a pair, otherwise .
The main recursion of Zuker’s algorithm for all i, j with , where N is the length of a sequence, is the following:
Below, we present the computation of
V:
eH (hairpin loop),
eS (stacking) and
eL (internal loop) are the structure elements of energy contributions in the Zuker algorithm.
The computation of Equations (1), (2), (3), (5), and (6) takes steps. Equations (4) and (8) require steps. The time complexity of a direct implementation of this algorithm is because we need operations to compute Equation (7). This formulation as a computational kernel involves float arrays and operations.
The computation domain and dependencies for Zuker’s recurrence cell () are more complex than those of Nussinov’s recurrence. Equations (3), (4), and (8) generate long-range (non-local) dependencies for cell (), while the other equations have short-range (local) dependencies. The computation of the element in Equation (3) spans a triangular area of several dozens to hundreds of cells.
Listing 1 shows the affine loop nest for finding the minimums of the V and W energy matrices.
The Zuker affine loop nest implies that loop bounds, conditional statements, and array addresses are represented by affine expressions. Within Zuker statements, there are numerous non-uniform dependencies characteristic of NPDP problems. Non-uniform dependencies refer to dependencies between iterations in a computation where the relationship or distance between dependent iterations varies during execution. Examples include expressions like or , where the dependency pattern changes dynamically based on the varying values of the involved variables. Another non-uniform dependency arises from the conditional expression based on Equation (8). Non-uniform dependencies in the context of NPDP problems present significant challenges for parallelization and optimization because they require algorithms and optimization strategies that can dynamically adapt to the data-dependent nature of these dependencies.
Listing 1. Zuker’s recurrence loop nest. |
for (i = N − 1; i >= 0; i−−){ for (j = i + 1; j < N; j++) { for (k = i + 1; k < j; k++){ for (m = k + 1; m < j; m++){ if (k − i + j − m > 2 && k − i + j − m < 30) V[i][j] = MIN (V[k][m] + EL(i,j,k,m), V[i][j]); // Equation (3) } W[i][j] = MIN (MIN(W[i][k], W[k + 1][j]), W[i][j]); // Equation (8) if (k < j − 1) V[i][j] = MIN (W[i + 1][k] + W[k + 1][j − 1], V[i][j]); // Equation (4) } V[i][j] = MIN (MIN (V[i + 1][j − 1] + ES(i,j), EH(i,j), V[i][j]); // Equations (1) and (2) W[i][j] = MIN (MIN (MIN (W[i + 1][j], W[i][j − 1]), V[i][j]), W[i][j]); // Equations (5), (6) and (7) } }
|
Fortunately, Zuker loop nests can be transformed to exploit parallelism and locality within the polyhedral model. They do not contain non-linear expressions or break and continue expressions. Most of the opportunities for optimizing computational efficiency arise from the application of loop tiling. Consequently, this improvement in data locality reduces the number of memory access operations to the slower main memory, leading to reduced execution times and increased overall performance of the algorithm.
Polyhedral compilers utilize a sophisticated approach for analyzing and transforming loop structures in programs, leveraging mathematical abstractions, such as polyhedra. This method represents loop nests and their iteration spaces as polyhedra, a geometric form that captures the multidimensional space of loop indices. By doing so, compilers can precisely understand the dependencies and execution order of iterations within loops. Once the loops are represented as polyhedra, polyhedral compilers use operations such as the following:
Affine transformations: These are mathematical transformations that maintain the straight-line nature of code segments and allow the compiler to reorder, fuse, or tile loops for optimization purposes.
Dependency analysis: By examining the vertices and edges within the polyhedral model, compilers can determine which iterations of the loop depend on others, allowing for safe parallelization or reordering of loops without altering the program’s semantics.
Scheduling: The compiler can generate an optimized execution schedule that improves data locality and parallel execution by analyzing the polyhedral representation. This is particularly effective for optimizing memory access patterns and leveraging cache memory more efficiently.
This methodology facilitates the optimization of nested loops in a way that is provably correct and often results in significant improvements in execution time, especially for high-performance computing applications.
Well-known polyhedral compilers like PLUTO, TRACO and DAPT realize dependence analysis and code generation by applying similar polyhedral techniques. These optimizing compilers start by extracting the relevant loops from the input C program and representing them in the polyhedral model. This representation captures the loop iteration space, dependencies, and data access patterns as mathematical entities. Next, they perform a comprehensive dependency analysis to understand the data dependencies within the program. This analysis ensures that any transformations preserve the program’s semantics, meaning the transformed program will produce the same output as the original. Using the polyhedral model, they automatically apply transformations aimed at improving data locality and exposing parallelism. This involves optimizing the loop order, tiling (breaking down loops into smaller blocks), and loop fusion or fission to better utilize the cache memory and enable parallel execution. All three compilers operate with default settings; for PLUTO, the options –tile and –parallel are specified, while the other two tools work in such mode by default.
Dependency analysis in loop optimization involves identifying relationships between loop iterations to determine data dependencies. PET (Polyhedral Expression Translator) [
5] is a tool that aids in this process, translating source code into a polyhedral model for advanced dependency analysis. PET facilitates various optimization strategies, including loop interchange and loop unrolling, ultimately generating optimized code for different hardware architectures.
Code generation in the context of libraries like Chunky Loop Generator (CLooG) [
32] and Integer Set Library (isl) [
5] relies on the polyhedral model. CLooG provides tools for generating efficient loop code based on complex polyhedral models, enabling optimizations such as loop interchange and loop space transformations. ISL, often used in conjunction with CLooG, is a library for manipulating sets of integers, crucial for the polyhedral model. These tools are utilized for the automatic generation of optimized code, particularly in the case of nested loops in numerical programs.
The primary distinction among compilers like PLUTO, TRACO, and DAPT lies in loop program transformations, particularly the techniques applied for loop blocking to generate cache-efficient code. Nevertheless, all these tools share a common foundation, as they are implemented using isl. The library provides operations on sets and relations, including union, intersection, negation, transitive closure of relations, projection, and others. These operations are employed in the form of matrices or sets/relations, offering a unified basis for the functionalities of these compilers.
The advanced PLUTO compiler [
8] utilizes the affine transformation framework (ATF) to generate parallel tiled code, employing execution-reordering loop transformations to facilitate multi-threading and improve locality. An embedded Integer Linear Programming (ILP) cost function helps create effective tiling hyperplanes, optimizing parallelism while minimizing communication and enhancing code locality in the processor space. PLUTO supports both one-dimensional and multi-dimensional time schedules for loop nest statement instances.
TRACO [
33], on the other hand, employs the transitive closure of dependence relation graphs to form valid target tiles. This process involves partitioning the iteration space into original tiles and correcting tiles by removing invalid dependence destinations. TRACO achieves this by applying the transitive closure of the dependence graph to the iteration subspace, eliminating invalid dependence destinations, and redistributing them to tiles with lexicographically greater identifiers. The compiler uses loop skewing for parallelizing NPDP tiled codes but does not implement ATF techniques.
DAPT [
19] addresses non-uniform dependencies by approximating them to uniform counterparts, simplifying the complexities associated with nonlinear time-tiling constraints. DAPT has successfully normalized non-uniform dependencies and has an advantage over PLUTO, as it supports three-dimensional tiling and benchmarks like Nussinov, nw, and sw [
8].
To assess the performance of tiling polyhedral techniques, we manually crafted code inspired by Li’s transformation [
11], employing transposed arrays to reduce inefficient column reading. Li originally introduced the code for the Nussinov RNA folding algorithm. In Listing 2, we introduce a customized version designed for Zuker’s loop nests. Adhering to Li’s methodology [
11], towards the conclusion of the loop body, cells originating from the unused left lower triangle are overwritten to the upper right one, reinstating valid
W array values.
Listing 2. Optimized Zuker’s recurrence loop nest with the Transpose technique. |
for (diag = 2; diag <= N − 1; diag++) #pragma omp parallel for shared(diag) private (col, row, k, m) for (row = 0; row <= N − diag − 1; row++){ col = diag + row; for (k = row; k < col; k++){ for (m = k + 1; m < col; m++) if (k − row + col − m > 2){ V[row][col] = MIN(V[k][m] + EFL[row][col], V[row][col]); } W[row][col] += MIN ( MIN(W[row][k], W[col][k + 1]), W[row][col]); if (k < col − 1){ V[row][col] = MIN(W[row + 1][k] + W[col − 1][k + 1], V[row][col]); } } V[row][col] = MIN( MIN (V[row + 1][col − 1], EHF[row][col]), V[row][col]); W[row][col] = MIN( MIN ( MIN ( W[row + 1][col], W[row][col − 1]), V[row][col]), W[row][col]); W[col][row] = W[row][col]; } }
|
3. Results
To carry out the experiments, we used a machine with a processor AMD Epyc 7542, 2.35 GHz, 32 cores, 64 threads, 128 MB Cache, and a machine with a processor Intel Xeon Gold 6240, 2.6 GHz (3.9 GHz turbo), 18 cores, 36 threads, and 25 MB Cache. The optimized codes were compiled using the GNU C++ compiler version 9.3.0 with the -O3 flag.
AMD EPYC processors offer several advantages, including a higher core count per socket, improved memory bandwidth, lower power consumption, and more PCIe 4.0 lanes. However, they may exhibit lower single-core performance compared to Intel Xeon processors, and some software may not be optimized for AMD architectures. On the other hand, Intel Xeon processors excel in higher single-core performance, superior virtualization support, a strong brand reputation, and broad software compatibility. Nonetheless, they come with fewer cores per socket, higher power consumption, and support fewer PCIe lanes compared to AMD EPYC counterparts. Hence, in our research, we intend to examine time, locality, scalability, and energy efficiency. Other parameters, such as cache misses or RAM usage, can also be measured; however, they are more challenging to observe and, at the same time, are strongly correlated with energy consumption and execution time. The consumption of memory remains invariant and is solely contingent upon the dimensions of the problem at hand, rather than the specificities of the algorithm employed. It is imperative to acknowledge that the paramount benefit of the reduced execution time is predominantly attributed to the enhancement of data locality. The algorithm’s complexity is denoted as . Variations in runtime are principally predicated on the strategic selection of loop sizes that yield improved locality.
Tests were conducted using ten randomly generated RNA sequences with lengths ranging from 1000 to 5000. The discussion in papers [
11,
16] shows that cache-efficient code performance does not change based on the strings themselves, but it depends on the size of the string.
We compared the performance of tiled codes generated with the presented approaches: (i)
PLUTO parallel tiled code (based on affine transformations) [
34], (ii) tile code based on the
space–time technique [
18] generated with DAPT, (iii) tiled code based on the correction technique
TileCorr [
16] generated with TRACO, and (iv) Li’s manual cache-efficient implementation of Zuker’s RNA folding
Transpose [
11]. All codes are multi-threaded within the OpenMP standard [
35]. All three compilers utilize affine transformations; additionally, TRACO employs transitive closure to enhance the discovery of optimization opportunities. In contrast, DAPT uses dependency approximation for finding affine transformations, which allows for further increased locality. All compilers allow the specification of block size; additionally, DAPT enables the partitioning of the time space, which facilitates further optimization of the loop boundaries, thereby enabling better cache memory fitting.
The tile size 16 × 16 × 1 × 16 for the PLUTO code was chosen empirically (PLUTO does not tile the third loop) as the best among the many sizes examined. The tile size of 16 × 16 × 16 × 16 for the tile correction technique was chosen according to paper [
36]. For the space–time tiled code, we chose the same tile sizes. Our preliminary empirical testing did not yield improved tile sizes for this algorithm.
Table 1 presents execution times in seconds for ten sizes of the RNA sequence using AMD Epyc 7542. Problem sizes from 1000 to 5000 (roughly the size of the longest human mRNA) are chosen to illustrate advantages for smaller and larger instances. Output codes are executed for 64 threads. We can observe that the presented space–time tiling approach allows for obtaining cache-efficient tiled code, which significantly outperforms the other examined implementations for each RNA strand length. The second most efficient code is loop tiling, which is produced by the PLUTO compiler.
Figure 1 depicts the speed-up for the times presented in
Table 1.
Table 2 presents execution times in seconds using two processors, Xeon Gold 6240 and 36 threads. The presented space–time tiling strategy strongly outperforms the other studied techniques for all RNA strand lengths. The Transpose technique allows us to obtain faster code than the ATF tiled code and the tile correction code with this machine.
Figure 2 depicts speed-ups for time executions in
Table 2.
Table 3 and
Table 4 present energy consumption in kJ for AMD Epyc and Intel Xeon Gold processors, respectively. The optimized codes significantly reduce power consumption. We observe that AMD is a more energy-efficient machine than Intel Xeon. However, Intel Xeon completes calculations about 4 min faster despite having about twice the energy consumption for all optimized codes. We can also observe a stronger correlation between shorter time and lower energy consumption for AMD machines.
Figure 3 and
Figure 4 illustrate shorter processing times for larger thread sizes on AMD Epyc and Intel Xeon Gold. Hence, we can assert that the presented codes are scalable concerning the size of the problem and the number of threads.
4. Discussion
The most favorable time results are achieved through the implementation of time–space tiling within the DAPT compiler. The initial application of this technique was implemented in the TRACO compiler and tested with Nussinov’s algorithm in 2019 [
18]. Subsequently, the algorithm underwent a rewrite in the newer DAPT compiler, incorporating additional capabilities and replacing the Petit dependence analyzer [
37] and cloog [
32] modules with pet and isl [
19]. The Zuker algorithm, a task with
complexity, along with another NPDP RNA folding task, the Maximum Expected Accuracy (MEA) prediction, were analyzed in a paper published in 2020 [
38]. This paper solely presented a comparison of ATF with tile correction, excluding time–space tiling.
The main challenge in leveraging tiling strategies for the Zuker algorithm lies in its sequential dependencies, as computations for subsequent stages rely on the outcomes of earlier ones. Moreover, the irregular memory access patterns and the need for synchronization between different units of the algorithm can hinder optimal performance gains when applying tiling strategies. However, given the rapid increase in biological data volume, speeding up the execution of bioinformatics algorithms through parallelization and tiling becomes essential. Moreover, exploring the concept of dimensionality expansion presents a significant opportunity for bioinformatics, allowing for the overcoming of inherent sequential limitations by transforming the problem space into higher dimensions, potentially overcoming some of the inherent sequential constraints.
To continue our research, this paper explores the application of polyhedral techniques to space–time tiling for NPDP kernels where ATF fails to expose the potential parallelism or locality. In the future, we plan to explore other NPDP tasks, such as the Multiple Sequence Alignment (MEA) or the Smith–Waterman algorithm applied to three DNA sequences.
Also, the use of GPUs to enhance the efficiency of parallel computations with tiling presents promising prospects. GPUs, with their high degree of parallelism and numerous cores, are well suited for executing many parallel threads, making them ideal for tiled algorithms, where computations can be distributed across cores. Additionally, for computations performed on GPUs, the appropriate partitioning of loops into blocks is crucial, as it directly impacts the efficiency and scalability of the parallel processing by ensuring optimal utilization of the GPU computational resources.
In summary, the introduced space–time tiled code demonstrates enhanced and scalable performance on multi-core processors, irrespective of the number of threads or problem size. We also observed a significant improvement in energy efficiency with the automatically generated code. The space–time tiling strategy implemented in the polyhedral compiler DAPT emerges as a promising solution for optimizing NPDP tasks. Future work includes exploring its application in other NPDP bioinformatics problems on both CPU and GPU platforms.