1. Introduction
Quantum chemistry has been the cornerstone in offering the building blocks and the first principles to derive the molecular and physical properties arising from (sub)atomic interactions. As such, the field has been able to predict and explain the formation of electronic structures and chemical reactions in science, engineering, and medical applications ubiquitously. One of the fundamental challenges in the realization of practical quantum chemistry lies in solving the Schrödinger equation [
1] to compute the electronic orbitals representing molecular properties:
where
represents the total energy of the molecular system (kinetic and potential energy, Hamiltonian),
denotes the wave function encoding the quantum state of the electron within the molecule (electron kinematics), and
E is the energy eigenvalue.
The central aim of quantum chemistry is to find solutions to atomic and molecular Schrödinger equations. Yet, as molecules of practical interest become largely polyatomic, suggesting a large number of nuclei and electrons, solving the Schrödinger equations analytically becomes intractable due to the large number of (sub)atomic interactions (the many-body problem). Using reasonable assumptions on electron kinematics, approximate schemes, such as the Hartree–Fock method [
2,
3], have become attractive to estimate energy states and the overall electronic structures [
4]. The Hamiltonian operator of an
N electron system is as follows:
where
are electrons,
is a one-electron operator representing the kinetic energy of electrons, and electron–nucleus Coulomb attraction, and
is the two-electron operator representing the electron repulsion. The minimization of the expectation conditions of
leads to the Hartree–Fock equations:
Solutions to the above (
3) render the orbital functions
and the orbital energies
. To tackle the above, the Self-Consistent Field (SCF), introduced by Hartree [
2,
5,
6], is often used to estimate the orbital functions
iteratively. By assuming an approximate initial guess of
, replacing of
in (
3) leads to the following:
in which the first solutions are computable from a normal eigenvalue problem. The first solutions
and
are expected to be better than the initial guess. Afterward, a subsequent set of solutions is obtained after computing the operator
using the previous solutions. In the above, initial guesses are used to compute the potential field felt by an electron. As such, the Hartree–Fock method assumes the single-electron configuration, in which electrons are subject to an average neighborhood electron potential.
A relevant procedure is evaluating the potential encoded through a corresponding basis set and integrals. Often, solutions of the orbital functions
at each step of the SCF procedure are expressed as a linear expansion of
M basis functions:
where
M is the number of basis functions, and
is the expansion coefficient. Furthermore, the Hartree–Fock method solves the set of integrals, often termed the
two-electron repulsion integrals (Basis-ERIs), encoding the Coulombic repulsion interactions between electron pairs across different orbitals. Several basis functions were explored in the field, and some of the popular ones include the Slater basis function showing cusps at the nucleus and exponential decay [
7], the primitive Gaussian functions suggested by Boys in the 1950s [
8], and the Gaussian functions with contraction coefficients and exponents, also known as Contracted Gaussian Functions (CGFs) [
9]. When the Hartree–Fock machinery uses Gaussian basis functions, the molecular orbital functions
are expressed as a linear expansion of Gaussian-type orbitals (GTOs). A GTO is a mathematical function used to approximate atomic orbitals in quantum chemistry calculations. It takes the form of a Gaussian function, which allows for efficient computation of integrals due to its simple analytical properties. Unlike Slater-type orbitals (STOs), which more accurately describe the exponential decay of atomic orbitals, GTOs facilitate faster integral evaluations. As a result, GTOs have been widely adopted in recent studies on quantum chemistry. As such, wave functions become computable from linear combinations of Gaussian-type orbitals (GTOs), and the computation of Basis-ERIs is reduced to the problem of computing GTO-ERIs.
1.1. Related Works
For systems with many electrons, the Hartree–Fock method becomes computationally expensive due to the challenge of solving a large number of integrals, e.g., Basis-ERIs require the computation of every quartet of basis functions representing the molecule of interest. In order to realize the efficient computation of electron repulsion integrals over Gaussian functions (GTO-ERIs), various algorithms have been proposed. Two key algorithms in the field are the well-known McMurchie–Davidson method (MD) [
10] and the Obara–Saika method (OS) [
11]. Inspired by the idea of using Cartesian Gaussian Functions [
8], the MD method allows computing of the molecular integrals through Hermite Gaussians, expressed in compact form by auxiliary functions, and recursive relations that enable calculation of the higher angular momentum efficiently. Also, the MD method, due to its relatively simple mathematical structure, offers greater generality and extensibility compared to other approaches. The OS method uses the Gaussian Product theorem to render recurrence equations to simplify composite integrals. Several other efficient schemes exist, such as the Pople–Hehre (PH) method [
12] using normal specification of basis functions sharing common computable information, allowing coordinate components to become zero or constant and decreasing computational load in the corresponding integrals, the Dupuis–Rys–King (DRK) method [
13] using root search of (orthogonal) Rys polynomials, often termed the Rys quadrature method [
14], yielding representations for basis functions of arbitrarily high angular momentum. Another approach is the Head–Gordon–Pople method (HGP) [
15] that (1) extends the recurrence relations from OS to consider the shifting angular momentum, and (2) benefits from efficient evaluation of integrals by evaluating basis functions sharing the same information (coefficients and exponents), by evaluating similar integrals simultaneously, and by grouping similar atomic orbitals. The PH method is particularly suitable for calculations involving high-angular-momentum systems, whereas the HGP method is more efficient for systems with low-angular-momentum functions. Also, the PRISM method [
16] capitalizes on the trichotomy between angular momentum and degree of contraction in ERIs to find tailored computation paths. As such, PRISM allows performing of contraction steps precisely where needed, and realizes a degree of flexibility to cover a wide range of ERI. The reader may find a further detailed description of the above in [
17].
Computing the electronic structure configurations through Hartree–Fock methods is amenable to parallelization and acceleration schemes by Graphics Processing Units (GPUs). As such, the active developments of new GPU architectures and the concomitant Compute Unified Device Architecture (CUDA) have enabled several Hartree–Fock estimation schemes to capitalize on the merits of GPU hardware.
The community has rendered several attempts on GPU acceleration of the MD method [
18,
19,
20,
21,
22,
23]. In 2008, Ufimtsev and Martinez [
18] were among the first groups in accelerating the MD algorithm by tailored forms of stream processing on GPUs, and in evaluating specific mappings between threads and integral computation, e.g., one thread per contracted integral and one thread per primitive integral. The resulting algorithm achieved a 130× speedup over CPU implementation.
Also, a year later in 2009, Ufimtsev and Martinez [
19] extended their acceleration schemes by (1) incremental computations of the Fock matrix throughout successive iterations, allowing reduction of the number of computable elements of the Fock matrix, in particular, in late iterations when convergence is near, (2) double screening of integrals: in a first stage,
ket and
bra pairs become ordered by angular momentum, and at the second stage, such pairs become ordered within each angular momentum subject to contribution to the Schwartz upper bound, enabling effective parallelization and improved load balancing, and (3) computation of the Fock matrix components by blocks and by following the ordering of the integral grid while ignoring the integrals with the Schwartz bound below a threshold. The extended algorithm achieved a performance of up to 650× when compared to a CPU-implementation. However, the implementation in [
19] is restricted to s- and p-type Gaussian basis functions and assumes that the communication time between the GPU and CPU is negligible compared to the total time.
In 2012, Titov et al. [
20] proposed automated code generation to accelerate the MD method by GPU and CPU kernels for integral evaluation. The generated kernels are expressed in symbolic form and are subject to algebraic transformations in the Mapple CAS system. The resulting framework was implemented in TeraChem, a popular tool in quantum chemistry, rendering kernels of about 2000 lines to support up to
d angular momentum. Although the approach by Titov et al. [
20] has the potential to explore diverse kernel configurations for specific GPU hardware, the selection of the suitable kernel follows empirical testing resembling a trial-and-error approach. Furthermore, the challenge of fully automating kernel generation and optimization arises from the limitations faced by compilers, which are often constrained by both compilation time and the availability of source code. This makes it impractical to scale them to accommodate a large angular momentum.
In 2014, Yasuda and Maruoka [
21] proposed an acceleration scheme for MD through a calculation order for electron repulsion integrals (ERIs), and a method using multiple cooperating threads to reduce register usage. Computational experiments on CPUs and GPUs showed that the approach significantly accelerated the computation by a factor of four for transition metal clusters. Although the method proposed by Yasuda and Maruoka [
21] enables finding calculation orders that minimize the intermediate ERIs that need to be simultaneously retained, scalability to higher angular momentum is still a challenge as minimizing register spills in ERI calculations is a nonpolynomial time search problem, allowing for only approximate solutions. Whereas compilers can effectively address simpler cases, the limited scalability to high-angular-momentum ERI calculations limits the overall effectiveness.
In 2017, Kalinowski et al. [
22] proposed the GPU acceleration of the MD method for arbitrary angular momentum. The approach is mainly oriented towards performing integral pre-screening (integral list generation) on the CPU side by the ORCA package, and CPU to GPU memory transfers of computed lists. As such, the computational load is distributed in a way that each GPU thread calculates a single integral batch at a time. The acceleration scheme achieved speedups of up to a factor of 30 relative to serial and parallel CPU implementations. Although the approach is presented for arbitrary angular momentum, aspects such as the complexity of high angular momentum inhibit the practical usage of GPU code to f-type integrals.
Furthermore, ever since the Gaussian basis function was proposed by Boys in the 1950s [
8], and the approach then coupled with the DRK method in [
24] (termed UM09), several enhanced approaches were proposed: the GPU accelerations of the DRK, also termed the Rys quadrature method [
25,
26,
27], the GPU acceleration of the HGP method and its adaptation up to the
f-type ERIs [
28], the multi-GPU implementations of the Rys quadrature [
29], the acceleration of the HGP method benefiting from permutational symmetries and dynamic load balancing schemes [
30], the optimized code generator via a heuristic and sieve method for HGP-OS implementing efficient integral screening and symmetry [
31] (termed Brc21), the multinode multi-GPU implementations [
32], the matrix multiplication form of the MD method on NVIDIA V100 GPU [
33], and the GPU implementation for PYSCF up to
g functions through the Rys quadrature [
34]. The above-mentioned algorithms compute the GTO-ERIs by repeatedly evaluating recurrence relations, which become deeper and computationally expensive as the azimuthal quantum number of the basis functions increases. Here, UM09 and Brc21 were reported to be promising throughout practical molecular benchmarks, yet both approaches differ in principle: whereas Brc21 exploits ERIs symmetry and optimizes for the computation per ERI, UM09 benefits from dividing the exchange computations and completing the efficient screening. Furthermore, although several algorithms have been proposed to efficiently compute Basis-ERIs with high azimuthal quantum numbers [
32,
35,
36,
37], most of the existing algorithms limit the supported values of the azimuthal quantum number. For instance, the study by Johnson et al. [
32] supports basis functions with azimuthal quantum numbers up to 4, generating optimized code for each specific case, and the hybrid between UM09 and Brc21 for multi-GPU domains has been evaluated in up to
f-type Gaussian functions [
38].
1.2. Contributions
Our previous related work [
23] investigated the acceleration of the McMurchie–Davidson (MD) method and developed a parallel algorithm optimized for efficient GPU implementation. This paper builds upon [
23] and further extends the GPU acceleration frameworks. Our previous study [
23] successfully improved the evaluation efficiency of the recurrence relation (Equation (
22)) by employing a batch-based computational algorithm. However, two major issues remained, which we address in this study:
Although the batch-based approach enhanced the efficiency of evaluating the recurrence relation (Equation (
22)), there was computational redundancy, as the Boys function needed to be evaluated every time a GTO-ERI within a Basis-ERI was computed.
In the previous implementation, each ERI calculation was assigned to a single CUDA block. This assignment strategy led to inefficient utilization of computational resources, particularly for ERI calculations with low computational cost, where the available computational resources within a CUDA block could not be fully utilized, increasing execution time.
The related approaches in MD acceleration, as shown in
Table 1, focus on different aspects/configurations of GPU-enabled MD acceleration. For instance, Ufimtsev and Martinez [
18,
19] focus on the most efficient computation of the Fock matrix and the effective screening of integrals, Titov et al. [
20] focus on the automated generation of (symbolic) kernels to evaluate integrals, each of which can be tailored to GPU hardware, Yasuda and Maruoka [
21] focus on how to find an optimal order of ERIs by (recursive) search, and how to enable the cooperating threads to reduce register use, and Kalinowski et al. [
22] focus on (serial) screening by CPU and the thread-based integral evaluations. Compared to the aforementioned related works (as summarized in
Table 1), and departing from particular aspects/configurations of MD acceleration, our approach is a generalized algorithm capable of computing GTO-ERIs for arbitrary azimuthal quantum numbers, provided that the intermediate terms can be stored. To this end, our approach capitalizes on the use of shells, batches, and triple-buffering of the shared memory, and efficient block and thread mapping strategies in CUDA, thus exploring the overall effective parallelism frontiers. The following points briefly describe the key contributions of this paper.
1.2.1. Batch-Based ERI Computation
Our approach capitalizes on the amenability of parallelization through shells (a feature of basis functions) and batches (having the role of partitioning independent computable groups). Basically, the proposed algorithm divides the components of the recurrence relation of the MD method into batches, where K is a parameter determined by the sum of the azimuthal quantum numbers in the four GTOs used for GTO-ERI. As such, the proposed approach enables efficient ERI computation with a unified code structure applicable to any input.
1.2.2. Triple-Buffering of Shared Memory
As a standard batch-based algorithm would require storing the computed elements of the recurrence sequentially, the overall storage requirements become impractical for large K. The value of K at which storage becomes impractical depends on the available GPU memory. For example, on the A100 Tensor Core GPU used in our experiments, storing the computed elements of becomes infeasible when K exceeds 24. As such, aiming at reducing the overall memory usage required in the batch-based algorithm, this paper proposes the triple-buffering of shared memory, which takes advantage of the property of the batch algorithm where, during the computation of elements in batch , only the elements from batches and are required. By cyclically switching between three shared memory buffers, the algorithm overwrites memory containing values that are no longer needed for subsequent computations. Our estimations show that the proposed approach successfully reduces shared memory usage by up to 65% compared to a standard approach that would preserve all elements in the shared memory. Furthermore, as K increases, the memory reduction rate becomes comparatively significant.
1.2.3. Block and Thread Mapping Strategies
Using the CUDA framework, we implemented four GPU parallel thread mapping methods to maximize computational efficiency and GPU resource utilization. A brief description of each of the following methods is given below.
- BBM
Basis-Block Mapping: each CUDA block is assigned to a Basis-ERI, and all CUDA threads within a CUDA block comput a Basis-ERI in parallel. This implementation is identical to the one proposed in our previous study and is used as the baseline for comparison.
- BTM
Basis-Thread Mapping: each CUDA thread is assigned to a Basis-ERI, with each CUDA thread independently evaluating a Basis-ERI.
- SBM
Shell-Block Mapping: each CUDA block is assigned to a shell-based ERI, and all threads within the CUDA block collaboratively compute a shell-based ERI in parallel.
- STM
Shell-Thread Mapping: each CUDA thread is assigned to a shell-based ERI, with each CUDA thread independently evaluating a shell-based ERI.
The definitions of Basis-ERI and shell-based ERI are provided in
Section 2 and
Section 4.1, respectively.
1.2.4. Computational Experiments
We assessed the computational efficiency of ERI calculations for various molecules using the above-mentioned four proposed GPU implementations. Experiments using an NVIDIA A100 Tensor Core GPU demonstrated substantial performance improvements. In particular, the proposed method achieved up to a 72× speedup compared to our previous GPU implementation. Furthermore, when benchmarked against a naive CPU implementation on an AMD EPYC 7702 processor (AMD, Santa Clara, CA, USA), it achieved a maximum speedup of 4500×. Our findings underscore the effectiveness of our approach in significantly accelerating ERI calculations for both monoatomic and polyatomic molecules.
1.3. Paper Organization
The structure of this paper is organized as follows:
Section 2 describes the Basis-ERIs, the primary focus of this study,
Section 3 overviews the McMurchie–Davidson method (referred to as MD throughout the paper), a well-known approach for computing GTO-ERIs,
Section 4 describes our proposed approach to accelerate the computation of MD, followed by the discussion of
Section 5 formulating relevant strategies to implement the proposed algorithm in GPUs.
Section 6 describes and discusses the results obtained through rigorous computational experiments involving ERI calculations of representative different molecular systems. Finally,
Section 7 concludes our paper by summarizing the key findings of this study.
2. Two-Electron Repulsion Integrals
This section describes the preliminaries behind the computation of two-electron repulsion integrals (termed Basis-ERIs). Concretely speaking, a Basis-ERI is defined as a double integral over the entire space involving four basis functions
,
,
, and
, as follows:
where
| notation often used in quantum chemistry to denote a Basis-ERI. |
| bra term. |
| ket term. |
| position vectors representing the three-dimensional coordinates (x, y, z) of two electrons. |
| norm of the vector r1 − r2 (Euclidean distance) representing the Coulomb interaction between two electrons. |
| notation for a basis function. |
In the above-mentioned formulation, the Basis-ERIs are to be computed for every combination of four basis functions. Therefore, if one considers
M basis functions, the reader may note that the total number of combinations of Basis-ERIs is
. For instance, when one considers a molecular system with basis functions
and
, thus
, one is to compute
Basis-ERIs. In order to illustrate the above-mentioned configuration,
Figure 1a shows the matrix-based configurations of the possible interactions between the two basis functions
and
, in which the elements in the column direction correspond to the
bra terms, whereas the elements in the row direction correspond to the
ket terms. Also, each cell in the matrices of
Figure 1a represents a single Basis-ERI.
One can benefit from the symmetry implications in Equation (
6), as follows:
As a result, the number of Basis-ERIs that require computation is reduced to
In order to illustrate the merits of the symmetry relations,
Figure 1b shows the possible combinations considering symmetry considerations. By observing
Figure 1a,b, the reader may note that, by including symmetry in the formulation of Basis-ERIs, it becomes possible to reduce the total number of Basis-ERIs, as shown by the upper triangular matrix in
Figure 1b. As such, the overall Basis-ERI calculation determines each element in the upper triangular matrix of
Figure 1b.
Furthermore, in quantum chemistry, the basis function
is often expressed as a linear combination of Gaussian basis functions as follows:
where
denotes the Gaussian-type orbitals (GTOs),
denotes the number of terms in the linear combination, and
denotes the coefficients of the linear combination. In the above, the Gaussian-type orbital
is defined as follows:
where
represents the orbital center (Cartesian coordinate center),
is a normalization constant,
,
, and
are non-negative integers that determine the orbital shape, and
is a parameter representing the spread of orbitals. In other words, a basis function consists of a set of Gaussian-type orbitals with the same center
, the same azimuthal quantum numbers
,
,
, but different coefficients
and parameters
. The orbitals are often named based on the sum of the azimuthal quantum numbers
. For example, an orbital with
is classified as an s-orbital, which has a spherical shape. Likewise,
render the p-, d-, and f- orbitals, respectively [
39]. Furthermore, within the context of contracted Gaussian basis functions (CGFs),
is the degree of contraction, and
denotes the total angular momentum.
The above-described parameters are uniquely determined by both the types of basis functions and the number of atoms.
Selecting basis functions is crucial as it significantly influences both the accuracy and computational cost of quantum chemical calculations [
40]. For instance, the Slater-type orbital with three Gaussian functions (often termed STO-3G) is a well-known minimal basis set representing molecular orbitals as the linear combination of three Gaussian-type orbitals [
41]. For the purpose of research and through evaluations, the community has developed the Basis Set Exchange (BSE) [
42], a comprehensive and convenient database of basis sets used in quantum chemistry calculations.
As the Hartree–Fock method involves the evaluation of quartet electron repulsion integrals (ERIs), we define the ERIs of the four Gaussian basis functions as follows:
where
,
,
, and
denote Gaussian-type orbitals (GTO-ERIs). Using the above-mentioned formulation, Equation (
6) can be rewritten as follows:
For clarity of exposition, in order to relate notions of basis sets and Gaussian-type orbitals while computing the electronic configuration of molecules,
Figure 2 shows the relation between the Basis-ERIs and the GTO-ERIs. In particular,
Figure 2 shows the combinations of four Basis-ERIs under symmetry considerations, in which the terms over the row (column) directions represent the
bra (
ket) Basis-ERIs. Each cell of the upper triangular matrix of
Figure 2 corresponds to a single Basis-ERI and is the result of the sum of Gaussian orbitals (GTO-ERIs), as expressed in Equation (
9).
3. McMurchie–Davidson Method
Our approach is centered on the efficiency frontiers and the acceleration prospects of the McMurchie–Davidson (MD) method [
10]. Thus, this section describes the preliminaries and key ideas behind the approach to efficiently compute electron repulsion integrals over Gaussian functions (GTO-ERIs) by the MD method. Generally speaking, the MD method uses Hermite Gaussian functions and auxiliary (tabulated) functions through an incremental approach, whose elegant formalism offers a high degree of versatility to parallelization schemes in GPUs.
For simplicity and without loss of generality of our exposition, let us consider the GTO-ERI
composed of the following four Gaussian-type orbitals (GTOs), each of which neglects normalization constants, as follows:
Following the MD method [
10], each of the GTO-ERIs is calculated as follows:
Furthermore, in Equation (
14):
where
. The recurrence formulas
, and
are terms related to the GTOs
and
. For example,
is defined by the following recurrence equations:
Also, the terms
and
can be calculated using the coordinates
y and
z, respectively, and the terms
, and
can be calculated for
and
in the same fashion. Furthermore, although evaluating the recurrence formula
is computationally trivial and inexpensive in most cases, implementing the terms of
as a recursive function in a large number of basis sets is expected to contribute to the overall computational load and bottleneck due to the large number of computation paths. As such, in this study, in order to realize the efficient and tailored computation paths across a diverse set of molecular configurations, we improved the evaluation of the term
by not only expanding the recurrence relations but also by creating the device functions tailored to each combination of the tuple
; as such, the overall computation load is anticipated to decrease [
43].
Central to the MD method is the computation of the following recurrence relations:
where
Thus,
is the point dividing
A and
B in the ratio
, where
represent the
x-,
y-, and
z-coordinates of
, respectively.
Furthermore, the term
in Equation (
22), in which
is the norm of the vector
, denotes the Boys function [
8], defined as follows:
The equations from Equation (
14) to Equation (
26) follow the formulations in [
10], but have been rewritten in a different form to improve readability. Although the above-mentioned Boys function is widely used in frameworks aiming at reducing the computation of complex integrals into amenable forms, the closed-form analytical solutions are nonexistent/intractable. In this study, to realize the highly accurate estimations with utmost efficiency frontiers, we leveraged the numerical approach rendered from GPU acceleration schemes proposed by Tsuji et al. [
44]. Although the computational burden behind evaluating the Boys function is high, it is possible to attain enhanced efficiency frontiers by devising strategies to reduce the number of evaluations in special scenarios. For instance, a special case occurs when
; thus, the Boys function becomes
; as such, the computational burden is expected to significantly reduce. The reader may note that the scenario
often occurs when the Cartesian centers
, and
of the four GTOs used in the GTO-ERI are identical.
5. Implementations in Graphics Processing Unit (GPU)
This section describes the implementations of the above-mentioned algorithms in Graphics Processing Units (GPUs) through the Compute Unified Device Architecture (CUDA).
Since the ability to write instructions on GPU hardware became possible, CUDA has enabled the seamless implementation and exploration of parallel algorithms on GPU devices throughout several programming models [
45]. GPUs embed a plural number of streaming multiprocessors (SMs), each of which contains several processing cores.
The programming model behind CUDA uses key elements such as CUDA threads, CUDA blocks, and shared memory, among others. A CUDA thread is an individual execution instance that performs a specific task/computation. Groups of 1024 threads are organized into CUDA blocks that can be executed concurrently. Furthermore, CUDA threads collaborate using shared memory, with high-speed storage accessible to all CUDA threads within the same CUDA block. The following section presents a parallel batch technique that efficiently uses the shared memory.
5.1. Triple-Buffering of Shared Memory
With the goal of reducing the overall memory usage required in the
batch-based algorithm, this section proposes the concept of triple-buffering of the
shared memory, which is inspired by the notion of avoiding the storage of unnecessary values from the computed recurrences. In the
batch-based algorithm (
Section 4.2), the values of the recurrence
for
batch k are computed sequentially starting from
batch 0 to
batch K. By observing the features of Equation (
22) and
Figure 4, one can note that only the values for
batch and
batch are needed to calculate the recurrence
for
batch k. Therefore, it becomes unnecessary to store all the values of the recurrence
for the entire
batch algorithm. Concretely speaking, for implementations, three buffers are prepared on the
shared memory, in which each buffer stores the values computed from the recurrence
for one
batch. To show the basic idea behind the triple-buffering of the shared memory,
Figure 5 shows the dependencies behind computing the corresponding recurrences
R for each
batch under the triple-buffering scheme. By observing the dependency paths in
Figure 5, computations switch between buffers. For instance, during the computation of
batch k, the values from the recurrence
R from
batch and
batch , stored in two buffers, are used to calculate the
R value for
batch k, whose result is then stored in the remaining buffer. In the subsequent computation of
batch , the
R values from
batch k and
batch , stored in the two buffers, are used to compute the
R value for
batch , which is then stored in the buffer that previously held the
R value for
batch . By repeating the above-mentioned process, all
R values can be calculated efficiently using only three buffers.
The triple-buffering technique is a well-known approach for optimizing memory usage. However, in the context of the MD method, its effectiveness depends on the specific recurrence dependencies unique to this method, which are observed in Equation (
22). Our work leverages these dependencies to minimize memory requirements while preserving computational efficiency. By carefully managing buffer usage in accordance with the recurrence properties, we achieve a significant reduction in shared memory consumption, as demonstrated in our analysis.
The above-mentioned implementation requires significantly less
shared memory compared to the approach that would preserve the computed values of
. In particular, an algorithm that would preserve computed values would require the following size of
shared memory:
In contrast, the buffer size used in the triple-buffering method is:
Since three buffers become necessary, the required memory is three times a constant size. Additionally, the values computed from the recurrence
in Algorithm 2 are retained separately, implying
elements of memory.
It is possible to estimate the required size of the
shared memory needed to compute all the
batches. First, we consider a straightforward approach that would store all the values rendered from the recurrences
R into the
shared memory. From
Section 4.2, the number of
R values computed in
batch k is given by
Therefore, the total number of
R values for all batches from
to
is:
On the other hand, the proposed three-buffering approach requires three buffers to store the maximum number of
R values needed for all relevant
batches k (
). As such, the maximum number of
R values required for any batch
k is as follows:
In the approach mentioned above, the
batch values can be computed sequentially. However, to execute Algorithm 2, the values of
computed in each
batch are to be preserved while avoiding loss and overwrites, implying the requirement of a separate storage. The number of
values in
batch k is given by
Therefore, to store all
values for
batch k , the following size of
shared memory is additionally required:
In order to portray the performance differences in terms of the required size of the
shared memory,
Figure 6 compares the required memory size between the proposed triple-buffering scheme and a naive implementation that considers retaining all values rendered from the recurrence
) for
). By observing the trend and behavior on the required shared
memory size in
Figure 6, we can observe the following facts:
the difference in required shared memory size is trivial in configurations fulfilling , whereas smaller sizes achieved by the proposed triple-buffering scheme become noticeable in configuration profiles with , and
as K increases, the memory reduction rate becomes comparatively clear and significant, achieving a reduction to 0.344 times the original memory usage, or approximately 65% less space.
The above-mentioned results portray the potential, feasibility, and versatility of implementing the proposed triple-buffering scheme across different classes of GPU devices, including those with limited shared memory capacity.
5.2. Proposed CUDA Block and Thread Mapping Strategies
In order to investigate the performance and efficiency frontiers of CUDA-enabled GPU devices, we propose the following four implementations that consider the versatile configuration of GPU thread assignments:
- BMM
Basis-Block Mapping, a GPU implementation mapping each Basis-ERI to a CUDA block.
- BTM
Basis-Thread Mapping, a GPU implementation mapping each Basis-ERI to a CUDA thread.
- SBM
Shell-Block Mapping, a GPU implementation mapping each shell-based ERI to a CUDA block.
- STM
Shell-Thread Mapping, a GPU implementation mapping each shellbased ERI to a CUDA thread.
In the following, we describe the details of the above-mentioned implementations.
5.2.1. Basis-Block Mapping (BBM)
Basis-Block Mapping (BBM) assigns GPU threads based on Algorithm 1. Concretely speaking, one CUDA block is assigned to one of the Basis-ERI in which CUDA threads in each CUDA block compute the Basis-ERI by the parallel algorithm shown in Algorithm 2. The BBM assignment approach implements our previous study [
23] and serves as a key reference for further evaluations. In order to show the basic idea behind the BBM assignment mechanism,
Figure 7 shows an example of the parallel thread assignment of CUDA blocks and CUDA threads to the Basis-ERI computation in BBM. Here, in each block, the GTO-ERIs in Equation (
9) are computed using CUDA threads with the
batch parallel algorithm presented above, one by one, exhaustively. The reader may note the relevant roles of Algorithms 1 and 2 in the construction of the components. In the BBM approach, we use the triple-buffering technique to cache the values of the recurrences
R, thus storing the corresponding results of the Basis-ERI calculations. Furthermore, the Basis-ERI computation is independent of other CUDA blocks; thus, no conflicts in memory access occur when the resulting values of the Basis-ERIs are stored in the memory.
5.2.2. Basis-Thread Mapping (BTM)
Similarly to the approach in BBM, the Basis-Thread Mapping (BTM) computes the Basis-ERI in Algorithm 1. However, it differs from the former in that BTM assigns a Basis-ERI computation to one thread. Concretely speaking, each thread performs GTO-ERIs through Algorithm 2 serially. To show the basic idea of thread assignment behind the BTM approach,
Figure 8 illustrates a basic example of the parallel thread assignment of CUDA blocks and CUDA threads to the Basis-ERI computation in BTM. The reader may note the relevant roles of Algorithms 1 and 2 in the construction of the components. Compared with BBM, thread assignment appears at first glance to be computationally inefficient due to the large amount of computation in each thread. However, when the target system is small, specifically, when the number of GTOs in the basis functions is small and/or when the azimuthal quantum number is small, BTM is a method that can utilize computational resources efficiently. Similarly to BBM, the Basis-ERI computation is independent of other CUDA threads, and no conflicts of memory access occur when storing the results.
5.2.3. Shell-Block Mapping (SBM)
Shell-Block Mapping (SBM) assigns GPU threads based on Algorithm 3. More specifically, one CUDA block is assigned to one of the shell-based ERI, in which CUDA threads in each CUDA block compute the Basis-ERI by the parallel algorithm shown in Algorithm 2. To show the basic idea behind the thread assignment in SBM,
Figure 9 shows a parallel thread assignment of CUDA blocks and CUDA threads to the shell-based ERI computation in SBM. The reader may note the relevant roles of Algorithms 2 and 3 in the construction of the components. As described in
Section 4.1, when computing the GTO-ERIs within four shells, performance can be improved by efficiently reusing previously calculated values of the Boys function. By capitalizing on this notion, the CUDA threads in each CUDA block in line 2 of Algorithm 3 compute all the values of the Boys function required for the GTO-ERI calculation in advance. Afterward, the computed values are reused in subsequent GTO-ERI calculations to avoid redundant computation. Different from BBM and BTM, the resulting value of the calculation for each CUDA block is accumulated, as shown in line 4 of Algorithm 3, implying the possibility of accessing similar memory addresses at the same time, thus inducing the requirement for exclusive memory access. CUDA provides the
atomicAdd function that enables the accumulative computations in memory exclusively [
46]. Therefore, the proposed implementation capitalizes on the
atomicAdd function to realize parallel memory access. As such, SBM offers the advantage of significantly reducing the number of Boys function evaluations compared to BBM, yet it incurs increased memory access operations.
5.2.4. Shell-Thread Mapping (STM)
Shell-Thread Mapping (STM) assigns each CUDA thread to one shell-based ERI in Algorithm 3. Although this strategy of GPU thread allocation appeared in [
32,
47], our implementations use the different shell-based ERI calculation algorithms. To exemplify the basic idea of thread assignment in STM,
Figure 10 shows an example of the parallel thread assignment of CUDA blocks and CUDA threads to the shell-based ERI computation in STM. The reader may note the relevant roles of Algorithm 2 and 3 in the construction of the components. Each thread is assigned to one element in line 1 of Algorithm 3. Namely, the computation of lines 2–5 of Algorithm 3 as well as the computation of Algorithm 2 are performed sequentially. Therefore, the computation cost of each thread becomes larger than the above mappings. Following the same motivations in SBM, we capitalize on the merits of the
atomicAdd function to enable the cumulative computations in memory exclusively.
5.3. Sorting of ERIs
In order to further enhance the computational efficiency of the proposed algorithms, this section describes the mechanisms to ensure effective parallelism of the implemented algorithms. BTM and STM are designed to perform distinct ERI computations as thread-based mappings. One of the key performance considerations in thread-based implementations lies in the ability to avoid the
warp divergence phenomenon [
46], which occurs when threads within a
warp (a group of 32 threads) take different computation paths, causing the serial execution of computation paths and thus the inefficient use of GPU resources. As such, an effective strategy to minimize the occurrence of the
warp divergence lies in the notion of ordering and grouping similar ERI calculations within a warp to ensure threads follow identical computation paths, thus improving the overall effective parallelism.
Ordering similar ERIs to improve GPU resource utilization is a commonly used strategy in ERI computations. In this study, we also adopt this approach to reduce warp divergence and enhance computational efficiency. While this technique itself is not novel, it is an essential step to ensure that the proposed GPU implementation operates efficiently.
5.3.1. Sorting of Basis-ERIs
Here, we describe the mechanism to sort Basis-ERIs considering the following factors:
The number of GTOs in a basis function determines the number of GTO-ERIs within the Basis-ERI. For instance, when computing a Basis-ERI containing three GTOs, the total number of GTO-ERIs is . The four values have a direct influence on the behavior of lines 3–5 in Algorithm 1. Consequently, to optimize performance and minimize the occurrence of warp divergence, it is extremely desirable to place Basis-ERIs with the same number of GTOs adjacent to each other.
On the other hand, the azimuthal quantum numbers of each GTO determine the behavior of the MD method in the calculation of the GTO-ERI. Thus, the azimuthal quantum numbers have a direct influence on the behavior of Algorithm 2, leading to significant losses if similar values are not clustered together.
In this study, we use a 64-bit key, as shown by
Figure 11, to sort Basis-ERIs. Concretely speaking, a 64-bit key is assigned to each Basis-ERI and sorting is performed based on the key. The upper 28 bits of the key are used to represent the number of GTOs in each basis function, whereas the lower 36 bits are used to represent the azimuthal quantum numbers of each GTO.
5.3.2. Sorting of Shell-Based ERIs
The computation paths for shell-based ERIs are determined by the type of shells involved (
Section 4.1). For instance, the computation of
proceeds sequentially in the order of
,
, and
. In addition, since each calculation is a GTO-ERI, and all the azimuthal quantum numbers mentioned above are the same, the sorting of the shell-based ERIs is based on the type of the four shells.
6. Performance Evaluation
In this section, in order to evaluate the performance frontiers of the proposed algorithms, we report computational experiments involving the proposed implementations of the McMurchie–Davidson method for shell-based ERI computations of molecules of interest. Thus, this section describes our observations and findings.
The experiments used an NVIDIA A100 Tensor Core GPU and an Intel Xeon Gold 6338 CPU. We measured the ERI computation times for the four GPU implementations presented in
Section 5. The number of threads per CUDA block was set to 256. The number of CUDA blocks launched was determined based on the following: for BBM, the number of threads was equal to the number of Basis-ERIs; for SBM, the number of threads corresponded to the number of shell-based ERIs. In the cases of BTM and STM, the number of CUDA blocks was set to 1/256 of those used in BBM and SBM, respectively. Considering the above-mentioned points, we conducted the following two experiments:
Experiment 1: In this experiment, our goal was to evaluate the performance frontiers (execution time) of ERI calculations in monatomic molecules. As such, we used monatomic hydrogen as the molecule of interest, and evaluated through five types of basis function sets defined in the correlation-consistent basis sets [
48].
Experiment 2: In this experiment, our goal was to evaluate the performance frontiers (execution time) of ERI calculations in polyatomic molecules. As such, benzene (C
6H
6), naphthalene (C
10H
8), and copper oxide (CuO) were selected as the polyatomic molecules of interest. Furthermore, we used the STO-3G [
41] and 6-31G** [
49] basis sets.
While direct comparisons with other state-of-the-art approaches could be insightful, many existing software packages incorporate acceleration techniques such as screening, to enhance performance. These optimizations significantly modify the computational workflow, making a straightforward comparison with our method difficult. Therefore, rather than making a direct comparison, this study evaluates the efficiency of the proposed parallelization strategies within a controlled computational framework.
One of the key motivations behind conducting the above-mentioned experiments lies in the structural differences between the central coordinates of the basis functions of monatomic and polyatomic molecules. Whereas the central coordinates are identical in monatomic molecules, they are different in polyatomic molecules. Concretely speaking, the parameter
x of the Boys function in the GTO-ERI calculations of monatomic molecules is always zero, leading to
, as described in
Section 3. The above context results in a significant simplification of the Boys function evaluation, which is a computational bottleneck in the GTO-ERI calculations.
In order to show a glimpse of the performance frontiers of the proposed algorithms,
Table 3 shows the results of Experiment 1. Here, following the observations of
Section 2, the number of Basis-ERIs in
Table 3 is calculated from the relation
, where
M is the number of basis functions. Similarly, the number of shell-based ERIs in
Table 3 is calculated from the relation
, where
N is the number of shells. The results indicate that for monatomic molecules, the shell-based methods performed slower compared to other frameworks. The reason for poor performance can be attributed to two reasons.
First, as mentioned earlier, the computation of the Boys function can always be simplified as
in the context of GTO-ERIs of monatomic molecules. As a result, the Boys function evaluation does not pose a significant computational overhead. Consequently, the key advantage of shell-based methods in reducing the number of Boys function evaluations is diminished. Instead, as discussed in
Section 5.2.3, the increased memory access in the shell-based method leads to longer execution times.
Second, by looking at the number of shells presented in
Table 3, the reader may easily note that for each basis function, the number of shells with the largest azimuthal quantum number is always one. For example, when using the cc-pV6Z basis set, there is only one h-shell. In this scenario, the highest computational cost per shell-based ERI arises from the shell-based ERI involving four h-shells; however, there is only one such shell-based ERI. Therefore, the SBM (STM) algorithm deals with such computation by a single CUDA block (CUDA thread).
In contrast, by observing the results from
Table 3, the number of basis functions with the largest azimuthal quantum number is 21. Hence, the number of Basis-ERIs formed by four h-orbitals can be calculated as
Thus, the BBM(BTM) algorithm uses 28,014 CUDA blocks (CUDA threads) for parallel computation. As a result, the shell-based algorithms fail to achieve effective parallelism for certain ERI computations compared to the basis function-based algorithms. The lack of effective parallelism becomes a computational bottleneck, causing the shell-based algorithms to underperform.
Additionally, the implementations that assign one thread per ERI (BTM, STM) were found to be slower than the implementations that assign one block per ERI (BBM, SBM) in most cases. This observation is particularly clear and pronounced between the performance of SBM and STM. Similarly to the above-mentioned insights, we argue that the lack of effective parallelism is one of the key reasons explaining the diminished performance of schemes based on one thread per ERI (BTM, STM).
Under similar conditions, the SBM algorithm computes the shell-based ERI involving four h-shells using one CUDA block (i.e., 256 threads), whereas the STM algorithm performs the same computation using only a single CUDA thread. Consequently, the computation time for such ERIs increases significantly, creating a computational bottleneck and resulting in decreased performance.
We also compare our proposed implementations with a (naive) CPU implementation (AMD EPYC 7702 processor), in which the Basis-ERIs are evaluated sequentially using the recursive formulae described in
Section 3. To show the performance frontiers of the CPU-based implementations,
Table 4 shows the obtained results. By observing the results in
Table 4, we achieved a speedup of up to 4500x; however, in certain scenarios, the GPU implementation was slower compared to the CPU. The above-mentioned observations indicate that the overhead associated with parallelization outweighed the potential speedup achieved through GPU acceleration.
In order to show the performance frontiers in polyatomic molecules,
Table 5 presents the results of Experiment 2. The results indicate that the thread-based methods (BTM and STM) achieved up to a 72× speedup compared to the block-based methods. However, depending on the molecule, the thread-based methods occasionally exhibited slower performance, with the worst-case scenario resulting in a 0.21× slowdown. By comparing the results of BBM and BTM, BTM consistently outperformed BBM, suggesting that the benefit of increased parallelization of Basis-ERI calculations in the BTM implementation outweighed the advantage of
batch parallelization provided by the BBMscheme. Comparing the results of SBM and STM, a similar observation applies to benzene and naphthalene. However, in the case of copper oxide, STM achieved a slower performance.
The above-mentioned observations can be attributed to the lack of effective parallelism in shell-based ERI calculations, as discussed in the analysis of Experiment 1, in which the STM implementation is seen to be more affected by such limitation. Lastly, examining the results of BTM and STM, the speedup for naphthalene ERI calculations using the 6-31G** basis set was particularly notable. This observation suggests that for sufficiently large molecules, shell-based ERI calculations may offer a greater advantage.
In the same manner as the investigations in Experiment 1, we compared the performance of the four proposed methods against those of the (naive) CPU implementations. To show the obtained results,
Table 6 presents a comparison of the computation times of ERIs for polyatomic molecules. By observing the obtained results in
Table 6, we note the maximum speedup of 1112x. However, unlike Experiment 1, the proposed algorithms outperformed the CPU implementations overall cases. The reason for the enhanced performance can be attributed to the fact of capitalizing from the parallelization merits when computing computationally expensive Boys functions in polyatomic molecules. Consequently, the parallelization merits in GPU implementations become more pronounced in such scenarios.