Efficient GPU Implementation of the McMurchie–Davidson Method for Shell-Based ERI Computations

Fujii, Haruto; Ito, Yasuaki; Yokogawa, Nobuya; Suzuki, Kanta; Tsuji, Satoki; Nakano, Koji; Parque, Victor; Kasagi, Akihiko

doi:10.3390/app15052572

Open AccessArticle

Efficient GPU Implementation of the McMurchie–Davidson Method for Shell-Based ERI Computations

by

Haruto Fujii

¹,

Yasuaki Ito

^1,*

,

Nobuya Yokogawa

¹,

Kanta Suzuki

¹,

Satoki Tsuji

^1,2

,

Koji Nakano

¹

,

Victor Parque

¹

and

Akihiko Kasagi

²

¹

School of Advanced Science and Engineering, Hiroshima University, Higashi-Hiroshima 739-8527, Japan

²

Fujitsu Limited, Kawasaki 211-8588, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(5), 2572; https://doi.org/10.3390/app15052572

Submission received: 20 January 2025 / Revised: 19 February 2025 / Accepted: 24 February 2025 / Published: 27 February 2025

(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))

Download

Browse Figures

Versions Notes

Abstract

:

Quantum chemistry offers the formal machinery to derive molecular and physical properties arising from (sub)atomic interactions. However, as molecules of practical interest are largely polyatomic, contemporary approximation schemes such as the Hartree–Fock scheme are computationally expensive due to the large number of electron repulsion integrals (ERIs). Central to the Hartree–Fock method is the efficient computation of ERIs over Gaussian functions (GTO-ERIs). Here, the well-known McMurchie–Davidson method (MD) offers an elegant formalism by incrementally extending Hermite Gaussian functions and auxiliary tabulated functions. Although the MD method offers a high degree of versatility to acceleration schemes through Graphics Processing Units (GPUs), the current GPU implementations limit the practical use of supported values of the azimuthal quantum number. In this paper, we propose a generalized framework capable of computing GTO-ERIs for arbitrary azimuthal quantum numbers, provided that the intermediate terms of the MD method can be stored. Our approach benefits from extending the MD recurrence relations through shells, batches, and triple-buffering of the shared memory, and ordering similar ERIs, thus enabling the effective parallelization and use of GPU resources. Furthermore, our approach proposes four GPU implementation schemes considering the suitable mappings between Gaussian basis and CUDA blocks and threads. Our computational experiments involving the GTO-ERI computations of molecules of interest on an NVIDIA A100 Tensor Core GPU (NVIDIA, Santa Clara, CA, USA) have revealed the merits of the proposed acceleration schemes in terms of computation time, including up to a 72× improvement over our previous GPU implementation and up to a 4500× speedup compared to a naive CPU implementation, highlighting the effectiveness of our method in accelerating ERI computations for both monatomic and polyatomic molecules. Our work has the potential to explore new parallelization schemes of distinct and complex computation paths involved in ERI computation.

Keywords:

two-electron repulsion integrals; McMurchie–Davidson method; quantum chemistry; parallel algorithm; GPU

1. Introduction

Quantum chemistry has been the cornerstone in offering the building blocks and the first principles to derive the molecular and physical properties arising from (sub)atomic interactions. As such, the field has been able to predict and explain the formation of electronic structures and chemical reactions in science, engineering, and medical applications ubiquitously. One of the fundamental challenges in the realization of practical quantum chemistry lies in solving the Schrödinger equation [1] to compute the electronic orbitals representing molecular properties:

\hat{H} Ψ = E Ψ,

(1)

where

\hat{H}

represents the total energy of the molecular system (kinetic and potential energy, Hamiltonian),

Ψ

denotes the wave function encoding the quantum state of the electron within the molecule (electron kinematics), and E is the energy eigenvalue.

The central aim of quantum chemistry is to find solutions to atomic and molecular Schrödinger equations. Yet, as molecules of practical interest become largely polyatomic, suggesting a large number of nuclei and electrons, solving the Schrödinger equations analytically becomes intractable due to the large number of (sub)atomic interactions (the many-body problem). Using reasonable assumptions on electron kinematics, approximate schemes, such as the Hartree–Fock method [2,3], have become attractive to estimate energy states and the overall electronic structures [4]. The Hamiltonian operator of an N electron system is as follows:

\hat{H} = \sum_{i = 1}^{N} \hat{h} (i) + \sum_{i > j}^{N} \hat{g} (i, j)

(2)

where

i, j

are electrons,

\hat{h} (i)

is a one-electron operator representing the kinetic energy of electrons, and electron–nucleus Coulomb attraction, and

\hat{g} (i, j)

is the two-electron operator representing the electron repulsion. The minimization of the expectation conditions of

\hat{H}

leads to the Hartree–Fock equations:

[\hat{h} (i) + \sum_{k} \int \hat{g} (i, j) {| ψ_{k} (j) |}^{2} d r_{j}] ψ_{i} (i) - \sum_{k} [\int \hat{g} (i, j) ψ_{k}^{*} (j) ψ_{i} (j) d r_{j}] ψ_{k} (i) = ε_{i} ψ_{i} (i) (i = 1, 2, \dots, n)

(3)

Solutions to the above (3) render the orbital functions

{ψ_{i}}

and the orbital energies

{ε_{i}}

. To tackle the above, the Self-Consistent Field (SCF), introduced by Hartree [2,5,6], is often used to estimate the orbital functions

{ψ_{i}}

iteratively. By assuming an approximate initial guess of

{ψ_{i}}

, replacing of

{ψ_{k}}

in (3) leads to the following:

\hat{F} ψ = ε_{i} ψ

(4)

in which the first solutions are computable from a normal eigenvalue problem. The first solutions

{ψ_{i}}

and

{ε_{i}}

are expected to be better than the initial guess. Afterward, a subsequent set of solutions is obtained after computing the operator

\hat{F}

using the previous solutions. In the above, initial guesses are used to compute the potential field felt by an electron. As such, the Hartree–Fock method assumes the single-electron configuration, in which electrons are subject to an average neighborhood electron potential.

A relevant procedure is evaluating the potential encoded through a corresponding basis set and integrals. Often, solutions of the orbital functions

{ψ_{i}}

at each step of the SCF procedure are expressed as a linear expansion of M basis functions:

ψ_{i} = \sum_{k}^{M} c_{i k} ϕ_{k}

(5)

where M is the number of basis functions, and

c_{i k}

is the expansion coefficient. Furthermore, the Hartree–Fock method solves the set of integrals, often termed the two-electron repulsion integrals (Basis-ERIs), encoding the Coulombic repulsion interactions between electron pairs across different orbitals. Several basis functions were explored in the field, and some of the popular ones include the Slater basis function showing cusps at the nucleus and exponential decay [7], the primitive Gaussian functions suggested by Boys in the 1950s [8], and the Gaussian functions with contraction coefficients and exponents, also known as Contracted Gaussian Functions (CGFs) [9]. When the Hartree–Fock machinery uses Gaussian basis functions, the molecular orbital functions

{ψ_{i}}

are expressed as a linear expansion of Gaussian-type orbitals (GTOs). A GTO is a mathematical function used to approximate atomic orbitals in quantum chemistry calculations. It takes the form of a Gaussian function, which allows for efficient computation of integrals due to its simple analytical properties. Unlike Slater-type orbitals (STOs), which more accurately describe the exponential decay of atomic orbitals, GTOs facilitate faster integral evaluations. As a result, GTOs have been widely adopted in recent studies on quantum chemistry. As such, wave functions become computable from linear combinations of Gaussian-type orbitals (GTOs), and the computation of Basis-ERIs is reduced to the problem of computing GTO-ERIs.

1.1. Related Works

For systems with many electrons, the Hartree–Fock method becomes computationally expensive due to the challenge of solving a large number of integrals, e.g., Basis-ERIs require the computation of every quartet of basis functions representing the molecule of interest. In order to realize the efficient computation of electron repulsion integrals over Gaussian functions (GTO-ERIs), various algorithms have been proposed. Two key algorithms in the field are the well-known McMurchie–Davidson method (MD) [10] and the Obara–Saika method (OS) [11]. Inspired by the idea of using Cartesian Gaussian Functions [8], the MD method allows computing of the molecular integrals through Hermite Gaussians, expressed in compact form by auxiliary functions, and recursive relations that enable calculation of the higher angular momentum efficiently. Also, the MD method, due to its relatively simple mathematical structure, offers greater generality and extensibility compared to other approaches. The OS method uses the Gaussian Product theorem to render recurrence equations to simplify composite integrals. Several other efficient schemes exist, such as the Pople–Hehre (PH) method [12] using normal specification of basis functions sharing common computable information, allowing coordinate components to become zero or constant and decreasing computational load in the corresponding integrals, the Dupuis–Rys–King (DRK) method [13] using root search of (orthogonal) Rys polynomials, often termed the Rys quadrature method [14], yielding representations for basis functions of arbitrarily high angular momentum. Another approach is the Head–Gordon–Pople method (HGP) [15] that (1) extends the recurrence relations from OS to consider the shifting angular momentum, and (2) benefits from efficient evaluation of integrals by evaluating basis functions sharing the same information (coefficients and exponents), by evaluating similar integrals simultaneously, and by grouping similar atomic orbitals. The PH method is particularly suitable for calculations involving high-angular-momentum systems, whereas the HGP method is more efficient for systems with low-angular-momentum functions. Also, the PRISM method [16] capitalizes on the trichotomy between angular momentum and degree of contraction in ERIs to find tailored computation paths. As such, PRISM allows performing of contraction steps precisely where needed, and realizes a degree of flexibility to cover a wide range of ERI. The reader may find a further detailed description of the above in [17].

Computing the electronic structure configurations through Hartree–Fock methods is amenable to parallelization and acceleration schemes by Graphics Processing Units (GPUs). As such, the active developments of new GPU architectures and the concomitant Compute Unified Device Architecture (CUDA) have enabled several Hartree–Fock estimation schemes to capitalize on the merits of GPU hardware.

The community has rendered several attempts on GPU acceleration of the MD method [18,19,20,21,22,23]. In 2008, Ufimtsev and Martinez [18] were among the first groups in accelerating the MD algorithm by tailored forms of stream processing on GPUs, and in evaluating specific mappings between threads and integral computation, e.g., one thread per contracted integral and one thread per primitive integral. The resulting algorithm achieved a 130× speedup over CPU implementation.

Also, a year later in 2009, Ufimtsev and Martinez [19] extended their acceleration schemes by (1) incremental computations of the Fock matrix throughout successive iterations, allowing reduction of the number of computable elements of the Fock matrix, in particular, in late iterations when convergence is near, (2) double screening of integrals: in a first stage, ket and bra pairs become ordered by angular momentum, and at the second stage, such pairs become ordered within each angular momentum subject to contribution to the Schwartz upper bound, enabling effective parallelization and improved load balancing, and (3) computation of the Fock matrix components by blocks and by following the ordering of the integral grid while ignoring the integrals with the Schwartz bound below a threshold. The extended algorithm achieved a performance of up to 650× when compared to a CPU-implementation. However, the implementation in [19] is restricted to s- and p-type Gaussian basis functions and assumes that the communication time between the GPU and CPU is negligible compared to the total time.

In 2012, Titov et al. [20] proposed automated code generation to accelerate the MD method by GPU and CPU kernels for integral evaluation. The generated kernels are expressed in symbolic form and are subject to algebraic transformations in the Mapple CAS system. The resulting framework was implemented in TeraChem, a popular tool in quantum chemistry, rendering kernels of about 2000 lines to support up to d angular momentum. Although the approach by Titov et al. [20] has the potential to explore diverse kernel configurations for specific GPU hardware, the selection of the suitable kernel follows empirical testing resembling a trial-and-error approach. Furthermore, the challenge of fully automating kernel generation and optimization arises from the limitations faced by compilers, which are often constrained by both compilation time and the availability of source code. This makes it impractical to scale them to accommodate a large angular momentum.

In 2014, Yasuda and Maruoka [21] proposed an acceleration scheme for MD through a calculation order for electron repulsion integrals (ERIs), and a method using multiple cooperating threads to reduce register usage. Computational experiments on CPUs and GPUs showed that the approach significantly accelerated the computation by a factor of four for transition metal clusters. Although the method proposed by Yasuda and Maruoka [21] enables finding calculation orders that minimize the intermediate ERIs that need to be simultaneously retained, scalability to higher angular momentum is still a challenge as minimizing register spills in ERI calculations is a nonpolynomial time search problem, allowing for only approximate solutions. Whereas compilers can effectively address simpler cases, the limited scalability to high-angular-momentum ERI calculations limits the overall effectiveness.

In 2017, Kalinowski et al. [22] proposed the GPU acceleration of the MD method for arbitrary angular momentum. The approach is mainly oriented towards performing integral pre-screening (integral list generation) on the CPU side by the ORCA package, and CPU to GPU memory transfers of computed lists. As such, the computational load is distributed in a way that each GPU thread calculates a single integral batch at a time. The acceleration scheme achieved speedups of up to a factor of 30 relative to serial and parallel CPU implementations. Although the approach is presented for arbitrary angular momentum, aspects such as the complexity of high angular momentum inhibit the practical usage of GPU code to f-type integrals.

Furthermore, ever since the Gaussian basis function was proposed by Boys in the 1950s [8], and the approach then coupled with the DRK method in [24] (termed UM09), several enhanced approaches were proposed: the GPU accelerations of the DRK, also termed the Rys quadrature method [25,26,27], the GPU acceleration of the HGP method and its adaptation up to the f-type ERIs [28], the multi-GPU implementations of the Rys quadrature [29], the acceleration of the HGP method benefiting from permutational symmetries and dynamic load balancing schemes [30], the optimized code generator via a heuristic and sieve method for HGP-OS implementing efficient integral screening and symmetry [31] (termed Brc21), the multinode multi-GPU implementations [32], the matrix multiplication form of the MD method on NVIDIA V100 GPU [33], and the GPU implementation for PYSCF up to g functions through the Rys quadrature [34]. The above-mentioned algorithms compute the GTO-ERIs by repeatedly evaluating recurrence relations, which become deeper and computationally expensive as the azimuthal quantum number of the basis functions increases. Here, UM09 and Brc21 were reported to be promising throughout practical molecular benchmarks, yet both approaches differ in principle: whereas Brc21 exploits ERIs symmetry and optimizes for the computation per ERI, UM09 benefits from dividing the exchange computations and completing the efficient screening. Furthermore, although several algorithms have been proposed to efficiently compute Basis-ERIs with high azimuthal quantum numbers [32,35,36,37], most of the existing algorithms limit the supported values of the azimuthal quantum number. For instance, the study by Johnson et al. [32] supports basis functions with azimuthal quantum numbers up to 4, generating optimized code for each specific case, and the hybrid between UM09 and Brc21 for multi-GPU domains has been evaluated in up to f-type Gaussian functions [38].

1.2. Contributions

Our previous related work [23] investigated the acceleration of the McMurchie–Davidson (MD) method and developed a parallel algorithm optimized for efficient GPU implementation. This paper builds upon [23] and further extends the GPU acceleration frameworks. Our previous study [23] successfully improved the evaluation efficiency of the recurrence relation (Equation (22)) by employing a batch-based computational algorithm. However, two major issues remained, which we address in this study:

Although the batch-based approach enhanced the efficiency of evaluating the recurrence relation (Equation (22)), there was computational redundancy, as the Boys function needed to be evaluated every time a GTO-ERI within a Basis-ERI was computed.
In the previous implementation, each ERI calculation was assigned to a single CUDA block. This assignment strategy led to inefficient utilization of computational resources, particularly for ERI calculations with low computational cost, where the available computational resources within a CUDA block could not be fully utilized, increasing execution time.

The related approaches in MD acceleration, as shown in Table 1, focus on different aspects/configurations of GPU-enabled MD acceleration. For instance, Ufimtsev and Martinez [18,19] focus on the most efficient computation of the Fock matrix and the effective screening of integrals, Titov et al. [20] focus on the automated generation of (symbolic) kernels to evaluate integrals, each of which can be tailored to GPU hardware, Yasuda and Maruoka [21] focus on how to find an optimal order of ERIs by (recursive) search, and how to enable the cooperating threads to reduce register use, and Kalinowski et al. [22] focus on (serial) screening by CPU and the thread-based integral evaluations. Compared to the aforementioned related works (as summarized in Table 1), and departing from particular aspects/configurations of MD acceleration, our approach is a generalized algorithm capable of computing GTO-ERIs for arbitrary azimuthal quantum numbers, provided that the intermediate terms can be stored. To this end, our approach capitalizes on the use of shells, batches, and triple-buffering of the shared memory, and efficient block and thread mapping strategies in CUDA, thus exploring the overall effective parallelism frontiers. The following points briefly describe the key contributions of this paper.

1.2.1. Batch-Based ERI Computation

Our approach capitalizes on the amenability of parallelization through shells (a feature of basis functions) and batches (having the role of partitioning independent computable groups). Basically, the proposed algorithm divides the components of the recurrence relation

R_{t, u, v}^{n}

of the MD method into

K + 1

batches, where K is a parameter determined by the sum of the azimuthal quantum numbers in the four GTOs used for GTO-ERI. As such, the proposed approach enables efficient ERI computation with a unified code structure applicable to any input.

1.2.2. Triple-Buffering of Shared Memory

As a standard batch-based algorithm would require storing the computed elements of the recurrence

R_{t, u, v}^{n}

sequentially, the overall storage requirements become impractical for large K. The value of K at which storage becomes impractical depends on the available GPU memory. For example, on the A100 Tensor Core GPU used in our experiments, storing the computed elements of

R_{t, u, v}^{n}

becomes infeasible when K exceeds 24. As such, aiming at reducing the overall memory usage required in the batch-based algorithm, this paper proposes the triple-buffering of shared memory, which takes advantage of the property of the batch algorithm where, during the computation of elements in batch

k \in [0, K]

, only the elements from batches

k - 1

and

k - 2

are required. By cyclically switching between three shared memory buffers, the algorithm overwrites memory containing values that are no longer needed for subsequent computations. Our estimations show that the proposed approach successfully reduces shared memory usage by up to 65% compared to a standard approach that would preserve all elements in the shared memory. Furthermore, as K increases, the memory reduction rate becomes comparatively significant.

1.2.3. Block and Thread Mapping Strategies

Using the CUDA framework, we implemented four GPU parallel thread mapping methods to maximize computational efficiency and GPU resource utilization. A brief description of each of the following methods is given below.

BBM: Basis-Block Mapping: each CUDA block is assigned to a Basis-ERI, and all CUDA threads within a CUDA block comput a Basis-ERI in parallel. This implementation is identical to the one proposed in our previous study and is used as the baseline for comparison.
BTM: Basis-Thread Mapping: each CUDA thread is assigned to a Basis-ERI, with each CUDA thread independently evaluating a Basis-ERI.
SBM: Shell-Block Mapping: each CUDA block is assigned to a shell-based ERI, and all threads within the CUDA block collaboratively compute a shell-based ERI in parallel.
STM: Shell-Thread Mapping: each CUDA thread is assigned to a shell-based ERI, with each CUDA thread independently evaluating a shell-based ERI.

The definitions of Basis-ERI and shell-based ERI are provided in Section 2 and Section 4.1, respectively.

1.2.4. Computational Experiments

We assessed the computational efficiency of ERI calculations for various molecules using the above-mentioned four proposed GPU implementations. Experiments using an NVIDIA A100 Tensor Core GPU demonstrated substantial performance improvements. In particular, the proposed method achieved up to a 72× speedup compared to our previous GPU implementation. Furthermore, when benchmarked against a naive CPU implementation on an AMD EPYC 7702 processor (AMD, Santa Clara, CA, USA), it achieved a maximum speedup of 4500×. Our findings underscore the effectiveness of our approach in significantly accelerating ERI calculations for both monoatomic and polyatomic molecules.

1.3. Paper Organization

The structure of this paper is organized as follows: Section 2 describes the Basis-ERIs, the primary focus of this study, Section 3 overviews the McMurchie–Davidson method (referred to as MD throughout the paper), a well-known approach for computing GTO-ERIs, Section 4 describes our proposed approach to accelerate the computation of MD, followed by the discussion of Section 5 formulating relevant strategies to implement the proposed algorithm in GPUs. Section 6 describes and discusses the results obtained through rigorous computational experiments involving ERI calculations of representative different molecular systems. Finally, Section 7 concludes our paper by summarizing the key findings of this study.

2. Two-Electron Repulsion Integrals

This section describes the preliminaries behind the computation of two-electron repulsion integrals (termed Basis-ERIs). Concretely speaking, a Basis-ERI is defined as a double integral over the entire space involving four basis functions

χ_{μ}

,

χ_{ν}

,

χ_{λ}

, and

χ_{σ}

, as follows:

\begin{matrix} (μ ν | λ σ) = \int \int χ_{μ} (r_{1}) χ_{ν} (r_{1}) \frac{1}{r_{12}} χ_{λ} (r_{2}) χ_{σ} (r_{2}) d r_{1} d r_{2} \end{matrix}

(6)

where

$(μ ν \| λ σ)$	notation often used in quantum chemistry to denote a Basis-ERI.
$(μ ν \|$	bra term.
$\| λ σ)$	ket term.
$r_{1}, r_{2}$	position vectors representing the three-dimensional coordinates (x, y, z) of two electrons.
$r_{12}$	norm of the vector r₁ − r₂ (Euclidean distance) representing the Coulomb interaction between two electrons.
$χ_{μ}$	notation for a basis function.

In the above-mentioned formulation, the Basis-ERIs are to be computed for every combination of four basis functions. Therefore, if one considers M basis functions, the reader may note that the total number of combinations of Basis-ERIs is

M^{4}

. For instance, when one considers a molecular system with basis functions

χ_{1}

and

χ_{2}

, thus

M = 2

, one is to compute

2^{4} = 16

Basis-ERIs. In order to illustrate the above-mentioned configuration, Figure 1a shows the matrix-based configurations of the possible interactions between the two basis functions

χ_{1}

and

χ_{2}

, in which the elements in the column direction correspond to the bra terms, whereas the elements in the row direction correspond to the ket terms. Also, each cell in the matrices of Figure 1a represents a single Basis-ERI.

One can benefit from the symmetry implications in Equation (6), as follows:

\begin{matrix} (μ ν | λ σ) = (ν μ | λ σ) = (μ ν | σ λ) = (ν μ | σ λ) = (λ σ | μ ν) = (λ σ | ν μ) = (σ λ | μ ν) = (σ λ | ν μ) . \end{matrix}

As a result, the number of Basis-ERIs that require computation is reduced to

\frac{M (M + 1) (M^{2} + M + 2)}{8},

In order to illustrate the merits of the symmetry relations, Figure 1b shows the possible combinations considering symmetry considerations. By observing Figure 1a,b, the reader may note that, by including symmetry in the formulation of Basis-ERIs, it becomes possible to reduce the total number of Basis-ERIs, as shown by the upper triangular matrix in Figure 1b. As such, the overall Basis-ERI calculation determines each element in the upper triangular matrix of Figure 1b.

Furthermore, in quantum chemistry, the basis function

χ_{μ} (r)

is often expressed as a linear combination of Gaussian basis functions as follows:

\begin{matrix} χ_{μ} (r) = \sum_{e = 1}^{K_{μ}} d_{μ_{e}} G_{μ_{e}} (r), \end{matrix}

(7)

where

G_{μ_{e}}

denotes the Gaussian-type orbitals (GTOs),

K_{μ}

denotes the number of terms in the linear combination, and

d_{μ_{e}}

denotes the coefficients of the linear combination. In the above, the Gaussian-type orbital

G_{μ_{e}}

is defined as follows:

\begin{matrix} G_{μ_{e}} = c_{e} {(x - A_{x})}^{μ_{x}} {(y - A_{y})}^{μ_{y}} {(z - A_{z})}^{μ_{z}} exp (- α_{e} {|r - A|}^{2}), \end{matrix}

where

A = (A_{x}, A_{y}, A_{z})

represents the orbital center (Cartesian coordinate center),

c_{e}

is a normalization constant,

μ_{x}

,

μ_{y}

, and

μ_{z}

are non-negative integers that determine the orbital shape, and

α_{e}

is a parameter representing the spread of orbitals. In other words, a basis function consists of a set of Gaussian-type orbitals with the same center

A

, the same azimuthal quantum numbers

μ_{x}

,

μ_{y}

,

μ_{z}

, but different coefficients

d_{μ_{e}}

and parameters

α_{e}

. The orbitals are often named based on the sum of the azimuthal quantum numbers

L = μ_{x} + μ_{y} + μ_{z}

. For example, an orbital with

L = 0

is classified as an s-orbital, which has a spherical shape. Likewise,

L = 1, 2, 3

render the p-, d-, and f- orbitals, respectively [39]. Furthermore, within the context of contracted Gaussian basis functions (CGFs),

K_{μ}

is the degree of contraction, and

μ_{x} + μ_{y} + μ_{z}

denotes the total angular momentum.

The above-described parameters are uniquely determined by both the types of basis functions and the number of atoms.

Selecting basis functions is crucial as it significantly influences both the accuracy and computational cost of quantum chemical calculations [40]. For instance, the Slater-type orbital with three Gaussian functions (often termed STO-3G) is a well-known minimal basis set representing molecular orbitals as the linear combination of three Gaussian-type orbitals [41]. For the purpose of research and through evaluations, the community has developed the Basis Set Exchange (BSE) [42], a comprehensive and convenient database of basis sets used in quantum chemistry calculations.

As the Hartree–Fock method involves the evaluation of quartet electron repulsion integrals (ERIs), we define the ERIs of the four Gaussian basis functions as follows:

\begin{matrix} [μ_{e} ν_{f} | λ_{g} σ_{h}] = \int \int G_{μ_{e}} (r_{1}) G_{ν_{f}} (r_{1}) \frac{1}{r_{12}} G_{λ_{g}} (r_{2}) G_{σ_{h}} (r_{2}) d r_{1} d r_{2}, \end{matrix}

(8)

where

G_{μ_{e}}

,

G_{ν_{f}}

,

G_{λ_{g}}

, and

G_{σ_{h}}

denote Gaussian-type orbitals (GTO-ERIs). Using the above-mentioned formulation, Equation (6) can be rewritten as follows:

\begin{matrix} (μ ν | λ σ) = \sum_{e = 1}^{K_{μ}} \sum_{f = 1}^{K_{ν}} \sum_{g = 1}^{K_{λ}} \sum_{h = 1}^{K_{σ}} d_{μ_{e}} d_{ν_{f}} d_{λ_{g}} d_{σ_{h}} [μ_{e} ν_{f} | λ_{g} σ_{h}] . \end{matrix}

(9)

For clarity of exposition, in order to relate notions of basis sets and Gaussian-type orbitals while computing the electronic configuration of molecules, Figure 2 shows the relation between the Basis-ERIs and the GTO-ERIs. In particular, Figure 2 shows the combinations of four Basis-ERIs under symmetry considerations, in which the terms over the row (column) directions represent the bra (ket) Basis-ERIs. Each cell of the upper triangular matrix of Figure 2 corresponds to a single Basis-ERI and is the result of the sum of Gaussian orbitals (GTO-ERIs), as expressed in Equation (9).

3. McMurchie–Davidson Method

Our approach is centered on the efficiency frontiers and the acceleration prospects of the McMurchie–Davidson (MD) method [10]. Thus, this section describes the preliminaries and key ideas behind the approach to efficiently compute electron repulsion integrals over Gaussian functions (GTO-ERIs) by the MD method. Generally speaking, the MD method uses Hermite Gaussian functions and auxiliary (tabulated) functions through an incremental approach, whose elegant formalism offers a high degree of versatility to parallelization schemes in GPUs.

For simplicity and without loss of generality of our exposition, let us consider the GTO-ERI

[a b | c d]

composed of the following four Gaussian-type orbitals (GTOs), each of which neglects normalization constants, as follows:

\begin{matrix} G_{a} (r) = {(x - A_{x})}^{a_{x}} {(y - A_{y})}^{a_{y}} {(z - A_{z})}^{a_{z}} exp (- α {|r - A|}^{2}), \end{matrix}

(10)

\begin{matrix} G_{b} (r) = {(x - B_{x})}^{b_{x}} {(y - B_{y})}^{b_{y}} {(z - B_{z})}^{b_{z}} exp (- β {|r - B|}^{2}), \end{matrix}

(11)

\begin{matrix} G_{c} (r) = {(x - C_{x})}^{c_{x}} {(y - C_{y})}^{c_{y}} {(z - C_{z})}^{c_{z}} exp (- γ {|r - C|}^{2}), \end{matrix}

(12)

\begin{matrix} G_{d} (r) = {(x - D_{x})}^{d_{x}} {(y - D_{y})}^{d_{y}} {(z - D_{z})}^{d_{z}} exp (- δ {|r - D|}^{2}) . \end{matrix}

(13)

Following the MD method [10], each of the GTO-ERIs is calculated as follows:

\begin{matrix} [a b | c d] = \frac{2 π^{5 / 2}}{p p^{'} \sqrt{p + p^{'}}} \sum_{t = 0}^{a_{x} + b_{x}} \sum_{u = 0}^{a_{y} + b_{y}} \sum_{v = 0}^{a_{z} + b_{z}} \sum_{t^{'} = 0}^{c_{x} + d_{x}} \sum_{u^{'} = 0}^{c_{y} + d_{y}} \sum_{v^{'} = 0}^{c_{z} + d_{z}} E_{t^{'}, u^{'}, v^{'}}^{t, u, v} R_{t + t^{'}, u + u^{'}, v + v^{'}}^{0} . \end{matrix}

(14)

Furthermore, in Equation (14):

\begin{matrix} E_{t^{'}, u^{'}, v^{'}}^{t, u, v} & = E_{t, u, v}^{a, b} E_{t^{'}, u^{'}, v^{'}}^{c, d} {(- 1)}^{t^{'} + u^{'} + v^{'}}, \end{matrix}

(15)

\begin{matrix} E_{t, u, v}^{a, b} & = E_{t}^{a_{x}, b_{x}} E_{u}^{a_{y}, b_{y}} E_{v}^{a_{z}, b_{z}}, \end{matrix}

(16)

\begin{matrix} E_{t^{'}, u^{'}, v^{'}}^{c, d} & = E_{t^{'}}^{c_{x}, d_{x}} E_{u^{'}}^{c_{y}, d_{y}} E_{v^{'}}^{c_{z}, d_{z}}, \end{matrix}

(17)

where

p = α + β, p^{'} = γ + δ

. The recurrence formulas

E_{t}^{a_{x}, b_{x}}, E_{u}^{a_{y}, b_{y}}

, and

E_{v}^{a_{z}, b_{z}}

are terms related to the GTOs

G_{a}

and

G_{b}

. For example,

E_{t}^{a_{x}, b_{x}}

is defined by the following recurrence equations:

\begin{matrix} E_{t}^{i, j} & = \frac{1}{2 p} E_{t - 1}^{i - 1, j} - (A_{x} - B_{x}) \frac{q}{α} E_{t}^{i - 1, j} + (t + 1) E_{t + 1}^{i - 1, j}, \end{matrix}

(18)

\begin{matrix} E_{t}^{i, j} & = \frac{1}{2 p} E_{t - 1}^{i, j - 1} + (A_{x} - B_{x}) \frac{q}{β} E_{t}^{i, j - 1} + (t + 1) E_{t + 1}^{i, j - 1}, \end{matrix}

(19)

\begin{matrix} E_{t}^{0, 0} & = exp (- q {(A_{x} - B_{x})}^{2}), \end{matrix}

(20)

\begin{matrix} E_{t}^{i, j} & = 0 if t < 0 or i + j < t . \end{matrix}

(21)

In the above relations,

q = \frac{α β}{α + β}

Also, the terms

E_{u}^{a_{y}, b_{y}}

and

E_{v}^{a_{z}, b_{z}}

can be calculated using the coordinates y and z, respectively, and the terms

E_{t^{'}}^{c_{x}, d_{x}}, E_{u^{'}}^{c_{y}, d_{y}}

, and

E_{v^{'}}^{c_{z}, d_{z}}

can be calculated for

G_{c}

and

G_{d}

in the same fashion. Furthermore, although evaluating the recurrence formula

E_{t}^{i, j}

is computationally trivial and inexpensive in most cases, implementing the terms of

E_{t}^{i, j}

as a recursive function in a large number of basis sets is expected to contribute to the overall computational load and bottleneck due to the large number of computation paths. As such, in this study, in order to realize the efficient and tailored computation paths across a diverse set of molecular configurations, we improved the evaluation of the term

E_{t}^{i, j}

by not only expanding the recurrence relations but also by creating the device functions tailored to each combination of the tuple

(i, j, t)

; as such, the overall computation load is anticipated to decrease [43].

Central to the MD method is the computation of the following recurrence relations:

\begin{matrix} \{\begin{matrix} R_{t, u, v}^{n} = (t - 1) R_{t - 2, u, v}^{n + 1} + (P_{x} - P_{x}^{'}) R_{t - 1, u, v}^{n + 1} \\ R_{t, u, v}^{n} = (u - 1) R_{t, u - 2, v}^{n + 1} + (P_{y} - P_{y}^{'}) R_{t, u - 1, v}^{n + 1} \\ R_{t, u, v}^{n} = (v - 1) R_{t, u, v - 2}^{n + 1} + (P_{z} - P_{z}^{'}) R_{t, u, v - 1}^{n + 1} \\ R_{0, 0, 0}^{n} = {(- 2 ω)}^{n} F_{n} (ω | P - P^{'} |^{2}) \\ R_{t, u, v}^{n} = 0 if t < 0 or u < 0 or v < 0 \end{matrix} \end{matrix}

(22)

where

P = (P_{x}, P_{y}, P_{z}) = \frac{α A + β B}{α + β},

(23)

P^{'} = (P_{x}^{'}, P_{y}^{'}, P_{z}^{'}) = \frac{γ C + δ D}{γ + δ},

(24)

ω = \frac{p p^{'}}{p + p^{'}} .

(25)

Thus,

P = (P_{x}, P_{y}, P_{z})

is the point dividing A and B in the ratio

β : α

, where

P_{x}, P_{y}, and P_{z}

represent the x-, y-, and z-coordinates of

P

, respectively.

Furthermore, the term

F_{n} (ω | P - P^{'} |^{2})

in Equation (22), in which

| P - P^{'} |

is the norm of the vector

P - P^{'}

, denotes the Boys function [8], defined as follows:

\begin{matrix} F_{n} (x) = \int_{0}^{1} u^{2 n} exp (- x u^{2}) d u, \end{matrix}

(26)

The equations from Equation (14) to Equation (26) follow the formulations in [10], but have been rewritten in a different form to improve readability. Although the above-mentioned Boys function is widely used in frameworks aiming at reducing the computation of complex integrals into amenable forms, the closed-form analytical solutions are nonexistent/intractable. In this study, to realize the highly accurate estimations with utmost efficiency frontiers, we leveraged the numerical approach rendered from GPU acceleration schemes proposed by Tsuji et al. [44]. Although the computational burden behind evaluating the Boys function is high, it is possible to attain enhanced efficiency frontiers by devising strategies to reduce the number of evaluations in special scenarios. For instance, a special case occurs when

x = 0

; thus, the Boys function becomes

F_{n} (0) = \int_{0}^{1} u^{2 n} d u = \frac{1}{2 n + 1}

; as such, the computational burden is expected to significantly reduce. The reader may note that the scenario

x = 0

often occurs when the Cartesian centers

A, B, C

, and

D

of the four GTOs used in the GTO-ERI are identical.

4. Proposed Method

This section describes the key ideas behind our proposal exploring the efficiency frontiers of the MD method by GPU acceleration schemes. Before going more deeply into detail, we first introduce the concept of a shell, which is a key feature of the basis functions, and then introduce the concept of a batch, which is a computational unit having the role of partitioning independent computable groups by expanding the recursive relation R expressed in Equation (22). Our approach capitalizes on the amenability of parallelizing both the shell and the batch concepts, described afterward.

4.1. Shell-Based ERI Computation

From the definition of the Boys function in Equation (26), one can observe that the arguments of the Boys function are determined by the terms n and

ω | P - P^{'} |^{2}

. Whereas the former denotes the order of the Boys function, the latter is a parameter encoding the distance between Gaussian functions. Moreover, based on the definitions of

ω, P

, and

P^{'}

, the reader may easily confirm that the computed values corresponding to the Boys function in the GTO-ERI with similar

α, β, γ, δ, A, B, C

and

D

are identical. Although evaluating computationally expensive Boys functions looks imperative for general GTOs showing distinct configurations of coefficient

α

and center

A

, Ivan et al. [24] proposed the concept of a shell to overcome the inevitable inefficiencies behind computing Boys functions repeatedly. Concretely speaking, a shell is a set of GTOs with identical coefficient

α

and center

A

, and is defined as follows:

\begin{matrix} S_{ξ} (w, r, A, α) = \{d_{ξ} G_{ξ} ({ξ_{x}, ξ_{y}, ξ_{z}}, r, A, α) ∣ ξ_{x}, ξ_{y}, ξ_{z} \geq 0, ξ_{x} + ξ_{y} + ξ_{z} = w\} . \end{matrix}

(27)

Here,

d_{ξ}

is the coefficient parameter as defined in Equation (7). Similarly,

G_{ξ} (ξ_{x}, ξ_{y}, ξ_{z}, r, A, α)

refers to the GTOs, as described in Equation (7), and is defined as:

\begin{matrix} G_{ξ} ({ξ_{x}, ξ_{y}, ξ_{z}}, r, A, α) = c {(x - A_{x})}^{ξ_{x}} {(y - A_{y})}^{ξ_{y}} {(z - A_{z})}^{ξ_{z}} exp ({- α | r - A |}^{2}) . \end{matrix}

From the above-mentioned definitions, a shell is conceptualized as the set of GTOs with similar values of coefficient

α

, center

A

, and the sum of azimuthal quantum numbers

ξ_{x} + ξ_{y} + ξ_{z}

. Comparable to the definition of basis functions, shells are often classified by the above-mentioned configurations and by following the order nomenclature of s-, p-, d-, and f-shells, and so on. Additionally, the number of GTOs within a shell is determined by the sum of the azimuthal quantum numbers, as defined in Equation (27). For ease of reference to the reader, Table 2 presents the total number of GTOs available in each shell, which is derived from the fact that the number of non-negative integer pairs

(t, u, v)

satisfying

t + u + v = k

is given by

\frac{(k + 2)!}{k! 2!} = \frac{(k + 2) (k + 1)}{2} .

Having presented the formal definition of a shell, we can also formally present the concept of ERIs with four shells using shell concepts (termed shell-based ERI), as follows:

\begin{matrix} (ξ η | θ κ) = \int \int S_{ξ} (r_{1}) S_{η} (r_{1}) \frac{1}{r_{12}} S_{θ} (r_{2}) S_{κ} (r_{2}) d r_{1} d r_{2} . \end{matrix}

(28)

In the above, parameters other than the coordinate vector

r

are omitted for simplicity.

Furthermore, in order to provide a glimpse of the configuration of shell-based ERIs, Figure 3 shows an example of shell-based ERIs with two s-shells

(S_{0}, S_{1})

and two p-shells

(S_{2}, S_{3})

. Whereas the s-shell consists of one GTO, the p-shell consists of three GTOs, and thus three GTO-ERIs. Also, by observing Figure 3, shell-based ERIs are classified according to the type of shell they contain. For instance, the term

(s s |

denotes that a bra consists of two s-shells, and the term

| s p)

denotes that a ket consists of one s-shell and one p-shell. Shell-based ERIs consisting of the above-mentioned types can also be classified as

(s s | s p)

. As an example, the upper right part of Figure 3 shows the configuration of

(s s | s p)

.

When computing the GTO-ERIs within four shells, computational performance and throughput can be improved by efficiently reusing precomputed values of the Boys function. For example, let us consider the integral

(s s | s p)

portrayed in the upper triangle of Figure 3: such an integral obtains the sum of azimuthal quantum numbers of four GTOs as 1, and consists of three GTO-ERIs:

[s s | s p_{x}]

,

[s s | s p_{y}]

, and

[s s | s p_{z}]

. Evaluating the Boys function in the above-listed three GTO-ERIs implies using identical parameters. One may easily compute the Boys function within the context of

[s s | s p_{x}]

and then reuse such obtained value into the context and requirements of

[s s | s p_{y}]

and

[s s | s p_{z}]

. As such, reusing precomputed values rendered from similar Boys function contexts is expected to significantly reduce the computational cost in repetitive calls. Furthermore, as the sum of the azimuthal quantum numbers tends to increase, the benefits of reusing precomputed values become more pronounced, enabling better scalability and throughput. Such observations and optimizations lie at the heart of efficient shell-based ERI computations.

4.2. Parallel Batch Computation of the Recurrence Relation R

This section explores the efficiency frontiers of the MD method through GPU acceleration schemes. Although the MD method offers an elegant formalism with a high degree of versatility to compute electron repulsion integrals over Gaussian functions (GTO-ERIs), several challenges prevent it from attaining practical performance for molecules of interest. One of the main drawbacks is the high computational cost associated with repetitive evaluations of the Boys function, and the redundancy of the identical Boys function values in iterative calls to the recursive function R. To tackle this problem, our results presented in [23] address the issues mentioned earlier by expanding the set of recursive functions R and by dividing into independent computational units termed batches, thereby avoiding redundant calculations. Additionally, it is possible to precompute the required values of

R_{t, u, v}^{0}

by evaluating each element within a batch.

In what follows, we provide an overview of the batch-based algorithm. First, let us define

s_{x} = a_{x} + b_{x} + c_{x} + d_{x}, s_{y} = a_{y} + b_{y} + c_{y} + d_{y}, s_{z} = a_{z} + b_{z} + c_{z} + d_{z}, and K = s_{x} + s_{y} + s_{z}

. To compute the GTO-ERI, all

R_{t, u, v}^{0}

values satisfying

0 \leq t \leq s_{x}, 0 \leq u \leq s_{y}, 0 \leq v \leq s_{z}

are to be calculated. Among the expanded recurrence relations

(R_{t, u, v}^{n})

, we assign the values meeting the following condition to batch k

(0 \leq k \leq K)

such that

k = t + u + v

and

0 \leq n \leq K - k

. The number of combinations

(t, u, v)

that satisfy

t + u + v = k

is

\frac{(k + 2)!}{k! 2!} = \frac{(k + 1) (k + 2)}{2} .

Also, batch k includes the recurrence

R_{t, u, v}^{n}

for

0 \leq n \leq K - k

. Therefore, the number of values through recurrence

R_{t, u, v}^{n}

in batch k is given by

\frac{(k + 1) (k + 2) (K - k + 1)}{2} .

Similarly, the total number of values computed through the recurrence

R_{t, u, v}^{n}

across all batches is

\sum_{k = 0}^{K} \frac{(k + 1) (k + 2) (K - k + 1)}{2} = \frac{(K + 1) (K + 2) (K + 3) (K + 4)}{24} .

Furthermore, by sequentially calculating from batch 0 to batch K, all required values of the recurrence

R_{t, u, v}^{0}

are obtained. The above-mentioned observations become evident from the definitions of Equations (8) and (22). Moreover, as the values rendered from recurrence

R_{t, u, v}^{n}

within a batch are independent according to Equation (22), all values within a batch can be computed in parallel. For reader’s reference, in order to portray a glimpse of the kind of dependencies in the computation paths involved, Figure 4 shows the computed values through the corresponding recurrences R through the use of batches for

K = 4

.

4.3. Parallel Algorithms

The proposed parallel algorithm employs a two-stage parallelization strategy: the first stage computes the ERIs corresponding to the basis functions

(μ ν | λ σ)

in parallel, and the second stage parallelizes the GTO-ERI algorithm

[a b | c d]

by the MD method, which is conveniently computed in the first stage. Algorithm 1 parallelizes the computation of M basis functions by leveraging the independence of all combinations of four basis functions. Also, Algorithm 1 avoids redundant computations of the corresponding recurrences R, and the results are preserved for further reuse. The inner loop of the algorithm is executed sequentially, while the GTO-ERI computation

[μ_{e} ν_{f} | λ_{g} σ_{h}]

in line 4 utilizes the parallel GTO-ERI algorithm presented in Algorithm 2.

Algorithm 1 Parallel Basis-ERI algorithm.

Require:: N basis functions
Ensure:: Basis-ERIs for all combinations of four basis functions
1:: for all $μ, ν, λ, σ$ such that $μ \leq ν$ , $λ \leq σ$ , and $(μ, ν) \leq (λ, σ)$ do in parallel
2:: $t \leftarrow 0$ // Equation (9)
3:: for all $e, f, g, h$ do
4:: $t \leftarrow t + d_{μ_{e}} d_{ν_{f}} d_{λ_{g}} d_{σ_{h}} [μ_{e} ν_{f} | λ_{g} σ_{h}]$ // Algorithm 2
5:: end for
6:: $(μ ν | λ σ) \leftarrow t$
7:: end for

Algorithm 2 Parallel GTO-ERI algorithm.

Require:: Four Gaussian-type orbitals $G_{a}, G_{b}, G_{c}, G_{d}$
Ensure:: GTO-ERI $[a b | c d]$
1:: for $k = 0$ to K do
2:: Compute R values in batch k in parallel
3:: end for
4:: $s \leftarrow 0$ // s is a shared variable, and parallel addition is performed on it
5:: for all $t, u, v, t^{'}, u^{'}, v^{'}$ do in parallel
6:: $s \leftarrow s + E_{t}^{a_{x} b_{x}} E_{u}^{a_{y} b_{y}} E_{v}^{a_{z} b_{z}} E_{t^{'}}^{c_{x} d_{x}} E_{u^{'}}^{c_{y} d_{y}} E_{v^{'}}^{c_{z} d_{z}} {(- 1)}^{t^{'} + u^{'} + v^{'}} R_{t + t^{'}, u + u^{'}, v + v^{'}}^{0}$
7:: end for
8:: return $\frac{2 π^{5 / 2}}{p p^{'} \sqrt{p + p^{'}}} \cdot s$

Algorithm 2 presents a parallel MD algorithm for computing the GTO-ERI. The algorithm first precomputes all necessary values of the recurrence R. As explained earlier, a batchk can be computed from batches

k - 1

and

k - 2

, and the batches are computed sequentially from batch 0 to batchK. Additionally, there are no dependencies among the values of R within each batch, enabling seamless parallel computation. In lines 4 through 8, the obtained values from the corresponding recurrences R are used to compute Equation (2). Since the overall procedure implies a six-fold loop computation, line 6 is computed in parallel, and the resulting sum is obtained using parallel addition techniques for shared memory. Since evaluating the recurrence formula

E_{t}^{i, j}

is computationally less intensive than evaluating the recurrence formula

R_{t, u, v}^{n}

, the values derived from the recurrence E are obtained by computing the recursive expressions straightforwardly.

Algorithm 3 shows a parallel algorithm for Basis-ERI computation based on shells. Whereas Algorithm 1 computes the Basis-ERIs, Algorithm 3 computes the shell-based ERIs. For a given combination of four shells, the algorithm computes the shell-based ERIs in parallel, as shown in Equation (28). As explained above, it is possible to reuse the Boys functions computed in the shells. Therefore, first, in line 2, all necessary Boys function values for the GTO-ERIs in the shells are computed independently and in parallel. Afterward, in lines 3–5, the GTO-ERIs in the shell-based ERI of Equation (8) are calculated sequentially. The result of each corresponding calculation is accumulated in the value of the basis function

(μ ν | λ σ)

. The reader may note that the key difference between Algorithms 1 and 3 lies in the order of calculations, yet the final computed results are equivalent.

Algorithm 3 Parallel shell-based ERI algorithm.

Input:: M shells
Output:: Basis-ERIs for all combinations of four basis functions
1:: for all $ξ, η, θ, κ$ such that $ξ \leq η$ , $θ \leq κ$ , and $(ξ, η) \leq (θ, κ)$ do in parallel
2:: Compute all $F_{n} (ω | P - P^{'} |^{2})$ values in $(ξ η | θ κ)$ in parallel
3:: for all Gaussian-type orbitals pair $(ξ_{e}, η_{f}, θ_{g}, κ_{h})$ in $ξ, η, θ, κ$ do
4:: $(μ ν | λ σ) \leftarrow (μ ν | λ σ) + d_{ξ} d_{η} d_{θ} d_{κ} [ξ_{e} η_{f} | θ_{g} κ_{h}]$ // Algorithm 2
5:: end for
6:: end for

5. Implementations in Graphics Processing Unit (GPU)

This section describes the implementations of the above-mentioned algorithms in Graphics Processing Units (GPUs) through the Compute Unified Device Architecture (CUDA).

Since the ability to write instructions on GPU hardware became possible, CUDA has enabled the seamless implementation and exploration of parallel algorithms on GPU devices throughout several programming models [45]. GPUs embed a plural number of streaming multiprocessors (SMs), each of which contains several processing cores.

The programming model behind CUDA uses key elements such as CUDA threads, CUDA blocks, and shared memory, among others. A CUDA thread is an individual execution instance that performs a specific task/computation. Groups of 1024 threads are organized into CUDA blocks that can be executed concurrently. Furthermore, CUDA threads collaborate using shared memory, with high-speed storage accessible to all CUDA threads within the same CUDA block. The following section presents a parallel batch technique that efficiently uses the shared memory.

5.1. Triple-Buffering of Shared Memory

With the goal of reducing the overall memory usage required in the batch-based algorithm, this section proposes the concept of triple-buffering of the shared memory, which is inspired by the notion of avoiding the storage of unnecessary values from the computed recurrences. In the batch-based algorithm (Section 4.2), the values of the recurrence

R_{t, u, v}^{n}

for batch k are computed sequentially starting from batch 0 to batch K. By observing the features of Equation (22) and Figure 4, one can note that only the values for batch

k - 1

and batch

k - 2

are needed to calculate the recurrence

R_{t, u, v}^{n}

for batch k. Therefore, it becomes unnecessary to store all the values of the recurrence

R_{t, u, v}^{n}

for the entire batch algorithm. Concretely speaking, for implementations, three buffers are prepared on the shared memory, in which each buffer stores the values computed from the recurrence

R_{t, u, v}^{n}

for one batch. To show the basic idea behind the triple-buffering of the shared memory, Figure 5 shows the dependencies behind computing the corresponding recurrences R for each batch under the triple-buffering scheme. By observing the dependency paths in Figure 5, computations switch between buffers. For instance, during the computation of batch k, the values from the recurrence R from batch

k - 1

and batch

k - 2

, stored in two buffers, are used to calculate the R value for batch k, whose result is then stored in the remaining buffer. In the subsequent computation of batch

k + 1

, the R values from batch k and batch

k - 1

, stored in the two buffers, are used to compute the R value for batch

k + 1

, which is then stored in the buffer that previously held the R value for batch

k - 2

. By repeating the above-mentioned process, all R values can be calculated efficiently using only three buffers.

The triple-buffering technique is a well-known approach for optimizing memory usage. However, in the context of the MD method, its effectiveness depends on the specific recurrence dependencies unique to this method, which are observed in Equation (22). Our work leverages these dependencies to minimize memory requirements while preserving computational efficiency. By carefully managing buffer usage in accordance with the recurrence properties, we achieve a significant reduction in shared memory consumption, as demonstrated in our analysis.

The above-mentioned implementation requires significantly less shared memory compared to the approach that would preserve the computed values of

R_{t, u, v}^{n}

. In particular, an algorithm that would preserve computed values would require the following size of shared memory:

\begin{matrix} \sum_{k = 0}^{K} \frac{(k + 2) (k + 1) (K - k + 1)}{2} = \frac{(K + 1) (K + 2) (K + 3) (K + 4)}{24} . \end{matrix}

In contrast, the buffer size used in the triple-buffering method is:

\begin{matrix} max_{0 \leq k \leq K} \frac{(k + 1) (k + 2) (K - k + 1)}{2} . \end{matrix}

Since three buffers become necessary, the required memory is three times a constant size. Additionally, the values computed from the recurrence

R_{t, u, v}^{0}

in Algorithm 2 are retained separately, implying

\begin{matrix} \sum_{k = 0}^{K} \frac{(k + 2) (k + 1)}{2} = \frac{(K + 1) (K + 2) (K + 3)}{6} \end{matrix}

elements of memory.

It is possible to estimate the required size of the shared memory needed to compute all the batches. First, we consider a straightforward approach that would store all the values rendered from the recurrences R into the shared memory. From Section 4.2, the number of R values computed in batch k is given by

\frac{(k + 2) (k + 1) (K - k + 1)}{2} .

Therefore, the total number of R values for all batches from

k = 0

to

k = K

is:

\begin{matrix} \sum_{k = 0}^{K} \frac{(k + 2) (k + 1) (K - k + 1)}{2} = \frac{(K + 1) (K + 2) (K + 3) (K + 4)}{24} . \end{matrix}

On the other hand, the proposed three-buffering approach requires three buffers to store the maximum number of R values needed for all relevant batches k (

0 \leq k \leq K

). As such, the maximum number of R values required for any batch k is as follows:

\begin{matrix} max_{0 \leq k \leq K} \frac{(k + 1) (k + 2) (K - k + 1)}{2} . \end{matrix}

In the approach mentioned above, the batch values can be computed sequentially. However, to execute Algorithm 2, the values of

R_{t, u, v}^{0}

computed in each batch are to be preserved while avoiding loss and overwrites, implying the requirement of a separate storage. The number of

R_{t, u, v}^{0}

values in batch k is given by

\frac{(k + 2)!}{k! 2!} = \frac{(k + 1) (k + 2)}{2} .

Therefore, to store all

R_{t, u, v}^{0}

values for batch k

(0 \leq k \leq K)

, the following size of shared memory is additionally required:

\begin{matrix} \sum_{k = 0}^{K} \frac{(k + 1) (k + 2)}{2} = \frac{(K + 1) (K + 2) (K + 3)}{6} \end{matrix}

In order to portray the performance differences in terms of the required size of the shared memory, Figure 6 compares the required memory size between the proposed triple-buffering scheme and a naive implementation that considers retaining all values rendered from the recurrence

R_{t, u, v}^{n}

) for

0 \leq K \leq 24

). By observing the trend and behavior on the required shared memory size in Figure 6, we can observe the following facts:

the difference in required shared memory size is trivial in configurations fulfilling $K < 7$ , whereas smaller sizes achieved by the proposed triple-buffering scheme become noticeable in configuration profiles with $K \geq 7$ , and
as K increases, the memory reduction rate becomes comparatively clear and significant, achieving a reduction to 0.344 times the original memory usage, or approximately 65% less space.

The above-mentioned results portray the potential, feasibility, and versatility of implementing the proposed triple-buffering scheme across different classes of GPU devices, including those with limited shared memory capacity.

5.2. Proposed CUDA Block and Thread Mapping Strategies

In order to investigate the performance and efficiency frontiers of CUDA-enabled GPU devices, we propose the following four implementations that consider the versatile configuration of GPU thread assignments:

BMM: Basis-Block Mapping, a GPU implementation mapping each Basis-ERI to a CUDA block.
BTM: Basis-Thread Mapping, a GPU implementation mapping each Basis-ERI to a CUDA thread.
SBM: Shell-Block Mapping, a GPU implementation mapping each shell-based ERI to a CUDA block.
STM: Shell-Thread Mapping, a GPU implementation mapping each shellbased ERI to a CUDA thread.

In the following, we describe the details of the above-mentioned implementations.

5.2.1. Basis-Block Mapping (BBM)

Basis-Block Mapping (BBM) assigns GPU threads based on Algorithm 1. Concretely speaking, one CUDA block is assigned to one of the Basis-ERI in which CUDA threads in each CUDA block compute the Basis-ERI by the parallel algorithm shown in Algorithm 2. The BBM assignment approach implements our previous study [23] and serves as a key reference for further evaluations. In order to show the basic idea behind the BBM assignment mechanism, Figure 7 shows an example of the parallel thread assignment of CUDA blocks and CUDA threads to the Basis-ERI computation in BBM. Here, in each block, the GTO-ERIs in Equation (9) are computed using CUDA threads with the batch parallel algorithm presented above, one by one, exhaustively. The reader may note the relevant roles of Algorithms 1 and 2 in the construction of the components. In the BBM approach, we use the triple-buffering technique to cache the values of the recurrences R, thus storing the corresponding results of the Basis-ERI calculations. Furthermore, the Basis-ERI computation is independent of other CUDA blocks; thus, no conflicts in memory access occur when the resulting values of the Basis-ERIs are stored in the memory.

5.2.2. Basis-Thread Mapping (BTM)

Similarly to the approach in BBM, the Basis-Thread Mapping (BTM) computes the Basis-ERI in Algorithm 1. However, it differs from the former in that BTM assigns a Basis-ERI computation to one thread. Concretely speaking, each thread performs GTO-ERIs through Algorithm 2 serially. To show the basic idea of thread assignment behind the BTM approach, Figure 8 illustrates a basic example of the parallel thread assignment of CUDA blocks and CUDA threads to the Basis-ERI computation in BTM. The reader may note the relevant roles of Algorithms 1 and 2 in the construction of the components. Compared with BBM, thread assignment appears at first glance to be computationally inefficient due to the large amount of computation in each thread. However, when the target system is small, specifically, when the number of GTOs in the basis functions is small and/or when the azimuthal quantum number is small, BTM is a method that can utilize computational resources efficiently. Similarly to BBM, the Basis-ERI computation is independent of other CUDA threads, and no conflicts of memory access occur when storing the results.

5.2.3. Shell-Block Mapping (SBM)

Shell-Block Mapping (SBM) assigns GPU threads based on Algorithm 3. More specifically, one CUDA block is assigned to one of the shell-based ERI, in which CUDA threads in each CUDA block compute the Basis-ERI by the parallel algorithm shown in Algorithm 2. To show the basic idea behind the thread assignment in SBM, Figure 9 shows a parallel thread assignment of CUDA blocks and CUDA threads to the shell-based ERI computation in SBM. The reader may note the relevant roles of Algorithms 2 and 3 in the construction of the components. As described in Section 4.1, when computing the GTO-ERIs within four shells, performance can be improved by efficiently reusing previously calculated values of the Boys function. By capitalizing on this notion, the CUDA threads in each CUDA block in line 2 of Algorithm 3 compute all the values of the Boys function required for the GTO-ERI calculation in advance. Afterward, the computed values are reused in subsequent GTO-ERI calculations to avoid redundant computation. Different from BBM and BTM, the resulting value of the calculation for each CUDA block is accumulated, as shown in line 4 of Algorithm 3, implying the possibility of accessing similar memory addresses at the same time, thus inducing the requirement for exclusive memory access. CUDA provides the atomicAdd function that enables the accumulative computations in memory exclusively [46]. Therefore, the proposed implementation capitalizes on the atomicAdd function to realize parallel memory access. As such, SBM offers the advantage of significantly reducing the number of Boys function evaluations compared to BBM, yet it incurs increased memory access operations.

5.2.4. Shell-Thread Mapping (STM)

Shell-Thread Mapping (STM) assigns each CUDA thread to one shell-based ERI in Algorithm 3. Although this strategy of GPU thread allocation appeared in [32,47], our implementations use the different shell-based ERI calculation algorithms. To exemplify the basic idea of thread assignment in STM, Figure 10 shows an example of the parallel thread assignment of CUDA blocks and CUDA threads to the shell-based ERI computation in STM. The reader may note the relevant roles of Algorithm 2 and 3 in the construction of the components. Each thread is assigned to one element in line 1 of Algorithm 3. Namely, the computation of lines 2–5 of Algorithm 3 as well as the computation of Algorithm 2 are performed sequentially. Therefore, the computation cost of each thread becomes larger than the above mappings. Following the same motivations in SBM, we capitalize on the merits of the atomicAdd function to enable the cumulative computations in memory exclusively.

5.3. Sorting of ERIs

In order to further enhance the computational efficiency of the proposed algorithms, this section describes the mechanisms to ensure effective parallelism of the implemented algorithms. BTM and STM are designed to perform distinct ERI computations as thread-based mappings. One of the key performance considerations in thread-based implementations lies in the ability to avoid the warp divergence phenomenon [46], which occurs when threads within a warp (a group of 32 threads) take different computation paths, causing the serial execution of computation paths and thus the inefficient use of GPU resources. As such, an effective strategy to minimize the occurrence of the warp divergence lies in the notion of ordering and grouping similar ERI calculations within a warp to ensure threads follow identical computation paths, thus improving the overall effective parallelism.

Ordering similar ERIs to improve GPU resource utilization is a commonly used strategy in ERI computations. In this study, we also adopt this approach to reduce warp divergence and enhance computational efficiency. While this technique itself is not novel, it is an essential step to ensure that the proposed GPU implementation operates efficiently.

5.3.1. Sorting of Basis-ERIs

Here, we describe the mechanism to sort Basis-ERIs considering the following factors:

The number of GTOs in each basis function
The azimuthal quantum numbers of each GTO

The number of GTOs in a basis function determines the number of GTO-ERIs within the Basis-ERI. For instance, when computing a Basis-ERI containing three GTOs, the total number of GTO-ERIs is

3 \times 3 \times 3 \times 3 = 81

. The four values have a direct influence on the behavior of lines 3–5 in Algorithm 1. Consequently, to optimize performance and minimize the occurrence of warp divergence, it is extremely desirable to place Basis-ERIs with the same number of GTOs adjacent to each other.

On the other hand, the azimuthal quantum numbers of each GTO determine the behavior of the MD method in the calculation of the GTO-ERI. Thus, the azimuthal quantum numbers have a direct influence on the behavior of Algorithm 2, leading to significant losses if similar values are not clustered together.

In this study, we use a 64-bit key, as shown by Figure 11, to sort Basis-ERIs. Concretely speaking, a 64-bit key is assigned to each Basis-ERI and sorting is performed based on the key. The upper 28 bits of the key are used to represent the number of GTOs in each basis function, whereas the lower 36 bits are used to represent the azimuthal quantum numbers of each GTO.

5.3.2. Sorting of Shell-Based ERIs

The computation paths for shell-based ERIs are determined by the type of shells involved (Section 4.1). For instance, the computation of

(s s | s p)

proceeds sequentially in the order of

[s s | s p_{x}]

,

[s s | s p_{y}]

, and

[s s | s p_{z}]

. In addition, since each calculation is a GTO-ERI, and all the azimuthal quantum numbers mentioned above are the same, the sorting of the shell-based ERIs is based on the type of the four shells.

6. Performance Evaluation

In this section, in order to evaluate the performance frontiers of the proposed algorithms, we report computational experiments involving the proposed implementations of the McMurchie–Davidson method for shell-based ERI computations of molecules of interest. Thus, this section describes our observations and findings.

The experiments used an NVIDIA A100 Tensor Core GPU and an Intel Xeon Gold 6338 CPU. We measured the ERI computation times for the four GPU implementations presented in Section 5. The number of threads per CUDA block was set to 256. The number of CUDA blocks launched was determined based on the following: for BBM, the number of threads was equal to the number of Basis-ERIs; for SBM, the number of threads corresponded to the number of shell-based ERIs. In the cases of BTM and STM, the number of CUDA blocks was set to 1/256 of those used in BBM and SBM, respectively. Considering the above-mentioned points, we conducted the following two experiments:

Experiment 1: In this experiment, our goal was to evaluate the performance frontiers (execution time) of ERI calculations in monatomic molecules. As such, we used monatomic hydrogen as the molecule of interest, and evaluated through five types of basis function sets defined in the correlation-consistent basis sets [48].
Experiment 2: In this experiment, our goal was to evaluate the performance frontiers (execution time) of ERI calculations in polyatomic molecules. As such, benzene (C₆H₆), naphthalene (C₁₀H₈), and copper oxide (CuO) were selected as the polyatomic molecules of interest. Furthermore, we used the STO-3G [41] and 6-31G** [49] basis sets.

While direct comparisons with other state-of-the-art approaches could be insightful, many existing software packages incorporate acceleration techniques such as screening, to enhance performance. These optimizations significantly modify the computational workflow, making a straightforward comparison with our method difficult. Therefore, rather than making a direct comparison, this study evaluates the efficiency of the proposed parallelization strategies within a controlled computational framework.

One of the key motivations behind conducting the above-mentioned experiments lies in the structural differences between the central coordinates of the basis functions of monatomic and polyatomic molecules. Whereas the central coordinates are identical in monatomic molecules, they are different in polyatomic molecules. Concretely speaking, the parameter x of the Boys function in the GTO-ERI calculations of monatomic molecules is always zero, leading to

F_{n} (0) = \frac{1}{2 n + 1}

, as described in Section 3. The above context results in a significant simplification of the Boys function evaluation, which is a computational bottleneck in the GTO-ERI calculations.

In order to show a glimpse of the performance frontiers of the proposed algorithms, Table 3 shows the results of Experiment 1. Here, following the observations of Section 2, the number of Basis-ERIs in Table 3 is calculated from the relation

\frac{M (M + 1) (M^{2} + M + 2)}{8}

, where M is the number of basis functions. Similarly, the number of shell-based ERIs in Table 3 is calculated from the relation

\frac{N (N + 1) (N^{2} + N + 2)}{8}

, where N is the number of shells. The results indicate that for monatomic molecules, the shell-based methods performed slower compared to other frameworks. The reason for poor performance can be attributed to two reasons.

First, as mentioned earlier, the computation of the Boys function can always be simplified as $F_{n} (0) = \frac{1}{2 n + 1}$ in the context of GTO-ERIs of monatomic molecules. As a result, the Boys function evaluation does not pose a significant computational overhead. Consequently, the key advantage of shell-based methods in reducing the number of Boys function evaluations is diminished. Instead, as discussed in Section 5.2.3, the increased memory access in the shell-based method leads to longer execution times.
Second, by looking at the number of shells presented in Table 3, the reader may easily note that for each basis function, the number of shells with the largest azimuthal quantum number is always one. For example, when using the cc-pV6Z basis set, there is only one h-shell. In this scenario, the highest computational cost per shell-based ERI arises from the shell-based ERI involving four h-shells; however, there is only one such shell-based ERI. Therefore, the SBM (STM) algorithm deals with such computation by a single CUDA block (CUDA thread).

In contrast, by observing the results from Table 3, the number of basis functions with the largest azimuthal quantum number is 21. Hence, the number of Basis-ERIs formed by four h-orbitals can be calculated as

\frac{21 (21 + 1) (21^{2} + 21 + 2)}{8} = 28, 014 .

Thus, the BBM(BTM) algorithm uses 28,014 CUDA blocks (CUDA threads) for parallel computation. As a result, the shell-based algorithms fail to achieve effective parallelism for certain ERI computations compared to the basis function-based algorithms. The lack of effective parallelism becomes a computational bottleneck, causing the shell-based algorithms to underperform.

Additionally, the implementations that assign one thread per ERI (BTM, STM) were found to be slower than the implementations that assign one block per ERI (BBM, SBM) in most cases. This observation is particularly clear and pronounced between the performance of SBM and STM. Similarly to the above-mentioned insights, we argue that the lack of effective parallelism is one of the key reasons explaining the diminished performance of schemes based on one thread per ERI (BTM, STM).

Under similar conditions, the SBM algorithm computes the shell-based ERI involving four h-shells using one CUDA block (i.e., 256 threads), whereas the STM algorithm performs the same computation using only a single CUDA thread. Consequently, the computation time for such ERIs increases significantly, creating a computational bottleneck and resulting in decreased performance.

We also compare our proposed implementations with a (naive) CPU implementation (AMD EPYC 7702 processor), in which the Basis-ERIs are evaluated sequentially using the recursive formulae described in Section 3. To show the performance frontiers of the CPU-based implementations, Table 4 shows the obtained results. By observing the results in Table 4, we achieved a speedup of up to 4500x; however, in certain scenarios, the GPU implementation was slower compared to the CPU. The above-mentioned observations indicate that the overhead associated with parallelization outweighed the potential speedup achieved through GPU acceleration.

In order to show the performance frontiers in polyatomic molecules, Table 5 presents the results of Experiment 2. The results indicate that the thread-based methods (BTM and STM) achieved up to a 72× speedup compared to the block-based methods. However, depending on the molecule, the thread-based methods occasionally exhibited slower performance, with the worst-case scenario resulting in a 0.21× slowdown. By comparing the results of BBM and BTM, BTM consistently outperformed BBM, suggesting that the benefit of increased parallelization of Basis-ERI calculations in the BTM implementation outweighed the advantage of batch parallelization provided by the BBMscheme. Comparing the results of SBM and STM, a similar observation applies to benzene and naphthalene. However, in the case of copper oxide, STM achieved a slower performance.

The above-mentioned observations can be attributed to the lack of effective parallelism in shell-based ERI calculations, as discussed in the analysis of Experiment 1, in which the STM implementation is seen to be more affected by such limitation. Lastly, examining the results of BTM and STM, the speedup for naphthalene ERI calculations using the 6-31G** basis set was particularly notable. This observation suggests that for sufficiently large molecules, shell-based ERI calculations may offer a greater advantage.

In the same manner as the investigations in Experiment 1, we compared the performance of the four proposed methods against those of the (naive) CPU implementations. To show the obtained results, Table 6 presents a comparison of the computation times of ERIs for polyatomic molecules. By observing the obtained results in Table 6, we note the maximum speedup of 1112x. However, unlike Experiment 1, the proposed algorithms outperformed the CPU implementations overall cases. The reason for the enhanced performance can be attributed to the fact of capitalizing from the parallelization merits when computing computationally expensive Boys functions in polyatomic molecules. Consequently, the parallelization merits in GPU implementations become more pronounced in such scenarios.

7. Conclusions

In this study, we have proposed an extended implementation of the efficient GPU-based McMurchie–Davidson (MD) method for shell-based two-electron repulsion integrals (ERI). To this end, our approach capitalizes on the amenability of GPU parallelization of the MD recurrence relations through notions of shell, batch, and triple-buffering of the shared memory, and ordering similar ERIs to ensure threads follow identical computation paths, thus exploring the overall effective parallelism frontiers. Whereas the related study used Basis-ERIs as the computational unit, this paper has adopted the shell-based ERIs as the key computational unit. Furthermore, instead of being limited to assigning one CUDA block to each computational unit, this paper has extended the approach by assigning one CUDA thread to each computational unit. As such, we have proposed and rigorously evaluated four GPU implementation schemes considering suitable mappings to CUDA blocks and threads. Our computational experiments involving the ERI computations of molecules of interest on an NVIDIA A100 Tensor Core GPU have shown the merits of the proposed acceleration schemes in terms of computation time and have demonstrated up to a 72× speedup compared to previous implementations. Our work has the potential to enable exploration of the efficient and seamless parallelization schemes of distinct and complex computation paths involved in ERI.

Author Contributions

Conceptualization, H.F. and Y.I.; Methodology, S.T.; Software, H.F., N.Y., K.S. and S.T.; Validation, Y.I., N.Y., K.S. and S.T.; Formal analysis, Y.I.; Investigation, H.F.; Resources, Y.I., K.N. and A.K.; Data curation, N.Y., K.S. and S.T.; Writing—original draft, H.F. and Y.I.; Writing—review & editing, Y.I. and V.P.; Supervision, Y.I. and K.N.; Project administration, K.N. and A.K.; Funding acquisition, A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study. Requests to access the datasets should be directed to the authors.

Conflicts of Interest

Authors Satoki Tsuji and Akihiko Kasagi were employed by the company Fujitsu Limited. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Schrödinger, E. Quantisierung als Eigenwertproblem. Ann. Phys. 1926, 384, 361–376. [Google Scholar] [CrossRef]
Hartree, D.R. The Wave Mechanics of an Atom with a Non-Coulomb Central Field. Part I. Theory and Methods. Math. Proc. Camb. Philos. Soc. 1928, 24, 89–110. [Google Scholar] [CrossRef]
Fock, V. Näherungsmethode zur Lösung des quantenmechanischen Mehrkörperproblems. Z. Phys. 1930, 61, 126–148. [Google Scholar] [CrossRef]
Ito, Y.; Tsuji, S.; Fujii, H.; Suzuki, K.; Yokogawa, N.; Nakano, K.; Kasagi, A. Introduction to Computational Quantum Chemistry for Computer Scientists. In Proceedings of the International Parallel and Distributed Processing Symposium Workshops, San Francisco, CA, USA, 27–31 May 2024; pp. 273–282. [Google Scholar] [CrossRef]
Hartree, D.R. The Wave Mechanics of an Atom with a Non-Coulomb Central Field. Part II. Some Results and Discussion. Math. Proc. Camb. Philos. Soc. 1928, 24, 111–132. [Google Scholar] [CrossRef]
Hartree, D.R. The Wave Mechanics of an Atom with a non-Coulomb Central Field. Part III. Term Values and Intensities in Series in Optical Spectra. Math. Proc. Camb. Philos. Soc. 1928, 24, 426–437. [Google Scholar] [CrossRef]
Slater, J.C. Atomic Shielding Constants. Phys. Rev. 1930, 36, 57–64. [Google Scholar] [CrossRef]
Boys, S.; Egerton, A.C. Electronic wave functions-I. A general method of calculation for the stationary states of any molecular system. Proc. R. Soc. London. Ser. Math. Phys. Sci. 1950, 200, 542–554. [Google Scholar] [CrossRef]
Clementi, E. Ab initio computations in atoms and molecules. IBM J. Res. Dev. 1965, 9, 2–19. [Google Scholar] [CrossRef]
McMurchie, L.E.; Davidson, E.R. One and Two-Electron Integrals over Cartesian Gaussian Functions. J. Comput. Phys. 1978, 26, 218–231. [Google Scholar] [CrossRef]
Obara, S.; Saika, A. Efficient recursive computation of molecular integrals over Cartesian Gaussian functions. J. Chem. Phys. 1986, 84, 3963–3974. [Google Scholar] [CrossRef]
Pople, J.A.; Hehre, W.J. Computation of electron repulsion integrals involving contracted Gaussian basis functions. J. Comput. Phys. 1978, 27, 161–168. [Google Scholar] [CrossRef]
Dupuis, M.; Rys, J.; King, H.F. Evaluation of molecular integrals over Gaussian basis functions. J. Chem. Phys. 1976, 65, 111–116. [Google Scholar] [CrossRef]
King, H.F.; Dupuis, M. Numerical integration using rys polynomials. J. Comput. Phys. 1976, 21, 144–165. [Google Scholar] [CrossRef]
Head-Gordon, M.; Pople, J.A. A method for two-electron Gaussian integral and integral derivative evaluation using recurrence relations. J. Chem. Phys. 1988, 89, 5777–5786. [Google Scholar] [CrossRef]
Gill, P.M.; Pople, J.A. The Prism Algorithm for Two-Electron Integrals. Int. J. Quantum Chem. 1991, 40, 753–772. [Google Scholar] [CrossRef]
Gill, P.M. Molecular integrals Over Gaussian Basis Functions. In Advances in Quantum Chemistry; Academic Press: Cambridge, MA, USA, 1994; Volume 25, pp. 141–205. [Google Scholar] [CrossRef]
Ufimtsev, I.S.; Martínez, T.J. Graphical Processing Units for Quantum Chemistry. Comput. Sci. Eng. 2008, 10, 26–34. [Google Scholar] [CrossRef]
Ufimtsev, I.S.; Martinez, T.J. Quantum Chemistry on Graphical Processing Units. 2. Direct Self-Consistent-Field Implementation. J. Chem. Theory Comput. 2009, 5, 1004–1015. [Google Scholar] [CrossRef]
Titov, A.V.; Ufimtsev, I.S.; Luehr, N.; Martinez, T.J. Generating Efficient Quantum Chemistry Codes for Novel Architectures. J. Chem. Theory Comput. 2013, 9, 213–221. [Google Scholar] [CrossRef]
Yasuda, K.; Maruoka, H. Efficient calculation of two-electron integrals for high angular basis functions. Int. J. Quantum Chem. 2014, 114, 543–552. [Google Scholar] [CrossRef]
Kalinowski, J.; Wennmohs, F.; Neese, F. Arbitrary Angular Momentum Electron Repulsion Integrals with Graphical Processing Units: Application to the Resolution of Identity Hartree–Fock Method. J. Chem. Theory Comput. 2017, 13, 3160–3170. [Google Scholar] [CrossRef]
Fujii, H.; Ito, Y.; Yokogawa, N.; Suzuki, K.; Tsuji, S.; Nakano, K.; Kasagi, A. A GPU Implementation of McMurchie-Davidson Algorithm for Two-Electron Repulsion Integral Computation. In Proceedings of the 15th International Conference on Parallel Processing and Applied Mathematics (to appear), Ostrava, Czech Republic, 8–11 September 2024. [Google Scholar]
Ufimtsev, I.S.; Martínez, T.J. Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation. J. Chem. Theory Comput. 2008, 4, 222–231. [Google Scholar] [CrossRef]
Yasuda, K. Two-electron integral evaluation on the graphics processor unit. J. Comput. Chem. 2008, 29, 334–342. [Google Scholar] [CrossRef]
Asadchev, A.; Allada, V.; Felder, J.; Bode, B.M.; Gordon, M.S.; Windus, T.L. Uncontracted Rys Quadrature Implementation of up to G Functions on Graphical Processing Units. J. Chem. Theory Comput. 2010, 6, 696–704. [Google Scholar] [CrossRef] [PubMed]
Asadchev, A.; Gordon, M.S. New Multithreaded Hybrid CPU/GPU Approach to Hartree–Fock. J. Chem. Theory Comput. 2012, 8, 4166–4176. [Google Scholar] [CrossRef] [PubMed]
Miao, Y.; Merz, K.M.J. Acceleration of High Angular Momentum Electron Repulsion Integrals and Integral Derivatives on Graphics Processing Units. J. Chem. Theory Comput. 2015, 11, 1449–1462. [Google Scholar] [CrossRef] [PubMed]
Kussmann, J.; Ochsenfeld, C. Preselective Screening for Linear-Scaling Exact Exchange-Gradient Calculations for Graphics Processing Units and General Strong-Scaling Massively Parallel Calculations. J. Chem. Theory Comput. 2015, 11, 918–922. [Google Scholar] [CrossRef] [PubMed]
Barca, G.M.J.; Galvez-Vallejo, J.L.; Poole, D.L.; Rendell, A.P.; Gordon, M.S. High-Performance, Graphics Processing Unit-Accelerated Fock Build Algorithm. J. Chem. Theory Comput. 2020, 16, 7232–7238. [Google Scholar] [CrossRef]
Barca, G.M.J.; Alkan, M.; Galvez-Vallejo, J.L.; Poole, D.L.; Rendell, A.P.; Gordon, M.S. Faster Self-Consistent Field (SCF) Calculations on GPU Clusters. J. Chem. Theory Comput. 2021, 17, 7486–7503. [Google Scholar] [CrossRef]
Johnson, K.G.; Mirchandaney, S.; Hoag, E.; Heirich, A.; Aiken, A.; Martínez, T.J. Multinode Multi-GPU Two-Electron Integrals: Code Generation Using the Regent Language. J. Chem. Theory Comput. 2022, 18, 6522–6536. [Google Scholar] [CrossRef]
Asadchev, A.; Valeev, E.F. High-Performance Evaluation of High Angular Momentum 4-Center Gaussian Integrals on Modern Accelerated Processors. J. Phys. Chem. A 2023, 127, 10889–10895. [Google Scholar] [CrossRef]
Li, R.; Sun, Q.; Zhang, X.; Chan, G.K.L. Introducing GPU-acceleration into the Python-based Simulations of Chemistry Framework. arXiv 2024, arXiv:2407.09700. [Google Scholar] [CrossRef] [PubMed]
Tian, Y.; Suo, B.; Ma, Y.; Jin, Z. Optimizing two-electron repulsion integral calculations with McMurchie–Davidson method on graphic processing unit. J. Chem. Phys. 2021, 155, 034112. [Google Scholar] [CrossRef] [PubMed]
Tornai, G.J.; Ladjánszki, I.; Rák, A.; Kis, G.; Cserey, G. Calculation of Quantum Chemical Two-Electron Integrals by Applying Compiler Technology on GPU. J. Chem. Theory Comput. 2019, 15, 5319–5331. [Google Scholar] [CrossRef] [PubMed]
Suzuki, K.; Ito, Y.; Fujii, H.; Yokogawa, N.; Tsuji, S.; Nakano, K.; Kasagi, A. GPU Acceleration of Head-Gordon-Pople Algorithm. In Proceedings of the International Symposium on Computing and Networking, Naha, Japan, 26–29 November 2024; pp. 115–124. [Google Scholar]
Palethorpe, E.; Stocks, R.; Barca, G.M.J. Advanced Techniques for High-Performance Fock Matrix Construction on GPU Clusters. J. Chem. Theory Comput. 2024, 20, 10424–10442. [Google Scholar] [CrossRef]
Lowe, J.P.; Peterson, K. Quantum Chemistry, 3rd ed.; Elsevier: Amsterdam, The Netherlands, 2011. [Google Scholar]
Davidson, E.R.; Feller, D. Basis set selection for molecular calculations. Chem. Rev. 1986, 86, 681–696. [Google Scholar] [CrossRef]
Hehre, W.J.; Stewart, R.F.; Pople, J.A. Self-Consistent Molecular-Orbital Methods. I. Use of Gaussian Expansions of Slater-Type Atomic Orbitals. J. Chem. Phys. 1969, 51, 2657–2664. [Google Scholar] [CrossRef]
Pritchard, B.P.; Altarawy, D.; Didier, B.; Gibson, T.D.; Windus, T.L. New Basis Set Exchange: An Open, Up-to-Date Resource for the Molecular Sciences Community. J. Chem. Inf. Model. 2019, 59, 4814–4820. [Google Scholar] [CrossRef]
Yokogawa, N.; Ito, Y.; Tsuji, S.; Fujii, H.; Suzuki, K.; Nakano, K.; Kasagi, A. Parallel GPU Computation of Nuclear Attraction Integrals in Quantum Chemistry. In Proceedings of the International Symposium on Computing and Networking Workshops, Naha, Japan, 26–29 November 2024; pp. 163–169. [Google Scholar]
Tsuji, S.; Ito, Y.; Nakano, K.; Kasagi, A. Efficient GPU-Accelerated Bulk Evaluation of the Boys Function for Quantum Chemistry. In Proceedings of the Eleventh International Symposium on Computing and Networking, Matsue, Japan, 28 November–1 December 2023; pp. 49–58. [Google Scholar] [CrossRef]
NVIDIA Corporation. CUDA C++ Programming Guide Release 12.4; NVIDIA Corporation: Santa Clara, CA, USA, 2024. [Google Scholar]
NVIDIA. CUDA C++ Programming Guide v12.0. Available online: https://docs.nvidia.com/cuda/cuda-c-programming-guide/ (accessed on 23 February 2025).
Tsuji, S.; Ito, Y.; Fujii, H.; Yokogawa, N.; Suzuki, K.; Nakano, K.; Kasagi, A. Dynamic Screening of Two-Electron Repulsion Integrals in GPU Parallelization. In Proceedings of the International Symposium on Computing and Networking Workshops, Naha, Japan, 26–29 November 2024; pp. 211–217. [Google Scholar]
Dunning, T.H., Jr. Gaussian basis sets for use in correlated molecular calculations. I. The atoms boron through neon and hydrogen. J. Chem. Phys. 1989, 90, 1007–1023. [Google Scholar] [CrossRef]
Rassolov, V.A.; Pople, J.A.; Ratner, M.A.; Windus, T.L. 6-31G* basis set for atoms K through Zn. J. Chem. Phys. 1998, 109, 1223–1229. [Google Scholar] [CrossRef]

Figure 1. Example of the configuration of two basis functions

χ_{1}

and

χ_{2}

(

M = 2

). (a) The quartet combinations, and (b) the symmetry-based combinations for Basis-ERIs when

M = 2

. By considering symmetrical relations, it becomes possible to reduce the number of Basis-ERIs, as shown by the upper triangular matrix in (b).

Figure 1. Example of the configuration of two basis functions

χ_{1}

and

χ_{2}

(

M = 2

). (a) The quartet combinations, and (b) the symmetry-based combinations for Basis-ERIs when

M = 2

. By considering symmetrical relations, it becomes possible to reduce the number of Basis-ERIs, as shown by the upper triangular matrix in (b).

Figure 2. Example of the relation between the Basis-ERIs and the GTO-ERIs. The row (column) directions represent the bra (ket) Basis-ERIs. Each cell in the upper triangular matrix corresponds to a single Basis-ERI.

Figure 3. Basic idea behind the definition of shell-based ERIs. The term

(s s |

implies that a bra consists of two s-shells, and the term

| s p)

implies that a ket consists of one s-shell and one p-shell. The integral

(s s | s p)

consists of three GTO-ERIs:

[s s | s p_{x}]

,

[s s | s p_{y}]

, and

[s s | s p_{z}]

.

Figure 3. Basic idea behind the definition of shell-based ERIs. The term

(s s |

implies that a bra consists of two s-shells, and the term

| s p)

implies that a ket consists of one s-shell and one p-shell. The integral

(s s | s p)

consists of three GTO-ERIs:

[s s | s p_{x}]

,

[s s | s p_{y}]

, and

[s s | s p_{z}]

.

Figure 4. Basic idea of the dependencies, denoted by arrows, behind computing the values of the corresponding recurrences R by using batch concepts when

K = 4

. The values required by the MD method are highlighted in red.

Figure 4. Basic idea of the dependencies, denoted by arrows, behind computing the values of the corresponding recurrences R by using batch concepts when

K = 4

. The values required by the MD method are highlighted in red.

Figure 5. Basic idea of the computation of R values for each batch using triple-buffering of the shared memory.

Figure 6. Comparison of the required size of shared memory to store R values.

Figure 7. Basic idea of the parallel thread assignment of CUDA blocks and CUDA threads to the Basis-ERI computation in BBM.

Figure 8. Basic idea of the parallel thread assignment of CUDA blocks and CUDA threads to the Basis-ERI computation in BTM.

Figure 9. Parallel thread assignment of CUDA blocks and CUDA threads to the shell-based ERI computation in SBM.

Figure 10. Basic idea behind the parallel thread assignment of CUDA blocks and CUDA threads to the shell-based ERI computation in STM.

Figure 11. Schematic of the 64-bit key for sorting the Basis-ERIs.

Table 1. Comparison with relevant works in acceleration of the MD method.

Related Work	Key Points
Ufimtsev and Martinez [18,19]	■ Incremental computations of the Fock matrix. ■ Double-screening of integrals using angular momentum and the Schwartz upper bound. ■ Computation of the Fock matrix components by blocks. ■ Computation up to s- and p-type Gaussian basis functions.
Titov et al. [20]	■ Generation of GPU and CPU kernels for integral evaluation. ■ Kernels in symbolic form tailored to GPU hardware. ■ Support up to d angular momentum.
Yasuda and Maruoka [21]	■ Calculation order for ERIs by a recursive search (branch and cut). ■ Multiple cooperating threads to reduce register usage. ■ Compilers unable to tackle search and scalability to high angular momentum.
Kalinowski et al. [22]	■ Integral pre-screening (integral list generation) on the CPU side. ■ Each GPU thread calculates a single integral batch at a time. ■ Complexity of high angular momentum inhibits practical use.
This Work	■ Batch-based ERI computation. ■ Triple-buffering of shared memory. ■ Multiple block and thread mapping strategies: BBM, BTM, SBM, STM.

Table 2. The number of Gaussian-type orbitals in each shell.

Shell	s	p	d	f	g	h	i	⋯
Number of GTOs	1	3	6	10	15	21	28	⋯

Table 3. Computation time of ERIs for monatomic molecules using GPU implementations.

		Monatonic Hydrogen (H)
Basis Function Set		cc-pVDZ	cc-pVTZ	cc-pVQZ	cc-pV5Z	cc-pV6Z
Basis functions	s-orbitals	2	3	4	5	6
	p-orbitals	3	6	9	12	15
	d-orbitals	0	6	12	18	24
	f-orbitals	0	0	10	20	30
	g-orbitals	0	0	0	15	30
	h-orbitals	0	0	0	0	21
	ERIs	120	1035	198,765	3,088,855	32,012,001
Shells	s-shells	5	7	9	12	15
	p-shells	1	2	3	4	5
	d-shells	0	1	2	3	4
	f-shells	0	0	1	2	3
	g-shells	0	0	0	1	2
	h-shells	0	0	0	0	1
	ERIs	231	1540	7260	32,131	108,345
Computation time [ms]	BBM	1.243	4.714	54.285	972.916	14836.465
	SBM	0.955	23.412	379.478	5002.266	51,561.765
	BTM	1.591	6.75	28.985	602.931	24,833.564
	STM	5.408	630.327	21,516.44	340,450.3	3,833,350
Speedup rate	BBM/BTM	0.781	0.698	1.872	1.613	0.597
	SBM/STM	0.176	0.037	0.017	0.014	0.013
	BTM/STM	0.294	0.010	0.001	0.001	0.006

Table 4. Comparison of computation times of ERIs for monatomic molecules including a naive CPU implementation.

		Monatomic Hydrogen (H)
Basis Function Set		cc-pVDZ	cc-pVTZ	cc-pVQZ	cc-pV5Z	cc-pV6Z
Computation time [ms]	CPU-Based	0.99	157	12,849	1,086,560	66,768,100
	BBM	1.243	4.714	54.285	972.916	14,836.465
	SBM	0.955	23.412	379.478	5002.266	51,561.765
	BTM	1.591	6.75	28.985	602.931	24,833.564
	STM	5.408	630.327	21,516.44	340,450.3	3,833,350
Speedup rate (Compared to CPU-based)	BBM	0.796	33.305	236.6952	1116.808	4500.27
	SBM	1.036	6.705	33.859	217.213	1294.915
	BTM	0.622	23.259	443.298	1802.130	2688.623
	STM	0.183	0.249	0.597	3.191	17.417

Table 5. Computation time of ERIs for polyatomic molecules using GPU implementations.

		Benzene (C6H6)		Naphthalene (C10H8)		Copper Oxide (CuO)
	Basis Function Set	STO-3G	6-31G**	STO-3G	6-31G**	STO-3G	6-31G**
Basis functions	s-orbitals	18	30	28	46	6	8
	p-orbitals	18	54	30	84	12	18
	d-orbitals	0	36	0	60	6	18
	f-orbitals	0	0	0	0	0	10
	ERIs	222,111	26,357,430	1,464,616	164,629,585	45,150	1,103,355
Shells	s-shells	54	84	84	132	18	32
	p-shells	18	30	30	48	12	20
	d-shells	0	6	0	10	3	5
	f-shells	0	0	0	0	0	1
	ERIs	3,454,506	26,357,430	21,487,290	164,629,585	157,641	1,464,616
Computation time [ms]	BBM	1584.599	25,943.201	10,344.533	166,499.671	451.022	5238.169
	SBM	1405.569	22,187.632	9312.203	144,135.328	391.059	4674.049
	BTM	26.578	1461.193	142.450	42,146.531	142.676	887.190
	STM	29.538	1288.133	140.164	4716.869	647.790	21,548.232
Speedup rate	BBM/BTM	59.62	17.75	72.61	3.95	3.16	5.90
	SBM/STM	48.58	17.22	66.43	30.55	0.60	0.21
	BTM/STM	0.89	1.13	1.01	8.93	0.22	0.04

Table 6. Comparison of computation times of ERIs for polyatomic molecules including a naive CPU implementation.

		Benzene (C6H6)		Naphthalene (C10H8)		Copper Oxide (CuO)
	Basis Function Set	STO-3G	6-31G**	STO-3G	6-31G**	STO-3G	6-31G**
Computation time [ms]	CPU-Based	18,094	765,999	120,246	5,249,620	19,930	389,133
	BBM	1584.599	25,943.201	10,344.533	166,499.671	451.022	5238.169
	SBM	1405.569	22,187.632	9312.203	144,135.328	391.059	4674.049
	BTM	26.578	1461.193	142.450	42146.531	142.676	887.190
	STM	29.538	1288.133	140.164	4716.869	647.790	21,548.232
Speedup rate (Compared to CPU-based)	BBM	11.418	29.526	11.624	31.529	44.188	74.287
	SBM	12.873	34.523	12.912	36.421	50.964	83.253
	BTM	680.788	524.228	844.127	124.556	139.687	438.612
	STM	612.566	594.658	857.895	1112.945	30.766	18.058

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fujii, H.; Ito, Y.; Yokogawa, N.; Suzuki, K.; Tsuji, S.; Nakano, K.; Parque, V.; Kasagi, A. Efficient GPU Implementation of the McMurchie–Davidson Method for Shell-Based ERI Computations. Appl. Sci. 2025, 15, 2572. https://doi.org/10.3390/app15052572

AMA Style

Fujii H, Ito Y, Yokogawa N, Suzuki K, Tsuji S, Nakano K, Parque V, Kasagi A. Efficient GPU Implementation of the McMurchie–Davidson Method for Shell-Based ERI Computations. Applied Sciences. 2025; 15(5):2572. https://doi.org/10.3390/app15052572

Chicago/Turabian Style

Fujii, Haruto, Yasuaki Ito, Nobuya Yokogawa, Kanta Suzuki, Satoki Tsuji, Koji Nakano, Victor Parque, and Akihiko Kasagi. 2025. "Efficient GPU Implementation of the McMurchie–Davidson Method for Shell-Based ERI Computations" Applied Sciences 15, no. 5: 2572. https://doi.org/10.3390/app15052572

APA Style

Fujii, H., Ito, Y., Yokogawa, N., Suzuki, K., Tsuji, S., Nakano, K., Parque, V., & Kasagi, A. (2025). Efficient GPU Implementation of the McMurchie–Davidson Method for Shell-Based ERI Computations. Applied Sciences, 15(5), 2572. https://doi.org/10.3390/app15052572

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient GPU Implementation of the McMurchie–Davidson Method for Shell-Based ERI Computations

Abstract

1. Introduction

1.1. Related Works

1.2. Contributions

1.2.1. Batch-Based ERI Computation

1.2.2. Triple-Buffering of Shared Memory

1.2.3. Block and Thread Mapping Strategies

1.2.4. Computational Experiments

1.3. Paper Organization

2. Two-Electron Repulsion Integrals

3. McMurchie–Davidson Method

4. Proposed Method

4.1. Shell-Based ERI Computation

4.2. Parallel Batch Computation of the Recurrence Relation R

4.3. Parallel Algorithms

5. Implementations in Graphics Processing Unit (GPU)

5.1. Triple-Buffering of Shared Memory

5.2. Proposed CUDA Block and Thread Mapping Strategies

5.2.1. Basis-Block Mapping (BBM)

5.2.2. Basis-Thread Mapping (BTM)

5.2.3. Shell-Block Mapping (SBM)

5.2.4. Shell-Thread Mapping (STM)

5.3. Sorting of ERIs

5.3.1. Sorting of Basis-ERIs

5.3.2. Sorting of Shell-Based ERIs

6. Performance Evaluation

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI