Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

High Performance Solution of Skew-symmetric Eigenvalue Problems with Applications in Solving the Bethe-Salpeter Eigenvalue Problem

High Performance Solution of Skew-symmetric Eigenvalue Problems with Applications in Solving the... We present a high-performance solver for dense skew-symmetric matrix eigenvalue problems. Our work is motivated by applica- tions in computational quantum physics, where one solution approach to solve the Bethe-Salpeter equation involves the solution of a large, dense, skew-symmetric eigenvalue problem. The computed eigenpairs can be used to compute the optical absorption spectrum of molecules and crystalline systems. One state-of-the art high-performance solver package for symmetric matrices is the ELPA (Eigenvalue SoLvers for Petascale Applications) library. We exploit a link between tridiagonal skew-symmetric and symmet- ric matrices in order to extend the methods available in ELPA to skew-symmetric matrices. This way, the presented solution method can benefit from the optimizations available in ELPA that make it a well-established, efficient and scalable library. The solution strategy is to reduce a matrix to tridiagonal form, solve the tridiagonal eigenvalue problem and perform a back-transformation for eigenvectors of interest. ELPA employs a one-step or a two-step approach for the tridiagonalization of symmetric matrices. We adapt these to suit the skew-symmetric case. The two-step approach is generally faster as memory locality is exploited better. If all eigenvectors are required, the performance improvement is counteracted by the additional back transformation step. We exploit the symmetry in the spectrum of skew-symmetric matrices, such that only half of the eigenpairs need to be computed, making the two- step approach the favorable method. We compare performance and scalability of our method to the only available high-performance approach for skew-symmetric matrices, an indirect route involving complex arithmetic. In total, we achieve a performance that is up to 3.67 higher than the reference method using Intel’s ScaLAPACK implementation. Our method is freely available in the current release of the ELPA library. Keywords: Distributed memory, Skew-symmetry, Eigenvalue and eigenvector computations, GPU acceleration, Bethe-Salpeter, Many-body perturbation theory 1. Introduction value problems running on distributed memory machines such as compute clusters. n×n A matrix A ∈ R is called skew-symmetric when A = The skew-symmetric case [5] lacks the ubiquitous presence T T −A , where . denotes the transposition of a matrix. We are of its symmetric counterpart and has not received the same ex- interested in eigenvalues and eigenvectors of A. tensive treatment. We close this gap by extending the ELPA The symmetric eigenvalue problem, i.e. the case A = A , methodology to the skew-symmetric case. has been studied in depth for many years. It lies at the core of Our motivation stems from the connection to the Hamilto- many applications in different areas such as electronic structure nian eigenvalue problem which has many applications in con- computations. Many methods for its solution have been pro- trol theory and model order reduction [6]. A real Hamiltonian posed [1] and successfully implemented. Optimized libraries matrix H is connected to a symmetric matrix M via the matrix for many platforms are widely available [2, 3]. With the rise of 0 I J = , where I denotes the identity matrix, more advanced computer architectures and more powerful su- −I 0 percomputers, the solution of increasingly complex problems M = JH. comes within reach. Parallelizability and scalability become key issues in algorithm development. The ELPA library [4] If M is positive definite, in the following denoted by M > 0, is one endeavor to tackle these challenges and provides highly the Hamiltonian eigenvalue problem can be recast into a skew- competitive direct solvers for symmetric (and Hermitian) eigen- symmetric eigenvalue problem using the Cholesky factorization M = LL . The eigenvalues of H are given as eigenvalues of ∗ the skew-symmetric matrix L JL and eigenvectors can be trans- Corresponding author formed accordingly. Email address: penke@mpi-magdeburg.mpg.de (Carolin Penke) These authors were supported by BiGmax, the Max Planck Societys Re- This situation occurs for example in [7], where a structure- search Network on Big-Data-Driven Materials Science. preserving method for the solution of the Bethe-Salpeter eigen- Preprint submitted to Parallel Computing April 21, 2020 arXiv:1912.04062v2 [math.NA] 20 Apr 2020 value problem is described. Solving the Bethe-Salpeter eigen- is tridiagonal. This is done by accumulating Householder value problem allows a prediction of optical properties in con- transformations densed matter, a more accurate approach than currently used Q = Q Q ···Q , trd 1 2 n−1 ones, such as time-dependent density functional theory (TDDFT) [8]. In this application context, the condition M > 0 ultimately where Q = I−τ v v represents the i-th Householder trans- i i i follows from much weaker physical interactions represented in i formation that reduces the i-th column and row of the the off-diagonal values [9, 10]. When larger systems are of T T updated Q ···Q AQ ···Q to tridiagonal form. The 1 i−1 interest, the resulting matrices easily become very high-dimen- i−1 1 matrices Q are not formed explicitly but are represented sional. This calls for a parallelizable and scalable algorithm. by the Householder vectors v . These are stored in place The solution of the corresponding skew-symmetric eigenvalue of the eliminated columns of A. problem can be accelerated via the developments presented in 2. Solve the tridiagonal eigenvalue problem, i.e. find or- this paper. thogonal Q s.t. The remaining paper is structured as follows. Section 2 diag reintroduces the methods used by ELPA and points out the nec- Λ = Q A Q . essary adaptations to make them work for skew-symmetric ma- diag trd diag trices. The Bethe-Salpeter problem is presented in Section 3. In ELPA, this step employs a tridiagonal divide-and-con- Section 4 provides performance results of the ELPA extension, quer scheme. including GPU acceleration, and points out the speedup achieved 3. Transform the required eigenvectors back, i.e. perform in the context of the Bethe-Salpeter eigenvalue problem. the computation 2. Solution Method Q = Q Q . trd diag 2.1. Solving the Symmetric Eigenvalue Problem in ELPA The ELPA solver comes in two flavors which define the The ELPA library [4, 11, 12] is a highly optimized parallel details of the transformation steps, i.e Steps 1 and 3. ELPA1 MPI-based code [13]. It shows great scalability over thousands works as described, the reduction to tridiagonal form is per- of CPU cores and contains low-level optimizations targeting formed in one step. ELPA2 splits the transformations into two specific compute architectures [14]. When only a portion of parts. Step 1 becomes eigenvalues and eigenvectors are needed, this is exploited algo- 1. (a) Reduce A to banded form, i.e. compute orthogonal rithmically and results in performance benefits. We briefly de- Q s.t. band scribe the well-established procedure employed by ELPA. This forms the basis of the method for skew-symmetric matrices de- A = Q AQ band band band scribed in the next subsection. ELPA contains functionality to deal with symmetric-definite is a band matrix. generalized eigenvalue problems. In this paper, we focus on the (b) Reduce the banded form to tridiagonal form, i.e. standard eigenvalue problem for simplicity. This is reasonable compute orthogonal Q s.t. trd as it is the most common use case and forms the basis of any method for generalized problems. We only consider real skew- A = Q A Q trd band trd trd symmetric problems. The reason is that any skew-symmetric problem can be transformed into a Hermitian eigenvalue prob- is tridiagonal. lem by multiplying it with the imaginary unit i. This problem Accordingly, the back transformation step is split into two parts can be solved using the available ELPA functionality for com- plex matrices. For the real case this induces complex arithmetic 3. (a) Perform the back transformation corresponding to which should obviously be avoided, but for complex matrices the band-to-tridiagonal reduction this is a viable approach. Q = Q Q . We consider the symmetric eigenvalue problem, i.e. the or- trd diag thogonal diagonalization of a matrix, (b) Perform the back transformation corresponding to Q AQ = Λ, the full-to-band reduction T n×n where A = A ∈R is the matrix whose eigenvalues are sought. Q = Q Q. band We are looking for the orthogonal eigenvector matrix Q and the diagonal matrix Λ containing the eigenvalues. The solution is The benefit of the two-step approach is that more efficient carried out in the following steps. BLAS-3 procedures can be used in the tridiagonalization pro- cess and an overlap of communication and computation is pos- 1. Reduce A to tridiagonal form, i.e. find an orthogonal sible. As a result, a lower runtime can generally be observed transformation Q s.t. trd in the tridiagonalization, compared to the one-step approach. This comes at the cost of more operations in the eigenvector A = Q AQ trd trd trd 2 Algorithm 1 Solution of a Skew-symmetric Eigenvalue Prob- back transformation due to the extra step that has to be per- lem formed. Therefore, ELPA2 is superior to ELPA1 in particular T n×n when only a portion of the eigenvectors is sought. In the context Input: A =−A ∈ R n×n of skew-symmetric eigenvalue problems, this becomes pivotal Output: Unitary eigenvectors Q ∈ C , λ ,...,λ ∈ R s.t 1 n as the purely imaginary eigenvalues come in pairs±λi, λ ∈ R. Q AQ = diag{λ i,... ,λ i}. 1 n The eigenvectors are given as the complex conjugates of each 1: Reduce A to tridiagonal form, i.e. generate Q s.t. trd other. It is therefore enough to compute half of the eigenvalues   0 α and eigenvectors. Both approaches are extended to skew-symmetric matrices   −α 0 T 1   Q AQ = A = . trd trd trd in this work.   . . . . . . n−1 −α 0 n−1 2.2. Solving the Skew-symmetric Eigenvalue Problem 2: Solve the eigenvalue problem for the symmetric tridiago- Like a symmetric matrix, a skew-symmetric matrix can be H 2 n nal matrix−iD A D, where D = diag{1,i,i ,... ,i }, i.e. trd reduced to tridiagonal form using Householder transformations. generate Q s.t. diag A Householder transformation represents a reflection onto a scaled first unit vector e . Let H be a transformation that acts 1     0 α on a vector v s.t. Hv = αe . Obviously −v is transformed to  .  .   α 0 T 1 H(−v) =−αe by the same H. Therefore all tridiagonalization   Q Q = .  .  diag diag   . . . methods that work on symmetric matrices, such as the ones im- . . . . n−1 plemented in ELPA, can in principle work on skew-symmetric α 0 n−1 matrices as well. 3: Back transformation corresponding to symmetrization (see A skew-symmetric tridiagonal matrix is related to a sym- Lemma 1), i.e. compute metric one via the following observation [5]. n×n 2 n−1 Q← DQ ∈ C . Lemma 1. With the unitary matrix D = diag{1,i,i ,... ,i }, diag where i denotes the imaginary unit, α ∈ R, it holds 4: Back transformation corresponding to band-to-tridiagonal     0 α 0 α 1 1 reduction, i.e. compute . . . .     . . −α 0 α 0 H 1 1     −iD D = . (1) Q← Q Q. trd     . . . . . . . . . . . . α α n−1 n−1 −α 0 α 0 n−1 n−1 . denotes the Hermitian transpose of a matrix. back transformation corresponding to tridiagonalization do not After the reduction to tridiagonal form, the symmetric tridi- change, because all they do is to apply Householder transfor- agonal system is solved using a divide-and-conquermethod [11]. mations to non-symmetric (and non-skew-symmetric) matrices. As a first step of the back transformation, the resulting (real) They are applied on the real and imaginary part independently, eigenvectors have to be multiplied by the (complex) matrix D. realizing the complex back transformation in real arithmetic. Then the back transformations corresponding to the tridiagona- The symmetric tridiagonal eigensolver can be used as is. Ma- lization take place. Algorithm 1 outlines the process. It is very king it aware of the zeros on the diagonal might turn out to be similar to the method employed for symmetric eigenvalue prob- numerically or computationally beneficial. lems. The differences are the addition of step 3 and changes in We now examine the implementation of the two tridiago- the implementation, which are given in detail in Sections 2.3.1 nalization approaches in ELPA1 and ELPA2 in more detail. At and 2.3.2. many points in the original implementation, symmetry of the In ELPA2 the transformation steps (1 and 4 in Algorithm 1) matrix is assumed in order to avoid unnecessary computations are both split into two parts as described in Section 2.1. and to efficiently reuse data available in the cache. In this sec- tion we recollect some details of the tridiagonal reduction in 2.3. Implementation order to point out these instances. Here, the implicit assump- Extending ELPA for skew-symmetric matrices means adding tions can be changed from “symmetric” to “skew-symmetric” the back transformation step involving D. In contrast to sym- by simple sign changes. metric matrices, skew-symmetric matrices have complex eigen- ELPA is based on the well established and well documented vectors and strictly imaginary eigenvalues. Computationally 2D block-cyclic data layout introduced by ScaLAPACK for load complex values are introduced in Algorithm 1 with D in step balancing reasons. It is therefore compatible to ScaLAPACK 3. Further transformations have to be performed for the real and can act as a drop-in replacement while no ScaLAPACK and the imaginary part individually. It is preferable to set up routines are used by ELPA itself. In general, each process an array with complex data type entries representing the eigen- works on the part of the matrix that was assigned to it. This vectors as late as possible, so that we can benefit from efficient chunk of data resides in the local memory of the process. Com- routines in double precision. The routines for the eigenvector 3 munication between processes is realized via MPI. Each pro- in Section 2.3.1. cess calls serial BLAS routines. Additional CUDA and OpenMP T T T A← (I−VTV ) A(I−VTV ) (8) support is available. T T T T T = A +V (0.5T V AVTV − T V A) | {z } 2.3.1. Tridiagonalization in ELPA1 In ELPA1, the tridiagonalization is realized in one step us- T T T ing Householder transformations. The computation of the House- + (0.5VT V AVT − AVT)V (9) | {z } holder vectors is not affected by the symmetry of a matrix. Es- sentially, the tridiagonalization of a matrix comes down to a = A + V U U V . (10) series of rank-2 updates [15], described in the following. Given 2 1 a Householder vector v, the update of the trailing submatrix is It holds U = U if A is symmetric, and U =−U if A is skew- 1 2 1 2 performed as symmetric. Each process computes the relevant parts of U in T T a series of (serial) matrix operations and updates the portion A← (I−τvv )A(I−τvv ) (2) of A that resides in its memory. Here, the symmetry of A is 2 T T T 2 T T = A + v(0.5τ v Avv −τv A)+(0.5τ vv Av−τAv)v (3) assumed and exploited at various points in the implementation. | {z } | {z } 2 Sign changes have to be applied at these instances. For the banded-to-tridiagonal reduction, the matrix is redis- T T = A + vu + u v (4) tributed in the form of a 1D block cyclic data layout. Each = A + v u u v . (5) process owns a diagonal and a subdiagonal block. The reduc- 2 1 tion of a particular column introduces fill-in in the neighboring For symmetric matrices it holds u = u . This is assumed in 1 2 block. The “bulge-chasing” is realized as a pipelined algorithm the original ELPA implementation. For skew-symmetric matri- where computation and communication can be overlapped by v u ces it holds u =−u . In ELPA1, the two matrices and 1 2 2 reordering certain operations [11, 17]. The update of the diagonal blocks takes the same form as in u v are stored explicitly. Actual updates are performed ELPA1 (Equations (2) to (5)). Here, no matrix multiplication is using GEMM and GEMV routines. The matrices differ in the data employed but BLAS-2 routines are used working directly with layout, i.e. which process owns which part of the matrix. After the Householder vectors. It holds u = u for symmetric A and the vector u is computed, it is transposed and redistributed to 1 2 u = −u for skew-symmetric A. In the symmetric case, the represent u in v u . Here, for the skew-symmetric variant, 1 2 2 2 update is realized via a symmetric rank-2 update (SYR2). We a sign change is introduced. The skew-symmetric update now implemented a skew-symmetric variant of this routine which reads T T realizes the skew-symmetric rank-2 update A← A− vu + uv . A← A + v −u u v . (6) For the setup of u, a skew-symmetric variant of the BLAS rou- 1 1 tine performing a symmetric matrix vector product (SYMV) is During the computation of u , symmetry is assumed in the necessary. computation of A v. In particular, the code assumes that an off- The other parts of Algorithm 1 are adopted from the sym- diagonal matrix tile is the same as in the transposed matrix. An- metric implementation without changes. The computation of other sign change corrects this assumption for skew-symmetric Householder vectors, the accumulation of the Householder trans- matrices. formations in a triangular matrix and the update of the local block during reduction to banded form do not have to be changed 2.3.2. Tridiagonalization in ELPA2 compared to symmetric ELPA. This is because they act on the In ELPA2, the tridiagonalization is split into two parts. First, lower part of the matrix so that possible (skew-)symmetry has the matrix is reduced to banded form, then to tridiagonal form. no effect. For the reduction to banded form, the Householder vectors are computed by the process column owning the diagonal block. nb×nb 3. The Bethe-Salpeter Eigenvalue Problem They are accumulated in a triangular matrix T ∈ R , where nb is the block size. The product of Householder matrices is Ab initio spectroscopy aims to describe the excitations in stored via its storage-efficient representation [16] condensed matter from first principles, i.e. without the input of any empirical parameters. For light absorption and scattering, Q = H ···H = I−VTV , (7) 1 nb the Bethe-Salpeter Equation (BSE) approach is the state-of-the- art methodology for both crystalline systems[18, 19, 20, 21, 8] v ··· v where V = contains the Householder vectors. 1 nb as well as condensed molecular systems [22, 23, 24, 25]. This H = I−τ v v is the Householder matrix corresponding to the i i i approach takes its name from the Bethe-Salpeter Equation [26], i-th Householder transformation. the equation of motion of the electron-hole correlation function, In this context, the update of the matrix A takes the follow- as derived from many-body perturbation theory [27, 8]. In prac- ing shape, analogous to the direct tridiagonalization described tice, the problem of solving the BSE is mapped to an effective eigenvalue problem. Specifically, its eigenvalues and -states are 4 employed to construct dielectric properties, such as the spectral Let density, absorption spectrum, and the loss function [7, 28]. An Re(A + B) Im(A− B) M = JH = (13) appropriate discretization scheme leads to a finite-dimensional −Im(A + B) Re(A− B) representation in matrix form H that shows a particular block BS structure [29]: be the symmetric matrix associated with the Hamiltonian ma- trix H. Its positive definiteness follows from property (12), A B A B H = = , (11) which can be seen in the following way. Let the matrices S BS H T ¯ ¯ −B −A −B −A and Ω be given as H T n×n A = A , B = B ∈ C . I A B S = , Ω = , (14) Note that the Hermitian transpose . as well as the regular trans- ¯ ¯ −I B A pose without complex conjugation . play a role in this struc- i.e. H = SΩ. With the matrix Q from Theorem 2 we have BS ture. In general, we are interested in all eigenpairs of the Hamil- M =−iJQ SΩQ. (15) tonian, as they contain valuable information on the excitations of the system. Specifically, they describe the bound excitons, It is easily verified that localized electron-hole pairs that form due to correlation be- −iJQ SQ = I , (16) tween an excited electron and a hole. The BSE eigenstates are used to reconstruct the excitonic wavefunction and obtain the i.e. −iJQ S is the inverse of Q. The construction of M (15) excitonic binding energy. can therefore be seen as a similarity transformation of Ω. If Ω In this paper, we present a solution strategy for the most is positive definite (12), so is M. The method described in [7] general formulation of the BSE problem. As such, A and B are relies on this property in order to guarantee the existence of the generally dense and complex-valued, which holds in the case of Cholesky factorization of M. excitations in condensed matter. It performs the following steps. H belongs to the slightly more general class of J-sym- BS 1. Construct M as in (13). metric matrices [30]. This class of matrices display a symmetry (λ ,−λ) in the spectrum. The additional structure in H leads 2. Compute a Cholesky factorization M = LL . BS ¯ ¯ to an additional symmetry (λ ,−λ ,λ ,−λ) and a relation be- 3. Compute eigenpairs of the skew-symmetric matrix L JL, 0 I tween the corresponding eigenvectors. Following [7], we con- where J = . −I 0 sider the definite Bethe-Salpeter eigenvalue problem. H is BS 4. Perform the eigenvector back transformation associated called definite when the property with Cholesky factorization and transformation to Hamil- I 0 A B tonian form (Theorem 2). H = > 0 (12) BS ¯ ¯ 0 −I B A The eigenvalues and eigenvectors can be used to compute is fulfilled, which often holds in practice. In this case, the the optical absorption spectrum of the material in a postpro- eigenvalues are real and therefore come in pairs (λ ,−λ). The cessing step. method presented in this work relies on this assumption. The main workload is given as the solution of a skew-sym- We aim for a solution method that preserves this structure metric eigenvalue problem (Step 3). As a proof of concept, under the influence of inevitable numerical errors, i.e. that guar- solution routines for the symmetric eigenvalue problem from antees that the eigenvalues come in pairs or quadruples, respec- the ScaLAPACK reference implementation [3] were adapted to tively. General methods for eigenvalue problems, such as the the skew-symmetric setting. The matrix is reduced to tridiag- QR/QZ algorithm, destroy this property. In this case it is not onal form using Householder transformations. The tridiagonal clear anymore which eigenpairs correspond to the same excita- eigenvalue problem is solved via bisection and inverse iteration. tion state. The ScaLAPACK reference implementation is not regarded A structure-preserving method running in parallel on dis- as a state-of-the art solver library. When performance and scal- tributed memory systems is developed in [7] and has been made ability are issues, one generally turns to professionally main- available as BSEPACK. It relies on assumption (12) and ex- tained and optimized libraries such as ELPA [4] or vendor- ploits a connection to a Hamiltonian eigenvalue problem given specific implementations such as Intel’s MKL. Within BSE- in the following Theorem. PACK, ScaLAPACK can be substituted by ELPA working on skew-symmetric matrices. The resulting performance benefits I −iI Theorem 2. Let Q = , then Q is unitary and 2 are discussed in Section 4.2. I iI A B Im(A + B) −Re(A− B) Q Q = i =: iH, 4. Numerical Experiments ¯ ¯ −B −A Re(A + B) Im(A− B) 4.1. ELPA Benchmarks where H is real Hamiltonian, i.e. JH = (JH) with In this section we present performance results for the skew- 0 I J = . symmetric ELPA extension. All test programs are run on the −I 0 5 Table 1: Execution time speedups achieved by different aspects of the solution approach. #Cores Compl. Compl. Skew-Sym. Skew-Sym. ELPA2 ELPA2 50% ELPA2 50% ELPA2 50% 100% vs. vs. Compl. vs. Compl. vs. Compl. Compl. MKL 50% ELPA2 50% MKL 50% MKL 100 % 16 1.10 1.41 2.33 3.28 32 1.29 1.41 2.30 3.24 64 1.11 1.40 2.32 3.25 128 1.18 1.33 2.20 2.93 256 1.17 1.28 2.16 2.76 512 1.21 1.51 1.87 2.82 16 32 64 128 256 512 Number of cores Complex ELPA1, 100% Complex ELPA1, 50% Complex ELPA2, 100% Complex ELPA2, 50% Complex MKL 100% Complex MKL 50% Skew-Symmetric Skew-Symmetric ELPA1, 100% ELPA1, 50% Skew-Symmetric Skew-Symmetric ELPA2, 100% ELPA2, 50% Figure 1: Scaling of the ELPA solver for skew-symmetric matrices. For com- parison the runtimes for the alternative solution method via complex Hermitian solvers is included. Here, ELPA and Intel’s MKL 2018 routines pzheevd and pzheevr are used. The matrix has a size of n = 20000. 16 32 64 128 256 512 Number of cores mechthild compute cluster, located at the Max Planck Institute ELPA2: Band-to- ELPA2: Full-to-Band for Dynamics of Complex Technical Systems in Magdeburg, Tridiagonal Germany. Up to 32 nodes are used, which consist of 2 Intel ELPA2: Full-to- ELPA1: Full-to-Tridiagonal Tridiagonal Xeon Silver 4110 (Skylake) processors with 8 cores each, run- PDSSTRD NB = 16 PDSSTRD NB = 64 ning at 2.1 GHz. The Intel compiler, MPI library and MKL PDSSTRD NB = 256 in the 2018 version are used in all test programs. The com- putations use randomly generated skew-symmetric matrices in Figure 2: Scaling of the tridiagonalization in two steps (ELPA2) and one step double-precision. (ELPA1). We compare it to the runtimes of the tridiagonalization routine for Figure 1 shows the resulting performance and the scaling skew-symmetric matrices PDSSTRD available in BSEPACK [7] for different properties of ELPA for a medium sized skew-symmetric ma- block sizes NB. The matrix size is n = 20000. trix (n = 20000). As an alternative to the approach described in this work, the skew-symmetric matrix can be multiplied with in Table 1). When all eigenpairs are computed, ELPA1 and the imaginary unit i. The resulting complex Hermitian matrix ELPA2 yield very similar runtime results which is why only can be diagonalized using available methods in ELPA or Intel’s ELPA2 is considered in Table 1. The two-step approach em- ScaLAPACK implementation shipped with the MKL. This rep- ployed by ELPA2 pays off in particular when not all eigen- resents the only previously available approach to solve skew- pairs are sought, which is the case here. When complex 50% symmetric eigenvalue problems in a massively parallel high- solvers are compared (ELPA2 vs. MKL, column 3 in Table performance setting. 1), the achieved speedup increases to a value between 1.28 and For skew-symmetric matrices, only 50% of eigenvalues and 1.51. The largest impact on the performance is caused by avoid- eigenvectors need to be computed, as they are purely imaginary ing complex arithmetic. This is represented by the speedup and come in pairs ±λi,λ ∈ R. The runtime measurements for achieved by the skew-symmetric 50% ELPA2 implementation 100% are included for reference. compared to the complex 50% ELPA2 implementation (column Figure 1 shows that all approaches display good scalabil- 4 of Table 1). This accounts for an additional speedup of 1.87 ity in the examined setting. Skew-symmetric ELPA runs 2.76 to 2.33. to 3.28 times faster than the complex MKL based solver, where The tridiagonalization is an essential step in every consi- both only compute 50% of eigenpairs. The data gives further in- dered solution scheme and contributes a significant portion of sight into how this improvement is achieved. Table 1 compares the execution time. The fewer eigenpairs are sought, the more the runtimes for different solvers and presents the achieved speed- dominant it becomes with respect to computation time. Figure 2 ups. When we compare complex 100% solvers, ELPA already displays the runtimes and scalability of available tridiagonaliza- improves performance by a factor of 1.1 to 1.29 (column 2 Runtime in s Runtime in s Table 2: Execution time speedups achieved by different aspects of the solution approach. Matrix Compl. Compl. Skew-Sym. Skew-Sym. ELPA2 ELPA2 50% ELPA2 50% ELPA2 50% size 100% vs. vs. Compl. vs. Compl. vs. Compl. Compl. MKL 50% ELPA2 50% MKL 50% MKL 100 % 50 000 1.17 1.45 2.32 3.35 75 000 1.16 1.46 2.39 3.50 100 000 1.17 1.47 2.42 3.57 125 000 1.17 1.49 2.46 3.67 50000 75000 125000 Matrix Size n Complex ELPA1, 100% Complex ELPA1, 50% Complex ELPA2, 100% Complex ELPA2, 50% Complex MKL 100% Complex MKL 50% Skew-Symmetric Skew-Symmetric ELPA1, 100% ELPA1, 50% Skew-Symmetric Skew-Symmetric ELPA2, 100% ELPA2, 50% 200 Figure 3: Runtimes for solving eigenvalue problems of larger sizes. 256 CPU cores were used, i.e. 16 nodes on the mechthild compute cluster. 1024 8192 16384 24576 32768 tion techniques for skew-symmetric matrices. As an alternative Matrix Size n implementation to the presented approaches there is a tridiago- 2x Intel Xeon Silver 4110, ELPA2 nalization routine PDSSTRD shipped in BSEPACK [7]. It is an 2x Intel Xeon Silver 4110 + 1x Nvidia P100, ELPA1 adapted version of the ScaLAPACK reference implementation. Figure 4: Runtimes for solving eigenvalue problems on one node on the All discussed implementations are based on the 2D-block- mechthild compute cluster employing a GPU. cyclic data distribution established by ScaLAPACK. Here, the matrix is divided into blocks of a certain size NB. The blocks are distributed to processes organized in a 2D grid in a cyclic the ELPA library [32]. The design approach is to stick with the manner. Typically, the block size is a parameter chosen once same code base as the CPU-only version, and offload compute- in a software project. The data redistribution to data layouts intense parts, such as BLAS-3 operations, to the GPU in or- defined by other block sizes is avoided as this involves expen- der to benefit from its massive parallelism. This is done using sive all-to-all communication. The main disadvantage of the the CUBLAS library provided by NVIDIA. Because ELPA2 PDSSTRD routine is that it is very susceptible to the chosen employs more fine-grained communication patterns, this ap- block size, both with regard to scalability and overall perfor- proach works best for ELPA1. Here, the performance can ben- mance. This makes it less suitable to be included in larger soft- efit when the computational intensity is high enough, i.e. when ware projects, where the block size is a parameter predefined big chunks of data are being worked on by the GPU. by other factors. ELPA (both the one and two-step version) on Figure 4 shows the performance that can be achieved on one the other hand does not have this problem and performs equally node of the mechthild compute cluster, that is equipped with well for all data layouts [31]. an NVIDIA P100 GPU as an accelerator device. The GPU ver- Figure 2 also displays the advantage of the two-step tridiag- sion is based on ELPA1 and therefore does not benefit from the onalization over the one-step approach. Here the performance faster tridiagonalization in ELPA2 (see Figure 2 and the dis- is dominated by the first step, i.e. the reduction to banded form. cussion in the previous section). Despite this fact, the GPU- In the context of electronic structure computations, the matri- accelerated ELPA1 version eventually outperforms the ELPA2 ces of interest can become extremely large. Figure 3 displays CPU-only version, if the matrix is large enough. In our case the achieved runtime improvements for larger matrices up to a the turning point is at around n = 15000. For smaller matrices size of n = 125000. The individual speedups are presented in the additional work of setting up the CUDA environment and Table 2. For large matrices we achieve a speedup of up to 3.67 transferring the matrix counteracts any possible performance compared to the available MKL routine. benefits and results in a larger runtime. For matrices of size n = 32768 employing the GPU can reduce the runtime from 4.1.1. GPU Acceleration 570 seconds to 328 seconds, i.e. by 41%. For the 1-step tridiagonalization approach (ELPA1), there The take-away message of these results is the following. If is a GPU-accelerated version available that gets shipped with nodes equipped with GPUs are available and to be utilized, it is Runtime in s Runtime in s NB = 64, but choosing a larger block size can increase the per- formance dramatically, as can be seen in Figure 5 for NB = 256. Typically, software packages (e.g. [42, 28]) developed for elec- tronic structure computations are large and contain many fea- tures, implementing methods for different quantities of inter- est. The block size is typically predetermined by other con- siderations. It would mean a serious effort to change it, in or- der to optimize just one building block of the software. Fur- thermore the optimal block size of the original BSEPACK is probably dependent on the given hardware and the given ma- trix size. Autotuning frameworks could help, but are also very costly and impose an additional implementation effort. A soft- 32 128 16 64 256 512 ware, that does not show this kind of runtime dependency is Number of cores greatly preferable. Employing ELPA for the main computa- BSEPACK NB = 64 BSEPACK NB = 256 tional task in BSEPACK fulfills this requirement. The perfor- BSEPACK + ELPA2 mance of ELPA is independent of the chosen NB, because the block size on the node level for optimal cache use is decoupled Figure 5: Scaling of the direct, complex BSEPACK eigenvalue solver for com- from the block size defining the multi-node data layout. puting the optical absorption spectrum of hexagonal boron nitride. The Bethe- Salpeter matrix (11) has a size of 51200. The ELPA-accelerated version is up to 9.22 times as fast as the original code with the default block size. Even when the block size is increased, using the new solver always yields important to make sure each node has enough data to work on. a better performance. In the case of NB = 256, the ELPA- This way, the available resources are used most efficiently. version still performs up to 2.76 times as fast. Choosing even larger block sizes has in general no further positive effect on the 4.2. Accelerating BSEPACK performance of the original BSEPACK. Employing ELPA also We consider the performance improvements that can be a- leads to an improved scalability over the number of cores. chieved by using the newly developed skew-symmetric eigen- value solver in the BSEPACK [7] software, described in Sec- 5. Conclusions tion 3. In this procedure, Step 3, the computation of eigenpairs of the skew-symmetric matrix L JL, is now performed by the We have presented a strategy to extend existing solver li- ELPA library. braries for symmetric eigenvalue problems to the skew-sym- To demonstrate the speedup, we consider the example of metric case. Applying these ideas to the ELPA library, makes hexagonal boron nitride at a fixed size of the BSE Hamiltonian. it possible to compute eigenvalues and eigenvectors of large The excitations in hexagonal boron nitride are widely studied skew-symmetric matrices in parallel with a high level of ef- both experimentally and theoretically [33, 34, 35, 36, 37, 38, ficiency and scalability. We benefit from the maturity of the 39], as its wide band gap and the layered geometrical structure ELPA software project, where many optimizations have been yield strong effects of electron-hole correlation, such as the for- realized over the years. All of these, including GPU support, mation of bound excitons. Previous studies have shown that the find their way into the presented skew-symmetric solver. As far BSE approach yields the optical absorption and excitonic prop- as we know, no other solvers dedicated to the skew-symmetric erties with high accuracy. In our calculations, the BSE Hamil- eigenvalue problem exist in an HPC setting. It is always pos- tonian is constructed on a 16×16×4 k-grid in the 1st Brillouin sible to solve a complex Hermitian eigenvalue problem instead zone, the 5 highest valence and 5 lowest conduction bands are of a skew-symmetric one. Our newly developed solver outper- employed to construct the transition space, leading to a matrix forms this strategy, implemented via Intel MKL ScaLAPACK, size of 2×16×16×4×5×5 = 51200. In the calculation of the by a factor of 3. We also observe an increase in performance BSE Hamiltonian, single-particle wavefunctions and the static concerning the Bethe-Salpeter eigenvalue problem. Here we dielectric function are expanded in plane waves with a cut-off improve the runtime of available routines by a factor of almost of 387 eV and 132 eV, respectively. The static dielectric func- 10, making the BSEPACK library with ELPA a viable choice tion is obtained from ABINIT [40], while the BSE Hamiltonian as a building block for larger electronic structure packages. is constructed using the EXC code [41]. Figure 5 displays the achieved runtimes of BSEPACK for 6. Acknowledgment this fixed-size matrix for different core counts. We compare the original version and a version that employs ELPA. The perfor- We thank Francesco Sottile for fruitful discussion and his mance of the original solver is highly dependent on the cho- support in generating the BSE Hamiltonian for hexagonal BN. sen block size (see also Figure 2). This parameter determines how the matrix is distributed to the available processes in the form of a 2D block-cyclic data layout. The default is given as Runtime in s References [19] L. X. Benedict, E. L. Shirley, R. B. Bohn, Optical absorp- tion of insulators and the electron-hole interaction: An ab [1] G. H. Golub, C. F. Van Loan, Matrix Computations, 4th Edition, Johns initio calculation, Phys. Rev. Lett. 80 (1998) 4514–4517. Hopkins Studies in the Mathematical Sciences, Johns Hopkins University doi:10.1103/PhysRevLett.80.4514. Press, Baltimore, 2013. [20] S. Albrecht, L. Reining, R. Del Sole, G. Onida, Ab ini- [2] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, tio calculation of excitonic effects in the optical spectra of A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, D. So- semiconductors, Phys. Rev. Lett. 80 (1998) 4510–4513. rensen, LAPACK Users’ Guide, SIAM, Philadelphia, PA, 2nd Edition doi:10.1103/PhysRevLett.80.4510. (1995). doi:10.1137/1.9780898719604. [21] S. Sagmeister, C. Ambrosch-Draxl, Time-dependent density functional [3] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, theory versus Bethe-Salpeter equation: an all-electron study, Phys. Chem. J. J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, Chem. Phys. 11 (2009) 4451–4457. doi:10.1039/B903676H. D. Walker, R. C. Whaley, ScaLAPACK User’s Guide, Vol. 4 of Soft- [22] J. C. Grossman, M. Rohlfing, L. Mitas, S. G. Louie, M. L. ware, Environments and Tools, SIAM Publications, Philadelphia, PA, Cohen, High accuracy many-body calculational approaches for ex- USA, 1997. doi:10.1137/1.9780898719642. citations in molecules, Phys. Rev. Lett. 86 (3) (2001) 472. [4] A. Marek, V. Blum, R. Johanni, V. Havu, B. Lang, T. Auckenthaler, doi:10.1103/PhysRevLett.86.472. A. Heinecke, H.-J. Bungartz, H. Lederer, The ELPA library: scalable [23] C. Faber, P. Boulanger, C. Attaccalite, I. Duchemin, X. Blase, Ex- parallel eigenvalue solutions for electronic structure theory and compu- cited states properties of organic molecules: from density func- tational science, Journal of Physics: Condensed Matter 26 (21) (2014) tional theory to the GW and Bethe–Salpeter Green’s function for- 213201. doi:10.1088/0953-8984/26/21/213201. malisms, Phil. Trans. R. Soc. A 372 (2011) (2014) 20130271. [5] R. C. Ward, L. J. Gray, Eigensystem computation for skew-symmetric doi:10.1098/rsta.2013.0271. and a class of symmetric matrices, ACM Trans. Math. Softw. 4 (3) (1978) [24] C. Cocchi, C. Draxl, Optical spectra from molecules to crystals: Insight 278–285. doi:10.1145/355791.355798. from many-body perturbation theory, Phys. Rev. B 92 (2015) 205126. [6] P. Benner, D. Kreßner, V. Mehrmann, Skew-Hamiltonian and Hamil- doi:10.1103/PhysRevB.92.205126. tonian eigenvalue problems: Theory, algorithms and applications, [25] D. Hirose, Y. Noguchi, O. Sugino, All-electron G W+ Bethe-Salpeter in: Z. Drmacˇ, M. Marusic, Z. Tutek (Eds.), Proc. Conf. Appl calculations on small molecules, Phys. Rev. B 91 (20) (2015) 205111. Math. Scientific Comp., Springer-Verlag, Dordrecht, 2005, pp. 3–39. doi:10.1103/PhysRevB.91.205111. doi:10.1007/1-4020-3197-1_1. [26] E. E. Salpeter, H. A. Bethe, A relativistic equation for [7] M. Shao, F. H. da Jornada, C. Yang, J. Deslippe, S. G. Louie, Structure bound-state problems, Phys. Rev. 84 (1951) 1232–1242. preserving parallel algorithms for solving the Bethe-Salpeter eigenvalue doi:10.1103/PhysRev.84.1232. problem, Linear Algebra and its Applications 488 (Supplement C) (2016) [27] G. Strinati, Application of the Green’s functions method to the study of 148 – 167. doi:10.1016/j.laa.2015.09.036. the optical properties of semiconductors, Riv. Nuovo Cimento 11 (12) [8] G. Onida, L. Reining, A. Rubio, Electronic excita- (1988) 1–86. doi:10.1007/BF02725962. tions: density-functional versus many-body Greens- [28] C. Vorwerk, B. Aurich, C. Cocchi, C. Draxl, Bethe–Salpeter equa- function approaches, Rev. Mod. Phys. 74 (2) (2002) 601. tion for absorption and scattering spectroscopy: implementation doi:10.1103/RevModPhys.74.601. in the exciting code, Electronic Structure 1 (3) (2019) 037001. [9] F. Furche, On the density matrix based approach to time-dependent den- doi:10.1088/2516-1075/ab3123. sity functional response theory, The Journal of Chemical Physics 114 (14) [29] T. Sander, E. Maggio, G. Kresse, Beyond the Tamm-Dancoff approxima- (2001) 5982–5992. doi:10.1063/1.1353585. tion for extended systems using exact diagonalization, Phys. Rev. B 92 [10] J. C´ızˇek, J. Paldus, Stability conditions for the solutions of the Hartree- (2015) 045209. doi:10.1103/PhysRevB.92.045209. Fock equations for atomic and molecular systems. application to the [30] P. Benner, H. Faßbender, C. Yang, Some remarks on the complex J- Pi-electron model of cyclic polyenes, The Journal of Chemical Physics symmetric eigenproblem, Linear Algebra and its Applications 544 (2018) 47 (10) (1967) 3976–3985. doi:10.1063/1.1701562. 407 – 442. doi:10.1016/j.laa.2018.01.014. [11] T. Auckenthaler, V. Blum, H.-J. Bungartz, T. Huckle, R. Johanni, [31] P. Benner, A. Marek, C. Penke, Improving the performance of numerical L. Kra¨mer, B. Lang, H. Lederer, P. Willems, Parallel solution of par- algorithms for the Bethe-Salpeter eigenvalue problem, Proc. Appl. Math. tial symmetric eigenvalue problems from electronic structure calcula- Mech. 18 (1) (2018). doi:10.1002/pamm.201800255. tions, Parallel Computing 37 (12) (2011) 783 – 794, 6th International [32] P. Ku˚s, H. Lederer, A. Marek, GPU optimization of large-scale eigen- Workshop on Parallel Matrix Algorithms and Applications (PMAA’10). value solver, in: F. A. Radu, K. Kumar, I. Berre, J. M. Nordbotten, I. S. doi:10.1016/j.parco.2011.05.002. Pop (Eds.), Numerical Mathematics and Advanced Applications ENU- [12] A. Alvermann, A. Basermann, H.-J. Bungartz, et al., Benefits from using MATH 2017, Springer International Publishing, Cham, 2019, pp. 123– mixed precision computations in the ELPA-AEO and ESSEX-II eigen- 131. doi:10.1007/978-3-319-96415-7_9. solver projects, Japan Journal of Industrial and Applied Mathematics [33] G. Cappellini, G. Satta, M. Palummo, G. Onida, Optical properties of BN 36 (2) (2019) 699–717. doi:10.1007/s13160-019-00360-8. in cubic and layered hexagonal phases, Phys. Rev. B 64 (2001) 035104. [13] Message Passing Interface Forum, MPI: A message-passing interface doi:10.1103/PhysRevB.64.035104. standard, Tech. rep., Knoxville, TN, USA (1994). [34] X. Blase, A. Rubio, S. G. Louie, M. L. Cohen, Quasiparticle band struc- [14] P. Ku˚s, A. Marek, S. Ko¨cher, H.-H. Kowalski, C. Carbogno, C. Scheurer, ture of bulk hexagonal boron nitride and related systems, Phys. Rev. B 51 K. Reuter, M. Scheffler, H. Lederer, Optimizations of the eigen- (1995) 6868–6875. doi:10.1103/PhysRevB.51.6868. solvers in the ELPA library, Parallel Computing 85 (2019) 167 – 177. [35] S. Galambosi, L. Wirtz, J. A. Soininen, J. Serrano, A. Marini, K. Watan- doi:10.1016/j.parco.2019.04.003. abe, T. Taniguchi, S. Huotari, A. Rubio, K. Ha¨ma¨la¨inen, Anisotropic [15] R. S. Martin, C. Reinsch, J. H. Wilkinson, Householder’s tridiagonaliza- excitonic effects in the energy loss function of hexagonal boron nitride, tion of a symmetric matrix, Numerische Mathematik 11 (3) (1968) 181– Phys. Rev. B 83 (2011) 081413. doi:10.1103/PhysRevB.83.081413. 195. doi:10.1007/BF02161841. [36] G. Fugallo, M. Aramini, J. Koskelo, K. Watanabe, T. Taniguchi, [16] R. S. Schreiber, C. Van Loan, A storage-efficient WY representation for M. Hakala, S. Huotari, M. Gatti, F. Sottile, Exciton energy-momentum products of Householder transformations, SIAM J. Sci. Statist. Comput. map of hexagonal boron nitride, Phys. Rev. B 92 (2015) 165122. 10 (1989) 53–57. doi:10.1103/PhysRevB.92.165122. [17] T. Auckenthaler, H.-J. Bungartz, T. Huckle, L. Krmer, B. Lang, [37] P. Cudazzo, L. Sponza, C. Giorgetti, L. Reining, F. Sottile, M. Gatti, Ex- P. Willems, Developing algorithms and software for the parallel solution citon band structure in two-dimensional materials, Phys. Rev. Lett. 116 of the symmetric eigenvalue problem, Journal of Computational Science (2016) 066803. doi:10.1103/PhysRevLett.116.066803. 2 (3) (2011) 272 – 278. doi:10.1016/j.jocs.2011.05.002. [38] J. Koskelo, G. Fugallo, M. Hakala, M. Gatti, F. Sottile, P. Cu- [18] M. Rohlfing, S. G. Louie, Electron-hole excitations in semicon- dazzo, Excitons in van der Waals materials: From monolayer to ductors and insulators, Phys. Rev. Lett. 81 (1998) 2312–2315. bulk hexagonal boron nitride, Phys. Rev. B 95 (2017) 035125. doi:10.1103/PhysRevLett.81.2312. doi:10.1103/PhysRevB.95.035125. 9 [39] W. Aggoune, C. Cocchi, D. Nabok, K. Rezouali, M. A. Belkhir, C. Draxl, Dimensionality of excitons in stacked van der Waals materials: The example of hexagonal boron nitride, Phys. Rev. B 97 (2018) 241114. doi:10.1103/PhysRevB.97.241114. [40] X. Gonze, F. Jollet, F. Abreu Araujo, D. Adams, B. Amadon, T. Ap- plencourt, C. Audouze, J. M. Beuken, J. Bieder, A. Bokhanchuk, E. Bousquet, F. Bruneval, D. Caliste, M. Coˆte´, F. Dahm, F. Da Pieve, M. Delaveau, M. Di Gennaro, B. Dorado, C. Espejo, G. Geneste, L. Gen- ovese, A. Gerossier, M. Giantomassi, Y. Gillet, D. R. Hamann, L. He, G. Jomard, J. Laflamme Janssen, S. Le Roux, A. Levitt, A. Lherbier, F. Liu, I. Lukacˇevic´, A. Martin, C. Martins, M. J. T. Oliveira, S. Ponce´, Y. Pouillon, T. Rangel, G. M. Rignanese, A. H. Romero, B. Rousseau, O. Rubel, A. A. Shukri, M. Stankovski, M. Torrent, M. J. Van Set- ten, B. Van Troeye, M. J. Verstraete, D. Waroquiers, J. Wiktor, B. Xu, A. Zhou, J. W. Zwanziger, Recent developments in the ABINIT soft- ware package, Computer Physics Communications 205 (2016) 106–131. doi:10.1016/j.cpc.2016.04.003. [41] Exc webpage, www.bethe-salpeter.org, accessed: 2019-11-26. [42] A. Gulans, S. Kontur, C. Meisenbichler, D. Nabok, P. Pavone, S. Riga- monti, S. Sagmeister, U. Werner, C. Draxl, Exciting: A full-potential all- electron package implementing density-functional theory and many-body perturbation theory, J. Phys. Condens. Matter. 26 (36) (2014) 363202. doi:10.1088/0953-8984/26/36/363202. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Mathematics arXiv (Cornell University)

High Performance Solution of Skew-symmetric Eigenvalue Problems with Applications in Solving the Bethe-Salpeter Eigenvalue Problem

Loading next page...
 
/lp/arxiv-cornell-university/high-performance-solution-of-skew-symmetric-eigenvalue-problems-with-AtLBwVq2JG

References (49)

ISSN
0167-8191
eISSN
ARCH-3343
DOI
10.1016/j.parco.2020.102639
Publisher site
See Article on Publisher Site

Abstract

We present a high-performance solver for dense skew-symmetric matrix eigenvalue problems. Our work is motivated by applica- tions in computational quantum physics, where one solution approach to solve the Bethe-Salpeter equation involves the solution of a large, dense, skew-symmetric eigenvalue problem. The computed eigenpairs can be used to compute the optical absorption spectrum of molecules and crystalline systems. One state-of-the art high-performance solver package for symmetric matrices is the ELPA (Eigenvalue SoLvers for Petascale Applications) library. We exploit a link between tridiagonal skew-symmetric and symmet- ric matrices in order to extend the methods available in ELPA to skew-symmetric matrices. This way, the presented solution method can benefit from the optimizations available in ELPA that make it a well-established, efficient and scalable library. The solution strategy is to reduce a matrix to tridiagonal form, solve the tridiagonal eigenvalue problem and perform a back-transformation for eigenvectors of interest. ELPA employs a one-step or a two-step approach for the tridiagonalization of symmetric matrices. We adapt these to suit the skew-symmetric case. The two-step approach is generally faster as memory locality is exploited better. If all eigenvectors are required, the performance improvement is counteracted by the additional back transformation step. We exploit the symmetry in the spectrum of skew-symmetric matrices, such that only half of the eigenpairs need to be computed, making the two- step approach the favorable method. We compare performance and scalability of our method to the only available high-performance approach for skew-symmetric matrices, an indirect route involving complex arithmetic. In total, we achieve a performance that is up to 3.67 higher than the reference method using Intel’s ScaLAPACK implementation. Our method is freely available in the current release of the ELPA library. Keywords: Distributed memory, Skew-symmetry, Eigenvalue and eigenvector computations, GPU acceleration, Bethe-Salpeter, Many-body perturbation theory 1. Introduction value problems running on distributed memory machines such as compute clusters. n×n A matrix A ∈ R is called skew-symmetric when A = The skew-symmetric case [5] lacks the ubiquitous presence T T −A , where . denotes the transposition of a matrix. We are of its symmetric counterpart and has not received the same ex- interested in eigenvalues and eigenvectors of A. tensive treatment. We close this gap by extending the ELPA The symmetric eigenvalue problem, i.e. the case A = A , methodology to the skew-symmetric case. has been studied in depth for many years. It lies at the core of Our motivation stems from the connection to the Hamilto- many applications in different areas such as electronic structure nian eigenvalue problem which has many applications in con- computations. Many methods for its solution have been pro- trol theory and model order reduction [6]. A real Hamiltonian posed [1] and successfully implemented. Optimized libraries matrix H is connected to a symmetric matrix M via the matrix for many platforms are widely available [2, 3]. With the rise of 0 I J = , where I denotes the identity matrix, more advanced computer architectures and more powerful su- −I 0 percomputers, the solution of increasingly complex problems M = JH. comes within reach. Parallelizability and scalability become key issues in algorithm development. The ELPA library [4] If M is positive definite, in the following denoted by M > 0, is one endeavor to tackle these challenges and provides highly the Hamiltonian eigenvalue problem can be recast into a skew- competitive direct solvers for symmetric (and Hermitian) eigen- symmetric eigenvalue problem using the Cholesky factorization M = LL . The eigenvalues of H are given as eigenvalues of ∗ the skew-symmetric matrix L JL and eigenvectors can be trans- Corresponding author formed accordingly. Email address: penke@mpi-magdeburg.mpg.de (Carolin Penke) These authors were supported by BiGmax, the Max Planck Societys Re- This situation occurs for example in [7], where a structure- search Network on Big-Data-Driven Materials Science. preserving method for the solution of the Bethe-Salpeter eigen- Preprint submitted to Parallel Computing April 21, 2020 arXiv:1912.04062v2 [math.NA] 20 Apr 2020 value problem is described. Solving the Bethe-Salpeter eigen- is tridiagonal. This is done by accumulating Householder value problem allows a prediction of optical properties in con- transformations densed matter, a more accurate approach than currently used Q = Q Q ···Q , trd 1 2 n−1 ones, such as time-dependent density functional theory (TDDFT) [8]. In this application context, the condition M > 0 ultimately where Q = I−τ v v represents the i-th Householder trans- i i i follows from much weaker physical interactions represented in i formation that reduces the i-th column and row of the the off-diagonal values [9, 10]. When larger systems are of T T updated Q ···Q AQ ···Q to tridiagonal form. The 1 i−1 interest, the resulting matrices easily become very high-dimen- i−1 1 matrices Q are not formed explicitly but are represented sional. This calls for a parallelizable and scalable algorithm. by the Householder vectors v . These are stored in place The solution of the corresponding skew-symmetric eigenvalue of the eliminated columns of A. problem can be accelerated via the developments presented in 2. Solve the tridiagonal eigenvalue problem, i.e. find or- this paper. thogonal Q s.t. The remaining paper is structured as follows. Section 2 diag reintroduces the methods used by ELPA and points out the nec- Λ = Q A Q . essary adaptations to make them work for skew-symmetric ma- diag trd diag trices. The Bethe-Salpeter problem is presented in Section 3. In ELPA, this step employs a tridiagonal divide-and-con- Section 4 provides performance results of the ELPA extension, quer scheme. including GPU acceleration, and points out the speedup achieved 3. Transform the required eigenvectors back, i.e. perform in the context of the Bethe-Salpeter eigenvalue problem. the computation 2. Solution Method Q = Q Q . trd diag 2.1. Solving the Symmetric Eigenvalue Problem in ELPA The ELPA solver comes in two flavors which define the The ELPA library [4, 11, 12] is a highly optimized parallel details of the transformation steps, i.e Steps 1 and 3. ELPA1 MPI-based code [13]. It shows great scalability over thousands works as described, the reduction to tridiagonal form is per- of CPU cores and contains low-level optimizations targeting formed in one step. ELPA2 splits the transformations into two specific compute architectures [14]. When only a portion of parts. Step 1 becomes eigenvalues and eigenvectors are needed, this is exploited algo- 1. (a) Reduce A to banded form, i.e. compute orthogonal rithmically and results in performance benefits. We briefly de- Q s.t. band scribe the well-established procedure employed by ELPA. This forms the basis of the method for skew-symmetric matrices de- A = Q AQ band band band scribed in the next subsection. ELPA contains functionality to deal with symmetric-definite is a band matrix. generalized eigenvalue problems. In this paper, we focus on the (b) Reduce the banded form to tridiagonal form, i.e. standard eigenvalue problem for simplicity. This is reasonable compute orthogonal Q s.t. trd as it is the most common use case and forms the basis of any method for generalized problems. We only consider real skew- A = Q A Q trd band trd trd symmetric problems. The reason is that any skew-symmetric problem can be transformed into a Hermitian eigenvalue prob- is tridiagonal. lem by multiplying it with the imaginary unit i. This problem Accordingly, the back transformation step is split into two parts can be solved using the available ELPA functionality for com- plex matrices. For the real case this induces complex arithmetic 3. (a) Perform the back transformation corresponding to which should obviously be avoided, but for complex matrices the band-to-tridiagonal reduction this is a viable approach. Q = Q Q . We consider the symmetric eigenvalue problem, i.e. the or- trd diag thogonal diagonalization of a matrix, (b) Perform the back transformation corresponding to Q AQ = Λ, the full-to-band reduction T n×n where A = A ∈R is the matrix whose eigenvalues are sought. Q = Q Q. band We are looking for the orthogonal eigenvector matrix Q and the diagonal matrix Λ containing the eigenvalues. The solution is The benefit of the two-step approach is that more efficient carried out in the following steps. BLAS-3 procedures can be used in the tridiagonalization pro- cess and an overlap of communication and computation is pos- 1. Reduce A to tridiagonal form, i.e. find an orthogonal sible. As a result, a lower runtime can generally be observed transformation Q s.t. trd in the tridiagonalization, compared to the one-step approach. This comes at the cost of more operations in the eigenvector A = Q AQ trd trd trd 2 Algorithm 1 Solution of a Skew-symmetric Eigenvalue Prob- back transformation due to the extra step that has to be per- lem formed. Therefore, ELPA2 is superior to ELPA1 in particular T n×n when only a portion of the eigenvectors is sought. In the context Input: A =−A ∈ R n×n of skew-symmetric eigenvalue problems, this becomes pivotal Output: Unitary eigenvectors Q ∈ C , λ ,...,λ ∈ R s.t 1 n as the purely imaginary eigenvalues come in pairs±λi, λ ∈ R. Q AQ = diag{λ i,... ,λ i}. 1 n The eigenvectors are given as the complex conjugates of each 1: Reduce A to tridiagonal form, i.e. generate Q s.t. trd other. It is therefore enough to compute half of the eigenvalues   0 α and eigenvectors. Both approaches are extended to skew-symmetric matrices   −α 0 T 1   Q AQ = A = . trd trd trd in this work.   . . . . . . n−1 −α 0 n−1 2.2. Solving the Skew-symmetric Eigenvalue Problem 2: Solve the eigenvalue problem for the symmetric tridiago- Like a symmetric matrix, a skew-symmetric matrix can be H 2 n nal matrix−iD A D, where D = diag{1,i,i ,... ,i }, i.e. trd reduced to tridiagonal form using Householder transformations. generate Q s.t. diag A Householder transformation represents a reflection onto a scaled first unit vector e . Let H be a transformation that acts 1     0 α on a vector v s.t. Hv = αe . Obviously −v is transformed to  .  .   α 0 T 1 H(−v) =−αe by the same H. Therefore all tridiagonalization   Q Q = .  .  diag diag   . . . methods that work on symmetric matrices, such as the ones im- . . . . n−1 plemented in ELPA, can in principle work on skew-symmetric α 0 n−1 matrices as well. 3: Back transformation corresponding to symmetrization (see A skew-symmetric tridiagonal matrix is related to a sym- Lemma 1), i.e. compute metric one via the following observation [5]. n×n 2 n−1 Q← DQ ∈ C . Lemma 1. With the unitary matrix D = diag{1,i,i ,... ,i }, diag where i denotes the imaginary unit, α ∈ R, it holds 4: Back transformation corresponding to band-to-tridiagonal     0 α 0 α 1 1 reduction, i.e. compute . . . .     . . −α 0 α 0 H 1 1     −iD D = . (1) Q← Q Q. trd     . . . . . . . . . . . . α α n−1 n−1 −α 0 α 0 n−1 n−1 . denotes the Hermitian transpose of a matrix. back transformation corresponding to tridiagonalization do not After the reduction to tridiagonal form, the symmetric tridi- change, because all they do is to apply Householder transfor- agonal system is solved using a divide-and-conquermethod [11]. mations to non-symmetric (and non-skew-symmetric) matrices. As a first step of the back transformation, the resulting (real) They are applied on the real and imaginary part independently, eigenvectors have to be multiplied by the (complex) matrix D. realizing the complex back transformation in real arithmetic. Then the back transformations corresponding to the tridiagona- The symmetric tridiagonal eigensolver can be used as is. Ma- lization take place. Algorithm 1 outlines the process. It is very king it aware of the zeros on the diagonal might turn out to be similar to the method employed for symmetric eigenvalue prob- numerically or computationally beneficial. lems. The differences are the addition of step 3 and changes in We now examine the implementation of the two tridiago- the implementation, which are given in detail in Sections 2.3.1 nalization approaches in ELPA1 and ELPA2 in more detail. At and 2.3.2. many points in the original implementation, symmetry of the In ELPA2 the transformation steps (1 and 4 in Algorithm 1) matrix is assumed in order to avoid unnecessary computations are both split into two parts as described in Section 2.1. and to efficiently reuse data available in the cache. In this sec- tion we recollect some details of the tridiagonal reduction in 2.3. Implementation order to point out these instances. Here, the implicit assump- Extending ELPA for skew-symmetric matrices means adding tions can be changed from “symmetric” to “skew-symmetric” the back transformation step involving D. In contrast to sym- by simple sign changes. metric matrices, skew-symmetric matrices have complex eigen- ELPA is based on the well established and well documented vectors and strictly imaginary eigenvalues. Computationally 2D block-cyclic data layout introduced by ScaLAPACK for load complex values are introduced in Algorithm 1 with D in step balancing reasons. It is therefore compatible to ScaLAPACK 3. Further transformations have to be performed for the real and can act as a drop-in replacement while no ScaLAPACK and the imaginary part individually. It is preferable to set up routines are used by ELPA itself. In general, each process an array with complex data type entries representing the eigen- works on the part of the matrix that was assigned to it. This vectors as late as possible, so that we can benefit from efficient chunk of data resides in the local memory of the process. Com- routines in double precision. The routines for the eigenvector 3 munication between processes is realized via MPI. Each pro- in Section 2.3.1. cess calls serial BLAS routines. Additional CUDA and OpenMP T T T A← (I−VTV ) A(I−VTV ) (8) support is available. T T T T T = A +V (0.5T V AVTV − T V A) | {z } 2.3.1. Tridiagonalization in ELPA1 In ELPA1, the tridiagonalization is realized in one step us- T T T ing Householder transformations. The computation of the House- + (0.5VT V AVT − AVT)V (9) | {z } holder vectors is not affected by the symmetry of a matrix. Es- sentially, the tridiagonalization of a matrix comes down to a = A + V U U V . (10) series of rank-2 updates [15], described in the following. Given 2 1 a Householder vector v, the update of the trailing submatrix is It holds U = U if A is symmetric, and U =−U if A is skew- 1 2 1 2 performed as symmetric. Each process computes the relevant parts of U in T T a series of (serial) matrix operations and updates the portion A← (I−τvv )A(I−τvv ) (2) of A that resides in its memory. Here, the symmetry of A is 2 T T T 2 T T = A + v(0.5τ v Avv −τv A)+(0.5τ vv Av−τAv)v (3) assumed and exploited at various points in the implementation. | {z } | {z } 2 Sign changes have to be applied at these instances. For the banded-to-tridiagonal reduction, the matrix is redis- T T = A + vu + u v (4) tributed in the form of a 1D block cyclic data layout. Each = A + v u u v . (5) process owns a diagonal and a subdiagonal block. The reduc- 2 1 tion of a particular column introduces fill-in in the neighboring For symmetric matrices it holds u = u . This is assumed in 1 2 block. The “bulge-chasing” is realized as a pipelined algorithm the original ELPA implementation. For skew-symmetric matri- where computation and communication can be overlapped by v u ces it holds u =−u . In ELPA1, the two matrices and 1 2 2 reordering certain operations [11, 17]. The update of the diagonal blocks takes the same form as in u v are stored explicitly. Actual updates are performed ELPA1 (Equations (2) to (5)). Here, no matrix multiplication is using GEMM and GEMV routines. The matrices differ in the data employed but BLAS-2 routines are used working directly with layout, i.e. which process owns which part of the matrix. After the Householder vectors. It holds u = u for symmetric A and the vector u is computed, it is transposed and redistributed to 1 2 u = −u for skew-symmetric A. In the symmetric case, the represent u in v u . Here, for the skew-symmetric variant, 1 2 2 2 update is realized via a symmetric rank-2 update (SYR2). We a sign change is introduced. The skew-symmetric update now implemented a skew-symmetric variant of this routine which reads T T realizes the skew-symmetric rank-2 update A← A− vu + uv . A← A + v −u u v . (6) For the setup of u, a skew-symmetric variant of the BLAS rou- 1 1 tine performing a symmetric matrix vector product (SYMV) is During the computation of u , symmetry is assumed in the necessary. computation of A v. In particular, the code assumes that an off- The other parts of Algorithm 1 are adopted from the sym- diagonal matrix tile is the same as in the transposed matrix. An- metric implementation without changes. The computation of other sign change corrects this assumption for skew-symmetric Householder vectors, the accumulation of the Householder trans- matrices. formations in a triangular matrix and the update of the local block during reduction to banded form do not have to be changed 2.3.2. Tridiagonalization in ELPA2 compared to symmetric ELPA. This is because they act on the In ELPA2, the tridiagonalization is split into two parts. First, lower part of the matrix so that possible (skew-)symmetry has the matrix is reduced to banded form, then to tridiagonal form. no effect. For the reduction to banded form, the Householder vectors are computed by the process column owning the diagonal block. nb×nb 3. The Bethe-Salpeter Eigenvalue Problem They are accumulated in a triangular matrix T ∈ R , where nb is the block size. The product of Householder matrices is Ab initio spectroscopy aims to describe the excitations in stored via its storage-efficient representation [16] condensed matter from first principles, i.e. without the input of any empirical parameters. For light absorption and scattering, Q = H ···H = I−VTV , (7) 1 nb the Bethe-Salpeter Equation (BSE) approach is the state-of-the- art methodology for both crystalline systems[18, 19, 20, 21, 8] v ··· v where V = contains the Householder vectors. 1 nb as well as condensed molecular systems [22, 23, 24, 25]. This H = I−τ v v is the Householder matrix corresponding to the i i i approach takes its name from the Bethe-Salpeter Equation [26], i-th Householder transformation. the equation of motion of the electron-hole correlation function, In this context, the update of the matrix A takes the follow- as derived from many-body perturbation theory [27, 8]. In prac- ing shape, analogous to the direct tridiagonalization described tice, the problem of solving the BSE is mapped to an effective eigenvalue problem. Specifically, its eigenvalues and -states are 4 employed to construct dielectric properties, such as the spectral Let density, absorption spectrum, and the loss function [7, 28]. An Re(A + B) Im(A− B) M = JH = (13) appropriate discretization scheme leads to a finite-dimensional −Im(A + B) Re(A− B) representation in matrix form H that shows a particular block BS structure [29]: be the symmetric matrix associated with the Hamiltonian ma- trix H. Its positive definiteness follows from property (12), A B A B H = = , (11) which can be seen in the following way. Let the matrices S BS H T ¯ ¯ −B −A −B −A and Ω be given as H T n×n A = A , B = B ∈ C . I A B S = , Ω = , (14) Note that the Hermitian transpose . as well as the regular trans- ¯ ¯ −I B A pose without complex conjugation . play a role in this struc- i.e. H = SΩ. With the matrix Q from Theorem 2 we have BS ture. In general, we are interested in all eigenpairs of the Hamil- M =−iJQ SΩQ. (15) tonian, as they contain valuable information on the excitations of the system. Specifically, they describe the bound excitons, It is easily verified that localized electron-hole pairs that form due to correlation be- −iJQ SQ = I , (16) tween an excited electron and a hole. The BSE eigenstates are used to reconstruct the excitonic wavefunction and obtain the i.e. −iJQ S is the inverse of Q. The construction of M (15) excitonic binding energy. can therefore be seen as a similarity transformation of Ω. If Ω In this paper, we present a solution strategy for the most is positive definite (12), so is M. The method described in [7] general formulation of the BSE problem. As such, A and B are relies on this property in order to guarantee the existence of the generally dense and complex-valued, which holds in the case of Cholesky factorization of M. excitations in condensed matter. It performs the following steps. H belongs to the slightly more general class of J-sym- BS 1. Construct M as in (13). metric matrices [30]. This class of matrices display a symmetry (λ ,−λ) in the spectrum. The additional structure in H leads 2. Compute a Cholesky factorization M = LL . BS ¯ ¯ to an additional symmetry (λ ,−λ ,λ ,−λ) and a relation be- 3. Compute eigenpairs of the skew-symmetric matrix L JL, 0 I tween the corresponding eigenvectors. Following [7], we con- where J = . −I 0 sider the definite Bethe-Salpeter eigenvalue problem. H is BS 4. Perform the eigenvector back transformation associated called definite when the property with Cholesky factorization and transformation to Hamil- I 0 A B tonian form (Theorem 2). H = > 0 (12) BS ¯ ¯ 0 −I B A The eigenvalues and eigenvectors can be used to compute is fulfilled, which often holds in practice. In this case, the the optical absorption spectrum of the material in a postpro- eigenvalues are real and therefore come in pairs (λ ,−λ). The cessing step. method presented in this work relies on this assumption. The main workload is given as the solution of a skew-sym- We aim for a solution method that preserves this structure metric eigenvalue problem (Step 3). As a proof of concept, under the influence of inevitable numerical errors, i.e. that guar- solution routines for the symmetric eigenvalue problem from antees that the eigenvalues come in pairs or quadruples, respec- the ScaLAPACK reference implementation [3] were adapted to tively. General methods for eigenvalue problems, such as the the skew-symmetric setting. The matrix is reduced to tridiag- QR/QZ algorithm, destroy this property. In this case it is not onal form using Householder transformations. The tridiagonal clear anymore which eigenpairs correspond to the same excita- eigenvalue problem is solved via bisection and inverse iteration. tion state. The ScaLAPACK reference implementation is not regarded A structure-preserving method running in parallel on dis- as a state-of-the art solver library. When performance and scal- tributed memory systems is developed in [7] and has been made ability are issues, one generally turns to professionally main- available as BSEPACK. It relies on assumption (12) and ex- tained and optimized libraries such as ELPA [4] or vendor- ploits a connection to a Hamiltonian eigenvalue problem given specific implementations such as Intel’s MKL. Within BSE- in the following Theorem. PACK, ScaLAPACK can be substituted by ELPA working on skew-symmetric matrices. The resulting performance benefits I −iI Theorem 2. Let Q = , then Q is unitary and 2 are discussed in Section 4.2. I iI A B Im(A + B) −Re(A− B) Q Q = i =: iH, 4. Numerical Experiments ¯ ¯ −B −A Re(A + B) Im(A− B) 4.1. ELPA Benchmarks where H is real Hamiltonian, i.e. JH = (JH) with In this section we present performance results for the skew- 0 I J = . symmetric ELPA extension. All test programs are run on the −I 0 5 Table 1: Execution time speedups achieved by different aspects of the solution approach. #Cores Compl. Compl. Skew-Sym. Skew-Sym. ELPA2 ELPA2 50% ELPA2 50% ELPA2 50% 100% vs. vs. Compl. vs. Compl. vs. Compl. Compl. MKL 50% ELPA2 50% MKL 50% MKL 100 % 16 1.10 1.41 2.33 3.28 32 1.29 1.41 2.30 3.24 64 1.11 1.40 2.32 3.25 128 1.18 1.33 2.20 2.93 256 1.17 1.28 2.16 2.76 512 1.21 1.51 1.87 2.82 16 32 64 128 256 512 Number of cores Complex ELPA1, 100% Complex ELPA1, 50% Complex ELPA2, 100% Complex ELPA2, 50% Complex MKL 100% Complex MKL 50% Skew-Symmetric Skew-Symmetric ELPA1, 100% ELPA1, 50% Skew-Symmetric Skew-Symmetric ELPA2, 100% ELPA2, 50% Figure 1: Scaling of the ELPA solver for skew-symmetric matrices. For com- parison the runtimes for the alternative solution method via complex Hermitian solvers is included. Here, ELPA and Intel’s MKL 2018 routines pzheevd and pzheevr are used. The matrix has a size of n = 20000. 16 32 64 128 256 512 Number of cores mechthild compute cluster, located at the Max Planck Institute ELPA2: Band-to- ELPA2: Full-to-Band for Dynamics of Complex Technical Systems in Magdeburg, Tridiagonal Germany. Up to 32 nodes are used, which consist of 2 Intel ELPA2: Full-to- ELPA1: Full-to-Tridiagonal Tridiagonal Xeon Silver 4110 (Skylake) processors with 8 cores each, run- PDSSTRD NB = 16 PDSSTRD NB = 64 ning at 2.1 GHz. The Intel compiler, MPI library and MKL PDSSTRD NB = 256 in the 2018 version are used in all test programs. The com- putations use randomly generated skew-symmetric matrices in Figure 2: Scaling of the tridiagonalization in two steps (ELPA2) and one step double-precision. (ELPA1). We compare it to the runtimes of the tridiagonalization routine for Figure 1 shows the resulting performance and the scaling skew-symmetric matrices PDSSTRD available in BSEPACK [7] for different properties of ELPA for a medium sized skew-symmetric ma- block sizes NB. The matrix size is n = 20000. trix (n = 20000). As an alternative to the approach described in this work, the skew-symmetric matrix can be multiplied with in Table 1). When all eigenpairs are computed, ELPA1 and the imaginary unit i. The resulting complex Hermitian matrix ELPA2 yield very similar runtime results which is why only can be diagonalized using available methods in ELPA or Intel’s ELPA2 is considered in Table 1. The two-step approach em- ScaLAPACK implementation shipped with the MKL. This rep- ployed by ELPA2 pays off in particular when not all eigen- resents the only previously available approach to solve skew- pairs are sought, which is the case here. When complex 50% symmetric eigenvalue problems in a massively parallel high- solvers are compared (ELPA2 vs. MKL, column 3 in Table performance setting. 1), the achieved speedup increases to a value between 1.28 and For skew-symmetric matrices, only 50% of eigenvalues and 1.51. The largest impact on the performance is caused by avoid- eigenvectors need to be computed, as they are purely imaginary ing complex arithmetic. This is represented by the speedup and come in pairs ±λi,λ ∈ R. The runtime measurements for achieved by the skew-symmetric 50% ELPA2 implementation 100% are included for reference. compared to the complex 50% ELPA2 implementation (column Figure 1 shows that all approaches display good scalabil- 4 of Table 1). This accounts for an additional speedup of 1.87 ity in the examined setting. Skew-symmetric ELPA runs 2.76 to 2.33. to 3.28 times faster than the complex MKL based solver, where The tridiagonalization is an essential step in every consi- both only compute 50% of eigenpairs. The data gives further in- dered solution scheme and contributes a significant portion of sight into how this improvement is achieved. Table 1 compares the execution time. The fewer eigenpairs are sought, the more the runtimes for different solvers and presents the achieved speed- dominant it becomes with respect to computation time. Figure 2 ups. When we compare complex 100% solvers, ELPA already displays the runtimes and scalability of available tridiagonaliza- improves performance by a factor of 1.1 to 1.29 (column 2 Runtime in s Runtime in s Table 2: Execution time speedups achieved by different aspects of the solution approach. Matrix Compl. Compl. Skew-Sym. Skew-Sym. ELPA2 ELPA2 50% ELPA2 50% ELPA2 50% size 100% vs. vs. Compl. vs. Compl. vs. Compl. Compl. MKL 50% ELPA2 50% MKL 50% MKL 100 % 50 000 1.17 1.45 2.32 3.35 75 000 1.16 1.46 2.39 3.50 100 000 1.17 1.47 2.42 3.57 125 000 1.17 1.49 2.46 3.67 50000 75000 125000 Matrix Size n Complex ELPA1, 100% Complex ELPA1, 50% Complex ELPA2, 100% Complex ELPA2, 50% Complex MKL 100% Complex MKL 50% Skew-Symmetric Skew-Symmetric ELPA1, 100% ELPA1, 50% Skew-Symmetric Skew-Symmetric ELPA2, 100% ELPA2, 50% 200 Figure 3: Runtimes for solving eigenvalue problems of larger sizes. 256 CPU cores were used, i.e. 16 nodes on the mechthild compute cluster. 1024 8192 16384 24576 32768 tion techniques for skew-symmetric matrices. As an alternative Matrix Size n implementation to the presented approaches there is a tridiago- 2x Intel Xeon Silver 4110, ELPA2 nalization routine PDSSTRD shipped in BSEPACK [7]. It is an 2x Intel Xeon Silver 4110 + 1x Nvidia P100, ELPA1 adapted version of the ScaLAPACK reference implementation. Figure 4: Runtimes for solving eigenvalue problems on one node on the All discussed implementations are based on the 2D-block- mechthild compute cluster employing a GPU. cyclic data distribution established by ScaLAPACK. Here, the matrix is divided into blocks of a certain size NB. The blocks are distributed to processes organized in a 2D grid in a cyclic the ELPA library [32]. The design approach is to stick with the manner. Typically, the block size is a parameter chosen once same code base as the CPU-only version, and offload compute- in a software project. The data redistribution to data layouts intense parts, such as BLAS-3 operations, to the GPU in or- defined by other block sizes is avoided as this involves expen- der to benefit from its massive parallelism. This is done using sive all-to-all communication. The main disadvantage of the the CUBLAS library provided by NVIDIA. Because ELPA2 PDSSTRD routine is that it is very susceptible to the chosen employs more fine-grained communication patterns, this ap- block size, both with regard to scalability and overall perfor- proach works best for ELPA1. Here, the performance can ben- mance. This makes it less suitable to be included in larger soft- efit when the computational intensity is high enough, i.e. when ware projects, where the block size is a parameter predefined big chunks of data are being worked on by the GPU. by other factors. ELPA (both the one and two-step version) on Figure 4 shows the performance that can be achieved on one the other hand does not have this problem and performs equally node of the mechthild compute cluster, that is equipped with well for all data layouts [31]. an NVIDIA P100 GPU as an accelerator device. The GPU ver- Figure 2 also displays the advantage of the two-step tridiag- sion is based on ELPA1 and therefore does not benefit from the onalization over the one-step approach. Here the performance faster tridiagonalization in ELPA2 (see Figure 2 and the dis- is dominated by the first step, i.e. the reduction to banded form. cussion in the previous section). Despite this fact, the GPU- In the context of electronic structure computations, the matri- accelerated ELPA1 version eventually outperforms the ELPA2 ces of interest can become extremely large. Figure 3 displays CPU-only version, if the matrix is large enough. In our case the achieved runtime improvements for larger matrices up to a the turning point is at around n = 15000. For smaller matrices size of n = 125000. The individual speedups are presented in the additional work of setting up the CUDA environment and Table 2. For large matrices we achieve a speedup of up to 3.67 transferring the matrix counteracts any possible performance compared to the available MKL routine. benefits and results in a larger runtime. For matrices of size n = 32768 employing the GPU can reduce the runtime from 4.1.1. GPU Acceleration 570 seconds to 328 seconds, i.e. by 41%. For the 1-step tridiagonalization approach (ELPA1), there The take-away message of these results is the following. If is a GPU-accelerated version available that gets shipped with nodes equipped with GPUs are available and to be utilized, it is Runtime in s Runtime in s NB = 64, but choosing a larger block size can increase the per- formance dramatically, as can be seen in Figure 5 for NB = 256. Typically, software packages (e.g. [42, 28]) developed for elec- tronic structure computations are large and contain many fea- tures, implementing methods for different quantities of inter- est. The block size is typically predetermined by other con- siderations. It would mean a serious effort to change it, in or- der to optimize just one building block of the software. Fur- thermore the optimal block size of the original BSEPACK is probably dependent on the given hardware and the given ma- trix size. Autotuning frameworks could help, but are also very costly and impose an additional implementation effort. A soft- 32 128 16 64 256 512 ware, that does not show this kind of runtime dependency is Number of cores greatly preferable. Employing ELPA for the main computa- BSEPACK NB = 64 BSEPACK NB = 256 tional task in BSEPACK fulfills this requirement. The perfor- BSEPACK + ELPA2 mance of ELPA is independent of the chosen NB, because the block size on the node level for optimal cache use is decoupled Figure 5: Scaling of the direct, complex BSEPACK eigenvalue solver for com- from the block size defining the multi-node data layout. puting the optical absorption spectrum of hexagonal boron nitride. The Bethe- Salpeter matrix (11) has a size of 51200. The ELPA-accelerated version is up to 9.22 times as fast as the original code with the default block size. Even when the block size is increased, using the new solver always yields important to make sure each node has enough data to work on. a better performance. In the case of NB = 256, the ELPA- This way, the available resources are used most efficiently. version still performs up to 2.76 times as fast. Choosing even larger block sizes has in general no further positive effect on the 4.2. Accelerating BSEPACK performance of the original BSEPACK. Employing ELPA also We consider the performance improvements that can be a- leads to an improved scalability over the number of cores. chieved by using the newly developed skew-symmetric eigen- value solver in the BSEPACK [7] software, described in Sec- 5. Conclusions tion 3. In this procedure, Step 3, the computation of eigenpairs of the skew-symmetric matrix L JL, is now performed by the We have presented a strategy to extend existing solver li- ELPA library. braries for symmetric eigenvalue problems to the skew-sym- To demonstrate the speedup, we consider the example of metric case. Applying these ideas to the ELPA library, makes hexagonal boron nitride at a fixed size of the BSE Hamiltonian. it possible to compute eigenvalues and eigenvectors of large The excitations in hexagonal boron nitride are widely studied skew-symmetric matrices in parallel with a high level of ef- both experimentally and theoretically [33, 34, 35, 36, 37, 38, ficiency and scalability. We benefit from the maturity of the 39], as its wide band gap and the layered geometrical structure ELPA software project, where many optimizations have been yield strong effects of electron-hole correlation, such as the for- realized over the years. All of these, including GPU support, mation of bound excitons. Previous studies have shown that the find their way into the presented skew-symmetric solver. As far BSE approach yields the optical absorption and excitonic prop- as we know, no other solvers dedicated to the skew-symmetric erties with high accuracy. In our calculations, the BSE Hamil- eigenvalue problem exist in an HPC setting. It is always pos- tonian is constructed on a 16×16×4 k-grid in the 1st Brillouin sible to solve a complex Hermitian eigenvalue problem instead zone, the 5 highest valence and 5 lowest conduction bands are of a skew-symmetric one. Our newly developed solver outper- employed to construct the transition space, leading to a matrix forms this strategy, implemented via Intel MKL ScaLAPACK, size of 2×16×16×4×5×5 = 51200. In the calculation of the by a factor of 3. We also observe an increase in performance BSE Hamiltonian, single-particle wavefunctions and the static concerning the Bethe-Salpeter eigenvalue problem. Here we dielectric function are expanded in plane waves with a cut-off improve the runtime of available routines by a factor of almost of 387 eV and 132 eV, respectively. The static dielectric func- 10, making the BSEPACK library with ELPA a viable choice tion is obtained from ABINIT [40], while the BSE Hamiltonian as a building block for larger electronic structure packages. is constructed using the EXC code [41]. Figure 5 displays the achieved runtimes of BSEPACK for 6. Acknowledgment this fixed-size matrix for different core counts. We compare the original version and a version that employs ELPA. The perfor- We thank Francesco Sottile for fruitful discussion and his mance of the original solver is highly dependent on the cho- support in generating the BSE Hamiltonian for hexagonal BN. sen block size (see also Figure 2). This parameter determines how the matrix is distributed to the available processes in the form of a 2D block-cyclic data layout. The default is given as Runtime in s References [19] L. X. Benedict, E. L. Shirley, R. B. Bohn, Optical absorp- tion of insulators and the electron-hole interaction: An ab [1] G. H. Golub, C. F. Van Loan, Matrix Computations, 4th Edition, Johns initio calculation, Phys. Rev. Lett. 80 (1998) 4514–4517. Hopkins Studies in the Mathematical Sciences, Johns Hopkins University doi:10.1103/PhysRevLett.80.4514. Press, Baltimore, 2013. [20] S. Albrecht, L. Reining, R. Del Sole, G. Onida, Ab ini- [2] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, tio calculation of excitonic effects in the optical spectra of A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, D. So- semiconductors, Phys. Rev. Lett. 80 (1998) 4510–4513. rensen, LAPACK Users’ Guide, SIAM, Philadelphia, PA, 2nd Edition doi:10.1103/PhysRevLett.80.4510. (1995). doi:10.1137/1.9780898719604. [21] S. Sagmeister, C. Ambrosch-Draxl, Time-dependent density functional [3] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, theory versus Bethe-Salpeter equation: an all-electron study, Phys. Chem. J. J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, Chem. Phys. 11 (2009) 4451–4457. doi:10.1039/B903676H. D. Walker, R. C. Whaley, ScaLAPACK User’s Guide, Vol. 4 of Soft- [22] J. C. Grossman, M. Rohlfing, L. Mitas, S. G. Louie, M. L. ware, Environments and Tools, SIAM Publications, Philadelphia, PA, Cohen, High accuracy many-body calculational approaches for ex- USA, 1997. doi:10.1137/1.9780898719642. citations in molecules, Phys. Rev. Lett. 86 (3) (2001) 472. [4] A. Marek, V. Blum, R. Johanni, V. Havu, B. Lang, T. Auckenthaler, doi:10.1103/PhysRevLett.86.472. A. Heinecke, H.-J. Bungartz, H. Lederer, The ELPA library: scalable [23] C. Faber, P. Boulanger, C. Attaccalite, I. Duchemin, X. Blase, Ex- parallel eigenvalue solutions for electronic structure theory and compu- cited states properties of organic molecules: from density func- tational science, Journal of Physics: Condensed Matter 26 (21) (2014) tional theory to the GW and Bethe–Salpeter Green’s function for- 213201. doi:10.1088/0953-8984/26/21/213201. malisms, Phil. Trans. R. Soc. A 372 (2011) (2014) 20130271. [5] R. C. Ward, L. J. Gray, Eigensystem computation for skew-symmetric doi:10.1098/rsta.2013.0271. and a class of symmetric matrices, ACM Trans. Math. Softw. 4 (3) (1978) [24] C. Cocchi, C. Draxl, Optical spectra from molecules to crystals: Insight 278–285. doi:10.1145/355791.355798. from many-body perturbation theory, Phys. Rev. B 92 (2015) 205126. [6] P. Benner, D. Kreßner, V. Mehrmann, Skew-Hamiltonian and Hamil- doi:10.1103/PhysRevB.92.205126. tonian eigenvalue problems: Theory, algorithms and applications, [25] D. Hirose, Y. Noguchi, O. Sugino, All-electron G W+ Bethe-Salpeter in: Z. Drmacˇ, M. Marusic, Z. Tutek (Eds.), Proc. Conf. Appl calculations on small molecules, Phys. Rev. B 91 (20) (2015) 205111. Math. Scientific Comp., Springer-Verlag, Dordrecht, 2005, pp. 3–39. doi:10.1103/PhysRevB.91.205111. doi:10.1007/1-4020-3197-1_1. [26] E. E. Salpeter, H. A. Bethe, A relativistic equation for [7] M. Shao, F. H. da Jornada, C. Yang, J. Deslippe, S. G. Louie, Structure bound-state problems, Phys. Rev. 84 (1951) 1232–1242. preserving parallel algorithms for solving the Bethe-Salpeter eigenvalue doi:10.1103/PhysRev.84.1232. problem, Linear Algebra and its Applications 488 (Supplement C) (2016) [27] G. Strinati, Application of the Green’s functions method to the study of 148 – 167. doi:10.1016/j.laa.2015.09.036. the optical properties of semiconductors, Riv. Nuovo Cimento 11 (12) [8] G. Onida, L. Reining, A. Rubio, Electronic excita- (1988) 1–86. doi:10.1007/BF02725962. tions: density-functional versus many-body Greens- [28] C. Vorwerk, B. Aurich, C. Cocchi, C. Draxl, Bethe–Salpeter equa- function approaches, Rev. Mod. Phys. 74 (2) (2002) 601. tion for absorption and scattering spectroscopy: implementation doi:10.1103/RevModPhys.74.601. in the exciting code, Electronic Structure 1 (3) (2019) 037001. [9] F. Furche, On the density matrix based approach to time-dependent den- doi:10.1088/2516-1075/ab3123. sity functional response theory, The Journal of Chemical Physics 114 (14) [29] T. Sander, E. Maggio, G. Kresse, Beyond the Tamm-Dancoff approxima- (2001) 5982–5992. doi:10.1063/1.1353585. tion for extended systems using exact diagonalization, Phys. Rev. B 92 [10] J. C´ızˇek, J. Paldus, Stability conditions for the solutions of the Hartree- (2015) 045209. doi:10.1103/PhysRevB.92.045209. Fock equations for atomic and molecular systems. application to the [30] P. Benner, H. Faßbender, C. Yang, Some remarks on the complex J- Pi-electron model of cyclic polyenes, The Journal of Chemical Physics symmetric eigenproblem, Linear Algebra and its Applications 544 (2018) 47 (10) (1967) 3976–3985. doi:10.1063/1.1701562. 407 – 442. doi:10.1016/j.laa.2018.01.014. [11] T. Auckenthaler, V. Blum, H.-J. Bungartz, T. Huckle, R. Johanni, [31] P. Benner, A. Marek, C. Penke, Improving the performance of numerical L. Kra¨mer, B. Lang, H. Lederer, P. Willems, Parallel solution of par- algorithms for the Bethe-Salpeter eigenvalue problem, Proc. Appl. Math. tial symmetric eigenvalue problems from electronic structure calcula- Mech. 18 (1) (2018). doi:10.1002/pamm.201800255. tions, Parallel Computing 37 (12) (2011) 783 – 794, 6th International [32] P. Ku˚s, H. Lederer, A. Marek, GPU optimization of large-scale eigen- Workshop on Parallel Matrix Algorithms and Applications (PMAA’10). value solver, in: F. A. Radu, K. Kumar, I. Berre, J. M. Nordbotten, I. S. doi:10.1016/j.parco.2011.05.002. Pop (Eds.), Numerical Mathematics and Advanced Applications ENU- [12] A. Alvermann, A. Basermann, H.-J. Bungartz, et al., Benefits from using MATH 2017, Springer International Publishing, Cham, 2019, pp. 123– mixed precision computations in the ELPA-AEO and ESSEX-II eigen- 131. doi:10.1007/978-3-319-96415-7_9. solver projects, Japan Journal of Industrial and Applied Mathematics [33] G. Cappellini, G. Satta, M. Palummo, G. Onida, Optical properties of BN 36 (2) (2019) 699–717. doi:10.1007/s13160-019-00360-8. in cubic and layered hexagonal phases, Phys. Rev. B 64 (2001) 035104. [13] Message Passing Interface Forum, MPI: A message-passing interface doi:10.1103/PhysRevB.64.035104. standard, Tech. rep., Knoxville, TN, USA (1994). [34] X. Blase, A. Rubio, S. G. Louie, M. L. Cohen, Quasiparticle band struc- [14] P. Ku˚s, A. Marek, S. Ko¨cher, H.-H. Kowalski, C. Carbogno, C. Scheurer, ture of bulk hexagonal boron nitride and related systems, Phys. Rev. B 51 K. Reuter, M. Scheffler, H. Lederer, Optimizations of the eigen- (1995) 6868–6875. doi:10.1103/PhysRevB.51.6868. solvers in the ELPA library, Parallel Computing 85 (2019) 167 – 177. [35] S. Galambosi, L. Wirtz, J. A. Soininen, J. Serrano, A. Marini, K. Watan- doi:10.1016/j.parco.2019.04.003. abe, T. Taniguchi, S. Huotari, A. Rubio, K. Ha¨ma¨la¨inen, Anisotropic [15] R. S. Martin, C. Reinsch, J. H. Wilkinson, Householder’s tridiagonaliza- excitonic effects in the energy loss function of hexagonal boron nitride, tion of a symmetric matrix, Numerische Mathematik 11 (3) (1968) 181– Phys. Rev. B 83 (2011) 081413. doi:10.1103/PhysRevB.83.081413. 195. doi:10.1007/BF02161841. [36] G. Fugallo, M. Aramini, J. Koskelo, K. Watanabe, T. Taniguchi, [16] R. S. Schreiber, C. Van Loan, A storage-efficient WY representation for M. Hakala, S. Huotari, M. Gatti, F. Sottile, Exciton energy-momentum products of Householder transformations, SIAM J. Sci. Statist. Comput. map of hexagonal boron nitride, Phys. Rev. B 92 (2015) 165122. 10 (1989) 53–57. doi:10.1103/PhysRevB.92.165122. [17] T. Auckenthaler, H.-J. Bungartz, T. Huckle, L. Krmer, B. Lang, [37] P. Cudazzo, L. Sponza, C. Giorgetti, L. Reining, F. Sottile, M. Gatti, Ex- P. Willems, Developing algorithms and software for the parallel solution citon band structure in two-dimensional materials, Phys. Rev. Lett. 116 of the symmetric eigenvalue problem, Journal of Computational Science (2016) 066803. doi:10.1103/PhysRevLett.116.066803. 2 (3) (2011) 272 – 278. doi:10.1016/j.jocs.2011.05.002. [38] J. Koskelo, G. Fugallo, M. Hakala, M. Gatti, F. Sottile, P. Cu- [18] M. Rohlfing, S. G. Louie, Electron-hole excitations in semicon- dazzo, Excitons in van der Waals materials: From monolayer to ductors and insulators, Phys. Rev. Lett. 81 (1998) 2312–2315. bulk hexagonal boron nitride, Phys. Rev. B 95 (2017) 035125. doi:10.1103/PhysRevLett.81.2312. doi:10.1103/PhysRevB.95.035125. 9 [39] W. Aggoune, C. Cocchi, D. Nabok, K. Rezouali, M. A. Belkhir, C. Draxl, Dimensionality of excitons in stacked van der Waals materials: The example of hexagonal boron nitride, Phys. Rev. B 97 (2018) 241114. doi:10.1103/PhysRevB.97.241114. [40] X. Gonze, F. Jollet, F. Abreu Araujo, D. Adams, B. Amadon, T. Ap- plencourt, C. Audouze, J. M. Beuken, J. Bieder, A. Bokhanchuk, E. Bousquet, F. Bruneval, D. Caliste, M. Coˆte´, F. Dahm, F. Da Pieve, M. Delaveau, M. Di Gennaro, B. Dorado, C. Espejo, G. Geneste, L. Gen- ovese, A. Gerossier, M. Giantomassi, Y. Gillet, D. R. Hamann, L. He, G. Jomard, J. Laflamme Janssen, S. Le Roux, A. Levitt, A. Lherbier, F. Liu, I. Lukacˇevic´, A. Martin, C. Martins, M. J. T. Oliveira, S. Ponce´, Y. Pouillon, T. Rangel, G. M. Rignanese, A. H. Romero, B. Rousseau, O. Rubel, A. A. Shukri, M. Stankovski, M. Torrent, M. J. Van Set- ten, B. Van Troeye, M. J. Verstraete, D. Waroquiers, J. Wiktor, B. Xu, A. Zhou, J. W. Zwanziger, Recent developments in the ABINIT soft- ware package, Computer Physics Communications 205 (2016) 106–131. doi:10.1016/j.cpc.2016.04.003. [41] Exc webpage, www.bethe-salpeter.org, accessed: 2019-11-26. [42] A. Gulans, S. Kontur, C. Meisenbichler, D. Nabok, P. Pavone, S. Riga- monti, S. Sagmeister, U. Werner, C. Draxl, Exciting: A full-potential all- electron package implementing density-functional theory and many-body perturbation theory, J. Phys. Condens. Matter. 26 (36) (2014) 363202. doi:10.1088/0953-8984/26/36/363202.

Journal

MathematicsarXiv (Cornell University)

Published: Dec 9, 2019

There are no references for this article.