A configuration-based heatbath-CI for spin-adapted multireference electronic structure calculations with large active spaces

This work reports on a spin-pure configuration-based implementation of the heatbath configuration interaction (HCI) algorithm for selective configuration interaction. Besides the obvious advantage of being spin-pure, the presented method combines the compactness of the configurational ansatz with the known efficiency of the HCI algorithm and a variety of algorithmic and conceptual ideas to achieve a high level of performance. In particular, through pruning of the selected configurational space after HCI selection by means of a more strict criterion, a more compact wave-function representation is obtained. Moreover, the underlying logic of the method allows us to minimize the number of redundant matrix-matrix multiplications while making use of just-in-time compilation to achieve fast diagonalization of the Hamiltonian. The critical search for 2-electron connections within the configurational space is facilitated by a tree-based representation thereof as suggested previously by Gopal et al. Usage of a prefix-based parallelization and batching during the calculation of the PT2-correction leads to a good load balancing and significantly reduced memory requirements for these critical steps of the calculation. In this way, the need for a semistochastic approach to the PT2 correction is avoided even for large configurational spaces. Finally, several test-cases are discussed to demonstrate the strengths and weaknesses of the presented method.


| INTRODUCTION
Multireference (MR) electronic structure calculations allow for a correct description of molecular systems with strongly correlated electrons.[3] Nowadays, the vast majority of MR calculations rely on the "active space" concept that invokes the choice of a set of active electrons and orbitals.This active space is treated on a higher level of theory than the remaining inactive electrons and orbitals.In the framework of the complete active space (CAS) approach, the electronic wave function is expanded in a complete set of N act -electron functions within the active orbital space.In other words, the full configuration interaction (FCI) problem is solved within the active space.Accordingly, the number of wave function parameters grows exponentially with the number of active electrons and orbitals and thus quickly becomes intractable.To alleviate the unfavorable computational scaling, a number of approximate FCI solvers has been developed and employed in practical use.[17][18][19][20][21][22][23][24][25][26][27] The aim of SCI (as well as the other approximate FCI solvers) is to find a lower-dimensional representation of the CAS wave function without significant loss of accuracy.In the first formulation of SCI by Huron et al., denoted CIPSI, 28 the set of basis functions chosen to represent Ψ CAS is selected in an iterative fashion, starting from a small reference space P.During every iteration, P is supplemented by the set of Slater determinants (SDs) or configuration state functions (CSFs) that are deemed important according to a first-order perturbation theory based selection criterion.Over the years, alternative selection criteria have been proposed.Most notably, Umrigar and coworkers introduced a new variant termed heatbath configuration interaction (HCI), in which a many-body basis function is selected for P, if it interacts sufficiently strongly with the reference wave function through the Hamiltonian operator. 17This selection criterion allows for an efficient reformulation of the selection of SDs (CSFs) compared to the CIPSI method such that significant speedups are be obtained.A recent blind challenge for the ground state energy of benzene highlighted the great performance of the semistochastic HCI (SHCI) implementation by Umrigar and coworkers in this context. 19,20,29[32] Importantly, the aforementioned SHCI implementation relies on a wave function expansion in terms of Slater determinants.On one hand, this choice of basis simplifies the expressions for all Hamiltonian matrix elements, which in turn enables the implementation of an extremely efficient computer code.On the other hand, in SD-basis, the spin symmetry is not strictly conserved.Furthermore, the SDrepresentation of the wave function is less compact than in alternative formulations based on CSFs or configurations. 24,33rein, we report on a configuration-based implementation of HCI.Instead of selecting single SD or CSFs individually, our algorithm selects complete sets of spin-adapted CSFs that belong to a given electronic configuration.While such a selection scheme necessarily entails larger expansion spaces than CSF-based SCI, 24 it facilitates a largely vectorized implementation and does not suffer from an active orbital ordering dependence of the total energy. 26A vital factor for the efficiency of the presented code is the availability of one-electron coupling coefficients at a low computational cost as provided by the recursive approach recently introduced by us. 34Another important aspect of a configuration-based formulation is the concomitant strict obedience to spin symmetry that enables the user to directly target specific spin states-a feature that is particularly helpful during studies of transition metal compounds with multiple open shells or close-lying spin states (vide infra).In addition to the aforementioned conceptual advantages (and disadvantages) of the presented method, this work puts a strong focus on the details of its implementation as they are a key factor to its performance and applicability.
This work begins with a brief recapitulation of SCI and its HCI variant in Section 2.1 before describing the intricacies of spinadaptation in the present context in Section 2.2.Subsequently, Section 3 provides a detailed description of the current implementation.Finally, the results of calculations on four test cases reported in Section 4 demonstrate the capabilities and shortcomings of the presented program.

| THEORY
In the following, indices i, j, k, l are used for internal orbitals, a, b, c, d for external orbitals, t, u, v, w for active orbitals and p, q, r, s for general molecular orbitals.Capital indices such as I, J are used to denote general many-body basis functions such as SDs or CSFs.

| The heatbath-CI algorithm
Within the CAS approach, one seeks to find the eigenfunction Ψ j i that corresponds to the lowest eigenvalue of in a complete set of N act -electron functions f Φ I j ig that are constructed from the active orbital set 1) are spin-traced replacement operators in second quantization notation while h tu and ðtujvwÞ are molecular one-and twoelectron integrals, respectively.For active space sizes up to 16-18   electrons and orbitals, this problem can be solved within the full f Φ I j ig-space (FCI) by means of Davidson 35,36 or Lanzcos 37 algorithms. 34,38,39The central quantity that needs to be computed in these algorithms is the σ-vector that is obtained by multiplying the Hamiltonian matrix with a trial vector, Owing to the steep increase of the number of N-electron basis functions, Φ I j i P CAS , with growing number of active electrons and orbitals, the calculation of σ becomes intractable for sufficiently large active spaces.As mentioned in the previous section, this problem can be alleviated by selecting a subset of FCI configurations that is well suited to represent Ψ j i.Such a selective CI concept was first introduced by Huron et al. in their seminal work on configuration interaction by perturbation with multiconfigurational zeroth-order wave function selected by iterative process (CIPSI) 28 from 1973 and has found its way into various quantum chemistry programs since then.The starting point of CIPSI is a set of reference functions P that could, for example, consist of only the Hartree-Fock (HF) wave function Φ HF j i.In the following, P will be referred to as the variational space.If P consists of multiple functions, the lowest eigenvalue E P of the Hamiltonian matrix H P and the corresponding eigenvector, are found through diagonalization.In the next cycle of the CIPSIalgorithm, a function Φ A j i= 2 P, is selected for a new variational space With P 0 at hand, the pair of E P 0 and Ψ P 0 j i are again obtained through Hamiltonian diagonalization.This cycle is repeated until either no new many-body functions are found important according to Equation (5) or the change in energy between two iterations is below a certain threshold, that is, ΔE ¼ jE P 0 À E P j > ε E .Obviously, for a function Φ A j i= 2 P to be selected, it has to belong to the first order interacting space (FOIS) of P. Hence, in case of P consisting only of Φ HF j i in the first iteration, only up to doubly excited N-electron basis functions can be selected.In the second iteration, up to quadruply excited functions can be selected and so on (see Figure 1).
The final variational energy can be improved by calculating a perturbative correction, Thus, the total CIPSI energy contains a variational and a non- . By incorporating the perturbative correction, the CIPSI method is able to calculate energies near the FCI-limit while the corresponding wave function representation is considerably more compact than that of FCI.It should be noted at this point that when SCI methods like CIPSI are used in the context of active space calculations, the perturbative correction takes only perturber functions into account that are part of the active space.[42] In 2016 Umrigar and coworkers introduced the Heatbath CI (HCI) algorithm, a particularly efficient SCI variant. 17Within HCI, a function for any Φ I j i P.This modified selection criterion emphasizes the role of the numerator in Equation ( 5) and refers to the interaction strength between elements of P and its FOIS.More importantly, it allows for an efficient reformulation of the selection algorithm (see  From the reference function Φ 0 j i the set of most important interacting functions are identified based on a simple criterion (represented by the blue arrow) and appended to the variational space P. In the next iteration, the most important interacting functions are identified for P 0 (represented by the green arrows).This procedure is repeated until self-convergence is reached.This Figure has been adapted from Reference 3.

Algorithm 1 HCI selection procedure.
Then, doubly excited functions Φ A j i¼E tu E vw Φ I j i FOIS are created by looping overall Φ I j i P, orbital pairs ðtuÞ and finally orbital pairs ðvwÞ.If the product jH AI C I j exceeds the user-defined threshold ε sel , Φ A j i is selected for P 0 , otherwise it is discarded.In the original formulation of HCI all functions Φ I and Φ A are SDs.Hence, for a given combination of I and ðt, uÞ, the value of f HCI ð Φ A j iÞ only depends on the size of jH AI j ¼ jðtujvwÞj.Since these have been ordered in a descending order all the following values f HCI ð Φ A j iÞ within the loop over ðv, wÞ will necessarily descend, too.Accordingly the loop over ðv, wÞ can be terminated after the first f HCI ð Φ A j iÞ value falls below ε sel .In this way, many functions Φ A j i are never considered, which substantially reduces the computational effort of the selection step.The algorithm described in here can also be applied to reduce the cost of computing the perturbative correction ΔE PT2 .Finally, it should be mentioned that the procedures described for the electronic ground state in this section are readily transferred to electronically excited states.

| Spin-adaptation
The Hamiltonian in Equation ( 1) commutes with both, the Ŝ z and the Consequently, a common basis of eigenfunctions of Ĥ, Ŝ z , and Ŝ 2 can be found.That means, the electronic states can be characterized as spin a singlet, triplet and so forth with a magnetic spin quantum number M. While Slater determinants composed of molecular orbitals are eigenfunctions of Ŝ z , they are not necessarily eigenfunctions of Ŝ 2 .
In fact, as soon as two electrons that do not occupy the same spatial orbital are antiferromagnetically coupled, the corresponding Slater determinant is not a true spin eigenfunction anymore.While the use of spin-adapted guess vectors as a starting point in CI calculations alleviates this problem, spin contamination can still occur through usage of ill-suited preconditioners and the effect of numerical noise. 43,446][47] CSFs on the other hand obey spin symmetry by construction.Hence, a CSF-based CI formulation comes with certain advantages.For example, since Ĥ transforms as a S ¼ 0 operator it takes a block-diagonal form when expanded in a basis of CSFs.Accordingly, only a submatrix of H has to be diagonalized which usually leads to a more compact representation of Ψ j i.Perhaps more importantly, a (selected) CI formulation in terms of CSFs allows the user to target specific spin states.
A given CSF is characterized by its orbital occupation pattern or configuration (CFG) n, spin quantum numbers S and M and an additional index μ that is necessary to distinguish the f NS degenerate spin functions for a given number of N unpaired electrons and total spin S. 48 As discussed elsewhere in detail, there are multiple ways of constructing the f NS spin eigenfunctions that belong to a given combination of N (and hence n) and S. 48 Herein, all CSFs correspond to spin eigenfunctions that were generated according to the genealogical coupling scheme.Thus each spin eigenfunction X μ ðN, S, M S ¼ SÞ corresponds to a Yamanouchi-Kotani branching diagram.Using this basis, the central quantities during a SCI calculation, that is, σ-vector and the perturbative correction ΔE PT2 , become contractions of molecular integrals, state vectors and spin-adapted coupling coefficient matrices (vide infra). 24,34These coupling coefficients are matrix elements of second quantized replacement operators between CSFs, that is, As is common practice in nonrelativistic quantum chemistry, only the principal component with M ¼ S will be considered in this work.
All other M components give rise to degenerate states as long as no spin-dependent terms enter the Hamiltonian.Due to a prototypical symmetry, only a limited number of unique coupling coefficients exist.
More precisely, the value of A Iμ,Jν pq depends only on the following quantities 34 : • ðp rel , q rel Þ, the position of indices p and q relative to the singly occupied orbitals in n I .
• N I , the number of unpaired electrons in n I .
• S I , the total spin of Φ I .
• ðμ, νÞ, the pair of indices that distinguish the spin eigenfunctions of Φ I and Φ J , respectively, among the f NISI and f NJSJ degenerate sets of spin eigenfunctions.
Throughout this work, all degenerate spin eigenfunctions for a given combination of n and S will be treated simultaneously.Accordingly, the unique coupling coefficients required in Hamiltonian diagonalization and ΔE PT2 calculation are handled as coupling coefficient (CC-) matrices A NS p rel q rel with dimension ðf NISI Â f NJSJ Þ.This simultaneous treatment of all CSFs belonging to a given configuration renders the presented approach "configuration-based."

| IMPLEMENTATION
This section outlines the details of our configuration-based HCI implementation.Subsection A describes the HCI selection procedure before subsections B and C elaborate on the efficient calculation of the σ-vector and the HCI perturbative correction, respectively.In the following, orbital configurations (CFGs) that can be represented by an occupation number vector n will be labeled with indices I, J and so forth.Configuration state functions (CSFs) that correspond to an occupation vector n together with a spin eigenfunction carry two labels, for example, Iμ.

| Configuration selection
In the configuration-based HCI, the reference space P consists of a set of configurations f Φ I j ig.All configurations Φ I j i P can interact through the Hamiltonian with up to at most doubly excited configurations that are created as with Φ A j i, Φ B j i= 2 P. The action of the replacement operator E pq on configuration Φ I j i is defined by reducing the q'th and increasing the p'th occupancy of n I by one.Of course, if through this process any occupancy of n I becomes larger than 2 or smaller than 0, the action results in 0. In the HCI-algorithm, the importance of configurations iin the wave function is determined by the absolute integral values, t pq and v pq,rs defined as t pq ¼ jh eff pq À 0:5 More precisely, the importance function for singly and doubly excited configurations is defined as A configuration is selected for inclusion in P if its corresponding importance function exceeds a user-defined threshold ε sel .Note that the same threshold is used for single and double excitations.Furthermore, the values of coupling coefficients are removed from the importance function since they are bounded by 2 and 4 for single and double excitations, respectively.
As outlined in Section 2.1, the efficiency of HCI configuration selection originates from the early termination of many loops over orbital indices p, q, r, s during excited configuration generation.This is made possible by sorting the configurations and integrals in such a way that the outermost loop is terminated as soon as the importance function drops below the selection threshold ε sel .A setup of the configuration selection consists of the following steps: 1. Create and store maps I 1 and I 2 : In I 1 , every key p is connected to a list of values fðq,t pq Þg that consist of pairs of orbital indices q and their corresponding integrals t pq , that is, each element is a pair ðq, t pq Þ. Importantly, the list fðq, t pq Þg is sorted in descending order of t pq .Likewise, I 2 connects key pairs ðp, qÞ with sorted lists fðr, s, v pq,rs Þg.During the creation of I 1 and I 2 , t max ¼ max pq ft pq g and v max ¼ max pqrs fv pq,rs g are determined and stored.
2. Create a set of generator configurations, P gen as a subset of all variational configurations of P, that satisfy c max I > ε gen and sort the generator configurations in a descending order based on their 49 With this data at hand, the configuration generation and selection using the HCI-algorithm can be readily carried out.Algorithm 2 summarizes all steps of the process of selecting singly-excited configurations.
The selection of doubly-excited configurations works analogously.
Although the HCI-algorithm is known for its efficiency, 29 it tends to not provide a compact wave function due to the underlying approximations as pointed out by Tubman et al. 50Therefore, we utilize the CIPSI method for further pruning of configurations after an initial HCI-selection.To describe our CIPSI-implementation briefly, we shall recall the corresponding importance function, where E is the variational energy of the previous iteration. 28The importance function of a configuration Φ A j i is taken as the maximum value of f CIPSI of all corresponding CSFs Φ Aμ , that is, During the calculation of matrix elements H Aμ,Iν , the HCIalgorithm is utilized to screen only significant configuration connections.The summation in Equation ( 16) does not run over all configurations in P but rather over the generator configurations.Hence, while generating the HCI-configurations according to Algorithm 2 and its twoelectron counterpart, we also find the configuration connections necessary to conduct the additional CIPSI-pruning according to Equations ( 16) and ( 17).This two-step selection procedure allows us to make efficient use of CIPSI, which would be significantly more time-demanding if all singly and doubly excited configurations were generated and probed.

| Prefix algorithm
The HCI-algorithm for configuration selection is used in both, the variational part and the perturbative energy correction in Equation ( 6).
Algorithm 2 HCI singly excited configuration selection.
A way to further speed up these steps is through parallel processing.
However, a parallelization over the generator configurations f Φ I g j i is problematic, since generator configurations belonging to different processes might interact with the same excited configuration Φ A j i through Ĥ.Therefore, we propose a prefix-based parallelization within the FOIS, which ensures that the parallel processes generate nonoverlapping sets of excited configurations.In this scheme, different processes generate excited configurations that feature distinct patterns in fixed segments of the occupation vector, called prefixes.This idea is perhaps similar to the recent constraint-based PT2 approach by Tubman et al. 21A straightforward way to generate the prefixes would be to consider the first k orbitals and create 3 k prefixes from all possible occupation patterns.However, this implies that k needs to be small and it would infer generation of many prefixes that will likely never occur in the HCI configuration space, such as "00000."We propose a different procedure for the creation of prefixes that aims to create more realistic prefixes and thus achieve an improved load balancing.In our procedure, the length k of the prefix is chosen as the number of occupied orbitals in the HF-reference.An initial set of prefixes of length k is obtained by extracting all of the ones that occur in the generator space P gen .As in a single HCI iteration, only up to double excitations can occur, so this initial set is extended by singly and doubly excited prefixes.Since the single and double excitations can also involve orbitals out of ½1,k, prefixes that are obtained by incomplete excitations have to be considered.For example, for E pq with p > k and q < k, we have to apply only the annihilation operation â q on the prefix, whereas â † p is discarded.With the final set of prefixes at hand, they are distributed evenly among the parallel processes.The method is summarized in the following steps: 1. Gather all the unique prefixes present in the set of generators.In this intermediate loop, it is tested whether the prefix can be reached from the current generator by single and double excitations (jΔnj ≤ 4).If so, configuration generation and selection is performed from that generator.We shall mention that the loop over prefixes is set as the outermost one during computation of the PT2-correction (cf.Equation ( 6)).There, the loop over prefixes is divided into batches such that the PT2-energy is calculated incrementally.This allows for an efficient adaption of the procedure to the available memory.

| Configuration connections
An efficient approach to compute σ-vectors during the diagonalization of Ĥ within the P-subspace is to utilize configuration connections.In this work, the construction of these connections is similar to the singles and doubles Hamiltonian construction algorithm from the review by Tubman et al. 50The one-and two-electron connections are defined and stored as lists of triplets, where Ω conf is the set of configuration indices while Ω 1 orb and Ω 2 orb are the sets of composite indices for one-and two-electron excitations respectively.The time-critical step in constructing configuration connections is performing the required searches on the list of P-configurations.In our implementation, the configurations are stored as C++ strings which suggests the use of an unordered map for storage and performing the searches.However, despite the undoubted efficiency of hash maps, this method can become expensive for large orbital and configuration spaces.An alternative, in this context more efficient, approach is to make use of a prefix-tree or trie. 24A trie is a recursive data structure that can be used to store configurations as a set of linked nodes.In the present implementation, each node consists of an array of three pointers, which correspond to the three possible orbital occupations 0, 1, and 2. These pointers store the memory locations of the next nodes.A walk in the trie from top to bottom corresponds to a configuration.Figure 2 illustrates this concept for two example configurations, 20 j i and 02 j i.The main advantage of using a trie is that one can perform prefix searches, that is, if a node has no children (three crossed lines), the search is terminated early.Therefore, a trie-based algorithm for configuration searches consists of walking down the trie and applying creation and annihilation operators on the starting configuration along the way.For this procedure to work efficiently, the creation and annihilation operations in E pq or E pq E rs have to be applied in an ordered manner.Consider the sum of one-electron replacement operators as it appears in the Hamiltonian, For finding all one electron connections within P, two separate walks are performed-one for p > q and the other for p < q.For the Algorithm 3 HCI prefix-based singly excited configuration selection.
diagonal terms, no configuration-searches are performed.For p < q, the search procedure is outlined in Algorithm 4. The search when q < p, works similarly by swapping the pq-indices.
In case of two-electron connections, the search works analogously, but to allow for efficient trie searches many more relations between the indices have to be considered.A complete list of search patterns is given in Appendix A. Finally, it should be mentioned that during the process of finding configuration connections, the relevant information (vide supra) about which coupling coefficients are associated with the connection is gathered as well.

| σ-vector calculation
The most time-consuming step in CI-calculations is usually the evaluation of the σ-vector σ ¼ Hc with elements Using a configurational resolution-of-the-identity (RI) ansatz, the two-electron coupling coefficient CC-matrix can be rewritten as While Siegbahn's approach for evaluating σ as product of vectorized intermediates 51 works remarkably well in the context of FCI, where the CI-space coincides with the RI-space, it may lead to memory and performance problems for SCI calculations as the configurational RI space is usually significantly larger than the CI space P. As an alternative, the presented program constructs, stores and utilizes twoelectron configuration connections during the σ-vector evaluation.
Clearly, the bottleneck in this approach is the calculation of matrix A simple configuration trie.The left path of connecting lines corresponds to 02 j i and the right one to 20 j i.The crossed lines refer to null pointers, which means that a node has no children.
Algorithm 4 Generation of one electron configuration connections when p < q.
products A IK pq A KJ rs .We shall first note that since replacement operators provide a one-to-one mapping between configurations, the summation over K in Equation ( 22) can be removed leading to To minimize the number of matrix-times-matrix multiplications, the loops in the σ-vector calculation are arranged such that the reading of one-electron CC-matrices from memory or the computation of 2-electron CC-matrices is done in the outermost loop.Each CC is then used many times in multiplications with c.Such a loop arrangement can be done by a making use of two maps, M 1 and M 2 , that are constructed during the construction of configuration connections.Each key of the maps is a set of numbers that uniquely specifies either a one-or two electron coupling coefficient matrix, whereas the value is a list of configuration connections corresponding to the specified CCmatrix.In the case of an arbitrary one-electron excitation E pq Φ I j i! Φ J j i, the involved one-electron CC-matrix is uniquely identified by a tuple of 5 integers, comprising the excitation type e, the numbers of unpaired electrons n I , n J in configurations I and J and relative positions p rel , q rel of donor and acceptor orbitals p and q with respect to the singly occupied orbitals.Similarly, for a double excitation fully identifies the corresponding 2-electron CC-matrix.These tuples can be conveniently used as a key for an ordered map in C++ by making use of the STL objects std::tuple and std::map.Thus, M 1 and M 2 have the following structure: where C 1 and C 2 are configuration connection lists.The σ-vectors are constructed through loops over keys in M 1 and M 2 that in turn feature loops over connections in the corresponding lists as outlined in Algorithm 5.
Due to the frequent application of matrix-matrix or matrix-vector operations in the time-limiting parts of the code, the method is wellsuited for making use of fast linear algebra libraries.In particular, the rearrangement of loops in the σ-vector calculation poses a suitable case for using just-in-time (JIT) compilation.By applying JIT to the outermost-loop of Algorithm 5, a runtime function or kernel is prepared to carry out the matrix-vector products in the innermost loop.
We have found that typically for a single matrix-product in the outermost loop, many subsequent matrix-vector products are calculated.
This means that the time for kernel-preparation is negligible.In addition, most of the CC-matrices have relatively small dimensions (below 100), in which case JIT methods are known to be advantageous over conventional BLAS/LAPACK routines.For the matrixvector product via JIT-kernel, we used the LIBXSMM library. 52For the rest of linear algebra operations, we used the C++ Armadillo library linked with Intel MKL. 53,54Since the CC-matrices have some sparsity and some elements equal to one, we believe that further speedups could be gained in the future by developing more specific runtime-compiled kernels.

| Perturbative correction
The perturbative correction to the SCI energy is calculated by During the evaluation of Equation ( 28), the prefix algorithm described above (see Section 3.2) is employed to achieve parallelization with a reasonable load-balance.Furthermore, it allows for the calculation of ΔE PT2 in a batched manner, which reduces the memory requirement of this step, thereby avoiding the need for a semistochastic approach. 20In the present implementation, each batch comprises 20 prefixes.This choice yielded a reasonable performance with respect to the memory usage in our test calculations.As mentioned above, the configuration connections through Ĥ in the summation of Equation ( 28) are efficiently screened in the same way as during the configuration selection (see, for example, Algorithm 2).In order to achieve accurate results, the selection thresholds are chosen typically 1-3 orders of magnitude smaller and all elements of P are considered "generators," that is, P gen ¼ P.

| RESULTS
This section presents results of a series of test calculations that have been chosen to demonstrate some key properties, strengths and also weaknesses of the presented approach.Computational details of the underlying calculations that are not of immediate relevance to the presented data (e.g., basis sets) can be found in Section 5.
Algorithm 5 Two-electron contributions to the σ-vector calculation.

| Cyanogen
Three sets of test calculations on the cyanogen molecule ((CN) 2 , 18 electrons in 32 orbitals) were conducted with the aim of demonstrating different aspects of the implemented method.First, the behavior of variational energy with respect to different configuration selection procedures and parameters is investigated.Then, the dependence of PT2 correction on the corresponding selection thresholds are scrutinized before convergence of the total energy towards FCI limit is compared to DMRG, an alternative approximate FCI method that employs a spin-adapted basis.(II) HCI with ε gen ¼ 0.
Tables 1 and 2 present the variational energies as obtained with and without the generator configuration approximation, respectively, for varying selection thresholds.We shall discuss the results from Table 1 first.As expected, the CIPSI-pruned scheme provides a wave function with a smaller CSF-dimension compared to "pure" HCI-selection.For combinations of loose thresholds (ε gen ≥ 0:5 Â 10 À1 ), CIPSI-pruning reduces the CSF dimension between 4% and 17%.This reduction comes at the cost of a variational energy increase between 3.3 mE h ðε gen ¼ 1 Â 10 À1 ) and 0.8 mE h (ε gen ¼ 0:5 Â 10 À1 ).Upon further tightening of thresholds, the energy difference between CIPSI-pruned HCI and "pure" HCI selection schemes become smaller while the CSF dimension reduction slightly increases to values between 15% and 25%.
Thus, the quality of CIPSI-pruned selection relative to the "pure" HCI selection improves with tighter thresholds.This finding can be rationalized by considering the perturbative nature of CIPSI, which is expected to become more accurate as the quality of the reference wave function reference improves.In general, CIPSI-pruning acts to partially remedy the tendency of HCI to select a considerable number of unimportant configurations in addition to the important ones.Most likely, the improvement in this regard achieved by CIPSI pruning is due to the inclusion of the denominator in Equation ( 16).In terms of timings, CIPSI-pruning has two opposing effects.While the time required for configuration selection is necessarily increased, the reduced dimension of P infers lower computational costs during the Davidson diagonalization procedure.Our data indicates that for loose thresholds, the former effect outweighs the latter, leading to slightly larger calculation times, whereas the CIPSI-pruned calculation times are reduced by up to 40% when tight thresholds are invoked.
While the general trends observed in the previous paragraph are also seen when P gen ¼ P, that is, for scheme II and IV, the difference between "pure" HCI selected and CIPSI-pruned selection is larger in terms of both, variational energies and CSF dimension.It should furthermore be noted that when ε var is chosen such that similar CSF dimensions are obtained, the CIPSI-pruned scheme provides lower energies.For example, compare the results with ε var ¼ 0:9 Â 10 À3 for the "pure" HCI selection and with ε var ¼ 0:3 Â 10 À3 for the CIPSIpruned scheme.Since CIPSI-pruning in general leads to more compact wave function representations at only moderately higher computational costs (if at all higher), we would advocate use of the HCI/CIPSI-scheme for configuration selection.

| PT2-correction
The combination of the HCI algorithm with the above described prefix algorithm enables efficient calculation of perturbative corrections in our configuration-based HCI implementation.While the former reduces the space of considered perturber-functions, the latter allows for a wellbalanced parallelization and memory-efficient batching.The results and timings for a set of calculations with varying ε PT2 presented in Table 3 support this assessment.For example, even with a tight threshold of The calculations were performed on one compute node with AMD EPYC 7451 24-core processor.

| Fe(II)Porphyrin
The prediction of triplet-quintet porphyrin has served as a challenging test case for multireference methods in many instances [56][57][58][59] and despite intense efforts the size of the gap is still subject of an ongoing debate. 59A proper quantitative description of this system is out of the scope of the current work since at least two important ingredients for such a treatment are not considered but will be subject to future work: active-active orbital rotations and dynamical electron correlation.
Instead, the presented results will serve to demonstrate the performance of the current HCI method in describing higher spin states.In Table 5, we have compiled the results of HCI calculations of the lowest triplet and quintet states of Fe(II) porphyrin with an active space of 40 electrons in 42 orbitals.Starting from a set of quasi-natural NEVPT2 orbitals 60,61 (see Section 5 for details), the molecular orbitals were optimized under consideration of internal-active, internalexternal, and active-external rotations.The orbital coefficients of the final set of orbitals are provided in the Electronic Supporting Information.When only the variational part of the HCI energies is considered, the T-Q gap is predicted to amount to ΔE T-Q ¼ 6:5 mE h (0.17 eV), which corresponds to a quintet ground state.Upon inclusion of the perturbative correction the value decreases to ΔE T-Q ¼ 0:0 mE h (0.0 eV).[64][65][66][67][68][69] In the current context it is instructive to consider and discuss the number of configurations and CSFs involved in the different steps of the underlying HCI calculations.The number of configurations selected in the variational and perturbative part of the quintet calculation is smaller than for the triplet calculation but both CSF dimensions are larger.This finding is connected to the increasing number of CSFs for a given configuration with increasing spin multiplicity. 70On average, the multiplicity of selected configurations in the variational and perturbative step of the quintet calculation is larger.Consequently, the calculation of the quintet state energy required significantly more computer time than the corresponding triplet calculation, especially so during the perturbative step.At this point we would like to mention that while the presented deterministic approach to the PT2 correction is able to take into account quite a large number of CSFs, the 3-step semistochastic approach presented in Reference 20 appears to take into account a significant larger number of Slater determinants.Yet, a direct comparison of the two approaches for the same system on the same computational setup has not been made for this work.
Finally, it is noteworthy that the variational energy calculation of the triplet state is quite efficient compared to the DMRG for this particular case.A DMRG calculation on the triplet state with M ¼ 2000 resulted in a total energy of E DMRG ¼ À2244:1619 E h in 41,073 s using the same computational setup as the aforementioned HCI calculations.In comparison, the variational HCI energy is by 2.5 mE h lower while the calculation was about 10-times faster faster.In our view, this clearly indicates the potential of a configuration-based HCI implementation for applications in transition metal chemistry.

| Co-based valence tautomer
An additional set of test calculations on [Co(sq) 2 (bpy)] (sq-semiquinone, bpy-2,2'-bipyridine, see Figure 3), 1, a known valence tautomer, 71 aims to highlight one of the key features of the presented configuration-based HCI method: its ability to selectively target a user-defined number of states for a given spin state.In its electronic ground state, 1 features a Co III center with a d 6 low-spin configuration together with a ligand-centered radical.On account of the presence of two symmetry-equivalent quinone ligands, two (near-) degenerate T A B L E 4 Calculation of approximate FCI energies of (CN) 2 near the FCI-limit.Note: M denotes the DMRG bond-dimension.For HCI the thresholds were ε gen ¼ 1 Â 10 À3 , ε var ¼ 1 Â 10 À6 and ε PT2 ¼ 1 Â 10 À7 .Energies are given in Hartrees, calculation time in seconds. a The calculations were performed on three compute nodes with AMD EPYC 7451 24-core processor-in total 72 cores.
T A B L E 5 Calculated HCI energies for iron porphyrin.The calculations were performed on three compute nodes with AMD EPYC 7451 24-core processor-in total 72 cores.
states with the unpaired electron being localized on either of the two ligands exist.Upon heating 1 beyond a solvent-dependent critical temperature T c , the nature of the electronic state changes considerably as indicated by a number of experimental observations.It is assumed that this change of spectroscopic and physical properties is induced by a ligand-to-metal charge transfer as observed in numerous other Co based valence tautomers. 72Such a charge transfer gives rise to a d 7 configuration on the Co center and two holes in the ligand valence shell thereby giving rise to a number of near-degenerate states of doublet, quartet and sextet spin multiplicity.
Table 6 summarizes the calculated energies for the four lowest doublet states, three lowest quartet states and the lowest sextet state of 1 and the corresponding computer timings.For these calculations, the CIPSI-pruned HCI method with an active space of 20 electrons in 20 orbitals was used.Analogous to the implementation by Holmes et al., 73 the union of selected configurations for the ground and each excited state is chosen to build the variational space during HCI-iterations.
Overall, the calculation of the three quartet states was most time consuming whereas the calculation of the four doublet states required the least computer time.This observation can likely be attributed again to the increase of the number of CSFs with increasing spin multiplicity of a given configuration.Owing to the lowest number of calculated roots and the concomitant reduction of N var ðCSFÞ the variational part of the sextet calculation was about one order of magnitude faster than the other two sets of calculations.Yet, t PT2 is on the same order of magnitude for all three sets; again owing to the connection between spin multiplicity and NðCSFÞ.Finally, we would like to emphasize that it is the spin-adapted nature of the manyelectron basis in the presented HCI implementation that allows for an a priori targeting of a user-specified number of roots for each spin multiplicity.

| Nickel-oxyl complex
5][76] An interesting question with regard to these compounds is their electronic F I G U R E 3 Tautomerization of the investigated [Co(sq) 2 (bpy)] complex 1.
distribution in the vicinity of the active metal center, for example, whether the compound can be characterized as metal-oxo (M=O) or metal-oxyl (MOÁ).Obviously, the spin density is a useful means to judge the relative importance of the two resonance structures as it provides immediate access to the localization of unpaired electrons.In the presented HCI method, the spin density is readily calculated through contraction of operator matrices T IJ p,q that contain matrix elements of the spin tensor operator with state vectors.Appendix C provides the details of how these operator matrices are efficiently calculated for the principle components with M ¼ S. To demonstrate the utility of this feature of our HCI method, we have conducted a HCISCF calculation on Ni-oxo intermediate 2 (top panel of Figure 4) which has been studied previously on the DFT level of theory by Figg and Cundari. 77The bottom panel of Figure 4 depicts an isosurface of the spin density that corresponds to the lowest triplet state obtained with an active space of 20 electrons in 20 orbitals.These orbitals comprise a blend of metal 3d orbitals and ligand based π orbitals and oxygen p-orbitals as suggested by the ASS1ST procedure. 60,61It is obvious that the spin density is considerably spread over both, the Ni center and the oxygen ligand thus indicating the existence of Ni(III) oxyl rather than a Ni(IV) oxo species.
This assessment is further substantiated by Mulliken spin populations of 1.1 and 1.0 on the Ni O atoms, respectively.

| COMPUTATIONAL DETAILS
This section outlines the computational steps and methods used to produce the results presented in the previous section.
Cyanogen.The geometry of the cyanogen molecule (CN) 2 was taken from Reference 78.Molecular orbitals for the presented MR calculations were obtained from a restricted HF calculation with ORCA (version 5.0.3) using Ahlrichs SV basis set. 79,80Subsequent HCI and DMRG calculations were performed with an active space of 18 electrons in 32 orbitals.provided further improved starting orbitals for the presented HCI calculations. 60,61During the ASS1ST procedure a maximum bond dimension of M ¼ 500 was used.The ASS1ST quasi-natural orbitals were then fed as starting orbitals for HCISCF orbital optimization.Several optimizations were carried out by sequentially tightening the thresholds.Finally, a HCI calculation including the PT2-correction was performed to yield final energies for the calculation of the triplet-quintet energy gap.In the HCISCF orbital optimization, active-active rotations and the PT2-correction were not included.All calculations on Fe(II) porphyrin employed the def2-SVP basis set. 82-valence tautomers.We considered the cobalt complex [Co(sq) 2 (bpy)] (sqsemiquinone, bpy -2,2'-bipyridine) from a previous DFT-study by Sato and coworkers. 71e geometry was optimized at DFT level using the TPSSh functional with a ZORA def2-TZVP(-f) basis set.The starting orbitals for SCI calculations were obtained from a state-averaged DMRGSCF optimization with M ¼ 3000.In the start of SCI calculations, the basis set was projected from def2-SVP to the def2-TZVP basis set. 80For geometry optimization Orca5 was used. 79The multireference calculations were carried out with the MOLBLOCK/BLOCK codes. 55ckeloxyl.The geometry was taken from a previous study by Figg and Cundary. 77The final active space was chosen to be 20 electrons in 20 orbitals based on results from the ASS1ST procedure with M ¼ 500.The orbitals from ASS1ST were optimized using HCISCF, in the end of which spin densities were obtained.All calculations on 2 utilized the def2-SVP basis set.

| CONCLUSIONS
We have developed and implemented a configuration-based SCI method designed to capture static electron correlation within the active space framework for the electronic ground and excited states Yet, we believe it should be noted at this point that a simple comparison of timings might not always be the single relevant factor when evaluating the utility of a given theoretical method for a planned computational investigation.Other criteria such as variationality or spinsymmetry of the wave function might be similarly important.
A key feature of the method presented in here is that the computed wave functions are expanded in a strictly spin-adapted basis.As a result, electronic states with a given spin multiplicity can be specifically targeted by the user, which can turn out to be essential, for example, during computational studies of molecules with multiple open shells like the Co-based valence tautomer 1.Furthermore, since all spin eigenfunctions for a given configuration are treated simultaneously, the calculated electronic energies do not depend on the order of active orbitals and all relevant computations rely on efficient matrix operations.Through a simple adaptation of the underlying generation routines for one-electron coupling coefficients, the presented configuration based HCI program allows for the evaluation of spin densities at a small extra cost.
While the HCI algorithm is inherently efficient, the demonstrated performance of the method relies to a high degree on details of its implementation.For example, the chosen loop structure minimizes the number of redundant matrix operations which in turn are optimized by choosing regular BLAS or JIT-compiled routines depending on the matrix-size.Our prefix-based parallelization scheme ensures an efficient and well-balanced parallelization of the workload during calculation of the σ-vector and the perturbative correction.In this way, a (semi-) stochastic treatment of the perturbative correction as required in previous HCI implementations could be avoided.Moreover, the usage of a trie-based data structure leads to significant speedups during the many necessary searches in the configurational space.
Despite the performance of our configuration-based HCI implementation, we envisage several routes for improvements and extensions.First, the selection of spin functions for configurations with a large number of unpaired electrons ( > 20) will have to be done in a CSF-specific manner to avoid an unfavorable scaling of required computer memory and time.Moreover, a combination of MPI-and OpenMP-based parallelization would improve the parallel computational performance while keeping memory requirements lower.Finally, the presented implementation will form the basis for a specifically designed adaptation of the NEVPT2 approach to the calculation of dynamic electron correlation effects and nuclear gradients for state-averaged CASSCF calculations with large active spaces.

APP E NDIX A: TWO-ELECTRON CONFIGURATION CONNECTIONS
To reduce the number of connections to be stored and also save some time in the σ-vector calculation, we make use of the symmetries, where we have defined Δn ¼ P M p jn I p À n J p j to denote the accumulative difference between two configurations.We split the two-electron replacement operator into multiple parts, 83 1 2 To construct the two electron configuration connections, we have to apply the replacement operators on the starting configurations.
There are multiple ways to order the creation and annihilation operations in a double excitation and we shall consider all of them here.In the following, the superscript 0 on orbital index denotes application of both annihilation and creation or vice versa.When jΔnj ¼ 4, we have enforced that pq < rs.Since we can write pq ¼ pM þ q, it follows that p < r ) pq < rs and if p ¼ r, q < s ) pq < rs.In the 4-th case of E pq E rs , the constraint p < r leaves q and s undetermined.Their order does not change the final configuration that is created in the excitation, but it determines which kind of CCs are involved.Hence in the implementation, for each E pq E rs -excitation, we also consider an excitation with q, s-swapped in the generation of configuration connections.
• jΔnj ¼ 2 1. E pq E qr q 0 < p † < r, r < q 0 < p † , ðA10Þ p † < r < q 0 , ðA11Þ r < p † < q 0 : ðA12Þ 2. E pr E rp p 0 < q < r † , ðA13Þ Note that, although the order of creation and annihilation operations can be swapped for the purpose of performing prefix searches, the order has to be restored back to the canonical order when the connection is established.For example, for the ordering p † < r † < q < s, we have to restore the order pqrs-the two-electron integrals and CCmatrices have some symmetry, but they are not completely permutationally invariant.

APP E NDIX B : GENERATION OF PREFIXES
In the configuration selection procedure, only single and double excitations are applied on the set of generator configurations.This constrains, which kind of prefixes can occur in the space of excited configurations.Because the prefix is usually smaller than the configuration, k ≤ M, it can happen that the creation and annihilation operations of a single or double replacement operator split between the prefix and the rest of the configuration.We have to consider all of these possible splittings to ensure that we do not miss any possible excitations.Below, we list the different possibilities of annihilation and/or creation operations that can be applied on the prefixes: p > k, q ≤ k : q, ðB2Þ p ≤ k, q ≤ k : p † q: 2. E pq E rs p ≤ k, q > k, r ≤ k, s > k : Note that the possibilities, where only one operation occurs in the prefix, was covered in the E pq -case.

AP PE NDIX C: SPIN TENSOR OPERATOR MATRICES
Just like their singlet counterparts E pq , the triplet spin tensor operators T0 pq used to produce spin densities in here correspond to single excitations on the level of occupation number vectors n I .Accordingly, four different kinds of matrix elements can be formed, DOMO!SOMO, SOMO!Virtual, SOMO!SOMO and DOMO!Virtual.As demonstrated in Reference 34, the first two classes and the last two classes are connected through simple matrix transpositions.Furthermore, just like for A IJ pq -matrices the number of nonredundant matrices is rather limited.Analogous to eq. ( 24) of Reference 34, operator matrices that correspond to SOMO!SOMO excitations can be computed through where sðp, qÞ ¼ 1, À 1 is a sign factor connected to the relative positions of p and q with respect to the singly occupied orbitals in the bra where f 11 ¼ f NÀ2Sþ1 and f 12 ¼ f NÀ2S are the numbers of independent spin functions for N À 2 unpaired electrons and total spins of S À 1 and S, respectively.Note, that Equation (C2) is a slight variation of eq. ( 21) of Reference 34.
Triplet spin tensor operators that correspond to DOMO!SOMO excitations are calculated according to where N þ 1 corresponds to an empty dummy orbital as described in the seminal work by Knowles and Werner. 84Hence, T N S ð TpNþ1 Þ and A N S ð ÊNþ1q Þ are readily evaluated using Equations (C1) and ( 24) of Reference 34, respectively.

Algorithm 1 )
. Before any f HCI ð Φ A j iÞ values are actually calculated, the two-electron integrals ðtujvwÞ are sorted in descending order of their absolute value for each orbital pair ðtuÞ.As pointed out by Umrigar and coworkers in their original work, these absolute values correspond to the magnitude of a double excitation in which electrons in orbitals w and u are excited to orbitals v and t.17

F I G U R E 1
Schematic representation of the iterative selection procedure in the framework of CIPSI and modern variants thereof.

2 .
Increase the set of prefixes by performing single and double excitations on the current prefixes.For the details on this step, please look at Appendix B. 3. Distribute the prefixes among processes.The prefix algorithm essentially introduces an intermediate forloop in the HCI configuration generation/selection (see Algorithm 3).

4. 1 . 1 |
Variational partAs outlined above, there are several options for performing configuration selection in our HCI implementation.The HCI configuration selection can be used by itself or together with CIPSI-pruning.Additionally, the selection procedure can be performed based on all configurations in the current variational space ðε gen ¼ 0Þ or based on the generators subset ðε gen > 0Þ.Accordingly, four different settings for the configuration selection have been tested and compared (I) HCI with ε gen ≠ 0.

Fe
(II) porphyrin.A CASSCF (12e,12o) and PC-NEVPT2 calculation was performed using the ORCA program package to yield PC-NEVPT2 natural orbitals as the starting orbitals.Then, a DMRGSCF (24e,22o) calculation was carried out with the MOLBLOCK program which interfaces the BLOCK code as an active space DMRG solver. 55,81On top of the converged DMRGSCF calculation, ASS1ST and ket occupation number vectors while U S ð RN Þ and U S ð PN Þ are representation matrices of permutation operators PN and QN in the basis of genealogical spin eigenfunctions with N unpaired electrons and totals spin S. Operators PN and RN act to bring the two involved singly occupied molecular orbitals p and q to positions N À 1 and N with respect to the other singly occupied orbitals.Finally, the projection matrix C takes the form Calculated variational HCI energies for the cyanogen molecule using approximations I and III, the generator approximation is applied, ε gen ≠ 0.
T A B L E 1Note: All energies are given in Hartrees, wall time is given in seconds.a ε gen ¼ Á10 À1 , ε var ¼ Á10 À4 .b The calculations were performed on one compute node with AMD EPYC 7451 24-core processor.
PT2 is less than 10 À5 E h , whereas the calculation time is reduced by a factor of $ 5. Of course, when ε PT2 is relaxed NðCSFÞ and t PT2 are further reduced at the cost of accuracy.Yet, at ε PT2 ¼ 1:0 Â 10 À6 the error Convergence of the DMRG energy with respect to the bond dimension indicates that the final energy at M ¼ 4000 is sufficiently accurate for the current purpose.With thresholds of ε gen ¼ 1 Â 10 À3 , ε var ¼ 1 Â 10 À6 and ε PT2 ¼ 1 Â 10 À7 the HCI energy is calculated to be 0.02 mE h below this DMRG result at Calculated variational HCI energies for the cyanogen molecule using approximations II and IV, the generator approximation is not applied, ε gen ¼ 0.Note: The energies are given in Hartrees, wall time is given in seconds.Note: For the variational part, fixed thresholds ε gen ¼ 10 À2 and ε var ¼ 10 À5 , were used.The energies are given in Hartrees and the timings in seconds.
À6, only about a billion CSFs out of the complete set of over 7 billion CSFs in the FOIS (cf.value of NðCSFÞ for ε PT2 ¼ 0:0) are considered during the calculation of E PT2 , resulting in a significant speedup at almost no loss of accuracy.In fact, the difference of E Lastly, we demonstrate the ability of the HCI method to efficiently provide energies near the FCI-limit.In Table4, the approximate FCI energies as obtained from the DMRG with different bond dimensions are compared to the HCI energy.approximately5% of the computational cost.The variational and perturbative parts took 1797 and 2182 s on three CPU nodes with 24 cores, respectively.It must be noted, however, that the DMRG energy is fully variational whereas the perturbative correction of the HCI energy that amounts to about 2.5 mE h in this case is nonvariational.T A B L E 2 a ε var ¼ Á10 À3 .bThecalculationswere performed on one compute node with AMD EPYC 7451 24-core processor.T A B L E 3 Calculated HCI energies of (CN) 2 with varying thresholds for the PT2-correction.a ε PT2 ¼ Á10 À6 .b The energies are given in Hartrees and time in seconds.The thresholds were ε gen ¼ 1 Â 10 À2 , ε var ¼ 1 Â 10 À5 and ε PT2 ¼ 1 Â 10 À6 .