MultiPsi: A python‐driven MCSCF program for photochemistry and spectroscopy simulations on modern HPC environments

We present MultiPsi, an open‐source MCSCF program for the calculation of ground and excited states properties of strongly correlated systems. The program currently implements a general MCSCF code with excited states available using either state‐averaging or linear response. It is written in a highly modular fashion using Python/C++ which makes it well suited as a development platform, enabling easy prototyping of novel methods, and as a teaching tool using interactive notebooks. The code is also very efficient and designed for modern high‐performance computing environments using hybrid OpenMP/MPI parallelization. This efficiency is demonstrated with the calculation of the CASSCF energy and linear response of a molecule with more than 700 atoms as well as a fully optimized conventional CI calculation on more than 400 billion determinants.


| INTRODUCTION
Most quantum chemistry methods are designed with the assumption that the wavefunction of the molecule is qualitatively well described by a single electronic configuration.However, there are many cases where this approximation breaks down, so-called strongly correlated systems.The archetypical example is bond dissociation, but similar problems can be found even around the equilibrium geometries, for example in transition metals or other molecules with partially filled electron shells.In these cases, the most reliable solutions are provided by multireference methods.While these methods have largely been confined to benchmarking purposes due to their cost and the relative difficulty in setting up a calculation, recent progress have significantly changed the situation and arguably made those methods realistic alternatives in production calculations. 1he multiconfiguration self-consistent field (MCSCF) method2 has a central role among multireference methods, both as one of the most computationally affordable method and as a starting point for subsequent treatments of dynamical correlation effects, with for example multireference perturbation theory, 3,4 multireference configuration interaction 5 or multireference coupled-cluster methods. 6Alternatively, the MCSCF machinery can be used to implement some form of multiconfigurational density functional theory. 7ecause of its central place in multiconfigurational quantum chemistry, it is essential to have a very efficient implementation of MCSCF.A MCSCF wavefunction is often defined by the separation of the orbital spaces between active and inactive orbitals, with some form of configuration interaction (CI) performed among the active orbitals while the other orbitals are treated within the mean-field approximation.If the CI includes all possible configurations, the method becomes known as complete active space self-consistent field (CASSCF). 8ue to the very steep (factorial) scaling of full CI, the cost of CASSCF grows very quickly beyond around 15 orbitals.In order to limit the cost, the expansion can be restricted using the restricted active space SCF (RASSCF) 9 or the more general generalized active space SCF (GASSCF). 10In the past decades, several methods have also been designed to replace the CI in order to allow the treatment of large active spaces, such as FCI quantum Monte Carlo 11 (FCIQMC) method, selected CI, 12 or the density matrix renormalization group (DMRG) methods. 13hile the CI cost tends to dominate for small molecules and/or large active spaces, for large molecules, the efficiency of MCSCF and MCSCF response is dictated by the orbital optimization, which consists mostly in the computation and manipulation of two-electron integrals.In particular, the most expensive step of a conventional MCSCF and linear response is the transformation of two or more of the two-electron integrals indices from the atomic orbital (AO) basis to the active molecular orbital (MO) basis.][26] Here, we describe MultiPsi, a new open-source MCSCF energy and linear response platform designed to run on modern massively parallel HPC architectures.It is written in a modular design using python and C++ and is meant as a multiconfigurational extension of Veloxchem. 27The CI step has been designed using a hybrid parallel architecture and take special care of non-uniform memory access that is becoming prevalent in modern HPC.Additionally, we also implement a direct Fock-matrix based molecular orbital transformation, using the Fock matrix constructor from Veloxchem, which shows near-optimal scaling and allows to efficiently tackle systems comprising several thousands of basis functions.This algorithms can thus truly enable large-scale MCSCF energy and linear response calculation, finally allowing to tackle strongly correlated system with an efficiency on par with that achieved by modern algorithms for single reference methods.

| General MCSCF theory
A general MCSCF wavefunction can be written as with b κ the orbital rotation operator defined as: and j Ψ i i a set of Slater Determinants (SD) with the associated CI coefficients C i .Note that for CASSCF wavefunctions, the orbital space is divided into inactive (occupied), active and virtual orbitals, and orbital rotations within each subspace are redundant.In this case, only inactive-active, inactive-virtual and active-virtual rotations are needed, but if an approximate solver is used within the active space, active-active rotations may have to be included.
Optimizing this wavefunction requires simultaneous optimization of the orbital and CI coefficients κ pq and C i .However, it is common to use only approximate second order algorithms where the direct coupling between the CI and orbital rotations is neglected, allowing us to optimize the two sets separately. 8While this typically requires more iterations, it reduces significantly the cost of each step and is thus often more efficient overall.In this scheme, one MCSCF iteration consists of a CI optimization followed by a pseudo-second order orbital step.The coupling of the orbital optimization with the CI is partially recovered through the hessian update. 9A great review of optimization methods for CASSCF is found in Ref. 28.

| Parallel CI algorithm
The CI equation can be seen as a simple projection of the time independent Schrödinger equation on a n-electron basis of SDs: and is thus most straightforwardly solved as a simple diagonalization.However, as the size of the CI expansion grows, the explicit construction and storage of the Hamiltonian matrix becomes unfeasible.Instead, the equation is solved in an iterative fashion using usually the Davidson method or its variants.In this method, the most expensive step is the computation of the sigma vector: with elements: The main limitation for traditional CI algorithms is the memory required to store the CI coefficients.0][21][22] At the same time, the computation of the sigma vector couples a large number of elements together, and would thus require a very significant amount of communication between different nodes.
A few years ago, Vogiatzis et al. 23 provided a clever way to reduce the parallel communication problem by making use of the locality of the Hamiltonian in the excitation space, namely that the Hamiltonian only couples elements which are at most two excitations from each other.The way they achieve this is by using the GAS structure of their CI code to divide the workload.With this algorithm, they showcased the possibility of a CASSCF step with several billion determinants and even a single CI step with nearly a trillion determinant.
In a GAS, the active space is divided into any arbitrary number of subspaces, and the set of SDs is correspondingly divided into subsets depending on their total α and β occupation in each subspace.For example, in a singlet molecule with an active space containing 2N electrons in the same number of orbitals, one could split the orbital space in the lowest N orbitals and the rest.The SD could then be sorted depending on the number of electrons in the first N orbitals.The first subset of SD would have 2N electrons and correspond to the Hartree-Fock reference (assuming energy ordering of the orbitals), while the next two subsets may contain 2N À 1 electrons in the lowest N orbitals, corresponding to respectively the α and β single excitations, and so on.By sorting SDs this way, it becomes easy to restrict the excitations to for example a CI singles and doubles.
But this sorting is also convenient as it exhibits the local structure of the Hamiltonian.One could thus alternatively keep all SD subsets but split them across the different computer nodes.Thanks to the splitting of the active space, one can reduce the amount of memory communicated between the nodes since we know that the sigma vector element subset corresponding to M electron in the first N orbitals will not require the CI coefficients for the SD having less than M À 2 or more than M + 2 electrons in this orbital subspace, see an example on Figure 1.Thus, this algorithm allows to translate the locality in the excitation space into data locality in the MPI.
By introducing more and more subspaces, one can divide the CI expansion into arbitrarily as many subsets as one wants (up to the number of Slater Determinants), and thus split the CI vector into as many nodes as are available for computation.As the SD subsets generated in this way can have very different sizes, load balance often requires to divide the CI vector into significantly more subsets than the number of nodes.
One of the main issues in this algorithm is that the splitting of the expansion in GAS subspaces increases the overall cost of the sigma vector computation.This was already observed in the original article, where they for example noted an increase from about 500 to 710 min when increasing the number of GAS spaces from 6 to 8 in CAS(20,20) on 32 cores.Similarly, they noticed that the cost of their calculation depended more heavily in the number of GAS subsets than in the number of SDs in the expansion.Thus, while their algorithm shows very good scaling with the number of nodes for a given number of GAS spaces, in practice one would typically increase the number of GAS spaces as we increase the number of nodes resulting in a much worse scaling performance than expected.
We partially solve this problem by making two changes in our implementation compared with that of Vogiatzis et al.First, their implementation only uses MPI, which is designed for distributed memory.This means that even the multicore parallelization within a shared-memory node requires the creation of more GAS subspaces.This problem would be even more critical for our code since we decided to not implement symmetry and thus we cannot take advantage of symmetry as a cost-free way to split the CI expansion between nodes.We instead use a hybrid strategy with the aforementioned MPI scheme for parallelization between nodes but a simpler OpenMP shared-memory multithreading within a node.This allows us to use significantly fewer GAS spaces than in the reference algorithm.
Additionally, as we mentioned, we often use more subsets than we have nodes to reduce load imbalance.This means we typically have several subsets per nodes.Our strategy is to merge back the subsets as much as possible, typically into one to three larger subsets, reducing further the division in our data and thus the loss of performance.This is done in a way that the efficient vectorized and OpenMP-parallelized loops are done over the entire set.
There are several CI algorithms that have been devised over time.In particular, since the realization by Handy in 1980 29 that Slater Determinants written as products of α and β strings allowed the storage of CI vectors as α-β matrices which was advantageous to produce an efficient code, a few algorithms have been proposed, which differ in the number of operations and vectorization.We opted for the algorithm with the minimum operation count, originally presented by Olsen. 30The two main parts of the algorithm, which we will refer to as the α-α and the α-β parts correspond to the calculation of and Example of GAS partitioning of the CI and sigma vector between four nodes assuming a CAS of 9 electrons in 10 orbitals split into 5 and 5.Each CI subset is labeled by the number of electrons in the first orbital space.For simplicity only the total number of electrons in the first space is shown, ignoring all possible α and β combinations.
With these notations, the α À α algorithm is very simple (Algorithm 1).The α À β is more complex, but follows the same concept (Algorithm 2).
Let us now analyze the best way to implement multithreading for the algorithms displayed here, starting by the easier αα part.The most obvious way would be to allocate the tasks to the different threads for the first loop over I α .This way, we divide both the most time consuming step (the vectorized I β loop) and the overheads from the computation of the allowed iajb indices and the J α index and sign computation.However, this strategy turned out to be undesirable on modern NUMA architectures regardless of the way we localize the C and σ arrays.For example, if localizing the C and σ array such that I α is sliced between the different memory subunits, then the σ vector of one slice depends on the C vector residing on other slices, requiring (relatively slow) communication.Instead, we moved the multithreading to the vectorized I β loop itself, and partitioned the σ and C vectors according to the β index.Since in our code, the β index is the "fast index" of our σ and CI vectors, the best way to slice the array along the β index was to actually create individual arrays for each β slice.Thus, each OpenMP thread has its own CI and σ part of the array and no communication between threads is required.The (integral-screened) list of J β and signs for each I α is precomputed with OpenMP multithreading of the I α loop.
For the calculation of the ββ excitation, we first create a new copy of the CI and σ vector which are transposed and sliced along the β index allowing us to use the αα code without any other modification.
For the α À β, since we have both α and β loops, at least one of them will require non-NUMA-local calculation.However, the most computationally demanding step is by far the inner vectorized loop over I α , and it is thus natural to make this section of the code NUMA-local and keeping the communication in the less expensive first half of the algorithm.Of course, it would have been possible to instead allocate a different MPI task to each NUMA domain and only use OpenMP for the multithreading within that unit.However, the overheads from the splitting of the CI expansion to several GAS spaces far outweigh the limited remaining NUMA communication from this algorithm.
The Davidson algorithm needs to store a couple of CI and corresponding σ vectors in memory (since our code does not write the vectors to disk).The subspace can be kept below a maximum size to reduce the memory footprint.
We note that due to the use of Slater Determinants throughout, there is no enforcement that the resulting wavefunction is a spin eigenfunction or has the spin value desired by the user.Such enforcement can be done either by an efficient transformation from and to configuration state function (see e.g., Ref. 31), or by adding a spin-dependent penalty function in the CI optimization (see Ref. 32), and both options are being considered for future implementation in MultiPsi.

| MCSCF orbital optimization
In order to optimize orbitals, we use an approximate second order optimization, where we compute the gradient as well the approximate orbital-orbital part of the Hessian proposed by Chaban et al. 33 With this Hessian choice, all gradients and Hessian terms depend on the same set of Fock and Q matrices: • The inactive Fock matrix: • The active Fock matrix: • The Q matrix: themselves depending on the (state-specific or state-averaged) one-and two-body density matrices γ tu and Γ tuvw .Here we have used the convention that Greek letters refer to atomic orbitals, i, j refer to occupied (inactive) orbitals, t, u, v, w to active, a, b to virtual, and p, q, r, s, x to general orbitals.Note that Chaban et al. do not provide any approximate Hessian for active-active rotations.Those are redundant in a CASSCF calculation, but can contribute when using RASSCF or other approximate solvers.For the active-active diagonal Hessian terms, we use: with G an element of the gradient: The main cost of the orbital optimization is the formation and transformation of the two-electron integrals pqjrs ð Þ.In addition to the Fock and Q matrices, these integrals are also needed for the active integrals tujvw ð Þneeded in the CI calculation.
Our code uses the integral routines from VeloxChem. 27The VeloxChem code is a Fock-matrix driven code, able to handle hundreds of Fock matrices simultaneously and tailored for parallel CPU and GPU architectures.The inactive and active Fock matrices are straightforward to compute since they already have the form of a general Fock matrix.However, the pujvw ð Þand tujvw ð Þintegrals needed for the Q matrix and the CI code require partial MO transformation.Traditionally, this is done by transforming one index at a time, with a resulting aN 4 computational scaling with a the number of active orbitals and N the number of basis functions.However, this algorithm is not very well suited to parallel programming.
Instead, we use a Fock-matrix driven transformation 34 : with While the cost of this half-transformation is formally higher a 2 N 4 , the screening reduces it to a significantly lower, especially as active orbitals tend to be relatively well localized, and the resulting algorithm is more adapted to parallel computing.In most cases, it was found that the most efficient strategy is to scatter the densities across MPI ranks, such that each MPI process computes a subset of all the Fock matrices.This means the different MPI processes may compute the same integrals, but the improved screening reduces this penalty.A keyword allows to switch to a different parallelization scheme where all nodes compute contributions to all Fock matrices, which can be more advantageous for basis sets with a large number of primitive functions.
The inactive Fock matrices and half-transformed integrals are computed simultaneously, thus requiring only a single computation of the integrals.However, while the inactive Fock matrix and active integrals are needed before the CI, the active Fock and Q matrix can only be computed once the CI is finished as they depend on the active density matrices.The Q matrix simply requires the code to store the 3/4 transformed integrals to be contracted with the two-body density matrix.However, the active Fock matrix needs to be computed a posteriori, thus requiring a second integral pass for each MCSCF iteration.
The overall structure of an MCSCF iteration is as follows: 1. Compute the inactive Fock matrix and active integrals pujvw ð Þ 2. Converge the CI problem to reasonable accuracy (linked to the current orbital gradient norm) 3. Compute the (state-averaged) one-and two-body density matrices 4. Compute the active Fock matrix 5. Form the gradient and approximate Hessians 6. Check for convergence and exit if converged 7. Perform a quasi-Newton step, with the augmented Hessian technique and a BFGS update (see the extensive discussion of MCSCF convergence in Ref. 28) 8. Canonicalize the orbitals and back to first step We note that the orbitals are canonicalized at each iteration, meaning the orbitals are such that the active density matrix is diagonal and the inactive Fock matrix is diagonal in the inactive and virtual blocks.We found that the canonicalization step often improves convergence noticeably.With this set-up, as long as a reasonable active space was chosen, convergence is most often very rapid, whether the starting orbitals are from a prior Hartree-Fock, DFT or MCSCF calculation or from a simple semi-empirical guess.MultiPsi implements a simple extended Hückel guess and a built-in orbital viewer/active space selector designed for jupyter-notebook environment to facilitate a quick and userfriendly workflow.

| MCSCF linear response
For MCSCF, electronically excited states are often computed using a state-specific or state-averaged optimization.However, this approach has some drawbacks, in particular the fact that all orbitals involved into the excitation process needs to be included in the active space.Also, the quality of the state-averaged results degrades with increasing number of states, while state-specific optimization can be very difficult to converge.For those reasons, using linear response can be advantageous. 18,35,36CSCF linear response (MCLR) calculations usually consist in solving: with ω the excitation energy, X the so-called response vector and E 2 ð Þ and S 2 ð Þ , respectively, the electronic Hessian and second derivative of the metric matrix.In this form, it is clear that the equation is simply a generalized eigenvalue equation.For MCSCF, using the parameterization suggested above in Equation ( 1), our response vectors should contain both orbital rotation parameters and CI parameters (or equivalently CI state-transfer parameters).However, as suggested in Ref. 36, it is most convenient to separate CI and orbital parameters since the response tends to have different rates of convergence with respect to these two parameters.
In its full form (which we will denote multiconfigurational random-phase approximation, MC-RPA), the response vectors include both excitation and de-excitation parameters, which leads to a paired structure of the equations and its solutions.To exploit this, it is most convenient to express the response vectors as combination of symmetric (gerade) and antisymmetric (ungerade) trial vectors. 37,38ike CI, standard linear response is also solved through a Davidson or related algorithm and the most expensive step is once again the construction of the σ vector.However, in MCSCF linear response, the response and σ contain CI and/or orbital components, and its expression depends on whether the Hessian is applied on an orbital vector or a CI vector, thus giving four different blocks.
The CI-CI σ term is very similar to the one in the CI optimization σ CC ¼ b HC À E 0 C. The other diagonal term, namely the orbital-orbital term, is most conveniently expressed as a standard MCSCF gradient but constructed with one-index-transformed integrals: and with κ the orbital rotation matrix of the trial vector.
When constructing the two-electron part of the one-index transformed inactive and active Fock matrix, we can see that they each contain two contributions where the transformed index is one of the summed indices (i and t, u in Equations (8) and (9), respectively) and 2 where the transformed index is one of the remaining indices of the Fock matrix.The latter are easily computed by multiplying the ground state inactive and active Fock matrices by the κ matrix.Doing this also accounts for the one-index transformed one-electron integrals.For the former terms, the simplest is to multiply the ground state density matrix by the κ matrix: We can then computing a standard Fock matrix two-electron term contraction with this one-index transformed density matrix.This means computing two Fock matrices (inactive and active) for each trial vector, and for large molecules, this is the dominant cost.Similarly to the MCSCF algorithm, the construction of these Fock matrices is distributed over nodes.
The orbital-orbital σ also contains a Q-matrix term.Again, one contribution corresponds to a simple multiplication of the ground state Q-matrix by the κ matrix, but the other three are more costly as they involve two-electron integrals with two active and two general (AO) indices tujμν ð Þ and tμjuν ð Þ. Computing those integrals on-the-fly for each trial vector or for each iteration would be too costly, and it is instead computed only once at the start of the calculation and saved for subsequent use.In a recent efficient implementation, the use of the resolution of identity reduced this storage bottleneck. 18Here, instead, we store these by making use of the entire distributed memory.This also effectively constitutes the MPI parallelization strategy for these intermediates.
The Orbital-CI terms require the active Fock matrix and Q-matrix with transition densities, which can be computed directly from these half-transformed intermediates at a low cost.Finally, the CI-orbital term corresponds to a CI σ vector construction with the fully active one-index-transformed integrals which are readily available from the orbitalorbital term.
For MC-RPA, we only store the "excitation part" of the vectors, and compute both the gerade and ungerade sigma, since both can be computed simultaneously for the cost of one.

| Program structure
The use of python to various degree has been chosen in several quantum chemistry software in recent years, notably Psi4 39 and PySCF, 40 both implementing among other a MCSCF module.In MultiPsi, the philosophy is to use python as much as possible, gaining efficiency through the use of numpy linear algebra libraries. 41The core CI routines are written in C++, exposed to python using pybind11, and the integrals are mostly accessed through the Fock matrix constructor of VeloxChem.The general structure of the program is shown on Figure 2.
We note that the use of python is not only beneficial for developers but also offers a high level of interactivity which benefits users as well as create a great platform for learning.MultiPsi is thus one of the softwares used to develop the eChem learning platform. 42I G U R E 2 Structure of the program.First, we demonstrate the performance of our parallel CI implementation.As a first test case, we use a pyrene including the π system in the active space, or CAS (16,16), corresponding to 82,824,885 determinants.The OpenMP scaling obtained on a single AMD EPYC 64 core processors, with 8 NUMA regions of 8 cores each, are shown on Figure 3a.
The results are close to linear, and even show some superlinear scaling, which are probably due to the parallelization introducing a better batching of the calculation.With all 64 cores, a single σ calculation takes only about 15 s.
Going toward even larger size, we can assess the efficiency of the MPI parallelization beyond a single node on benzophenanthrene, for which the π system constitute a CAS (18,18), which amounts to more than a billion determinants (1,181,976,510 to be precise).The results are shown on Figure 3b.Here, the calculation takes 44 s on the 16 MPI ranks (1024 cores).
We note that the MPI scaling is less favorable for smaller expansions, as the size of each GAS space becomes smaller and the benefits of parallelization are outweighed by the overheads.At this point, a better strategy for state-averaged MCSCF or linear response where many sigmas have to be computed at each iterations is to let each node compute a fraction of the sigmas.
Still, with this favorable scaling, the main limit to the size of the CI expansion that can be reached is the amount of distributed memory available.To showcase the sizes that can be reached, we re-explore a molecule that the author had investigated a few years ago.The [Ni-Fe] hydrogenase in its native structure does not have any symmetry, and for an accurate description of the singlet-triplet gap, we used an active space containing 22 electrons in 22 orbitals for the triplet state. 43At the CASSCF level, this corresponds to 418,151,049,316 determinants.
The largest conventional σ vector ever computed so far was by Vogiatzis et al more than 900 billion determinants, but it was a single σ step, not a fully optimized CI, and it made use of symmetry. 23Here, we successfully converged the full CASCI for our hydrogenase model using 128 nodes of two AMD EPYC 64 cores each, or a total of 16,384 cores.The calculation took 2 h 54 min per CI iteration and converged in 21 iterations, for a total calculation time a little over 2 days.This is to our knowledge by far the largest conventional full CI calculation ever converged.

| MCSCF and linear response performance
MCSCF and MCSCF linear response include both CI sigma calculations and orbital Fock matrices.Since we already looked in detail at the CI part, we focus here on the performance of the orbital parts.For this, we choose a larger molecule with a decent size active space, as the orbital part also depends on the active space size, but few determinants.First, we look at the substituted iron porphyrin shown in Figure 4.This molecule has 91 atoms, and with the def2-SV F I G U R E 3 Scaling of the wall time of the CI step in two different calculations, with the number of OpenMP threads or MPI rank.
(P) 44 basis set, this corresponds to 1012 basis functions.Our active space is a CAS (15,11), and with a sextet spin multiplicity this amounts to a small 5082 Slater determinants.
In Figure 4, we illustrate the parallel performance of both the entire MCSCF as well as the initialization and iteration time of the MCSCF linear response.The parallelization includes both OpenMP multithreading (from 1 to 32 cores) as well as MPI multinode parallelization (from 1 to 4 nodes).Note that we chose 10 states for the MCLR calculation, which gives up to 20 Fock matrices per iteration.Since the code tries to distribute the Fock matrices on the different nodes, the best performance is obtained when the number of Fock matrices is a multiple of the number of nodes.The speedup shown here correspond to the iterations with 20 Fock matrices and thus we are in the optimal scenario.
Looking first at the MCSCF step, we can see again almost ideal speedup with increasing the number of cores, even superlinear at times.Both MCLR steps show a noticeable performance drop going from 16 to 32 cores, probably due to the communication overheads in the dual socket node, however, the speedup is again back to linear beyond one node.
With this favorable scaling, we can target larger molecules than ever before.As an example, we built a model system of the Green Fluorescent Protein (GFP) by taking all residues within 6 Å of the chromophore in the atomic resolution x-ray structure 2WUR, 45 for a total of 701 atoms, see Figure 5.With a modest def2-SV(P) basis, this amounts to 5250 basis functions.The active space covers the entire π system of the chromophore, CAS (14,13).
The CASSCF calculation took 1 h 35 min per iteration on 8 Â 128 cores, and the subsequent linear response calculation of the five lowest excited states took 2 h 50 min for initialization and 1 h 30 min on average per iteration on 16 Â 128 cores.This is to our knowledge the largest CASSCF and CASSCF linear response calculations ever performed, especially considering the fact that they do not make use of any density fitting or similar approximations.

| SUMMARY AND OUTLOOK
MultiPsi is designed as a flexible platform for multiconfigurational calculations of ground and excited states, currently implementing a general MCSCF code with excited states computed through state-averaging or linear response.It is designed to work both in interactive teaching or developing environments and in modern massively parallel architectures allowing the calculation of larger molecular systems than ever possible before.It is freely available under the GPLv3 license and easily installed using regularly updated conda builds.
We are planning to continually expand the program capabilities, with our main priorities to include analytical gradient, polarizable embedding, as well as several forms of multi-reference correlated method, including multiconfigurational density functional theory.

ALGORITHM 1
Minimum-operation count algorithm for σ αÀα loop over all I α loop over all possible iajb excitations J α ,sgn f I α , ia,jb ð Þ σ αα I α I β þ ¼ sgn Â iajjb ð ÞC J α I β ⊳Vectorized over I β end loop end loop ALGORITHM 2 Minimum-operation count algorithm for σ αÀβ loop over all ia loop over all possible

F I U R E 4
Parallel performance for MCSCF and linear response on Fe-tetrakis(4-sulfonatophenyl) porphyrin.The calculations are run on nodes consisting of dual Intel Xeon Gold 6130 cpus with 16 CPU cores each, resulting in 32 cores per node.F I G U R E 5 Visualizing an active π Ã orbital of the CASSCF calculation of a 701 atoms model of the GFP active site.