SEARCH

SEARCH BY CITATION

Keywords:

  • quantum Chemistry;
  • parallelization;
  • force constant;
  • excited-state gradient;
  • AOFORCE;
  • EGRAD

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Shared-Memory Parallelization with Separated Address Spaces
  5. Performance Analysis
  6. Conclusions
  7. References

The programs ESCF, EGRAD, and AOFORCE are parts of the TURBOMOLE program package and compute excited-state properties and ground-state geometric hessians, respectively, for Hartree-Fock and density functional methods. The range of applicability of these programs has been extended by allowing them to use all CPU cores on a given node in parallel. The parallelization strategy is not new and duplicates what is standard today in the calculation of ground-state energies and gradients. The focus is on how this can be achieved without needing extensive modifications of the existing serial code. The key ingredient is to fork off worker processes with separated address spaces as they are needed. Test calculations on a molecule with about 80 atoms and 1000 basis functions show good parallel speedup up to 32 CPU cores. © 2010 Wiley Periodicals, Inc. J Comput Chem, 2011


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Shared-Memory Parallelization with Separated Address Spaces
  5. Performance Analysis
  6. Conclusions
  7. References

For more than a decade, the speed of CPUs used in commodity computing has doubled every 18–24 months. Exponential growth like this has first been formulated in the famous Moore's law,1 which states in its original form that the number of transistors on a single chip doubles every year. In the last years, however, the speed increase of a single CPU core has considerably leveled off; instead, more and more CPU cores have been put on a single chip. This development has a clear consequence: any large-scale scientific calculation that is limited to use only a single CPU (core) will require a wall-clock time to complete that is no longer acceptable, because even entry-level computers are equipped with several CPU cores in a single node (“box”). Although eight CPU cores per node is a standard feature since few years, low-cost nodes with 32 or more CPU cores are available now. This produces a pressing demand for being able to actually use these CPU cores in parallel in quantum chemical calculations.

The two most important standards and paradigms in parallel programing are MPI (message passing interface)2 and OpenMP.3 MPI allows parallelization using the CPU cores of several nodes, but MPI programs are a little bit more difficult to invoke for the average user because they depend on external libraries and programs. This is so mostly because the startup of all the individual processes (workers) on the nodes has to be organized. OpenMP programs, on the other hand, are much easier to use: the number of CPU cores to use is either determined automatically or specified by the user in a very simple way, and apart from that, the program is invoked in exactly the same way as in the serial calculation. The main drawback of OpenMP and any other recipe based on a shared memory is that only the CPU cores of a single node can be used. If it is desired to use 128 CPU cores or more, this essentially means that one has to use MPI.

OpenMP is based on lightweight processes or threads, that is, all parallel processes access the same memory. This requires that one has to separate public and private data. Public data can be accessed by all threads, but private data belong to one and only one thread. The problem is now that in many quantum chemistry computer codes there are parts that are quite old (legacy code, often simply called “dusty decks”). In these codes, compute kernels often communicate with each other using static data (“common blocks” in the FORTRAN programing language). This structure prevents OpenMP parallelization because clearly, if one thread stores an intermediate result in such a static memory location, it will soon be overwritten by another thread and that leads to an error. As a result, the legacy code has to be reorganized considerably before one can start an OpenMP parallelization, and the reorganization often involves a much higher programing effort than the parallelization itself.

More than 15 years ago (the OpenMP standard had not yet been specified), the present author developed a technique for shared-memory parallelization that does not require such a reorganization of the code, and this has been used already in the first publication4 on the implementation of density functional (DFT) methods within the TURBOMOLE5, 6 program package. The technique developed makes parallelization quite easy (see next section), and it was possible to parallelize the DFT energy and gradient programs at that time within 2 weeks. The present work aims at using this technique to parallelize the AOFORCE, ESCF, and EGRAD programs of a current TURBOMOLE7 release. The main use of the AOFORCE program8 is the computation of ground-state harmonic force constants at Hartree-Fock or DFT level. In many typical DFT investigations, this is by far the most CPU time-consuming part, therefore not being able to use it in parallel is a major deficiency of TURBOMOLE. The ESCF program9, 10 calculates SCF instabilities, frequency-dependent polarizabilities and optical rotations, and electronic excitation energies including the transition moments needed to compute UV/vis and CD spectra. As we do not parallelize programs but rather individual subroutines (see below), the entire parallelization of ESCF is a by-product of having parallelized AOFORCE. This is so because the numerically intensive part of ESCF is the solution of response equations (also called coupled perturbed Hartree-Fock or coupled Kohn-Sham equations), and this also occurs in a force constant calculation and both programs use the same code for this purpose. The EGRAD program11, 12 calculates geometrical derivatives of excited-state energies and/or ground-state polarizabilities. Again, after parallelizing AOFORCE, little extra work (related to geometrical derivatives) was left to be done for EGRAD.

Shared-Memory Parallelization with Separated Address Spaces

  1. Top of page
  2. Abstract
  3. Introduction
  4. Shared-Memory Parallelization with Separated Address Spaces
  5. Performance Analysis
  6. Conclusions
  7. References

It seems natural to start with the parallelization of the calculation/processing of two-electron integrals and the numerical integrations, as nearly all of the computational effort goes into these steps. Here, we followed a recipe already well established in the literature4, 13, 14: two-electron integrals as well as grid points are partitioned into disjoint subsets, and each parallel worker performs the computations associated with this subset. Although this allows to adapt existing code with minimal changes, this approach cannot be extended to massive parallelism because it is not data parallel (replicated data approach). This means that each parallel worker has a memory requirement, which is of the same order of magnitude than the memory required for a serial calculation, and (even worse) the amount of communication for a given problem grows with the number of parallel workers.

The next question is how to create the parallel workers and keep them busy. As explained in the introduction, we refrained from using OpenMP because we do not want to reorganize large amounts of legacy code. Therefore, we create new processes using the UNIX system call fork and exploit its semantics. The process created by fork is an exact replication of the original one, such that the parallel workers inherit all information (location of atoms, basis sets, density matrices, etc.) they need. The program starts as a single process, reads the input files, and does all initialization required. Everything is exactly as in the serial case until a subroutine that processes two-electron integrals or performs a numerical integration is entered. In the case of a parallel calculation, the process then replicates itself several times by forking off n parallel workers, if n is the number of CPU cores to use. At the same time, communication channels are created between the original process (called “master” from now on) and the workers using the socketpair system call. The workers then start to process two-electron integrals or perform a numerical integration, whereas the master, which does not do any such calculations, just takes care of the dynamic load balancing (see below). At the end of such a parallel step, a global sum operation is performed and the worker processes terminate. New workers are created at a later time when they are needed again. This scenario is schematically shown in Figure 1. One sees that the master (bold line) forks off four workers (thin lines) in each of the steps of the calculation, and the workers terminate (denoted by the filled circles) when the work associated with that step is done.

thumbnail image

Figure 1. Schematic view of processes created during a “four-processor” run. The bold line indicates the master process, which performs the serial work outside the parallel regions and dynamically balances the load within. Worker processes (thin lines) are forked off and terminate (solid circles) by the end of the parallelized subroutine.

Download figure to PowerPoint

The separation of the address spaces when forking off the worker processes is an essential prerequisite for easy parallelization of legacy code. Each of the worker processes may use any static data at its own discretion, without any interference with the master or the other workers. On the other hand, this separation requires that the communication between the master and the workers is done by exchanging messages, such that the parallelization is not entirely different from the MPI case: the main difference is that the master need not transfer any information to the workers, which inherit this when being forked off. In principle, the present shared-memory parallelization needs less memory than an equivalent MPI run on the same machine. The reason is that after forking off only those parts of the address space that are actually updated (i.e., written to) are physically duplicated (“copy-on-write” mechanism). In typical quantum chemical calculations, more than half of the address space (code, static data, density matrices used as in input to two-electron routines, etc.) is not updated within the compute kernel and thus only resides once in the memory, irrespective of the number of worker processes generated. In some sense, we left the separation of public and private data, which involves so much handwork in the case of OpenMP parallelization, to the operating system.

Once it is established how to create the workers, the next question is how to keep them busy. Ideally, the work is distributed such that all workers complete simultaneously. As the work, say, associated with a subset of two-electron integrals, depends very strongly on the angular momentum and the contraction of the basis functions in that particular subset, dynamic load balancing is absolutely necessary. The duty of the master process is thus to keep the workers busy by distributing the work in small packets (tasks). As the focus of this investigation is how to keep things simple without sacrificing (too much) efficiency, a global counter mechanism is used to implement dynamical load balancing. Such a global counter can be viewed as a ticket machine, where the workers draw a (numbered) ticket whenever they are ready to do some work. Then, they perform the computation (task) associated with the number on the ticket. These numbers are distributed consecutively, and a given number is only distributed once to one of the workers. When a worker needs no more tickets, it informs the master about this. As soon as all workers have indicated they need no more tickets, the master subroutine issuing the tickets terminates.

To become more specific and to demonstrate how simple this is in practice, Figure 2 shows (in some sort of pseudo-code) all the modifications made to the subroutine d2jk, which contracts the two-particle density matrix with second derivative two-electron integrals to update the molecular Hessian. Only the few lines indicated by a solid bar have been added to the code. The utility function InquireSMP indicates whether this is a SMP-parallel calculation. In our implementation, this is triggered either by setting an environment variable or specifying the number of CPUs in the input file, unless the calculation runs in parallel under a message passing software like MPI or GlobalArrays: as some parts of the TURBOMOLE package are parallelized this way, we do not want to interfere in these cases. If it is not an SMP-parallel calculation, the subroutine proceeds as it did before parallelization except for the order in which the two-electron integrals are computed. The function nextval returns the number of the next “ticket,” such that the subroutine within a worker process processes only a subset of the two-electron integrals. The statements with a hollow bar at the left have been modified: First, the name of the subroutine has been changed because the original name (d2jk) is now used for a wrapper subroutine (see below). Second, the loops with the loop variables i and j have been modified to count backward. This modification ensures that on the average, more expensive two-electron tasks are now processed earlier, and good load balancing requires that the tasks at the end of the list are small to fill the gaps. A new wrapper with the name of the original subroutine encloses the original compute kernel (see Fig. 3). If it is not a SMP-parallel calculation, the “original” routine is called without further a-do. In a parallel calculation, after the workers have been forked off, the workers clear the target data, call the compute kernel (as described above) that adds onto the target data, tell the master that no more tickets are needed (subroutine end_nextval), send the data to the master upon request, and terminate. During this time, the master calls the global counter server routine nextval_server, which terminates as soon as all workers have called end_nextval, gathers the computed data from the workers, adds it onto the target (in the case of d2jk: the Hessian), and finally waits for the termination of the worker processes. It is important to note that the master process does not do any computational work (except distributing the “tickets,” whereas the clients are busy and consume only tiny amounts of CPU cycles during this time (see last section). As a result, we can fork off as many workers as there are CPU cores, and each of the workers gets >99% CPU usage although the number of active processes (including the master) is larger than the number of available CPU cores. This is different from most MPI implementations where dynamical load balancing requires a dedicated MPI task and the number of workers doing actual computation is one lower than the number of available CPUs.

thumbnail image

Figure 2. Layout of a parallelized two-electron routine. Only the statements indicated by a solid bar at the left side have been added to the original code, those with a hollow bar have been modified.

Download figure to PowerPoint

thumbnail image

Figure 3. Layout of a wrapper routine that organizes the parallel calculation. All computation is done within the “original” compute kernel (see Fig. 2).

Download figure to PowerPoint

Other two-electron and numerical integration routines have been parallelized in the same way. In the numerical integrations, it is the loop over batches of grid points in which the step counter is incremented. The target data to be updated are either the Hessian or the r.h.s. of the response equations or the matrices that are constructed in the iterative solution of the response (also known as “coupled Kohn Sham”) equations. Note after parallelizing such a subroutine, all programs that use that particular subroutine are enabled to run this part of the calculation in parallel.

Performance Analysis

  1. Top of page
  2. Abstract
  3. Introduction
  4. Shared-Memory Parallelization with Separated Address Spaces
  5. Performance Analysis
  6. Conclusions
  7. References

B3LYP Calculations

To asses the performance of our parallelization, benchmark calculations were done for the 12-helicene molecule. It consists of 12 fused benzene rings that form a helix, such that we have a reasonably compact (globular shaped) molecule with 50 carbon and 28 hydrogen atoms. This molecule has been chosen because molecules of this size and shape are found in many routine applications of DFT. The molecular geometry (C2 symmetry) has been taken from the TURBOMOLE benchmark suite; point group symmetry has been exploited in the calculations. Figure 4 shows an ORTEP plot15 of the molecule. We will first report on calculations using B3LYP16 exchange-correlation functional as implemented in TURBOMOLE. This functional contains a portion of Fock exchange, therefore the Hartree and Fock terms are calculated using four-center two-electron integrals. The calculations use the TZVP basis set,17 which is a typical basis set for routine applications, and this gives 1118 basis functions for this molecule. As the programs ESCF and EGRAD show virtually identical behavior, we show timings only for EGRAD and AOFORCE, and the measured CPU and wall-clock Tables 1 (for EGRAD) and 2 (for AOFORCE). Timings were obtained on two different machines, a small one with two quad-core Intel Nehalem CPUs giving eight CPU cores, and, to investigate the behavior on a large number of CPU cores, a big one with eight quad-core AMD Opteron CPUs (32 CPU cores).

thumbnail image

Figure 4. ORTEP15 plot of the 12-helicene molecule C50H28.

Download figure to PowerPoint

Table 1. Parallel Performance of EGRAD (Calculations on a 12-Helicene C50H28 Molecule, TZVP Basis Set with 1118 Basis Functions, B3LYP Functional)
  • a

    CPU time (in minutes) summed over all processes.

  • b

    Wall-clock time (in minutes) of the calculation.

  • c

    This number is probably the best measure for the “real-world ” speedup (see text).

  • d

    Wall-clock time ratio of serial and parallel calculation.

Data for an eight-core node (Two Nehalem X5500 CPUs):
#Workers12468
CPUa486491496519518
Wallb4862461258766
CPU/wallc1.02.04.06.07.9
Speedupd1.02.03.95.67.4
Data for a 32-core node (Eight Opteron 8378 CPUs):
#Workers18162432
CPUa751773788807820
Wallb75198513528
CPU/wallc1.07.915.522.829.7
Speedupd1.07.714.821.227.2

A coarse look at the data already shows that the wall-clock time required to complete the calculation is dramatically reduced by using more and more CPU cores, demonstrating that the parallelization was successful. More detailed information can be obtained by looking at the ratio of the consumed CPU time (summed over the master process and all terminated workers) and the required wall-clock time. A deviation from the ideal factor (namely, the number of CPU cores used) may result both from an insufficient parallelization of the code or an improper load balancing. The measured data are only compatible with (virtually) perfect load balancing and a remaining serial part of the calculation, which consumes less than 1% of the total CPU time: According to Amdahl's law,18 a serial portion of 1% would limit the speedup to less than 7.5 using eight workers and to less than 25 if using 32 workers. Note that only few “hot spots” of the code, namely the calculation/processing of the four-center two-electron integrals and the numerical integrations, have been parallelized, and the relative weight of the two-electron steps in particular grows if the basis set is extended for a given molecule. Although it would be quite straightforward to apply our parallelization technique to the one-electron integrals as well, this would not improve much: reducing the serial part to something well below 1% probably requires modifications all over the place.

Although the data discussed so far are quite convincing, it does not really interest the average user who is more interested in how much faster, in terms of wall-clock time, results can be obtained compared with a serial calculation. The parallel speedup factor defined this way is also given in Tables 1 and 2, and it is always smaller than the ratio of total CPU and wall-clock times. The reason is quite obvious: the required CPU time, summed over the master and all worker processes, grows with the number of CPU cores used. Although the increase in CPU time is mostly below 10%, it reaches 23% for the AOFORCE calculation on the Opteron machine (Table 2). A small increase in the total CPU time when using more and more CPU cores has to be expected because there is always some overhead in the parallel calculations (e.g., communication, summing up partial results), but the seemingly large overhead called for a more detailed investigation.

Table 2. Parallel Performance of AOFORCE (Calculations on a 12-Helicene C50H28 Molecule, TZVP Basis Set with 1118 Basis Functions, B3LYP Functional)
  • a

    CPU time (in minutes) summed over all processes.

  • b

    Wall-clock time (in minutes) of the calculation.

  • c

    This number is probably the best measure for the “real-world” speedup (see text).

  • d

    Wall-clock time ratio of serial and parallel calculation.

Data for an eight-core node (Two Nehalem X5500 CPUs):
#Workers12468
CPUa9,0419,2509,3559,91910,073
Wallb9,0414,6312,3481,6641,272
CPU/wallc1.02.04.06.07.9
Speedupd1.02.03.95.47.1
Data for a 32-core node (Eight Opteron 8378 CPUs):
#Workers18162432
CPUa15,61216,60016,81017,72519,155
Wallb15,6122,1001,083776644
CPU/wallc1.07.915.522.829.8
Speedupd1.07.414.420.124.3

In the serial AOFORCE calculation (see Table 3), 9% of the CPU time is consumed in the calculation/processing of derivative four-center two-electron integrals, another 9% in the numerical integrations (DFT exchange-correlation terms), and 82% are used for the nondifferentiated two-electron integrals (mostly needed to solve the response equations). In the same calculation but parallel on 32 CPU cores, the CPU time for the derivative two-electron integrals is essentially unchanged, whereas the CPU time for the numerical integrations increases by 70% and the CPU time for the nondifferentiated two-electron integrals by 20%. The CPU time increase is thus very different for different parts of the calculation and correlates well with the intensity of memory usage. This suggests that memory contention rather than parallelization overhead is responsible for the CPU time increase. Note that all calculations have been run on a dedicated node with no interference from other jobs. This means that in the serial calculations, the memory bandwidth, the CPU caches, and the lookup table for translating virtual memory addresses into physical memory pages are exclusively used by a single process, whereas many processes share these resources in a parallel calculation. This would also explain why the increase is larger for AOFORCE than for EGRAD, as the former program uses memory more intensely because many response equations are solved simultaneously.

Table 3. Performance of AOFORCE (Calculations on a 12-Helicene C50H28 Molecule, TZVP Basis Set with 1118 Basis Functions, B3LYP Functional)
 SerialSwarma,b,cParallelc
  • CPU times (in minutes) are reported for a serial calculation running on a dedicated machine, for a swarm of 32 identical serial calculations, and for a parallel calculation with 32 workers. All calculations were done on a 32-core node (eight Opteron 8378 CPUs).

  • a

    The total CPU time for the 32 members of the swarm varied from 19,216 to 20,470 min.

  • b

    Average over the 32 members of the swarm.

  • c

    Increase of CPU time relative to serial calculation in parentheses.

  • d

    Calculation/processing of derivative two-electron integrals.

  • e

    Numerical integrations.

  • f

    Calculation/processing of nondifferentiated two-electron integrals.

Deriv. 2e integralsd1,4171,406 (− 1%)1,426 (+ 1%)
DFTe1,3391,945 (+45%)2,282 (+70%)
2e integralsf12,83216,596 (+29%)15,424 (+20%)
Total CPU15,61220,014 (+28%)19,155 (+23%)

To corroborate this conjecture, a “swarm” of 32 serial and simultaneously initiated AOFORCE calculations on the 12-helicene molecule have been started on the 32-core Opteron node to simulate the behavior of a serial calculation on a nondedicated node. In Table 3, we also report the average CPU time for the different parts of the calculations in this swarm. One sees that the increase in CPU time for parallel calculation is even smaller than the increase that one observes going from a single serial run on a dedicated node to a swarm of identical calculations. The difference between the increase for the DFT part and the increase for the two-electron part is larger in the parallel calculation than in the swarm, which reflects that in the parallel calculation, all processes are synchronized (that is, all processes either do numerical integration or they do two-electron integrals), whereas the calculations quickly get out of sync in the swarm (note the wide spread in total CPU times). For a “real-world” speedup, one should compare the parallel calculation against the average run time of the calculations in the swarm. In this example at least, the cpu/wall ratio of the parallel calculation is a much more reasonable measure of the “real-world” speedup than comparing serial and parallel calculation on a dedicated node.

Density-Fitting (RI) Calculations

If one repeats the above calculations just switching to a nonhybrid exchange-correlation functional that does not contain Fock exchange, the timings are not much different. It is, however, possible to speed up the calculation considerably using the so-called RI-J method that has been implemented for the programs AOFORCE, ESCF, and EGRAD.19–21 In the AOFORCE program version we are using, the RI-J approximation is only used to replace nondifferentiated four-center two-electron work but not for the derivative integrals. The acronym RI stands for resolution of the identity and denotes an approximation for the four-center two-electron integrals that goes back to ref.22 and reads

  • equation image(1)
  • equation image(2)

In this equation, μ, ν, ρ, σ are basis functions used to expand the molecular orbitals, while P, Q are functions from an auxiliary basis, and the Mulliken notation (•|•) has been used for the four-, three-, and two-center two-electron integrals. The RI-J method uses this approximation to calculate the matrix elements of the Coulomb operator Jμν = ∑ρ,σDρσ (μν|ρσ) (D is the density matrix) in three steps:

  • equation image(3)
  • equation image(4)
  • equation image(5)

Note that this special application of the RI is equivalent to variational density fitting introduced in density functional calculations much earlier.23

Because a Cholesky decomposition of the matrix V is computed upon program start and kept in memory, by far the most CPU time goes into steps A and C, so these have also been parallelized using the recipe explained in the last section (see, e.g., Fig. 2) with a slight modification: for a given (i,j) shell pair, the required computing time may be as small as the latency of the global counter. In this case, the master cannot keep the workers busy. We therefore process more than one (i,j) shell pair, depending on the angular momenta and contraction lengths, for a given value of the step counter. If there is enough memory to keep a significant fraction of the three-center two-electron integrals, their contributions to steps A and C can be evaluated by fast matrix-vector multiplications. In an AOFORCE calculation one has to calculate several J matrices simultaneously, and we have rewritten the “memory contributions” as matrix–matrix multiplications in this case, which makes this step much faster already in a serial calculation. We have not explicitly parallelized the memory contributions (matrix operations) but rely on automatic parallelization through the use of multithreads BLAS routines (Intel's MKL library has been used in our case).

To test the implementation, we have repeated the above calculations using the b-p (nonhybrid) exchange-correlation functional.24, 25 We only report timings for AOFORCE because the EGRAD calculation can now be done within 10 min on a single CPU core and our parallelization technique, because of the overhead with forking off processes, is not suited for reducing a wall-clock time of minutes down to seconds.

We have performed two sets of calculations, one without any memory contributions and one where all three-center two-electron integrals (4.5 GB) have been kept in memory. Both sets of calculations behave very similar, and the gain of keeping all the integrals in memory is rather limited. The reason is that in a serial calculation the fraction of the total CPU time spent for the RI terms is only 5.5% without keeping integrals in memory, whereas it goes down to 0.3% if all integrals are kept. In both cases, the remaining CPU time is used by the derivative two-electron integral work and the numerical integrations in equal amounts, and these parts determine the overall efficiency. It is therefore sufficient to present only the timings for the case where no three-center two-electron integrals have been kept in memory (Table 4).

Table 4. Parallel Performance of AOFORCE (Calculations on a 12-Helicene C50H28 Molecule, TZVP Basis Set with 1118 Basis Functions, BP Functional and Using the Density Fitting (RI) Approximation, all Three-Center Integrals Calculated Repeatedly)
  • a

    CPU time (in minutes) summed over all processes.

  • b

    Wall-clock time (in minutes) of the calculation.

  • c

    This number is probably the best measure for the “real-world” speedup (see text).

  • d

    Wall-clock time ratio of serial and parallel calculation.

Data for an eight-core node (Two Nehalem X5500 CPUs):
#Workers12468
CPUa1,5291,5281,5681,6561,687
Wallb1,529770400286222
CPU/wallc1.02.03.95.87.6
Speedupd1.02.03.85.46.9
Data for a 32-core node (Eight Opteron 8378 CPUs):
#Workers18162432
CPUa2,3782,5102,8163,2503,460
Wallb2,378337207172151
CPU/wallc1.07.413.618.922.9
Speedupd1.07.011.513.815.7

If we had not used the RI approximation, CPU times would be comparable to those of Table 2, while replacing the computational steps involving nondifferentiated two-electron integrals by RI code gives the expected acceleration. More relevant in the present context is, however, that the parallel efficiency of the RI calculations is worse than observed above: for example, using all 32 CPU cores on the Opteron node we only achieve a speedup of 16 (compared with a serial calculation on a dedicated node), and also the cpu/wall ratio is significantly worse than in the B3LYP calculation. This made it necessary to look more closely where this degradation in parallel efficiency stems from. Even if the algorithm itself shows perfect scaling, there are possible performance leaks associated with our parallel setup (Fig. 1). Therefore, we measured the wall-clock time required to fork off the worker processes (fork), the CPU time spent in the master while keeping the workers busy (global counter), and the wall-clock time elapsed for the global sum operation. In the 32-core calculation of Table 4, fork required 2.5 min (145 s), the global counter only 7 CPU s, and the global sum was most expensive with 20 min (1197 s) wall-clock time. As expected, the wall-clock times for fork and global sum are cut by a factor of 2 if only 16 CPU cores are used, whereas the CPU time of the global counter is fairly insensitive to the number of parallel workers. Forking off the workers is not a critical step, but the time required to compute the global sum (20 min) is a sizeable fraction of the total wall-clock time (151 min). If we look at this for individual parts of the calculation, we clearly see that it is the amount of data transferred which counts. For example, for step A of the RI procedure [eq. (3)] the result is the small vector b, whereas for step C, it is the large matrix J. Consequently, the wall-clock time needed for the global sum operation is negligible for step A but significant for step C.

We have finally measured the global sum operation in a synthetic benchmark. It takes 70 s on the 32-core machine to sum up 1 GB of data from 32 workers, this corresponds to an effective data transfer rate of 500 MB/s and matches the performance seen in the actual AOFORCE calculation. Two-thirds of this time is needed to actually transfer the data from the workers to the master, and one-third of this time is required for the master to add the data onto the target matrices with a BLAS daxpy operation. We have run this benchmark on several different machines and found that the result strongly depends on the memory bandwidth, for the eight-core machine (two Nehalem CPUs) that we have used in this study, the effective data transfer rate is more the four times higher. It thus seems that the degradation in parallel performance seen in Table 4 for the 32-core machine stems from the rather poor memory bandwidth. Note finally that the data collection time is virtually the same in the B3LYP calculations presented before, but here it does not degrade the parallel efficiency simply because the overall CPU time is much larger.

Conclusions

  1. Top of page
  2. Abstract
  3. Introduction
  4. Shared-Memory Parallelization with Separated Address Spaces
  5. Performance Analysis
  6. Conclusions
  7. References

The technique described in this work allows parallelization of legacy code without time-consuming reorganization of the code. Most of the programing could be done within a few days, and a very useful result has been obtained that is of high interest to the average user of the TURBOMOLE program package. The parallel efficiency is not limited by the algorithm itself but rather by the communication rate between the master and the workers, which depends strongly on the memory bandwidth of the machine. Using more than 32 parallel workers in an efficient calculation such as those which use the RI approximation is therefore not recommended. This restriction could be alleviated if one implements a collective global sum operation in which all workers participate, e.g., by using shared-memory segments for data exchange.

Binary executables of the AOFORCE, ESCF, and EGRAD programs that result from this work will be available free of charge to licensed TURBOMOLE users from COSMOlogic, the distributor of the program package.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Shared-Memory Parallelization with Separated Address Spaces
  5. Performance Analysis
  6. Conclusions
  7. References