A note on generating finer-grain parallelism in a representation tree


Christof Vömel, Computer Science Department, University of California at Berkeley, Berkeley, CA 94720, U.S.A.

E-mail: voemel@eecs.berkeley.edu


The representation tree lies at the heart of the algorithm of Multiple Relatively Robust Representations for computing orthogonal eigenvectors of a symmetric tridiagonal matrix without Gram–Schmidt. A representation tree describes the incremental shift relations between relatively robust representations of eigenvalue clusters of an unreduced tridiagonal matrix, which are needed to strongly separate close eigenvalues in the relative sense. At the bottom of the representation tree, each leaf defines a relatively isolated eigenvalue to high relative accuracy.

The shape of the representation tree plays a pivotal role for complexity and available parallelism: a deeper tree consisting of multiple levels of nodes involves tasks associated to more work (i.e., eigenvalue refinement to resolve eigenvalue clusters) and less parallelism (i.e., a longer critical path as well as potential data movement and synchronization). An embarrassingly parallel, ideal tree on the other hand consists of a root and leaves only.

As highly parallel hybrid graphics processing unit/multicore platforms with large memory now become available as commodity platforms, exploiting parallelism in traditional algorithms becomes key to modernizing the components of standard software libraries such as LAPACK. This paper focuses on LAPACK's Multiple Relatively Robust Representations algorithm and investigates the critical case where a representation tree contains a long sequential chain of large (fat) nodes that hamper parallelism. This key problem needs to be addressed as it concerns all sorts of computing environments, distributed computing, symmetric multiprocessor, as well as hybrid graphics processing unit/multicore architectures. We present an improved representation tree that often offers a significantly shorter critical path and finer computational granularity of smaller tasks that are easier to schedule. In a study of selected synthetic and application matrices, we show that an average 75% reduction in the length of the critical path and 82% reduction in task granularity can be achieved. Copyright © 2011 John Wiley & Sons, Ltd.