A note on generating finer-grain parallelism in a representation tree
Article first published online: 22 NOV 2011
Copyright © 2011 John Wiley & Sons, Ltd.
Numerical Linear Algebra with Applications
Volume 19, Issue 5, pages 869–879, October 2012
How to Cite
Vömel, C. (2012), A note on generating finer-grain parallelism in a representation tree. Numer. Linear Algebra Appl., 19: 869–879. doi: 10.1002/nla.828
- Issue published online: 19 SEP 2012
- Article first published online: 22 NOV 2011
- Manuscript Accepted: 2 OCT 2011
- Manuscript Revised: 11 SEP 2011
- Manuscript Received: 11 SEP 2010
- Multiple Relatively Robust Representations (MRRR);
- representation tree;
- spectrum peeling;
The representation tree lies at the heart of the algorithm of Multiple Relatively Robust Representations for computing orthogonal eigenvectors of a symmetric tridiagonal matrix without Gram–Schmidt. A representation tree describes the incremental shift relations between relatively robust representations of eigenvalue clusters of an unreduced tridiagonal matrix, which are needed to strongly separate close eigenvalues in the relative sense. At the bottom of the representation tree, each leaf defines a relatively isolated eigenvalue to high relative accuracy.
The shape of the representation tree plays a pivotal role for complexity and available parallelism: a deeper tree consisting of multiple levels of nodes involves tasks associated to more work (i.e., eigenvalue refinement to resolve eigenvalue clusters) and less parallelism (i.e., a longer critical path as well as potential data movement and synchronization). An embarrassingly parallel, ideal tree on the other hand consists of a root and leaves only.
As highly parallel hybrid graphics processing unit/multicore platforms with large memory now become available as commodity platforms, exploiting parallelism in traditional algorithms becomes key to modernizing the components of standard software libraries such as LAPACK. This paper focuses on LAPACK's Multiple Relatively Robust Representations algorithm and investigates the critical case where a representation tree contains a long sequential chain of large (fat) nodes that hamper parallelism. This key problem needs to be addressed as it concerns all sorts of computing environments, distributed computing, symmetric multiprocessor, as well as hybrid graphics processing unit/multicore architectures. We present an improved representation tree that often offers a significantly shorter critical path and finer computational granularity of smaller tasks that are easier to schedule. In a study of selected synthetic and application matrices, we show that an average 75% reduction in the length of the critical path and 82% reduction in task granularity can be achieved. Copyright © 2011 John Wiley & Sons, Ltd.