Research Article
Solving the block–Toeplitz least-squares problem in parallel
Article first published online: 29 NOV 2004
DOI: 10.1002/cpe.883
Copyright © 2005 John Wiley & Sons, Ltd.
Issue
1532-0634/asset/cover.gif?v=1&s=6094df24c795ce080ff6df6ff3b6bcec19adb708)
Concurrency and Computation: Practice and Experience
Volume 17, Issue 1, pages 49–67, January 2005
Additional Information
How to Cite
Alonso, P., Badía, J. M. and Vidal, A. M. (2005), Solving the block–Toeplitz least-squares problem in parallel. Concurrency Computat.: Pract. Exper., 17: 49–67. doi: 10.1002/cpe.883
Publication History
- Issue published online: 29 NOV 2004
- Article first published online: 29 NOV 2004
- Manuscript Accepted: 14 JAN 2004
- Manuscript Revised: 28 OCT 2003
- Manuscript Received: 29 OCT 2002
Funded by
- Ministerio de Educación, Cultura y Deporte, Spain. Grant Number: CICYT TIC 2000-1683-C03-03
- Abstract
- References
- Cited By
Keywords:
- least-squares problem;
- block–Toeplitz matrices;
- Toeplitz–block matrices;
- displacement structure;
- Generalized Schur Algorithm;
- parallel algorithms;
- clusters of personal computers
Abstract
In this paper we present two versions of a parallel algorithm to solve the block–Toeplitz least-squares problem on distributed-memory architectures. We derive a parallel algorithm based on the seminormal equations arising from the triangular decomposition of the product TTT. Our parallel algorithm exploits the displacement structure of the Toeplitz-like matrices using the Generalized Schur Algorithm to obtain the solution in O(mn) flops instead of O(mn2) flops of the algorithms for non-structured matrices. The strong regularity of the previous product of matrices and an appropriate computation of the hyperbolic rotations improve the stability of the algorithms. We have reduced the communication cost of previous versions, and have also reduced the memory access cost by appropriately arranging the elements of the matrices. Furthermore, the second version of the algorithm has a very low spatial cost, because it does not store the triangular factor of the decomposition. The experimental results show a good scalability of the parallel algorithm on two different clusters of personal computers. Copyright © 2005 John Wiley & Sons, Ltd.

1532-0634/asset/olbannerleft.gif?v=1&s=a4e4e145787de94e1d91eaab3c8c29d8a9d96a26)