A high-performance sorting algorithm for multicore single-instruction multiple-data processors
Article first published online: 19 JUL 2011
Copyright © 2011 John Wiley & Sons, Ltd.
Software: Practice and Experience
Volume 42, Issue 6, pages 753–777, June 2012
How to Cite
Inoue, H., Moriyama, T., Komatsu, H. and Nakatani, T. (2012), A high-performance sorting algorithm for multicore single-instruction multiple-data processors. Softw: Pract. Exper., 42: 753–777. doi: 10.1002/spe.1102
- Issue published online: 4 MAY 2012
- Article first published online: 19 JUL 2011
- Manuscript Accepted: 17 MAY 2011
- Manuscript Revised: 25 APR 2011
- Manuscript Received: 20 JUN 2010
- parallel algorithms
Many sorting algorithms have been studied in the past, but there are only a few algorithms that can effectively exploit both single-instruction multiple-data (SIMD) instructions and thread-level parallelism. In this paper, we propose a new high-performance sorting algorithm, called aligned-access sort (AA-sort), that exploits both the SIMD instructions and thread-level parallelism available on today's multicore processors. Our algorithm consists of two phases, an in-core sorting phase and an out-of-core merging phase. The in-core sorting phase uses our new sorting algorithm that extends combsort to exploit SIMD instructions. The out-of-core algorithm is based on mergesort with our novel vectorized merging algorithm. Both phases can take advantage of SIMD instructions. The key to high performance is eliminating unaligned memory accesses that would reduce the effectiveness of SIMD instructions in both phases. We implemented and evaluated the AA-sort on PowerPC 970MP and Cell Broadband Engine platforms. In summary, a sequential version of the AA-sort using SIMD instructions outperformed IBM's optimized sequential sorting library by 1.8 times and bitonic mergesort using SIMD instructions by 3.3 times on PowerPC 970MP when sorting 32 million random 32-bit integers. Also, a parallel version of AA-sort demonstrated better scalability with increasing numbers of cores than a parallel version of bitonic mergesort on both platforms. Copyright © 2011 John Wiley & Sons, Ltd.