Research Article
Emmerald: a fast matrix–matrix multiply using Intel's SSE instructions
Article first published online: 27 FEB 2001
DOI: 10.1002/cpe.549
Copyright © 2001 John Wiley & Sons, Ltd.
Issue
1532-0634/asset/cover.gif?v=1&s=6094df24c795ce080ff6df6ff3b6bcec19adb708)
Concurrency and Computation: Practice and Experience
Volume 13, Issue 2, pages 103–119, February 2001
Additional Information
How to Cite
Aberdeen, D. and Baxter, J. (2001), Emmerald: a fast matrix–matrix multiply using Intel's SSE instructions. Concurrency and Computation: Practice and Experience, 13: 103–119. doi: 10.1002/cpe.549
Publication History
- Issue published online: 27 FEB 2001
- Article first published online: 27 FEB 2001
- Manuscript Revised: 27 JUL 2000
- Manuscript Received:
Funded by
- Australian Research Council
- Abstract
- References
- Cited By
Keywords:
- GEMM;
- SIMD;
- SSE;
- matrix multiply;
- deep memory hierarchy
Abstract
Generalized matrix–matrix multiplication forms the kernel of many mathematical algorithms, hence a faster matrix–matrix multiply immediately benefits these algorithms. In this paper we implement efficient matrix multiplication for large matrices using the Intel Pentium single instruction multiple data (SIMD) floating point architecture. The main difficulty with the Pentium and other commodity processors is the need to efficiently utilize the cache hierarchy, particularly given the growing gap between main-memory and CPU clock speeds. We give a detailed description of the register allocation, Level 1 and Level 2 cache blocking strategies that yield the best performance for the Pentium III family. Our results demonstrate an average performance of 2.09 times faster than the leading public domain matrix–matrix multiply routines and comparable performance with Intel's SIMD small matrix–matrix multiply routines. Copyright © 2001 John Wiley & Sons, Ltd.

1532-0634/asset/olbannerleft.gif?v=1&s=a4e4e145787de94e1d91eaab3c8c29d8a9d96a26)