SEARCH

SEARCH BY CITATION

Abstract

We present in this paper a study of the computation and communication costs on RP3 and on some issues about algorithm designs on a three-level memory hierarchy multi-processor. Using very simple algorithms (vector-add, vector-sum, saxpy, … ), we compare different implementations which differ on data localization (global or local) and data cacheability (cacheable or non-cacheable). This comparison is done using a performance monitoring system (VPMC) that records instructions, data movement, cache requests and misses. The output of the VPMC was then used as input to an analytical performance model which we used to compute the elemental computation and communication times of every basic algorithm. Regarding cacheability (marking the data cacheable instead of non-cacheable), we found it worthwhile as long as data are blocked adequately. For our simple 1-D data structures, a block size equal to a multiple of the cache line size gives the best results. However, considering possible load imbalance, a block size equal to the cache line seems optimal. Regarding localization (copying data from global to local, working on local data instead of global and copying data back), we found it ineffective, at least with the RP3 local and global communication speed ratios (1:10:15).