A GPU-accelerated Conjugate Gradient solver is tested on eight different matrices with different structural and numerical characteristics. The first four matrices are obtained by discretizing the 3D Poisson's equation, which arises in many fields such as computational fluid dynamics, heat transfer and so on. Their relatively low bandwidth and low condition numbers makes them ideal targets for GPU acceleration. We chose another four matrices from the other end of the spectrum, both ill-conditioned and with very large bandwidth. This paper concentrates on the computational aspects related to running the solver on multiple GPUs. We develop a fast distributed sparse-matrix vector multiplication routine using optimized data formats that allows the overlapping of communication with computation and, at the same time, the sharing of some of the work with the CPU. By a thorough analysis of the time spent in communication and computation, we show that the proposed overlapped implementation outperforms the non-overlapped one by a large margin and provides almost perfect strong scalability for large Poisson-type matrices. We then benchmark the performance of the entire solver, using both double precision and single precision combined with iterative refinement and report up to 22× acceleration when using three GPUs as compared with one of the most powerful Intel Nehalem CPUs available today. Finally, we show that using GPUs as accelerators not only brings an order of magnitude speedup but also up to 5x increase in power efficiency and over 10x increase in cost effectiveness. Copyright © 2010 John Wiley & Sons, Ltd.