The block Wiedemann (BW) algorithm is frequently used to solve sparse linear systems over GF(2). Iterative sparse matrix–vector multiplication is the most time-consuming operation. The necessity to accelerate this step is motivated by the application of BW to very large matrices used in the linear algebra step of the number field sieve (NFS) for integer factorization. In this paper, we derive an efficient CUDA implementation of this operation by using a newly designed hybrid sparse matrix format. This leads to speedups between 4 and 8 on a single graphics processing unit (GPU) for a number of tested NFS matrices compared with an optimized multicore implementation. We further present a GPU cluster implementation of the full BW for NFS matrices. A small-sized GPU cluster is able to outperform CPU clusters of larger size for large matrices such as the one obtained from the Kilobit special NFS factorization. Copyright © 2012 John Wiley & Sons, Ltd.