This work presents our strategy, applied optimizations and results in our effort to exploit the computational capabilities of graphics processing units (GPUs) under the CUDA environment in order to solve the Laplacian PDE. The parallelizable red/black successive over-relaxation (SOR) method was used. Additionally, a program for the CPU was developed as a performance reference. Various performance improvements were achieved by using optimization methods, which proved to provide significant speedup. Memory access patterns prove to be a critical factor in efficient program execution on GPUs and it is, therefore, appropriate to follow data reorganization to achieve the highest feasible memory throughput. The same approach exhibits performance benefits on the CPU version, as well. Eventually, a direct comparison of optimal versions’ performance was realized. A 10 × speedup was measured for the CUDA version on an NVidia GTX480 GPU (NVidia Corp, Sta. Clara, CA, USA), exceeding 142 GB/s bandwidth, over the single threaded CPU version when run on an Intel Core i7 2600K CPU. The results prove that the global memory cache added on recent GPU architectures assist achieving high performance without requiring to employ the special memory types provided by the GPU (i.e. shared, texture or constant memory). Copyright © 2012 John Wiley & Sons, Ltd.