Techniques used to implement an unstructured grid solver on modern graphics hardware are described. The three-dimensional Euler equations for inviscid, compressible flow are considered. Effective memory bandwidth is improved by reducing total global memory access and overlapping redundant computation, as well as using an appropriate numbering scheme and data layout. The applicability of per-block shared memory is also considered. The performance of the solver is demonstrated on two benchmark cases: a NACA0012 wing and a missile. For a variety of mesh sizes, an average speed-up factor of roughly 9.5 × is observed over the equivalent parallelized OpenMP code running on a quad-core CPU, and roughly 33 × over the equivalent code running in serial. Copyright © 2010 John Wiley & Sons, Ltd.