Graphics processing unit computing and exploitation of hardware accelerators

Authors


Correspondence to: Enrique S. Quintana-Ortí, Departamento de Ingeniería y Ciencia de Computadores, Universidad Jaume I, 12.071-Castellón, Spain.

E-mail: quintana@icc.uji.es

SUMMARY

This special issue contributes to this promising field with extended and carefully reviewed versions of selected papers from two workshops, namely the 2nd Minisymposium on GPU Computing, which was held as part of the 9th International Conference on Parallel Processing and Applied Mathematics (PPAM 2011) in Torun (Poland); and the Workshop on Exploitation of Hardware Accelerators (WEHA 2011), which was held in conjunction with The 2011 International Conference on High Performance Computing & Simulation in Istanbul (Turkey). Copyright © 2012 John Wiley & Sons, Ltd.

The importance of hardware accelerators (graphics processing units (GPUs), cell, field-programmable gate arrays, …) is rapidly increasing in performance sensitive areas. They are particularly relevant in high throughput disciplines such as high quality 3D computer graphics and vision, real-time data stream processing, and high-performance scientific computing. The main reason behind this trend is that these accelerators can potentially yield speedups and power savings orders of magnitude higher than those obtained with optimized implementations for general-purpose CPU cores. As a result, during the past few years, these architectures have become powerful, capable, and inexpensive mainstream coprocessors useful for a wide variety of applications.

The growing relevance of these devices has given place to a very rich environment for their programming, particularly in comparison with the landscape only a few years ago. This way, on top of the two major programming frameworks, compute unified device architecture (CUDA) and OpenCL, libraries (e.g., cuFFT) and high level interfaces (e.g., Thrust) have been developed that allow a fast access to the computing power of GPUs and other accelerators without detailed knowledge of the underlying hardware. Annotation-based programming models (e.g., PGI Accelerator), GPU plug-ins for existing mathematical software (e.g., Jacket in Matlab), GPU scripting languages (e.g., PyOpenCL), and new data parallel languages (e.g., Copperhead) are also helping to bring the programming of hardware accelerators to a new level.

Altogether, the advances both in the hardware and in the programmability of accelerators, coupled with their potentially appealing performance/power ratio for a wide range of applications, have pushed organizations to invest in heterogeneous systems that include accelerators, and have motivated researchers to port their algorithms to such systems and develop novel tools to facilitate their usage.

This special issue contributes to this promising field with extended and carefully reviewed versions of selected papers from two workshops, namely the 2nd Minisymposium on GPU Computing, which was held as part of the 9th International Conference on Parallel Processing and Applied Mathematics (PPAM 2011) in Torun (Poland); and the Workshop on Exploitation of Hardware Accelerators (WEHA 2011), which was held in conjunction with The 2011 International Conference on High Performance Computing & Simulation in Istanbul (Turkey).

Fourteen papers were published in the Conference Proceedings of these two events, after one or two review rounds. Extended versions of some of these papers went through two new review rounds, resulting in the selection of papers contained in this special issue. The topics offer a good cross section of current GPU challenges: further abstraction of the hardware and the programming model, improvements of basic parallel algorithms in discrete mathematics and linear algebra, and the utilization of the parallel processing power of GPUs for real-world applications.

In [1], the authors present GPU and CPU implementations of the red/black successive over-relaxation method, comparing them for a variety of problem sizes. Five GPU kernels are implemented, tuned, and compared. The optimization decisions are analyzed, and some were also applied to the CPU implementation, as it appeared it would be beneficial on both platforms. The results also prove that the global memory cache added on recent GPU architectures assist in achieving high performance without requiring the use of the special memory types provided by the GPU (i.e., shared, texture, or constant memory).

In [2], the authors address the thread divergence problem in Branch-and-Bound algorithms for solving the flow-shop scheduling optimization problems. By adapting the selection operator and reordering data for single-instruction multiple-data (SIMD) processing, they can significantly reduce thread divergence in this highly irregular algorithm and obtain speedups against a single CPU core in excess of 50 × on large problems on an NVIDIA Tesla C2050.

In [3], the authors present an efficient implementation of affinity propagation on clusters of GPUs. The authors present a decomposition scheme for affinity propagation that distributes the calculations over multiple GPUs, with low communication-to-computation ratio. By distributing the calculations between multiple GPUs, they are able to efficiently find exemplars in data sets that would not fit in one, or few units.

In [4], the authors present a cost-effective multi-GPU implementation of a finite-volume scheme for solving pollutant transport problems in a shallow-water system. Their contribution shows the optimization process from a naïve single-GPU implementation to the final optimized version separately evaluating each one of the improvements applied both to the GPU kernels as well as to the extension to exploit several GPUs.

In [5], the authors introduce a unified framework for the presentation of inversion algorithms for different types of matrices (general, symmetric positive definite, and symmetric) and illustrate the superior performance of Gauss–Jordan elimination over the conventional use of Gaussian elimination for the computation of this particular operation on Nvidia ‘Fermi’ GPUs.

In [6], the author describes the development of single-precision CUDA kernels for the level 1 and level 2 BLAS (Basic Linear Algebra Subprograms) and shows how auto-tuning can be applied with great success to optimize the performance of these kernels on graphics processors from the Nvidia Tesla 20-series.

We would like to thank the authors for their excellent contributions to this special issue. The anonymous reviewers who helped to greatly improve the quality of the papers deserve also special acknowledgement; without their selfless effort, this special issue would not have been possible. We hope that the body of work in this special issue inspires future research in the area of heterogeneous computing.

Ancillary