SU-C-207B-01: A Novel Graphics Processing Units (GPU) Implementation of Discrete Wavelet Transformation




To design and implement a GPU-based discrete wavelet transformation (DWT) to be used in medical image reconstruction, image processing, and data compression. DWTs are widely used in medical physics, but the computation of DWTs is time consuming for large volumetric data. An efficient parallel implementation of DWTs is essential for many time sensitive applications, such as 4DCT.


We choose Daubechies wavelet transformations as a benchmark, implementing both DWT and inverse DWT (IDWT). The reference CPU code is from “Numerical Recipes in C”. We implemented GPU-based codes using C++ with CUDA 7.5. A GPU is specialized processors with a highly parallel structure originally designed for manipulating computer graphics. GPU computation is highly memory-bounded. To optimize GPU memory access pattern and achieve best performance for transformation along Y-direction, the data are transposed before and after the DWT (IDWT), which can sharply reduce computing efforts. Both CPU and GPU codes are unit-tested and the result differences between two implementations are verified to be within rounding error. The hardware platform was a desktop with Intel Xeon E5-1607 3.00 GHz CPU and NVIDIA Quadro K2000 GPU.


With the GPU implemented code, we process a medical CT image size of 512×512×256 in 0.5 seconds, which is 60 times faster than the CPU implementation. A naive non-optimized GPU implementation by direct parallelization approaches only 10 times speedup for 2-D and 3-D DWT and IDWT. In comparison, the optimization of GPU memory access pattern can obtain an extra 6x speedup.


We developed an efficient implementation of GPU-based DWT/IDWT for medical image processing. To achieve the best performance of any medical imaging processing algorithm, developers should pay attention to memory access pattern optimization when implementing on GPU architecture.

This project is supported by CPRIT grant under RP150485.