The computation of electron repulsion integrals (ERIs) is the most time-consuming process in the density functional calculation using Gaussian basis set. Many temporal ERIs are calculated, and most are stored on slower storage, such as cache or memory, because of the shortage of registers, which are the fastest storage in a central processing unit (CPU). Moreover, the heavy register usage makes it difficult to launch many concurrent threads on a graphics processing unit (GPU) to hide latency. Hence, we propose to optimize the calculation order of one-center ERIs to minimize the number of registers used, and to calculate each ERI with three or six co-operating threads. The performance of this method is measured on a recent CPU and a GPU. The proposed approach is found to be efficient for high angular basis functions with a GPU. When combined with a recent GPU, it accelerates the computation almost 4-fold. © 2014 Wiley Periodicals, Inc.