Extended Conference Paper
Improving TLB performance on current chip multiprocessor architectures through demand-driven superpaging
Article first published online: 1 MAY 2012
Copyright © 2012 John Wiley & Sons, Ltd.
Software: Practice and Experience
Volume 43, Issue 6, pages 705–729, June 2013
How to Cite
Qasem, A. and Magee, J. (2013), Improving TLB performance on current chip multiprocessor architectures through demand-driven superpaging. Softw: Pract. Exper., 43: 705–729. doi: 10.1002/spe.2128
- Issue published online: 8 MAY 2013
- Article first published online: 1 MAY 2012
- Manuscript Accepted: 11 APR 2012
- Manuscript Revised: 10 APR 2012
- Manuscript Received: 5 FEB 2011
- Texas State Research Enhancement Program. Grant Number: DE-SC001770
- Department of Energy. Grant Number: DE-SC001770
Translation Lookaside Buffers (TLBs) can play a critical role in improving the performance of emerging parallel workloads. Most current chip multiprocessor systems include multilevel TLBs and provide support for superpages both at the hardware and software level. Judicious use of superpages can significantly cut down the number of TLB misses and improve overall system performance. However, indiscriminate superpage allocation results in page fragmentation and increased application footprint, which often outweigh the benefits of reduced TLB misses. Previous research has explored policies for smart allocation of superpages from an operating system perspective. This paper presents a compiler-based strategy for automatic and profitable memory allocation via superpages. A significant advantage of a compiler-based approach is the availability of data-reuse information within an application. Our strategy employs data-locality analysis to estimate the TLB demands for both single-threaded and multi-threaded programs and uses this metric to apply selective superpage allocation. Apart from its obvious utility in improving TLB performance, this strategy can be used to improve the effectiveness of certain data-layout transformations and can be a useful tool in benchmarking and automatic performance tuning. We demonstrate the effectiveness of this strategy with experiments on three multicore platforms on a workload that contains both sequential and parallel applications. Copyright © 2012 John Wiley & Sons, Ltd.