SEARCH

SEARCH BY CITATION

Keywords:

  • computer architecture;
  • parallel and distributed technologies;
  • high performance computing

ABSTRACT

  1. Top of page
  2. ABSTRACT
  3. 1 BACKGROUND
  4. 2 SPECIAL ISSUE PAPERS
  5. ACKNOWLEDGEMENTS
  6. REFERENCES

This special issue focuses on new developments in high performance applications, as well as the latest trends in computer architecture and parallel and distributed technologies, and is based on extended, thoroughly revised papers from the 22nd International Symposium on Computer Architecture and High Performance Computing. The authors were invited to provide extended versions of their original papers, taking into account comments and suggestions raised during the peer review process and comments from the audience during the conference. Copyright © 2012 John Wiley & Sons, Ltd.

1 BACKGROUND

  1. Top of page
  2. ABSTRACT
  3. 1 BACKGROUND
  4. 2 SPECIAL ISSUE PAPERS
  5. ACKNOWLEDGEMENTS
  6. REFERENCES

This special issue focuses on new developments in high performance applications, as well as the latest trends in computer architecture and parallel and distributed technologies, and is based on extended, thoroughly revised papers from the 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). The authors were invited to provide extended versions of their original papers, taking into account comments and suggestions raised during the peer review process and comments from the audience during the conference.

Relevant contributions have been provided by Coutinho et al., Miller et al., Yu et al., Camata and Coutinho, Fournier et al., by Gaona-Ramirez et al., and Jones and Koenig. These contributions focus on the following:

  • profiling divergences in graphics processing unit (GPU) applications;
  • runtime failure rate targeting for energy-efficient reliability in chip microprocessors;
  • a queuing model-based approach for the analysis of transactional memory (TM) systems;
  • performance analysis of a parallel linear octree finite element mesh generation scheme;
  • multiple threads and parallel challenges for large simulations to accelerate a general Navier–Stokes computational fluid dynamics (CFD) code on massively parallel system;
  • design of energy-efficient hardware TM systems; and
  • clock synchronization in high-end computing environments: a strategy for minimizing clock variance at runtime.

The 22nd edition of the SBAC-PAD series encouraged researchers to submit and present original work related to latest trends in computer architecture and parallel and distributed technologies. Within this overall scope, this SBAC-PAD edition emphasized issues of application-specific systems; benchmarking, performance measurements, and analysis; clouds, grids, clusters, peer-to-peer systems, and embedded and pervasive systems; GPUs, field-programmable gate arrays, and other accelerator architectures; languages, compilers, and tools for parallel and distributed programming; modeling and simulation methodology; operating systems and virtualization; parallel and distributed systems, algorithms, and applications; power and energy-efficient systems; processor, cache, memory, storage, and network architecture; real-world applications and case studies; and reconfigurable and fault-tolerant systems. The conference was held in Petrópolis-RJ, Brazil.

2 SPECIAL ISSUE PAPERS

  1. Top of page
  2. ABSTRACT
  3. 1 BACKGROUND
  4. 2 SPECIAL ISSUE PAPERS
  5. ACKNOWLEDGEMENTS
  6. REFERENCES

Coutinho et al. [1, 2] introduce a dynamic profiler for GPUs that pinpoints the program regions where threads have followed divergent execution paths. Divergences, a phenomenon typical in single-instruction multiple-data execution models, occur when threads that run in lockstep are forced to take different paths because of branches in the program code. This profiler helps developers to speed up their general-purpose computing on GPU programs. As an example, the authors have used it to identify key performance bottlenecks in a well-known version of the parallel quicksort algorithm, hence improving its performance by almost 11%.

Miller et al. [3, 4] propose a new approach to microprocessor reliability management that achieves reliable and energy-efficient operation by dynamically adapting the amount of error protection to the characteristics of individual chips (to account for the effects of process variation), their runtime behavior (to account for workload variability), and the desired level of error resiliency. The ability of the system to dynamically adapt allows it to operate reliably within defined targets without wasting energy on high safety margins or over-provisioning. A flexible, manufacturer-defined or user-defined reliability target allows the same chip multiprocessor to be deployed in systems with different reliability requirements such as laptops or servers. The paper also demonstrates that a machine learning-based dynamic control mechanism performs the runtime optimization required by the system accurately and quickly.

Yu et al. [5, 6] present in their work an analytical model, developed to describe the execution efficiency of TM systems. This model employs queuing theory to analyze the impact of an essential set of TM design parameters. The model is validated via extensive experiments. The study shows that, for a given TM-based program, the frequency of performing conflict detection can be carefully chosen to minimize the mean transaction completion time. The study also demonstrates the importance of reducing implementation overhead. This study is expected to be useful for designing TM systems and applications.

Camata and Coutinho [7, 8] present an efficient parallel implementation of a linear octree-based mesh generation scheme designed to create reasonable-quality, geometry-adapted unstructured hexahedral meshes automatically from triangulated surface models. The paper describes the main algorithms for the construction, 2 : 1balancing, and meshing for large linear octrees on supercomputers. In order to handle arbitrary surfaces that usually represent complex solid boundaries present in computational solid and fluid mechanics applications, they use efficient computer graphics algorithms for surface detection based on bounding box tree and ray tracing. We have conducted experiments to answer if their surface detection algorithm is able to capture any arbitrary surface and how much fast we could extract hexahedral mesh structures from balanced linear octrees. Experiments using two complex triangulated surfaces have shown that their surface detection algorithm is able to capture the arbitrary surface properly with small relative errors between the enclosed solid volume and octree. Finally, an isogranular performance analysis demonstrates the good scalability. The implementation is able to execute the 2 : 1 balancing operations over 3.4 billion octants in less than 10 s per 1.6 million octants per CPU core.

Fournier et al. [9, 10] explore the performance, scalability, and optimization of the industry standard application developed by Electricite de France, Code_Saturne, commonly used in the field of CFDs, which is an increasingly important application domain for scientists and engineers across many industries including automotive, aerospace, and energy. The research was conducted to evaluate various features that will be supported in the IBM Blue Gene/Q supercomputer, which is a highly scalable platform capable of achieving 20 petaflops. The paper was initially motivated by the important challenge of CFD simulation problems requiring the modeling of larger higher-resolution mesh structures by using more complex simulations for new turbulence models such as hybrid Reynolds-averaged Navier–Stokes/large eddy simulations, which require smaller cell aspect ratios than pure Reynolds-averaged Navier–Stokes. Production simulation problems are now reaching the range of hundreds of millions of mesh cells and are predicted to require billion-cell meshes over the next few years. There is a near-term demand for solutions to problems with complex multibillion-cell meshes and for the petaflop computational capabilities required to make such simulations practical. CFD approaches and implementations capable of simulating billions of cells or particles are beginning to emerge within the research community, but it is a major challenge to extend this capability to general Navier–Stokes CFD industrial software.

Gaona-Ramirez et al. [11, 12] present a characterization of the performance and energy consumption of the two most important hardware TM flavors so far: eager–eager (EE) and lazy–lazy (LL). These approaches employ opposite policies for data versioning and conflict management. Obtained results show that even though LL beats EE on average, there are considerable deviations in performance depending on the particular characteristics of each application and the settings of both systems. The paper shows that a significant part of the energy consumed in some applications in EE is spent on the back-off delay phase, and it explores more energy-efficient hardware back-off mechanisms. For LL systems, the way in which memory lines are assigned to the L2 cache banks affects the number of parallel commits in some applications, and an alternative fine-grained assignment is studied.

Finally, Jones and Koenig [13, 14] present the design of a new software-based clock synchronization scheme. Inspired by the needs of system software on extreme scale machines, the results obtained from the authors' implementation indicate that it is able to reliably achieve sufficient levels of time agreement to enable new opportunities. In addition to describing their design, the authors present performance measurements for a large Cray XT5 machine and discuss how the technology can be applied to coordinate scheduling of independent Linux kernels.

ACKNOWLEDGEMENTS

  1. Top of page
  2. ABSTRACT
  3. 1 BACKGROUND
  4. 2 SPECIAL ISSUE PAPERS
  5. ACKNOWLEDGEMENTS
  6. REFERENCES

We would like to thank the authors for contributing papers on their research on latest trends in computer architecture and parallel and distributed technologies for this special issue and all the reviewers for providing constructive reviews and in helping to shape this special issue. Finally, we would like to thank Prof. Geoffrey Fox for providing us an opportunity to bring this special issue to the research community.

REFERENCES

  1. Top of page
  2. ABSTRACT
  3. 1 BACKGROUND
  4. 2 SPECIAL ISSUE PAPERS
  5. ACKNOWLEDGEMENTS
  6. REFERENCES
  • 1
    Coutinho B, Sampaio D, Pereira FMQ, Meira W. Profiling divergences in GPU applications. Concurrency and Computation: Practice and Experience 2013; 25(6):775789.
  • 2
    Coutinho B, Sampaio D, Pereira FMQ, Meira W. Performance debugging of GPGPU applications with the divergence map. Computer Architecture and High Performance Computing (SBAC-PAD), 2010 22nd International Symposium on, 27–30 Oct. 2010; 3340. doi: 10.1109/SBAC–PAD.2010.38
  • 3
    Miller T, Surapaneni N, Teodorescu R. Runtime failure rate targeting for energy-efficient reliability in chip microprocessors. Concurrency and Computation: Practice and Experience 2013; 25(6):790807.
  • 4
    Miller T, Surapaneni N, Teodorescu R. Flexible error protection for energy efficient reliable architectures. Computer Architecture and High Performance Computing (SBAC-PAD), 2010 22nd International Symposium on, 27–30 Oct. 2010; 18. doi: 10.1109/SBAC–PAD.2010.37
  • 5
    Yu X, He Z, Hong B. A queuing model based approach for the analysis of transactional memory systems. Concurrency and Computation: Practice and Experience 2013; 25(6):808825.
  • 6
    Yu X, He Z, Hong B. An analytical model on the execution of transactional memory. Computer Architecture and High Performance Computing (SBAC-PAD), 2010 22nd International Symposium on, 27–30 Oct. 2010; 175182. doi: 10.1109/SBAC–PAD.2010.29
  • 7
    Camata JJ, Coutinho ALGA. Parallel implementation and performance analysis of a linear octree finite element mesh generation scheme. Concurrency and Computation: Practice and Experience 2013; 25(6):826842.
  • 8
    Camata JJ, Coutinho ALGA. Parallel linear octree meshing with immersed surfaces. Computer Architecture and High Performance Computing (SBAC-PAD), 2010 22nd International Symposium on, 27–30 Oct. 2010; 151158. doi: 10.1109/SBAC–PAD.2010.26
  • 9
    Vezolle P, Heyman J, D'Amora B, Braudaway G, Magerlein K, Magerlein J, Fournier Y. Multiple threads and parallel challenges for large simulations to accelerate a general Navier–Stokes CFD code on massively parallel systems. Concurrency and Computation: Practice and Experience 2013; 25(6):843861.
  • 10
    Vezolle P, Heyman J, D'Amora B, Braudaway G, Magerlein K, Magerlein J, Fournier Y. Accelerating computational fluid dynamics on the IBM Blue Gene/P supercomputer. Computer Architecture and High Performance Computing (SBAC-PAD), 2010 22nd International Symposium on, 27–30 Oct. 2010; 159166. doi: 10.1109/SBAC–PAD.2010.27
  • 11
    Gaona-Ramirez E, Titos-Gil R, Fernandez J, Acacio ME. On the design of energy-efficient hardware transactional memory systems. Concurrency and Computation: Practice and Experience 2013; 25(6):862880.
  • 12
    Gaona-Ramirez E, Titos-Gil R, Fernandez J, Acacio ME. Characterizing energy consumption in hardware transactional memory systems. Computer Architecture and High Performance Computing (SBAC-PAD), 2010 22nd International Symposium on, 27–30 Oct. 2010; 916. doi: 10.1109/SBAC–PAD.2010.11
  • 13
    Jones T, Koenig GA. Clock synchronization in high-end computing environments: a strategy for minimizing clock variance at runtime. Concurrency and Computation: Practice and Experience 2013; 25(6):881897.
  • 14
    Jones T, Koenig GA. A clock synchronization strategy for minimizing clock variance at runtime in high-end computing environments. Computer Architecture and High Performance Computing (SBAC-PAD), 2010 22nd International Symposium on, 27–30 Oct. 2010; 207214. doi: 10.1109/SBAC–PAD.2010.33