SICSA multicore challenge editorial preface†
- †This article is published with the permission of the Controller of HMSO and the Queen's Printer for Scotland.
This special issue reports on the SICSA multicore challenge, which commenced in 2010 and remains an ongoing activity. The aim is to produce a comparative evaluation of a range of standard parallel programming tools and techniques on a representative set of parallelizable problems, executing on commodity multicore platforms.
In this introductory article, we outline contemporary multicore computing trends that give rise to the challenge in Section 2. Then, we describe the specific motivation for the challenge in Section 3. We summarize the parallelizable problem selected for implementation in the second challenge phase, the N-body problem, and studied in the papers in this special issue in Section 4. Finally in Section 5, we summarize the participation in challenge activities to date and give an overview of the papers that were produced by challenge participants that appear in this particular special issue.
We intend this special issue to present the main lessons learnt from the challenge, providing material of general relevance for multicore application developers and researchers.
2 MULTICORE LANDSCAPE
This section sets the context for the SICSA multicore challenge in terms of recent advances in both hardware and software.
2.1 Multicore architectures
The widespread adoption of the multicore architectural paradigm has taken little more than a decade, from the seminal academic article  to the present industrial reality of octa-core ARM processors in commodity smartphones‡ and 60-core Intel processors in Xeon Phi accelerator boards§ . The growth in processing elements per chip seems to be following the familiar exponential trend of Moore's Law .
One interesting observation is that the current range of multicore architectures is highly diverse. Some commodity processors (generally of x86 ancestry) are homogeneous multicores, with fairly powerful individual processing elements. Other systems feature a larger array of less powerful cores such as Tilera . The other end of the homogeneous multicore spectrum is the general purpose GPU  which may have many hundreds of specialized, small cores optimized for SIMD-style stream processing.
There are also heterogeneous architectures , which combine diverse processing elements. For instance, the cell processor  has one complex power core and many simpler processing elements for specialized computation. Alternatively, asymmetric multicore systems  have a homogeneous instruction set, but different power characteristics. The big.LITTLE system from ARM follows this approach .
This proliferation of designs leads us to one of two conclusions. Perhaps it is no longer the case (if it ever was) that one size fits all. Alternatively, the evolution of multi/many-cores is in an early, highly exploratory stage (somewhat akin to geology's Cambrian explosion period) and it is not yet clear which approach is most suitable.
2.2 Changing cost models
The cost model for multicore parallelism is generally different to previous models for parallel computing, such as cluster-based parallelism. In a multicore scenario, parallel hardware threads are highly available and relatively cheap to fork. For example, Contreras and Martinosi  report that the overhead of thread spawns using the Intel TBB library is measured in tens of cycles. Further, the overhead of communication between threads is often reduced, because of improved physical proximity between processing elements. For example, Barrow-Williams et al.  use a 10-cycle latency for shared L2 cache access in their characterization of communication patterns of parallel benchmarks on modern multicore architectures.
Given these developments, it has become cheaper (i) to create new threads, and (ii) to share data between threads. Therefore, programmers have the freedom to write high-performance parallel software in a fundamentally different way. In general, any flexibility in implementation strategy for parallel applications should be guided by formal resource analysis and reliable cost models .
2.3 Parallelism in programming languages
The rapid emergence of commodity multicore machines has propelled parallel programming into the mainstream of software development. Various programming models can exploit potential parallelism but are sometimes restricted to certain classes of architectures or applications. It is not yet clear which parallel programming models can exploit multicore to best advantage. Hence, there is a dual situation to current multicore hardware design diversity; a vast variety of multicore programming approaches have been proposed.
The most conservative approach is a backwards compatible one, i.e. to map longstanding parallel constructs onto the multicore paradigm. These include shared memory parallel approaches like OpenMP and message passing parallel approaches like MPI.
A more adventurous solution involves the adoption of newly introduced libraries for multicore parallelism, such as Java fork/join , Intel Threaded building blocks  and .NET task parallel library . These libraries inter-operate with existing mainstream high-level languages (Java, C++, and C# respectively).
An increasingly popular approach is to work in the context of well-understood parallel patterns  or algorithmic skeletons , to introduce parallelism. Provided that the problem can be formulated in terms of one or more of these patterns, this approach offers high parallel performance with only small changes to the program, in particular without the need for explicitly controlling the coordination of the parallel program. The success of Google's MapReduce pattern  is evidence to the flexibility and applicability of this approach.
The most radical approach involves the creation of entirely new languages, in which parallelism is a fundamental design consideration. Such languages include Chapel , X10  and Fortress , all created as part of the DARPA High Productivity Computing Systems initiative. The new class of partitioned global address space (PGAS) languages incorporate key concepts of locality control, developed for these languages, and bring them to main-stream parallel programming in the form of UPC  and CoArray Fortran .
Multicore has also brought about a resurgence of interest in functional languages . The main reason is that properties like referential transparency and function purity make it easier for programmers to express and reason about concurrency. Among the most prominent functional languages for multicore are F# , Erlang , Haskell  and Scala , which also supports object-oriented programming but is functional-first.
The four papers appearing in this special issue represent a sample of the approaches enumerated above. Two papers focus on imperative languages (C, Fortran and Pascal) with restrictions or extensions to support effective parallelization. One paper uses a patterns-based approach with various skeleton frameworks for C++. The final paper uses the Haskell functional language.
3 MOTIVATION FOR CHALLENGE
The primary aim of the SICSA multicore challenge is to assess the suitability of currently available languages and systems for multicore programming, by comparing implementations of several challenge applications in terms of achieved performance and ease of implementation on state-of-the-art multicore machines.
Given the diversity of parallel programming approaches for multicore, we want to gain some insight regarding the strengths and weaknesses of various techniques. We explicitly informed all participants that they should not devise new parallel languages or techniques during the challenge; rather, they should engage in the evaluation of existing approaches. Ideally, we intended all participants to evaluate their software submissions on a reference platform, which was a 64-bit Intel Xeon multicore Linux server. However, given the broad range of approaches used and the fact that some techniques are only applicable on specialized platforms, this original goal proved impossible to achieve.
One non-negotiable objective of the challenge was that all participants should implement the same problems and use the standard algorithms. In this way, we hope to maintain some degree of comparability between the solutions. Nonetheless, we are well aware of the notorious difficulty of empirical comparisons between programming languages .
3.1 About SICSA
The Scottish Informatics and Computer Science Alliance (SICSA¶ ) is a £30m research pool supported by the Scottish Funding Council. It brings together researchers from 14 universities across Scotland to create a world-leading quality research cluster in informatics and computer science. One of SICSA's four key research themes is complex systems engineering. The multicore challenge activities fit within this theme.
4 CHALLENGE PROBLEM
In each phase, one challenge application is posed, inviting implementations in different systems and languages. We aim for simple yet representative programs, each comprising the computational core of some high-performance application.
The first challenge application involves the generation of a concordance for word sequences occurring in a text file. The second challenge application is the N-body problem. All papers in this special issue describe implementations of the second challenge application, hence this section gives details of the second challenge problem and associated data sets.
4.1 N-body problem
The N-body problem involves the prediction of motion of a system of N bodies that interact with each other gravitationally. The computation proceeds over time steps. In each time step, the net force on every body has to be calculated; then, this result is used to update the position and velocity of that body for that time step.
The input is a tab-separated value file of floating-point numbers. The first number specifies the number of bodies, N. The subsequent 3N numbers denote the x,y,z positions of the N bodies. These are laid out as a vector of x positions, followed by a vector of y positions, followed by a vector of z positions. The subsequent 3N numbers denote the vx,vy,vz velocity vector components of the N bodies. Again, these are laid out in vectors for the three orthogonal directions. For the challenge problem, N = 1024 and the time step is 0.001. The actual input file is available online∥ .
The output is the computed overall energy of the system before and after updating the positions and velocities of all N bodies. The reference implementation runs for 20 time steps.
There are a range of well-known algorithms to solve the N-body problem. The simplest approach is the all-pairs algorithm that computes all pair-wise forces directly. This requires O(N2) operations at each time step. The all-pairs algorithm is conceptually simple but inefficient for large-scale simulations. More complex algorithms, such as Barnes–Hut , group together bodies that are close together and compute their cumulative effect on far-away bodies. This reduces the amount of computation, but at the cost of a slight decrease in precision.
The multicore challenge instructions explicitly encouraged participants to concentrate on simpler implementations (i.e. all-pairs rather than Barnes–Hut) in order to focus on parallelization details rather than algorithmics. One of the four papers in this special issue tackles Barnes–Hut in addition to all-pairs, using a high-level Haskell-based parallel programming model. The other three papers restrict attention to the all-pairs algorithm only.
The N-body problem is a classical, scientific high-performance computing application. This has a kernel that is representative of a wider class of parallel applications. A perceived weakness of this problem is that the code is embarrassingly parallel, thus parallelization is generally trivial. Further, since there are a range of potential algorithms, we must be clear which approaches each participant implements, in order to make appropriate and fair comparisons.
5 SUMMARY OF CHALLENGE ACTIVITIES
The SICSA multicore challenge was initiated mid-2010. To date, there have been two challenge phases as reported in this special issue. Each challenge phase concluded with a workshop. The first workshop was at Heriot Watt University on 13 December 2010. Nine speakers presented their parallel implementations of the concordance application. The second workshop was at Glasgow University on 27 May 2011. Eight speakers described their implementations of the N-body problem. Around 40 people attended the workshops, mainly from Scotland but with visitors from other parts of Europe.
The workshops identified a number of common themes and general trends in the field.
One such trend is towards high-level abstraction for parallelism, making it easier to develop an initial parallel version of a program, by minimising the code changes required to introduce parallelism. Examples of this trend include parallel extensions to the functional programming language Haskell.
A related trend is the identification of patterns of parallel computation, such as algorithmic skeletons, which can be flexibly instantiated and used to parallelise a wide range of applications, without the need to control parallelism directly.
Finally, heterogeneous architectures are of increasing importance to the general area of high-performance computing. In particular, specialised hardware acceleration, e.g. as used in the cell processor, presents a very powerful compute engine, but poses additional challenges for efficient parallelisation.
Participants have found the challenge to be very valuable in identifying strengths and weaknesses of individual platforms, through a comparative evaluation. A typical comment describes the challenge activities to be ‘an extremely valuable experience, gaining insights [regarding] the relative advantages of each of the approaches.’
We hope that the papers presented in this special issue will form the core of an enduring initiative. Interest in this initiative continues to grow, e.g. the mailing list currently has almost 100 subscribers. The SICSA multicore challenge is ongoing; it has now entered the third phase. A particular problem has not yet been selected. Potential candidates include the maximum flow graph problem or the boolean satisfiability problem (SAT). Further information is available at the website** .
5.2 Articles in this special issue
For this special issue, we solicited papers from challenge participants. We also encouraged people who could not attend the workshops to consider submitting a paper. We asked authors to report on the parallelization of one or both challenge applications, summarizing the parallel performance achieved, assessing the ease of parallelization, reflecting on the performance tuning of the application and finally remarking on pragmatic aspects such as the availability and stability of tools and libraries.
We had eight submissions, each of which was reviewed by at least three members of our review committee. We finally accepted four papers for publication, all of them describing implementations of the second challenge application. Table 1 summarises the papers featured in this special issue, giving details of the challenge problem tackled, and the architecture and framework used for evaluation.
Table 1. Summary of papers in this special issue.
|Cockshott et al. ||N-body||Cell||Vector Pascal, Fortran subset|
|Šinkarovs et al. ||N-body||x86||Single-assignment C|
|Goli et al. ||N-body||x86 and NVidia GPU||Skeletons (FastFlow, SkePU and Thrust)|
|Totoo et al. ||N-body||x86||Parallel Haskell (GpH/Eden)|
The review committee consists of nine members, principally drawn from Scottish Universities with a few industrial and academic colleagues. We would like to thank the committee for their careful consideration of the submitted papers for this special issue:
Paul Cockshott (University of Glasgow)
Murray Cole (University of Edinburgh)
Kevin Hammond (University of St Andrews)
Tim Harris (Oracle)
Hans-Wolfgang Loidl (Heriot Watt University)
Rita Loogen (Philipps-Universität Marburg)
Ross McIlroy (Google)
Greg Michaelson (Heriot Watt University)
Jeremy Singer (University of Glasgow)