A performance analysis of the first generation of HPC‐optimized Arm processors

In this paper, we present performance results from Isambard, the first production supercomputer to be based on Arm CPUs that have been optimized specifically for HPC. Isambard is the first Cray XC50 “Scout” system, combining Cavium ThunderX2 Arm‐based CPUs with Cray's Aries interconnect. The full Isambard system will be delivered in the summer of 2018, when it will contain over 10 000 Arm cores. In this work, we present node‐level performance results from eight early‐access nodes that were upgraded to B0 beta silicon in March 2018. We present node‐level benchmark results comparing ThunderX2 with mainstream CPUs, including Intel Skylake and Broadwell, as well as Xeon Phi. We focus on a range of applications and mini‐apps important to the UK national HPC service, ARCHER, as well as to the Isambard project partners and the wider HPC community. We also compare performance across three major software toolchains available for Arm: Cray's CCE, Arm's version of Clang/Flang/LLVM, and GNU.


INTRODUCTION
The development of Arm processors has been driven by multiple vendors for the fast-growing mobile space, resulting in the rapid innovation of the architecture, greater choice for system companies, and competition between vendors. This market success inspired the European FP7 Mont-Blanc project to explore Arm processors designed for the mobile space, investigating the feasibility of using the Arm architecture for workloads relevant to the HPC community. Mont-Blanc's early results were encouraging, but it was clear that chips designed for the mobile space could not compete with HPC-optimized CPUs without further architecture and implementation developments. Since Mont-Blanc, Cavium announced it will release its first generation of HPC-optimized, Arm-based CPUs in 2018.
In response to these developments, the Isambard* project was set up to provide the world's first production Arm-based supercomputer.
Isambard will be a Cray XC50 ''Scout'' system and will be run as part of Tier 2 within UK's national integrated HPC ecosystem. † Cavium ThunderX2 processors will form the basis of the system; these processors use the Armv8 instruction set but have been optimized specifically for HPC workloads. ThunderX2 CPUs are noteworthy in their focus on delivering class-leading memory bandwidth: each 32-core CPU uses eight DDR4 memory channels, enabling a dual-socket system to deliver in excess of 250 GB/s of memory bandwidth. The Isambard system represents a collaboration between the GW4 Alliance (formed from the universities of Bristol, Bath, Cardiff, and Exeter) along with UK's Met Office, Cray, Arm, and Cavium, and funded by EPSRC. Although Isambard is due to arrive in July 2018, this paper will present results using the project's early-access nodes, delivered in October 2017 and upgraded from A1 to B0 beta silicon in March 2018. These results will be among the first published for the near-production silicon of the Arm-based Cavium ThunderX2 processor and the first using Cray's CCE tools for Arm. *http://gw4.ac.uk/isambard/  The results we shall present will focus on single-node, dual-socket performance in comparison to other state-of-the-art processors found in the majority of supercomputers today. These results will also lay the foundations for a future study of at-scale, production-style workloads running on Isambard, which will utilize the Cray Aries interconnect.
The top 10 most heavily used codes that are run on UK's national supercomputer, ARCHER ‡ (an Intel x86-based Cray XC30 system), along with a selection of mini-apps, proxy applications, and applications from project partners, provide good representation of the styles of codes used by today's HPC community. 1 As such, they provide an ideal vehicle for which to benchmark the new Cavium ThunderX2 architecture, and the Isambard system presents a unique opportunity for such comparative benchmarking. Its heterogeneous design enables direct comparison between Cavium ThunderX2 CPUs and the best of today's mainstream HPC hardware, including x86 Intel Xeon and Xeon Phi processors, and NVIDIA P100 GPUs.

ISAMBARD: SYSTEM OVERVIEW
The most exciting aspect of the Isambard system will be the full cabinet of XC50 ''Scout'' with Cavium ThunderX2, delivering 10 752 high-performance Armv8 cores. Each node includes two 32-core ThunderX2 processors running at 2.1 GHz. The processors each have eight 2666-MHz DDR4 channels, yielding a STREAM triad result of over 250 GB/s per node. The XC50 Scout system packs four dual-socket nodes into each blade and then 42 such blades into a single cabinet. Pictures of a Scout blade and an XC50 cabinet are shown in Figure 1.
The results presented in this paper are based on work performed at two hackathons, using the Isambard early-access nodes. These early-access nodes use the same Cavium ThunderX2 CPUs as the full Isambard XC50 system, but in a Foxconn whitebox form factor. These nodes were upgraded from A1 to B0 beta silicon in March 2018 and were run in a dual-socket, 32-core, 2.

Applications
The Isambard system has been designed to explore the feasibility of an Arm-based system for real HPC workloads. As such, it is important to ensure that the most heavily used codes are tested and evaluated. To that end, eight real applications have been selected for this study, taken from the top 10 most used codes on ARCHER, UK's national supercomputer. 1 These applications represent over 50% of the usage of the whole supercomputer. Therefore, the performance of these codes on any architecture captures the interests of a significant fraction of UK HPC users, and any change in the performance of these codes directly from the use of different architectures is important to quantify. The test cases were chosen by the group of core application developers and key application users who came to two Isambard hackathon runs in October 2017 and February 2018; details of the attendees are found in our acknowledgments at the end of this paper. Given that we had to focus on comparing the performance of single nodes, we had to choose test cases that were of scientific merit, yet could run in a reasonable time on a single node.
CP2K § : This code simulates the ab initio electronic structure and molecular dynamics of different systems such as molecular, solids, liquids, and so on. Fast Fourier transforms (FFTs) form part of the solution step, but it is not straightforward to attribute these as the performance-limiting factor of this code. The memory bandwidth of the processor and the core count both have an impact. The code already shows sublinear scaling up to tens of cores. Additionally, the performance of the code on a single node does not necessarily have the same performance-limiting factors as when running the code at scale. We will need the full Isambard XC50 system to test these effects; in the meantime, we have used the H2O-64 benchmark, which simulates 64 water molecules (consisting of 192 atoms and 512 electrons) in a 12.4 Å 3 cell for 10 time steps. This is an often studied benchmark for CP2K and, therefore, provides sufficient information to explore the performance across the different architectures in this paper.
GROMACS ¶ : This molecular dynamics package is used to solve Newton's equations of motion. Systems of interest such as proteins contain up to millions of particles. It is thought that GROMACS is bound by the floating-point performance of the processor architecture. This has motivated the developers to handwrite vectorized code in order to ensure an optimal sequence of such arithmetic. 9 The hand-optimized code is written using vector intrinsics, which results in GROMACS not supporting some compilers-such as the Cray Compiler-because they do not implement all the required intrinsics. For each supported platform, computation is packed so that it saturates the native vector length of the platform, eg, 256 bits for AVX2, 512 bits for AVX-512, and so on. For this study, we used the ion_channel_vsites benchmark. # This consists of the membrane protein GluCl, containing around 150 000 atoms (which, for GROMACS, is typically small), and uses ''vsites'' and a five-femtosecond time step. On the ThunderX2 processor, we used the ARM_NEON_ASIMD vector implementation, which is the closest match for the Armv8.1 architecture. However, this implementation is not as mature as some of those targeting x86.
NAMD: This molecular dynamics simulation program is designed to scale up to millions of atoms. 10 It is able to achieve scaling to more than half a million processor cores using the Charm++ parallel programming framework, a high-level library that abstracts the mapping of processors to work items-or chares-away from the programmer. 11 The test case used is the STMV benchmark, which simulates one of the smallest viruses in existence and which is a common set of inputs for measuring scaling capabilities. This benchmark includes PME calculations, which use FFTs, and so, its performance is heavily influenced by that of the FFT library used. Due to the complex structure of atomic simulation computation and the reliance of distributed FFTs, it is hard to define a single bounding factor for NAMD's performance-compute performance, memory bandwidth, and communication capabilities all affect the overall performance of the application.

NEMO:
The Nucleus for European Modelling of the Ocean ‖ (NEMO) code is one ocean modeling framework used by UK's Met Office and is often used in conjunction with the Unified Model atmosphere simulation code. The code consists of simulations of the ocean, sea ice, and marine biogeochemistry under an automatic mesh refinement scheme. As a structured grid code, the performance-limiting factor is highly likely to be memory bandwidth. The benchmark we used was derived from the GYRE_PISCES reference configuration, with a 1 ∕ 12 • resolution and 31 model levels, resulting in 2.72M points, running for 720 time steps.

OpenFOAM:
Originally developed as an alternative to early simulation engines written in Fortran, OpenFOAM is a modular C++ framework aiming to simplify writing custom computational fluid dynamics (CFD) solvers. The code makes heavy use of object-oriented features in C++, such as class derivation, templating and generic programming, and operator overloading, which enables a powerful, extensible design methodology. 12 OpenFOAM's flexibility comes at the cost of additional code complexity, of which a good example is how every component is compiled into a separate dynamic library, which are then combined at runtime based on the user's input to form the required executables. The two features mentioned above grant OpenFOAM a unique position in our benchmark suite. In this paper, we use the simpleFoam solver for incompressible, turbulent flow from version 1712 of OpenFOAM,** the most recent release at the time of writing. The input case is based on the RANS DrivAer generic car model, which is a representative case of real aerodynamics simulation and, thus, should provide meaningful insight of the benchmarked platforms' performance. 13 OpenFOAM is almost entirely memory bandwidth bound.
OpenSBLI: This is a grid-based finite difference solver † † used to solve compressible Navier-Stokes equations for shock-boundary layer interactions. The code uses Python to automatically generate code to solve the equations expressed in mathematical Einstein notation and uses the Oxford Parallel Structured (OPS) software for parallelism. As a structured grid code, it should be memory bandwidth bound under the Roofline model, with low computational intensity from the finite difference approximation. We used the ARCHER benchmark for this paper, ‡ ‡ which solves a Taylor-Green vortex on a grid of 1024 × 1024 × 1024 cells (around a billion cells).

Unified Model:
The UK's Met Office code, the Unified Model § § (UM), is an atmosphere simulation code used for weather and climate applications. It is often coupled with the NEMO code. The UM is used for weather prediction, seasonal forecasting, and climate modeling, with timescales ranging from days to hundreds of years. At its core, the code solves the compressible nonhydrostatic motion equations on the domain of the Earth discretized into a latitude-longitude grid. As a structured grid code, the performance-limiting factor is highly likely to be memory bandwidth. We used an AMIP benchmark 14  The benchmark utilized is known as PdO, because it simulates a slab of palladium oxide. It consists of 174 atoms, and it was originally designed by one of VASP's developers, who also found that (on a single node) the benchmark is mostly compute bound; however, there exist a few methods that benefit from increased memory bandwidth. 16

Platforms
The early-access part of the Isambard system was used to produce the Arm results presented in this paper. Each of these Arm nodes houses two 32-core Cavium ThunderX2 processors running at 2. an L1 and L2 cache per core, along with a shared L3. This selection of CPUs provides coverage of both the state of the art and the status quo of current commonplace HPC system design. We include high-end models of both Skylake and Broadwell in order to make the comparison as challenging as possible for ThunderX2. It is worth noting that in reality, most Skylake and Broadwell systems will use SKUs from much further down the range, of which the Xeon Gold part described above is included as a good example. This is certainly true for the current Top 500 systems.
A summary of the hardware used, along with peak floating-point and memory bandwidth performance, is shown in Table 1, whereas a chart comparing key hardware characteristics of the main CPUs in our test (the three near-top-of-bin parts: Broadwell 22c, Skylake 28c, and ThunderX2 32c) is shown in Figure 2. There are several important characteristics that are worthy of note. First, the wider vectors in the x86 CPUs give them a significant peak floating-point advantage over ThunderX2. Second, wider vectors also require wider datapaths into the lower levels of the cache hierarchy. This results in the x86 CPUs having an L1 cache bandwidth advantage, but we see the advantage reducing as we go up the cache levels, until once at external memory, it is ThunderX2 that has the advantage, due to its greater number of memory channels. Third, as seen in most benchmark studies in recent years, dynamic voltage and frequency scaling (DVFS) makes it harder to reason about the percentage of peak performance that is being achieved. For example, while measuring the cache bandwidth results shown in Figure 2, we observed that our Broadwell 22c parts consistently increased their clock speed from a base of 2.2 GHz up to 2.6 GHz. In contrast, our Skylake 28c parts consistently decreased their clock speed from a base of 2.1 GHz down to 1.9 GHz. Our ThunderX2 parts ran at a consistent 2.2 GHz. At the actual, measured clock speeds, the values of the fraction of theoretical peak bandwidth achieved at L1 for Broadwell 22c, Skylake 28c, and ThunderX2 32c were 57%, 55%, and 51%, respectively.
In order to measure the sustained cache bandwidths as presented in Figure 2, we used the methodology described in our previous work. 17 The Triad kernel from the STREAM benchmark was run in a tight loop on each core simultaneously, with problem sizes selected to ensure residency ## https://ark.intel.com/ This portable methodology was previously shown to attain the same performance as handwritten benchmarks, which only work on their target architectures. 18 We evaluated three different compiler families for ThunderX2 in this study: GCC 7 and 8, the LLVM-based Arm HPC Compiler 18.2 and 18.3, and Cray's CCE 8.6 and 8.7. We believe this is the first study to date that has compared all three of these compilers targeting Arm. The compiler that achieved the highest performance in each case was used in the result graphs displayed below. Likewise for the Intel processors, we used GCC 7, Intel 2018, and Cray CCE 8.5-8.7. Table 2 lists the compiler that achieved the highest performance for each benchmark in this study. Figure 3 compares the performance of our target platforms over a range of representative mini-apps.

STREAM:
The STREAM benchmark measures the sustained memory bandwidth from the main memory. For the processors tested, the available memory bandwidth is essentially determined by the number of memory controllers. Intel Xeon Broadwell and Skylake processors have four and six memory controllers per socket, respectively. The Cavium ThunderX2 processor has eight memory controllers per socket. The results in Figure 3 show a clear trend that Skylake achieves a 1.64× improvement over Broadwell, which is to be expected, given Skylake's faster memory The use of nontemporal store instructions is an important optimization for the STREAM benchmark, as Raman et al showed, where, for the Triad kernel, the use of nontemporal store instructions resulted in a 37% performance improvement. 19 On the Intel architecture, using these instructions for the write operation in the STREAM kernels ensures that the cache is not polluted with the output values, which are not reused; note that it is assumed that the arrays are larger than the last-level cache. As such, if these data occupied space in the cache, it would reduce the capacity available for the other arrays that are being prefetched into the cache. The construction of the STREAM benchmark with arrays allocated on the stack with the problem size known at compile time allows the Intel compiler to generate nontemporal store instructions for all the Intel architectures in this study. Although the GCC compiler does not generate nontemporal stores for the Cavium ThunderX2 architecture (in fact, it cannot generate nontemporal store instructions for any architecture), the implementation of these instructions within the ThunderX2 architecture does not result in a bypass of cache. Instead, the stores still write to the L1 cache, but in a way that exploits the write-back policy and the least recently used eviction policy to limit the disruption on cache. As such, this may be limiting the achieved STREAM performance on ThunderX2, as memory bandwidth is being wasted, evicting the output arrays of the kernels.
Despite this lack of true streaming stores, it is clear that the additional memory controllers on the ThunderX2 processors provide a clear external memory bandwidth advantage over Broadwell and Skylake processors.

CloverLeaf:
The normalized results for the CloverLeaf mini-app in Figure 3 are very consistent with those for STREAM on the Intel Xeon and Cavium ThunderX2 processors. CloverLeaf is a structured grid code, and a majority of its kernels are bound by the available memory bandwidth.
It has been shown previously that the memory bandwidth increases from GPUs result in proportional improvements for CloverLeaf. 4 The same is true on the processors in this study, with the improvements on ThunderX2 coming from its greater memory bandwidth. Therefore, for structured grid codes, we indeed see that the runtime is proportional to the external memory bandwidth of the system, and ThunderX2 provides the highest bandwidth out of the processors tested.
It has been noted that the time per iteration increases as the simulation progresses on the ThunderX2 processor. This phenomenon is not noticeable on the x86 processors. We believe this is due to the data-dependent need for floating-point intrinsic functions (such as abs, min, and sqrt); this can be seen in the viscosity kernel, for instance. As the time iteration progresses, the need for such functions is higher, and therefore, the kernel increases in runtime. Although these kernels are memory bandwidth bound, the increased number of floating-point operations increases the computational intensity and is therefore slightly less bandwidth bound (under the Roofline model).

TeaLeaf:
The TeaLeaf mini-app again tracks the memory bandwidth performance of the processors, as previously shown on x86 and GPU architectures. 5 The additional memory bandwidth of the ThunderX2 processor clearly improves the performance over those processors with fewer memory controllers.

SNAP:
Understanding the performance of the SNAP proxy application is difficult. 7 If truly memory bandwidth bound, the extra bandwidth available on the ThunderX2 processor would increase the performance as with the other memory bandwidth-bound codes discussed in this study; however, the ThunderX2 processor is almost 30% slower than Broadwell for the tested problem for this code. The Skylake processor does give an improvement, however, and while this does have memory bandwidth improvements over Broadwell, this is not the only significant architectural change. Skylake has 512-bit vectors, which creates a wide data path through the cache hierarchy-a whole cache line is loaded into registers for each load operation. In comparison, Broadwell has 256-bit vectors, and ThunderX2 has 128-bit vectors, moving half and a quarter of a 64-byte cache line per load operation, respectively.
The main sweep kernel in the SNAP code requires that a cache hierarchy support both accessing a very large data set and simultaneously keeping small working set data in low levels of cache. The CrayPat profiler reports the cache hit rates for L1 and L2 caches when using the Cray Compiler. On ThunderX2, the hit rates for the caches are both at around 84%, which is much higher than the hit rate for the STREAM benchmark (where the latter is truly main memory bandwidth bound). As such, this shows that the SNAP proxy application is heavily reusing data in cache, and so, the performance of the memory traffic to and from cache is a key performance factor. On the Broadwell processors in Isambard Phase 1, CrayPat reports cache hit rates of 89.4% for L1 and 24.8% for L2. Again, the L1 cache hit rate is much higher than in the STREAM benchmark, indicating high reuse of data in the L1 cache, but that the L2 is not used as efficiently. It is the subject of future work to understand how the data access patterns of SNAP interact with the cache replacement policies in these processors. However, for this study, it is clear that main memory bandwidth is not necessarily the performance limiting factor, but rather, it is the access to the cache where the x86 processors have an advantage.
Neutral: In previous work, it was shown that the Neutral mini-app has algorithmically little data reuse, due to the random access to memory required for accessing the mesh data structures. 8 Additionally, the access pattern is data driven, and this not predictable, and so, any hardware prefetching of data into cache according to common access patterns is likely to be ineffective, resulting in a relatively high cache miss rate.
Indeed, CrayPat shows a low percentage (27.3%) of L2 cache hits on the ThunderX2 processor and a similar percentage on Broadwell. The L1 cache hit rate is high on both architectures, with over 95% hits. As such, the extra memory bandwidth available on the ThunderX2 processor does not provide an advantage over the Intel Xeon processors. Note from Figure 3 that the ThunderX2 processor still achieves performance close to that of Broadwell, with Skylake 28c only offering a small performance improvement of about 29%.

Mini-app performance summary
Many of these results highlight the superior memory bandwidth offered by ThunderX2's eight memory channels, which deliver 253 GB/s for the STREAM Triad benchmark 2 -twice that of Broadwell and 18% more than Skylake. This performance increase can be seen in the memory bandwidth bound mini-apps such as CloverLeaf and TeaLeaf, with the ThunderX2 processor showing similar improvements to STREAM over the x86 processors.
The SNAP and Neutral mini-apps, however, rely more on the on-chip memory architecture (the caches), and so, they are unable to leverage the external memory bandwidth on all processors. As such, the additional memory controllers on the ThunderX2 processors do not seem to improve the performance of these mini-apps relative to processors with less memory bandwidth. The exact nature of the interaction of these algorithms with the cache hierarchy is the subject of future study. Figure 4 compares the performance of dual-socket Broadwell, Skylake, and ThunderX2 systems for the real application workloads described in Section 3.2.

Applications
CP2K: CP2 K comprises many different kernels that have varying performance characteristics, including floating-point-intensive routines and those that are affected by external memory bandwidth. While the floating-point routines run up to 3× faster on Broadwell compared to ThunderX2, the improved memory bandwidth and higher core counts provided by ThunderX2 allow it to reach a ∼15% speedup over Broadwell for this benchmark. The 28c Skylake processor provides even higher floating-point throughout and closes the gap in terms of memory bandwidth, yielding a further 19% improvement over ThunderX2. The H2O-64 benchmark has been shown to scale sublinearly when running on tens of cores, 20 which impacts the improvements that ThunderX2 can offer for its 64 cores.

GROMACS:
The GROMACS performance results are influenced by two main factors. First, the application is heavily compute bound, and the x86 platforms are able to exploit their wider vector units and wider datapaths to cache. Performance does not scale perfectly with vector width due to the influence of other parts of the simulation, particularly the distributed FFTs. Second, because GROMACS uses hand-optimized vector code for each platform, x86 benefits from having the more mature implementation, one that has evolved over many years. Since Arm HPC platforms are new, it is likely that the NEON implementation is not yet at peak efficiency for ThunderX2.

NAMD:
As discussed in Section 3.2, NAMD is not clearly bound by a single factor, and thus, it is hard to underline a specific reason why one platform is slower (or faster) than another. It is likely that results are influenced by a combination of memory bandwidth, compute performance, and other latency-bound operations. The results observed do correlate with memory bandwidth, making Broadwell the slowest platform of the three for this application. Running more than one thread per core in SMT increased NAMD performance, and this is the most pronounced on ThunderX2, which can run four hardware threads on each physical core. Furthermore, due to NAMD's internal load balancing mechanism, the application is able to efficiently exploit a large number of threads, which confers yet another advantage to ThunderX2 for being able to run more threads (256) than the x86 platforms (112 on Skylake 28c). As a result, while ThunderX2 32c does not quite match the top-bin Skylake 28c, it does outperform the mainstream Skylake 20c by about 18%. FIGURE 4 Comparison of Broadwell, Skylake, and ThunderX2 for a set of real application codes. Results are normalized to Broadwell NEMO: For the NEMO benchmark, ThunderX2 is 1.49× faster than Broadwell 22c and is slightly faster than Skylake 20c, while not quite matching Skylake 28c. While the benchmark should be mostly memory bandwidth bound, leading to significant improvements over Broadwell, the greater on-chip cache bandwidth of Skylake gives the top-bin part a slight performance advantage over ThunderX2. Running with multiple threads per core to ensure that the memory controllers are saturated provides a small improvement for ThunderX2.

OpenFOAM:
The OpenFOAM results follow the STREAM behavior of the three platforms closely, confirming that memory bandwidth is the main factor that influences performance here. With its eight memory channels, ThunderX2 yields the fastest result, at 1.87× the Broadwell performance. Skylake is able to run by 1.57×-1.66× faster than Broadwell, ie, a bigger difference than in plain STREAM, because it is likely able to get additional benefit from its improved caching, which is not a factor in STREAM. This benchmark strongly highlights ThunderX2's strength in how performance can be improved significantly by higher memory bandwidth.
OpenSBLI: The OpenSBLI benchmark exhibits a similar performance profile to OpenFOAM, providing another workload that directly benefits from increases in external memory bandwidth. The ThunderX2 system produces speedups of 1.69× and 1.21× over Broadwell 22c and Skylake 20c, respectively, and almost matches Skylake 28c.
Unified Model: Comprising 2 million lines of Fortran, the Unified Model is arguably the most challenging of the benchmarks used in this study, stressing the maturity of the compilers as well as the processors themselves. Skylake 28c only yields a 19% improvement over Broadwell 22c, indicating that the performance of this test case is not entirely correlated to memory and cache bandwidth or floating-point compute and that the relatively low-resolution benchmark may struggle to scale efficiently to higher core counts. The ThunderX2 result is around 8% slower than our top-bin Broadwell, but demonstrates the robustness of the Cray software stack on Arm systems by successfully building and running without requiring any modifications. Interestingly, when running on just a single socket, ThunderX2 provides a ∼ 15% improvement over Broadwell.
We also observed performance regressions in the more recent versions of CCE on all three platforms; the Broadwell result was fastest using CCE 8.5, which could not be used for either Skylake or ThunderX2.

VASP:
The calculations performed by the VASP benchmark are dominated by floating-point-intensive routines, which naturally favor the x86 processors with their wider vector units. While the higher core counts provided by ThunderX2 make up for some of the difference, the VASP benchmark exhibits a similar profile to GROMACS, with ThunderX2 around ∼ 24% slower than Broadwell 22c, and sitting around half the speed of Skylake 28c. Figure 5 compares the latest versions of the three available compilers on the ThunderX2 platform, normalized to the best performance observed for each benchmark. The benefit of having multiple compilers for Arm processors is clear, as none of the compilers dominate performance, and no single compiler is able to build all of the benchmarks. The performance for the mini-apps is broadly similar across all of the compilers, with 15%-20% variations for SNAP and Neutral whose more complex kernels draw out differences in the optimizations applied by the compilers. The

Compiler comparison
Arm HPC compiler uses the LLVM-based Flang, a relatively new Fortran frontend, which, at the time of writing, produces an internal compiler error while building CP2K. Both CP2K and GROMACS crash at runtime when built with CCE; this issue also occurs on the Broadwell system and, hence, is not specific to Arm processors. While the NAMD benchmark builds and runs correctly with GCC 7, it currently hangs after initialization with GCC 8. At the time of writing, it is unclear whether these issues are a result of bugs in the applications themselves or miscompilations by the compilers. NAMD failed to build with CCE because Charm++ uses some inline assembly syntax that is not yet supported by the Cray Compiler.
OpenFOAM exhibits multiple syntax errors in its source code, which are only flagged as issues by GCC 8 and CCE. The largest performance difference we observed between the compilers was with the OpenSBLI benchmark, where the code generated by CCE is 2.5× faster than any FIGURE 5 Efficiency of different compilers running on Cavium ThunderX2. The BUILD and CRASH labels denote configurations that either failed to build or crashed at runtime, respectively. The ''-'' indicates that the build configuration was not supported by the benchmark at the time of writing of the other compilers. On x86, performance is much closer between all of the compilers, and some other testing we have performed leads us to believe that this performance discrepancy is most likely due to MPI library differences between the CCE and non-CCE builds of OpenSBLI.
Investigations are continuing, and we believe the outcome will be that GCC and Arm will come up to be much closer to CCE once we have resolved the issue.

Performance per Dollar
So far, we have focused purely on performance, but one of the main advantages of the Arm ecosystem for HPC should be the increased competition it brings. While Performance per Dollar is hard to quantify rigorously and price is always subject to negotiation, we can still reveal some illuminating information by using the published list of prices of the CPUs to compare the respective Performance per Dollar of the processors. In the results to follow, we only consider the CPU list prices. This deliberately leaves every other factor out of the comparison; factoring our results into a subjective whole system cost is left as an exercise for the expert reader. Figures 6 and 7 show the Performance per Dollar of our platforms of interest, normalized to Broadwell 22c. The numbers are calculated by taking the application performance numbers shown in Figures 3 and 4 and simply dividing them by the published list prices described in Section 4.1, before renormalizing against Broadwell 22c. There are a number of interesting points to highlight. First, considering the mini-apps in Figure 6, we see that ThunderX2's RRP of just $1795 gives it a compelling advantage compared to all the other platforms. ThunderX2 consistently comes ahead of not just the top-bin-and therefore expensive-Skylake 28c but also of the more cost-conscious, mainstream Skylake 20c SKU. The picture for real applications in Figure 7 is similar, where, even in cases when ThunderX2 had a performance disadvantage, such as in the compute-bound codes GROMACS and VASP, ThunderX2 becomes competitive once cost is considered. Where ThunderX2 was already competitive on performance, adding in cost makes ThunderX2 look even more competitive, often achieving a performance/price advantage over the more cost-oriented Skylake 20c SKUs of 2× or more.

Performance summary
Overall, the results presented in this section demonstrate that the Arm-based Cavium ThunderX2 processors are able to execute a wide range of important scientific computing workloads with performance that is competitive with state-of-the-art x86 offerings. The ThunderX2 processors can provide significant performance improvements when an application's performance is limited by external memory bandwidth, but are slower in cases where codes are compute bound. When processor cost is taken into account, ThunderX2's proposition is even more compelling. With multiple production-quality compilers now available for 64-bit Arm processors, the software ecosystem has reached a point where developers can have confidence that real applications will build and run correctly, in the vast majority of cases with no modifications.
Some of the applications tested highlight the lower floating-point throughput and the L1/L2 cache bandwidth of ThunderX2. Both of these characteristics stem from the narrower vector units relative to AVX-capable x86 processors. In 2016, Arm unveiled the Scalable Vector Extension ISA, 21 which will enable hardware vendors to design processors with much wider vectors of up to 2048 bits, compared to the 128 bits of today's NEON. We therefore anticipate that the arrival of SVE-enabled Arm-based processors in the next couple of years will likely address most of the issues observed in this study, enabling Arm processors to deliver even greater performance for a wider range of workloads.

REPRODUCIBILITY
With an architecture such as Arm, which is new to mainstream HPC, it is important to make any benchmark comparison as easy to reproduce as possible. To this end, the Isambard project is making all of the detailed information about how each code was compiled and run, along with the input parameters to the test cases, available as an open-source repository on GitHub. ‖‖ The build scripts will show which compilers were used in each case, what flags were set, and which math libraries were employed. The run scripts will show which test cases were used and how the runs were parameterized. These two sets of scripts should enable any third party to reproduce our results, provided that they have access to similar hardware. The scripts do assume a Cray-style system but should be easily portable to other versions of Linux on non-Cray systems.

CONCLUSIONS
The results presented in this paper demonstrate that Arm-based processors are now capable of providing levels of performance competitive with state-of-the-art offerings from the incumbent vendors, while significantly improving Performance per Dollar. The majority of our benchmarks compiled and ran successfully out of the box, and no architecture-specific code tuning was necessary to achieve high performance. This represents an important milestone in the maturity of the Arm ecosystem for HPC, where these processors can now be considered as viable contenders for future procurements. Future work will use the full Isambard system to evaluate production applications running at scale on ThunderX2 processors.
We did not address energy efficiency in this paper. Our early observations suggest that the energy efficiency of ThunderX2 is in the same ballpark as the x86 CPUs we tested. This is not a surprise-for a given manufacturing technology, a FLOP will take a certain number of Joules, and moving a byte a certain distance across a chip will also take a certain amount of energy. There is no magic, and the instruction set architecture has very little impact on the energy efficiency when most of the energy is being spent moving data and performing numerical operations.
Overall, these results suggest that Arm-based server CPUs that have been optimized for HPC are now genuine options for production systems, offering performance competitive with best-in-class CPUs, while potentially offering attractive price/performance benefits.

ACKNOWLEDGMENTS
As the world's first production Arm supercomputer, the GW4 Isambard project could not have happened without support from a lot of people. attendees of the first two Isambard hackathons, who did most of the code porting behind the results in this paper, and Federica Pisani from Cray who organized the events.