An evaluation of the state of time synchronization on leadership class supercomputers

We present a detailed examination of time agreement characteristics for nodes within extreme‐scale parallel computers. Using a software tool we introduce in this paper, we quantify attributes of clock skew among nodes in three representative high‐performance computers sited at three national laboratories. Our measurements detail the statistical properties of time agreement among nodes and how time agreement drifts over typical application execution durations. We discuss the implications of our measurements, why the current state of the field is inadequate, and propose strategies to address observed shortcomings.


INTRODUCTION
The trend towards increasing node counts in high-performance computing (HPC) is motivating a move toward greater levels of concurrency in HPC systems. Today's software environment is now being called on to produce new solutions for emerging issues including managing system power, resilience, and performance characteristics. The distributed algorithms that underlie such services operate much more efficiently in the presence of tightly synchronized clocks. For example, tightly synchronized clocks benefit well-known gang scheduling techniques and complex consensus algorithms. To illustrate the point, such time synchronization enables more aggressive assumptions about communication and synchronization patterns, the removal of unnecessary locks, and a wide range of other applications. Clock-based techniques are already frequently deployed in cloud and data center distributed systems for precisely these reasons.
We examined the time synchronization on some of the world's fastest and most powerful machines. These leadership-class systems employ high-end hardware connected by an extremely low-latency, low-jitter, interconnect in a carefully controlled environment, in contrast to widely distributed cloud-based systems based on commodity hardware and networks. Because of this, we assumed that these systems would have more stable, predictable hardware clocks, and close base time agreement using only standard time synchronization systems like Network Time Protocol (NTP). We did not believe that the complex hardware and software techniques used to provide time synchronization in wide-area systems would be necessary in leadership systems.
Our results demonstrate that the actual time uncertainty for leadership-class machines is often unexpectedly large, in some cases over 600 milliseconds despite network latencies of less than two microseconds. Building on this, we set out to thoroughly quantify the magnitude of the time synchronization challenge in leadership-class systems. This study shows that the current time protocol in use, NTP, is not suitable for providing the level of time synchronization necessary for important system software tasks such as coordinated scheduling. Based on this, we conclude that more complex time synchronization techniques are in fact needed to provide tight time synchronization in these systems, and we discuss the specific techniques most appropriate to HPC systems according to our findings.

Timing terminology
A perfect clock, denoted by t, reports one second of elapsed time per second of real time and has an origin t = 0 at a specified previous instant. A given real clock is imperfect; it can have acceleration, temperature, and other environmental sensitivities. Using the notation found in Veitch et al., 1 a real clock reads C(t) at the true instant t and suffers from an error or offset which we denote as : at true time t. The skew corresponds to the rate of change in clock offset: Uncertainty is the maximum offset between any pair of observed real clocks: for i, j in a set of clocks and t over a time interval under consideration. To synchronize is to coordinate the occurrence of multiple entities with respect to time. Two clocks are synchronized to a specified uncertainty if their measurements of the time of a single event at an arbitrary time differ by no more than that uncertainty.
NTP stands for Network Time Protocol, and it is an Internet protocol used to synchronize the clocks of computers to some time reference. NTP is an Internet standard protocol originally developed by Professor David L. Mills at the University of Delaware. 2

Modern HPC architecture
The basic building blocks of modern HPC machines are nodes connected via a communications interconnect. A node usually consists of a processor or multiprocessor, memory, an interface to the interconnect and, optionally, a local disk. Modern interconnects are capable of sustaining concurrent transfers measuring gigabytes per second between multiple nodes. The typical latency between the request for a transfer and the first bytes appearing at the destination are on the order of a few microseconds or less. Figure 1 depicts a modern high-performance computer such as those found among the world's most powerful machines. 3 The machine features a large number of compute nodes that are able to pass information among each other via the interconnect. As their name would suggest, compute nodes are optimized for computation; they usually do not contain a local disk. For other aspects of computer activity (e.g., interactive user activity, disk drive accesses, and so forth), support is usually carried out by service nodes set aside for these necessary operations including login nodes and I/O nodes. For example, the Titan machine at Oak Ridge National Laboratory consists of 18,668 compute nodes and 20 service nodes that are used for interactive logins and access to the parallel file system.
The ability to efficiently handle timesharing within compute nodes has become a requirement for HPC machines. Traditionally, these supercomputers have operated in a manner designed to optimize the machine's resources. As such, application teams submit production-ready versions of applications to a batch queue of jobs. Each job request contains requirements for the number of compute nodes and the expected overall execution time. When the job reaches the top of the queue, it is given sole access to a matching number of compute nodes. While no other user FIGURE 1 A typical high-performance computing architecture JONES ET AL.

of 16
time-shares the reserved compute nodes, there will be periods when other applications time-share with the application; such is the case for system software like file system daemons, health and maintenance daemons, and so forth. As described below, these necessary system tasks that time-share with the user application, termed noise or detours in the literature, may lead to undesirable performance impacts on the user application if left unchecked. Efficient timesharing on compute nodes is very important whether the need arises from enabling colocated threads (e.g., in situ analytics or coupled applications), or from accommodating computational support for an ongoing experiment, or simply from supporting system software daemons efficiently.
Current generation supercomputing systems with relatively large node counts cause time synchronization issues to manifest in terms of a large number of processes all reporting skewed clocks as described and quantified in this paper. The issue of fine-granularity time synchronization is also relevant to next-generation supercomputing systems. In next-generation systems, system architectures are trending toward a relatively smaller count of nodes with each node containing a much higher degree of parallelism. The expectation is that the processes within a single node would likely all read consistent time from a single node-specific time source. Accordingly, it follows that across an entire next-generation supercomputer, large groups of processes would all agree on a value for the current time, with time skew observed between groups. The challenge is that a time perturbation on a single node can simultaneously affect a large number of processes participating in a computation while the opportunity is that getting the nodes into tight time synchronization with one another can quickly improve the time base for all processes in a computation.
Two more definitions will help to acquaint those unfamiliar with HPC to the basic environment. Message Passing Interface (MPI) is a standardized and portable message-passing system designed for the development of portable and scalable large-scale parallel applications. Finally, Bulk synchronous parallelism (BSP) algorithms are those that consist of concurrent execution of local computation across processes, and allow parallel progress through the use of synchronization points. 4

Contributions of this paper
To our knowledge, ours is the first paper that presents a detailed statistical analysis of clock agreement on high-performance computer environments. We introduce collected results from a number of tests conducted on some of the world's most powerful computers. We then analyze these results and how they might apply to different applications and scenarios. Finally, we discuss the implications of our measurements and why the current state of the field is inadequate, and we propose strategies to address observed shortcomings. The data associated with this paper are being made available online; the registered DOI value is 10.11578/1130048.
At the outset of the work presented in this paper, we expected to find a much lower degree of uncertainty among node clocks in large-scale supercomputers than our results uncovered. We believe that the reason time synchronization has been neglected to this point is due to the way that supercomputers are used by the research community versus how they are provided as resources by supercomputing centers. System software researchers are possibly in the best position to be able to leverage the benefits of tight synchronization of node clocks but are unlikely to be in a position to affect a change in how a given supercomputer is run. On the other hand, the operations personnel who are responsible for maintaining supercomputer systems may not recognize the value of tightly synchronized node clocks or know the degree to which node clocks disagree, mistakingly believing that the temperature and power controlled environment in data centers along with the use of commodity protocols to synchronize time, such as NTP, is enough. Our aim with this paper is to highlight these ideas, supported by our measurements and findings, to the supercomputing community.

Paper outline
The rest of this paper is organized as follows. Section 2 provides a more complete motivation for our line of study including its significance and a quantitative assessment of the impact of time synchronization to parallel applications. Section 3 describes work related to our own. Section 4 provides a series of relevant data measurements, followed by a discussion of these measures in Section 5. Section VI presents recommendations stemming from our study. The paper closes with our future plans and a conclusion.

IMPORTANCE OF TIME AGREEMENT
The main motivation to this line of inquiry is efficiency, primarily in wall-clock runtime of parallel applications written in the bulk synchronous parallel (BSP) style. Efficiently employing concurrency is critical to success in HPC. With growth in clock frequency stalled, performance improvements are being achieved by an exponential increase in the number of processing elements per chip, an increase in the hardware threading per core, and an increase in the total node count. These trends underscore the importance of parallel speedup.
Other kinds of efficiency, such as in maintenance, machine uptime, and code development are impacted by time agreement, although these are more difficult to quantify. For example, resilience root cause analysis relies on good time agreement to inform log analysis when dependency is ambiguous. 5 Performance analysis tools also rely on time agreement for effective distributed debugging. 6 Finally, timed guarantees can be used to avoid some communication. Such clock-based techniques are already frequently deployed in cloud and data center distributed systems 7 ; higher-quality time synchronization would allow the use of these techniques in HPC systems. The negative consequences of high time uncertainty and the benefits of low time uncertainty are discussed in this section.

Improved clock synchronization significance
The lock-step progress of BSP style programs is bounded by the slowest task among the parallel task set. That is to say, the parallel application cannot advance until all tasks reach a common point. However, as node counts increase, the likelihood of one task experiencing an interruption becomes a key factor to performance; as the BSP per-loop iteration time decreases, the effects of even smaller noise perturbations increase, further emphasizing the usefulness of tightly synchronized clocks. [8][9][10] To mitigate the potentially disastrous hit on performance, sophisticated runtime systems and operating systems seek to overlap any disruption in progress across all compute nodes. When system software* is able to schedule all interrupting tasks on every node such that they occur simultaneously, the cascading effect on the parallel algorithm is kept in check-a strategy known as coordinated scheduling.
A number of works 10,11 have used coordinated scheduling-based techniques to minimize the impact of potentially interfering workloads. In this kind of solution, co-schedulers on each node ensure that the parallel tasks are scheduled at the same time. This can be achieved by coordinating interfering short-lived system tasks. Coordinated scheduling schemes face several technical challenges at various levels of scale. In order to support this, next architectural designs in HPC systems must provide low-overhead synchronization/coordination mechanisms. The scheduler must guarantee precise CPU reservations for cooperative codes and accurate timing to run applications with coordination needs across nodes. 8,9 In this scenario, a precise inter-node time agreement is critical. Moreover, the absence of suitable time agreement prevents effective gang-scheduling and coordinated-scheduling. 10,12-14

How precise is enough? The impact of time synchronization to applications
Each parallel application has its own sensitivity to noise, and therefore its own capacity to be aided by the low-time variance strategy of coordinated scheduling. The extent of wall-clock speedup depends on how frequent synchronizing collectives (e.g., MPI_Barrier, MPI_Allreduce) are utilized within the application. On the upper bound, our earlier work demonstrated that a coordinated scheduling strategy can provide a speedup of 285% for sensitive bulk synchronous workloads. 15 Here, a sensitive application is one in which an MPI synchronizing collective call (e.g., MPI_Barrier, MPI_Allreduce, MPI_Allgather, and so on) is called at least once per iteration in an iterative simulation or algorithm. Previous works [16][17][18] have shown that interference sources have potentially greater impact on applications with fine-grained parallelism (i.e., applications with shorter per-iteration intervals). For example, in Seelam et al., 17 authors show that jitter can generate slowdowns as high as 8% for applications with computation intervals of 100 ms and over 16% of slowdowns for applications with 10-ms computation intervals at 32 K CPUs.
The amount of time agreement required for coordinated scheduling varies nonlinearly with time agreement and requires an uncertainty bound of several microseconds before attaining best results. analytics. This figure, reproduced from Levy et al. 19 work, presents analytics synchronization levels that range from the perfectly coordinated case to the completely uncoordinated case for 64 K nodes. Different levels of synchronization are simulated adding offsets to the time at which noise traces start on different analytics processes. 19 In this work, authors show that the amount of synchronization needed to mitigate interference from system software and other workloads can vary depending on the characteristics of the application and interference source. For high-impact noise sources (e.g., asynchronous checkpointing or co-scheduled analytics), synchronization amounts between 20 and 100 ms are necessary to reduce the performance impact to most HPC applications below 10%. The same studies show that common application communication patterns can synchronize global activities within approximately 100 ms but not necessarily more tightly. 19 While scheduling is often critical to parallel algorithms and the impact of harmful interruptions or detours should be minimized, some BSP algorithms can be made to operate in a more asynchronous manner. In perhaps the largest impact reported, Hammouda et al. studied a classic bulk synchronous implementation of an explicit stencil calculation in a range of environments spanning the absence of random detours to increasingly larger presence of random detours. They demonstrated that by refactoring explicit stencil calculations, they were able to achieve speedups of 1 to 37 times faster than a traditional noise sensitive algorithm in an uncoordinated timing environment. 20 Beyond HPC, the impact of timing to more general scenarios in concurrency is also interesting. Google's Spanner uses coordinated time to achieve scalability and guarantee correctness of its distributed concurrency features such as externally consistent transactions, lock-free read-only transactions, and non-blocking reads. 21 Liskov showed how synchronized time is used for generating authentication tickets in Kerberos, for ensuring cache consistency in an object persistence system, for ensuring atomicity in a distributed database, and for improving the performance of commit window negotiation in distributed filesystems. Liskov points out that the use of synchronized clocks in a distributed system can improve performance by replacing communication with local computation. 22 *System software here refers to the various software that supports the execution of the application, including the OS kernel, daemons, and programming system runtime libraries that manage resource allocation and provide basic system services to the application.

The time agreement tradeoff: interruptions versus time agreement precision
Given the importance of time agreement to parallel applications, one might think that simply relying on a client which frequently restores a common time base among a distributed set of nodes would be a simple solution. Theoretically, a client design in which a subset of nodes with access to a reference clock could utilize a protocol with all other nodes to overcome the drift with each node's local clock. Indeed, such clients exist, but they frequently fail to provide the desired benefit due to a subtle design tradeoff: most client designs require frequent updates among all nodes to reach very high levels of time agreement, but frequent updates result in increased system software interrupts to carry out the clients message traffic. On the one hand, the advantages of very precise time agreement (i.e., the clocks on all nodes are coordinated to agree to within 1 microsecond) are very attractive to many parallel uses including consensus algorithms and parallel tools. On the other hand, the clients which enable precise time agreement may also negatively impact common operations such as synchronizing collectives through their heavy use of software interrupts.
Ferreira et al. studied the impact of harmful system software interrupts or detours to BSP applications. 16 They demonstrated that a 2.5% net processor noise rate (interruption rate) at 10,000 nodes can have no impact or can result in over a factor of 20 slowdown for the same application, depending solely on how the noise is generated. Their study featured important DOE scientific applications including the CTH shock physics simulation, the POP parallel ocean simulation program, and SAGE which is SAICÕs adaptive grid Eulerian hydrocode. Such noise rates would require a bounded time uncertainty of less than 5 microseconds.
Hoefler et al. showed that the scale at which noise becomes a bottleneck for three applications: SWEEP3D, AMG, and POP. 23 SWEEP3D solves a neutron transport problem on a 3D Cartesian geometry and AMG is an algebraic multigrid solver. They found that the impact of uncoordinated interrupts at 16,000 processes ranges from 4% slowdown (AMG) to over 100% slowdown (SWEEP3D and POP). To avoid these slowdowns, a time uncertainty bound of several microseconds would be required.
In our measurements described below, the time agreement clients on each distributed node were configured to perform agreement consensus every few hundred seconds with the hope that any localized drift during the period between updates would be inconsequential. As we show below, the time agreement for this strategy was surprisingly poor.

Relevant questions related to time agreement in HPC environments
A detailed understanding of the degree and characteristics of clock-agreement are useful for the above lines of inquiry. Moreover, environments that include parallelism, distributed agents, or coordination may be concerned with subtle aspects of clock synchronization including • What does the distribution of offsets from a reference clock look like? What are the statistical properties?
• What is the magnitude of time uncertainty across a set of clocks?
• How can we reason about clock skew and synchronization over short and long durations?
• What do these results mean for various applications?

RELATED WORK
Clock synchronization in distributed systems is an important problem that impacts design decisions in a wide range of applications. To that end, clock synchronization has been well surveyed and has a long history with many important contributions. [24][25][26][27] This is in keeping with the wide range of usage models that benefit from synchronized clocks. For example, software updates rely on accurate file time stamps across distributed nodes to support a wide range of activities from system administration configuration tracking to software development team code repository tracking. These file time stamp usage cases typically depend on agreement within a few seconds. Parallel environments like those found in high-end computing have much more strict requirements. For example, application development tools like message trace analysis tools can show problematic message-traffic patterns, but their effectiveness is undermined without tight enough constraints on time variance between nodes. 28 Another important potential use for high performance distributed clocks is found in the general area of system software. For example, file system tokens used for flow control and metadata management benefit from skews held to a few dozen microseconds or less. 29 Perhaps one of the best known clock synchronization systems is NTP. 2 NTP uses a hierarchical system of levels of clocks, with the highest level ("Stratum 0") considered as the reference and in practice typically being atomic clocks or GPS clocks. A node in the network usually synchronizes its local clock against one or more clocks in the same stratum or the stratum one level higher. If the remote clocks are accessed over the public Internet, the node can usually maintain its clock to within ten milliseconds of the reference clock; accuracies of one millisecond are possible if a time server is available on the local intranet. 30 Hierarchical approaches trade scalability for relaxed tolerance to variance: each leaf node is no longer necessarily synchronized with the stratum 0 reference, and there is no guarantee that the stratum 0 reference is exactly in synch with lower level stratums (only the NTP protocol itself). Mills provides a survey of the characteristics of roughly 10,000 machines on the Internet. 31  They utilized a specially equipped test beds with a GPS receiver to measure delay characteristics. Their findings among 1861 public time servers showed synchronization protocols were able to achieve a median error of 2 to 5 milliseconds but with a long tail.
Google's Spanner 21 is a scalable, globally distributed database that directly addresses challenges involving scalability and consistency of trillions of rows of data that span multiple data centers. An API called TrueTime provides each participating process access to a time interval (T earliest , T latest ) with a bounded uncertainty and an absolute time that is guaranteed to fall within the interval. The uncertainty increases between time synchronizations, which the authors report as happening at approximately 30-second intervals. The resulting worst-case time uncertainty is in the range of 1 to 7 milliseconds (drift rate of approximately 200 microseconds/second). Spanner synchronizes time via GPS and atomic clock time references.
In Liskov, 22 Liskov provides a survey of applications of synchronized clocks in distributed systems. A key observation is that clock synchronization cannot be provided absolutely, only with some high probability of falling within a bounded range. Our work presented in this paper provides empirical measurements for clock skew bounds, and we believe that this is important toward developing an understanding of the types of applications that are possible.
In Maillet and Tron, 37 the authors present statistical methods to estimate the relationship between the values of different clocks. Two algorithms proposed in the literature are analyzed; the sample data needed by the estimators are described and the qualities of the estimators are compared.
They note a tradeoff in the design of statistical clock synchronization methods: a long interval of data collection is useful for reducing the impact of noise in the delay measurements, but the accompanying overhead is undesirable to other workloads which may be impacted by the delay to achieve time synchronization. To address this issue, the authors propose using a sample before and after methodology when possible.
In Gurewitz et al. 38 by Gurewitz, Cidon and Sidi, a more complex peer-to-peer approach for time synchronization is presented. The approach, referred to as Classless Time Protocol or simply CTP, is non-hierarchical and based on peer node exchanges. The CTP design explicitly tunes to minimize a global network-wide cost function. The authors assume an environment with significant differences than our intended environment: (1) CTP assumes a generalized network with no assumptions on uniformity of links (other than link delays cannot be negative); (2) they assume no clock skew among non-master clocks, although they reference techniques that may remove skew. With these underlying premises, CTP uses minimum one-direction delays and an assumption that clock offsets cancel in circular paths. The authors of CTP state in Gurewitz et al. 39 that one-way delay is preferable because round-trip delays may be subject to route differences between outbound and return. Our approach differs with these generalized approaches in several key ways. First, in our unique environments, we are only concerned with symmetric networks within a supercomputer and dedicated hardware. Unlike a generalized network, we have guarantees on routing, symmetry, and workload which leads to round-trip times that are extremely consistent and three orders of magnitude or more faster than arbitrary networks. The magnitude of round-trip times in our network environment is orders of magnitude below the clock offsets so that the additional precision potentially gained with the CTP approach (that would still need a skew removal technique) would not change our conclusions.
Generalized network strategies like CTP utilize statistical estimation techniques to determine the message delay. When the variable delay in message delivery is Gaussian distributed, the optimal parameter estimator is relatively easy to derive. However, when the variable delay is not Gaussian, a more sophisticated technique is required. In Jeske, 40 Jeske addresses the delay characteristics for exponentially distributed delays.
Jenske develops the maximum likelihood estimation (MLE) which features particularly attractive characteristics due to its asymptotic properties: it is unbiased and achieves the Cramer-Rao bound (CRB) for joint skew and offset estimation at large enough numbers of data samples.
In Doleschal et al., 41 Doleschal et al. introduce a two-step clock synchronization process suitable for program development tools such as a parallel event tracer. First, local times are captured from the independent clocks running on separate nodes of a large parallel machine. Second, a post mortem step is used after the completion of the application to reconcile the clock differences found with a trace log and restore the relationship of concurrent events. To minimize the impact the workload being traced, local clocks are accessed during operations which are performed on all nodes (thereby reducing the likelihood of imbalanced overhead). The authors are able to take advantage of certain constraints introduced in the environment to perform the post mortem step (e.g., a message must be sent before it is received). Using a strategy based on the aforementioned CFP method, the post mortem step is able to display events in their corrected order and relation.
In our previous work, 42,43 we designed a clock synchronization protocol that yields excellent time agreement precision while controlling the deleterious effects of client interrupts. We were able to achieve agreement levels of a few microseconds. Our technique leverages the MPI profiling interface (PMPI) to intercept calls to collective operations in order to piggyback all necessary protocols. Our implementation focuses on The work we present in this paper differs from the Related Work in two ways. First, the work presented here is not a mechanism for synchronizing node clocks such as NTP, RADclock, or Precision Time Protocol. Second, the work presented here is not a piece of system software that leverages tightly synchronized node clocks such as Spanner or the work presented in Liskov's survey. Instead, the work presented here is the technique and measurements for node clock synchronization in supercomputers.

IMPLEMENTATION DETAILS AND DATA MEASUREMENTS
In this section, we describe the test environment, our method for collecting the measurements, and characterize our measurements.

Test environment
We conducted our experiments in three different supercomputers: Titan, Edison, and Mira. Three different interconnects were explored: Titan uses a Cray Gemini interconnect, Edison uses a Cray Aries interconnect, and Mira uses a BlueGene/Q interconnect (see Figure 3).

Titan
Titan is located at Oak Ridge National Laboratory. 44 A Cray XK7 features a hybrid-architecture with a peak performance of 27 Petaflops. It contains both 16-core AMD Opteron CPUs and NVIDIA Kepler GPUs. 45 Titan incorporates 18,688 compute nodes and 710 TB of memory.
Titan is based on Cray's Gemini interconnect with a three-dimensional torus topology (see Figure 3). This 3D torus has the dimensions of Z=24, X=25, Y=16 (that is, 24 blades per cabinet, 25 cabinets, 2*8 rows). In Gemini, near-neighbors are given more bandwidth than farther clients, which receive geometrically less a share of bandwidth according to fan-in/distance. This means that for large machines, unless each link has adequate bandwidth for the traffic pattern, the chances of contention increase as messages travel farther and farther; large fan-in communications therefore can pose significant challenges for such topologies. End point and per-hop latencies for Gemini are presented in Table 1. 46

Edison
Edison is a 5576-node Cray XC30 cluster with a total of 357 TB of memory. 47 It contains an Intel ® Xeon ® E5-2695 processor (12-core Ivy Bridge).
The Eos compute partition is comprised of 133,824 processor cores. The Dragonfly topology interconnect of Edison is configured with up to 240 Blue links using 60 optical cables, 4 links per cable, between each cabinet; there are fifteen two-cabinet groups.
Edison uses Cray's Dragonfly topology. Systems can be configured to meet bandwidth requirements by varying the number of optical connections.
Bidirectional bandwidth for two Aries nodes at 4-K message size is approximately 14.3 GBytes/s. Peak global bandwidth is 11.7 GBytes/s per node for a full network; with a payload efficiency of 64% percent this equates to 7.5 GBytes/s per direction. 48 This topology enables a higher bandwidth and reduced latency in comparison with Gemini. Table 1 shows the end point and per-hop latencies for Aries. 49

Mira
Mira supercomputer is a 49152-node IBM Blue Gene/Q system located at Argonne National Laboratory. This machine has a total of 786,432 cores, 786 TB of memory and a peak flop rate of 10 PetaFlops. Each node is outfitted with 16 GB RAM memory and a 1.6-Ghz PowerPC A2 processor with 16 cores. The whole system is accommodated in 48 cabinets.
Mira uses a 5D Torus proprietary interconnect. Each node has 10 chip-to-chip links with a bidirectional bandwidth of 4 GB/s. 50 51 show that latency in Mira is not significantly affected by the distance between nodes. Table 1 shows the end-to-end and per-hop latencies for directly connected nodes and for nodes separated by the longest distance (96-rack system, 31 hops). 52

Timeline: a tool for evaluating clock agreement
To carry out the experiments described in this paper, we developed Timeline, an MPI program designed for HPC machines with low-latency interconnects. Timeline is capable of high accuracy time agreement measurements among the members of an MPI communicator. First, a reference node is located. For simplicity, this reference node is chosen to be Rank 0 in the MPI computation, although any node within the computation would be a valid candidate. The algorithm measures the time agreement offset, between the reference node's time and each other node's time. process begins by exchanging one message with one of the slave processes. The master process initiates the message at time t m0 on the master's clock and the slave process responds to the message at time t s0 on the slave's clock. Time measurements from this initial message exchange are discarded because they are likely to include one-time anomalous overheads. Next, ten sets of messages are exchanged between the master process and the slave process, and these results are recorded in a report block. In each set, the first message is initiated at time t m1 on the master's clock, the response to this message is sent at time t s1 on the slave's clock, and the response is received a time t m2 on the master's clock. The resulting tuple for set i, < t m1 , t s1 , t m2 > i is recorded in the experiment's report block. After recording ten tuples for the slave, we determine the tuple with minimum round-trip time, t m2 − t m1 . We denote this tuple by < t m1 , t s1 , t m2 >, dropping the index i, and output as our measurement for the slave. Finally, this entire process is repeated for all slaves in the computation. Updating this register is a hardware operation, and reading it and storing its value is also fast. Access to the TSC clock is accomplished by calling MPI_Wtime(). During Timeline initialization, each node's starting time is collected. During the subsequent experiment's duration, each node's time is computed as an offset to that node's starting time. Each of these new measurements reveals the rate at which the TSC clock is advancing relative to a designated reference node. This results in a clock offset, or . If equals zero, the trip time t m2 − t m1 will be approximately twice that of the out time t s1 − t m1 . However, if the TSC clock on the remote node has deviated from that of the TSC clock on the reference node, the magnitude of out, t s1 − t m1 , will include the amount of the deviation . The value for is therefore For Mira, a comparable low-overhead clock access is used, the optimized MPI_Wtime, and the same reasoning about applies. Note that the first t s1 occurs immediately after the sync response message in Figure 4. Mira reports out, t s1 − t m1 , as a negative value in accordance with the semantics of MPI_Wtime (i.e., a globally synchronized time). Indeed, we expect the magnitude of out, t s1 − t m1 , to be approximately half the magnitude of trip, t m2 − t m1 .

DISCUSSION OF MEASUREMENTS
In the following experiments, the Timeline tool was used to quantify timing characteristics. In all experiments, the master clock is chosen to belong to the rank 0 processing element within the computation, although this choice is arbitrary and any process could be selected to act in this role. For experiments conducted on the Cray supercomputers, Titan and Edison, the locations of all processes on the interconnect topology were recorded.
On Titan, the location information is the x, y, z coordinates on the Gemini 3D torus. On Edison, the location information is the coordinates on the Aries dragonfly network. This location information allows postmortem analysis on the effects of relative locations to the clock skew measurements.
Our analysis is divided into two sections. First, in Section 5.1, we make observations regarding interim results that occurred during the course of an experiment. Then, in Section 5.2, we focus on the overall interpretation of our aggregated results.

Analysis of short time frame results
This subsection presents an analysis of the short time frame results of the experiments conducted on Titan, Edison, and Mira. For this analysis, Table 2 gives an overview of the data used. Per rank (one rank per node), we use 20 measurement replications on Titan, 10 on Mira, and 200 on Edison, each resulting in a similar amount of data because of the different numbers of nodes involved. Measurements are collected with an inner loop over ranks and an outer loop over replications. Figure 5 depicts the median trip time heatmap for each slave node in the experiment as a function of its physical communication fabric location.
Medians are taken over replications. The location of each node, in terms of its coordinates on the communication fabric, was obtained using system library calls. These mapping calls are directly available on the Cray architecture. Accordingly, the figure shows results for Titan and Edison. For both of these cases, coordinates along the XY axes of the communication fabric correspond to horizontal alignment (i.e., parallel to the floor of the data center) while coordinates along the Z axis are vertical. We show the complete supercomputer in both cases to indicate where PBS assigned our runs.
Note that a few slots are empty throughout, perhaps indicating blades taken out of service. In terms of physical locations on the floor, the Titan 3-d torus communication fabric is folded and interleaved to control maximum cable length. Figure 5A confirms the torus interconnect with middle nodes in all three dimensions being furthest away from the (0,0) coordinate, showing high latency blue. The results for Edison reflects characteristics of the small diameter dragonfly interconnect which places one subset of slave nodes one hop away from the master and a larger subset of slave nodes two hops away (see Figure 5B).  The differences between the three systems in Figure 7 are striking in at least two aspects. The scales on the vertical axes differ by orders of magnitude. The patterns are very different, although some similarities can be interpreted in the two Cray systems.    To see what patterns exist in the clock offsets, we map the median out times on the physical communication fabric of Titan and Edison in Figure 8.
Each recorded out time is already the minimum of ten repeats in a report block (see Sec. 4.2). The median is taken over replications (outer loop) that are separated by polling all other nodes (inner loop). Because of the time span between replications, the offset can drift slightly even in this short time frame analysis so we report the median. Titan clock offsets ( Figure 8A) generally increase with distance in the X coordinate. This is unlike the latency heatmap of trip, which reflects the torus interconnect. We suspect that this is an artifact of the NTP synchronization algorithm. A similar conclusion can be reached about Figure 8A for Edison.  FIGURE 9 Titan master/slave clock offsets for ten-hour duration on rank 1 and rank 865 across 10 runs

Analysis of long time frame results
Recall that the out and back times are dominated by node clock offsets on Titan and Edison, as we determined in the short time frame analysis. For this reason, we now refer to graphs of out and back as "offset" graphs. Figure 9 (left) shows the offset for a given node (in this case, rank 1). Each line depicts the offset of rank 1 during a separate 10-hour run (ten runs total). We see that the offset meanders between about 130 and 245 milliseconds with different skews, which occasionally change, probably by the NTP corrections. Each run gives a different pattern, yet the curves are very smooth over the long time scale. The right panel in the figure provides a similar graph for rank 865, chosen arbitrarily.
To get a sense for the offsets of a large collection of ranks in a single run, Figure 10 selects two runs of 1024 nodes to illustrate a typical set and the most extreme set from the 10 experiments. In both Figures 9 and 10, NTP corrections can be seen as discontinuities in the skew.
Because trip, out, and back times on Mira are of roughly the same magnitude, they can be shown in the same graph ( Figure 11). What is remarkable on Mira is that the offset is small, several microseconds, and controlled throughout the run duration, especially compared to the Titan offsets.
For some runs, a small subset of the nodes in the computation develops noticeable clock offsets which are eventually mostly resolved by NTP. For other runs, a majority of nodes in the computation develops significant clock offsets. At the worst point, the offset range exceeds 600 milliseconds across some participating nodes in the computation. That is over half a second on a machine that sends messages in about a microsecond! Such a large discrepancy would completely obviate using time-based coordination techniques for things such as fine-granularity operating system scheduling where agreements would need to be on the order of tens of microseconds or less to be effective. In resilience applications, time stamps in system message logs are rendered practically useless for postmortem root cause analysis. Other affected applications would include resolving race conditions in parallel programming environments, optimistic locking strategies in database systems or filesystems, and other similar types of systems programs.
The primary take away from the long time frame results is the ineffectiveness of relying solely on NTP. Similar results were observed for Edison.

RECOMMENDATIONS
Cloud service providers and compute farms have successfully employed multiple strategies to enable distributed time agreement with uncertainties of a few microseconds. 7,55,56 Many of the same strategies employed by cloud service providers and distributed computing facilities would seem to FIGURE 10 Titan master/slave clock offsets for ten-hour duration across 1024 ranks in a typical run (left) and most extreme out of 10 (right) FIGURE 11 Mira master/slave clock offsets for seven-hour duration (rank 1) be suitable for deployment on leadership-class machines. Given the demonstrated sensitivity of many parallel workloads and the surprisingly high 0.6-second time uncertainty across a single large-scale parallel machine, we encourage the following best practices.
• First, we recommend replacing NTP-only solutions with higher resolution time synchronization mechanisms. With NTP, accuracies are limited to one millisecond in the best case if a time server is available on the local intranet. 30 We believe that improving the resolution of time synchronization within contemporary supercomputers will help to drive system software use cases that leverage tightly synchronized node clocks to improve the performance and efficiency of system software.
• Second, we recommend the use of hardware-based time synchronization mechanisms whenever possible. Hardware-based approaches may be available in high-performance computing environments, for example, the Mellanox Connectx-4 time-stamping. 57 For institutions acquiring new machines, we recommend the procurement process as a possible way to get a hardware-based solution.
• Third, in cases where hardware-based time synchronization mechanisms are not possible, perhaps due to cost considerations, we recommend the use of newer software-based time synchronization mechanisms such as RADclock. Software-based approaches are receiving increasing traction in distributed cloud computing environments where point-to-point hardware connections do not exist. Leveraging these approaches in high performance computing environments is likely to be straightforward for a potentially high benefit. One advantage of the time synchronization approach that we advocate is that it can be leveraged against just the nodes and processes participating in a given computation. That is to say, with an approach such as RADclock, the overhead of clock synchronization is always present at the supercomputing system software level. Our approach could be implemented on a per-computation level, thus paying the overhead for tight time synchronization only in instances where it is truly useful.
• Fourth, we recommend that as part of the installation process, the system administrator responsible for installing MPI perform an analysis of available underlying clocks available to MPI_Wtime implementations and document these characteristics for their MPI user community.
• Fifth, we advocate for continuing research that explores new strategies for improving time uncertainty bounds in high-performance computers.
As we show that time skew is a smooth function, this provides an opportunity for building predictive statistical and machine learning models that rely on minimal master clock data and give uncertainty bounds. CTP is one method that advances in this direction by arranging nodes that participate in time synchronization operations in a ring and coordinating time among the ring members. However, within a supercomputer context challenges to approaches such as CTP arise from scalability concerns as the number of nodes in a computation increases.
• Finally, we recommend that customers and vendors in the high-performance computing arena aim for a bounded uncertainty of under 10 microseconds. This recommendation is based on the observation that one-way interconnect latencies in current supercomputers are under 5 microseconds, so achieving clock synchronization of all nodes within 10 microseconds is a readily achievable goal using only software-based synchronization techniques. In cases where hardware-based synchronization techniques are available, even tighter synchronization bounds are likely to be possible.
These recommendations are targeted toward high-performance computers, or systems with the following three characteristics: (a) the machine is intended for applications characterized by large-scale parallelism; (b) the machine contains at least 1000 nodes; (c) the machine contains a communication interconnect with a latency of 5 microseconds or less between any node pair. However, we believe they may prove useful for other distributed and parallel environments as well.

CONCLUSION
This paper has presented a statistical analysis of the default clock agreement of three representative leadership-class computer systems.
Among our findings, we observe that the NTP-based time agreement system allows discrepancies exceeding acceptable uncertainty requirements for many potential applications. In some cases, discrepancies of hundreds of milliseconds are observed, including a worst case measurement exceeding 600 milliseconds across some nodes of a computation. These large discrepancies would prohibit using time-based coordination techniques for things such as fine-granularity operating system scheduling where agreements would need to be on the order of tens of microseconds or less to be effective. Other affected applications would include resolving race conditions in parallel programming environments, optimistic locking strategies in database systems or filesystems, and other similar types of systems programs.
Based on the results presented in this paper, we have provided a set of recommendations that call for the removal of unaltered NTP as the primary tool for time agreement on high performance computers.
To date, our work has assumed symmetric timings for both outbound and return trips of selected round-trip messages traveling between two peers. Specifically, we have assumed those messages chosen from a large set of transfers with minimum round-trip time have symmetric outbound and return timings. In the future, we intend to incorporate methods that characterize asymmetric influences (e.g., adaptive routing, network traffic, OS noise, and so on) as well as incorporate minimum message delay strategies into our investigations.