Scheduling data streams for low latency and high throughput on a Cray XC40 using Libfabric

Achieving efficient many‐to‐many communication on a given network topology is a challenging task when many data streams from different sources have to be scattered concurrently to many destinations with low variance in arrival times. In such scenarios, it is critical to saturate but not to congest the bisectional bandwidth of the network topology in order to achieve a good aggregate throughput. When there are many concurrent point‐to‐point connections, the communication pattern needs to be dynamically scheduled in a fine‐grained manner to avoid network congestion (links, switches), overload in the node's incoming links, and receive buffer overflow. Motivated by the use case of the Compressed Baryonic Matter experiment (CBM), we study the performance and variance of such communication patterns on a Cray XC40 with different routing schemes and scheduling approaches. We present a distributed Data Flow Scheduler (DFS) that reduces the variance of arrival times from all sources at least 30 times and increases the achieved aggregate bandwidth by up to 50%.

appropriately. Indirectly, a short duration and variance of time-slice completion leads to a more regular usage of receive buffers per input link at the compute nodes, which improves the scalability of the system. Suppose that we have fixed hardware components and cannot afford a linear growth of main memory for receive buffers at the compute nodes, the overall buffer space we can provide per input link at the compute nodes will linearly decrease with an increasing number of input nodes. Thus, being able to free buffer space early by low variance of time-slice completion at compute nodes is a desirable system property for scalability. In general, the buffer size determines the amount of variance a system can tolerate.
We make the following main contributions throughout this paper to achieve these goals.
• We discuss the challenges of scaling FLESnet in detail and present a potential solution for its communication pattern, scalability constraints, and synchronization demands (Section 2).
• We derive a design for a Data-Flow Scheduler (DFS) that works distributed across input and compute nodes and describe how it addresses the scalability challenges without introducing a central component or single point of failure (Section 3).
• To study the effectiveness of DFS, we implemented micro-benchmarks using Libfabric and MPI resembling FLESnet's communication pattern between two disjoint groups of input and compute nodes and evaluate the performance on our test machines (Section 5.1).
• With DFS, we demonstrate a reduction in the variance of arrival times by up to a factor of 30, an overall increase of the throughput of up to 50% compared to FLESnet using a best-effort approach (Section 5.2.1). Furthermore, we show an improved synchronization (Section 5.2.2), a more regular usage of the receive buffers in compute nodes (Section 5.2.4), and that the overall throughput achieves at least 77% of our micro-benchmark.
This special issue paper is an updated version of our original Cray User Group (CUG) publication. 11

FLESNET COMMUNICATION PATTERN
The CBM experiment 8 targets to detect and discover phenomena at different times during beam-time with hundreds of sensors surrounding the experiment. Each sensor transmits the measured time-stamped data chunks to an input node, which buffers and then distributes the contributions to compute nodes ( Figure 1). To build a time-slice on a compute node for a meaningful data analysis, CBM requires each data sensor/input to contribute. The system to distribute individual data streams from input nodes to compute nodes and to build complete time-slices is called FLESnet.
FLESnet's communication pattern is mostly unidirectional and needs some buffering at input and compute nodes. Input nodes receive data streams from sensors, buffer them, chop the data into micro-time-slices (mtss), and scatter the mtss to compute nodes in a round-robin manner.
Each input node distributes mtss to all active compute nodes in each round, ie, a round consists of as many mtss as there are active compute nodes in the system. On the other hand, each compute node receives mtss as contributions for a particular time-slice from all input nodes and buffers them until the time-slice is completed when the last missing contribution arrives and the time-slice can be passed over to the data analysis.
In the original implementation of FLESnet without DFS, the communication follows a best-effort approach 12 and is not well coordinated between the nodes. Input nodes distribute their round, independent of the progress of the other input nodes. Thus, compute nodes are prepared to collect several mtss for different time-slices from the same input node by allocating a part of their limited local memory as a receive buffer per input node. To prevent buffer overflow at the compute nodes, a ticket-based flow control mechanism 13 is established per pair of input and compute nodes. Input nodes receive a set of tickets from each compute node, and they use each ticket to send a mts. Once they run out of tickets, they wait until they get new tickets when time-slices were completed (got mtss from all input nodes) and buffer space becomes available again. While an input node waits for new tickets, the sensors keep measuring new data, which is stored in a local buffer at the input node to avoid data loss until also this buffer is exhausted. The lack of coordination between input nodes may lead to an unfair use of the network (everyone tries to keep sending according to best-effort), input nodes may fall far behind others only limited by buffer exhaustion, and delays finishing a time-slice occur much larger than necessary.

Scalability constraints and coordination
With increasing system size, the overall bandwidth increases as well. However, the buffer space in the compute nodes typically does not grow linearly with the number of input nodes. This has several effects and consequences for the system. With more input nodes, compute nodes receive mtss from more input nodes, but the fraction of the local buffer space per input node decreases, and consequently, fewer tickets are provided to send mtss. In a system using eager best-effort sending and ticket-based flow control only, two trends can be expected that at first seem to be oppositional.
On the one hand, having fewer tickets per input node increases the dependency and implicit synchronization between different input nodes as some nodes might run out of tickets and get blocked when some others cannot keep up. Such blockage happens earlier with smaller compared to larger buffers at the compute nodes, so stragglers can catch up and the network usage becomes more uniformly. As the difference between the input node with the farthest progress and the fewest progress decreases, the time to complete full time-slices should also decrease. Thus, scaling the system can help implicitly synchronize it.
On the other hand, when input nodes have to be blocked, the overall throughput will not be optimal as some links are not used until new tickets become available. The larger the system, the more input nodes will become blocked waiting for the latest contribution of a time-slice to arrive at a compute node, so they can get their next tickets. Moreover, we risk losing measurement data when the input buffer runs full as running out of tickets is the common case rather than the exception.
For good scalability and sustained high throughput, more coordination is needed. When some input nodes fall behind, the other input nodes should not try to saturate all available bandwidth to give the stragglers a chance to catch up. That way, compute nodes can collect all contributions to finish time-slices earlier, buffer fill levels are expected to remain more regular without buffers running full, and input nodes should not regularly run out of tickets so that they can keep sending to achieve a reasonable sustained aggregate throughput. With more coordination, smaller buffers at the compute nodes may be sufficient, which would help to scale to more input nodes as smaller buffers mitigate the buffer space limitation outlined above.
Such coordination among the input nodes should happen without adding much communication/processing overhead as that might decrease the achievable bandwidth. It should consider the network status, latency, and clock drifts of different nodes. The aim would be to steer the distribution of mtss so that compute nodes receive all contributions for a time-slice in a timely manner, can pass it to the analysis, and can reuse the buffer space.

Synchronization aspects
To support the approaches outlined in the previous paragraphs, input nodes need to behave more synchronized in time than with the default best-effort approach. For the offset-based round-robin data distribution, input nodes should start their data distribution roughly at the same time to effectively separate the link usage. It might turn out that coordination at such fine-granular level is not possible or too costly. Then, coordination of data transfers on a round or group-of-rounds level might be sufficient when an asynchronous best-effort approach is used in between. At least, as we have discussed, it helps to assign a large share of the available network capacity to stragglers so that they can catch synchronization mechanism of the system, or one can integrate a clock synchronization that indirectly detects deviations and clock drift rates and provides enough feedback to each input node so that they behave in a synchronized manner.

DISTRIBUTED DATA-FLOW SCHEDULER
Based on the considerations outlined in Section 2, we introduce a new deterministic scheduling mechanism, called Data-Flow Scheduler (DFS) to increase the aggregated bandwidth, to stabilize the network latency, to reduce the occurrence of network congestion, and to reduce the local memory space usage for stream processing. The DFS coordinates and orchestrates the data flow by performing the following tasks.
• It synchronizes input nodes to reduce the variance of arrival times of mtss for a time-slice at compute nodes so that time-slices can be completed with shorter duration.
• It schedules the injection rate at the input nodes to avoid network and endpoint congestion so that the overall network can be used more uniformly, which supports lower variance of arrival times.
• It dynamically adapts the injection rate trying to improve the network utilization.
• It collects and distributes coordination information decentralized among all input and compute nodes, so it is not a central component limiting scalability.
The DFS divides the transmission time into time-intervals (intvl). Each interval consists of a configurable number of rounds where in each round each input node transmits one mts to each compute node. Thereby, each compute node receives a complete time-slice per round. DFS coordinates input and compute nodes on the level of time-intervals to keep the communication overhead between different processing nodes low compared to a management per round. DFS assigns a unique identifier intvl id to each interval in an ascending order over time. Each interval has its metadata, which consists of the first time-slice to build, the number of rounds, the absolute start time, and the duration.
The DFS operates distributedly across the input and compute nodes. We call the part running on input nodes Input Engine (IE) and the part  Figure 2 depicts the components of each engine.

System assumptions
We assume a homogeneous system where all input and compute nodes are connected with the same capabilities to the network, ie, identical network cards and network links. Otherwise, the slowest node would limit the scalability of the system, or we would have to load-balance the time-slice building over the compute nodes, an aspect we plan to address in the future. The aggregated network capacity of the input nodes should be similar or smaller than the capacity of the compute nodes. The bisectional bandwidth should be close to or higher than the aggregate network capacity of the input nodes. Varying adversary traffic of colocated applications might limit the scalability. Instead of adapting to small changes, the DFS would have to constantly adapt to large changes of the available network capacity.

Distributed deterministic engine
The DDE consists of three modules: History Manager, Proposer, and Clock Synchronizer. The history manager module collects metadata of completed intervals from input engines and calculates statistics. Proposer modules calculate the interval metadata of the upcoming intervals based on the statistics in the history manager module with the aim to synchronize the IEs and to utilize the network continuously taking changes in the achievable aggregate network bandwidth into account. The proposer then broadcasts the proposed metadata to the IEs. The Clock Synchronizer module compensates the different clock drift rates of the different machines to improve the synchronicity of input nodes in case the system clocks are not well synchronized.

History manager
The history manager module measures and receives the actual metadata of intervals, which consists of the ''duration measured'' durm i,j and the measured start time tm i,j , for each input node i and time interval j. As different machines could have different clock speeds, it sends each metadata to the clock synchronizer module in order to adjust it to the local machine clock. After that, it stores the actual interval metadata of each IE and triggers the completion of each interval once it receives the metadata from all IEs. Different IEs could start an interval at different times or take longer duration than what is proposed. Therefore, the history manager module calculates a unified metadata for each interval that represents all IEs.
The unified metadata of an interval j consists of the average start time avg_tm j and the median duration med_durm j from all IEs. The history manager module calculates and uses the average start time avgtm j to consider straggler nodes in its upcoming interval proposals to resynchronize all input nodes. When an IE starts an interval later than what is proposed, it could be due to several reasons: (1) this input node is slow, and/or (2) one of the compute nodes is slow, and/or (3) a network link is congested. Calculating the average start time considers extreme values and delays upcoming intervals to reduce pressure on the network and synchronize the input node in time.

Proposer module
The proposer module calculates the proposed start time tp j and duration durp j of an upcoming interval j based on the metadata of the completed intervals. As the last completed interval w could represent an extreme state of the network that does not represent the majority of intervals, the proposer module calculates the median interval duration med_durm_hist of a configurable number hist_cnt of the last completed intervals. It uses this median duration to estimate the start time of the upcoming interval j. The tp j is the completion time of the last interval w in addition to the needed duration to complete the intervals between w and j, which is calculated based on the med_durm_hist as follows: The proposer module could use med_durm_hist as the proposed duration for an upcoming interval j. However, the measured duration of the last intervals could be exceptionally high due to temporary problems such as network/endpoint congestion. Therefore, the proposer module applies a staged-speed-up (SSU) mechanism that reduces the proposed duration of intervals in order to reach the maximum achievable bandwidth over time and to recover from temporary slow-downs. When the connections between input and compute nodes are established at the beginning, this mechanism supports to gain speed resembling the slow-start of TCP. 18 The proposer module uses the SSU in such a way that it does not overload or congest the network when the maximum bandwidth is achieved.
The proposer module applies the SSU only when the average variance between the proposed and the measured metadata of each interval from the last completed intervals hist_cnt is smaller than a configurable ssu_TH. A lower variance indicates synchronized nodes and a relaxed network. The SSU reduces the proposed duration durp j of an interval j by a configurable percentage ssu_pct, as described in Equation (1). Once the proposer module calculates the metadata of an interval, it transmits this metadata to all IEs after adjusting it using the clock synchronizer module in order to synchronize the data transmission.

Clock synchronizer
The clock synchronizer module calculates the clock offsets between the local machine and input nodes. It also tracks the clock drift of each input node over time to calculate the metadata of the upcoming intervals correctly. When an IE finishes transmitting an interval j, it sends immediately the actual metadata and the median network latency lat i of the connection, where i is the index to indicate the ith DDE. The DDE records the 6 of 14 local time lt when the message is received and calculates the end time of the interval tmend j . Then, it calculates the clock offset for interval j as toffset i,j (see Equation (2), where i is an index indicating the ith IE The clock synchronizer module stores the history of the clock offsets of each input node and calculates the clock drift accordingly where j is the index of the latest finished interval.

Input engine
The input engine receives the proposed metadata of intervals and schedules the data transmission accordingly. Initially, when a connection is established between each pair of IE and DDE, each IE transmits its mts without any guidelines from the DDEs. The first intervals determine the initial duration an interval takes. After the DDEs receive the metadata of the first intervals, they calculate the required duration and propose the metadata of the upcoming intervals accordingly. When the IE updates the DDEs with the actual metadata of an interval j, it requests the proposal metadata of a particular interval k, where k ≥ j + 2. The IEs use the last proposed metadata when they do not receive the guidelines from the DDEs in time. Therefore, the DFS is a nonblocking mechanism.
When the IE starts a new interval, it records the local time as tm j , divides the proposed interval duration durp j fairly on the number of interval rounds rounds(j), and calculates the round start time tr j,y and duration durr j accordingly, where j is the interval index and y is the round index, as outlined in Equation (4). The IE attempts to start each round at its proposed time and divides the duration to the next round fairly on transmitting its mtss. durr j = durp j ∕rounds( j) tr j,y = tm j + durr j * y The IE receives an acknowledgment when a compute node receives a contribution. It calculates the latency of receiving each acknowledgment and calculates the median medlat accordingly at the end of each interval. After transmitting all contributions of a particular interval, the IE waits until receiving a specific threshold of acknowledgments before starting a new interval. If a node is slow and it reached the proposed end time of an interval tp j + durp j without sending all the contributions, the IE starts transmitting the remaining contributions using the best-effort traffic.
As a result, this node utilizes the network as fast as it can, while other input nodes might idle. The idle time of other nodes would be kept short as this IE tries to catch up.
When the IE receives the last acknowledgment of a particular interval, it records the local time and calculates the actual interval duration tm j .
Then, it broadcasts this metadata to all DDEs. Therefore, the DFS is a deterministic mechanism for scheduling data distribution because the DDEs get the same history of completed intervals, and thus, they propose the same metadata for the upcoming intervals.

Fault tolerance
The DDEs propose the same metadata for each interval and thus are replicas to each other. The IEs use the first received metadata of an interval proposal without waiting for other DDEs. Therefore, DFS needs an IE at each input node and at least one DDE on any of the compute nodes to be running. When a system uses more DDEs on different compute nodes, they tolerate a failure of any but one of them.

IMPLEMENTATION
The original FLESnet was formerly implemented* using the API of InfiniBand Verbs, 19 and it relied on connection-oriented communication. To support other modern interconnects like Omni-Path, 20 Ethernet, and GNI, 21 we ported † FLESnet to the OpenFabrics Interface (OFI) Libfabric. 22,23 Input and compute processes run on separate single cores, and each process uses a single thread. In order to initially synchronize input and compute processes, FLESnet uses the MPI_Barrier once the connections are established and before the start of transmitting the mtss. We assume that all processes leave the barrier at a similar time, and most, often they actually do so. Each process records the local time once it leaves the barrier, and the input processes broadcast their triggered time to the compute processes. The DDE, which runs on compute processes, uses these times to calculate the initial clock offset of each machine. We implemented our DFS ‡ on top of FLESnet.
FLESnet uses two types of messages to communicate between nodes.
• Remote Direct Memory Access (RDMA) writes 24 : Input nodes write the mtss into the memory of remote compute nodes using RDMA writes.
RDMA is a one-sided communication; therefore, compute nodes are not informed or interrupted when a mts is written into their memory.
• Message Passing (SYNC) messages: To coordinate between input and compute nodes, message passing is used. SYNC messages are only used when a node wants to inform another node about changes. Input nodes use this message type to inform compute nodes about the written data, actual metadata of intervals, and to request proposal metadata of a particular interval. On the other hand, compute nodes use SYNC messages to send tickets or proposed metadata of upcoming intervals and to inform input nodes about completed time-slices.
FLESnet is developed to receive variable sizes of mtss; therefore, the buffer spaces are designed as ring buffers in order to utilize the usage of the available memory space. When the end of the ring-buffer is almost reached and there is a small space at the end and enough space at the beginning of the buffer, the input node divides the mts into two parts, to fit the available space, and writes the mts using two RDMA writes instead of one. Moreover, each mts has a descriptor of 20 bytes that describes the component and content of the mts. This descriptor is written in a separate RDMA write after transmitting the mts content. After the mts is written, input nodes use a SYNC message to inform the compute node about the written mts. Once an IE transmitted a whole interval, it uses SYNC messages to broadcast the actual metadata.

EVALUATION
We evaluated the Data Flow Scheduler on a Cray XC40 using up to 384 nodes, each equipped with two Intel Xeon E5-2680v3 and 64 GiB of main memory. We used the GNI provider 21  We used Libfabric 1.6.2 and Cray mpich 7.5.1 versions compiled on Cray GCC 6.2.0 compiler. The micro-benchmark of Libfabric § and MPI ¶ are adopted from other works, 25,26 respectively. We modified these benchmarks to divide the processing nodes into two groups: half of the nodes as senders and the other half as receivers. Each processing node runs a single thread, either sender or receiver. The benchmark measures by repeated executions the average duration of a round (each sender writes a message into each receiver's memory buffer) and the average bandwidth of RDMA writes using different number of processes on a single run.

Libfabric/MPI micro-benchmark
The MPI and Libfabric micro-benchmarks use the MPI_Put and fi_write operation, respectively, to write data into the receiver's memory. In order to improve performance, the benchmark uses huge pages, which defaults to 2 MiB pages for the benchmark. The benchmark aligns the senders to start writing data roughly at the same time using an MPI_Barrier. Each sender finishes writing data independently on other senders after a given number of iterations.  The results show that Libfabric needs a longer duration to transmit a round of messages when the message size is larger than 128 KiB, as depicted in Figure 3. MPI achieves 8.7% and 50.65% shorter duration for 128 KiB and 1 MiB messages, respectively, when 192 nodes are used.
Therefore, Libfabric achieves a better bandwidth than MPI when the message size is at most 128 KiB, as depicted in Figure 4. While MPI improves the performance for larger messages, Libfabric suffers from a significant performance drop with messages larger than 8 KiB. We use huge pages of 2 MiB to improve the performance of large messages. In general, it would help to use multiple threads per node. 27 This micro-benchmark only shows the achievable performance when RDMA writes are used overriding the memory space without any demand of tickets and coordination. It differs from FLESnet in several aspects.
• FLESnet transmits different types of messages with different sizes, as explained in Section 4, while the benchmark transmits one message type (RDMA writes) with a fixed message size.
• FLESnet has a limited buffer space at each node, and tickets have to be available to write more mtss. To receive a new ticket for the same buffer place, input nodes have to transmit the mts, wait to receive the completion event of the networking layer, send a SYNC message to inform the compute node about the written mts, wait to complete the time-slice at the compute node (collect all contributions from other nodes), and then receive a SYNC message containing a new ticket. The benchmark, on the other hand, can send as many messages as fast as it can without these limitations.
• The nodes of FLESnet depend on each other's progress. One straggler node quickly affects other nodes as discussed in Section 2. In the benchmark, in contrast, senders and receivers do not depend on each other at all.   one for each time-slice, and 2 million 64-byte SYNC messages. We see a significant performance drop when small messages are added to the workload, especially in the case of MPI. These results confirm our decision to port FLESnet to Libfabric in order to support various sizes of time-slices performance-wise. We will use the Libfabric benchmark results for the workload, including small messages (rightmost bars in Figure 5), to evaluate the effectiveness of DFS.

DFS performance
We configured FLESnet to assign 1 MiB of main memory to each node and each mts is 64 KiB. Table 1 shows the values of the other DFS parameters.

Achieved throughput
We benchmarked FLESnet and FLESnet with DFS (short DFS) and compared the results with the Libfabric micro-benchmark results with a mix of message sizes (see Section 5.1). This micro-benchmark is comparable with FLESnet regarding only the data distribution because FLESnet uses a combination of SYNC messages (87 bytes), mts RDMA writes (64 KiB in these test cases), and mts descriptor RDMA writes (20 bytes). We discuss how the system limitations of FLESnet (see Section 2), cause a significant performance drop on larger systems. For larger systems, the synchronization overhead increases.

Synchronization overhead
One of the goals of DFS was to shorten the duration to receive a complete time-slice at the compute nodes to free the buffer space as soon as possible. Figure 7 depicts the time difference between the arrival of the first and the last mts of each time-slice. DFS shortens the duration by at

Bandwidth recovery
The DFS is able to recover temporary bandwidth drop caused by an event such as network congestion. We simulated an artificial bandwidth drop by increasing the actual duration of an interval by 25%. Figure 8

Buffer usage
Due to the reduced completion time for collecting a time-slice, DFS is able to free the buffer space sooner than bare FLESnet. Figure 9 illustrates the buffer fill levels aggregated across compute nodes and over time. For each sample, the buffer fill level at each compute node is ordered descending. The Figure shows that many buffers are filled up in case of FLESnet, ie, input nodes run out of tickets. With DFS, the buffer fill level is only ≈10% for all connections. As we explained earlier, this is essential for the system's scalability because the local memory space does not scale up with the system size.

RELATED WORK
We studied FLESnet in the context of the CBM experiment, but DFS is not restricted to this setting. A similar communication pattern occurs in O2, 28 the online analysis software used at CERN # for the ALICE Experiment. 29 O2 distributes and processes data iteratively, similar to FLESnet.
Our Data-Flow scheduler approach could help saturate the network more uniformly and reduce the latency to collect all contributions for a time-slice.
Another practical application where the communication pattern is known in advance and thus DFS could improve the network usage is Facebook's SVE, 30 which performs distributed video processing.
There are a variety of different measures to detect, avoid, and cope with congestion in different disciplines. A common technology is credit-based flow control. 31 It is similar to our approach and uses credits to manage the network throughput. The Aries network 32 uses adaptive routing to route around congested parts of the network.

Monitoring and detection
Everflow 33 designed a network telemetry system on packet-level on top of existing functionality in common switches. It distributes collected packages of several analysis servers and uses ''guided probes'' to detect and analyze faults. In contrast, we decided to use the existing servers to monitor the network.
SketchVisor 34 uses counters to detect problematic flows with low overhead. Their prototype is built on top of Open vSwitch. The so-called fast-path is used for more detailed analysis. It uses the Misra-Gries top-k algorithm to identify offending flows. We keep detailed information about each flow in our approach.
In contrast to previous systems, Trumpet 35 performs active monitoring of network flows. Instead of using switches to monitor traffic, they employ the end-hosts, which analyze all incoming resp. outgoing traffic. Users can deploy triggers at the nodes. It resembles our approach, as we also rely on end-hosts for analyzing the network. Our approach knows the overall communication pattern in advance and thus can plan the schedules instead of just reacting to the occurring load. Facebook uses a distributed video processing system called SVE. 30 Similar to our approach, it is streaming-based and has predictable network flows. SVE is a practical scenario where the usage of DFS could be advantageous.

Traffic shaping
Instead of a centralized resource for managing traffic, Carousel 37 lets the end-hosts control their data-center network. Traffic shaping includes packet pacing, rate-based congestion control, and policy-based bandwidth allocation to flows. The DFS also uses a distributed monitoring and scheduling system for scalability. Instead of reacting to the arising network load, it knows the overall communication pattern in advance and thus can plan the schedule ahead.
# https://home. cern  FIGURE 9 The minimum, maximum, 10th and 90th percentile, and the median of the buffer fill level on ordered connections of 64, 128, 192 nodes, respectively, aggregated for all compute nodes over a whole run, sampled every second. Each run took at least 18 minutes, and each input node has access to a buffer space at compute nodes that can hold up to 16 mtss, so the overall buffer space is scaled linearly with the number of compute nodes for this evaluation. Half of the nodes are input nodes, and the other half are compute nodes With an improved switch queuing algorithm, multipath routing, and a new transport protocol, NDP 38 can provide low-latency, isolation between different workloads in data-center networks, fairness, and avoid congestion. It is optimized for CLOS networks.
While Carousel follows the trend to move traffic shaping to the end-hosts, DRILL 39 goes into the opposite direction. It uses switches to perform micro load balancing based on queue occupancy and randomizing the traffic. In contrast to these approaches, DFS uses an end-host-based approach and works independently and on top of the underlying network routing.
Using credit-based flow control, ExpressPass 40 provides bandwidth allocation and fine-grained packet scheduling. It has similar goals to our approach: fast convergence, low buffer occupancy, and high utilization.
Similar to our approach, some systems rely on end-hosts to perform traffic shaping. Other systems employ the servers to manage the network.
We only rely on the end-hosts. The literature also shows some cases of using credit-based flow control, which we used to manage data flows.
The typical network traffic in a data-center is highly challenging as it is unpredictable and constantly changing with micro bursts. Similar to SVE, 30 our traffic pattern is more constant and predictable, which allows us to use different techniques.

CONCLUSION
We presented a distributed data-flow scheduler (DFS), which runs on a set of senders and receivers to steer high-volume data stream distributions with a high throughput, as needed for the Compressed Baryonic Matter (CBM) experiment. DFS aims at achieving a fair network usage so that stream chunks from the same observation time are aggregated in the compute nodes without much time delay and low buffer usage.
DFS is nonblocking, distributed, and provides deterministic data flow schedules for the senders based on the behavior observed in the recent past. Due to the use of Libfabric, it works not only with Cray Aries/GNI but also with other modern interconnects. Compared to generic data-center solutions, DFS is coupled with the application and thus knows the intended communication pattern in advance, which gives DFS the advantage to be able to calculate a schedule that is given to the input nodes, so they can try to follow the schedule. This way, we outperform the integrated adaptive routing of the network because DFS can leverage knowledge that is not available to a reactive system.
We have shown that DFS improves both, the system scalability and the performance. It increases the aggregated bandwidth (80% vs 53% for FLESnet of the practicably achievable bandwidth on 128 nodes) and reduces the duration to collect all time-slice contributions by a factor of up to 45. DFS synchronizes the input nodes to receive complete time-slices at compute nodes in a timely manner. As a result, DFS needs only 10% of the buffer space at the compute nodes, which is essential for the system scalability. DFS distributes the load more evenly on the network resources to saturate all communication links in any time. It reduces the congestion of the network by scheduling the outgoing data packets of input nodes to different compute nodes at a time. DFS detects dynamic network changes and reschedules the traffic accordingly. It is also able to tolerate the failure of its distributed instances.