Blue Waters system and component reliability

The Blue Waters system, installed in 2012 at NCSA, has the largest component count of any system Cray has built. Blue Waters includes a mix of dual‐socket CPU (XE) and single‐socket CPU, single GPU (XK) nodes. The primary storage is provided by Cray's Sonexion/ClusterStor Luster storage system delivering 35 PB (raw) storage at 1 TB/s. The statistical failure rates over time for each component including CPU, DIMM, GPU, disk drive, power supply, blower, etc and their impact on higher level failure rates for individual nodes and the systems as a whole are presented in detail, with a particular emphasis on identifying any increases in rate that might indicate the right‐side of the expected bathtub curve has been reached. Strategies employed by NCSA and Cray for minimizing the impact of component failure, such as the preemptive removal of suspect disk drives, are also presented.


BRIEF HISTORY OF BLUE WATERS
The Blue Waters project started in 2007, when NCSA was awarded a grant from NSF to deploy a sustained-petaflop system that could provide advanced computing capabilities to serve the science and engineering communities. 2At the beginning of the project, a selection of key science and engineering applications was identified, and NCSA staff worked extensively with developers of those applications to ensure that their codes would be ready to effectively scale up to a petaflop level of sustained performance. 3e deployment of Blue Waters hardware was started in early 2012, when Cray installed at NCSA a 32-cabinet partition that could be used by selected Early-Science applications.This partition contained only AMD Interlagos processors.During the Summer and Fall of 2012, major remaining parts of the system were installed, including 244 additional cabinets with a mix of CPUs and GPUs, full Gemini interconnection network, and the final storage sub-system.Starting in mid-summer 2012, the testing and benchmarking workload ran very intensely, along with some early science work.
After intensive on-site testing by NCSA with assistance from Cray, 4 Blue Waters was formally accepted by NSF near the end of 2012. 5For acceptance, beyond traditional functionality tests, sustained-petascale performance was measured on a set of fully functional scientific applications, 6 using a metric based on a method that considers time-to-solution as the key factor in evaluation of performance. 7rly in 2013, a nearline storage component was added, containing a high-capacity tape sub-system.The operation of Blue Waters in a production manner started in April 2013.That operation was briefly interrupted in the Summer of 2013 to integrate 12 additional cabinets comprising exclusively XK-7 GPU nodes, aiming to further promote the adoption of GPU-computing by the scientific community.This integration expanded the Blue Waters physical topology to its final configuration, comprising 288 cabinets of compute nodes. 8 summary, Blue Waters has the organization depicted in Figure 1.The compute partition contains 27,648 nodes, including XE nodes (dual-socket AMD processors) and XK nodes (an AMD processor and an NVIDIA K20X GPU).The system has a combined total of 57,930 CPU processors and 4228 GPUs.There are 201,568 DIMMs in the memory system, and 17,712 disk drives, with 2 TB each, for storage.There are 582 LNET nodes in the storage system.
During the period when Blue Waters was serving NSF-based allocations, 9 the job scheduling policy favored large jobs.To improve the spatial geometric shapes of the set of nodes allocated to such jobs, a topology-aware job scheduler was developed and successfully deployed. 9This special job scheduler was deactivated in 2020, when the new workload consisted predominantly of jobs with fewer nodes.

COMPONENT RELIABILITY
We start our discussion on reliability by presenting, in this section, observed failure data for each individual type of system component.This analysis is useful to show the actual behavior from each particular class of components across more than 80 years of Blue Waters operation in production.

AMD processor faults
The monthly failure rates for the AMD Opteron/Interlagos processors, the CPUs in Blue Waters nodes, are given by Figure 2.This figure shows the total number of failures observed in each month since 2013.In the first years of operation, Cray would immediately replace any processors from nodes detected as failing.As time went on, the removed processors were tested offline, and most of those failures proved to be transient failures.
Thus, a simple node-reboot would be sufficient to recover the failing nodes.This rebooting practice was adopted in 2017, as the first attempt to bring a failing node back to normal operation.If the node still failed after the reboot, then the processor was actually replaced.From a total of 1525 processor fault events, only 140 were confirmed to correspond to failed processors.
The observed spike in failures for December 2014, initially attributed to the processors, was later identified to be in fact due to a memory issue, as we discuss in the next subsection.Overall, the AMD processors have shown very stable behavior, and the low number of failures observed after January 2020 reflects the transition in the system workload, as the NSF allocations expired at the end of 2019.After that point, the new workload has been less CPU-intensive and more demanding in terms of both GPU computing and I/O.The processor failure rate on the XK nodes is a bit lower than the XE rate making up 5% of processor failures compared to being 8% of the processor population.

DIMM faults
The monthly number of DIMM fault events is presented in Figure 3. Just like in processors, DIMM failure events initially led to replacement of the underlying DIMMs, but later Cray started to work around those node failures by simply rebooting the nodes.However, for DIMMs, rebooting a node was not as effective in fixing the failure as it was for processors, and actual replacements were required.
Blue Waters employs DIMMs made by four different manufacturers, as shown in Table 1.The majority of DIMMs currently installed comes from Micron, and that is also the type of DIMM that has been in the system for the longest time.Hence, it accumulates the highest percentage of failures.
The three spikes observed in Figure 3  F I G U R E 4 Monthly GPU faults.

GPU faults
Although other large Cray systems employing GPUs, like ORNL's Titan, have presented a high number of failures in the past, 11 the number of observed GPU failures on Blue Waters was moderate.Figure 4 shows the number of GPU failures by month, indicating that there was a noticeable increase in the number of monthly failures in recent years.The average GPU utilization was high and steady through the end of 2019 but dropped substantially after the start of 2020 with a significant portion of XK node usage by CPU only applications.Since the GPU failure rate is close to flat from 2018 through 2021, we conclude that the failures are simply age related rather than load related.Despite the modest increase in the failure rate the number of failures is still well within manageable levels, and we have sufficient spare parts to last several years.
The reasons for the GPU failures are detailed in Figure 5, showing that most of the faults are due to either page retirement or double-bit errors in the GPU memory structure.Such errors are well documented in the Reference [12] and their occurrence is not surprising for a large system with the dimensions of Blue Waters.Nearly one third of the failures required replacement of the GPUs, whereas the remaining failures could be properly managed with a reboot of the corresponding XK node.
The low number of observed GPU replacements (i.e., only 123 in more than 4200 parts, or less than 3%) is even more remarkable if we note that most of these GPUs were shipped by NVIDIA directly to NCSA and were in the same manufacturing batch as those shipped to ORNL for the Titan system.In the Fall of 2012, when most of Blue Waters was already in place undergoing testing, the K20X GPUs, which had just entered production at NVIDIA, were installed by Cray personnel into empty sockets of the XK nodes.Thus, these nodes did not even have the extensive factory-testing that is typically done by Cray before deploying a system at the customer's site.Nevertheless, during system acceptance, specific tests were conducted to verify the proper behavior of all XK nodes.

Hard drive failures
For a system with more than 17,000 traditional (i.e., mechanical) hard disk drives, one would expect the storage sub-system to be a critical component for reliability.However, Blue Waters disks continue to present very good behavior, despite the age of the system.Figure 6 depicts the number of disk replacements by month, including the replacements required due to actual failures and the replacements recommended based on some degraded metrics observed for the drives.
Starting in January 2018, Cray implemented a more rigorous policy for preemptively replacing disk drives: they replaced any drive that would either (a) contain more than 1000 replaced sectors, or (b) achieve more than 100 uncorrected reads/writes, or (c) present consistently slow response.
Under this new policy, disks were replaced more often, with a monthly average of 11.1 replaced drives for the last 3 years, whereas the lifetime rate for the system is 9.7 replaced drives per month.Nevertheless, for the past 12 months, the replacement rate was 10.1 drives per month, showing that the storage sub-system of Blue Waters has not yet reached the ramp up of the "bathtub curve" traditionally expected for aging components. 13is new disk replacement policy was motivated by the spike in failures observed in January 2018: two drives in the same RAID-set failed, and during the rebuild of that RAID-set, a third drive in the same set was preemptively failed due to a high rate of its observed errors.This required manual reconstruction of that RAID-set, and luckily no data loss occurred.To minimize the likelihood of reoccurrence of such failures in a given RAID-set, the new replacement scheme was adopted.In addition, the preemptive failure logic was modified such that it would not fail a drive while that drive was part of a rebuild process.

Liquid cooling system failures
The Cray XE/XK rack design employs a mix of liquid and air cooling.External to the cabinets, Liebert XDP heat exchanger units cool RF134A refrigerant using facility chilled water.The RF134A provides liquid cooling that serves as the basis for the cooling of a given cabinet. 14At the bottom of each cabinet, there is a blower that blows air from the lower to the upper part of the cabinet, providing cooling air through the compute blades that are positioned vertically.The air that circulates internally in the cabinet is cooled by the liquid that is provided externally by the XDPs.The external liquid cooling mechanism is supported by an advanced structure 15 in the building where Blue Waters is installed.
For Blue Waters, each XDP unit feeds four compute cabinets and in normal operation the design is temperature room neutral but is not a closed loop.When an XDP unit fails, the exhaust air from the affected four racks quickly heats up and can easily exceed the maximum inlet temperature for the racks, causing them to power off.Since the exhaust air mixes with neighboring cabinets, all four racks do not always fail.In addition, if the XDP issue can be anticipated, the rear doors of the affected cabinets can be propped open and vent tiles arranged to provide facility air from the raised floor to keep the racks running while the XDP is serviced.In the Blue Waters facility, one challenge for this system is that the cooling water supplied to the XDPs has two sources.In the cooler months, onsite evaporative cooling towers provide water at up to 60 • F, while the rest of the year mechanically chilled water is provided at 43 • F. This temperature range stresses the water control values, which has resulted in the valve controls being a significant maintenance issue.In addition, the original pump gaskets degraded over time, resulting in the refrigerant leaking out and potentially shutting down four compute cabinets.Ten single or multiple compute rack interrupts were attributed to XDP issues.However, many more issues were proactively detected and corrected without impacting the compute system, all pump gaskets and valve control arms were proactively replaced.

Blower failures
As mentioned in the previous subsection, each Cray XE/XK rack utilizes a single, 7.5HP blower to keep the rack cool.When that blower fails the rack almost immediately powers off due to exceeding the thermal limits.Fortunately, blower failures are rare, with 10 total failures in 8 years and a maximum of two in 1 year.

Power supplies
Each Cray XE/XK cabinet utilizes seven power supplies to convert the 480 V AC input to DC for distribution inside the rack.The power supplies are designed with redundancy such that one can fail without impacting the rack.However, in certain failure modes an arc is generated that creates an in rush current high enough to trip the facility supply breaker, taking down the cabinet.This unusual failure mode has caused six single rack failures in the 8 years of operations.

SYSTEM-WIDE RELIABILITY
We now discuss failures of a wider scope, like those causing interruptions in an entire node or even in the full system.Thanks to good engineering and careful maintenance, these interruptions did not cause severe downtimes during the lifespan of Blue Waters.Cray designed the XE/XK platform to be fault tolerant and very serviceable.High-failure rate components have redundancy at the system and rack levels.At the node level components are socketed for good serviceability.Failed nodes and node boards were replaced once or twice per week when the number of failures justified implementing a non-intrusive re-route configuration to the high-speed network.While the network quiesces did not have a severe impact it is still instructive to analyze their frequency and main causes.

Daily node interrupt rate
The daily node failure rate for Blue Waters is in Figure 7, for both XE (dual CPUs) and XK (CPU + GPU) nodes.As the figure shows, the failure rate has been quite stable.In the first years of operation, many node failures were caused by system software issues (including Luster) and were returned to service quickly with a reboot of the affected node though the long-term fix came via system software updates.In the latest years, a higher proportion of failures have been caused by the hardware, since software updates have been much less frequent in this period.
Over the lifetime of Blue Waters, we have observed an average of 2 node failures per day.Since 2016, as the system software became more mature, the rate dropped to 1.6 node failures per day.Furthermore, over the last 12 months, we have observed an average of only one node failure per day, which is quite impressive for a system with more than 27,000 nodes.
Because there are no periodic interruptions for preventative maintenance on Blue Waters, whenever a node fails, and cannot be rebooted by software, that node is left down until there is an opportunity for its replacement.When the number of nodes down goes beyond a certain value agreed upon by Cray and NCSA, Cray conducts (at most once a week) a procedure of "warm-swapping" blades containing failed nodes.This is preceded by a rerouting of the high-speed interconnect, such that the affected blades can be safely powered down and physically replaced while the rest of the system continues to be in regular operation.After the blades are replaced and powered up, the interconnect is reconfigured back to its original routing.The removed nodes are then tested offline by Cray, in a test cabinet, to diagnose their status.

System-wide interrupts
Various factors can lead to a system-wide outage in a large system.Figure 8 shows the number of monthly system-wide interrupts since Blue Waters entered into production.As in the case of node interrupts, most outages in the first 3 years were caused by software issues.The most impactful issues were in software controlling the high-speed network.Since Blue Waters was over 50% larger than the next biggest Cray XE/XK system it is not surprising that multiple issues were encountered in the high-speed network stack that had not been seen on other systems.Since 2016, however, the number of interrupts in a month has been stable and quite low.Because the system software is no longer updated frequently, other reasons led to the recent outages, such as occasional hardware failures in the interconnect or problems in the storage sub-system caused by particular access patterns in the workload.
The second notable aspect is related to unscheduled reboots.Rebooting a large system like Blue Waters is a very costly operation, which can take several hours depending on how the system was last shut down.Fortunately, as noted in Figure 8, the number of such operations was typically low on any given year.Furthermore, given the policy of avoiding regular maintenance operations as already discussed in the previous subsection, the time wasted by powering up or down the system was kept to a minimum.
F I G U R E 7 Daily node interrupt rates by month.
F I G U R E 8 System-wide interrupts.

Power events
Power is provided to the Blue Waters facility via four 8 MW 13.8 KV feeds from the University of Illinois' 100 KV substation providing 24 MW of usable power, with the option of using multiple feeds for power diversity.Since this power comes from the local utility it is susceptible to local severe weather events, though those interrupts are usually less than a second.Due to the capital and operating costs, no Blue Waters equipment utilizes a UPS or other similar backup power.The Blue Waters storage sub-system utilizes two feeds to each rack, which has proven to be very effective at preventing power-related issues.However, each Cray XE/XK rack allows only a single power feed.Thus, 288 XE/XK racks are distributed across the four feeds and an interrupt on any of the feeds has the potential to power off 1/4 of the system, requiring a reboot to recover system operation.In 8 years of operations there have been 15 full system outages due to facility power issues.Ten of those issues were caused by external weather-related events while the others were a mixture of component failure and human error.

BLUE WATERS DATA AVAILABILITY
Blue Waters is possibly one of the most intensively monitored supercomputer in the world.In addition to its large size, which naturally leads to a huge volume of data from various metrics, the management of the system involves collection and analysis of several distinct kinds of operating system and application data.The monitoring of the physical machine uses a holistic approach that collects information from all system components, 16 using a scalable, light-weight mechanism that captures that information from each source with a frequency of one set of samples per minute. 17,18 addition to hardware-related data, extensive logs from activity in the system are maintained, aiming to support detailed analysis of system behavior, application performance, or any other investigation of interest.To offer the HPC community a rich source of real-life system data, many of these logs, properly anonymized, are publicly available via the Globus transfer tool at the following site: https://bluewaters.ncsa.illinois.edu/data-sets The kind of available data consists of the following: • Collected metrics from hardware sub-components , for Dec/2014, Sep/2016 and Mar/2021, correspond to "row-hammer" events in the DIMMs 10 : due to a design problem, certain memory access patterns on writes cause changes in neighboring memory cells.When those changes are detected by the processor it triggers a node interrupt.On Blue Waters, this problem was observed only on the Samsung DIMMs, and because it is associated to particular kinds of memory access, it is triggered only by a few applications.The problem was corrected in each case by working with the application team to slightly modify their code or often just change the compiler optimization level.F I G U R E 2 AMD processor faults.F I G U R E 3 Monthly DIMM fault events.TA B L E 1 Type and number of DIMMs in Blue Waters.

5
Reason for the observed GPU faults and corresponding maintenance action.F I G U R E 6 Number of 2 TB disk drive replacements per month.
Various system-related logs • Logs from the job scheduler and job execution • Darshan logs from I/O activity in applications • Logs of user's view for quality-of-Service of storage