Availability analysis of redundant and replicated cloud services with Bayesian networks

Due to the growing complexity of modern data centers, failures are not uncommon any more. Therefore, fault tolerance mechanisms play a vital role in fulfilling the availability requirements. Multiple availability models have been proposed to assess compute systems, among which Bayesian network models have gained popularity in industry and research due to its powerful modeling formalism. In particular, this work focuses on assessing the availability of redundant and replicated cloud computing services with Bayesian networks. So far, research on availability has only focused on modeling either infrastructure or communication failures in Bayesian networks, but have not considered both simultaneously. This work addresses practical modeling challenges of assessing the availability of large‐scale redundant and replicated services with Bayesian networks, including cascading and common‐cause failures from the surrounding infrastructure and communication network. In order to ease the modeling task, this paper introduces a high‐level modeling formalism to build such a Bayesian network automatically. Performance evaluations demonstrate the feasibility of the presented Bayesian network approach to assess the availability of large‐scale redundant and replicated services. This model is not only applicable in the domain of cloud computing it can also be applied for general cases of local and geo‐distributed systems.


I. INTRODUCTION
Due to the growing complexity of modern data centers, failures are not the exception anymore; they are the norm [1].For example, the OVHcloud data center incident in 2021 led to the unavailability of multiple online businesses [2], while the Facebook outage in late 2021, caused by a missconfiguration of the backbone routers [3], led to an estimated loss of 65 million dollars in revenue [4].Cloud operation teams and reliability engineers employ fault tolerance techniques to mask faults through redundancy or replication, by deploying multiple instances of the same service to increase availability.These instances are not fault independent.They normally share common cause failures with the surrounding execution environment and communication network, raising the question if fault tolerance measures meet the availability requirements.To answer this question, this paper proposes a novel Bayesian network modeling appraoch to assess the availability of redun-dant and replicated cloud services in presence of network and common-cause failures.
This work distinguishes between the terms redundant and replicated cloud services to address two different modeling semantics with respect to service communication, which can lead to different availability outcomes.In a broader sense, redundancy implies independent service instances (copies) that work in parallel.Redundant services can be stateful or stateless.For example redundant DNS servers are stateful, where multiple DNS instances can independently serve client requests.Stateless redundant services are AWS Lambda and Azure Functions, which are part of the Function-as-a-Service (FaaS) layer.In contrast, replication always involves stateful services that implement a replication protocol to maintain the desired degree of state consistency between the instances.Examples of such systems are replicated databases [5]- [7], and distributed locking services [8].These instances need to communicated with each other at some point in time as supposed to instances of a redundant service.
The ISO/IEC/IEEE International Standard on Systems and Software Engineering defines availability as the "degree to which a system or component is operational and accessible when require" [9].Similarly, we refer to availability as the likelihood of a cloud service to be reachable and operational (up) when required.Figure 1 exemplifies the difference in failure modes when assessing the availability of a redundant or replicated services.Common cause failures and cascading faults in the infrastructure can simultaneously lead to the unavailability of multiple service instances.Network faults might lead to network partitioning, which renders services instances unreachable for client requests or segmenting the instances of a replicated service into groups that cannot agree upon the next states.For example, Figure 1a shows a redundant services.A client application regards the redundant service available as long as it can reach at least one of the instances.In contrast, Figure 1b depicts a replicated service, which has the overhead of inner-replica communication due the necessity of implementing a replication protocol.So, the replicated service is reachable as long as at least one working instance is reachable by the client, and the instance can communicate with (a) Redundant system represented by an independent set of instances which all ofer the same service.
(b) Stateful replication which necesitates communication between replicas to aggree upon the state of the service.
Fig. 1.To assess the availability of a redundant or replicated service, one needs to consider the reachability of the instances through the communication network, as well as their fault dependencies with the execution context sufficient remaining instances to reach the required quorum size, i.e. to correctly implement the replication protocol.As a result, this communication overhead might involve more network components that form an additional source for potential failures, which we need to account for in our availability model.
As Michael R. Lyu noted [10], it is not sufficient to assess the reliability or availability of a software system in isolation.It is important to also consider the execution (operational) environment, in order to create accurate availability models.However, while researchers acknowledge the significance of infrastructure and communication faults [11], [12], they usually model either the infrastructure [13]- [15], or the communication [16]- [19] part of a system.Moreover, with the advent of cloud computing, reliability engineers face the challenge of modeling the availability of large-scale cloud services.Espe-The proposed availability model considers redundant or replicated distributed (cloud service) systems as a set of instances.Instances are assumed to run on virtual or physical hosts, placed within the infrastructure of one or more data centers, and linked by a communication network.The network is assumed to consist of components such as switches, routers, and middleboxes, e.g., firewalls, which are placed within the same infrastructure as the hosts themselves.
Specifically, redundant services can be stateless or stateful services, where the stateful service does not replicate its state.Replicated services always refer to stateful services where state is replicated.
A replicated service is available when sufficient replicas are available.Conversely, if too many replicas are unavailable, i.e., have crashed or are not reachable, the service is considered unavailable at the time of the request.A quorum is a certain set of k-out-of-n redundant instances that need to be available to provide a particular service function.Note that different functions such as reading or writing a data object can have different quorum sizes, depending on the replication protocol.Therefore, in this work, service availability implicitly refers to the availability of a specific service function or operation.
The model considers two types of communication patterns.For redundant services, we assume that a client only needs to communicate with one instance to issue its request.For replicated services, it is also sufficient for a client to communicate with one instance to initiate the request.However, that instance needs to be able to communicate with sufficient remaining instances to agree upon the result of the client's request.The exact fault tolerance semantics for redundant and replicated services is flexible and can be defined by the reliability engineer as part of the system description.
The hosts and the communication network are part of the infrastructure, which forms a complex component-based system consisting of infrastructure components, such as data centers, racks, power supplies, virtual machines, and network appliances.The model assumes that hard -and softwarecomponents, including the service instances, have a crashrecovery model.As soon a component encounters a failure, it crashes and stops, and recovers eventually.Each component in the infrastructure has its probability of failing by its own without external influence.
Moreover, the model assumes that infrastructure components have fault relations, representing potential common causes of failures.These fault relationships can form a causeeffect chain, where the failure of one component is the cause of failure of another component, essentially propagating the failure through the infrastructure, until it eventually leads to the failure of the cloud service, i.e., cascading failure.In order to formalize the relationship between two directly fault dependent components, the model assumes that the dependence can be described by means of a static fault tree [26].
Client applications and instances can communicate with each other by exchanging messages via the communication network.The network is composed of network components forming a network graph.The end-to-end communication, i.e., channels, between instances and clients can be synchronous or asynchronous and implemented by one or more redundant network routes.A channel crashes when there is no route in the network to connect the two endpoints, and a route becomes unavailable when at least one network component along the route crashes.Client applications might be placed outside of the known infrastructure.In this case, the model considers the paths starting from the network appliance that constitutes the entry point of the data center; or, if the client application is within the data center, its host.Moreover, we assume the exists some dedicated network components, e.g., firewalls or load balancers, that act as gateways, i.e. entry points, for clients applications to communicate with the service.
A particular placement of instances to virtual or physical hosts is called a deployment and known beforehand.Instances do not migrate.If an instance crashes, it does not recover on a different host.It recovers back at its former host.Hence, if a host crashes, all its instances can recover when the host recovers.The model makes no restrictions on the number of instances per host.Multiple instances can run on the same host.In the case of replication, the model does not assume the concurrency control method or the particular replication protocol.Either at any given point in time there are enough replicas up and reachable to agree upon the results of a client's request, or too many replicas crashed or are unreachable, such that the remaining replicas cannot form a quorum for any client request, resulting into the unavailability of the service.

III. HIGH LEVEL MODEL DESCRIPTION
This section will address the modeling challenge of building a Bayesian network model to infer the availability of a cloud service in the presence of cascading infrastructure and network faults.To ease the modeling process, we present a highlevel model description first, which we later translate to a Bayesian network.The model contains three basic sub-models: a failure model for the infrastructure, a model for the network, and a model to describe the fault-tolerance semantics of the service.This provides the advantage to choose the component granularity of the system.First, we begin with the basic unit of our model, a component.

Definition III.1 (Component).
A component C ∈ C, from the finite set of all components of the system C = {C 1 , C 2 , . . .}, is an indivisible hard or software entity with the states {F, T }, and a probability distribution P (C = F ) = q i to observe the component as faulty (unavailable) and P (C = T ) = 1 − q i to observe the component as operational or working (up).
The set I = {I 1 , . . ., I n } ⊂ C are instances of the service.The remaining components are infrastructure and network components.
Components might have fault dependencies between themselves.We describe these fault dependencies as a direct acyclic graph (DAG).Directed edges are tuples (C i , C j ), where C i is said to be a parent component of C j , and C j is said to be a child component of C i .These edges can also define a containedin relation, to signify that one component is contained within another.
In order to express complex component dependencies, F T (C i ) contains the definition of a static fault tree that describes the fault semantics of a component C i as a function of its parent components.F T (C i ) has as the top event the failure of component C i and as base events C i 's parents components.For example, Figure 2a shows a fault dependency graph with a host component that depends on three parent components to illustrate how to apply F T .The host fails if the rack breaks, e.g., catching fire, or both power supplies stop working.F T (host) encodes this failure relation at the host component, as shown in Figure 2a.Hence, the corresponding fault tree shown in Figure 2b has the power supplies and the rack as basic input events and the host failure as the top event.The fault tree uses an OR gate to trigger the top event.
The hosts fails when the rack fails, or both power supplies fail, represented by the AND gate.The fault tree is part of the host component to determine the cause of a host failure due to external influences, which depends solely on its parent components.Note that the fault dependency model is a DAG, disallowing cyclic fault dependencies since it leads to cycles in the final Bayesian network graph, which is not allowed by definition.With this graph notion, reliability engineers can decide the granularity of the network model.Suppose they have little or no knowledge of the network.In that case, they can represent the network as 'one switch' connecting all instances, aggregating all potential failure probabilities as one value for one super component.However, they can also describe more complex network graphs if they have ample knowledge, which improves the model w.r.t. a more realistic representation of the actual network.
The final system description of the cloud service is the unification of the above model definitions.The parameter Q defines all instance combinations for which the service is considered in a working state in the presence of instance failures.This generic definition fits redundant as well as replicated service.It implies the enumeration of all valid instance combinations to build Q, building a (minimal) path set of the service instances.For example, let us assume a service has three instances I = I 1 , I 2 , I 3 and the service works as long as two instances are up.As a result, Q is the enumeration of all combinations with at least two instances Q = {{I 1 , I 2 }, {I 1 , I 3 }, {I 2 , I 3 }, {I 1 , I 2 , I 3 }}.This definition provides a flexible way to express a wide range of fault tolerance semantics.However, the enumeration of all instance combinations can become inefficient, especially when considering services with hundreds of instances.To alleviate this burden, we suggest an implicit construction method for k-out-of-n redundancy and voting-based replication models, as well as for the special cases of read-one and write-all replication.For these specific models, we define Q as a tuple (V, t), where V = (v 1 , ..., v n ) are instance votes and t a threshold value.The availability model will then account for the probability of observing sufficient working instances such that their votes exceed the threshold.For example, we can express the previous examples as Q = ((1, 1, 1), 2) to implement the majority set without enumerating all possible set combinations.If the service has different thresholds, that is, different quorum size requirements, per operation like readone write-all replication.Read-one would have t = 1 for the read operation and write-all t = n for the write operation.The service definition would then refer to one specific operation.Multiple operations can be supported by defining a service model for each operation separately and compute their availability values.At this point, it is up to the reliability engineers how to aggregate the availability of the different operations.They can use the lowest resulting value as a means to assess the probability of the worst-case service model, or they could compute the (weighted) average availability across all operations.Independently of what aggregation method a they chooses, this work shows how to build the availability model accordingly.Let us exemplify the system model by describing a database management system as shown in Figure 3, which we will then use as a running example for the construction of the Bayesian network model next section.
Figure 3 shows the overall system with its infrastructure and network components that provides the execution environment for the database management system.Although the data center infrastructure might be much larger, we only consider those components which serve the service.The database management system consists of seven replicas I 1 to I 7 , placed on hosts within the infrastructure of two data centers.Without loss of generality, the service is available as long as the replicas can form a majority quorum.
Black arrows define fault dependencies between infrastructure components and blue edges represent communication links between network components.Without restrictions, in this example, we assume that a component fails when all its parent components fail; however, our Bayesian network model will also be capable of modeling more complex component dependencies, such as redundant power supplies.Each component has its own intrinsic fault probability q representing the likelihood of a component failure without external influence.Here, the fault probabilities are sampled from a beta distribution with ∀i : q i ∼ B(10, 1000).
Finally, the database management system has the following service description: If a component has no parent, e.g.DC 1 , then pa returns the empty set.• The network graph has the following form.
• The fault probabilities of observing the components as unavailable are: For the sake of readability, we assume that instances do not fail due to intrinsic faults.Hence, they have an availability of one.• The entry point for client applications is the firewall: G = {F W } • With c = true, the model will consider communication between the instances, describing a replicated service.For example, the final model would address failure modes where rack Ra 1 would fail, which leads to the failure of all its built-in components.This includes its hosts H 1 to H 3 , the firewall, and the switches N 1 and N 2 to fail as well.As a result, the replicas I 1 to I 3 would also fail since Ra 1 is a common cause of failure here.The Bayesian network model compactly encodes all combinations of component state and their probabilities, for which the service is considered available, as part of its qualitative representation, without enumerating all potential failure combinations explicitly.

IV. BAYESIAN NETWORK MODEL
The translation of the high-level service model into a Bayesian network consists of three steps.First, it builds a Bayesian network model of the fault dependency graph.Afterward, it extends the initial Bayesian network with the failure model for inner-replica communication when considering replicated services.The third step finalizes the Bayesian network model by including the failure model for the clientto-instance communication.This modeling approach is novel insofar it can address network partitioning failures, which defines the availability of the service as a function of the channels between instances.For instance, in the case of replicated services with voting-based replication, instead of building a model that accounts for at least k-out-of-n working instances, we build a model where we can infer the probability that for any reachable instance, there are at least (k-1)-out-of-(n-1) working channels connected to the remaining working instances.

A. Background
We will use the Bayesian network representation of fault tree gates throughout the modeling process.This section provides the necessary background to understand fault trees and their equivalent Bayesian network notation.Readers familiar with this notation are free to skip this subsection.
Fault trees are graphs that describe how certain combinations of component faults, known as base events, can lead to an undesired system failure, known as the top event.Logic gates are used to create intermediate events by forming a Boolean expression to describe what combinations of base events lead to a system failure [27].There are three basic gate types that have all fault tree variants in common: the AND, OR, and the k-out-of-n voting gate.The AND gate propagates a fault if all input events trigger a fault, while the OR gate propagates a fault when one input event has triggered a fault.The kout-of-n voting gate propagates a fault when more than kout-of-n inputs are fault events.The voting gate is suitable to model groups of redundant components.A group is considered available as long as no more than n − k + 1 components are available, with k being the number of failed components.
Bobbio et al. [28] introduced the general approach to represent fault tree gates with the help of Bayesian networks.This work will use the translation concepts as templates to construct the proposed Bayesian network availability model.A discrete Bayesian network [29] is a DAG G = (X, E) that represents a joint probability distribution P (X) over the set of discrete random variables X = {X 1 , X 2 , . . .X n }.The term variable or node are used interchangeably to denote the vertices of the Bayesian network graph.For every edge (X i , X j ) ∈ E between the nodes X i and X j , X i is said to be a parent node of X j , and X j is a child node of X i .Each variable has a conditional probability distribution P (X i = x i |pa(X i )) encoded as a conditional probability table (CPT).The CPT contains the probability to observe a certain state X i = x i given the observed states of its parent nodes denoted by parent function pa(X i ) = {X p : ∀(X p , X i ) ∈ E}.Nodes without parents are called root nodes and have an a prior probability distribution P (X i = x i ).
A Bayesian network entails a full joint probability distribution compactly as the product of all the nodes' conditional probability distributions: With the help of the joint probability distribution, one can use inference to compute the posterior distribution P (Y |X ) of some query Y ⊂ X of uncertain variables from a given subset X ⊂ X\Y of observations of the remaining variables.The fault relation of fault tree gates is defined over the fault state of their input events, making the AND/OR semantic counter-intuitive to an actual AND/OR expression in Boolean algebra.However, to be true to the original definition of the fault tree gate, the Boolean expression of AND and OR gates acts upon components' fault state.
1) AND Model: For every state combination of the parent nodes, we define Gate = F if all parent nodes are observed to be in state F .Hence, the conditional probability distribution for the Gate node has the following short-hand definition: 2) OR Model: For every state combination of the parent nodes, we will observe Gate = F if at least one parent node is in state F .
3) k-out-of-n Model: For example, Figure 4 (right side) shows an instance of the k-out-of-n model for a two-outof-three voting gate.The k-out-of-n voting gate triggers a fault event when k or more input events are in a faulty state.Hence, the CPT of the Gate node has to count how many parent nodes are in the state F .This is done for each column.We set the probability to 1 for state T if less than k parent nodes are in the state F , or set the probability of F to 1 if k or more parent nodes are in the state F .Formally, the conditional probability distribution of the k-out-of-n model has the following definition: where 1 F (x) is an indicator function such that

B. Transformation Overview
Algorithm 1 introduces the pseudo code to build the Bayesian network model based on the high-level service description.Here, the notion (x, y, z) ← S means that a structure, say S, provides its elements x, y, and z to the outer scope, which is known as pattern matching in the context of functional programming.First we set up a an empty Bayesian network with the node set X and edge set E. Afterward, we add our first node S, which is a binary random variable representing the availability of the service.At the end of the procedure, one can then infer the fault probability, or availability, of the service by computing the marginalization P (S = F ), or P (S = T ) respectively.The definition of the conditional probability distribution of S follows in the procedures in line 7 or 9.
For any given service model S, we build the Bayesian network availability model of the fault dependency graph with the method CREATEFAULTGRAPH in line 5, in order to account for cascading and common cause failures, and then include the concrete service type according to c.If c is true, we include the replicated service model with the method REPLI-CATEDSERVICE in line 7, otherwise the procedure builds the redundant service model in line 9.The remainder of this section will introduce each of the three sub-procedures in detail.
Algorithm 1 Generating the service model  return BN 12: end procedure

C. Fault Dependency Graph
Given a system model S, the first step in the translation procedure is to build the Bayesian network representation of the fault dependency graph.Perhaps it is not apparent why the fault dependency graph forms the beginning.However, due to the cause-effect semantics of Bayesian networks, it is essential to start with root causes first and then successively attach the effects, which themselves are failure causes for other components.Hence, infrastructure failures form the initial causes of failures.
1) Failure Model of a Component: A component C ∈ C fails either because of an intrinsic failure or because of an external fault caused by its parent components.First, we define the general Bayesian network structure of a single component.This structure will then be used as a building block for the upcoming Bayesian network representation of the fault dependency graph.First, the procedure creates a binary random variable for every component in C with the states {F, T }, where each variable defines the probability of observing the eponymous component as faulty or available.The procedure applies to each component C the Bayesian network transformation of F T (C) according to [28], where the fault of C is the top event, and C's parent components are the base events.For example, Figure 5 shows the Bayesian network representation of a component C that expresses its dependability to its parent components C p1 to C p N as a fault tree with one AND gate.Hence, the CPT uses the previously introduced AND model from Equation 2. A component C can also fail by its intrinsic fault with probability q, which is part of C's CPT definition.The conditional probability distribution of C represents a noisy-AND model.Hence, the CPT of C from Figure 5 has the following definition.
2) Translating the Fault Dependency Graph: Algorithm 2 repeats the approach mentioned above for each component It transforms a given fault dependency graph G FD into a Bayesian network.First, the procedure creates a node for every component (line 3).Then, it creates their corresponding Bayesian network fault tree representation defined in F T (C) (line 7), using the building formalism introduced by Bobbio et al. in [28], and then connecting the parent components as base events to the resulting structure at line 9. Finally, we also connect the node of the component that represents the top event (TE) with the corresponding component node (line 11).Afterward, it adds the node representation of the instances to the host nodes according to a predefined deployment D (line 15).
Applying Algorithm 2 to the example S Example leads to the preliminary Bayesian network shown in Figure 6.Here, without loss of generality and for the sake of readability, the AND fault relation between all infrastructure components can be simply combined to one node with the noisy AND model of the component.With this simplification, the Bayesian network corresponds in its shape to the fault dependency graph, as illustrated in Figure 3.Moreover, to visually assist the translation procedure, the nodes in Figure 6 are rearranged.All network components are on the left, and all hosts with their processes are on the right side (gray dashed box).
3) Channel Model: In order to model service reachability in the presence of network partitioning failures, we need to discuss how to model the probability of observing communication failures with Bayesian networks.Instances and client applications communicate over channels, which is realized as a route along the network graph.The goal of a channel is to assess the accessibility between two instances in the presence of possible network faults.From an availability perspective, when a route fails, because some network component had failed along the route, then a channel can be established along a different if one still exists.Therefore, a channel is considered for C ∈ C \ I do 7: Create Bayesian network model of F T (C) according to [28] 8: for C pj ∈ pa(C) do 9: end for 11: add CPT to C using P and Eq. 5 13: with q = P (I i = F ) 14: end for 15: for I i ∈ I do 16: add CPT to I i using P and Eq. 5 18: with q = P (I i = F ) end for 20: return BN 21: end procedure Fig. 6.Bayesian network infrastructure model of the data management system example.
unavailable, when all potential routes have failed.A channel subsumes the fault probability of observing all routes between the two endpoints as interrupted.
Figure 7 shows the Bayesian network structure that contains the node C Ii−Ij , representing the probability of a communication failure between two instances I i and I j .For readability, this section refers to C Ii−Ij simply as a channel node.A channel node is conditionally dependent on three nodes: an AND node and two nodes for the endpoints of the channel.The AND node represents the failure probability that no route exists, whereas the endpoint nodes represent the failure probability of the corresponding instances.The CPT of the channel node entails an OR model, defining the probability of observing a channel failure when one of the endpoints fails, or no working route exists.
The nodes that define the failure of the endpoints, i.e., I i and I j , are the node representations of the service instances.However, they could also represent different failure causes that indirectly affect the channel, which could be a client application, e.g., the host of the client, or a common endpoint of a second channel.The latter is essential for the replicated service model, to model inner-replica communication.
Finally, nodes R 1 to R n define the failure probabilities of routes.These route nodes use an OR model for their CPTs and are conditionally dependent on the network components N 1 to N m that are part of the corresponding route in the network graph.This model also considers correlating route failures when a route shares the same network components.For example, if N i fails, route R 1 are R n are interrupted.The same applies when multiple channels share the same routes, respectively.
Algorithm 3 formalizes the construction of a channel as a procedure.Necessary inputs are source C src and destination C dst component and a pair of Bayesian network nodes X src and X dst , which represent the failure causes of the channel's endpoints.As discussed briefly, the model distinguishes between the components for which it computes the routes and the parent nodes that provide the failure causes at the channel's endpoints.The node AN D src−dst indicates that the AND node belongs to the channel C src−dst , in order to distinguish the AND nodes between multiple channels.First, the procedure computes all routers in the network graph at line 3. Afterward, line 4 to 8 initializes the channel nodes with its parent nodes.Line 9 iterates over the list of routes and determines if the route has existed as a node in the Bayesian network graph or not.If yes, then the corresponding route node is directly added to the channel as shown in line 17.If not, the procedure creates the new route node and connects it with its corresponding network components (lines 10 to 13).The remainder of the procedure finalizes the CPT of the channel node and returns it as a reference.
Algorithm 3 Routine to create Bayesian network sub-graph for channels.
1: procedure CREATECHANNEL(( BN , G NET , (X, E) ← BN E = E ∪ (X src , C src−dst ) 8: for R in routes do 10: if R / ∈ X then 11: for C ∈ R.components do 13: end for end if 17: end for return C src−dst 22: end procedure Without a doubt, the number of routes can get intractably large.In this case, one might resort to simplifying the network graph.That can be done either by aggregating multiple network components, or by considering a limited number of routes -or both.However, while this simplification increases performance, it comes to the expense of model fidelity.

D. Redundant Service Model
Given the channel model, we can build the model of a redundant service first.Successful communication exists when clients can access sufficient working instances directly.In reallife, a client application will most likely try to connect to one instance, whereas the Bayesian network represents the probability of connecting to any of those instances.Due to the high user-load assumption, we need to account for the likelihood of observing sufficient working instances, even if we need one instance to handle the request.
Algorithm 4 describes how to extend the previously created Bayesian network model of the infrastructure with the redundant service model.We stated in the system model, that a client application can access a service through one or more dedicated network component that act as entry points, i.e. gateways, in the network.Therefore, we introduce a new set of binary random variables K = {K i } m i=1 , with m = |G|, which represents the probability of accessing sufficient instances through the i-th entry point defined in G.
At line 10, the procedure creates the channel nodes for each entry point in the set G to each instance.The channel creation procedure takes as input the network component that Algorithm 4 Implementation of the redundant service model.
for G i ∈ G do 8: for I i ∈ I do end for 13: add CPT of K i according to Q.  acts as an entry point, the host of the instances as defined by their deployments, and the two nodes that represent the failure of the channel's endpoints.In a follow up step (line 11), all channel nodes that are related to the i-th entry point component connect to one node K i , which implements the reachability requirement of accessing sufficient instances form that entry point as part of its CPT.For example, if one instance is sufficient for a working service, then each K i would implement an AND model at line 14, representing the fault probability that the i-th client cannot communicate with any instance at all.A detailed discussion on how to integrate general requirements for K i at line 14 can be found at the end of this section.
Finally, Algorithm 4 finishes by introducing the final service node S.This node accounts for the probability that no client at any entry point has sufficient working channels to communicate with the instances.Hence, we can compute the probability of a single service failure as the marginal P (S = F ) or its availability P (S = T ) using Bayesian inference.
For instance, Figure 8 shows the Bayesian network model of the example service S from Section III, assuming a redundant service.In this example, all clients communicate with the instances via the firewall (represented by node FW).There are three routes R 1 to R 3 , which are shared by all seven channels, emphasized by the dashed box.Each channel is connected to the firewall node, representing the client.Since there is only one entry point, the set K = {K 1 } contains one node.For example, assuming the service can tolerate three instance failures, node K 1 implements a four-out-of-seven model (see Equation 4).

E. Replicated Service Model
For replicated services, we said that clients first send their request to one instance, which then communicates with the remaining instances.This communication pattern subsumes and implements the likelihood of accessing at least one instance that can communicate with sufficient remaining instances.Hence, we will show how to use this communication pattern to encode all possible states in which the instances, or cannot, reach the desired number of votes, e.g, quorum size, as defined by the fault tolerance model in Q.Consequently, the final Bayesian network will encode the probability of observing the service in a working state, giving potential infrastructure and communication faults.
Algorithm 5 begins first by modeling the communication channels between instances.It introduces again the set of binary random variables K = {K i } n i=1 where n = |R|, which represent the failure probability of communicating with an insufficient number of instances when the i-th instance initiates the replication protocol.Hence, every K i is a child node of n−1 channel nodes (line 12 and 13), since the fault probability of instance R i is already part of one of the endpoints of the channels.Next, the procedure builds a channel node for every entry point G i to every instance R i by using K i as failure cause(line 21).Instead of directly addressing the failure probability of an instance, the model uses K i to represent the instance R i .In the case of a network partitioning, K i would contain the probability that R i can still access sufficient processes in its partition.
Finally, node S accounts for the failure probability that no client can access the service through any gateway(line 25).Hence, one can now infer the fault probability, or availability, of the service by computing the marginalization P (S = F ), or P (S = T ) respectively.
For example, Figure 9 shows the Bayesian network of the database service example, based on the assumption that client applications access the service via the firewall.The left box shows the channel nodes representing the fault probabilities Algorithm 5 Implementation of the replicated service model.
end for 8: C Ri−Rj := CREATECHANNEL(BN ,G NET , 11: 12: 13: end if end for 16: add CPT for all K i ∈ K according to Q. 17: for j=1; j < n; j++ do G i ,D(R j ),G i ,K j ) 22: end for 24: end for 25: add AND model to CPT of S 26: end procedure for the communication between clients and service instances.The right box shows the channels of each instance to every other instance.A node K i has as parent nodes the channel nodes of the i-th instance.Hence, to implement the majority set requirement, one can use a three-out-of-six model for K i to encode the probability of observing at least three working channels, which implies that the i-th instance is also working.
Next, we discuss in detail how to implement the CPTs of the nodes in K as hinted at line 16.
1) Read-one/write-all: Read-one/write-all is a special case in replication since every operation has its particular quorum requirements.We already had a brief introduction on readone/write-all in the last section.There, we discussed how to implement the service requirements for read Q ro = 2 R /∅, and for write quorums Q wa = {R}.Consequently, each operation needs its own Bayesian network model to assess its availability individually.Read-one can be modeled by using the redundant service model.Hence, the model uses an AND model for all CPTs of the nodes in K to account for the fault probability that no channel works.In contrast, for write-all, it depends on the system design.One can use either the redundant or th replicated service model.Both models use the OR model for the CPTs of nodes in K, accounting for the fault probability that there is at least one channel faulty to an instance.
2) K-out-of-N Voting: In voting-based replication, instances have one vote to decide on an incoming operation request.The system is available when it can reach k-out-of-n votes for some request, e.g., majority sets require k = n 2 + 1 votes.For replicated services that use the indirect communication pattern, the i-th replica is part of the voting process, where it must acquire at least k − 1 votes from the remaining n − 1 replicas to consider the service as available.Thus, the CPT of K i implements an n − k + 1-out-of-n − 1 mode as defined in Equation 4l, i.e., considering the inverse on how many channel failures can be tolerated.
For redundant services that use the direct communication pattern with n instances, where k instances are sufficient to signify that the service does not fail due to overload, the model implements the CPT of K i by using an (n−k)-out-of-n model.Thus, the system fails if there are more than n − k channels faulty.
3) Weighted Voting: In weighted voting, individual replicas can have multiple votes.This forms the general case of the normal voting-based appraoch from above.To reach a potential quorum, the total number of votes that are available by working instances needs to exceed a given threshold t.As a result, this work extends the k-out-of-n model from Equation 4to account for the individual vote counts of the replicas.We use the tuple notation for Q = (V, t), where V = (v 1 , ..., v n ) are instance votes and t the threshold value.Given that K i refers to the i-th instance, the models use v j to denote the number of votes of the instance at the opposing endpoint of the j-th channel for a given state combination c i−1 , . . ., c i−m of the channel nodes connected at K i .Here, since the i-th instance initiated the replication protocol, we automatically assume that its votes v i contribute to the request.Hence we reduce the threshold by its votes.
For every state combination c i−1 , . . ., c i−m , the model builds the weighted sum of those channels that are available and checks if the result is above the threshold.

F. Scalability
Bayesian networks are subject to an exponential growth of memory with regard to their CPTs [30].The CPT of a node has to implement a conditional probability distribution for each state combination of its parent nodes.If the parent nodes are binary, then the number of CPT entries is O(2 n ).Hence, all CPTs of K will exhibit an exponential memory growth in the number of instances.We have a similar situation for nodes that represent the availability of routes.Those nodes implement an OR model, which can have multiple network components that represent a route.Assuming a CPT entry is just several bytes large, it is not hard to see that a node with 30 parent nodes will have a CPT with several gigabytes of memory.Therefore, this Bayesian network approach is suitable only for services with up to 30 instances and short network routes; afterward, the memory becomes the limiting factor.
However, this problem can be mitigated for the AND/OR, and k-out-of-n model.Heckerman [25] provides an equivalent AND/OR model that reduces the space complexity to linear, while Bibartiu et al. [24] provide an equivalent (scalable) k-out-of-n model with polynomial complexity.Having these scalable models, we can substitute the existing AND/OR, and k-out-of-n models in the Bayesian network model with their scalable counterparts.Hence, we can overcome the memory limitations for redundant services and voting-based replication models for large services.

V. EVALUATION
This section provides an in-depth analysis of the performance and modeling feasibility of the presented Bayesian network availability model.The evaluation will analyze the availability, build, and inference performance for redundant and replicated services for an increasing number of instances.All experiments were performed on a 64-bit machine with 64 Intel(R) Xeon(R) CPU E7-4850 v4 at 2.10GHz and 1 TB of main memory, running Arch Linux 5.13.12 with GCC 11.1.0,Python 3.9.6, and with pgmpy 0.1.7(the Bayesian network modeling package) and Numpy 1.20.3.Bayesian network inference is performed with approximate and exact inference whenever possible.For approximate inference, we use the forward sampling method, and for exact inference, we used the Lauritzen-Spiegelhalter Algorithm method [31] from the gRain 1.3.2package [32], [33].Furthermore, we used in all experiments the scalable Bayesian network representations for AND/OR and voting gates by Heckerman [25] and Bibartiu et al. [24].The implementation of the algorithms and evaluation methods for the presented Bayesian network model are available as open source 1 .
Moreover, all experiments will consider two different data center infrastructures.The first infrastructure corresponds to the example used in Section III, which consists of 19 components.The evaluation will refer to this example as the small infrastructure.Consequently, the second infrastructure will be called the large infrastructure.The large infrastructure consists of three data centers with 40 hosts each, using a random topology of 20 network components to connect hosts and data centers.Moreover, each data center has 100 additional infrastructure components that influence the hosts and the network components.The large infrastructure has in total 440 components.All components in the large infrastructure have an availability value sampled from a beta distribution with C ∼ Beta(10000, 1), resulting in an average downtime of 1 hour during a mission time of 10,000 hours.Without loss of generality, we will require that the majority of instances are needed for both service types to be considered available.Other k-out-of-n schemes are also possible, but a different k changes only the content of the corresponding nodes and not the structure of the Bayesian network.
The plot in Figure 10 shows the expected availability for both service types for an increasing number of instances, using the small and large infrastructure, applying approximate and exact inference.Instances were placed in round-robin.We computed the availability for services with up to 300 instances using approximate inference.Exact inference was only possible for up to 27 instances for the redundant service experiments and for up to 6 instances for the replicated service experiments, independently of the infrastructure size.Approximate inference might vary by nature with every execution.So we compared the results of the exact and approximate inference methods by repeating them 40 times to compute their confidence intervals.As a result, it can be stated with 95% confidence that there is no significant difference in the inference results between the exact and approximate inference methods here.
The availability results between the redundant and the replicated service are similar.The availability decreases up until six instances for the small infrastructure experiments.This is mainly because all instances are placed in the first data center.The follow-up placements also consider the second data center in the small infrastructure for services with seven or more instances.The more instances, the less commoncause failures are shared.However, adding more instances does not lead to higher availability.The higher the distribution of instances, the higher the risk of communication failures since more network components are involved.This limits the availability to a point where the influence of the shared infrastructure outweighs the benefits of replication.Even in the large infrastructure example, where we assume a low average downtime per component, the availability does not converge arbitrarily near to 1.The plot in Figure 11 shows the mean inference time to compute the presented availabilities.Here we can observe the exponential time increase (linear function in a semilog plot) of the exact inference method, which contrasts the polynomial time increase (log function in a semi-log plot) of the approximate inference method.There are two main observations.First, the inference time between the redundant and replicated services have different polynomial complexities, and second, the inference time converges independently of the infrastructure size.Clearly, due to the twenty-fold increase of components in the large infrastructure compared to the small infrastructure, the former is slower than the latter for small numbers of instances.However, the number of channels nodes increases with the number of instances.Hence, the more instances, the more channel nodes.The number of channel nodes outweighs the number of infrastructure components until they become the influencing factor in the computation.The model of the redundant service has a linear increase of channels, whereas the replicated service has a quadratic increase of channel nodes due to the indirect communication pattern.
Finally, Figure 12 introduces the build time to construct the Bayesian network.Clearly, the build time shows a significant difference between the large and small infrastructure examples for small numbers of instances with n less than 30 w.r.t.service type.However, with increasing numbers instances the time difference diminishes.Afterwards, the sole factor that determines the build time is the service type.For large numbers of instances, the infrastructure has almost not significant influence on the build time anymore.Again the number of channel nodes that grow in proportion to the number of instances outweighs the component nodes of the infrastructure.

VI. DISCUSSION
The evaluation demonstrated the feasibility of the Bayesian network approach to model large-scale and replicated systems.Build and inference time is within manageable time frames for reliability engineers to make informed decisions on the service.Overall, for small service sizes with three to seven replicas as commonly used for transaction-oriented database systems, the reliability engineer can even use exact inference to assess the availability in order to compute deterministic results.We discussed that the number of channels has the most influence with regard to the build and inference time.Replicated services lead to a quadratic growth of channel nodes in the number of instances.Also, the build procedure needs to compute all possible routes that constitute a channel.Finding all possible routes in a graph can become a performance impediment, which is why we suggest considering only a subset of essential routes if performance is of higher priority.The largest model with 300 instances took about one hour to build.But once the model is built, inference can be performed independently often.Even updating individual beliefs of component failures can be done directly to the respective nodes if needed, without rebuilding the whole Bayesian network.
A particular modeling challenge is the potential lack of accurate availability data (failure probabilities).Acquiring accurate failure data is a non-trivial task for rare events, which require a large number of observations to conclude statistical significance.However, this issue can be addressed in   several ways.First, many vendors already provide mean time to failure (MTTF) information for their software or hardware components.Secondly, cloud providers host larger numbers of hardware components in their data centers, which are constantly monitored, providing significant amounts of data also for rare events [34].Thirdly, for yet unobserved failures of highly available components, one can use rare event analysis (an active research area) in conjunction with expert knowledge acquisition to incorporate prior beliefs first and later refine the estimate with observation during mission time.Moreover, our model does not consider the effects of longrunning requests and the implications of component failures and recoveries during a longer execution time.This would require a dynamic Bayesian network approach [35], [36] to model the time dimension, bringing new challenges w.r.t.model assumptions, which might require additional implementation details of the particular replication protocol, increasing the model complexity.Therefore, we consider this challenge as future work.

VII. RELATED WORK
Modeling complex infrastructures is subject to various areas of reliability engineering [13], [18], [37], [38].Jammal et al. [14] provide a hierarchical infrastructure model for cloud services with Petri nets as an evaluation framework.They consider fault propagation within a hierarchical infrastructure model supporting redundant cloud services with a one-outof-n fault tolerance semantic.However, they do not consider network communication.
Ghosh et al. [16], and Narayanan et al. [38] consider a kout-of-n redundancy model for their instances; however, their infrastructure model only considers fault-independent compute nodes or data centers, respectively.
The Palladio Component model [39], [40] provides a holistic modeling approach to evaluate the performance and availability of complex software systems unifying hard-and software into one model.However, the Palladio availability model supports only a one-out-of-n redundancy model and cannot model quorum requirements.
There are several methods to evaluate the availability of a system, among which Bayesian networks have gained large acceptance within the industry and research [41]- [44].
Bobbio et al. [28], [45] demonstrated the applicability and superiority of Bayesian networks in modeling and evaluating equivalent fault trees [26].Moreover, Boudali and Dugan [35], [36] showed how to use dynamic Bayesian networks to model dynamic fault trees as well, effectively proving that the Bayesian network formalism is powerful enough to cover all non-state space models.
Bennacer et al. [15] use Bayesian networks for network diagnostics by introducing a case-based reasoning inference approach to increase diagnostic performance for large-scale Bayesian network models.While they only focus on network communication, they provide a tailored inference technique for efficient diagnostics of root causes, which can also be combined with our Bayesian network model when diagnostics is of interest.
Pitakrat et al. [46] use Bayesian networks for online failure predictions of microservice applications.The Bayesian network represents the interconnection between the microservice instances and updates the fault probabilities of the services based on the online monitoring of performance metrics.They consider fault propagation between services; however, replication is not considered.
In summary, a Bayesian network modeling approach, covering a wide range of redundant and replicated services that also includes cascading and correlated faults caused by dependent infrastructure and network communication, was missing.

VIII. CONCLUSION
This work introduced a Bayesian network availability model for redundant and replicated services.The Bayesian network model unifies the fault aspects defined within a high-level model description of the service.The high-level model consists of three sub-models: a fault dependency graph to express the failure relation between components of the infrastructure and execution environment, a network model to address communication and network partitioning failures, and a model to define fault-tolerance requirements of the service.We show how to translate the high-level model into one Bayesian network to compute the expected availability.Finally, evaluations demonstrate the feasibility of the Bayesian network approach to represent and assess the availability of large-scale service with hundreds of fault influences and service instances.

IX. ACKNOWLEDGMENTS
This work was supported by the Robert Bosch GmbH.

Definition III. 2 (
Fault Dependency Graph).Given the set of all components C, the model defines the fault dependency graph as a DAG G FD = (C, E INF , F T ), with edges E INF ⊆ C × C, and an associated (static) fault tree model F T for every component in C.
(a) Fault dependencies of a host.(b) Local fault tree model of the host component.

Fig. 2 .
Fig. 2. Fault Dependency Graph Example.To account for communication faults, the model needs a representation of the network.Network components represent network appliances such as switches, routers, load-balancers, and firewalls.Consequently, the failure of related infrastructure components can influence the failure of a network component, which can lead to communication failures.Unlike the fault dependency graph, the network graph can have cycles.Definition III.3 (Network Graph).Given a set of hosts H ⊂ C, a set of network components N ⊂ C, and their union C NET = H ∪ N , the network is a graph G NET = (C NET , E NET ) with unidirectional edges, where the edges E NET ⊆ C NET ×C NET define the communication links between any two network components.

Definition III. 4 (
High-level System Model).A system S = (C, Q, G FD , G NET , D, P, G, c) is a eight-tuple consisting of the following elements: C The set of all infrastructure, network components ans instances.Q The fault tolerance model defined as a path set of instances Q = {Q 1 , . . ., Q m } ⊆ 2 I .G FD The fault dependency graph.G NET The network graph.D The association of instances to hosts D : I → H. P The function of all fault probabilities of the components in C. G The set of network components that act as entry point for client applications to establish a communication channel with the instances of the services.G ⊆ C NET .c A Boolean value c ∈ {false,true} to indicate if the service is redundant or replicated.

Fig. 4 .
Fig. 4. Basic Bayesian network to represent the fault tree's AND/OR, or k-out-of-n voting gates (left).Example instance of a Bayesian network k-outof-n model (right).

Figure 4 (
Figure 4 (left side) shows the main Bayesian network structure to realize the AND/OR and the k-out-of-n voting gate.The basic structure has n components C 1 to C n with prior probabilities represented by their eponymous binary random variables with states {F, T }, observing the component either faulty or available, respectively.The individual semantics of the gate types are encoded within the CPT of the Gate node.The fault relation of fault tree gates is defined over the fault state of their input events, making the AND/OR semantic counter-intuitive to an actual AND/OR expression in Boolean algebra.However, to be true to the original definition of the fault tree gate, the Boolean expression of AND and OR gates acts upon components' fault state.1)AND Model: For every state combination of the parent nodes, we define Gate = F if all parent nodes are observed to

Fig. 7 .
Fig. 7. Bayesian network representation of a single communication channel.

3 :
routes := compute all paths from C src to C dst in G NET 4: X = X ∪ C src−dst 5: X = X ∪ AN D src−dst 6: E = E ∪ (AN D src−dst , C src−dst ) 7: to CPT of C src−dst 20: add AND model to CPT of AN D src−dst 21:

Fig. 8 .
Fig. 8.The Bayesian network of a redundant service example.

Fig. 9 .
Fig. 9.The Bayesian network of the indirect communication pattern for the database example.

Fig. 12 .
Fig. 12. Comparing the time to build the Bayesian network model for increasing number of instances.
The availability results of a service for increasing the number of instances, using approximate and exact inference.
The inference time to compute the availability of a service with increasing number of instances for the small and large infrastructure example, using approximate and exact inference.