Causal chain event graphs for remedial maintenance

The analysis of system reliability has often benefited from graphical tools such as fault trees and Bayesian networks. In this article, instead of conventional graphical tools, we apply a probabilistic graphical model called the chain event graph (CEG) to represent the failures and processes of deterioration of a system. The CEG is derived from an event tree and can flexibly represent the unfolding of asymmetric processes. For this application we need to define a new class of formal intervention we call remedial to model causal effects of remedial maintenance. This fixes the root causes of a failure and returns the status of the system to as good as new. We demonstrate that the semantics of the CEG are rich enough to express this novel type of intervention. Furthermore through the bespoke causal algebras the CEG provides a transparent framework with which guide and express the rationale behind predictive inferences about the effects of various different types of remedial intervention. A back-door theorem is adapted to apply to these interventions to help discover when a system is only partially observed.


Introduction
Conventional graphical tools in system reliability include fault trees (FTs), Boolean decision diagrams (BDDs).An FT is a structured top-down logic diagram starting with the critical system event and decomposing successively into events whose composition or intersection can cause the top event [7].A BDD can provide an equivalent graphical representation to an FT, where events are ordered in the same way as an event tree.Neither of these diagrams provides explicit partial (temporal) order of events nor standard statistical modeling methodologies associated with uncertainty handling and causal reasoning can be seamlessly embedded.Early work [44,28,10] also criticised traditional reliability analysis, stating that the deficiency of these models lies in the limitation of modelling the uncertain dependencies between failures and complex systems and suggested instead the use of a Bayesian network (BN).Both BNs and chain event graphs (CEGs) enjoy the flexibility of embedding probabilistic knowledge, managing probability propagation, inference, and performing causal analysis.Thus, these two classes of models can be used to inform decision makers or engineers about the potential effects of new policies or actions, enabling optimisation of the maintenance strategy in an efficient and effective way -making probabilistic graphical tools more appealing for reliability or risk analysis than traditional tools.
Despite the popularity of the BN framework for exploring causal relationships, many researchers [34,37,33] have argued that event tree based inference provides an even more flexible and expressive graph from which to explore causal relationships.Within artificial intelligence, methods based on probability trees are now widely used for various types of causal modeling to support decision and risk analyses in many different domains, see e.g.causal discovery, decision making, and risk analysis [9,23,49,7].
Within the domain of reliability where the focus of inference is on explaining and repairing failure incidents in a system, the use of trees and their derivative depictions such as CEGs [33,13,14] provide a complementary method to the use of fault trees.Here a collection of paths on these graphs are used to explain the unfolding of events that might have led to the fault.One advantage of the CEG is that unlike the BN the asymmetric unfoldings of the process can be directly represented by its topology so that context-specific causal dependencies can be read from the tree [4,35,3].This is extremely useful for encoding explicitly the failure processes and deteriorating processes of machines.[31] demonstrated and [34] has long argued that causal assumptions are often easily inferred from tree-like structures because these represent explicitly the hypothesised time orderings of events intrinsic to many causal conjectures.A causal analysis can be performed around a framework of the CEG in much the same way as for the BN.However, although the BN has been successfully applied to support the causal analysis of various problems in reliability [19], to our knowledge the more flexible framework of the CEG has yet to be applied to this domain.We show in this paper that the CEG is a much more expressive graphical representation than the BN for putative causes [3,15,41,14] and it can embed the sorts of asymmetries met in reliability models yet to be exploited.
Previous work [40,41] has proposed a generic method to translate Pearl's do-calculus [29,30] onto the CEGs.The atomic intervention on the BN that forces a variable to take a single value can be simply imported into the CEG where a singular manipulation on a causal BN corresponding to forcing multiple edges along the equivalent causal CEG to take a conditional probability one and others zero.For such conventional types of manipulations it has been discovered when these and more nuanced interventions can be identified.In particular, [39,41] and [40] formulated the back-door theorem and the front-door theorem on the CEG, analogous to what Pearl [22,29,30] designed on the BNs.These support predictive models of how certain natural events might trigger failures until such time that they are remedied.
Causal reasoning about interventions to study the reliability of a system can inform the policy makers or the engineers about the potential effects of new policies or actions so that the maintenance strategy can be optimised in an efficient and effective way.However for models of system reliability, we may encounter complicated forms of causal mechanisms and unfamiliar types of intervention not usually studied in standard causal analyses when failure events automatically trigger remedial maintenance.For example when such remedial acts involve the replacement of failed components, they will behave as if starting from new rather than starting from a point of embedded usage as they would be in a more conventional causal intervention.The latter would be what we would need to assume were we to use conventional algebras.But perhaps even more important is that remedial interventions we study here are always designed to rectify a subset of root causes of a failure incident.So interventions associated with remedial maintenance are in practice nearly always very specific types of non-atomic (not singular) interventions.We demonstrate that the manipulations associated with such interventions are then likely not to be represented within the class of vanilla interventions considered in causal BNs -they are far too symmetric.[45] gave a brief introduction of different types of remedies.Here, we formalise these ideas and provide a detailed methodology which imports the concept of remedy and a root cause analysis into the CEG framework.In this way we are about to establish new causal algebras for the different remedial intervention regimes on CEGs.We show how to use CEGs to determine when and if so how we can measure probabilistically the effectiveness of remedies imposed by engineers to perceived faults.
The contributions of this article are three-fold.Firstly, we propose a novel approach for causal analysis which is applicable to reliability data.We formalise the concept of the stochastic manipulation on the CEG and develop the mathematical formulae to import it into the CEG.Moreover, we show the causal effects of stochastic manipulations can be identified on CEGs using the adapted back-door theorem.Both graphical criteria and proofs are given to support this new theorem.Thirdly, we define a new type of intervention -the remedial intervention -on the CEG for analysing system reliability.In particular, we emphasise useful causal concepts like "remedy" and "root cause" in reliability and translate them into algebras via CEGs to embellish the standard causal analysis.
In the next section we will show how to construct a CEG for analysing system reliability.Then using this framework to formally define a remedial intervention in Section 3. In Section 4 we apply this definition to prove a number of results about whether or not the probabilistic effects of a given intervention are identifiable from information commonly available to an engineer.Section 5 will use a simulated dataset to demonstrate how to perform a causal analysis using the techniques proposed in previous sections.

Constructing CEGs for reliability analysis
In this section, we will briefly review the elicitation process of CEGs from event trees and introduce new concepts for constructing a CEG for reliability data.

A review of CEGs
Consider a finite event tree T = (V T , E T ) defined with vertex set V T and edge set E T [35,39,14,36,24].Let e v,v ′ ∈ E T denote the directed edge emanating from the vertex v and pointing to the vertex v ′ .For any vertex v ∈ V T , denote the set of its parents by pa(v) = {v ′ ∈ V T : e v ′ ,v ∈ E T }, and the set of its children by ch(v) = {v ′ ∈ V T : e v,v ′ ∈ E T }.The vertex without parents in T is called the root vertex of the tree, denoted by v 0 .The set of vertices without children are the leaves of the tree, denoted by L T ⊂ V T .The non-leaf vertices are called the situations here, denoted by A path starting from v 0 and terminating in v ∈ L T is called a root-to-leaf path on the tree.Let Λ T denote the collection of all the root-to-leaf paths of T .
If each edge e v,v ′ ∈ E T has an associated transition probability, denoted by 1), then a probability tree with structure T and set of probabilities θ T = {θ v } v∈S T can be well-defined.The probability vector is defined for each situation.Then the probability of traversing along any path can be evaluated.Let π(•) represent the path related probability on the tree and θ v,v ′ = π(v ′ |v).
A staged tree [14] is a coloured probability tree (T , θ T ) where different colour represent different stages.Two situations v, v ′ are in the same stage if the probability distributions over the set of edges E(v), E(v ′ ) are the same.Let U T = {u 0 , • • • , u n } denote the set of stages of T .Given a staged tree, two situations v, v ′ in the same stage are in the same position if and only if T (v) and T (v ′ ) are isomorphic, i.e. having same set of edges, vertices, and colouring.Let W T = {w 0 , • • • , w m } denote the set of positions.All the leaves in the staged tree belongs a sinking status, denoted by w ∞ .A chain event graph (CEG) can then be constructed by merging the vertices that belong to the same position and vertices in the sinking status and the corresponding edges that share the same colour.Let C = (V C , E C ) denote the graphical representation of the CEG.The vertex set is w,w ′ ∈ E T and v, w are in the same position, then there exist corresponding edges f, f ′ ∈ E C .If also v ′ , w ′ are in the same position, then f = f ′ .The positions and edges retain their corresponding colour from the staged tree.The probabilities θ f ∈ θ C of edge f ∈ E C in the new graph are the same as the transition probabilities of the corresponding edges in the staged tree.Then a CEG is defined as (C, θ C ).Let w 0 ∈ W T denote the root node of the CEG.The path starting from w 0 and terminating in w ∞ is the root-to-sink path.Let Λ C denote the collection of all the root-to-sink paths on C.

A CEG for reliability analysis
An event tree T is next constructed to represent how system may fail.The sequence of events which would have occurred prior to maintenance are explicitly represented on each of the root-to-leaf path.Call the labels of events on edges by d-events and denote it by X T = e∈E T x(e).Leaves represent the final status of the system, states are labelled by fail or not fail.We add restrictions to the definition of stages: v and v ′ are in the same stage if and only if E(v) and E(v ′ ) represent the same set of d-events and θ Unusually for this application of CEGs it is useful to define two sink nodes for C: if v ∈ L T represents a failed condition, then v ∈ w f ∞ , otherwise v ∈ w n ∞ represents an operational but worn-out condition.We call these the failure and the working sink nodes respectively.Thus the vertex set of the CEG is For any path λ ∈ Λ C that ends in the sink w f ∞ , we call a failure path and represents a possible pathway to fail.All other paths -those which terminate in w n ∞ are called deteriorating paths.Let Λ f C and Λ n C denote the sets of failure paths and deteriorating paths respectively so that CEGs have the advantage over BNs to explicitly expressing within its topology the pathway to failure.More explicitly the chronological development of the failure or deteriorating processes can be captured by the root-to-sink paths.We can order the d-events by beginning with root causes, followed by the cascading events initiated by the root causes, i.e. primary faults and secondary faults and so on, and ending with a failure or worn-out condition.Because a cause will always happen before an effect the order of these cascading events, expressed in the CEG, embodies the full causal story about what happens if a unit passes along one of its paths.In particular this enables us to infer the nature of a root cause.In this way we can use the topology of the CEG needed to examine the efffects of a given remedial act in fixing the root cause of a failure.
To demonstrate how such a CEG analysis begins, consider an example of failures associated with bushing systems depicted in Figure 1b.Bushings are components in a transformer for insulation.We constructed an event tree according to the investigation report by [1].The tree starts with classification of causes that may lead to system failures.This is followed by depicting the potential causes, then the symptoms that might arise from these.The last component represented on the tree is the failure indicator.A staged tree next colours the vertices of T to embed various contextspecific conditional independences that an engineer might bring to the study.For example, here we assume that when  the environment outside the machine impacts the system negatively, system failure is conditionally independent of the exact exogenous environment.This assumption makes situations v 7 and v 8 in the same stage.The stages in Figure 1a are In practice, we can construct an event tree for a system using domain knowledge, and elicit the CEG from it based on expert judgement or making appropriate assumptions.If there are data available for this system, we can apply the structural learning algorithm [15,14] to find the topology of the staged tree that best describes the data and derive the CEG accordingly: just as we would if modeling with a BN.
We can now perform causal analysis on a CEG through extending Pearl's do-operation [30].This is defined as the singular manipulation [39,41,40].Thwaites and others [40,5] formalised a causal CEG as follows: under a singular manipulation at a position w so that the event represented on e w,w * is controlled, the CEG is a manipulated CEG with θ w,w * = 1, θ w,w ′ = 0 for w ′ ∈ ch(w), w ′ ̸ = w * , and all other θ w ′′ , w ′′ ∈ W T , w ′′ ̸ = w, are unchanged.Notice that because the different root-to-leaf paths of the underlying event tree are expressed explicitly within the CEG, it is possible to express explicitly within its topology a much wider range of interventions than would ever be possible just using a BN.It is this property which enables us to develop a transparent causal algebra which is particularly suited to support the study of the causes of failure in system reliability.

Causal algebras for the remedial intervention
The term "remedial work" is ubiquitous to many types of engineering reports which record the maintenance of some defects or failures.Further "remedy" is a more familiar terminology in reliability engineering than "treatment" [32], which is commonly adopted in medical science and has a subtle different meaning.A unit must have been failed before a remedy is applied.The remedy aims to find and fix the root cause of the observed failure in order to prevent the same defect or failure reoccurring.In contrast a treatment can be applied irrespective of the state of the unit.Furthermore whilst a remedy could be a single act it is often a combination of acts taken in sequence.So the application of causal analyses are rather different in system reliability than in medicine and public health where the majority of causal analyses have traditionally been applied.
In light of the two essential concepts, i.e. the remedy and the root causes, we define a novel domain-specific intervention and call it the remedial intervention.This is a typical external intervention customised to different types of remedies.The inferential framework of the remedial intervention focuses on the discovery of root causes of a fault and the identification of a sequence of actions that will provide a remedy to that fault.Analogously to the root cause analysis, this process can be used to understand and prevent defects of a system by tracing and correcting the initial contributing factor of these defects.Here we develop bespoke causal algebras on CEGs where remedial maintenance  takes centre stage.Such new algebras extend the singular manipulation which are now well established for CEGs [40,41].

Perfect, imperfect and uncertain remedial interventions
For a repairable system, there are three main categories of maintenance: perfect maintenance, imperfect maintenance and minimal maintenance [27,8].If the status of the system after maintenance is the same as new and has the same failure rate, then the maintenance is perfect and the post-maintenance status is called as good as new (AGAN) [27,8,7].If the status of the system after the maintenance returns to the working order just prior to failure, then the maintenance is minimal, and the post-maintenance status is called as bad as old (ABAO) [27,8].If the status of the system after maintenance is somewhere between ABAO and AGAN, then the maintenance is classified to be imperfect [27].
To reflect these standard categories of maintenance, we accordingly give definitions to three types of remedial interventions: the perfect remedial intervention, the imperfect remedial intervention, and the uncertain remedial intervention.Here we use the name "uncertain" instead of "minimal" because the causal algebras we develop later in this article concern about quantifying the characteristics of the uncertainty associated with this type of intervention.
We first make the following two assumptions before formalising the concept of a remedial interventions.ASSUMPTION 3.1.The idle CEG 3 or the event tree is faithfully constructed with respect to the domain knowledge of a particular system so that every failure process or deteriorating process that may happen in this system can be identified on the tree and every root cause and symptom are well-captured by the semantics of the tree.ASSUMPTION 3.2.The system modelled by the CEG is repairable, and the AGAN status is reached when the root cause of the failure is completely fixed.
For illustrative purpose, we create a new graphical framework integrating a failure process with the process of maintenance in order to demonstrate the differences between various types of the remedial intervention, see Figure 2.
Here we simplify the root-to-sink path by labelling only the root cause and the symptom.Take a failure process as an example.The root vertex of this path represents an AGAN status while the sink vertex of this path represents a failed condition.We call the root vertex the AGAN vertex and the leaf vertex the fail vertex.The failure path is connected by the solid black edges.The recovery path is defined to be the directed dashed path rooting from the fail vertex and sinking in the AGAN vertex.It models the status change of the system caused by the maintenance.The black and red dashed edges are associated with observed and unobserved maintenance respectively.Recovery paths are external to the idle CEG.This is because it represents the effect of external intervention on status of the equipment and such recovery process is not part of the description of the original system before any intervention has taken place.
A remedial intervention is perfect if the root cause of the failure is correctly identified and successfully fixed by the observed maintenance so that the post-intervention status of the part being maintained is AGAN [45,47].The recovery process is demonstrated in Figure 2a.The recovery path starts from the fail vertex and ends in the AGAN vertex which means the observed maintenance returns the status of the failed part to full working order.Suppose the CEG in Figure 1b failthfully models the unmanipulated bushing system, and we observe a failed bushing whose failure was caused by a cracked insulator.Then an example of a perfect remedial intervention is that the engineer replaced the cracked insulator by a new one.
If the root cause is not remedied but only a subset of the secondary or intermediate faults are remedied, then after the intervention the status of the repaired component will not return to AGAN.However it is better than ABAO.We call such an intervention an imperfect remedial intervention.We can visualise the status change of the maintained equipment from Figure 2b.The recovery path consists of a black dashed edge and a red dashed edge.The black dashed edge points from the fail vertex to the interior vertex of the failure path, which means the status of the equipment is improved but not AGAN after maintenance.In order to fully restore the system, additional maintenance is needed.If imperfect remedial work has been made at time t, then the maintenance log will record only that maintenance has happened.As for what is further needed to fully restore the system is unknown at that time.This brings uncertainty into this type of remedial intervention.The recovery process corresponding to the additional remedial work is represented by the red dashed edge, which points from the interior vertex to the AGAN vertex.
If the maintenance logs do not record what remedial maintenance was taken, then such intervention is classified as an uncertain remedial intervention.Diagnostic information has not yet been made available so the root cause of the failure cannot be determined.A follow-up check and maintenance will be carried out in order to restore the broken part.Therefore the recovery process of this type of intervention is unobserved and so uncertain.The corresponding recovery path is shown in Figure 2c.

Notation and definitions
To model remedial intervention, we first introduce some new variables.
Assume that the root causes of a specific defect or failure could be multiple and are well-defined.Note that a remedial intervention is defined to allow multiple root causes to be corrected simultaneously.Such an intervention is of course always non-singular within the CEG representation.
Let A denote the state space of maintenance events.Let A O and A U be random variables taking values in A representing observed maintenance and uncertain maintenance respectively.Let A = (A O , A U ) denote the vector of all maintenance.The status of the maintained equipment is observable, and represented by a status indicator δ such that δ = 1, if the status is AGAN after maintenance, 0, otherwise.When δ = 1, the remedial intervention is perfect.In other words, the root causes are correctly identified and fixed by A O .So the value of the vector of the intervention indicators is known, denote this by I A O .Then we have, When δ = 0, the remedial intervention is imperfect or uncertain.The uncertainty arises from the unobserved additional maintenance A U .Then the root causes need to be inferred: The probability p(I ) since the latter is associated with a perfect remedy with degenerate probability distribution while the former is not.The actual path λ R can be inferred in the same way.Here we apply equation (3.4) for either imperfect or uncertain remedial intervention.This is because both types involve uncertainty in maintenance events, which is represented by the model p(a U |a O , λ O , δ = 0).When the intervention is uncertain, a O = ∅ denotes an empty set and a U is informed from the partially observed failure process λ O .In practice, we can specify a parametric model p(a U |a O , λ O , δ = 0; η ).Then η 0 λ will denote the set of parameters defined over the set of observable maintenance and the root-to-sink paths for the model under the uncertain remedial intervention, while η 1 λ will denote the set of parameters defined over the root-to-sink paths for the model under the imperfect remedial intervention.Thus, given the observed maintenance, we can infer the probabilities associated with the intervention indicator vector as Here p(a U |a O , λ O , δ = 0) will be 0 or near 0 for rare events.Furthermore, under assumption 3.1, for any system of interest, all possible root causes are represented by the tree.Under assumption 3.2, equation (3.5) enables us to identify the corresponding root causes for any remedial intervention.Therefore, equation (3.5) provides a general form for modelling any remedial intervention.

The stochastic manipulation
Engineers address root causes to prevent the fault or failure caused by these root causes reoccurring.So it is natural to assume that the distribution over root causes are affected by the remedial intervention.We then import this idea to the idle CEG.Denote the set of positions whose emanating edges are labelled by root causes as 1. for each w ∈ w * , there is a well-defined map Γ updating the transition probabilities vector ) where θw = ( θw,w ′ ) w ′ ∈ch(w) denotes the post-intervention transition probabilities vector; 2. the new transition probabilities vector θw satisfies θw ̸ = θ w , w ′ ∈ch(w) θw,w ′ = 1 and θw,w ′ ∈ (0, 1) for w ′ ∈ ch(w); 3. for a position w ∈ W Λ(w * ) \ w * , i.e. a position that lies on any of the paths passing through w * and is not an intervened position, the corresponding transition probabilities vector remains the same as the pre-intervention version: θw = θ w , here Λ(w * ) = ∪ w∈w * Λ(w) denote the intervened paths; 4. for a position w ′ ∈ ch(pa(w * ))\w * , i.e. one that shares the same parent as w * but which is not an intervened position, θpa(w ′ ),w ′ = 0.
The simplest scenario for the manipulated transition probabilities θw * is when the values of θw * are known.But these values may not necessarily be available, in which case we have a more complex scenario where we are required to learn these values from an inferential framework.
We begin with the simplest scenario when θw * are known.In this case, the post-intervention path related probabilities can be evaluated.Let An example of a manipulated CEG is given in Figure 3.This corresponds to the idle CEG in Figure 1b for the bushing system and an imperfect remedial intervention that did not restore the status of the machine to AGAN.Suppose defects in gasket or porcelain lead to the failure.Then the intervened position in w * = {w 1 } and we can explore the effect of the remedial intervention by stochastically manipulating the distrubiton over F(w 1 ).The composition of stages may change when transforming from C to Ĉ.The position w 8 in the idle system contains a single stage u 7 .This stage consists of six situations that consists of situations {v 7 , v 8 , v 13 , v 14 , v 15 , v 16 }.While in the manipulated CEG, vertices v 7 , v 8 are not traversed by any path in Λ(v 1 ) in the event tree, where w 1 = {u 1 } = {v 1 }.The root floret F(w 0 ) is associated with the root cause classifier.Here the manipulated CEG is conditioned on Λ(w 1 ), so πΛ(w1) (w 1 |w 0 ) = 1 and we only concern about the endogenous causes.In fact, exogenous root causes, such as lightening, are difficult to be remedied.
The manipulated CEG is associated with an intervened model and expresses what might happen had some variables been controlled under some hypothetical intervention.It allows us to identify the effect of some form of controls, e.g.fixing a root cause, from the observed data and interpret it causally.

An inferential framework using the causal algebras
Of course in practice we need to estimate the parameters appearing in the formulae above before preceding to identify the causal effects which will be discussed in the next section.However from a Bayesian perspective this is actually straightforward.[4] and [14] have established a conjugate analysis on the non-causal CEG which translates seamlessly into this new causal setting.Here we only give an example of how Bayesian predictive inference can be performed.
Let f (θ j,w |α w ) denote the prior distribution of θ j,w , which is the transition probability vector of position w for individual j.Let χ j,e denote whether d-event on edge e is observed for individual j.The parameters of the prior is the vector α w .Let γ a,λ denote the parameter for the probability distribution of I j,E ∆ |A O j , λ O j , with subscripts a denote the maintenance and λ denote the paths.Let f (γ a,λ |β) denote the prior of γ a,λ with prameter vector β.Then the posterior can be written as: where O denotes the observations.We can sample γ a,λ from the posterior and simulate the root causes through simulating I E ∆ .Then sampling θ from the posterior and further sampling the post-intervened probabilities by the transformation Γ({θ w } w:I w ∆ =1 ).Then we can find the predictive distribution over the paths by simulating the paths from the post-intervened transition probabilities.[45] and [46] gave examples of implementing Bayesian inference with the customised causal algebras on CEGs for reliability analysis.
4 Causal identifiability of a remedial intervention on the CEG

The expression of the causal query
For variable X represented on the causal BNs, the do−operation [30] that forces X to take value x is denoted by do(X = x).If we are interested in the effect of this intervention on another variable Y , then the causal query to be estimated is p(y|do(x)).This do−operator corresponds to the singular manipulation on the CEG that forces Λ x to be traversed, where Λ x here is the collection of root-to-sink paths passing along the edges labelled by x.Analogously to p(y|do(x)), here we identify the effect of forcing x to happen on another event y through estimating the probability π(Λ y ||Λ x ).Here the notation || plays a similar role as the do−operator which imposes an intervention onto the tree [41].So notationally π(Λ y ||Λ x ) = π(Λ y |do(Λ x )).
We have explained that a remedial intervention imposes a stochastic manipulation on the probability distributions θ w * given the intervened positions w * and the post-intervention transition probabilities assigned to F(w * ) is θw * .Given θw * , the causal query of a remedial intervention is Given an intervention a O , we are interested in π(Λ y |do(a O )).However, a O is external to the system represented by the CEG.To identify this causal quantity, we transform the intervention onto the CEG using the formulation explained in the previous section.Recall that the intervened positions are identified through the map ρ, the transition probabilities are updated through the map Γ, and the manipulated CEG is then obtained through ξ.Thus, when the remedial intervention is perfect, i.e. δ = 1, the causal query can be expressed as When the remedial intervention is imperfect or uncertain, i.e. δ = 0, we estimate the causal effect where the post-intervention transition probabilities θw * are obtained from observations via Γ • ρ.For either type of remedial intervention, we need to identify π(Λ y ||Γ({θ w } w:Ie w,w ′ =1 )).In this section, we only focus on the quantity π(Λ y || θw * )) with a known θw * = Γ({θ w } w:Ie w,w ′ =1 ).Note that given the idle CEG we can construct the manipulated CEG with θ * .Based on this knowledge, we next show the effect of the stochastic manipulation π(Λ y || θw * )) is identifiable given the idle CEG and the observable information.

Causal effect identifiability of stochastic manipulations
A fine cut [43] is defined to be a set of vertices W ′ ⊂ W T so that ∪ w∈W ′ Λ(w) = Λ C .The intervened positions under a remedial intervention are not necessarily a fine cut.If w * is a fine cut of C, then the intervened paths are the set of all root-to-sink paths on the CEG, i.e.Λ(w * ) = Λ C .When the manipulations are asymmetric or the processes modelled on the idle CEG are asymmetric, w * might not be a fine cut.A CEG conditional on the intervened paths Λ(w * ) can be constructed.Such a conditioned CEG has structure C Λ(w * ) = (V * , E * ), where ∞ and E * = E Λ(w * ) .The transition probabilities are θ * = {θ * w } w∈W Λ(w * ) , where θ * w,w ′ = π Λ(w * ) (w j |w i ) which is evaluated as: So the conditioned CEG differs from the manipulated CEG in that it inherits pre-intervention conditional probabilities.
Let the intervened d-events be x(E(w * )) = ∪ w∈w * ∪ e∈E(w) x(e), which are labels of the edges emanating from the intervened positions.For remedial interventions, they are a subset of root causes.
Given x(E(w * )) and θw * , estimating π(Λ y || θw * ) from (C Λ(w * ) , θ * ) is equivalent to manipulating each d-event x ∈ x(E(w * )) with probability π(Λ x || θw * ).Note that estimating the causal query in this way is standard for causal algebras.For example, [30] has suggested estimating the causal effects of a stochastic policy from the an unmanipulated causal BN in a similar way.Following this idea, we can formulate the causal query as follows.Definition 4.1 (Causal effect of a stochastic manipulation).Given the intervened d-events x(E(w * )) and the conditioned CEG (C Λ(w * ) , θ * ), the causal effect of the stochastic manipulation of θw * on the d-event y is a function from the stochastically manipulated probability vectors θw * to the space of path related probability distributions on Λ y which can be expressed as: where Given this definition, the causal effects of such a stochastic manipulation is identifiable if and only if π Λ(w * ) (Λ y ||Λ x ) can be uniquely estimated for every x ∈ x(E(w * )) given the CEG and the observations.Recall that the intervened positions w * may not form a fine cut.So x∈x(E(w * )) π(Λ x ) may not be equal to 1 unless conditional on Λ(w * ).
No restriction is imposed on the intervened d-events.Under remedial interventions, however, these are normally root causes.[47] also discussed another type of intervention in reliability in which case the intervened d-events can be a component, a typical symptom and so on, depending on the maintenance scheduled and conducted by the engineers.Definition 4.1 generically formulates a causal query from a stochastic manipulation which should not be restricted for remedial interventions.
Next we show the causal identifiability of the stochastic manipulation.PROPOSITION 4.2.Suppose a remedial intervention imposes a stochastic manipulation on the distributions of F(w * ).Then given the post-intervention transition probabilities θw * , the effects of this intervention are identifiable if and only if π Λ(w * ) (Λ y ||Λ x ) can be uniquely estimated for every x ∈ x(E(w * )) given the CEG and the observations.
Therefore, the identifiability of π Λ(w * ) (Λ y ||Λ x ) is a necessary and sufficient condition for the identifiability of π(Λ y || θw * ).[41] has proved the identifiability of a singular manipulation on the CEG through adapting the backdoor theorem and the front-door theorem which are graphical tests designed by [30] to examine the identifiability of the causal effects of an atomic intervention on the causal BN.For simplicity, here we only focus on extending the back-door theorem here to prove causal identifiability.The idea is to find a partition of the root-to-sink paths Λ C , denoted by Λ z so that the following criteria are satisfied.Theorem 4.3.For any e w,w ′ ∈ e(x), if hold for every element of {Λ z }, then {Λ z } is the back-door partition [41].
For a stochastic manipulation, as we have explained, we wish to estimate π Λ(w * ) (Λ y ||Λ x ) from (C Λ(w * ) , θ * ).So we can simply adapt the above two criteria as follows.Theorem 4.4.For any w ∈ w * , for all x ∈ x(E(w * )) and any e w,w ′ ∈ e(x) if hold for every element of {Λ z }, then {Λ z } is the back-door partition for identifying the effects of a remedial intervention.
Note that w * form a fine cut of Λ(w * ) and ch(w * ) also form a fine cut of Λ(w * ).The set of paths {Λ z } is a partition of Λ(w * ).Next we formalise the back-door theorem for identifying the causal effects of a remedial intervention.Theorem 4.5.The effects of a stochastic manipulation are identifiable whenever a back-door partition {Λ z } can be found so that This holds when here Λx denotes that there is a singular intervention on Λ x only, not on Λ z , and It is now straightforward to deduce this from the two criteria in Theorem 4.4 and adapting the proof of the back-door theorem for the singular intervention [41].We prove this theorem in the supplementary material.We allow flexibility in choosing z, where z can be a set of d-events We can also define z to be a set of stages, positions or edges.Next, we give an example to show how to find an appropriate back-door partition for a remedial intervention.
We can now illustrate how the formulae work for the bushing example.
Example.Recall that we let the intervened position be w 1 and the manipulated CEG is Figure 3.The intervened paths are Λ x(e 1 w 1 ,w 3 ) ∪ Λ x(e 2 w 1 ,w 3 ) ∪ Λ x(ew 1 ,w 4 ) ∪ Λ x(ew 1 ,w 5 )4 .Suppose we are interested in how the maintenance will affect system failure.Then Λ y = Λ fail .
Figure 4: A hypothesised CEG of a conservator system for the example in Section 4.2.
Suppose the post-intervention probabilities θw1 are known.Having a stochastic manipulation on the distribution over F(w 1 ) can be treated as having a singular manipulation on e w1,w ′ with probability θw1,w ′ , where w ′ ∈ ch(w 1 ).We can validate Theorem 4.4 for all x ∈ x(E(w 1 )).To check the first criterion, We can also compute: )) = π Λ(w1) (Λ(e(z 1 ))|Λ(w 1 )).(4.17) Following in this way, it is easy to check that the first criterion is satisfied all manipulated x and partition z.
Now we check whether the second criterion is satisfied.
As mentioned in the previous section, there is a special case of the stochastic manipulation that w * form a fine cut of the idle CEG.In this case, Then to identify the causal effects, we only need to show that for every x ∈ x(E(w * )) we can find a back-door partition satisfying the criteria in Theorem 4.3.
The formulae given in this section is useful because we can formally identify the efficacy of the intervention which is expressed through a quantitative probability score which is a function of terms we can identify from the idle system.Figure 5: Trees for the simulation study.Numbers in brackets are the true parameters.We use the same colour palette for the trees.Vertices in Figure 5a and Figure 5b with the same colour do not share the same transition probability vector.
Example.Here we give an example of a conservator system of a transformer, see Figure 4.The initial events are root causes: {oil indicator/contact fault, other fault}.Following the root causes, we attach the oil status of the transformer: {leak & level low, other }, where other refers to the condition when there is no leak and oil level is normal, or there is only oil leak, or only oil level is low.Defects in two components buchholz and drycol may occur after oil problems.Here we consider whether the two components are both faulty or otherwise -either is faulty or both are functioning.Then we attach the failure indicator.
Suppose there is a remedial intervention which replaced the deteriorated seal.This maintenance remedied the contact fault.In response to the intervention, the conditional probability assigned to e w0,w1 is expected to decrease and the probability distribution over F(w 0 ) is manipulated.

A simulation study
The engineer reports in the domain we studied are sensitive.So instead we demonstrate how to apply the causal algebras proposed in previous sections in practice on a simulated dataset. 5Let Figure 5a   CEG 6 and we simulate a synthetic dataset comprising of 5000 cases.There are 2698 failure cases which emulate the information extracted from the failure report.The rest emulate the record of preventive maintenance [46].We generate perfect remedies by simply assuming p(δ = 0|endogenous cause, fail, root cause 1) = p(δ = 0|endogenous cause, fail, root cause 2) (5.1) = p(δ = 0|endogenous cause, fail) = 0.3. (5.2) For the cases with non-perfect remedy we assume the failure processes are only partially observed.
In practice the explanations of the effects of remedial interventions will have been extracted from engineer's reports either manually or automatically [48].The tree consistent with these explanations and data should begin with the failure indicator.Thus we transform 5a to an equivalent tree in Figure 5b, where the transition probabilities can be calculated using Bayes rule.We call it the learning tree.
Suppose we know the ground truth event tree for this system, which shares the same the topology as the learning tree.We next learn the parameter and stages of the tree from the simulated data.The parameters can be learned in a Bayesian framework based on equation 3.10 using the Metropolis-Hastings algorithm.Here we simply assume Dirichlet priors for transition probability vectors, where θ w ∼ Dirichlet(α w ).The choice of α w is discussed in the supplementary material.The structure of the CEG (i.e.stages) is learned using the maximum a posterior (MAP) algorithm -this is the most used Bayesian structural learning method for CEGs, see [4, 15, 45, 46? ].After learning the best-scored CEG on the learning tree, we transform it back to the causal CEG.This allow us to perform the causal analysis described in Section 4. To impose the effects of remedies, we manipulate θ w1 via Γ.Here we simply let , where N wi,wj counts the observations associated with e wi,wj .Other choices of Γ have been discussed by [45] and [48].
The best-scored tree learned from the synthetic data has the same structure as Figure 5b where v 3 and v 5 are in different stages while v 4 and v 8 are in the same stage.Its MAP score is −7212.494,2698.319higher than the score of the tree with v 3 and v 5 in the same stage.The sensitivity analysis is summarised in Section B of the supplementary material.The manipulation shifts the distribution of θ w1,w3 to the left, as shown in Figure 6a.The posterior mean of θ w1,w3 is reduced to 0.492 from 0.503 after manipulation.We can then predict the system failure induced by the two root causes based on the estimated distribution of θw1 , see Figure 6b.

Discussion
In this article, we have shown the flexibility of the semantics of a CEG in representing asymmetric processes and capturing the effects of various interventions even when the manipulations are asymmetric.Given a context-specific CEG for a particular system, we can design the bespoke causal algebras for different types of remedial interventions.We can predict a machine's failure probability by imposing the underlying stochastic manipulations to the idle system so that we can identify the effects of the remedial intervention through finding appropriate back-door partition on the PREPRINT CEG.The graphical have described here therefore provides an excellent framework for translating established formal causal analyses so that these can be embedded into mainstream system reliability.
The original domain that motivated this formal development did not contain the case where the discovery of a fault would encourage an improvement of the system.However, a referee pointed out there are many domains where a fault would provoke a system upgrade 7 .In such cases, the failure rate could be reduced over the entire lifetime after maintenance.Thus more complex manipulations, beyond restoring root causes to AGAN, should be considered.Semi-Markov processes can be modelled on dynamic CEGs [6], enabling us to generalise our proposed algebras and manipulate failure rates directly.Our previous work [45] sketched relevant ideas, which could be formalised in future research.
Finally we note that the inferential framework we have developed here can be adapted to accommodate natural language data extracted for example from maintenance logs where engineers write about the faults they observe and the possible reasons for what they see.We can embed the causal reasoning behind the texts by mapping them onto the event tree and learning the structure of the causal CEG.This requires natural language processing techniques where the d-events play a role as a bridge to link the tree to the free texts.[47] and [48] proposed a naive way to implement this idea which used a hierarchical structure.These can be used to help structure appropriate CEGs like the examples as well as informing the posterior floret probabilities.More advanced algorithm can be developed in the future to automate this causal learning process.Such an intervention may involve manipulation not only of the probability distribution over root causes: either upstream or downstream florets of the root causes might be manipulated, depending on the maintenance.The algebras proposed in the paper can be generalised to encode such combinatorial manipulations.Moreover, distribution of timeto-event might also be affected.On a dynamic CEG, we can model semi-Markov processes with semi-Markov kernel Q wi,wj (t) = θ wi,wj P (h wi,wj ≤ t) where h wi,wj denotes the holding time at w i just before transitioning to w j .We can specify a parametric distribution for each conditional holding time, for example, a Weibull distribution.Then controlling the shape parameter of the Weibull distribution enables us to work on different phases of the bathtub curve.

C Interventions that improve system efficiency
(a) The hypothesised staged tree.(b) The CEG derived from (a).

Figure 1 :
Figure 1: An example of the staged tree and the CEG derived for a bushing system.
(a) The perfect remedy.(b) The imperfect remedy.(c) The uncertain remedy.

Figure 2 :
Figure 2: The status monitors for the three types of remedies.
(3.1)    Let E ∆ = {e l1 , • • • , e ln } denote the set of edges labelled by the d-events associated with root causes.For any edge representing a root cause e li ∈ E ∆ , we define a binary variable to indicate whether or not the root cause labelled on e li is fixed and call it an intervention indicator.Let I e l i = 1, if the root cause represented on I e l i is fixed by the maintenance, 0, otherwise.(3.2) Then we have a vector of intervention indicators defined over E ∆ , denoted by I E ∆ = {I e l 1 , • • • , I e ln }.Let λ O ∈ Λ be the set of possible root-to-sink paths associated with the failure or deterioration.Note that the whole failure or deteriorating process might be partially observed when the root causes are unknown.Let λ R ∈ Λ denote the actual failure or deteriorating path when root causes are known.
The intervention indicator vector defined for position w is I w = (I e w,w ′ ) w ′ ∈ch(w) .Define the intervened position to be the position whose floret F(w) is assigned a new probability distribution under an intervention.Let w * denote the set of intervened positions.Then under a remedial intervention, w * ⊆ W ∆ .If I e w,w ′ = 1 then this means the root cause represented on e w,w ′ is intervened and w ∈ w * .We next formalise the manipulation of the probability distributions over F(w * ).Definition 3.3 (Stochastic manipulations).A manipulation on a CEG C is called stochastic if there exists a set of positions w * ⊆ W such that,

Figure 6 :
Figure 6: Results of simulation study.

PREPRINT
Comparison of total errors for different values of α0.
Comparison of situational errors for different values of α0.

Figure 8
Figure8uses a bathtub curve to portray the life cycle of a unit, reflecting the change in failure rate.The perfect remedial intervention formalised in this paper corresponds to the black curve in the second life cycle.If the system is improved after maintenance, then the second life cycle may look like the red curve where the failure rate is reduced throughout the entire lifetime of the unit compared to the first life cycle.