Demonstrating software reliability using possibly correlated tests: Insights from a conservative Bayesian approach

This paper presents Bayesian techniques for conservative claims about software reliability, particularly when evidence suggests the software's executions are not statistically independent. We formalise informal notions of “doubting” that the executions are independent, and incorporate such doubts into reliability assessments. We develop techniques that reveal the extent to which independence assumptions can undermine conservatism in assessments, and identify conditions under which this impact is not significant. These techniques – novel extensions of conservative Bayesian inference (CBI) approaches – give conservative confidence bounds on the software's failure probability per execution. With illustrations in two application areas – nuclear power‐plant safety and autonomous vehicle (AV) safety – our analyses reveals: (1) the confidence an assessor should possess before subjecting a system to operational testing. Otherwise, such testing is futile – favourable operational testing evidence will eventually decrease one's confidence in the system being sufficiently reliable; (2) the independence assumption supports conservative claims sometimes; (3) in some scenarios, observing a system operate without failure gives less confidence in the system than if some failures had been observed; (4) building confidence in a system is very sensitive to failures – each additional failure means significantly more operational testing is required, in order to support a reliability claim.


Introduction
It is prudent to be conservative when assessing a software-based safety-critical system (SCS), since software failure could significantly harm stakeholders in the system.Rigorous statistical arguments can give support for conservative claims about whether the software is sufficiently reliable, where such claims are based on evidence of achieved levels of reliability.In particular, Bayesian methods provide a natural formalism/calculus for combining various forms of reliability evidence, resulting in probabilistic measures that (given the evidence) articulate one's uncertainty about the reliability of the SCS (see Atwood et al. (2003)); in particular, the reliability of its software.Examples of evidence that can be utilised in Bayesian approaches for reliability assessment include the observed failure behaviour of (similar) software during past operation (e.g., see Thomas E. Wierman et al. (2001), Bunea et al. (2005) and Pörn (1996)).
When using statistical arguments in assessments, a central question is whether the system's software has statistically independent and identically distributed (i.i.d.) "executions".By "an execution" we mean a set of actions performed by the software that can be regarded as a unit of software operation1 .For example, actions performed in response to each demand/input the software receives from its environment, or actions in response to a sequence of inputs (received over a unit amount of time or distance).In this paper, when a software execution occurs, either all of the actions performed2 are the required actions or at least one of these actions is incorrect -i.e., a software execution is either correct or a failure.Our focus is the assessment of the system's software, so we consider only software failures (so, no hardware failures) as defining system failure.
Executions that are i.i.d.make sense for some systems, such as an on-demand system where demands rarely occur, and the system's state and operational environment are reset inbetween demand occurrences.But there are often reasons to doubt the i.i.d.assumption.An autonomous vehicle (AV) could experience sudden changes in driving conditions that make it very likely for the AV to make a series of consecutive mistakes; or an airplane (and its flight control systems) can be put under increasing operational stresses when they encounter aggressive weather mid-flight (with turbulent "air pockets"); and "failure clustering" has been observed in various (control) systems.In many situations, at least some doubt about independent executions is warranted.
Even when executions are assumed i.i.d., SCS software will typically be required to exhibit no failures over a large number of executions.An assessor might (must?) consider whether these successful executions are positively correlated after all, and might account for this possibility in assessments.Because, at face-value, an assumption of i.i.d.executions can seem quite strong -it significantly limits an assessor's hypothesis about which probabilistic laws could characterise a software's failure process.Consequently, one might suspect that assuming independence results in optimistic reliability claims -it's useful to ask whether this is actually the case.Are assessments significantly undermined by assuming independence?
The answer depends on: i) how reliable the software actually is, ii) the sequence of the software's successes/failures during operational testing, and iii) the nature of any dependence between executions.Prior to testing, the assessor is uncertain about (i)-(iii), and is reliant on reliability evidence to shape their beliefs about how reliable the system is.Operational testing provides more evidence that refines these beliefs further.
This process -of an assessor's uncertainties being initially shaped by evidence obtained prior to operational testing, and then further shaped by the system's performance during testing -is formalised in this paper in conservative Bayesian inference (CBI) terms, for an assessor making claims about the system's probability of failure per "execution" (pfe).An example pfe is the probability of failure per demand (pfd) for an on-demand system, and another example is the probability of a fatality-event per mile (pfm) for an AV; we consider both examples later in the paper.Our primary interest is in assessing software reliability -specifically, solving constrained optimisation problems, to obtain the least confidence an assessor can justifiably have in a system's pfe being "small enough".CBI makes explicit how (i)-(iii) above affect an assessor's uncertainty under various assessment scenarios.Fig. 1 summarises these scenarios and indicates sections in this paper where the scenarios are treated.Prior to testing, an assessor may express some confidence in, say, i) the system being sufficiently reliable (e.g.confidence in the unknown pfe being smaller than some target value set for the system developers); ii) the system being fault-free; iii) future executions being negatively or positively dependent, and being failure-free.An assessor uses execution outcomes during testing -such as no failed executions occurring, or some failures separated by runs of successes occurring -to update their confidence.

Summary of the paper's contributions
1) extending CBI techniques that allow an assessor to quantify the potential negative impact of invalid statistical modelling assumptions on reliability claims (e.g., when software executions are assumed i.i.d.when, in actuality, they are not).See section 4; 2) showing how the outcomes of software executions -whether failures occurred and whether these were clustered or isolated -significantly affects how confident an assessor can (justifiably) be of a system being sufficiently reliable (sections 4, 5); 3) several closed-form solutions for conservative posterior confidence in an upper bound on pfe (see Fig. 1, sections 4, 5 and the supplementary material); 4) illustrating these findings in two scenarios: nuclear power-plant and autonomous vehicle (AV) safety (section 4).We give advice and caution for assessors/practitioners, concerning how confident they should be before embarking on operational testing (sections 4, 5).
The rest of the paper is organised as follows.Related work is detailed in section 2, while section 3 reviews CBI and the Klotz model for correlated executions.Formalisations of doubting i.i.d.executions are given in section 4, and used to derive conservative confidence bounds on system pfe under various scenarios.The sensitivity of these bounds to changes in model parameters is studied in section 5. Section 6 discusses results, and section 7 concludes the paper.

Related Work
The current paper directly continues our development of statistical techniques for conservative reliability assessment first reported in Salako & Zhao (2023).Consequently, the related works detailed in that paper continue to be relevant here; we highlight these works in this section.

Why is Modelling Correlated Executions Necessary?
An early model for sequences of statistically independent executions, used in works on random testing, is due to Thayer et al. (1978) (see Duran & Ntafos (1984) for an application of the model).However, reasons to expect correlated failed executions in various systems became well-known.For example, a system can exhibit "failure clustering" due to the system receiving sequences of inputs that cause the system to fail, where such inputs cluster into subsets of the system's failure region3 -see Ammann & Knight (1988) and Bishop (1993).The system's operational environment generates input sequences as trajectories (within the set of all inputs) that eventually enter into, and linger in, these failure regions.This phenomenon motivated developing assessment approaches that account for positive failure correlation between executions -see Csenki (1993); Tomek et al. (1993); Huang et al. (2021).Strigini (1996) gives other reasons for correlated executions; e.g. if the software's internal state is corrupted upon an initial failed execution, making subsequent executions more likely to fail.Or, if the system's operational environment becomes increasingly more stressful (i.e.there's an increasing probability of trajectories in the input space entering the failure region).

Statistical Models of Correlated Executions
A number of models with Markov dependence have been proposed for correlated executions.The binary Markov chain of Chen & Mills (1996); the Markov renewal process of Goseva-Popstojanova & Trivedi (2000) (that builds upon earlier work in Csenki (1993), Tomek et al. (1993)); and the Bondavalli et al. (1995) model that captures benignfailures, and the cumulative impact of such failures when assessing iterative software.Bondavalli et al. (1997) improve on this model, demonstrating the model's use with fitted steady-state and transition probabilities.
None of the aforementioned models are demonstrated using inference methods that explicitly account for one's uncertainty about whether the executions are i.i.d. or not.Nor do these models provide demonstrably conservative statistical support for reliability claims about software -where such support is justified by various forms of reliability evidence (in addition to the outcomes of possibly correlated executions during operational testing).

Conservative Bayesian Methods for Assessments
A number of studies have applied Bayesian methods to support software reliability assessment, e.g.Miller et al. (1992); Singh et al. (2001); Littlewood et al. (2002); Popov (2013).The utility of these methods is in the inference process.An assessor's beliefs about the reliability of a system are initially formed by evaluating relevant evidence.Then these beliefs are updated, upon seeing how the system performs during operation.
The usual challenge with Bayesian methods is the need to characterise one's initial (i.e."prior") beliefs as a prior probability distribution -a distribution that captures all, and only all, of one's prior beliefs.Care must be taken when eliciting a prior distribution; an unrepresentative prior could lead to overly pessimistic, or dangerously optimistic, assessments.
For reliability assessments, there is the added challenge that prior distributions often represent beliefs about continuous random variables, such as an on-demand system's unknown pfd.Requiring that an assessor specify beliefs about the infinitely many ranges of possible pfd values is often impractical.
CBI methods have been developed to address these challenges.CBI is related to robust Bayesian analysis which studies the sensitivity of the results of Bayesian inference to changes in the inference inputs -see Berger (1994Berger ( , 1990)); Lavine (1991); Berger & Moreno (1994).Inputs such as: the prior distribution; the statistical model that determines the likelihood function; and the posterior measure of interest.An assessor may not be able to use available evidence to fully specify a prior distribution, but the evidence may allow a much more limited partial specification of a prior -e.g. the assessor expects the prior, whatever it may be, to satisfy certain quantiles or moments.By considering all of those prior distributions that satisfy these specifications, CBI determines the most conservative inference result (from using these priors) to give support for a claim that the system is sufficiently reliable.This is one way in which conservatism in assessments is realised via Bayesian inference.Bishop et al. (2011) introduced the CBI idea.A number of studies soon followed, applying CBI in various contexts.For example, i) in Strigini & Povyakalo (2013), Povyakalo et al. use CBI to obtain the smallest probability of the system's next  executions being successful, given prior evidence that the system is very reliable and its last  executions were successful; ii) in Zhao et al. (2015), with evidence to support some confidence in the system possibly being fault-free, and some confidence in the system being very reliable, Zhao et al. use CBI with operational testing to conservatively gain confidence in the system possibly being fault-free; iii) Salako (2020) bounds the reliability of a binary classifier, given evidence that the classifier's past performance was (un)likely; and iv) in Zhao et al. (2019), Flynn et al. apply CBI to the problem of assessing AV safety -highlighting circumstances under which attempts to demonstrate the required levels of safety via road testing are in vain.More CBI applications to assessing SCSs are found in Zhao et al. (2015Zhao et al. ( , 2017Zhao et al. ( , 2018Zhao et al. ( , 2019)).
These applications all involve "univariate" priors; i.e., distributions of a single unknown, typically the system pfe.More recently, CBI applications have involved "bivariate" priors.Littlewood et al. (2020) consider the assessment of a system in a "new" situation -either the system replaces an older system in a given operational environment, or the system has been deployed in a new environment after operating in a previous environment for some time.They demonstrate how CBI supports dependability claims, when evidence suggests the system's failure propensity in the new situation is "no worse" than the propensity in the "old" situation.Zhao et al. (2020) study "improvement arguments" of this kind in the context of assessing AV safety -but with different dependability measures of interest and a more general failure model for the system.While Salako et al. (2021) consider more general "improvement arguments" for an even wider range of assessment scenarios.
In Salako & Zhao (2023), we introduce a CBI technique for incorporating doubts about the i.i.d.assumption into conservative reliability claims.The statistical model used was the first CBI model to capture correlated executions, and is based on the Klotz (1973) model -a binary Markov chain that predates and agrees with Chen & Mills (1996) and Goseva-Popstojanova & Trivedi (2000).We illustrated how assessments where i.i.d.executions are assumed can be very optimistic.That paper was concerned with assessing on-demand systems where (despite the software containing faults) no software failures are observed during extensive operational testing.The current paper significantly extends this CBI approach, to apply to a wider range of assessment scenarios -e.g., scenarios where some failures are observed during extensive testing, and where the assessor can justify only very weak beliefs about the unknown pfe.

Assessing Continuously Operating Software
The current paper focuses on assessment scenarios where one observes the software's success/failure behaviour on each of a number of "unit" software operations -i.e., on each execution.Example scenarios include assessing ondemand systems (see PD IEC TR 63161:2022(2022); Rausand (2014)).Hence, we employ "discrete-time" statistical models (i.e., Bernoulli processes) and our reliability measure of interest is pfe.
Contrastingly, for assessments where the reliability measure of interest is the software's failure-rate in continuous time, "continuous-time" statistical models are more appropriate (e.g., non-homogeneous Poisson processes (NHPPs)).

A Review of CBI
Consider the following scenario from Zhao et al. (2019).An on-demand system is subjected to operational testing, to determine if its probability of failing on a randomly occurring demand (i.e.pfe) is acceptably low.Let  be this unknown pfe -i.e. on a random demand, the system fails (with probability ) or succeeds (with probability 1 − ).Demands occur randomly according to an operational profile, see Musa (1993).
Before operational testing, an assessor might have sufficient evidence to fully specify a prior distribution representing their beliefs about which values of  are likely to be the true value, and which values aren't.During operational testing, the system correctly responds to all  demands that occur -these successes are assumed to occur in an "independent and identically distributed" (i.i.d.) manner.If the value of  is  then the probability of observing these successes is (; ) = (1 − )  .Let  be a required upper bound on .The assessor's confidence (after seeing the successes) in  being no larger than  is: where   is an indicator function-it equals 1 when predicate  is true, and 0 otherwise.Often, there isn't enough evidence to fully justify a specific prior distribution for (1), but there may be enough to justify weaker constraints on the prior (such as a few quantiles).We refer to such constraints as prior knowledge (PK).A basic form of PK is: Prior Knowledge 1 (certainty in a lower bound).100% confidence in the system's pfe not being lower than   ; i.e.,  ( ⩾   ) = 1.
is a probability, so   should be non-negative.  can be 0 (see Littlewood & Rushby (2012) for possibly "fault-free" software) or a very small number (e.g. the best reliability feasible for the system given current levels of technology).
Prior Knowledge 2 (confidence in satisfying an engineering goal). × 100% confidence in the system's pfe being better than, or equal to, an upper bound ; i.e.,  ( ⩽ ) = .
is an "engineering goal": a target pfe value that system developers try to achieve. is typically chosen to be much smaller than the required bound , so  ⩽ .While  is how confident the assessor is, before operational testing, that the engineering goal has been achieved. has to be large enough to support conducting operational testing; reducing the chance that unreliable systems use up the operational testing budget.The following theorem shows one can conservatively gain confidence in a bound  on  (see Zhao et al. (2019)).
Theorem 1 (univariate CBI).Let   be the set of all probability distributions over the interval [0, 1].Using (1), the optimisation problem Insight 1 (The basic CBI idea).One considers the set of all feasible priors that satisfy an assessor's PKs.For a given posterior measure of interest (e.g.posterior confidence in a bound on ), CBI determines a prior that gives the most pessimistic value for this measure -no feasible prior gives a more pessimistic value, and any prior that does must violate at least one PK.The CBI prior is referred to as a "worst-case" prior.

A Model of Correlated Software Executions
The following stochastic failure process for software exhibiting correlated executions -used in Salako & Zhao (2023) -is based on the Klotz (1973) model.A sequence of Bernoulli random variables  1 , … ,   , each take on the values 1 or 0; indicating failure or success, respectively, on each of  software executions.Similar to section 3.1, let  be the unconditional probability the next execution is a failure (pfe).Let  be the probability that a failure is followed by another failure.That is, If the process is 1st-order stationary, we obtain the Markov model in Fig. 3 (see Klotz (1973), Salako & Zhao (2023)).This stochastic process is well-defined if the transition probabilities lie between zero and one.That is, if Inequalities (2) define the subset  of the unit square (see Fig. 4a).
Remark 1 (correlation in the Klotz model).The correlation coefficient for two successive executions is − 1−  0⩽<1 +  =1 .This defines 3 correlation types: i) when  = , the model exhibits independent execution outcomes; ii) when  > , execution outcomes tend to cluster more often (e.g.bursts of failures) -positive correlation; and iii) when  < , execution outcomes tend to alternate more often, between failure and success -negative correlation.
Over  executions, suppose the software makes:  transitions from a successful execution to a failed execution,  transitions from "success" to "success",  from "failure" to "failure", and  from "failure" to "success".So,  +  +  +  + 1 = .Under the Klotz model, for given (, ), the probability of observing these transitions is the likelihood (, ; , , , ): when the 1st execution is a failure when the 1st execution is a success (3) 4 To obtain this solution, one constructs sequences of feasible priors (where each subsequent prior in a sequence gives an increasingly worse value for the objective function) that converge "almost everywhere" to a prior that gives the infimum.This prior differs from the feasible priors converging to it at  = .As a consequence, it is the value of  ( <  | …) from this prior, rather than  ( ⩽  | …), that is the infimum.
During operational testing, an assessor observes , ,  and .But both  and  are unknown to the assessor5 ; in particular, the assessor is uncertain about the pfe .So, upon observing  executions of the system, an assessor's confidence in  being no larger than a bound  is: Upon observing  executions, what is the least confidence an assessor should have about  being "no bigger than" the bound ?We determine this for different scenarios by deriving the greatest lower bound for (4) using PKs 1, 2, 3, 4. Section 3.1 introduced PKs 1, 2, while section 4.2 introduces PKs 3, 4.
Practical implications of these results -with domain-specific PK parameterisations for nuclear power-plant safety protection systems and AV safety subsystems -are given in sections 4.3 and 4.4 respectively.The parameterisations are illustrative of plausible PK values when assessing functionally redundant software components for safety systems employing fault-tolerance -e.g., systems highlighted in Wood et al. (2010); Koopman & Wagner (2016); Hörwick & Siedersberger (2010); Sha (2001).
These CBI techniques/results -for conservatively gaining confidence in the software being sufficiently reliableare primarily intended for use in assessments based on operational/statistical testing, where the software is treated as a black-box.In operational/statistical testing, the test cases for the software are randomly generated software inputs.To do this correctly, the probability of generating a given test case must be consistent with an "operational profile" -i.e., this probability must be the same as the probability of the same "case" occurring when the software is deployed in real operation.Test cases can take different forms, depending on the type of software under study.For example, consider a batch program that receives numerical values for all of its input variables at the start of an execution, it executes, and then it produces all of its outputs.A test case for such a program could be a fixed-length vector of numerical values -each number in the vector is the value for an input variable.Alternatively, consider a control program that receives multiple numerical values for each input variable over time; here, a test case could be a collection of numerical sequences -each sequence represents the changing values over time for an input variable.More examples of test cases can be found in Lyu (1996) and Strigini & Littlewood (1997).
Statistical testing supports direct estimates of reliability, for the purposes of reliability assessment and product acceptance.It also supports decisions on whether software is ready for use in a specific system.Thus, CBI techniques can be applied in any software development testing phase where statistical testing may be applied, and where decisions may be taken on whether software is ready to be deployed; e.g., integration or acceptance testing.
Of course, good practice for carrying out statistical testing must be followed when employing our theorems and results; e.g., Strigini & Littlewood (1997) give detailed guidance in this regard.See also Salako & Zhao (2023) for more discussion.Best practice approaches for expert belief elicitation should be followed to elicit the PKs; e.g., PRA Working Group (1994); O' Hagan et al. (2006).
Table 1 lists the worst-case priors used for the curves in the example plots.While these priors are consistent with those in Salako & Zhao (2023) (for assessors that can justify continuous marginal prior distributions of ), the current priors are applicable in many more scenarios (e.g., when assessors can justify only very limited properties of the marginal prior distribution of , in the form of PKs 1 and 2).

Assessment with Doubts about i.i.d. Executions
Our first assessment scenario is a baseline.Before operational testing, an assessor uses reliability evidence to justify the engineering goal PK 2, and the following two PKs about the independence assumption (cf.Remark 1): Prior Knowledge 3 (confidence in negative dependence). 1 × 100% confidence in successive software executions having negative dependence; i.e.,  (Λ < ) =  1 .
Consequently, the assessor's prior confidence in independence is Note that in all of the remaining theorems in this paper,  1 =  2 = 0 is the special case of i.i.d.execution outcomes -in this limit, all the theorems agree with previously published univariate CBI results.
An assessor observes all  executions during testing are successful.Using the Klotz model, the CBI problem of determining the least amount of confidence the assessor can justifiably have, about the system pfe satisfying bound , is the following constrained optimisation problem.Consider the support  of the Klotz likelihood, defined by (2) and depicted in Fig. 4a.Let  be the set of all prior probability distributions over , and 0 ⩽   ⩽  <  < 1 2 .Then, the following theorem holds (generalisation proved in supplementary material, A.2).Each of Fig. s 4b, 4c and 4d shows the domain  of a joint prior distribution for (, Λ) random variables, and the 4 points (i.e., black dots) in  assigned nonzero probabilities by this distribution.So, these joint priors are depicted as if one were looking down on the distribution and its domain "from above".
Example 1 (baseline).Consider an on-demand SCS which acts only upon receipt of a demand from its environment.An assessor is 75% confident the software's pfe -i.e., its probability of failure per demand (pfd) -is no worse than  = 10 −5 (i.e. the engineering goal of PK2 with  = 0.75).After  failure-free tests of the SCS, the assessor is  ×100% confident that the system meets Safety Integrity Level (SIL) 4, i.e.  = 10 −4 (see IEC ( 2010)).Fig. 5 shows three plots of  as a function of  using three Bayesian models6 : univariate CBI (cf.Theorem 1), Bayesian Inference (BI) using a Beta prior7 satisfying PK2, and CBI with confidence  1 and  2 in negative and positive dependence respectively (cf.Theorem 2).Univariate CBI and BI (using a Beta prior) both assume the executions are statistically independent (so, have likelihood (; ) = (1 − )  ).In Fig. 5, as expected, univariate CBI gives less confidence (so, is more conservative) than BI using a Beta prior8 ; while both of these are more optimistic than CBI with doubts about independence.Failure-free evidence is always "good news" under univariate CBI and BI (i.e., the solid and dashed-dot curves monotonically increase to "certainty" in the bound).Contrastingly, with doubts in independence, such evidence can eventually undermine an assessor's posterior confidence in the 10 −4 bound for all sufficiently large  (i.e., the unimodal pattern of the dashed curve).This is because there are pessimistic reasons for why the failure-free tests could be occurring -reasons that suggest the successes are occurring despite the SCS not being very reliable.For example, the test-cases might be unrepresentatively "easy" for the SCS to correctly respond to, or there may be problems with the test oracle which mean some failures go undetected, see Littlewood & Wright (2007); Barr et al. (2015); Salako & Zhao (2023).
So, as failure-free evidence accumulates, any prior confidence the assessor has in the executions being positively correlated -i.e., any  2 > 0 -will eventually undermine confidence in any pfd upper bound.On the other hand, prior confidence in negatively correlated executions (i.e. 1 > 0) has a negligible impact on posterior confidence in the bound (see section 5's sensitivity analysis).Intuitively, the longer testing goes on for without failure, the greater the evidence against the tests being negatively dependent.An example of how negative dependence can occur is if an Table 1 A list of the prior distributions for the curves in the example plots of Section 4. See supplementary material for any listed figures not included in the paper.assessor intentionally tries to "stress" the software during testing, by randomly including a disproportionate number of "difficult" demands -demands that are thought will likely cause software failure.So one might expect testing to reveal some negative dependence -a failure quickly followed by a success, then followed by another failure relatively soon afterwards, and so on.However, "no failures" may suggest the "difficult" demands are not actually difficult for the software.

Assessment with PKs for Nuclear Reactor Safety Systems
Next consider the assessment of a nuclear reactor safety protection system that is simple enough to possibly be fault-free (i.e., the system's pfd could be 0), see Littlewood & Rushby (2012).Typically, failure-free operational testing from such a system is required -otherwise, if a failure occurs, the system is taken offline and fixed, before testing resumes with a new sequence of demands.This scenario is very similar to the baseline of the last section -the testing evidence and most of the PKs are the same -but now, the engineering goal is "perfection" (i.e., PK2 with  = 0).
As before, we are interested in an assessor's confidence in a pfd bound  upon seeing  failure-free runs, where the assessor harbours doubts about the execution outcomes being i.i.d.(i.e., nonzero  1 or  2 ).This confidence in  is given by Theorem 2, simply by replacing PK2 with  ( = 0) = ; that is,  = 0 in the distributions of Fig. s 4b, 4c  and 4d.
Example 2 (nuclear reactor protection systems).Consider a nuclear reactor safety protection system that an assessor is 70% confident contains no faults (i.e., PK2 with  = 0 and  = 0.7).Upon seeing  failure-free tests, an assessor's conservative posterior confidence in the pfd bound 10 −4 (SIL 4) is shown in Fig. 6, for three Bayesian models with different PKs: univariate CBI with  = 0, CBI with doubts in the independence assumption and  = 10 −5 (i.e., the baseline Example 1), and CBI with doubts in independence and  = 0.
Example 2 highlights the benefit of the software possibly being fault-free: in contrast to Example 1, accumulated failure-free evidence will not eventually undermine posterior confidence in .That is, the dotted curve in Fig. 6 is an increasing function of  that is asymptotic to the horizontal line , so it lies above the dashed curve 9 , yet lies below the solid curve (i.e., its more conservative than the confidence from univariate CBI).
Insight 2 (For a possibly perfect system, failure-free testing cannot undermine confidence in a bound).As more successful executions are observed, the more likely it is that these observations are the result of either a fault-free system (so  = 0) or a perfectly positively correlated system (so  = 1).That is, as  increases, the distribution in Fig. 4b tends to a distribution that has probability mass at only two points: a probability  +(1−) 2 at (0, 0), and a complementary probability at (, 1).In the limit of large , the assessor will need more evidence to distinguish between these two possibilities.
The sensitivity of these insights -to changes in the strength of the PKs -is explored in section 5.While further implications of these insights are discussed in section 6. 9 The asymptote is obtained by setting  = 0 in the distribution of Fig. 4b

Assessment with PKs for Autonomous Vehicles
We turn our attention to assessing AV-safety.In line with Kalra & Paddock (2016) and Zhao et al. (2019), the pfe is the probability of a fatality-event per mile (pfm).Here, each mile is treated as a "unit distance" over which software that enacts AV safety functions must correctly operate10 .Unlike the (possibly fault-free) protection software of subsection 4.3, AV software cannot be expected to be fault-free -it relies on imperfect, sophisticated machine learning solutions performing a complex driving task.Consequently, unfortunately, AV safety software failures (leading to fatalities, in particular) have been known to occur; see National Highway Traffic Safety Administration (2022).This suggests a nonzero lower bound   on the pfm (see PK1), and an engineering goal of a "safe enough"11 system rather than "perfection".
The following theorem gives conservative confidence in the pfm bound  (see general proof in supplementary material, A).Failures change the form of the Klotz likelihood from the one in Theorem 2 (see (3)).Here, an assessor doubts the independence of successive software executions as the AV drives over successive miles.Example 3 (AVs).Consider an assessor's confidence in an AV being as safe as the average human driver12 in terms of pfm (so  = 10 −8 ), after a fleet of AVs have driven millions of miles.Using PK parameter values from Zhao et al. (2020), we have: the engineering goal,  = 10 −10 , is 2 orders of magnitude safer than the pfm for human drivers; and the lower bound on pfm is   = 10 −15 .Fig. 8 shows confidence in  under different values of  (number of failures),  (consecutive failures),  1 , and  2 .
Three observations from Fig. 8: i) like previous scenarios, the dashed curve shows that doubts in independent executions eventually undermine confidence in  when no failures occur (cf.Fig. s 5,6).However, the other curves in Fig. 8 (except the solid curve) show that failures allow confidence  to eventually approach 1.This is explained in Insight 3; ii) unsurprisingly, more failures requires more testing for confidence in  to grow (i.e., the dash-dot curve lies to the right of the dotted curve); and iii) more consecutive failures requires even more testing (i.e., the space-dash curve lies to the right of the dash-dot curve).Section 5's sensitivity analysis explores this further.
Insight 3 (failures can allow confidence in  to grow to 1).Consider the following two possibilities when failures occur: i) no consecutive failures (so  >  = 0), and prior evidence weakly supports positively correlated executions ( 2 ⩽ ).Then initially, execution outcomes are evidence of negative correlation (possibly from a system with pfm larger than ).However, as the number of successful executions increases (with no more failures), it becomes less likely that the executions are negatively (or positively) correlated; otherwise, more failures should have been observed.Instead, it's more likely that the successes are occurring because the pfm is smaller than ; ii) consecutive failures (so  >  > 0), and prior evidence weakly supports negatively correlated executions ( 1 ⩽ ).Again, initially, correlated executions (possibly from a system with pfm larger than ) are most likely.And like the previous case, as the successful executions increase (with no more failures), it becomes less likely that the executions are correlated, and more likely that the successes are due to the pfm being smaller than .

The Sensitivity of Confidence Bounds to Changes in Prior Knowledge and Evidence
For the assessor that is uncertain about PK values, this section illustrates how to check the sensitivity/robustness of confidence in  to changes in PK values.The analyses also gives insight into how confidence responds to a "strengthening" of prior reliability evidence.We systematically vary PK parameters, the bound , and the execution outcomes.The prior distributions used in the numerical analyses are summarised in Table 2.

The Nuclear Safety Protection System Scenario
Fig. 9 provides 3 sub-figures highlighting the effects of changes in PK parameters (except PK1).In Fig. 9a, "CBI with dependence" curves are asymptotically more conservative than the univariate CBI solid curve.However, because the system could be fault-free, the curves show that confidence in  always increases with increasing failure-free operational evidence.Also, the smaller  2 becomes, or the bigger  or  are, the greater confidence in  can become.
Fig. 9b illustrates how changes in  1 have no apparent impact on confidence in .However, Fig. 9c shows that the smaller  2 becomes, the closer the "CBI with dependence" curve gets to the univariate CBI curve.While, for  2 ⩾ 1 − , the confidence in  is the constant horizontal line  =  +(1−)(1−) (see Fig. 4c with   =  = 0).

Table 2
A list of all prior distributions used in the sensitivity analysis of Section 5. See the supplementary material for any listed figures not included in the paper.

The AV Scenario
Recall that, unlike the baseline and nuclear protection system scenarios, failures occur (albeit rarely 13 ) during sufficiently long road testing campaigns (so  > 0).The confidence in bound  from Theorem 3 is dependent on whether some of these failures are consecutive ( > 0) or not ( = 0).We conduct two sets of analyses 14 along these possibilities.

With No Consecutive Failures (𝑠 > 𝑟 = 0)
In Fig. 10a confidence in  changes in response to increases in  and .The increase in  has no noticeable impact -the solid and dashed curves overlap.However, when  is increased, the required number of executions  to support a given confidence level reduces by a similar order of magnitude.In contrast, an additional failure increases  significantly -so the dotted and solid curves lie to the left of the spacedashed and dashdotted curves, respectively.This is consistent with the findings of Littlewood & Wright (1997).
Changes in  1 have no noticeable effect on confidence in  (see Fig. 10b).Perhaps because there is very little operational evidence to support negative correlations -i.e.only few instances of "switching" between failure and success.
On the other hand, changes in  2 have a clear impact, as shown in Fig. 10c.An increase in  2 requires an increase in .Moreover, when  2 ⩾ , confidence in  becomes 0 for all  (see the prior distributions for  = 0 in Fig. 23). 13Due to the severe negative impact failures can have in SCSs, we only consider operational campaigns with no more than a few failures. 14Note, Fig. 8 shows the result of  becoming nonzero for fixed .

With Some Consecutive Failures (𝑠 > 𝑟 > 0)
Despite consecutive failures, Figs.11a and 10a broadly give the same insights.When prior confidence in negative correlations is strong (i.e. 1 ⩾ ), Fig. 11b shows confidence in  is 0 for all , just like the condition on  2 in the  = 0 scenario (see the left column of prior distributions in Fig. 23).While a large  2 (i.e. 2 ⩾ 1 − ) gives practically 0 confidence in  in Fig. 11c.
We now summarise the results of the sensitivity analysis: Insight 4 (results of sensitivity analysis).Confidence in  from "CBI with dependence" is insensitive to changes in  1 , although confidence is 0 for  1 ⩾ ,  > 0. Confidence in  is sensitive to  2 .Both the number of failures  and the number of consecutive failures  have a significant effect on confidence in .When there are no failed executions   is irrelevant, with some effect when failures occur.The impact of the engineering goal, , is consistent with previous CBI models, e.g.Zhao et al. (2020) -smaller  gives greater confidence in  when no failures are observed, but has no significant impact when there are failures.6. Discussion

Incorporating Doubts about Model Assumptions into Reliability Assessments
In Bayesian assessments, one should always question the properties of the statistical model (such as independent executions), and whether these properties are appropriate in a given real-world context.Our use of CBI illustrates a general, formal method for incorporating such doubts -about the suitability of any statistical model properties -directly into the assessment.Since the results of CBI are guaranteed to be conservative (see Insight 1), this is a conservative version of a Bayesian approach first proposed by Draper (1995).Draper suggests that if one is uncertain about the suitability of model properties, one should perform the inference with an "expanded" model that weakens the properties in question and has the original model as a special case.In this sense, our choice of the Klotz model is not arbitrary: it is the simplest model we know that exhibits dependent, stationary Bernoulli trials, and that has "independent executions" as a special case.This approach is incremental; if one is doubtful of the Klotz model, a model expansion of the Klotz model can be used for assessment.
Using this approach in Salako & Zhao (2023), we illustrated how to formally check the impact of one's doubts (about i.i.d.execution outcomes) on reliability claims, where such claims are based on Bayesian inference with operational testing data.The results suggested that, for on-demand systems, using the i.i.d.assumption in assessments could result in extremely optimistic claims, but not always.By weakening the prior knowledge an assessor must justify before commencing operational testing, the current paper extends this previous work to a much wider class of scenarios.For example, there are assessment scenarios for which the i.i.d.assumption supports claims "close to being" conservative."How close" will depend on the strength of reliability evidence available (e.g., Fig. 8 shows how accumulating operational evidence can make i.i.d.-based claims less optimistic).

Which Prior Beliefs give Conservative Confidence Bounds?
Only certain prior beliefs about the pfe, , and the dependence between executions, Λ, will ensure an assessor's posterior confidence in a pfe upper bound  is conservative.The CBI solutions of section 4 make clear what these beliefs are -i.e., these beliefs are encoded in the prior distributions that solve Theorems 2 and 3. Specifically, the beliefs are encoded as those (, ) locations in region  that each of these distributions assign nonzero probability to (e.g., see Fig.s 4,7,25).Four main factors determine such beliefs: i) the execution outcomes during operational testing (i.e., the successes/failures); ii) prior knowledge, e.g.evidence strongly suggests the executions are positively correlated (i.e., large  2 ), not negatively correlated (i.e., small  1 ); iii) which beliefs about (, Λ) are least likely to have produced the testing outcomes, if the pfe is less than ; and iv) which beliefs are most likely to have produced the outcomes, if the pfe is larger than .Here, "least likely" and "most likely" are determined by the Klotz likelihood.
In addition to ensuring an assessor's confidence is conservative, here are two more reasons for why such beliefs are important.Firstly, they are consistent with the available evidence, since the beliefs are encoded in prior distributions that are (the limits of sequences of prior distributions that are) consistent with the evidence.So, the assessor cannot "rule out" these beliefs without extra evidence, and the consequences of these beliefs should be taken seriously.Secondly, when reliability evidence is "weak", these beliefs can make operational testing futile: the more one observes successful executions, the more doubtful one becomes about the pfe.Assessors should have enough evidence before embarking on operational testing; we elaborate on these points below.
For example, consider when all of the executions are successful (e.g.Example 1 of section 4).Superficially, this suggests the executions are strongly positively correlated, or the system's pfe is low.However, the assessor can take the more conservative view that these successful executions are evidence the system is not quite good enough.The assessor does this by holding the following beliefs: i) if the pfe is larger than , then successful tests are most likely if the following two beliefs are true: the executions are "perfectly positively correlated" (i.e. = 1), and the pfe is "as small as possible, but no smaller than ".In terms of the Klotz likelihood, these beliefs are encoded as the location15 (, 1) in ; ii) if instead, the pfe is smaller than , then successful tests are least likely if the following two beliefs are true: the executions are as "negatively correlated" as possible, and the pfe is "as big as possible, but no bigger than ".These beliefs are encoded as the location (, 0).
PKs refine these beliefs, giving the prior distributions shown in Fig.s 4 and 24.These beliefs imply that as failurefree executions increase without bound, in order for the assessor to be conservative, their confidence in the bound must diminish (e.g., see dotted curve in Fig. 6).Because the increasing number of successful executions makes all other beliefs unlikely, except the beliefs that the executions are "perfectly positively correlated" and the pfe is "as small as possible, but no smaller than ".The Klotz likelihood tends to zero at all of the nonzero probability locations in Fig. 4, except at the point (, 1) where the likelihood has the constant value (1 − ) for all .Failure-free executions eventually undermine confidence.
It is possible that the successful executions are occurring because the pfe is very small.The problem with this possibility is that any very small pfe eventually becomes too unlikely to have produced a sufficiently large number of successful executions.Moreover, there are more pessimistic reasons (not disallowed by the evidence) for runs of successful executions.For example, all of the test inputs may have been "easy" for the software to respond to.This could happen by chance: e.g., the operational environment just happens to be submitting a sequence of easy inputs.Overcoming such problems of chance may require an infeasible amount of testing.Another pessimistic reason for the successful executions could be an error in the test-case generation procedure, which systematically fails to generate inputs that lie in the system's failure region.Or an error in the test oracle, which fails to indicate true failures.Such possibilities are consistent with previous works that show how "favourable" operational evidence can undermine confidence during assessments; e.g., Littlewood & Wright (2007); Salako & Zhao (2023).To make progress with using failure-free evidence, an assessor must use appropriate additional evidence to rule out pessimistic reasons for not observing any failures.
If the system could be fault-free -in modelling terms, the engineering goal  is zero -then a fault-free system could produce the increasing number of successful executions.This possibility allows the assessor's confidence in the bound  to grow, because the confidence an assessor has in the bound being satisfied is never smaller than their confidence in the system being fault-free.The confidence in the system being fault-free increases as the number of successful executions increases, thus increasing confidence in  too.See Fig. 6, where confidence in  increases from  to  +(1−) 2 along the dotted curve.Contrarily, the successes could also be produced by more pessimistic reasons, so confidence in these more pessimistic reasons also increases.For example, the dotted curve in Fig. 6 and the prior distribution in Fig. 4b that produced this curve, together imply that this confidence increases from , since  2 ⩽ 1 −  is assumed.So, the assessor will always be uncertain about whether the pfe is better than .
So far we have discussed conservative beliefs when only failure-free executions are observed.Ironically, a few failures during testing can help overcome the pessimism we have highlighted.As the number of successful executions increases, any initial failures become evidence that the executions cannot be "perfectly positively correlated".Otherwise, if the executions were this strongly correlated, either no successes or no failures would have occurred!The previously pessimistic belief implied by location (, 1) -where  = 1 -must now move to a different pessimistic location where  <  < 1 (indicating less strong positive correlation).Now, as successful executions increase, it eventually becomes most likely that the successes are being produced by a system with a pfe smaller than .Consequently, the assessor's confidence in  eventually approaches certainty, although this can require considerable amounts of successful executions because of the few failures (e.g.Fig. 8).
Like the "failure-free" scenario, there are also situations where testing is futile when some failures occur.The futility here is even more extreme than before: confidence in the bound is now identically zero, no matter how many more successful tests are observed.For instance, note the zero confidence in Fig. s 10 and 11 from priors such as those in Fig. 23.If the failures during testing are isolated and few, then strong confidence in the executions being positively correlated (i.e.,  2 ⩾ ) undermines confidence in the bound .If the failures are clustered and few, then strong confidence in the executions being negatively correlated (i.e.,  1 ⩾ ) undermines confidence in .In both situations, the assessor's prior confidence in the system being "very reliable" is simply not strong enough to rule out pessimistic causes for the testing outcomes.
The various assessment scenarios are summarised in Fig. 12 (an updated version of Fig. 1).Each path through the figure -starting from the far-left at the " executions" node -gives the operational evidence and the prior confidence (i.e., PK parameter ranges) an assessor could have.The paths containing dashed lines are paths to be weary of; paths that our analyses reveal to be asymptotically futile or of little practical interest.Assessor's on such paths should seek stronger evidence of the system being sufficiently reliable prior to commencing testing.Concerning the impact of assuming independent executions, Chen & Mills (1996) observe the following when using classical inference with their Markov model: for a given number of successes/failures during testing, i) positively correlated executions give bigger confidence bounds (i.e., worse values for pfe estimates), compared with using Thayer et al.'s independent executions model; ii) a "not so big" positive correlation gives confidence bounds that are comparable to those given by Thayer et al.'s model.This is consistent with our findings.

Is assuming Independence always very Optimistic?
Indeed, consider the examples in Fig. 13, where all 10 5 executions are successful, and the system could be "faultfree".Then, for instance, the 99% confidence bound from univariate CBI (i.e., CBI that assumes independence) is 3.7 × 10 −5 -the smallest -value from the 99% confidence bound (dotted) curve, precisely when  2 = 0.All other 99% confidence bounds from this curve -i.e., all 99% confidence bounds from CBI using the Klotz model,  = 0 and  2 > 0 -are larger.Moreover, as  2 increases, the confidence bounds  increase.While, the smaller the assessor's prior confidence in positive correlation (i.e.,  2 decreases), the closer the Klotz confidence bound becomes to the confidence bound under independent executions (i.e., the intersection of the curves with the horizontal axes).
If the system cannot be fault-free, CBI with the Klotz model is significantly more conservative.The long-dashed curve gives the 99% confidence bounds when  = 10 −5 .Here, the univariate CBI 99% confidence bound is 4.7 × 10 −5 (at  2 = 0) -"4 orders of magnitude" smaller than the 99% confidence bound 0.1 from the Klotz model with  2 = 3.5 × 10 −3 .
Interestingly, unlike Chen and Mill's other observation (that a reduction in confidence bounds accompanies an increase in negative correlation), section 5 suggests that confidence bounds from failure-free testing are insensitive to prior confidence in negative correlation  1 .In fact, since  1 ⩾  in Fig. 13, the relevant posterior confidence is given by the prior in Fig. 4b as , which doesn't depend on  1 ."No failures" supports confidence in positive correlation  2 , and undermines confidence in negative correlation  1 .
So, the plots illustrate how conservative confidence bounds can be very sensitive to confidence in positive correlation -i.e., small changes in  2 can result in "orders of magnitude" changes in confidence bounds.If evidence strongly supports the executions being positively correlated, the confidence bounds obtained under assuming independence can be quite optimistic.On the other hand, B and Fig. 26 of the supplementary material show how being skeptical about independent executions can be initially optimistic; giving smaller confidence in  than the confidence from univariate CBI.While strongly believing in independence can be conservative initially.As successes accumulate, these roles between "skepticism " and "strong belief" are reversed, with "skepticism" in independence eventually giving conservative confidence and "strong belief" giving optimistic confidence.

Limitations, Generalisations and Future Work
The following Klotz model limitations, first highlighted in Salako & Zhao (2023), remain.Using a relatively simple model of dependent Bernoulli trials -i.e., the Klotz model -we have illustrated how one might account for dependent executions in conservative reliability assessments.Of course, there is scope for studying the implications of more expressive failure-models.For instance, many systems experience different types of failure, some of which may be considered benign.Some Markov models that capture this include the models of Csenki (1993); Goseva-Popstojanova & Trivedi (2000); Bondavalli et al. (1999).In contrast, the Klotz model treats all failure-types (and all successes-types) identically, in terms of how likely they are to occur, and the model ignores variations in the impact different failure-types have on system stakeholders when they occur.
Another Klotz model limitation is that positive correlation in both of its forms -i.e., whether failures are likely to follow previous failures or successes follow previous successes -are captured by the size of parameter  relative to  (see Remark 1).A further limitation is that the type of dependence -i.e., whether executions are positively or negatively correlated, or independent -is fixed for the duration of the system's operation.Certainly, there are practical situations where dependence can vary significantly over time; e.g., a change in the system's internal state makes failures much more/less likely.Or dependence can exist between several executions separated in time.Or, the sequence of executions could be halted whenever a failure occurs and the software could be fixed, before the software is allowed to resume executing -thus altering the faults the software contains and the dependence among execution outcomes.Accounting for such dependence variation requires a failure-model that explicitly captures time-dependent correlations.These scenarios justify a weakening of the conditional independence in the Klotz model: in the model,   is conditionally independent of  −2 ,  −3 , … ,  1 given  −1 .In future work, it will be interesting to consider longer dependence structures (over several "time steps" into the past) -e.g.,   being dependent on the last " − 1" execution outcomes.By applying the general conservative approach illustrated in this paper, an assessor can check the robustness of assessment claims based on models with more general dependence structures.
The Klotz model is a 1st-order stationary stochastic process (see Klotz (1973); Salako & Zhao (2023)).pfes used in reliability assessment make sense when the failure process is stationary.Because then, the probability of the system failing its -th execution is the same for all , and it equals the pfe.This, despite the conditional probability of failing the very next execution being dependent on, say, the success/failure of the last execution.Consequently, upper confidence bounds on such pfes are useful measures of reliability in those practical scenarios characterised by a stationary failure process.But when failure probabilities are time-dependent, one should forego using pfes in assessment claims and opt for more suitable reliability measures, such as the probability of failure-free operation in the future (see Strigini (1996)).
Even with a 1st-order stationary model, it's still worth studying the impact of the independence assumption on reliability measures like the probability of future failure-free operation.Previous CBI studies have shown that an assessor's justifications for a conservative claim are often different for different measures -even if the justifications are ultimately based on the same PKs.Also, some measures may be more sensitive to PK changes than other measures.

Concluding Remarks
Statistically independent software executions are often assumed when assessing software reliability.If inappropriate, this assumption can result in (dangerously) optimistic reliability claims.By formalising informal notions of "doubting" the independence assumption, and by employing conservative Bayesian methods, this work demonstrates how such doubts can be accounted for in assessments.
This paper contains analyses of various assessment scenarios.This involved the constrained mathematical optimisation of an assessor's confidence in an upper bound on the probability of failure per execution (pfe), after observing the system in operation.The work highlights a number of practical considerations.For example, a system exhibiting no failures during operation can give less confidence in a pfe bound, compared with if the system had exhibited failures.Or confidence can be very sensitive to failures; each additional failure means significantly more failure-free operation is needed for confidence to grow.
The scope of the results makes clear that a nuanced answer is required to the question of whether assuming independence undermines assessments.The answer depends, often sensitively, on various factors outlined in the paper.So that sometimes, the independence assumption has "little to no" impact on conservatism.And sometimes, the impact is simply too great to ignore.A "case-by-case" approach to estimating this impact in practice is advised, and the methods and many solutions in this paper provide assessors/practitioners with the means to do this.We prove the following theorem.
Theorem 4. Let  be the set of all prior distributions over the region  and let 0 ⩽  <  < 1 2 (see Fig. 14a).The optimisation problem Proof.The optimisation constraints can be used to partition  into 7 disjoint subsets (with one of the subsets being the 45 • diagonal).Each prior distribution  ∈  must assign 7 probabilities {  } 7 =1 to these subsets, in such a way as to satisfy the constraints of the optimisation problem (see Fig. 14a).
The proof progresses in 6 stages: 1. restrict the optimisation from  to its subset  ′ of discrete prior distributions.An arbitrary discrete prior assigns its probabilities   to 7 arbitrary points {(  ,   )} 7 =1 within .Hence, the objective function becomes a rational function of the "  "s, "  "s and "  "s; 2. show that the gradient of this objective function is determined by the gradient of the Klotz model likelihood; 3. show the likelihood is unimodal along vertical and horizontal lines in , as well as along the 45 • diagonal line; 4. show that the likelihood is also unimodal over all of , and it attains its maximum either at a stationary point in the interior of , or along the boundary of ; 5. the previous steps in the proof imply the following: starting from any  ∈  ′ , and the probabilities {  } assigned by  , we can construct a new prior that gives a smaller value for the objective function (compared with  's objective function value).We simply use the gradient of the likelihood to determine new locations within each of the 7 -subsets, and reassign the "  "s to these new locations.This reassignment produces a new prior distribution, which in turn can have its probabilities reassigned to new points (and so on, indefinitely).
In the limit, depending on the values of , ,  and , the sequence of new points obtained by successive reassignments will converge to limit points 17 in each -subset.That is, the objective function values converge in a monotonically decreasing manner, as the sequence of "reassigned" priors converge to a limiting distribution with support at, no more than, 7 limit points; 6. finally, determine the values for the "  "s that a limiting distribution should assign to limit points -to ensure that the related sequence of objective function values converge to the infimum.Determining these worst-case "  "s is a constrained linear fractional programming problem.One may solve this either numerically, or by a logical allocation of probability masses to the relevant limit points in .For the CBI solutions in this paper, we use the latter approach.These final forms of limiting distribution (illustrated in Fig. s 21 -24) are worst-case prior distributions; so-called because  (  <  | outcomes of  executions ) for these distributions equals the infimum we seek.
Let us proceed with the proof: stage 1) By definition, for any  ∈ , However, the set  can be restricted to the subset  ′ of discrete joint distributions -i.e., to those distributions that assign their "  "s to single points within each  subset (e.g.see Fig. 14b), see Moreno & Cano (1991).So that, for any  ∈  ′ , the objective function of the optimisation becomes Consequently the objective function has become   ; a rational function of the "  "s, "  "s and "  "s.
17 Definition: for a given topology (e.g., the "open balls" topology associated with 2D Euclidean space), a limit point of a subset of the plane is a point that is arbitrarily well-approximated by sequences of points within the subset (see Rudin (1976); Copson (1968)).
stage 2) Consider how this objective function changes when restricted to a vertical line in the subset of  where  ⩽ 1 2 .The rate of change of   with respect to  is then Since   is a rational function of , it is smooth (except where  = 0).Consequently, the sign of   (   ) indicates how to move the location of the "  "s along vertical lines in each  subset, in order to minimise   .The following argument shows how the sign of ) . And thus, they determine where the "  "s should be allocated to minimize   along that line.
stage 3) Along any vertical line in  where  ⩽ 1 2 is satisfied, (, ; , , , ) is a non-negative unimodal function of .To see this, note that   (, ; , , , ) = 0 has non-trivial solutions at  values where two quadratic functions of  intersect.That is, solutions to An illustration of these two functions is given in Fig. 15.One function has two roots of opposite sign (at  = 2−1  ,  + ) and a maximum, while the other function has a root at  = 0 and a minimum.This means at least one solution to (12) cannot lie within  -it must be non-positive.And the other solution must be positive and represents a maximum turning point.Because the l.h.s of ( 12) is bigger than the r.h.s. for  values slightly smaller than the positive solution, and the l.h.s. is smaller than the r.h.s. for all  values bigger than the positive solution.
Thus, as  grows from 0 to 1 along any vertical line in  (where  ⩽ 1 2 ), there is (at most) one stationary point at which the likelihood is maximum.The likelihood is monotonic on either side of this maximum along the vertical line.
(, ; , , , ) is also unimodal along the 45 • diagonal (i.e., when  = ).Because it has only one non-trivial stationary point 19 , and this must be a maximum since the likelihood is non-negative with value 0 at the endpoints of the diagonal.
Analogously, (, ; , , , ) is unimodal along any horizontal line within , since it has (at most) one stationary point at which it attains a maximum.The stationary point solves   (, ; , , , ) = 0 non-trivially.Equivalently, the stationary point satisfies the leftmost intersection between a straight line and a quadratic function in  (see Fig. 16).This leftmost intersection must occur to the left of  = 1 2− , ensuring that the stationary point lies in .The fact that the line lies above the quadratic before this intersection, and then below the quadratic immediately after, ensures that, as  increases from 0, the   (, ; , , , ) transitions from being positive to being negative.That is, the stationary point is a maximum.
stage 4) Finally, the likelihood either has a single stationary point within  at which it attains a maximum value over , or it attains its maximum value over  on the boundary of .When the single stationary point lies within 18 These statements exclude the unimportant edge cases when   = 0, 1. 19 This stationary point is located at either  = 1++ 1++++ or  = + 1++++ , depending on whether executions begin with a failure or success respectively.
If the stationary point lies within  then it must be a maximum; because the stationary curves in Fig. 17 imply that, from any point along the boundary of , we can always move away from that point along an appropriate path within  to increase the likelihood's value.
stage 5) stages 1-4 of this proof demonstrate the existence and uniqueness of locations in  that are local or global maxima, as exemplified in Fig. 18.For the region  ⩽  in , the "further away" from maxima the locations a prior assigns probabilities to, the smaller the objective function.For  > , the "closer" the nonzero probability locations are to the maxima, the smaller the objective function.Here, "further away" and "closer" are in terms of the Klotz likelihood's gradients.That is, given any  ∈  ′ , we can reassign the probabilities {  } that  allocates to points in , to new points suggested by the likelihood's gradients -resulting in a new prior with a smaller objective function value.Such reassignments can be carried out indefinitely, creating a sequence of priors with an associated, monotonically decreasing sequence of objective function values.And the completeness of the real numbers guarantees that this sequence of objective function values converge20 .Being discrete distributions, it is also clear that these reassigned priors, themselves, converge to some limiting discrete distribution.Examples of limiting distributions converged to in this manner are illustrated in Fig. 20.The points in each subfigure indicated by black dots are the limits of the sequences of new points chosen for reassignments -so-called limit points.stage 6) So, the limiting distributions assign probabilities only to certain limit points of the 7 -subsets.The exact values of the probabilities will depend on which initial prior  (with its probabilities {  }) was chosen to create the "reassigned" priors sequence.To determine those values for the "  "s that ensure the sequence of objective function values converges to the infimum, one can systematically allocate probability masses to the limit points.We will now illustrate this, and show how the priors (when   = 0) in For example, consider the limit points in Fig. 20b, for the case when  1 ⩾  and no failures are observed.Focus on the subset  ⩽  and recall the requirement  ( ⩽ ) = .To be pessimistic, we must allocate probability  to those limit points within this subset at which the likelihood is smallest.This is the limit point (, 0).Since  ( ⩽ Λ) =  1 ⩾ , all of the  probability can be allocated to this limit point "from below" the 45 • diagonal and "from the left" of the line  =  (see Fig. 19a).Consequently, because  ( ⩽ ) = , no more probability can be allocated to any other limit points in  ⩽ .
Now we need to assign probability 1 −  to the remaining limit points in .There are two alternative limit points above the diagonal where we may assign the  2 probability.Assigning to the point (, 1) gives more pessimistic results than assigning to (, ).We can see this by sharing the  2 probability between the two points, and noting that the objective function monotonically decreases as the amount of  2 allocated to (, 1) increases.In effect, all of  2 should be allocated to any sequence of points that approximate (, 1) arbitrarily-well, "from the right" of the line  = .This justifies Fig. 19b.Similar reasoning shows that allocating probability  1 −  to the point (, ), "from the right" of  = , is more pessimistic than allocating it to (, 0) "from the left".Thus justifying Fig. 19c.Note that these allocations are possible and do not violate the constraints, because Finally, using similar "approximation"-based reasoning to how  2 and  1 −  were allocated, we must assign the remaining probability 1 −  1 −  2 to the point (, ).Via any sequence of points that approximate (, ) arbitrarily-well "from the right", along the diagonal (see Fig. 19d).
Note that probabilities were assigned to limit points that lie along the line  = , but only by assigning the probabilities to points that approximate these limit points "from the right" of the line  = .Consequently, our final limiting distribution gives the value of the infimum in the optimisation problem, but only by computing " ( <  | …)" for this distribution, and not by computing " ( ⩽  | …)".
Using similar arguments to allocate probabilities to limit points, all of the remaining worst-case distributions in Fig. s 21 -24 are constructed from limit points analogous to those in Fig. 20.■ Remarks : with very few modifications, the foregoing arguments can be used to derive worst-case prior distributions subject to the additional constraint of PK1, i.e.  ( ⩾   ) = 1.Such priors solve the optimisation problem in Theorem 3. Indeed, after observing  executions of a system (which include some consecutive, failed executions),      curve for dependent executions.But, posterior confidence based on independence becomes increasingly optimistic after a large number of tests.That is, the curves begin to deviate significantly as the number of tests increases.So, only assessments that allow for doubt in independence -i.e.nonzero  1 or  2 -can support (in the long run) more pessimistic claims about the bound .Once some doubt in independence has been expressed, an assessor might want to allow for operational evidence to "slowly" allay such doubts.Or, instead, allow for the evidence to "quickly" convince them otherwise -that independence does not hold!PK5 represents an assessor who's initially very confident the system will exhibit independent, failure-free executions.
Prior Knowledge 5 (strong belief in independence).The probability  ( executions will be independent and failure-free ), from the joint prior distribution of (, Λ), has a value that is the solution to the optimisation problem: sup   ( executions will be independent and failure-free )
Prior Knowledge 6 (weak belief in independence).The probability  ( executions will be independent and failure-free ), from the joint prior distribution of (, Λ), has a value that is the solution to the optimisation problem: inf   ( executions will be independent and failure-free )

s.t. 𝑃 𝐾1, 𝑃 𝐾2, 𝑃 𝐾3, 𝑃 𝐾4
Note the difference between the optimisation problems in these PKs, and those in CBI Theorems 1, 2 and 3. Here, the optimisations constrain the prior distribution in how it assigns probability mass; hence why these are PKs.While the previous optimisations are contrained by the prior distributions -specifically, constrained by the PKs the priors must satisfy.for the  1 ,  2 ,  ranges indicated.Consider all prior distributions with the largest prior probability of observing  independent executions with no failures.The most pessimistic posterior confidence in  from these priors is given by the prior in Fig. 25a.Similarly, for all priors with the smallest prior probability of  independent failure-free executions, Fig. 25b gives the most pessimistic posterior confidence.
Theorems 5 and 6 below give the pessimistic posterior confidence in , for assessors who hold PK5 or PK6 beliefs, respectively.Proved in B.2, some prior distributions that give the pessimistic posterior confidence in these theorems are shown in Fig. 25.And the confidence from these priors, as well as from priors in Theorems 1 and 2, are compared in Fig. 26.Fig. 26 clearly shows that the confidence from the very skeptical "PK6"-believing assessor is initially the most optimistic (i.e. the widely-spaced dotted curve lies above all of the other curves).Noticeably more optimistic than even the confidence based on independent executions (i.e. the solid curve).This is in contrast to the assessor who holds the strong PK5 belief in independence.Such beliefs actually support conservative claims initially, even as claims based on independence start becoming optimistic -as suggested by the overlap of the dashed curve and the narrowly-spaced dotted curve, where the solid curve lies above both of them.Eventually, however, the roles are reversed as PK5 supported claims become optimistic, while PK6 supported claims become conservative and agree with the dashed curve.In this sense, "strongly believing" and "being skeptical of" the independence assumption are two halves of "conservatively doubting" the independence assumption.This is the behaviour for the range of PK parameter values in Fig. 26.
Why does PK5 initially support less optimistic claims than PK6, then less pessimistic claims as the number of successes  rises?It has to do with which prior beliefs about (, Λ) -i.e. which locations in  -are crucial for in the system's executions being independent.For the parameter values in this example, being very skeptical about independent executions (i.e. the prior in Fig. 25b) gives the most optimistic posterior confidence in the bound  shown here.While strong beliefs in independent executions (i.e. the prior in Fig. 25a) almost results in the most pessimistic confidence in , at least for  < 4 × 10 4 approximately.
conservative confidence in .There are two principal beliefs a conservative assessor must hold: i) strong doubts of failure-free operation being evidence of a "sufficiently reliable" system -i.e., being evidence of a system with a pfe just smaller than ; ii) strong confidence in failure-free operation being evidence of an "almost sufficiently reliable" system -i.e., being evidence of a system with a pfe slightly worse than , that exhibits perfectly positively correlated executions.With a PK5 belief, failure-free operation initially supports confidence in an "almost sufficiently reliable" system (hence, initially conservative confidence in ).But eventually, an increasing number of successes could also be due to the system being fault-free (because there is a nonzero probability at (  ,   ) in Fig. 25a and   = 0 in Fig. 26).So, the dotted curve reaches a horizontal asymptote.It's the reverse with a PK6 belief, where an "almost sufficiently reliable" system is initially very unlikely (hence initially optimistic confidence in ), but becomes arbitrarily more likely as  grows (hence conservative confidence in ).

B.2. Proof of Theorem 5
We derive the prior distribution in Fig. 25a that solves the optimisation problem in Theorem 5. Analogous steps can be used to derive the prior in Fig. 25b which solves Theorem 6. Proof.For the prior distributions that solve PK5, the size of  ( independent failure-free executions) = [(, 1; , 0, 0)] is made as big as possible (for all  ⩾ 0) by assigning as much probability mass as possible to locations along the diagonal in  where the Klotz likelihood is largest.The likelihood is largest at (  ,   ) and monotonically decreases  26, but with the added constraint that there is a probability  = 0.7 of the system containing no faults (i.e.,  = 0).Now, a strong belief in independent executions gives the most pessimistic posterior confidence in i.e. the dashed "optimism in independence" and the dotted "CBI with dependence" curves are now indistinguishable.
Consequently, the priors that solve PK5 -i.e., the feasible priors in theorem 5 -must allocate probability along the diagonal in one of the two ways just outlined.In particular, from among those priors that allocate all of the 1 −  1 −  2 mass to the point (  ,   ), the methods of A justify the prior in Fig. 25a as a solution of theorem 5 .■

Figure 1 :
Figure 1: various assessment scenarios and the sections in this paper that treat them.Each path through the diagram, starting at the "n executions" node on the left, indicates a system's behaviour (during operational testing) and the implications of reliability evidence considered by an assessor (before operational testing).

Figure 4 :
Figure 4: The support  of the Klotz likelihood function, and the subsets of  related to PKs 1, 2, 3 and 4, are shown in subfig.4a.Upon observing executions with no failures, subfig.s4b, 4c and 4d are 3 prior distributions that solve the optimisation problem in Theorem 2. These priors are relevant for the ranges of parameter values indicated in each subfigure.
increases * An arbitrary Beta distribution satisfying the PKs.* * This curve is a piecewise function -i.e.it's the confidence from the listed priors, in sequence, as  increases.The precise values  at which the curve switches between confidence from different priors depends on the execution outcomes (see proof in supplementary material, A).

Figure 7 :
Figure 7: Some examples of worst-case priors that give nonzero infima for the optimisation in Theorem 3; please see Figs.21, 22,23 of the supplementary material for all of the priors.When the software executes with some isolated and consecutive failures (i.e. > 0), the worst-case priors can look like those in subfig.s7a and 7b.And, if the executions contain some isolated failures, but no consecutive failures (i.e. = 0), worst-case priors can look like subfig.s7c and 7d instead.The exact locations of the support of these distributions (i.e. the black dots) depend on the values of the exponents in the likelihood function, and whether the 1st execution is a failure or not.

Figure 10 :Figure 11 :
Figure 10: Sensitivity analysis varying PKs for the AV-safety scenario with no consecutive failures.

Figure 12 :
Figure 12: An updated overview of the various assessment scenarios analysed in this paper (cf.Fig. 1), indicating the possible testing outcomes and prior confidence an assessor could have.The dashed paths are scenarios where either testing is futile in supporting reliability claims, or the scenarios are of little practical interest.

Figure 13 :
Figure 13:The relationship between prior confidence  2 in positively correlated executions and the (1 − ) × 100% confidence bound , when the system is subjected to 10 5 tests without failure.The curves are obtained from posterior confidence given by the prior in Fig.4b.The values of  plotted here are curves for  = 0.01, 0.03.The smaller  2 becomes, the smaller the pfe upper bound  that can be supported at a given level of confidence.

Figure 14 :
Figure 14: is solved by prior distributions such as those in Fig.s 21 -24, since  (  <  | outcomes of  executions ) from these priors (for   = 0) equals the infimum.

Figure 15 :Figure 16 :
Figure 15: Two illustrations of two quadratic functions of  having, at most, one intersection over the range 0 <  < 1.For fixed , this geometric fact implies (, ; , , , ) is unimodal over any vertical line in  such that  ⩽ 1 2 .

Figure 17 :
Figure 17: Two illustrations of pairs of curves intersecting, at most, once, over the range 0 <  < 1, 0 <  < 1.This geometric fact implies (, ; , , , ) is unimodal over  with, at most, one stationary point in the interior of  .

Figure 18 :
Figure 18: Examples of locations in  (indicated by grey circles) at which local and global maxima of the Klotz likelihood occur.Here, , , ,  ⩾ 1.

Figure 19 :
Figure 19: A systematic allocation of probabilities to limit points, that demonstrates how Fig. 24a is obtained from Fig.20b.

Figure 20 :
Figure 20: Examples of 3 limiting distributions for sequences of prior distributions (in  ′ ) that give progressively smaller posterior confidence in the failure-rate bound .These distributions must allocate mass only at certain limit points of each -subset, as indicated by the black circles.Some relevant stationary points in  are also indicated as grey circles.

Figure 21 :Figure 22 :
Figure 21: Worst case prior distributions that solve the optimisation problem in Theorem 3 when consecutive failures are observed (i.e. > 0).It's important to note the following: the precise locations of the "black dots" for each such distribution are determined by 1) the values of , ,  and , 2) whether the first execution is a success or failure, and 3) the indicated parameter ranges in each subfigure.The location ( * ,  * ) of the global maximum for the Klotz likelihood is indicated by the grey circle.The 4 priors, illustrated in subfigures 21a, 21c, 21e and 21g, are solutions when  2 ⩾ 1 − .While the priors in 21b, 21d, 21f and 21h solve the problem when  2 ⩽ 1 −  and  1 ⩽ .These solutions assume , , ,  > 0.

Fig. 6
Fig.6already shows that assessments based on independent executions are conservative initially, when the number of inputs during operational testing is relatively small.That is, the univariate CBI curve initially overlaps with the CBI

Figure 23 :
Figure 23: Worst case prior distributions that solve the optimisation problem in Theorem 3 when failures are observed, giving 0 posterior confidence.Each distribution's support is determined by , , , , and whether the first execution succeeds or fails.The location ( * ,  * ) of the global maximum for the Klotz likelihood is indicated by the grey circle.The 4 priors, illustrated in subfigures 23a, 23c, 23e and 23g, are solutions when  1 ⩾  and  > 0. While the priors in 23b, 23d, 23f and 23h solve the problem when  2 ⩾  and  > 0. These solutions assume , , ,  > 0.

Figure 24 :
Figure 24: For executions with no failures, these worst-case prior distributions solve the optimisation problem in Theorem 4. These priors are relevant for the ranges of parameter values indicated in each subfigure.The support of each distribution is determined by .

Figure 25 :
Figure 25: Prior distributions representing extreme beliefs about whether the executions will be independent (i.e.whether [ = Λ]),

Figure 26 :
Figure 26: A comparison of posterior confidence in the bound  after operational testing, showing the impact of some confidence

Figure 27 :
Figure 27: A similar comparison to that of Fig. 26, but with the added constraint that there is a probability  = 0.7 of the system