Selection of representative natural hazard scenarios for engineering systems

Representative hazard scenarios are essential for many tasks in risk management, such as preparedness and emergency response planning. However, criteria and methods for systematically selecting such scenarios for natural hazards are lacking. From a risk perspective, such scenarios should be selected considering the losses they incur. Hence, we propose to define a scenario that is representative for a certain degree of loss, for example, the 100‐year loss, as the most likely one among all possible scenarios leading to this loss. Taking basis in a generic model of natural hazards and their impact on engineering systems, we formally introduce the representative scenarios. We then develop algorithms that enable an efficient evaluation of these scenarios. The method and algorithms are demonstrated on a hypothetical example considering a spatially distributed infrastructure system subjected to earthquakes.


INTRODUCTION
Natural hazards pose significant risks to engineering systems in general and infrastructure systems in particular. Past infrastructure failures caused by storms, floods and seismic events, or combinations thereof, attribute to this fact, for example during hurricane Kathrina in 2005 1,2 or the 2010 Chile earthquake and tsunami. [3][4][5] To limit the impact of these failures, authorities and utility operators aim at an effective risk management. 6 In this context, risk managers commonly work with representative scenarios to assess the risk [7][8][9] and resilience, 10 to test and validate risk management procedures, [11][12][13] and for effective risk communication. 6,14 Such scenarios are usually selected based on expert knowledge 15 and past events. 16 Surprisingly, there has been little research on how to systematically identify representative scenarios for risk management in general, and for natural hazard events in particular. Miller and Baker 8 propose to identify a limited set of hazard and damage scenarios that best approximate the loss exceedance curve, with application to seismic hazards. Romero et al. 17 develop an optimization approach for selecting seismic hazard and damage scenarios. Seismic hazards scenarios are selected as those that best approximate the annual exceedance probability (AEP) curve of the PGA, based on Vaziri et al. 18 Damage scenarios are selected as those that best approximate the component vulnerabilities, following Brown et al. 19 All these approaches aim at an accurate risk estimation with only a limited set of (computationally expensive) evaluations of the full hazard and damage models.
Salgado-Gálvez et al., 9 in the context of probabilistic seismic hazard analysis (PSHA), select a single scenario 'based on choosing the loss corresponding to a return period of 1000 years from the fully probabilistic analysis as the target loss, and then, a single event from the complete stochastic set which caused a similar value of loss'.
Berk et al. 20 propose a method to identify representative rainfall events for flood analysis. Their approach is based on inverse FORM (first-order reliability method), 21 which identifies the hazard event that is the most likely one among all possible events leading to a demand of a certain return period.
In this contribution, we propose a risk-oriented definition of representative hazard scenarios, which takes up ideas from these previous approaches. In line with Miller and Baker 8 and Salgado-Gálvez et al., 9 we identify hazard scenarios that are associated with a given loss return period; for example, the seismic event that is representative for the loss with a 100yr return period. In contrast to these approaches, we define the representative scenario as the most likely one to lead to such a loss. This definition is inspired by the inverse FORM approach, which, however, does not consider losses but failure events and assumes that the system performance is deterministic for given parameters. Our proposed definition addresses the uncertainty in the system response for given hazard parameters. This leads to additional computational challenges in evaluating these scenarios. Therefore, we also develop efficient algorithms to estimate the scenarios.
To enable a systematic and generic definition of representative hazard scenarios, Section 2 presents a general framework for risk assessment of (spatially distributed) engineering systems subject to natural hazards, which is inspired by the PEER framework. 22,23 On this basis, we introduce and discuss the definition of the representative hazard scenarios in Section 3. In Section 4, we propose a workflow for identifying such scenarios based on a combination of surrogate models with active learning (AL) techniques. In Section 2.5, we investigate and demonstrate the methodology on an idealized example of a power network subject to earthquakes. We end this contribution in Section 6 with a discussion on the limitations and possible extensions of the proposed definitions and methods.

General framework
Risk analysis of infrastructure systems subject to natural hazards requires the combination of models from multiple domains, including models of hazard occurrence, hazard propagation, response of system components, overall system performance and losses. In order to identify hazard scenarios that are representative for losses with a certain return period, it is necessary to combine the different models. In this Section 2, we present a framework for combining these models as a basis for identifying representative hazard scenarios.
In the context of earthquake engineering, PSHA is a methodology for identifying ground motions with a specified exceedance probability. First formulated by Cornell,24 it is a widely used framework for hazard analysis of engineering structures under seismic hazards. [25][26][27][28] On this basis, the Pacific Earthquake Engineering Research Center (PEER) formulated a performance-or risk-based framework for seismic hazard, which systematically integrates the different model components to assess the seismic risk. 22,23,28,29 For other types of hazards, risk analysis is also based on a similar combination of models. Examples found in the literature include wind storms, 30,31 floods 32,33 and volcanoes. 34,35 In Figure 1, we present a generic framework for risk analysis of infrastructure systems under natural hazards. It is inspired by the PEER framework, yet is compatible with modelling approaches used for different natural hazards. It is applicable to different types of engineering systems and infrastructure; in particular, it allows consideration of the spatially distributed nature of most infrastructure systems. The framework establishes the connection between the probabilistic model of the hazard occurrence (the hazard rate and the hazard parameters ) and the target variable (the loss measure ).
The framework consists of a hazard occurrence analysis described in Section 2.2, a hazard propagation analysis (Section 2.3), the component fragility and damage analysis (Section 2.4) and the system response and loss analysis (Section 2.5). The models resulting from these analyses are combined in the risk analysis summarized in Section 2.6. Since the interest is ultimately in the relation between the hazard parameters and the loss , Section 2.7 discusses the sensitivity of the losses to the uncertainty in . F I G U R E 1 General framework for risk assessment of spatially distributed engineering systems under natural hazards.

Hazard occurrence analysis
The occurrence of natural hazard events can be described by random processes. For most hazards, the occurrence of extreme events is well-described by a stationary Poisson process, or more generally a point process. [36][37][38] We denote the mean rate of occurrence of hazard events by . Each hazard event is described by a set of parameters = [ 1 ; 2 ; ⋯ ; ]. For instance, parameters describing a seismic hazard are earthquake magnitude, location and slip type (among others); similarly, for flood hazards, example parameters are rainfall intensity and duration. The selection of these parameters is application-specific and might depend on the model and data availability. For example, for floods, the hazard parameters could also be maximum discharge instead of rainfall parameters. The hazard parameters are modelled as random variables with conditional joint probability density function (PDF) | ( ).
In some cases, the analysis includes only a single hazard parameter Θ, for example, in a fluvial flood risk assessment, the hazard might be characterized only by the maximum discharge. 39,40 In this case, one can find the hazard parameter corresponding to a return period by where −1 Θ is the inverse cumulative distribution function (CDF) of Θ. If the relation between hazard parameter and resulting losses is deterministic and monotonously increasing, then is also the hazard leading to the losses with return period . This corresponds to an AEP neutrality between the losses and the hazard parameter. 41 In general, however, the hazard parameters form a vector and the losses are not deterministic for given . As a consequence, AEP neutrality does not hold and Equation (1) is not applicable. This motivates the criterion and methods for identifying representative hazard scenarios that we propose in Section 3.

Hazard propagation analysis
To assess the performance of the engineering system, the impact of the hazard at the locations = [ , ] at any of the system components must be determined, for = 1, … , . The impact is quantified through one or more intensity measures at each of the locations . Examples of intensity measures are the PGA in seismic hazard analysis or the inundation depth in flood risk analysis.
For given hazard parameters, the intensity measure at a location is evaluated through a model. One example of such a model is the so called ground motion prediction equation (GMPE) that predicts the PGA for a given seismic event. We represent such models by a function , which can be either deterministic or stochastic, depends on additional deterministic and stochastic parameters, such as shear velocity and roughness, and maps the hazard parameters to the intensity measure at : The intensity measure is evaluated at all locations = [ 1 ; ⋯ ; ], resulting in a vector = [ 1 ; … ; ]. In the general case, is a stochastic function and is a discrete representation of a random field, defined conditional on the hazard parameters . A commonly used random field model for a scalar intensity measure is based on multiplying a deterministic model with a lognormal random field: where ( ) is a standard normal random field, and controls the variance of the random field. This model is commonly used in seismic risk analysis, 42,43 wherein is the deterministic GMPE. More generally, the hazard intensity at a location can be described by a vector of intensity measures = [ 1 , … , ]. For example, both PGA and PGV might be used as intensity measures in a seismic risk analysis. In this case, applying the model to all locations , one obtains the matrix of intensity measures , whose th entry corresponds to the th intensity measure evaluated at the location of the th component.
Based on a model such as given in Equation (3), the intensity measures at all components can be summarized by a conditional joint PDF | ( | ).

Component vulnerability analysis
The performance of the system components is represented by fragility models in function of the intensity measures at the component locations. Fragility functions for different hazard types and system components are described extensively in the literature. 25,27,[44][45][46][47][48] In agreement with the PEER framework, we use the term damage measure to describe the performance of the th system component. In the simplest case, one distinguishes only between a functioning = 0 and a failed = 1 component. However, consideration of multiple damage states is straightforward. 27 For a scalar intensity measure at a location , and considering only a failure state = 1, the fragility function is In spatially distributed systems, the damage measure is a random vector = [ 1 ; … ; ]. Commonly, it is assumed that the damage states are conditionally independent among the components given the vector of intensity measures . Consideration of such dependence among components is, however, possible 46 and can be included in the framework.
The fragility functions for all components define a conditional joint PMF of component states given the vector of intensity measures, | ( | ).

System vulnerability analysis
The performance of the overall system is represented by a loss measure . It can be measured, for example, by connectivity loss, 49-52 efficiency loss, [53][54][55] or a factor describing the impact on the population. 51 For power networks, it can be measured by the energy not supplied (ENS), 56 and for road networks by the travel time delay. 57 Examples of further metrics can be found in the literature. 50,58 The can also consider the resilience of the system, that is, time until the system recovers. 56,59 Depending on the scope of the analysis, can include only direct or additionally indirect consequences associated with system disruptions.
In the loss analysis, is mapped to by means of a system model , which can be deterministic or stochastic: A large variety of system models can be found in the literature, ranging from simple connectivity-based models 51,52,54,55,60 to fully physics-based network models. 61,62 In the numerical investigations presented in Section 2.5, we utilize a generic network model that allows accounting for cascading effects.
In the general case of a stochastic loss model, defines the conditional PDF of given the vector of component damage states, | ( | ).

Risk analysis
The components of the probabilistic hazard analysis described in Sections 2.2-2.5 can be combined to determine the CDF of conditional on a hazard event : is the conditional CDF of the losses given . A common measure of risk is the loss exceedance rate. 8,26 Under the common (albeit not necessarily correct) assumption of independence between the number of hazard events and the losses in any given hazard event, the loss exceedance rate is From the inverse of ( ), one can derive losses with given return period as

Sensitivity of the losses to the hazard scenarios
Equation (6) represents the uncertainty in the losses given a hazard event , whereas Equation (7) corresponds to the uncertainty in the losses due to all factors different from . In this section, we show how one can quantify the contribution of the uncertainty in to the overall uncertainty in the losses. For ease of notation, we drop the explicit conditioning on the hazard occurrence in the following. One can decompose the conditional variance of given a hazard occurrence by the law of total variance: wherein and are the variance and expected value with respect to ; whereas ( ) = | ( | ) and 2 ( ) = | ( | ) are, respectively, the conditional mean and variance of given . The first term in Equation (10) corresponds to the combined effect of on the variance of . One can observe that if 2 = ( ( )), then 2 ( ) = 0 and the losses are a deterministic function of .
Normalizing Equation (10) by the total variance 2 , one obtains the following equality: The term is the closed Sobol' index of , that is, the sum of the Sobol' sensitivity indices of all orders involving only the hazard parameters. 63 The term − is the total effect sensitivity index of − , that is, the sum of the remaining Sobol' sensitivity indices, which are those that consider at least one input random variable that is not a hazard parameter. One can observe that 0 ≤ ≤ 1, and that = 1 corresponds to the case where the loss function is deterministic for given , whereas = 0 is the (hypothetical) case where the uncertainty in the hazard parameter has no effect on the expected losses.

Definition
The aim of this section is to define hazard scenarios that are representative of specific loss values, typically losses with a specified return period following Equation (9). A hazard scenario is characterized by a vector , and we denote the representative scenario of a loss by . We present and later investigate the definition of a representative hazard scenario. All distributions utilized in the following are conditional on the occurrence of a hazard event . For readability, we do not write out this condition.
We define the representative scenario as the most likely parameter values leading to a loss , that is, Here | ( | ) is the conditional joint PDF of the hazard parameters, given that the loss equals .
Combining Equations (12) and (13), and noticing that ( ) is constant, equals ( ) can be derived from historical records, literature and expert knowledge; | ( | ) corresponds to the derivative of Equation (7) with respect to , evaluated at .
Equation (14) shows that depends on the uncertainty in , as well as the uncertainty associated with the conditional density | ( | ). In fact, if = 1, that is, if all uncertainty in comes from , then | ( | ) = ( − ( )) and Equation (14) becomes a constrained optimization problem: Equation (15) is analogous to the inverse FORM approach 20,21 for finding representative design parameters. Thereby, − ( ) would be equal to the limit state function.

Illustration
We illustrate the definition of Section 3.1 through a simple example. The goal is to determine a representative seismic event. At a seismic fault, strong earthquakes occur with a rate of = 1.0 yr −1 . The engineering system under consideration consists of a single component. The hazard parameters = [ , ln ] ⊺ are the magnitude and the log of the hypocentral distance from the earthquake source to the component. The distribution of is normal with mean vector and covariance matrix Σ given as follows: The intensity measure is the PGA in m∕s 2 . The GMPE from Esteva and Villaverde 64 is taken as the hazard propagation model of Equation (2): wherein = [ , ln ] ⊺ are realizations of the hazard parameters , = [0.8, −2] ⊺ , and is a normal random variable with zero mean and standard deviation , . For illustrative purposes and to obtain analytical solutions, we assume here that the damage measure is continuous and proportional to the intensity measure multiplied with a lognormal random variable. We furthermore take the losses to be proportional to the damage measure, also multiplied with a lognormal random variable. Omitting the proportionality constant, it follows that the losses are wherein and are normal random variables with zero mean and standard deviations , , , , respectively. They represent the uncertainty in the damage state given the PGA and the losses given the damage state. Equation (18) can be re-written in terms of a single standard normal random variable and a single standard deviation : For given values of the hazard parameters, ( ) is a lognormal random variable with parameters: The unconditional loss is also lognormal with parameters ln = −3.16, ln = √ 2.46 + 2 . Unless otherwise noted, we set = 1 in the following and we consider a scenario with return period = 100yr. The loss associated with this return period is = 3.21.
Because the parameters and the log-loss ln are jointly normal, the conditional | ( | ) can be evaluated analytically. It is normal with mean | = and covariance matrix Σ | = equal to is shown in Figure 3 as a function of . For = 1, it is = 0.347.

F I G U R E 2
The conditional distribution of given = and the representative hazard scenario , with = 1

F I G U R E 3 Closed Sobol' sensitivity index for different values of
In the following, we investigate the effect of on . Figure 4 shows for different values of . One can observe that for increasing values of , approaches the mode of the distribution of . When = 0, ( ) is a deterministic function ( ), and can be evaluated following Equation (15), that is, as the value with the highest density along the black line shown in Figure 4. For values → 0, that is, as the combined effect sensitivity index of tends to 1, approaches the solution on this line. Figure 4 illustrate how the relative share of the uncertainty that is associated with the hazard parameters affects the representative scenario. If is large, and hence is small, then the representative hazard scenario for any loss return period is close to the mode of ( ). In this case, the extent of the loss is not determined by the intensity of the hazard, but by other factors. In contrast, if is small, and hence is large, then extreme losses will always be associated with extreme hazard scenarios.

NUMERICAL EVALUATION OF REPRESENTATIVE HAZARD SCENARIOS
Generally, analytical solutions to Equation (12) are not available and the representative scenarios must be found numerically. Furthermore, the loss exceedance rate is also not available in analytical form; it must be evaluated numerically to determine . In this section, we propose a procedure for doing so efficiently. The procedure assumes that the hazard rate and the probability distribution of the hazard parameters are available. The procedure starts with an initial Monte Carlo sampling. The resulting sample evaluations of the losses are used for estimating and to set up an initial surrogate model of the losses using Gaussian process (GP) regression. Depending on the variance in given , the procedure chooses between two approximation methods. When this conditional variance of the losses is small, that is, for large , the procedure employs a GP surrogate of the losses for given . If the conditional variance is large, the procedure approximates | ( | ) with a kernel-density estimation (KDE), which requires additional model evaluations, and learns a GP surrogate of | ( | ) in function of .
To efficiently learn the GP surrogates, an AL method is utilized, which optimally chooses additional samples of for which the system should be evaluated. AL is a commonly used strategy for improving the predictions of GP regression in optimization 65,66 and reliability analysis. 67,68 At the core of AL is the learning or acquisition function, which determines the best next point to be evaluated for reducing the prediction error of the GP regression. The implemented AL methods are detailed in Sections 4.2 and 4.3. The final surrogate model learned from all sampled scenarios is then employed to solve the optimization problem defined by Equation (14), resp. Equation (15). Figure 5 shows the different steps of the procedure, which are detailed in Sections 4.1-4.3.

Initial sampling
The first step is to generate 0 random samples of , denoted by (1) , … , ( 0 ) . For each sample, one evaluates the losses with the model, resulting in loss samples (1) , … , ( 0 ) . The choice of 0 depends on the computational cost of one model evaluation. If the computational budget allows it, we recommend 0 ≥ 10 and generate the samples independently from the distribution ( ). Then one can obtain an estimate of from the ordered samples (1) ≤ … ≤ ( 0 ) : where is the smallest integer larger or equal than 0 (1 − 1 ), and is their difference, that is, = 0 (1 − 1 ) − . F I G U R E 5 Scheme of numerical approximation of the representative hazard scenario associated to return period Following the initial sampling, it must be decided if the conditional variance of the losses given the hazard parameters is small or large. Quantitatively, this is expressed by − , see Section 2.7. In many cases, it will be clear to the analyst prior to the analysis whether − is small or large. If not, − can be estimated with a sampling approximation based on the generated samples of and .
If the conditional variance is deemed to be small, the procedure proceeds with learning a surrogate of the losses given following Section 4.2 (deterministic approximation). Otherwise, a surrogate modelling of the objective function is employed as described in Section 4.3 (stochastic approximation). If one is unsure about whether the conditional variance is 'large' or 'small', one should choose the stochastic approximation, since it works in all cases (but at a higher computational cost than the deterministic approximation).

Deterministic approximation of the conditional losses
If the conditional variance of the losses given is small, it is justified to replace | ( | ) with a deterministic model of the losses in function of , that is, ( ) ≈ [ | ]. With this approximation, the representative scenario can be found by solving Equation (15). To model ( ), we employ GP regression 69 and obtain the mean  ( ) and standard deviation function  ( ). The loss function is approximated by the mean function, and Equation (15) is replaced with the approximation: The approximation of the loss function with a GP introduces a prediction error. However, one can reduce it near the equality constraint of Equation (24) by adding more scenarios in this region. One can achieve this with an AL method whose acquisition function looks for new scenarios close to the constraint. For that purpose, we employ the Expected Feasibility Function as acquisition function, 67 defined as wherein ∝  ( ). An optimization algorithm is employed to find the scenario with the maximum ( ). This scenario is added to the training set and the GP is re-trained. This procedure is repeated until the computational budget 3 is exceeded. After F I G U R E 6 Scheme for finding the representative hazard scenario with the deterministic approximation of the losses adding sequentially 3 new scenarios with AL, one solves Equation (24) and finds the numerical approximation of . Figure 6 summarizes the approach.

Stochastic approximation of the conditional distribution of the losses
If the conditional variance of the losses given is large, then one cannot utilize the deterministic approximation ( ), and a normal approximation of the conditional density of the losses is inappropriate in most cases. Therefore, we propose to approximate | ( | ) with KDE for selected scenarios ( ) , = 1, … , 1 . At these scenarios, the objective function in Equation (14) is approximated as wherein is a kernel function and the bandwidth, which can depend on ( ) . ( , ) are loss samples conditional on ( ) , which are obtained from system model evaluations. 2 is the number of model evaluations at the scenario ( ) . Note that one can use the same 2 model evaluations for estimating the conditional density at multiple loss values, corresponding to different return periods.
To limit the number of evaluations of Equation (26), a GP surrogate of the objective function | ( | ) ( ) is constructed. One needs to select the training set for this GP regression efficiently, because each evaluation of ( ) (Equation 26) requires 2 model evaluations. To this end, we propose to choose from the 0 scenarios of the initial sampling the scenarios whose sampled losses are closest to . One then clusters these scenarios into 1 clusters with kmedoids, 70 and takes the medoids of the clusters ( 1 ) , … , ( 1 ) as the initial training set for learning the GP. We consider = min( 0 , 5 1 ).
Density estimation introduces uncertainty. We assess this uncertainty with bootstrapping. 71 That is, at scenario ( ) and after estimating ( ) , one computes the standard deviation ( ) of KDEs from re-sampled loss values. The bootstrap standard deviations of the KDEs (1) , … , ( 1 ) evaluated at the points ( 1 ) , … , ( 1 ) are the training set for a surrogate GP to learn the noise variance over the hazard parameters. We employ AL to select informative scenarios at which to evaluate the objective function. A suitable acquisition function in the context of an optimization problem is the Augmented Expected Improvement , 65,66 which we employ here. The is wherein 2 ( ) is the noise variance, which we approximate with the mean function of , and * is a representative value of the maximum observed value at AL iteration . Based on a similar idea of Picheny et al., 72 with 0 ≤ ≤ 1, we consider the following expression for * : We set = 0.01. The scenario with the largest is found via a standard optimization algorithm. The scenario is added to the training set and the GP is re-trained. This procedure is repeated until the computational budget 3 is exceeded.
After adding sequentially 3 new scenarios with AL, we solve the optimization problem with the GP surrogate of the objective function, that is, Figure 7 summarizes the method.
The total number of model evaluations is 0 + 2 ( 1 + 3 ) − 1 . The last term is related to the re-use of loss evaluations from the initial sampling when evaluating the KDE at the 1 scenarios ( 1 ) , … , ( 1 ) .
The choice of 1 and 2 has a significant impact on the performance of the methodology. 1 is the number of scenarios of the initial training size of the GP regression, hence it should be chosen in function of the dimensionality of . Based on numerical investigations, we recommend to choose 1 in the range 20 ≤ 1 ≤ max(20, 0.1 0 ). The reason for the upper bound on 1 is to avoid including too many scenarios in the training set that are far from the solution. If one wants a large 1 , then one should increase 0 as well.
2 is the number of loss evaluations necessary for a KDE of given = . For 2 , the recommendation is to use a value 2 ≥ 20. 3 is the number of AL steps, which implies additional 3 KDEs. For that reason, we suggest a value 5 ≤ 3 ≤ 1 . We observed that the uncertainty in the density estimation dominates the overall uncertainty in the GP predictions, together with the uncertainty in estimating . Therefore, if the computational budget is large enough, additional model evaluations should focus on 0 and 2 , rather than on 1 and 3 . Exemplary choices of the number of samples for different computational budgets are shown in the next section.

Illustration
We illustrate the performance of the proposed method by approximating the solution of the basic example presented in Section 3. The analytical solution of Section 3 is utilized to assess the quality of the approximation method. The return period of interest is = 100yr. To investigate the deterministic approximation (Section 4.2), we use the example with the setting = 0.01, in which case the uncertainty of the losses given is small. To investigate the stochastic approximation (Section 4.3), we set = 1.
In the deterministic case, we set a computational budget of = 10 3 model evaluations. From that budget, we generate For the stochastic case, we evaluate the solution with different computational budgets , as detailed in Figure 9. In all cases, we compute the bootstrap standard deviation of the KDEs with a re-sampling size of 400.
The stochastic approximation requires 10 times more model evaluations than the deterministic approximation for obtaining an acceptably accurate estimate of the representative hazard scenario, as seen from Figure 9. Most of the evaluations are invested in conditional KDEs close to the solution. In terms of training set sizes, the stochastic approximation utilizes between 20 and 60 scenarios for learning the initial GP surrogate, all of them selected close to the solution.

NUMERICAL EXAMPLE
We apply the methodology to the IEEE39 bus system, whose topology is displayed in Figure 10. It consists of 39 nodes and 43 edges, which represent substations and transmission lines, respectively. The system is assumed to be exposed to earthquake events, whose hazard parameters are the hypocenter location 0 = [ ; ; ] and the magnitude . We consider only events with magnitudes between 6 and 9.5, with a hazard rate of = 1 −1 . The representative hazard scenario is estimated for a loss return period of = 100 yr.  The hazard model is presented in Section 5.1, and the system model in Section 5.2. The numerical results are shown in Section 5.3.

Hazard model
The hazard parameters = [ ; ; ; ] are summarized in Table 1. They all follow a beta distribution with parameters and . The hazard parameters are assumed to be statistically independent.
( ) is a Gaussian random field with zero mean, unit variance, and a squared exponential auto-correlation coefficient function with a correlation length of 50 km.

System model
All substations have the same fragility function, represented by a lognormal CDF with parameters = −1.77 and = 0.35. The parameters correspond to the fragility function of the extensive damage state for high voltage substations with unanchored elements. 73 No direct failures of transmission lines due to an earthquake are considered.
where ( ) is the efficiency of the most efficient path from source node to terminal node ,  is the set of source nodes,  the set of terminal nodes. Equation (32)

Estimation of
A preliminary analysis of the system model clearly shows that the conditional variance of the losses given the hazard parameters is considerable and the stochastic approximation should be employed. (While not necessary in practice, we also evaluate the contribution of the uncertainty in the system model and the loss evaluation on the overall uncertainty as − = 0.45.) We consider a computational budget of = 10 4 model evaluations, which we distribute as 0 = 2020, 1 = 20, 2 = 200, 3 = 20.
We compare the performance of the numerical approximation with a reference solution. This solution utilizes first 10 6 randomly sampled scenarios for estimating . Thereafter, an additional 10 4 scenarios are evaluated in a full factorial experimental design whereby each hazard parameter is discretized to 10 values. At each scenario, the losses are evaluated 220 times. Finally, linear interpolation is employed over the four-dimensional grid for finding . In total, the reference solution is based on 3.2 × 10 6 model evaluations. Figure 11 shows the results of one computation of . One can observe that the initial training set, which was selected from the initial MCS, is spread, but not located around the prior mode. The scenarios added with AL show that the chosen acquisition function balances well exploration and exploitation. The resulting GP estimate ( ) matches well the conditional distribution | ( | ) obtained with the reference solution. Figure 12 shows the resultinĝfrom 20 repeated evaluations. One can observe that the variability is close to the reference solution

CONCLUDING REMARKS
Representative hazard scenarios are of relevance for many tasks in risk management of engineering systems. In cases where the hazard is characterized by one dominant parameter, such as the annual maximum discharge in fluvial flood events, such scenarios are typically defined as those with a specific return period, for example, the 100-year flood event.
For hazards that are characterized by multiple stochastic parameters, such as the earthquake hazard for a given region, it was unclear how a scenario can be associated to a given return period. In this contribution, we close this gap by proposing a definition of representative hazard scenarios for engineering systems. The return period of a scenario is thereby evaluated with respect to the losses, that is, a 100-year hazard scenario is one that leads to losses with an AEP of 100 years. The representative scenario is then defined as the one with the highest probability among all scenarios leading to these losses. The proposed definition is applicable to systems with a single location, such as buildings, as well as spatially distributed systems, such as power networks, or even systems of systems. The definition also considers the uncertainty in the system response and the losses. It is important to realize that the proposed representative hazard scenario is a function not only of the hazard but also of the considered system. One should not refer to a representative scenario without specifying the system for which it is representative. A consequence of this definition is that any action that modifies the system (e.g., adding redundancy or reinforcing component structures) can lead to a change of the representative hazard scenarios. In most cases, in particular for small changes in the system, this change will not be large, and one can keep working with the previously determined representative scenario. In some cases, when larger changes to the system are made, the representative scenario can indeed change, but we see this as a positive feature of our definition. Assume that in a spatially distributed network, a weak section is strengthened, then it can be that the representative earthquake scenario changes from one that it is in the vicinity of this section to another one that is closer to another weak section. Such a change in the system should indeed be reflected in the representative hazard scenario.
We developed a methodology for evaluating such representative hazard scenarios numerically. The focus of the proposed algorithms is on an efficient evaluation; nevertheless, the necessary number of system performance evaluations is still in the order of 10 3 -10 4 when evaluating scenarios with a return period of 100 years.
This large number restricts the complexity of the system models that can be used within the framework. However, using simplified system models for identifying the representative hazard scenarios seems justified. Once the scenarios are determined, more advanced models can be applied to these scenarios to assess the appropriateness of the simpler models.
In this paper, we show how a single representative hazard scenario can be evaluated. The method can be extended easily to obtain representative hazard scenarios for multiple return periods. Essentially, all model evaluations utilized for determining the representative scenario for one return period can be re-used when determining the one for another return period. Only the AL steps need to be performed separately for each return period.
In some cases, one might also wish to obtain multiple hazard scenarios that are representative of the same return period. Our definition does not cover this case. When the distribution of the hazard parameters given the losses is multi-modal, our definition, which currently only identifies the highest mode, can be extended to cover these multiple modes. If | ( | ) has only a single mode, then one might identify additional 'representative' scenarios at some distance around this mode. We leave this question for future research.
While the examples in this paper consider single-hazard scenarios, the scenario definition and the proposed methodology for evaluation are applicable to multi-hazard scenarios, for example, earthquake followed by tsunami, floods and landslides. In this case, the hazard model corresponds to the combination of the models of the individual hazards.
At present, the methodology is applicable to hazards that are characterized by continuous parameters, whose joint probability distribution is available. Future work should extend the methodology to discrete and categorical hazard parameters, for example, if an earthquake can originate in multiple faults. Furthermore, in some cases, available information on hazards is not in the form of models with continuous random input parameters, but rather in the form of selected scenarios, for example, from an earthquake catalogue. In this case, the scenarios must be parameterized before the methodology is applicable.
The methodology is also restricted to hazard scenarios described by a limited number of parameters, in the order of up to 10 parameters. This is due to the use of GP regression, whose performance deteriorates with increasing dimensions. For most hazards, this restriction is not crucial. However, the method might have difficulties when one wants to include information about a spatial distribution of a hazard, for example, if one would want to describe earthquake hazard in terms of a ground motion map. In such cases, the methodology might be extended with dimensionality reduction techniques, such as principal component analysis (PCA) or partial least squares regression (PLS). 74,75 The general idea behind the definition of representative scenarios could also be extended beyond the hazard parameters to system parameters, for example, to identify scenarios of network failures associated with a specific return period. In these cases, the number of parameters describing these scenarios would be even higher. Therefore, it seems worthwhile to further investigate the extension of the methodology to scenarios defined by a larger number of input parameters.

D ATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available from the corresponding author upon reasonable request.