An Improved Copula‐Based Framework for Efficient Global Sensitivity Analysis

Global sensitivity analysis (GSA) enhances our understanding of computational models and simplifies model parameter estimation. VarIance‐based Sensitivity analysis using COpUlaS (VISCOUS) is a variance‐based GSA framework. The advantage of VISCOUS is that it can use existing model input‐output data (e.g., water model parameters‐responses) to estimate the first‐ and total‐order Sobol’ sensitivity indices. This study improves VISCOUS by refining its handling of marginal densities of the Gaussian mixture copula model (GMCM). We then evaluate VISCOUS using three types of generic functions relevant to water system models. We observe that its performance depends on function dimension, input‐output data size, and non‐identifiability. Function dimension refers to the number of uncertain input factors analyzed in GSA, and non‐identifiability refers to the inability to estimate GMCM parameters. VISCOUS proves powerful in estimating first‐order sensitivity with a small amount of input‐output data (e.g., 200 in this study), regardless of function dimension. It always ranks input factors correctly in both first‐ and total‐order terms. For estimating total‐order sensitivity, it is recommended to use VISCOUS when the function dimension is not very high (e.g., less than 20) due to the challenge of producing sufficient input‐output data for accurate GMCM inferences (e.g., more than 10,000 data). In cases where all input factors are equally important (a rarity in practice), VISCOUS faces non‐identifiability issues that impact its performance. We provide a didactic example and an open‐source Python code, pyVISCOUS, for broader user adoption.

In comparison to other variance-based GSAs, VISCOUS provides an advantage by eliminating the need for input-output data to follow specific sampling strategies, as required in traditional Monte Carlo methods for Sobol' sensitivity indices (e.g., Homma andSaltelli, 1996, Saltelli, 2002).This is because input-output data in VISCOUS are not used to directly calculate Sobol' sensitivity indices but are used for training the GMCM.Therefore, input-output data can be from previous model runs for other modeling purposes, such as calibration and uncertainty analysis.Moreover, VISCOUS does not impose assumptions on the structure of input-output data, as required in many emulator-based GSA, especially those employing ANOVA (Analysis of Variance).For example, assumptions about negligible higher-order interactions, as required in Borgonovo et al. (2012), Plischke et al. (2013), and Stanfill et al. (2015), are not enforced by VISCOUS.This characteristic enhances VISCOUS's applicability in diverse models.
The motivation of this research is to improve the VISCOUS of Sheikholeslami et al. (2021).In Sheikholeslami et al. (2021), the GMCM marginal densities are defined as the standard normal distribution along all variable dimensions (i.e., zero mean and unit variance).However, these marginal densities are inefficient as they remain fixed during the GMCM inference process, neglecting the impact of updated GMCM parameters.This may result in biased GMCM parameter estimates and inaccurate sensitivity indices, especially when insufficient iterations are allowed in GMCM inference.The objective of this paper is to improve the VISCOUS methodology by refining the GMCM marginal densities.This methodological advance will lead to a more efficient GMCM inference and improved sensitivity index estimates.
The structure of this paper is organized as follows.Section 2 explains the methodology of the improved VISCOUS framework.Section 3 evaluates VISCOUS using three types of Sobol' functions and demonstrate how its performance is affected by function dimension, input-output data size, and GMCM non-identifiability.
LIU ET AL.

Methodology
This section describes the methodology of the improved VISCOUS framework.The essence of VISCOUS is to develop and use the Gaussian mixture copula model (GMCM) to calculate the Sobol' sensitivity indices.To explain VISCOUS, we first review the Sobol' variance-based GSA method and then explain the GMCM method in detail, including the improved handling of the GMCM marginal densities.With both, we provide the derivations of the first-and total-order Sobol' sensitivity indices.Finally, we explain the implementation steps of the VISCOUS framework using Monte Carlo-based approximations.
In the following, random variables are denoted by capital letters, and their values are denoted by lowercase letters.For example, F X (x) is the cumulative distribution function of the random variable X evaluated at x. Bold face letters denote vectors or matrices, such as X = [X 1 ,…,X d ], where d is the number of variables.

Overview of the Sobol' Global Sensitivity Analysis
Assume a water system model is expressed as:  = (1, 2, . . ., ) (1) where a total of d input factors are evaluated in sensitivity analysis.Assume the variance of model outputs is a good proxy of output uncertainty and input factors are random and independent.The variance of model response (Y) can be decomposed into partial variances: first-order variance (V i ), second-order variance (V ij ), …, until d-order variance (V i..d ) (Saltelli, 2002;Saltelli et al., 2008).The first-order sensitivity index (S i ) is calculated as: where V(E(Y|X i )) is the variance of mean Y over X i alone.It represents the contribution of the single input   to the variance of response Y.The first-order sensitivity index is also called the main effect sensitivity index.
The total-order sensitivity index (S Ti ) is calculated as: where  (( |∼)) is the variance of mean Y over all X except X i .It represents the total contribution of non-X i , denoted as  ∼ , to the variance of response   .The total-order sensitivity index is also called the total effect sensitivity index.It includes not only the first-order effects of an input variable but also its higher-order interactions with other input variables.
Sobol' sensitivity indices range from zero to one.The closer an index value is to one, the better the associated input variable explains the model output.Moreover, from Equations 3 and 4, we see that the calculation of conditional expectations, E(Y|X i ) and  ( |∼) , is the cornerstone of the variance-based sensitivity analysis.In the following sections, we will explain the GMCM method and the use of it to calculate E(Y|X i ) and  ( |∼) ., where each u ∈ [0,1] follows the uniform distribution.

Gaussian Mixture Copula Model (GMCM)
The joint distribution F X,Y can be expressed as a function of the marginal distributions based on Sklar's theorem (Sklar, 1959;Tewari et al., 2011): (5) is the joint CDF of (X,Y).C is the copula function defined as the joint CDF of (u x ,u y ).The copula function specifies the distribution of (X,Y) by specifying their marginal distributions and linking the marginal distributions through the copula function.
The joint PDF of (X,Y), f X,Y , is obtained by computing the derivative of Equation 5: where c(u x ,u y ) is the copula density.In GMCM, a Gaussian Mixture Model (GMM) is used to approximate the copula function as there is no simple analytical formula for the copula function (Tewari et al., 2011).

Gaussian Mixture Model (GMM)
A GMM is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions (Singh, 2019;Xu & Jordan, 1996).The GMM CDF is denoted by: where where K is the total number of Gaussian components or clusters.λ k is the weight of the kth Gaussian component.

𝑧𝑧
are the corresponding inverse distribution function.There is no closed form expression for the inverse function, so a linear interpolation is used to obtain the inverse values based on the GMM parameters Θ (Tewari et al., 2011).   and z y are the obtained inverse values of    and u y , respectively.More details about the GMM are in Appendix A.

Gaussian Mixture Copula Model (GMCM)
The GMCM function is derived from the GMM.When the Gaussian mixture copula function is approximated by a GMM: the copula density, c(u x ,u y ), is approximated by: From Equation 9, we see that the dependence structure of the GMCM is obtained from the GMM.Moreover, the GMCM function shares the same parameter set as the GMM function.The GMCM parameters Θ are estimated by a modified Expectation-Maximization (EM) algorithm of Tewari et al. (2011) (detailed in Appendix B).
Here it is worth noting the algorithmic advancement relative to the VISCOUS of Sheikholeslami et al. (2021).
In Sheikholeslami et al. (2021), the marginal densities  Φ   and  Φ  are defined as the standard normal distribution along all variable dimensions (i.e., zero mean and unit variance).As such, the marginal densities are independent of the formula of GMCM.This has a subsequent impact on the GMCM inference as the marginal densities ignore the updated GMCM parameters and remain fixed in the inference process.This may lead to optimizing GMCM parameters taking longer than necessary, and introducing biases in these parameter estimates and inaccurate sensitivity index estimates.The presented methodology in this paper overcomes this shortcoming by defining the GMCM marginal densities based on the formula of GMCM and adopting iteratively updated marginal densities based on the GMCM parameters in the inference process.This methodological advance helps to obtain a GMCM function that fits the input-output data more efficiently and provides better sensitivity index estimates.

GMCM-Based Sobol' Sensitivity Index Estimation
As explained in Session 2.1, the variance-based sensitivity index estimation relies on the conditional expectations, E(Y|X i ) and  ( |∼) .The following explains the general use of GMCM to compute the conditional PDF, f Y|X .With f Y|X , it then explains the computation of E(Y|X i ) and the first-order sensitivity index.The computation of  ( |∼) and the total-order sensitivity index follows a similar logic.

Model Conditional PDF of Y
To compute the conditional PDF of Y, f Y|X , we need the joint PDF of (X,Y) and the marginal PDF of X.In GMCM, the joint PDF of (X,Y), f X,Y , is estimated based on Equations 6 and 9: The marginal PDF of X, f X , is estimated as: where The conditional PDF of Y, f Y|X , is obtained by dividing f X,Y by f X :

First-Order Sensitivity Index
When the input variable X i is fixed to a value x i , the resulting conditional expectation of Y is: where ΩY is a region of Y over which integration is conducted.Equation 13 is approximated using Equation 12: Since  d  d =  () , the above equation becomes: To drop the dependence upon the specific value x i , the variance of E(Y|X i ) is estimated by integrating E(Y|X i = x i ) over the probability density function of X i , expressed as: E(Y|X i ) and V(E(Y|X i )) can be estimated using Monte Carlo approximations, which is the content of the next section.With V(E(Y|X i )), the first-order sensitivity index is computed based on Equation 3. Similar approach can be used to calculate  ( |∼) ,  (( |∼)) , and the total-order sensitivity index by replacing the X i with  ∼ and is detailed in Appendix C. The above also shows that two loops are needed in the computation of V(E(Y|X i )).The inner loop is to compute E(Y|X i ) by integrating over u y .The outer loop is to compute the variance of E(Y|X i ) by integrating over x i .

Steps for Performing VISCOUS
This section explains the implementation steps of the VISCOUS framework using Monte Carlo-based approximations.Six steps are involved (Figure 1).Same as in Section 2.3, we take the first-order sensitivity index of X i as an example.The procedure is the same for the total-order sensitivity index except replacing X i with  ∼ and is detailed in Appendix C. Additionally, Appendix D demonstrates the implementation steps using the two-parameter Rosenbrock function.This didactic example aims to help users to better understand the details within the VISCOUS method and apply it for their own applications.Step 1. Select the evaluated input and output data based on the goal of sensitivity analysis.For example, when calculating the first-order sensitivity index of variable X i , the selected input-output data are (x i ,y).
Step 2. Normalize the selected input-output data using the min-max normalization method.Normalization transforms data into a common scale without changing the relationships among data.This improves the performance and training stability of the GMCM.The produced normalized data  (  ′  ,  ′ ) will be used in the calculation of Sobol' sensitivity indices.
Step 3. Calculate the rank-based empirical CDF for each variable of the normalized data, getting the marginal CDF data (u x ,u y ).Rank transformation is a common procedure to get marginal CDFs when the data distribution is unknown or complex (Saltelli and Sobol', 1995).The marginal CDFs are used to derive the inverse CDF values (z x ,z y ) in the following GMCM inference.

Part B. GMCM Inference
Finding the best fitted GMCM involves solving two problems.The first problem is to determine the optimal number of Gaussian components (K).The second problem is to determine the optimal GMCM parameters (Θ).Therefore, the following two steps are conducted interactively.
Step 4. To find the optimal value of K, we use a statistic known as Akaike information criterion (AIC).AIC estimates the quality of a model by balancing its goodness of fit (log-likelihood) and complexity (penalty to the number of model parameters) (Akaike, 1974).Readers can explore alternative model selection criteria based on their data characteristics and analysis goals.For instance, Bayesian information criterion (BIC) is another popular model selection criterion (Vrieze, 2012) and has been added as an alternative in pyVISCOUS.
Step 4 compares the AICs of multiple GMCMs with different Gaussian component (K) values (e.g., K = 1, 2,…,9 in this study).For each candidate K value, use a modified EM algorithm to estimate its corresponding GMCM parameters (Step 5), and then compute the AIC score for the estimated GMCM.
The GMCM that achieves the lowest AIC value is identified as the best fitted GMCM, and its corresponding K value is the optimal K value.
Step 5. Given a Gaussian component value K, estimate the GMCM parameters using a modified EM algorithm.
The EM algorithm is explained in Appendix B. In the EM, the marginal densities of GMCM change with every GMCM parameter update.The corresponding inverse distribution values (z x ,z y ) vary based on the form of the GMCM.A Python library called Copulas is used to perform the modified EM.

Part C. Sensitivity Index Estimation
Step 6. provides samples for integration over x i to obtain V(E(Y|X i )) in the outer loop.
The second round of sampling generates N 2 samples, for example, where  16is approximated by: With Equations 18 and 19, and Equation 3, the first-order sensitivity index can be computed.The procedure for calculating the total-order sensitivity index is similar and detailed in Appendix C.

Evaluation of the VISCOUS Framework
This section evaluates the improved VISCOUS framework using three types of Sobol' functions.We will first introduce the three types of functions, followed by comparative performance evaluation.We will also investigate three factors that affect the performance of VISCOUS: function dimension, input-output data size, and non-identifiability.Function dimension means the number of uncertain input factors analyzed in GSA, and non-identifiability refers to the inability to estimate the GMCM parameters.

Sobol' Function
According to Kucherenko et al. (2011), any model functions can be classified into three types based on their dependence on variables.
• Type A function: Variables are not equally important in terms of sensitivity.
• Type B function: Variables are equally important, and no interaction exists between variables.Therefore, S i = S Ti , ∑S i = 1, and S i = 1/n.• Type C function: Variables are equally important, and interaction exists between variables.Therefore, S i < S Ti , and ∑S i < 1.
Type A functions are the most common type of functions in practice.For instance, in most water system models, a large proportion of model output variation is often associated with a small proportion of the input factors (Markstrom et al., 2016).In statistics, this is known as the sparsity of effects principle or the Pareto principle (Box & Meyer, 1986).In the context of sensitivity analysis, this phenomenon reflects over-parameterization in model structure or the need for using a wider range of performance metrics for model evaluation.
Type B and C functions have all equally important variables.Equal importance means that all variables have the same sensitivity at all orders (i.e., first-order, second-order, …, and total-order).Type B and C functions differ in the interactions between variables.While these functions are uncommon, they provide valuable insights into the boundaries and limitations of a theory or methodology, aiding in refinement and improvement.Our study, which examines VISCOUS in type B and C functions, allows us to explore the full spectrum of possibilities, validate VISCOUS's robustness, and provide directions for future study.
The popular Sobol' function is adopted to examine the performance of VISCOUS in all three cases (Hu & Mahadevan, 2019;Kucherenko et al., 2011): Set d = 10, then (X 1 ,…,X 10 ) are the 10 input variables uniformly distributed in [0,1].We can conveniently get all the three types of function by changing a i (Kucherenko et al., 2011).
LIU ET AL. 10.1029/2022WR033808 9 of 22 Table 1 lists four functions that belong to the above three types of functions.Functions A1 and A2 are both Type A functions and are used to distinguish whether some (not all) variables are equally important in Type A. In function A1, all X variables are differently important.In function A2, X 1 and X 2 are equally important and X 3 ,…,X 10 are equally important, but function A2 is more sensitive to X 1 and X 2 than to X 3 ,…,X 10 .In functions B and C, all X variables are equally important, but the interactions between the variables are different, as stated above in the definitions of Type B and C functions.

Sensitivity Index Results
Figure 2 shows the first-order and total-order sensitivity index results of the four functions of Table 1 using the VISCOUS and Sobol' methods as well as the analytical true sensitivity index values.The Sobol' method is based on Saltelli (2002); the analytical truth is calculated based on Saltelli et al. (2004); the calculation of each sensitivity index is repeated 50 times to quantify sampling uncertainty.Each of the 50 experiments uses a different set of input-output sample data with size 10,000; and the Monte Carlo sample sizes are N 1 = N 2 = 2,000.For the first-order sensitivity indices (Figures 2a, 2c, and 2e), VISCOUS generates results matching the truth for all variables in all functions, and its uncertainty of sensitivity estimates is smaller than Sobol's.For the total-order sensitivity indices (Figures 2b, 2d, and 2f), for functions A1 and A2, VISCOUS provides slightly higher sensitivity estimates than the truth; for functions B and C, VISCOUS provides quite different sensitivity estimates from the truth.For all functions, VISCOUS is correct in ordering the sensitivity of each variable.Therefore, if one is interested in first-order sensitivity and input factors ranking, VISCOUS is good at achieving this functionality.
To investigate why VISCOUS behaves differently between Type A functions and Type B and C functions, we examined the results of the Sobol' method.The Sobol' method produces many negative sensitivity indices when the total-order sensitivities approach zero (Figures 2b, 2d, and 2f).Negative sensitivity indices do not make theoretical sense and are instead the result of numerical artifacts in the estimation procedure.Moreover, the Sobol' method produces large uncertainties when the total-order sensitivities are the same across all dimensions (Figures 2f and 2h).These reveal the difficulty of calculating the total-order sensitivities when they are close to zero or the same, in other words, when functions are insensitive or equally sensitive to evaluated variables.
We hypothesize that the performance of VISCOUS in estimating sensitivity indices is affected by three factors: function dimension, input-output data size, and non-identifiability of GMCM inference.The following sections check them one by one.

Function Dimension
For all types of functions, high dimensionality (the number of function input variables) has no effect on first-order sensitivity estimation, which is a beauty of VISCOUS, but it poses a challenge to total-order sensitivity estimation.This is because the function dimension has different effects on the number of variables involved in GMCM (and GMCM inference) in first-and total-order sensitivity estimations.
Suppose the number of variables involved in GMCM inference is denoted as D. When calculating first-order sensitivity, D is always equal to two regardless of the function dimension, including the evaluated variable itself (X i ) and the evaluated output variable (Y).When calculating total-order sensitivity, D is equal to the function dimension, including all the input variables except the evaluated variable  (∼) plus the evaluated output variable (Y).
For a GMCM with K components and D variables, the number of GMCM parameters to estimate is equal to K × D × D + K × D + K.These include K covariance matrices each of size D × D, K mean vectors of length D, plus a component weight vector of length K.These GMCM parameter values are determined through GMCM inference.When calculating first-order sensitivity, the GMCM has 7K parameters to estimate because D = 2.When calculating the total-order sensitivity, the GMCM has K × d × d + K × d + K parameters to estimate because D = d (d is the function dimension).This polynomial growth in the number of GMCM parameters can be a problem for high-dimensional functions because it becomes more challenging to produce a sufficient amount of sample data for making accurate GMCM inferences.For example, assuming a two-component GMM is used in GMCM (i.e., K = 2), when the number of X variables varies between 4, 6, 8, 10, 15, 20, 30, and 50, the corresponding number of GMCM parameters becomes 42, 86, 146, 222, 482, 842, 1,862, and 5,012.
To demonstrate the effect of function dimension on VISCOUS performance, we change the number of function variables from 4 to 50 to cover from low-dimensional to high-dimensional cases, and apply VISCOUS to all functions in Table 1.The experiment design is the same as in Section 3.2 except changing the number of function variables.The input-output data size remains 10,000 in all experiments.Figure 3 shows the VISCOUS sensitivity estimate errors.The error is calculated as the mean absolute sensitivity difference between the VISCOUS's result and the analytical truth across all X variables of a function.
For first-order sensitivity index, VISCOUS provides accurate estimates regardless of the function dimension, with a negligible error less than 0.005.For total-order sensitivity index, VISCOUS provides gradually worse estimates as the function dimension increases.Specifically, for Type A functions, when the function dimension is lower than 20, the total-order error increases slowly with the function dimension, and the error is acceptably small, less than 0.2.When the function dimension is higher than 20 (including 20), the total-order error increases rapidly, and the error is large.This difference between total-order errors and first-order errors indicates a potential limitation of the GMCM in capturing complex structures in high-dimensional problems.
Type B and C functions have different total-order error curves from Type A functions.Rapid error increases are observed in Type B and C functions even when the function dimension is low (e.g., d from 4 to 6 in Figure 3b).This implies that, when estimating total-order sensitivity indices, VISCOUS faces difficulties other than high dimensionality, which is particularly influential in Type B and C functions.This will be explained in Section 3.5.

Input-Output Data Size
To investigate how many input-output data are needed for VISCOUS to provide accurate sensitivity estimates, we changed the input-output data sizes from 200 until 10,000, and applied the VISCOUS framework to functions of Table 1.The results are shown in Figure 4.The first-column of Figure 4 shows the effect of input-output data size on first-order sensitivity estimates.For all functions, the first-order sensitivity estimate error effectively reduces as the input-output data size increases.More importantly, the first-order sensitivity errors are tiny for all functions with even only 200 input-output data (i.e., less than 0.003).This is due to the low number of parameters to be estimated in the first-order sensitivity related GMCM inference as explained in Section 3.3.
The second column of Figure 4 shows the effect of input-output data size on total-order sensitivity estimates.For Type A functions, adding input-output data effectively improves the total-order sensitivity estimates of low-and medium-dimensional functions (d < 20).If taking 0.2 as an error threshold, 200 input-output data are needed for VISCOUS to produce accurate total-order sensitivity estimates for 4-and 6-dimensional problems.400, 750, and 5,000 input-output data are needed for 8-, 10-, and 15-dimensional problems, respectively.
However, adding input-output data does not necessarily improve the total-order sensitivity estimates of high-dimensional functions (d ≥ 20) given limited input-output data.For example, in function A1 (Figure 4b), the total-order error increases as the data size rises to 1,000 when d = 30, and to 10,000 when d = 50.This is caused by overfitting.When the GMCM being used is overly complex, the GMCM might fit noise in data rather than capturing the true underlying patterns.As such, the GMCM performs very well on the input-output data but cannot generalize and therefore performs poorly on new data (i.e., GMCM samples in Step 6).This result indicates that estimating total-order sensitivity of high-dimensional functions is difficult because a large amount of input-output data is needed to make good GMCM inferences (e.g., more than 10,000 data).In this case, we recommend applying a screening method (e.g., Elementary Effect Test (Pianosi et al., 2016)) followed by the calculation of the Sobol' total-order sensitivity index on a reduced number of input factors.
Figure 4 also shows that, increasing the input-output data size does not improve the total-order sensitivity estimates for Type B and C functions as effectively as it does for Type A functions.The next section will explain the factor that has a greater effect on the total-order sensitivity estimates of Type B and Type C functions than function dimension and sample size.

Non-Identifiability of GMCM Inference
We hypothesize that the poor performance of VISCOUS in total-order sensitivity estimates for Type B and Type C functions stems from the non-identifiability of GMCM inference.Non-identifiability is the inability to infer some or all parameters of interest from the available data (Renard et al., 2010;Wagener et al., 2001).There is a considerable body of work on non-identifiability in the control-engineering literature, in the context of dynamical models, spanning over 40 years (Dobre et al., 2012;Guillaume et al., 2019).The following explains the reason behind the non-identifiability of GMCM inference.1.For each function, the number of X variables varies between 4 and 50; the input-output data size is 10,000; and the error is calculated as the mean absolute sensitivity difference between VISCOUS and the analytical truth across all X variables per function. 10.1029/2022WR033808 12 of 22

Grouped Component Parameters in GMCM Inference
In GMCM, the log-likelihood of all input-output data is expressed by: where Θ = [λ,μ,Σ] is the GMCM parameter vector, N is the total number of input-output data used for GMCM inference, and n = (1,…,N).z n = (z n,x ,z n,y ) is the nth inverse distribution values marginally based on the GMCM parameter vector (Θ) and the marginal CDF data (u n ).
Consider a simple example of GMCM with two Gaussian components.The log-likelihood is:  1.The number of X variables varies between 4 and 50.
LIU ET AL. 10.1029/2022WR033808 13 of 22 Assuming the two Gaussian components are independent, the log-likelihood function can be re-parameterized as: where The re-parameterized log-likelihood function depends on the weighted sum of Θ 1 and Θ 2 , not on the individual Θ 1 and Θ 2 .Therefore, μ EM and Σ EM are identifiable and their inference problem is well posed, but the individual Θ 1 and Θ 2 are not identifiable.
However, VISCOUS needs well-defined inference on the individual component parameters Θ k .This is because to compute the conditional expectations in variance-based sensitivity indices, both the joint and the marginal distributions of the GMCM are needed (see Equations 15 and C1).The following explains why the non-identifiability has the greatest effect on GMCM inference when the input variables are equally sensitive.

Non-Exchangeable Priors in GMCM Inference
When facing non-identifiability, the strength of the prior information determines if the GMCM inference problem is well-posed (Renard et al., 2010).An inference problem is considered well-posed if it satisfied the following three criteria: a solution must exist, should be unique, and should depend continuously on the given data and assumptions.
The use of non-exchangeable priors can help yield a well-posed GMCM inference problem.Here the non-exchangeable priors mean that the priors for one Gaussian component are distinctly different from the priors for all other Gaussian components: where k and k′ represent two different Gaussian components of the GMM (k,k′ ∈ [1,..,K], k ≠ k′).Otherwise, if   =   ′ and   =   ′ , then the two priors are exchangeable between the kth and   ′ th components.
The challenge in generating non-exchangeable priors exists in functions that are equally sensitive to input variables.The equally sensitive variables have the same distribution and same interaction with other variables (including y), the prior information on these variable dimensions is very similar or even the same.If the data used for GMCM inference induce exchangeable priors and cannot discriminate between components, then the data cannot discriminate between the individual component parameters.In this case, it is impossible for any inference algorithm to explicitly discriminate these component parameters.
The higher the function dimension is, the more difficult it is to generate non-exchangeable priors for the equally sensitive input variables.This explains why the total-order sensitivities of VISCOUS deteriorate much faster in Type B and C functions than in Type A functions as the function dimension increases (see Figure 3).VISCOUS currently uses the k-means method to generate priors for GMCM parameters.Appendix E lists approaches to generating non-exchangeable prior information, though applying these approaches is out of scope of this study.

pyVISCOUS
pyVISCOUS is the open-source Python implementation of VISCOUS, available at https://github.com/CH-Earth/pyviscous.git(Liu et al., 2023).It is developed to streamline the application of VISCOUS.pyVISCOUS offers straightforward installation options -available both as a Python package via pip or directly from the source.We also provide example notebooks demonstrating the utilization of pyVISCOUS across the Rosenbrock function, four Sobol' functions of Table 1, and a real case study of the Bow at Banff basin, Alberta, Canada.Each example notebook includes well-documented code, guiding users on generating input-output data, setting up and running VISCOUS, and evaluating sensitivity index results.

Conclusions
VISCOUS is a variance-based global sensitivity analysis (GSA) framework developed by Sheikholeslami et al. (2021).As a "given-data" method, VISCOUS leverages existing model input and output data (e.g., parameters and responses of water system models) to provide useful approximations of the first-and total-order Sobol' sensitivity indices.The input-output data do not need to follow any specific sampling strategies and thus can be from the previous model runs generated for other modeling purposes, such as calibration and uncertainty analysis.Also, there are no enforced structure assumptions on the input-output data, which enhances VISCOUS's flexibility and applicability to models with complex interactions.
This research has three innovative contributions.First, we improve the VISCOUS methodology by refining the GMCM marginal densities based on the GMCM formula.Second, we conduct comprehensive evaluations of VISCOUS using three types of generic functions and highlight general problems with the application of GSA methods to water system models (e.g., dimensionality challenges associated with computing total-order sensitivity index).Last, we provide a didactic example (Appendix D) and an open-source Python code, pyVISCOUS, to help people understand and apply VISCOUS.pyVISCOUS is model-independent and can be applied with user-provided input-output data.
Our evaluation shows that the performance of VISCOUS is affected by three factors: function dimension, input-output data size, and non-identifiability.VISCOUS is powerful in estimating the first-order sensitivity using a small input-output data set, such as 200 in this study.This holds true across various function dimensions, as VISCOUS is inherently not affected by the function dimension in first-order sensitivity estimation.Moreover, VISCOUS is always correct in ranking input variables in both first-and total-order sensitivity terms regardless of function dimension and input-output data size.
For functions that are differently sensitive to input variables (Type A function, which are common in water system models), VISCOUS can provide good total-order sensitivity estimates for low-and medium-dimensional functions using limited input-output data (e.g., 10,000 or fewer).For instance, in this study, VISCOUS needs only 200 input-output data for 4-and 6-dimensional problems, and 400, 750, and 5,000 input-output data for 8-, 10-, and 15-dimensional problems, respectively.However, like other GSA methods, VISCOUS has difficulties in estimating total-order sensitivities for high-dimensional functions or models.This is because the number of GMCM parameters grows in a polynomial manner with the function dimension, and it is difficult to produce sufficient input-output data to make good GMCM inferences.Therefore, it is advisable to use VISCOUS when the function dimension is not very high (e.g., less than 20).When the function dimension is high, we recommend applying a screening method followed by the calculation of the Sobol' total-order sensitivity index on a reduced number of input factors.
For functions that are equally sensitive to input variables (Type B and C functions, which are rare in water system models), VISCOUS faces a greater challenge than function dimension and data size in total-order sensitivity estimation, that is, the non-identifiability of GMCM inference.The GMCM parameters are grouped in inference, so the individual component parameters are not identifiable.In this context, if a function is equally sensitive to its input variables, the prior information on these variable dimensions is highly exchangeable and cannot be discriminated between the GMCM components.This adds complexity and subjectivity to the GMCM inference.While well-posedness is still achievable, careful consideration and justification of the exchangeable priors are necessary to ensure the validity and robustness of the inference results.VISCOUS currently uses the k-means method to generate priors, and our evaluation confirms that k-means does not perform well for Type B and C functions.Future work is needed to incorporate the method of creating non-exchangeable priors into GMCM inference, so it can handle functions with equally important variables.
We also invite discussion and collaboration with others interested in related issues of sensitivity and uncertainty analysis for computationally expensive models.We seek collaborations to assess pyVISCOUS's effectiveness in large samples of model types and study locations across a variety of hydroclimatic and environmental regimes.This will further help us test, improve, and modify the proposed sensitivity analysis framework.
To drop the dependence upon the specific value  ∼ , the variance of E(Y|X i ) is estimated by integrating  ( |∼ = ∼) over the probability density function of  ∼ , expressed as: The total-order sensitivity index is computed based on  (( |∼)) and Equation 4.
In Monte Carlo-based approximations, the above two equations are estimated as follows.First, use the inferred GMCM to perform two rounds of sampling and generate the Monte Carlo samples (for example, see Equation C3. The conditional expectation,  ( |∼ = ∼) , is approximated by: )) The conditional variance,  (( |∼)) , is approximated by: The total-order sensitivity index can be computed based on Equations C4, C5, and 4.

Appendix D: A Didactic Example of Implementing the VISCOUS Framework
This section uses the two-parameter Rosenbrock function to demonstrate the implementation of the improved VISCOUS framework.This example is intended to help users to understand the details of the VISCOUS methodology, such as the Gaussian components and GMM, and hence help users to utilize the VISCOUS framework for their own applications.
The Rosenbrock function, also referred to as the Valley or Banana function, is a popular test problem for uncertainty analysis, sensitivity analysis, and optimization algorithms (Rosenbrock, 1960).In the two-dimensional form, the Rosenbrock function is defined as:

D1. Part A. Data Preparation
Assume both variables (X 1 ,X 2 ) follow a uniform distribution between their lower and upper bounds.When we compute the first-order sensitivity index of X 1 for the Rosenbrock function, (X 1 ,Y) are included in the VISCOUS framework.We first generate 10,000 sets of (x 1 ,x 2 ) by randomly sampling from each variable's uniform distribution, and then calculate the corresponding y based on Equation D1.Following steps 1-3 in Section 2.4, we get three sets of data: input-output data (x 1 ,y), normalized data  (  ′ 1 ,  ′ ) , and empirical marginal CDF data (   1 , ) .Figure D2 shows the scatter plot of the two-dimensional data among the three data sets.

D2. Part B. GMCM Inference
For ease of visualization, we first used two Gaussian components to estimate the GMCM (K = 2).The resulting visualization can help to understand what the Gaussian components are and how they are grouped together to form the GMM.
Based on two Gaussian components, the GMCM density function is expressed as:  are the marginal CDF for the input-output data (see Figure D2c).We compute the inverse CDF of  (  1 ,  ) within the GMM, getting  (  1 ,  ) . Figure D4a shows the distribution of  (  1 ,  ) data in the GMM.Next, we compute the corresponding joint probability density for each data point  (  1 ,  ) based on the PDF of the GMM (Figure D4b).These probability density values play a crucial role in GMCM inference, specifically serving as key inputs for calculating the log-likelihood in the utilized EM algorithm (see Equation B1).The log-likelihood of this two-components GMCM is 2,697.90.Lastly, to see the appearance of different Gaussian components, we label each  (  1 ,  ) data point with the Gaussian component to which it exhibits the highest  ) . The study is funded by the Global Water Futures programme.The authors wish to acknowledge with respect that they collectively reside on the traditional lands encompassed by Treaties 6, 7, and 8, and the homeland of the Métis Nation.We extend our gratitude to these nations for their enduring care and stewardship of this land and water.Furthermore, we pay homage to the ancestors and custodians of these sacred places for their profound and lasting contributions.
the Python code of VISCOUS called pyVISCOUS.The paper concludes with discussion of future work and potential utility of VISCOUS for different modeling applications.
Y) is the variance of the model response Y.
. Φ is the CDF of a multivariate Gaussian distribution with mean μ k and covariance Σ k .The GMM parameter vector combines the weights, mean vectors and covariance matrices of all the Gaussian components, notated as Θ in the rest of the paper.

Figure 1 .
Figure 1.Flowchart of performing the VISCOUS framework.CDF denotes the cumulative distribution function.GMCM denotes the Gaussian mixture copula model.EM denotes the Expectation-Maximization algorithm.AIC denotes the Akaike information criterion.

Figure 3 .
Figure 3. VISCOUS sensitivity estimate errors for different function dimensions.Functions A1, A2, B, and C are defined in Table1.For each function, the number of X variables varies between 4 and 50; the input-output data size is 10,000; and the error is calculated as the mean absolute sensitivity difference between VISCOUS and the analytical truth across all X variables per function.

Figure 4 .
Figure 4. VISCOUS sensitivity estimate errors for different input-output data sizes.Functions A1, A2, B, and C are defined in Table1.The number of X variables varies between 4 and 50.
D1) where (X 1 ,X 2 ) are the two input variables in range of [−2,2].The global minimum is at (x 1 ,x 2 ) = (1,1), where y = 0.The Rosenbrock function over the domain [−2,2] 2 is shown in Figure D1.It involves a long steep valley and a gradually sloping floor.The Rosenbrock function in its two-dimensional form enables us to visualize the function itself and the implementation steps of VISCOUS.

Figure D1 .
Figure D1.Rosenbrock function in its two-dimensional form.

Figure D2 .ϕFigure
Figure D2.Scatter plot of the two-dimensional input-output data, normalized data, and empirical marginal CDF data.The histograms on the sides represent the marginal distribution.

Figure D3 .
Figure D3.PDFs of two bivariate Gaussian components and the GMM.

Figure D4 .
Figure D4.When using two Gaussian components, the histogram (panel a), joint PDF (panel b), and clustering (panel c) results for  (  1 , Once the best fitted GMCM is determined, generate the Monte Carlo samples  Based on the inferred GMCM, two rounds of sampling are performed to generate Monte Carlo samples.The first round of sampling generates N 1 samples, namely

Table 1
Configurations of Four Sobol' Functions Figure2.First-and total-order sensitivity index results of the Sobol' method, VISCOUS, and the analytical truth for the four functions of Table1.