Expert elicitation and data noise learning for material flow analysis using Bayesian inference

Bayesian inference allows the transparent communication and systematic updating of model uncertainty as new data become available. When applied to material flow analysis (MFA), however, Bayesian inference is undermined by the difficulty of defining proper priors for the MFA parameters and quantifying the noise in the collected data. We start to address these issues by first deriving and implementing an expert elicitation procedure suitable for generating MFA parameter priors. Second, we propose to learn the data noise concurrent with the parametric uncertainty. These methods are demonstrated using a case study on the 2012 US steel flow. Eight experts are interviewed to elicit distributions on steel flow uncertainty from raw materials to intermediate goods. The experts' distributions are combined and weighted according to the expertise demonstrated in response to seeding questions. These aggregated distributions form our model parameters' informative priors. Sensible, weakly informative priors are adopted for learning the data noise. Bayesian inference is then performed to update the parametric and data noise uncertainty given MFA data collected from the United States Geological Survey and the World Steel Association. The results show a reduction in MFA parametric uncertainty when incorporating the collected data. Only a modest reduction in data noise uncertainty was observed using 2012 data; however, greater reductions were achieved when using data from multiple years in the inference. These methods generate transparent MFA and data noise uncertainties learned from data rather than pre‐assumed data noise levels, providing a more robust basis for decision‐making that affects the system.


Introduction
Material flow analysis (MFA) is a foundational tool of industrial ecology research [1] and characterizes how a given material is transported and transformed through a supply chain.MFAs are key to identifying potential resource efficiency improvements (e.g., increased recycling), and to evaluating the upstream and downstream system impacts of local interventions; e.g., the potential to reduce greenhouse gas (GHG) emissions released during material production by improving downstream manufacturing process yields [2].MFAs have been used to help set environmental policies and goals by national governments (e.g., justifying Japan's reduce, reuse, and recycling laws), local governments (e.g., remedial action taken against toxic releases into New York City harbor), and companies (e.g., Toyota's corporate MFA was used to set company goals for emissions and recycling) [3].The proliferation of MFA, however, is hindered by at least two major challenges.First is the long timeline for creating and updating detailed MFAs, currently taking months or even years.Second is the lack of uncertainty quantification (UQ) in most MFA results-a lack of UQ limits insight into the impacts, risks and unintended consequences of system interventions.It is increasingly accepted that UQ must be included in MFA results if they are to meaningful support informed decision-and policy-making [3,4].Bayesian methods help address these challenges of UQ and laboriousness in MFA, as explained below.

Previous work on Bayesian inference in MFA
Bayesian inference is a probabilistic approach to achieve data reconciliation that adjusts MFA variable estimates by combining prior knowledge with collected material flow data [5,6,7,8].The prior information is typically a combination of fact-based knowledge with subjective impressions based on experience [9].The prior belief about an MFA variable, such as the mass flow between two processes in a factory, is expressed as a probability density function (PDF); e.g., a Gaussian PDF could be used to represent a prior belief that a mass flow is expected to be 10 t with a variance of 1 t 2 .MFA data are subsequently collected and the "noise" in the collected data-e.g., due to the error in a mass sensor reading-is also expressed as PDFs.In Bayesian inference, the collected MFA data are combined with the priors to generate an updated posterior belief, represented as updated conditional PDFs.
Bayesian inference presents several advantages over other forms of MFA data reconciliation such as least squares optimization [10,11].First, it allows a rigorous quantification of uncertainty via the probability and statistics formalism, and allows flexible probability distributions able to capture high-order non-Gaussian and correlation effects.For example, the prior knowledge of an MFA variable might be best represented as a uniform rather than normal distribution if nothing is known other than the upper and lower bounds on the variable.Second, it is particularly suited for handling sparse, noisy, and incomplete data, and can incorporate multiple data streams simultaneously.Third, the Bayesian framework provides a natural entryway to inject domain knowledge, such as by using historical data or opinions from subject matter experts to form the prior distribution of the MFA variables [12].Bayesian inference can also be "chained" together, to perform sequential learning that iteratively assimilates new data as they become available.Even when little data is available, the Bayesian approach can provide a practitioner with an MFA with associated uncertainties.
Bayesian inference was first used in MFA in 2010 by Gottschalk et al. [13] to study nano-TiO 2 mass flows.They formed uniform and triangular prior distributions centered on values of historical data and performed Bayesian inference using a Metropolis sampling algorithm with simulated instead of measured data.In 2018, Lupton and Allwood [14] introduced additional MFA prior forms (e.g., Dirichlet priors) and conducted a case study on deriving the global steel flow.Their case study highlights some of the challenges of applying Bayesian inference to MFA: • Assigning proper and rigorously justified prior distributions.Lupton and Allwood's steel flow analysis [14] used previous results from Cullen et al. [15] as the basis of their priors.However, historical data may not be always available and even when it is (and still deemed relevant) it remains unclear how to form a probability distribution that properly reflects the uncertainty.Assuming a prior variance without justification can introduce bias to the posterior results.Alternatively, non-informative or weakly-informative priors may be used; e.g., assigning a wide uniform PDF for a mass flow between 0 Mt and 200 Mt.However, they will likely require more MFA data to be collected in order to decrease uncertainty to desirable levels.
• Assigning noise to collected MFA data.MFA relevant data are typically published without accompanying uncertainty information; e.g., no error bars are given for the commodity mass flow data reported by the U.N. Comtrade Database [16] or from trade associations such as the World Steel Association [17].
How then to model the data noise?Assumptions can be made; e.g., Lupton and Allwood [14] assume the data noise in their collected data follows a Gaussian distribution with a standard deviation equal to 10% of the observed value.Such assumptions can introduce bias into the final posterior MFA results and provide a false sense of confidence if the assumed noise overestimates the data quality.Elsewhere, there are qualitative methods to categorize data into uncertainty levels based on features such as the perceived data source quality and specificity [18], and semi-quantitative approaches such as using a pedigree matrix or confidence score [19,11] which translates an uncertainty level into a probability distribution or numerical value.The strict quantitative identification of MFA data uncertainty has been lacking so far [20]; however, the Bayesian framework can also facilitate the learning of data noise.At the core of MFA is the conservation of mass, which requires the total input flow of material mass for each node to equal its total output flow.We denote the total input (equivalently, total output) flow for node i by z i .The flow along an edge out of node i (e.g., to node j) is then equal to φ ij z i , where φ ij ∈ [0, 1] is the allocation fraction of node i's total outflow going into node j (φ ij = 0 if there is no flow from node i to node j).Hence,

Scope and structure of this article
Furthermore, for each node, its output allocation fractions need to sum to unity: We recommend working with the allocation fractions (φ ij ) as model parameters instead of working directly with the mass flow values for each edge.The allocation fractions offer a convenient method of expressing and allowing the mass balance relationships for the entire MFA to be assembled into a linear system as proposed by Gottschalk et al. [21].For instance, the mass balance equations for the simple MFA shown in Fig. 1 can be expressed as: where I is the n p × n p identity matrix, Φ ∈ R np×np is the adjacency matrix where entries are the allocation fractions φ ij , z ∈ R np depicts all the nodal mass flows, and q ∈ R np represents the external inflows to the network.The material inflows (q i ) originate from outside the network.For example, a material inflow to the U.S. aluminum material flow network could be imports of aluminum billet.For a scenario with given Φ and q, we can compute the model prediction for all nodal mass flows via: Subsequently, from the values of {z, Φ, q}, other common MFA quantities of interest (QoIs) can be predicted, such as mass flows for each edge (φ ij z i ), and sums, products and ratios of mass flows.We express these QoIs as a function, G(φ, q), where φ denotes a flattened vector containing all φ ij 's.

Bayesian Parameter Inference
Given an MFA model, the set of all unknown MFA parameters θ ∈ R n θ and the collected MFA data y ∈ R ny are treated as random variables and associated with a joint PDF p(θ, y).Here, we use θ to denote all MFA model parameters we are interested to learn, which may encompass φ and q as well as other parameters; y is the flattened set of collected and noisy MFA data records that correspond to the model prediction QoIs G(φ, q).Bayes' rule then directly follows from the axioms of probability, stating: where p(θ) is the prior PDF representing the initial belief in the MFA parameters θ before having collected any MFA data; p(y|θ) is the likelihood PDF (see Sec. 2.3); p(θ|y) is the posterior PDF representing the updated belief in the MFA parameters θ after having collected the MFA data y; and p(y) is the model evidence (marginal likelihood) and acts as a normalizing constant for the posterior PDF.Performing Bayesian parameter inference entails computing or characterizing the posterior p(θ|y) while accessing the likelihood and prior.

Modeling the Likelihood p(y|θ)
The likelihood computes the probability of having collected MFA data y if the model parameters had the value equal to θ; that is, it provides a probabilistic measure on the mismatch between observation y and model prediction G(φ, q).There are many ways in which the collected data y may relate to G(φ, q).For example, the discrepancy may be viewed as an additive noise: where k indicates the kth data component.Equation ( 6) is appropriate for MFA data where the error is insensitive to the scale of the measurement; e.g., in the case of a mass sensor which has a sensitivity of ± 10 grams across its measurement range.However, oftentimes the data error increases with the scale of the measurement and can be modeled as a relative noise in the form: In either case, modeling the noise k as a Gaussian centered around zero is a sensible assumption.For its standard deviation σ k , we do not presume its magnitude and will instead learn these values from the data.In the absence of information to the contrary, the noise associated with different pieces of data can be modeled as independent.Therefore, if adopting the relative error form in Eqn.(7) and with the unknown model parameters θ = {q, φ, σ}, the likelihood PDF is:

Expert Prior Elicitation for MFA
The goal of expert prior elicitation is to extract pertinent knowledge for θ from subject matter experts in a form that can used as a proper Bayesian prior PDF.There are two main challenges in expert prior elicitation.An expert does not typically have a preexisting quantification of her belief in the form of a PDF [22].Therefore, the first challenge is how to elicit and synthesize a single expert's knowledge into a "quantified belief prior" [22].Next, when there are multiple experts available, it is desirable to utilize all their opinions to have the prior capture the full diversity of background knowledge.However, Bayesian inference requires a single PDF for a parameter as the prior rather than multiple distributions from several experts.The second challenge is, therefore, how to combine and weight the beliefs of multiple experts into a single prior for each MFA variable of interest.Expert prior elicitation has been widely applied in medicine to design clinical trials [23], assess the effect of specific treatments for rare disorders [24], and inform health-care decision-making [25].Prior elicitation has also been used in environmental assessments to forecast future wind energy costs [26] and regional climate change [27].There has been some preliminary work on expert elicitation in MFA: Montangero and Belevi [28] use expert elicitation to describe the uncertainty regarding the flow of nitrogen and phosphorus in a septic tank.In their work, multiple experts provide uncertainty information as quantiles and their opinions are combined using equal weights.Elsewhere, Mathieux and Brissaud [29] conduct expert elicitation to understand where aluminum is being used in commercial vehicles.In their work, experts are brought together to determine collectively a single distribution for each estimate.In this work, we examine methods for eliciting, weighting, and aggregating multiple experts' beliefs under the Bayesian framework.

Eliciting a Prior from an Expert
Prior elicitation is typically performed using surveys conducted either remotely (e.g., by mail or online), in-person, or via a hybrid format where the expert is assisted by telephone or video conference [30].While the variable of interest θ can be multivariate (e.g., a joint distribution on all the allocation fractions leaving a node), eliciting multi-dimensional PDFs directly is very challenging; therefore, multivariate elicitation typically involves eliciting and then combining univariate marginal distributions [31,32].The common methods to elicit univariate PDFs are either a Variable Interval Method or a Fixed Interval Method [32,33].In a Variable Interval Method, the expert provides estimates of the quantiles; e.g., estimate a and b such that P(θ ≤ a) = 0.25, P(a < θ ≤ b) = 0.5 and P(b > θ) = 0.25 [34,35].In Fixed Interval Methods, the expert estimates the probability of θ within given fixed intervals (e.g., estimate P(a < θ ≤ b) for some given a and b) [36].While it is unclear which set of methods yield more accurate representations of the expert's true belief (Abbas' findings contradicting those of Murphy and Winkler [37,34]) it does appear that participants find the Fixed Interval Method easier to complete [37].Given that many MFA industry experts might be unfamiliar with statistical concepts such as quantiles, we recommend using the Fixed Interval Method for MFA prior elicitation where possible; e.g., for eliciting allocation fractions (φ) which are bounded within [0, 1].Elsewhere, when eliciting external inflows (q) or data noise parameters (σ k ), there is no upper bound a priori so that it is appropriate to either ask the expert to specify an upper bound before defining the intervals used in the Fixed Interval Method or else use a Variable Interval Method (see SI Section 1).

Prior Aggregation from Multiple Experts
The typical methods for combining multiple experts' knowledge into a single proper prior PDF are behavioral aggregation and mathematical aggregation [32].
Behavioral Aggregation: Experts collaborate to define agreed upon priors [32].Thus, very 'informed' experts have the chance to share their knowledge.However, it can be difficult to find a common time when all the experts are available and there are potential issues with strong personalities dominating the decision-making [32], the risk of overconfidence in the group [38], some experts concealing their true views [39,40], group polarisation [39,40], and individuals with unique information being ineffective at sharing [41].Therefore, methods such as the Delphi method [42,43] and its variants [44,45] have been proposed where direct interaction between the experts is restricted to prompt the experts to explain their views rather than relying on reputation or personality [32]; e.g., experts may share their views anonymously, adjust their views based on the received information, and iterate until convergence on a distribution.Mathematical Aggregation: A distribution is elicited from each of the n e experts independently, yielding n e PDFs {p 1 (θ), . . ., p ne (θ)}.An aggregated PDF p(θ) is then calculated using either linear or logarithmic pooling [32,46,47]: where w is the weight associated with expert , and Z is a normalization constant.In logarithmic pooling, the resulting prior p(θ) = 0 whenever any of the experts believe p (θ) = 0.In contrast, the prior generated by linear pooling includes any value considered plausible by any of the experts and is therefore more conservative in terms of not ruling out experts' beliefs (see Fig. 2).While equal weights could be assigned to all experts such that w = 1/n e , it is often desirable to allocate greater weighting to more informed experts, typically using either Social Network Weighting or methods requiring questions on seeding variables, also referred to as seeding questions from hereon.Social Network Weighting assigns weights based on each expert's number of citations [48] or a consensus among the experts on whose opinion should receive the most weight [49].Such methods have been criticized for often excluding experts with predominately industry rather than academic experience and for resulting in prior PDFs with low accuracy [48,49,50].Seeding questions assess each expert's expertise by comparing the experts' responses to collected seeding variable observations.One seeding variable example for a steel MFA study might be on the fraction of U.S. pig iron consumed in Indiana within a certain time period.The experts' responses are compared to the observations from a credible source (e.g., USGS).Typically, Cooke's method [51] is used to convert expert seeding question responses to expert weights where, for expert , the weight w ∝ C K with C being the calibration score and K the information score.The calibration score measures the accuracy of the expert's responses, and the information score penalizes experts weakly informative responses that approach the uniform distribution.The Kullback-Leibler (KL) divergence is used in calculating both scores.The KL divergence is a measure of how two PDFs differ.It is non-symmetric and non-negative with a KL divergence being zero between two identical distributions, and a larger KL divergence implying a greater difference between two distributions [52].For discrete variable X taking values in {1, 2, ..., m}, and two probability mass functions P (x) = p x and Q(x) = q x , the KL divergence from Q to P is given as: Figure 2: Linear versus logarithmic pooling of two equal weight elicited priors for an allocation fraction (φ).In general, the logrithmic pool result is more "concentrated" than the linear pool result [32].
Each expert's information score K is calculated as the KL divergence from the expert's elicited distribution to the uniform distribution, averaged across all the seeding questions (see Fig. 3a).In order to calculate an expert's calibration score C , each response to the seeding questions from expert is split into inter-quantile intervals.Typically, four inter-quantile intervals (3 degrees of freedom) are adopted with a corresponding probability vector Q = {0.05,0.45, 0.45, 0.05}.Each seeding variable observation is then compared to the inter-quantile intervals derived from the expert's response; e.g., the USGS records that 34.9% of U.S. pig iron was consumed in Indiana from 2002-2016, falling into the 50-95% interval of the response from expert shown in Fig. 3b.Then, let P = {p 1 , p 2 , p 3 , p 4 } denote the fraction of all n seeding variable observations that fall into each of the four intervals elicited from expert .Cooke [51] states that for a well-informed "ideal" expert (i.e.where seeding variable observations appear as independently drawn from a distribution consistent with the expert's quantiles) then P tends to Q, and D KL (P ||Q) tends to zero.Cooke defines the calibration score C as the probability that a random variable following a Chi-square distribution (with 3 degrees of freedom if using four inter-quantile intervals) is greater than the likelihood ratio statistic (2 × n × D KL (P ||Q)), see Fig. 3b; therefore, if expert 's knowledge differs from the seeding variable observations to a large extent, the associated KL divergence would be high and C becomes small.While not discussed previously, we believe that Cooke's method of calculating C is only appropriate when there is uncertainty in the seeding variable observation (see SI Section 2.4 for a detailed discussion).Eggstaff et al. state that there is no definitive minimum number of seeding questions; however, regardless of the number of questions, Cooke's method significantly outperforms equally weighting the experts' judgments [53].
Once the expert weights are computed, they can then be used to carry out the mathematical aggregation in Eqn.(9) to complete the prior construction.

Distributions for Modeling the MFA Priors
The next step is fitting the histograms elicited from the MFA experts to a family of parameterized PDFs.The distribution and corresponding hyper-parameters are typically fitted and selected via a least squares procedure [54,55].

Distributions for Allocation Fraction Priors
Lupton and Allwood [14] proposed using a Dirichlet distribution as the prior for the allocation fractions {φ S,d 1 , ..., φ S,d S } from source node S, ensuring that all the allocation fractions remain in [0, 1] and sum to unity.The Dirichlet PDF for φ S,d 1 , ..., φ S,d S given hyper-parameters and for each φ S,i , its marginal PDF follows a Beta distribution characterized by Another benefit of using a Dirichlet distribution is that eliciting each marginal distribution on φ S,i (Eqn.(12)) from an expert is sufficient to fully construct the Dirichlet joint distribution (Eqn.( 11)) [32].After eliciting each marginal histogram of φ S,i from the experts, we fit for the optimal Dirichlet hyper-parameters α * D by minimizing the squared differences between the probability of each Beta marginal that corresponds to the Dirichlet joint distribution and the marginal weighted histograms with n b intervals from the experts, summed across all d S allocation fractions emanating from a source node S [56]:  where F is the Beta CDF.
Other methods exist that may also be used to model the allocation fraction priors [14,57].For example, softmax transformations offer extra flexibility compared to using Dirichlet priors; e.g., they can incorporate a strong belief that φ S,d 1 = φ S,d 2 while the relationship between other allocation fractions is unknown.However, the increased complexity of the procedure complicates the elicitation process because, for example, the CDF in Eqn. ( 13) cannot be evaluated in a closed-form.

Distributions for Data Noise Parameter Priors σ k
Since σ k is the standard deviation of Gaussian distributed k , the prior must take positive support; i.e., p(σ k < 0) = 0.A common choice of prior for the standard deviation of a Gaussian distribution is the Inverse-Gamma distribution, which enables an analytical evaluation of the posterior when the measured data is linear in the parameters of interest [58].However, the Inverse-Gamma and other commonly used positive support distributions such as the Log-Normal distribution place negligible probability of σ at regions close to zero, greatly reducing the possibility that the MFA data is of high quality.We believe that generally in MFA there should be a moderate probability that the collected data is clean and high quality.Consequently, other prior distributions that place nonnegligible probability at regions close to zero should be considered.Such distributions include the half-Cauchy distribution, uniform distribution, and truncated normal distribution.

Distributions for Mass Inflow Priors q
The mass inflows q to the network must be positive.Therefore, uniform and truncated normal distributions can be used.When there is little information about q, the upper bound can be set to a large value for both distributions.Alternatively, both the half-Cauchy and truncated normal distribution can be used without an upper bound, and set such that the shape of p(q) is close to being flat.

Posterior Sampling
Once the prior and likelihood are established, the Bayesian inference problem can be solved as stated in Eqn.(5), updating our knowledge about our model parameters through the posterior distribution.Attempting to compute the posterior PDF would entail evaluating the denominator (model evidence) in Bayes' rule: p(y) = p(y|θ)p(θ) dθ, a task that is generally intractable to perform even numerically except for very low (e.g., < 3) dimensions of θ.Instead of computing the PDF, a major alternative strategy is to generate samples of θ from the posterior distribution.To that end, Markov Chain Monte Carlo (MCMC) algorithms [59,60,61,62] have become the predominant methods for computational Bayes in moderate θ dimensions (e.g., up to ∼ 100), which iterates a Markov chain to generate samples that are consistent with the targeted posterior distribution.The more scalable MCMC methods include Hamiltonian Monte Carlo (HMC) based samplers such as the No-U-Turn (NUTs) algorithm, which explore the parameter space efficiently leveraging the posterior gradient and Hamiltonian energy principles [63].However, HMC suffers from divergence in its time integration step when encountering neighborhoods of high posterior curvature [63,64]; this difficulty was indeed observed in our study when incorporating the data noise parameters σ into the Bayesian inference.Therefore, we opt to use sequential Monte Carlo (SMC) [65] to sample the posterior, which is based on the idea of iteratively re-weighing the samples using on a tempered likelihood [p(y|θ)] β at each stage where β is a tempering parameter that gradually increases from 0 to 1.We describe the SMC algorithm in SI Section 4.

Case Study on the U.S. Steel Flow
The advances in the Bayesian inference approach to MFA discussed above are tested by mapping the U.S. annual flow of steel, where we take the MFA network structure from Zhu et al.'s [11] analysis of U.S. steel flows in 2014.The case study demonstrates rigorous prior development based on expert elicitation and inference of the MFA collected data noise as well as the MFA flow parameters.All the data and code used in this case study are available online (see the SI).
In this case study, the parameters requiring prior formulation are the allocation fractions (φ), the external inputs (q), and the data noise standard deviation (σ) associated with data noise ( ).

Constructing Priors for Allocation Fractions (φ) and Input Flows (q)
Since the elicitation of informative priors for all MFA variables of interest can be time prohibitive, we only elicit expert priors for the upstream allocation fractions (φ) and external inputs (q), while using weakly informative priors elsewhere.Experts on U.S. steel flows were identified by conducting a literature search on steel flows and recycling and by contacting U.S. steel companies (e.g., U.S. Steel and Nucor).All the experts had more than 5 years of working or research experience in the steel industry.Eight experts agreed to an emailed request to take part in the study (see Table 2 in the SI).
For prior construction, the experts independently completed surveys online that contained a total of 32 questions: 23 elicitation questions for allocation fractions associated with import, export, production and consumption of ferrous raw materials, and an extra 9 seeding questions whose actual observations are taken from USGS.
At least one author was present online to answer any questions during survey completion.It took the experts between 25 and 80 minutes each to complete the survey.The Fixed Interval Method was used to elicit the parameters.For allocation fractions φ, the support [0, 1] for φ was divided into 10 equal-width intervals and Dirichlet hyperparameters were fit to the elicited experts' histograms as described in Sec.2.4.For the external inputs, the expert first specifies the lower and upper bound, with the interval then divided into 10 equal-width intervals.Figure 4 shows an illustrative example of a survey question for eliciting an allocation fraction.An expert could enter the probability value for each interval in the box or else drag the bar across for the given interval until the summation of the bars is 1, otherwise the expert cannot continue to the next question.
Linear pooling was used to aggregate the responses from the multiple experts into a single proper prior PDF for each MFA variable.First, weights were assigned to each of the experts based on their responses to 9 seeding questions using Cooke's method.A wide range of expert weighting values were obtained with the most informed expert having a weight of 0.299 and the least informed expert having a weight of less than 0.001.Figure 4 shows how the response of multiple steel experts are combined to form a single aggregated prior PDF for an MFA variable.

Collecting MFA Data Records and Constructing the Prior for Data Noise Standard Deviation (σ)
After construction of the allocation fraction and external inflow priors, U.S. steel flow MFA data were collected.We focus here on 2012 data although any year from the previous 20 years could also have been chosen.The steel flow data were collected from the United States Geological Survey (USGS) [66,67,68], World Steel Association (WSA) [17] and Zhu et al. [11].A complete record of all the collected MFA data is provided in SI Section 5.
The collected MFA data are published without accompanying uncertainty information while data error is inevitable.Subsequently, the noise for each piece of collected data is modeled as an independent relative error (see Eqn. (7)) that follows a Gaussian distribution with zero mean and a standard deviation σ.Expert elicitation could be used to derive an informed prior for σ; e.g., experts could be interviewed on the likely accuracy of USGS and WSA data.In this study, however, the prior on σ is instead modeled as only weakly informative using a normal distribution truncated below zero and above 0.5 with hyperparameters set such that P(σ ≤ 0.1) and P(σ ≤ 0.3) are approximately 0.5 and 0.95 respectively, maintaining a reasonable probability that the data can be of high quality.

Case Study Results: U.S. Steel flow in 2012
The Bayesian inference is implemented using SMC in PyMC3 with the code adapted from Lupton and Allwood [14].It takes approximately 17 hours to generate 10,000 samples using an Intel(R) CoreTM i7-11800H CPU, 2.30 GHz.The prior and posterior results are shown in Fig. 5 and 6 respectively.The width of each line is proportional to the mean of the flow and the color indicates the uncertainty level, with a smaller relative uncertainty displayed in darker blue colors.
Figure 7 shows the prior and posterior distributions for φ and σ associated with three prominent upstream flows.There is a significant uncertainty reduction for all φ shown in Fig. 7 (a)-(c).For example, despite a relatively flat prior for the fraction of (solid or liquid) pig iron consumed in the Basic Oxygen Furnace (BOF), even with only 1 year of data the posterior is able to reflect that pig iron is mainly consumed in the BOF. Figure 7 (d)-(f) presents the prior and posterior for σ.The posterior distribution for σ on "Iron Ore → BF" is more concentrated than the prior, indicating that the data noise can be learned from data.However, in comparison to φ, there is a lower uncertainty reduction in σ.In an effort to enhance data noise learning, we explored the use of multiple years of data to prompt a greater reduction in uncertainty.

Enhanced learning using data collected from multiple years
One interpretation of the modest residual uncertainty for σ shown in Fig. 7 is that there is still a low amount of information from data for inference if using data only from 2012.However, some of the φ are very likely to be similar across years.In addition, the measurement noise in different years is also possibly a realization from a similar distribution.If the allocation fractions φ and data noise parameter σ on regularly reported MFA data (e.g., the USGS data record on "Iron ore → BF") can be verified to be similar across a multi-year time period, then there is the potential to leverage multiple years' worth of MFA data to enhance the learning of the data noise.Therefore, we used Bayes factor analysis (see SI Section 3) to check and justify modeling the allocation fractions and data noise parameters as constant across five years worth of USGS and WSA data (2012-2016), allowing the inference to be rerun using these additional years' data.
Figure 8 shows the posteriors on φ and σ when utilizing one years (2012) versus five years worth (2012-2016) of data.Figure 8 shows a reduction in σ uncertainty when leveraging these extra data.For "Iron ore → BF", utilizing 2012-2016 data reduces both the uncertainty on σ and its mean value.On the other hand, the posterior on "Pig Iron → BOF" shows a reduced uncertainty for σ but an increase in its mean.This could be because the data noise is indeed low in the "Iron ore → BF" data and high in the "Pig Iron → BOF" data.However, readers should note that the data noise calculated here reflects not only the collected MFA data quality but also any inadequacy in the modeling; e.g., from the MFA network structure used to the assumption of constant data noise errors across the five years.

Discussion and Conclusions
This paper has introduced, adapted, and demonstrated the use of expert elicitation techniques to define proper prior distributions for Bayesian inference in MFA, and illustrated how the noise level in collected MFA data may itself be learned from data.Below, we discuss the lessons learned from the case study on U.S. steel flows, limitations of these approaches, and future work to advance the Bayesian approach in MFA.

Using Expert Elicitation in MFA
Expert elicitation allows the use of informed priors in MFA and will result in a quick reduction in parametric uncertainties when combined with collected data.In cases where no or negligible MFA data can be collected, expert elicitation provides a statistically robust method of estimating material flows.Expert elicitation may also reveal model-form mistakes in the MFA structure that were not apparent beforehand.
A potential drawback is the time needed to find experts, develop well-posed elicitation questions, and process the responses.For the case study, the authors were concerned that each interview might take more than one hour and that there would be significant confusion regarding our request to answer questions with histograms and PDFs rather than point values.However, most interviews were completed within 30 minutes and the authors only received approximately one survey clarification question per expert.Even experts who were unaccustomed to PDFs were comfortable completing the questions after reviewing the example question and solution we posted at the start of the survey.There was one case where the sum of upper bound elicited allocation fractions from a single source node was less than unity.While this does not prevent hyper-parameter fitting to a Dirichlet PDF it does suggest that either the expert made a mistake or that the expert thought that the model structure was incorrect, believing that there should be another destination node from that source node.A more sophisticated elicitation procedure could ask experts to evaluate the probability of candidate MFA model structures in addition to MFA parametric values.Overall, the Fixed Interval Method used in the case study was found to be an effective and quick method of eliciting allocation fraction and external inflow priors.
In the case study, a single weight was assigned to each expert.A potential problem is that the expert might not be uniformly informed across the domain of interest; e.g., an expert might be an authority on steel recycling but relatively uninformed on primary steelmaking.In the case study, it was possibly a case of non-uniform expertise that affected the derivation of the aggregated prior for the allocation fraction from (liquid and solid) pig iron to the BOF.The expert who received the highest weighting gave an apparently erroneous answer to the elicitation question on "Pig Iron → BOF", responding that less than 50% of the iron flows into the BOF.The high weighting given to this expert meant that the aggregated prior for Pig Iron to BOF was relatively uninformed (flat) despite the other seven experts responding that the majority of pig iron flows into the BOF (see Fig. 7).One option to avoid such problems would be to assign multiple weights to each expert corresponding to different areas of expertise; however, that would require the development and asking of more seeding questions tailored to those different areas of expertise.
A source of confusion for one of the experts was whether the survey was asking for uncertainty on a given variable (e.g., θ = Amount of Iron Ore Exported Amount of Iron Ore Produced ) or its variability over space and time; e.g., the histogram of the Amount of Iron Ore Exported Amount of Iron Ore Produced for each month in a year or across regions.The survey was eliciting the former and we found it useful to clarify that we were seeking uncertainty rather than variability by demonstrating a "toy" example and also using equations to define the variable of interest.

Learning the MFA Variables and Data Noise Parameters
The uncertainty reduction in the U.S. steel mass flows between the prior and posterior is shown by the increase in dark blue colors between Fig. 5 and 6.The ability to quantify and reduce these uncertainties in a principled manner using Bayesian inference is appealing as it can lead to more informed decision-and policy-making.For example, the benefits of deploying a new, more energy efficient mill technology can be calculated with greater confidence for the cold rolling mill (calculated to have processed an expected 32.3 Mt in 2012 with a standard deviation of 3.27 Mt) compared to the primary mill (calculated to have processed a smaller expected value of 23.1 Mt in 2012 but with a larger standard deviation of 8.69 Mt).
A new MFA approach explored in this article has been to incorporate data noise parameters as random variables.The allocation fractions and the data noise parameters are then learned simultaneously with collected MFA data.Intuitively it is harder to achieve great uncertainty reductions using this method because the same amount of data is used to provide information on more parameters, even though it is a more honest representation if we do not know the true data noise.To combat this, we used Bayes factor analysis to verify the existence of time-invariant data noise parameters, allowing us to incorporate multiple years worth of data for greater uncertainty reductions.
As observed elsewhere [14], a potential drawback of the Bayesian approach to MFA is the computational cost of the Monte Carlo-based algorithms.Modeling the data noise parameters as random variables significantly increased the stochastic dimension of the problem and in turn computational cost, from 3 hours/run if the data noise parameters are prescribed as constants, to 17 hours/run when the data noise parameters are modeled as random variables.Therefore, there is a trade-off between the amount of bias and the computational cost.However, the computational speed can be increased by, for example, using multi-core processors to run algorithms that can be parallelized or applying approximate Bayesian techniques such as variational inference [69].

Supporting Information: Expert Elicitation and Data Noise Learning for Material Flow Analysis using Bayesian Inference
Jiayuan Dong * , Jiankan Liao † , Xun Huan ‡ , and Daniel Cooper § This Supporting Information (SI) document includes data and literature reviews helpful to understanding the main article as well as links to elicitation surveys and Python Scripts that allow other MFA practitioners to use expert elicitation and data noise learning techniques.
Please go to this link (http://remade.engin.umich.edu/MFA_NSF.htm)for downloads of the following: • Presentation of the underlying data used to construct the Sankey diagrams in the main paper (Figures 5 and 6) in a numerical, tabular format • An example Google Qualtrics survey for eliciting allocation fractions from experts • An Excel file for calculating expert weights based on seeding variable question responses • A Python script for fitting prior PDFs to the aggregated and weighted histograms from experts • A Python script for performing Bayesian inference (adapted from Lupton and Allwood [1]) using the fitted prior PDFs and USGS and WSA collected MFA data • A Python script for performing Bayes factor estimation to select best performing model assumptions

Variable versus Fixed Interval Elicitation Methods
Table 1 summarizes the implementation of some common approaches for eliciting a univariate θ from an individual expert.One of the most widely used Variable Interval Methods is the Bisection method, which asks for medians of different intervals and is straightforward for experts to understand.However, some empirical studies reported overconfidence with the Bisection in comparison to the Tertile method [2], which operates similarly to the Bisection method but with more refined intervals.For the Fixed Interval Method, there are various options to define the intervals that are presented to experts to obtain their probability estimates.In [3,4], intervals are chosen with bins concentrated around the median or mode of θ, and focus on the elicitation of regions where the bulk of probability lies.The probabilities of θ within different intervals are asked in specific order to avoid tendencies of anchor assessments, which describes the phenomenon that an expert may adjust their estimates for later questions based on their own earlier responses.The trail roulette method is a Fixed Interval Method which does not ask for specific probabilities [5].Instead, it requires experts to distribute a number of chips among the bins of a histogram, and the proportion of total chips for a bin then represents the expert's probability of the parameter lying within that bin.The procedure may be simplified by increasing the width of bins or decreasing the number of chips although at the expense of losing resolution in representing the distribution.
There are a few experiments comparing Variable Interval Methods and Fixed Interval Methods but without conclusive findings for one being better.Murphy and Winkler [8] invited 4 experienced weather forecasters to make credible interval forecasts for temperature in Denver.Two of them forecasted based on the bisection method and estimated the 12.5, 25, 37.5, 50, 62.5, 75, 87.5th percentiles, while the other two used the Fixed Interval Method and gave probabilities of ±5 • C and ±9 • C intervals around a pre-determined median.In that study, results from the bisection method were much closer to the observed relative frequencies than the fixed interval forecasts.Separately, Abbas et al. [9] conducted a comparison experiment to assess the temperature in Palo Alto with 72 students.The Variable Interval Method involved estimating the 5, 25, 50, 75 and 95th percentiles, and the Fixed Interval Method entailed intervals with boundary points at the 5%, 25%, 50%, 75% and 95% of a total range specified by the judges.The order of questions for both methods were randomizes.When comparing with 345 historical data points, results from the Fixed Interval Method achieved a better fit, contradicting the findings from Murphy and Winkler [8].Moreover, the Fix Interval Method was reported to induce faster responses, and 50 participants indicated preference for the Fixed Interval Method while the other 22 either preferred the Variable Interval Method or showed no preference.

Elicitation from Multiple Experts 2.1 Inconsistency in the Linear and Logarithmic Opinion Pooling Methods
Both the linear and logarithmic opinion pooling methods suffer some form of inconsistency.Linear pooling is not externally Bayesian [2].That is, updating each expert's prior with data and aggregating their posteriors into single one would yield a different result than first combining the experts' priors and then updating the aggregated prior to the posterior.While the logarithmic pooling is externally Bayesian, it does not satisfy coherent marginalization.To satisfy coherent marginalization, for an event "C" equal to "event A or B where A and B are mutually exclusive" (and therefore P(C) = P(A)+P(B)), one can either compute P(C) from each expert and then pool the result, or first pool P(A) and P(B) separately and then sum them up.Logarithmic pooling does not satisfy this marginalization property while linear pooling does [10].O'Hagan et al. [2] further point out that no mathematical formula can be simultaneously externally Bayesian and abide coherent marginalization, and Lindley [11] and Genest and Zidek [12] criticised the evaluation of these fundamental inconsistencies, arguing that the aggregated distribution does not represent the belief of a single individual and therefore need not to behave as one would expect an individual's probability distribution.

Performance of Cooke's Method
To assess the performance of Cooke's method, Cooke and Goossens [13] conducted a psychometric experiment comparing "experienced experts" (teachers at a technical training institution) versus "inexperienced experts" (students at the institution).On technical items, the experienced experts performed significantly better on both the calibration and information scores.On general knowledge items, there was no significant difference between the teachers and students.Therefore, Cooke's method appears effective at reflecting the teachers expertise.
Cooke and Goossens [14] also conducted studies comparing (i) Cooke's method for assigning expert weights, (ii) simple averaging (uniform weights), and (iii) putting all weight on the best expert only as revealed using Cooke's method (i.e. one with the highest original weight).In their work, 45 separate experiments were carried out.In each experiment, every seeding question's histograms from multiple experts were combined into a single histogram using the three aforementioned weighting methods.Under each method, its combined histograms were then used to compute the new information and calibration scores (in the same manner as that in Cooke's method), and the product of these two scores is used as the assessment metric for comparison.Cooke's method achieved the highest metric among the three weighting approaches in 27 of the 45 cases, and it scored higher than the uniform weighting approach in 44 of the 45 cases.However, by using an assessment metric that is the same weighting formula from Cooke's method (i.e.product of information and calibration scores), it may have automatically favored the competition towards Cooke's method.Furthermore, the assessment metric was calculated on the same seeding questions as those used to determine the expert weights, and thus may not reflect the ability for the weighting approaches to generalize to priors for other quantities not covered in the seeding questions.

Experts for Prior Elicitation in the Case Study
Details of experts interviewed for prior elicitation in the case study of this paper are shown in Table 2.

Special Discussion on Seeding Variable
An interesting consideration for the Cooke's method that has not been discussed in existing literature is regarding the nature of the elicitation distribution.In particular, there is inconsistency to Cooke's method if the reference answer value to the seeding question is the true value to the quantity being asked.For example, suppose a seeding question asks "what is the value of π up to tenth digit?" and suppose the reference answer is the exact value 3.141592653, then a perfect expert who perfectly knows π's digit then would provide a histogram that is centered around this value and with a minimally required uncertainty (e.g., due to the minimum bin width of histograms in the survey).If the expert answers all seeding questions optimally in this manner, the empirical interquantile interval distribution (the blue distribution in the middle column of Fig. 3   tempered posterior p(θ|y) β ∝ p(θ)p(y|θ) β is the prior when β = 0 and becomes the posterior when β = 1.At stage t, we start with a population of n samples from the current tempered posterior.The importance weight of each sample is then computed as the ratio of the tempered likelihoods between stage t + 1 and t.The population is then re-sampled based on these weights, and each sample is also perturbed with a small number of steps in a Metropolis-Hastings Markov chain.This procedure is iterated until β reaches 1.The SMC algorithm is summarized in Algorithm 1.For this work, the SMC algorithm is implemented with PyMC3.

Input:
Prior p(θ), likelihood p(y|θ), population sample size N , number of Metropolis-Hastings steps n steps , effective sample size threshold N t .

Reweigh particles by wi
and normalize such that N i=1 wi n = 1.

5:
Estimate Effective Sample Size ( ESS) Re-sample N particles based on empirical distribution using wi n .

8:
end if 9: Perturb each particle for n steps steps of Metropolis-Hasting Markov chain.

2 Formulation 2 . 1 A
Mathematical Representation of MFA An MFA can be represented via a directed graph as shown in Fig. 1.The nodes of the graph (indexed by 1, 2, . . ., n p ) represent different processes, products or locations.Each directed edge connecting two nodes represents the mass flow of material from one process to another.

Figure 1 :
Figure 1: A graphical representation of an MFA network structure (a) Calculating the information score for an expert.(b) Calculating the calibration score for an expert.

Figure 3 :
Figure 3: (a) Each expert's information score K is calculated as the KL divergence from the expert's elicited histogram to the uniform distribution averaged across all seeding question responses.(b) Each expert's calibration score C is calculated by first counting the number (fraction) of actual observations of the seeding questions in each inter-quantile interval across all seeding question responses from an expert, then computing the KL divergence from the ideal expert's inter-quantile interval probabilities to the empirical expert's interquantile interval probabilities, and lastly obtaining the likelihood ratio statistic with the corresponding p-value (i.e.P(χ 2 3 > 2.479) in the above example) being the calibration score.

Figure 4 :
Figure 4: Prior elicitation required steel industry experts to complete an online Qualtrics Survey.The images above show an example question and expert response.The images below show an example calculation of an aggregated prior for the fraction of U.S. Iron Ore that is exported.Only 3 expert responses are shown for the sake of concision.Their weights have been normalized accordingly to sum to unity.

Figure 5 :
Figure 5: The prior for the U.S. Steel flow in 2012.All numbers on the flows refer to the mean of the prior mass flow in units of million metric tons (Mt).The uncertainty percentages refer to the flow standard deviation as a percentage of the flow mean.All mass flows refer to steel except for the iron ore flows that include the non-iron mass (e.g., oxygen and gangue).

Figure 6 :
Figure 6: The posterior for the U.S. Steel flow in 2012.All numbers on the flows refer to the mean of the prior mass flow in units of million metric tons (Mt).The uncertainty percentages refer to the flow standard deviation as a percentage of the flow mean.All mass flows refer to steel except for the iron ore flows that include the non-iron mass (e.g., oxygen and gangue).

Figure 8 :
Figure 8: Examples of posteriors on Allocation Fractions (φ) and Data Noise Parameters (σ) using data only from 2012 versus full data from 2012-2016.
This paper explores how to: (a) Form informative prior distributions for MFA variables by eliciting information from industrial experts, and (b) Learn the data noise from collected data by incorporating data noise as random variables for inference.In Sec. 2, we first introduce the MFA problem math- ematically using the conservation of mass principle (Sec.2.1) and then formulate Bayesian inference for learning MFA model parameters and the collected data noise (Sec.2.2).The two crucial components needed for solving a Bayesian inference problem-the likelihood and prior-are then presented in Sec.2.3 and 2.4 respectively.In particular, Sec.2.4 reviews existing methods for expert elicitation and discusses their use for the MFA framework.In Sec. 3, we apply these methods to derive the U.S. steel flow for 2012.Finally, in Sec. 4 we discuss the lessons learned from the case study.

Table 1 :
Method for elicitation of a univariate θ.

Table 3 :
Bayes factors to compare among models where the allocation fractions and noise parameters are assumed to be either time-invariant or time-dependent across the U.S. steel sector (2012-2016).Model 1 is that all allocation fractions φ and data noise parameters σ are time-invariant; Model 2 is that all φ are invariant but σ are different each year; Model 3 is that all φ are different each year but parameters σ are invariant; Model 4 is that all φ and σ are different each year.The minimum, mean, and maximum values are from repeating 30 trials of 10 5 -sample Monte Carlo estimate.A BF (M A : M B ) greater (smaller) than 1 represents an increase (decrease) of support in favor of model A versus model B given the observed data.In this case, model 1 receives overwhelming support in comparison to all other models.