Future Proofing a Building Design Using History Matching Inspired Level Set Techniques

History Matching is a technique used to calibrate complex computer models, that is, finding the input settings which lead to the simulated output matching up with real world observations. Key to this technique is the construction of emulators, which provide fast probabilistic predictions of future simulations. In this work, we adapt the History Matching framework to tackle the problem of level set estimation, that is, finding input settings where the output is below (or above) some threshold. The developed methodology is heavily motivated by a specific case study: how can one design a building that will be sufficiently protected against overheating and sufficiently energy efficient, whilst considering the expected increases in temperature due to climate change? We successfully manage to address this - greatly reducing a large initial set of candidate building designs down to a small set of acceptable potential buildings.


Introduction
Computer models (simulators) are increasingly common tools in science, used to help understand complex real world phenomena.Given a set of input settings, the real world is simulated (to some degree of simplification) and then some value of interest can be output.These models can be expensive to run, and so statistical surrogate models (emulators) are built to provide fast probabilistic predictions for what the simulator's output might be (Sacks et al., 1989;Kennedy and O'Hagan, 2001;Oakley and O'Hagan, 2002).Perhaps the most widespread emulator is the Gaussian process emulator (see O'Hagan, 2006, for details), which interpolates previously obtained simulator runs and provides uncertainty for these interpolations.
These emulators can have many different uses, depending on the specific goal of a practitioner.Perhaps the two most explored applications are: prediction (what the simulator output is for some new inputs, and is obtained automatically from an emulator), and calibration (given some observed value, finding what the "real" input value was).For calibration, standard Bayesian inference provides one such obvious solution (Kennedy and O'Hagan, 2001), but it is not without its flaws (Brynjarsdóttir and O'Hagan, 2014).An alternative method, History Matching (Craig et al., 1997;Vernon et al., 2014;Andrianakis et al., 2015), also exists, providing straightforward (and faster) general implementation, easy utilisation even when the input and output dimension is high, a robustness to low simulation budgets, and the ability to identify when no such "real" input exists (say because the simulator is unfit for purpose).
We apply History Matching techniques to the problem of level set estimation; that is, for what inputs is the output lower (or higher) than some threshold?History Matching techniques have already been extended to the problem of optimisation (Lawson et al., 2016), and so it is a natural step to try and extend them to level set estimation as well.The problem of efficient simulator level set estimation is an open research question, and other efforts exist (see Lyu et al., 2018, for an example).The definitive conclusion as to whether History Matching is "better" than alternatives is still lacking, but many of the same arguments for History Matching in general can be made for our developed History Matching inspired level set methodology.
This methodology is intrinsically very accessible, easily applied to new problems.We showcase this by developing a framework upon which engineers can 'future-proof' buildings -modifying a given building design such that the building has a sufficiently low chance of overheating towards the end of the 21st century.This case study provides interesting problems, with the field of building performance simulation crying out for greater attention from statisticians.
In this article, Section 2 will provide further details on the building design problem, the building simulator, and the specific building we use as our example.Section 3 will outline the statistical emulators used, and the various nuances provided by this case study.Section 4 then provides an explanation of History Matching, and our History Matching inspired method for level set estimation.Section 5 then applies this method to the discussed building model problem; and Section 6 provides concluding remarks.

Building Model
Our goal is to construct a mechanism that finds modifications to an existing building design which will have satisfactory overheating risks and energy demands; even after the expected increase in temperature by the end of the 21st century.We aim particularly for this notion of "satisfactory" rather than "optimal", as in practice there are often secondary criteria when it comes to building design (such as its appearance); and because it is far easier, and more sensible, for regulation to require a specific threshold standard than it is to require some relative notion of 'most-improved'.
EnergyPlus (Crawley et al., 2000) is a numerical model for simulating many properties of a given building, such as its total annual energy usage, or its hourly temperature.A required input is the shape and design of the building in question.For the purposes of this article, we use a specific building design (an image of this design is given in Figure 1), but the methods discussed are not specific to this building.
This building is fitted with an "ideal loads" heating system, and no air conditioning.No air conditioning may seem like an odd setup, given the intended objective, but this choice is more representative of the UK's building stock, where it has been estimated that only 0.5% of residential buildings have air conditioning (BBC, 2013).Air conditioning could have been included, and with a variable capacity, without any major problems.
Another input that EnergyPlus requires is the outside weather.Standard practice takes this weather as a fixed, known thing (Eames et al., 2015;Eames, 2016).In this work, we opt instead to take the weather as random: each time EnergyPlus is run, a new sample of weather is drawn from a random weather generator, more accurately representing the chaotic and uncertain nature of weather.Taking this whole procedure to now be what we refer to as the simulator, EnergyPlus is now stochastic.No longer is the output of a single run informative, instead the distribution of the output is the desired output.When EnergyPlus was deterministic, the predicted energy usage (specifically from heating), and whether or not the building overheats, would be the quantities of interest.Now, with EnergyPlus being stochastic, the predicted average heating energy usage and the overheating risk are the quantities of interest.
In this article, our interest lies in 'future-proofing' the building, i.e. making sure it performs well towards the end of the 21st century.With this in mind, our specific choice of weather generator is the UKCP09 weather generator, which can output possible samples of weather for the year 2080, which can then be input into EnergyPlus (Eames et al., 2011).In this way, the choice of weather generator directly effects the analysis made -other weather generators could have been chosen (or even fixed weather could be used) were the goals different.
Certain properties of the building can be edited (for example, the thickness of insulation), a subset of which we will treat as our inputs of interestvariables we will assume we can change in order to improve the building.The specific inputs we consider will be: wall insulation thickness (varying from 0m to 0.5m), roof insulation thickness (varying from 0m to 0.5m), ground insulation (varying from 0m to 0.1m), the size of the windows (varying from 20% of the wall size to 100% of the wall size), the length of window overhangs (varying from 0% to 100% the height of the windows), the amount the windows can be opened by occupants (varying from 0% to 100% of the windows size), the emissivity of the roof (varying from 0.4 to 1) and whether or not the windows are double or triple glazed.From now on, these will be referred to as x 1 , . . ., x 8 and their input ranges shall be rescaled to be between 0 and 1.Other inputs could of course be considered (for example, air conditioning capacity); the specific choices in practice would be down to what options are available.
As mentioned above, the outputs of interest are the average yearly heating energy usage, and the overheating risk.Heating Energy usage is directly output by EnergyPlus and thus is easily obtained.Although temperature is output by EnergyPlus, the word "overheating" actually still needs to be defined.Whether a building overheats is fairly subjective -what is too hot?Similarly, a building can be very hot for a very short period of time, or it can be slightly hot for a very long period of time; which one is worse?We bypass this question, by using the metric defined by CIBSE (Chartered Institution of Building Services Engineers) (TM52, 2013).As a brief summary, they define a building as overheating if it meets at least two out of three criteria, each of which tries to quantify the various ways a building could be considered uncomfortably hot.Important to note, is that this provides a binary classification -a building either overheats, or it does not.Other hypothetical criteria could provide a continuous metric of overheating.
As a final note for our description of the problem, it was mentioned earlier that the goal involves targeting some thresholds for the energy usage and the overheating risk.In this article, we semi-arbitrarily decide to aim for a less than 1% chance of overheating.This represents a 'sufficiently small' value, but a different value could have been chosen.For the average yearly heating energy usage, we aim for less than 15kWh/m 2 , which is the requirement set by the passivhaus standard (Schnieders and Hermelink, 2006), but similarly, a different threshold could have been chosen.

Emulators
Constructing emulators for these two outputs is then non-trivial, but essential.Not only does the model take time to run, which can be alleviated with an emulator, but the outputs of interest are not truly provided by Energy-Plus.One quantity of interest is the risk of overheating, but EnergyPlus only outputs if the building overheats for a given set of weather.The other quantity of interest is the average heating energy usage, but EnergyPlus only outputs the what the energy usage is for a given set of weather.This restriction demands some degree of statistical modelling.This section will explain how these variables were modelled.In many circumstances, particularly when standard deterministic simulators are used (say if the only quantity of interest was the energy usage, and the common fixed weather procedure was considered acceptable), the construction of an emulator would be much easier (O'Hagan, 2006).A complication in this example, is the existence of the binary input variable x 8 .This is non-standard in the emulation community, with a standard Gaussian process formulation requiring all inputs to be continuous.The binary window glazing variable was selected partially to show that binary input variables can still be included; binary variables are likely to be common as potentially adjustable attributes in a building design.In this work, we use the mechanism outlined in (Qian et al., 2007) that allows non-continuous variables to be included in the covariance structure of a Gaussian process.
The first emulator described, is the one for the overheating risk.Taking the output as y oh (which is binary), the continuous inputs as x c = x 1 , . . ., x 7 , and the binary input(s) as x b = x 8 , we have the following logistic classifier emulator: That is, the output is modelled as a Bernoulli random variable, with risk of the building overheating p(x c , x b ).The logit of this risk is modelled as a Gaussian process.This Gaussian process has a mean function m oh (x c , x b ), which can be considered the same as the standard part of a logistic linear regression model, modelling the overall trend of the changing p(x c , x b ).The covariance function K oh (x c , x b , x c , x b ) provides a correlation structure, al-lowing more nuanced local details to be captured.In this work, the mean function is taken to be 0, letting the covariance function do all of the work; and the covariance function is taken to be: To clarify, α 2 oh is the overall variance of the process; the left most sum controls the correlation between two data points if only the continuous variables vary, with l oh i modelling the smoothness of the relationship in the i th continuous dimension (and is called the squared exponential correlation function); and the right most sum controls the the correlation between two data points if only the binary variables vary, with φ oh i modelling this correlation for the the i th binary variable.In our case, we only have one binary variable, so the right most sum can be replaced with a single term, but the full summation is provided here for generality.Simply put, this model allows the overheating risk to be modelled as a very flexible shape.
Fitting this model (i.e, obtaining values for the unknowns: the latent p(x c , x b ) at the observed data points, the variance α 2 oh , the length scales l oh i and the binary correlations φ oh i ) is done in a fully Bayesian way, using Stan (Stan Development Team, 2015).Fitting the model requires many simulations from EnergyPlus to be made, providing the data needed to infer the unknowns.Preferably the simulations are chosen using a "space-filling" design -using a wide-range of value-combinations for x 1 , . . ., x 8 .One such way of deciding these input values is a Latin hypercube design (McKay et al., 2000), or in this case, because we have a binary variable, a sliced Latin hypercube (Ba et al., 2015).If multiple runs of EnergyPlus are taken using the same input values x c , x b , then fewer latent p(x c , x b ) must be estimated, speeding up the fitting process.
Predicting new values of the overheating risk is very straightforward, using the standard Gaussian process predictive equations (Rasmussen and Williams, 2006) (and the inverse logit transformation, to go from logit(p The second emulator is the one for energy usage.Taking the continuous output as y eu , and the same inputs as before we have the following stochastic emulator: That is, the output is modelled as a Gaussian process, with an additional intrinsic variability term δ 2 , which models the stochasticity of the energy usage output.This intrinsic variability is allowed to be different for different values of x c , x b , and thus is also modelled with a Gaussian process (or more specifically, log(δ 2 ) is modelled a as Gaussian process, to ensure that the variability remains constant).The mean functions m eu (x c , x b ) and m δ (x c , x b ) are again both taken to be 0, and the covariance functions K eu (x c , x b , x c , x b ) and K δ (x c , x b , x c , x b ) are the same as that in Equation 2. Using the same simulation runs as used to fit the binary overheating risk emulator (EnergyPlus can output both the overheating classification and the energy usage at the same time, thus requiring no extra simulations), this model could also be fit in a fully Bayesian way, but this has been shown to be impractically slow (Kersting et al., 2007;Boukouvalas and Cornford, 2009;Binois et al., 2018), so instead we obtain maximum a posteriori estimates of the unknowns using the optimizing function in Stan.This doesn't provide the same full assessment of uncertainty, but is a necessary decision; it also isn't as essential in this case to obtain a full uncertainty assessment, as none of the unknowns here are of primary interest (whereas for the binary overheating risk emulator, p(x c , x b ) is an unknown to be estimated that is also of primary interest).
Predicting the average energy usage is then possible again using the standard Gaussian process predictive equations.Because we are only interested in the mean of the process (ȳ eu ), rather than new values of y eu , the estimates of δ 2 for new values are not needed, only the already estimated values at the simulated input points.Using these estimates, the standard noisy predictive equations from Rasmussen and Williams (2006) can be used to predict the average energy usage, but with a vector of intrinsic variability values rather than a single constant.These equations are then also the same as the "stochastic kriging" equations (Ankenman et al., 2010).
Together, these two emulators provide a way of predicting what the overheating risk, and the average energy usage is for any values of (x c , x b ).These predictions will have an uncertainty distribution around them (which is easily obtained from the Gaussian process predictive equations -analytically for the average energy usage, via sampling for the overheating risk).The accuracy, and precision, of these predictions depends on the total number of simulations made.
The next section will outline the History Matching inspired level set estimation methodology -detailing how one can estimate the level set of a simulator.The section after that will then apply said methodology to the above emulators -finding suitable buildings with regards to overheating risk and average energy usage.One key benefit of the proposed methodology, which is worth noting now, is that it is easily generalisable to many types of emulator -as long as an expected value of a prediction can be provided, and a variance of the uncertainty, then the proposed methodology can be used.The two contrasting emulators used in this work showcase this flexibility.

History Matching Level Set Estimation
To begin with, what follows is a brief explanation of History Matching.More details can be found in Craig et al. (1997); Vernon et al. (2014);and Andrianakis et al. (2015).History Matching as default is a calibration mechanism; observed data y obs can be used to narrow down the range of values that unknown inputs x calib could be, assuming some true values exist.An emulator of the simulator is built, using an initial simulated data set, and then values for x calib are discarded as "implausible" if they lead to output values that are sufficiently far away from the observed data.What counts as "sufficiently far away" depends on the degree of uncertainty surrounding the simulator output and the observed value.A key attribute of History Matching is "iterative refocussing", i.e. after an (often large) subset of the input space is discarded as implausible in the first wave, new simulations can be done using "nonimplausible" input values, improving the emulator in this region of space, and thus allowing yet more input values to safely be discarded as implausible.Repeating this process for several waves can lead to a very small space of non-implausible input values x calib remaining, with only a comparatively small number of simulation runs having been needed.
We adapt this mechanism to instead apply to level set estimation.In our case, we do not have any observed data, y obs , instead we have a single level set threshold we wish to aim for, L (for the overheating risk problem, this is 0.01).Assuming we target values less than L, we define the implausibility metric as follows: Where E(y(x)) is the expectation of the emulator, and V (y(x)) is its variance.This implausibility can be positive or negative, which is a key difference to standard History Matching.Large, positive, values of I(x) suggest the input is not in the level set, as the expectation is much larger than the threshold L. Large, negative, values suggest that it is in the level set, as the expectation is much smaller than L. When the level set is defined as values larger than L (instead of values smaller than L), defining the implausibility metric as the negative of that in Equation 4 allows the interpretation of the resulting implausibilities to remain the same.
A value of 3 or greater for I(x) is taken as the threshold for a value being 'implausible'.Three is the value often used in calibration, based on the Pukelsheim's three sigma rule (Pukelsheim, 1994).Any value of x with I(x) > 3 is ruled-out, and no longer needs to be considered.The set of x values that are not ruled out yet is often referred to as the 'NROY' space (the Not Ruled Out Yet space), and is present in standard History Matching.Similarly, any value of x with I(x) < −3 is 'ruled-in', and although it is likely part of the level set, it is so likely part of the level set that it is not worth wasting further simulations on, and thus also no longer needs to be considered (but does need to be remembered).The set of x values which are not ruled in yet will hereon be referred to as the 'NRIY' space (the Not Ruled In Yet space), which is not present in standard History Matching.
For clarification, consider the image in Figure 2.This illustration demonstrates how 4 distinct regions of space emerge from using the implausibility metric from Equation 4. The central line, going from the top left corner to the bottom right corner, represents the set of inputs where the output exactly equals L. The red, top right, region represents the set of inputs where the output is much larger than L; they are therefore almost certainly not in the level set, and thus are ruled-out.The blue, bottom left, region represents the set of inputs where the output is much smaller than L; they are therefore almost certainly in the level set, to the extent that they become uninteresting, and thus are ruled-in.The uncoloured middle regions are the regions of greater interest.The upper uncoloured region, NROY, represents the set of inputs where the implausibility is greater than 0, and thus are not believed to be in the level set; but the implausibility is not large enough to know for certain.On the other hand, the lower uncoloured region, NRIY, represents the set of inputs where the implausibility is smaller than 0, and thus are believed to be in the level set; but the implausibility is not small enough to know this for certain either.With this, we then have a set of x values which are candidates for future simulations (any values where −3 < I(x) < 3).Running simulations for some of these NROY / NRIY values and refitting the emulator will improve the emulator in this space.This process can then be repeated several times, until the space of NROY / NRIY is acceptably small (or does not appear to change).
If at any point, no choices of x are NROY (i.e.all values of I(x) are greater than three), then this implies that no values of x are in the level set.
If more than one output is being emulated (as in our case study, where we have two outputs of interest), one can take the overall implausibility to be the maximum of the individual implausibilities -if it is implausible that a specific value is not in one of the level sets, then that value is considered implausible overall.
In the final wave, when a final decision must be made (or a set of final candidate values must be presented to a practitioner), it does not seem reasonable to allow the choice of any 'non-implausible' values if they still have fairly large implausibility, but not quite as large as three.Therefore, in the final wave, we constrain our final set of candidate values to be any 'tenable' values -that is any input values where the implausibility is less than 0. A more conservative choice would be to only consider values with an implausibility less than -3 (i.e.those ruled-in), in our case however we find this to be too strong a requirement as none of our candidate buildings end up ruled-in.
Given an emulator, this methodology is exceedingly easy to implement, and is conceptually straightforward -we rule out values that are obviously not in the level set, and rule in those that are obviously in the level set, all others can be investigated further.
In the next section, we shall apply this methodology to the two building criteria described previously.

Results
We start by initialising a set of 1000000 possible buildings -these are chosen by constructing two random latin hypercubes of size 500000, one for each value of the binary input variable.The goal is to reduce this huge number of potential buildings that could be built, to a more manageable subset of 'future-proofed' buildings.
In the first wave, we fit the two emulators using an initial data set of size 400: 80 unique x input points chosen by a sliced Latin hypercube design, each replicated 5 times.We then calculate the I oh (x) and I eu (x) (that is, the implausibilities for the overheating risk emulator and the energy usage emulator) for 1000 new x locations, also chosen by a sliced Latin hypercube design.
As a comment, we are interested in the values of the overheating risk and the average energy usage, not the raw outputs of the simulator.Therefore, in calculating I oh (x) via Equation 4, y(x) is replaced with the logit overheating risk.Similarly, y(x) is replaced with the mean energy usage in the calculation of I eu (x).Because the logit overheating risk is used, rather than the overheating risk itself, we also modify the value of L, the target threshold, to be logit(0.01)rather than just 0.01.The logit overheating risk is the original output of the latent Gaussian process emulator, and is also unbounded.Using the overheating risk itself could also be done, although such quantity is bounded between 0 and 1.
The emulators used are not fast enough to make predictions for 1000000 possible buildings in a short period of time.It is for this reason that the implausibilities for only 1000 buildings are explicitly calculated.Obtaining implausibilities for the larger 1000000 set of candidate buildings is done by interpolation; using standard, deterministic, Gaussian processes.1These interpolators are much faster than the initial Gaussian process based emulators, because the implausibilities that are interpolated are standard continuous variables.It is because the overheating value is binary, and because the energy usage is stochastic and heteroscedastic, that complex (and thus slower) emulators were needed.With the Gaussian process inteprolators, we simply use the mean predictions, ignoring any epistemic uncertainty.Further work could be done incorporating this epistemic uncertainty surrounding the interpolated implausibilities.This interpolation of implausibilities when the emulator is complex serves as an interesting avenue for future research on History Matching in general.
With these two sets of implausibilities, I oh (x) and I eu (x), the overall implausibility can be calculated (I(x) = max(I oh (x), I eu (x)).Any building x where the overall implausibility is greater than 3 is ruled-out (and if any were to be less than −3, those would be ruled-in).A random selection of 80 NROY/NRIN buildings are then chosen, and simulated 5 times each.This data set, along with any of the older simulated data which is also NROY/NRIN makes up the newer simulated data set, and the process can be repeated.A key computational attribute here, is that once a building is ruled out (or ruled in), it no longer needs to be checked -it has already been ruled out, its final implausibility value is the last one it was assigned.
We performed three waves of this History Matching inspired level set estimation.In wave 1, the NROY space was reduced to 25.03% of the total input space (i.e.25.03% of all the initial candidate buildings were nonimplausibly future-proof) and 0% of the space was found with I(x) < 0 (i.e.none of the initial candidate buildings were yet tenably future-proof).By Wave 2, the NROY space was shrunk further down to 12.23% of the total space and 0.59% of the space was tenable.By the third wave, 10.20% was NROY, and 1.43% of all buildings were tenably future-proof.Any one of these tenable buildings could be recommended in good faith.
To visualise the types of buildings which are most future-proof, we make use of standard (in the History Matching literature) minimum implausibility and optical depth plots (Andrianakis et al., 2017).For every combination of two input variables, a 2D grid is made.Every candidate building is then sorted into the relevant grid cell for the 2D combination of input variables.Minimum implausibility plots present the minimum implausibility of any building within each grid cell, and the optical depth plots plot the proportion of buildings that are NROY within each grid cell.These plots then provide information about the shape of the NROY space and the implausibility in this 2D projection.This can then be repeated for all 2D input variable combinations.Of course a preferred option would be to just visualise the entire 8D space, but this is clearly infeasible -the minimum implausibility plots and optical depth plots provide a good alternative.For the binary variable, where there can only be two horizontal grid cells (or vertical, depending one which side the binary variable is on), we have separated the two factors with a dividing line, to increase clarity.
Figure 3 presents the results from wave 3.As a reminder, x 1 is wall insulation thickness, x 2 is roof insulation thickness, x 3 is ground insulation thickness, x 4 is window size, x 5 is overhang size, x 6 is window opening amount, x 7 is roof emissivity, and x 8 is whether windows are double or triple glazed.From the figure, we can see that small values of x 2 and x 3 and large values of x 4 are all poor choices for a building.There are also key 2D interactions that can be observed here -one example is that between x 4 and x 5 , where large values of x 4 and small values of x 5 are particularly poor.Another interesting interaction is between x 2 and x 3 .These plots can reveal many interesting 2D relationships, but it is important to realise that these are only 2D projections, with the full 8D space being much more complicated.It is thus not recommended to use these plots to choose the specific building, rather one should choose one from the found tenable set -these plots however can provide an intuitive understanding of the general patterns.
Further waves could be done from here -the non-implausible space and tenable space did change between wave 2 and wave 3; but these changes were sufficiently small such that we found it acceptable to stop here.For the minimum implausibility, the scale is capped above by 3 (above this all buildings are implausible), and below by 0 (below this all buildings are tenable).For the optical depth, the scale is on the log scale, and goes from 0 (i.e.all buildings are non-implausible) down to -10 or lower (i.e. less than 0.0045% of buildings are non-implausible).
For a final choice of building, one could consider any of the tenable set, and we leave such a choice down to a practitioner.Secondary (or in this case, tertiary) criteria often exist.For example, a practitioner might choose the tenably 'future-proof' building which has the largest windows.

Conclusion
To conclude, we presented a modification to History Matching to deal with the problem of level set estimation.The methodology is intuitive and straight-forward; easily applied to different emulators; automatically outputs whether the level set is empty; and because the procedure discards "implausible" values rather than searching for probable values, it can also be robust to high input dimensions and small simulation budgets.
We also presented a case study -applying this methodology to a difficult buidling performance simulation problem, where the level set methodology is still easily applied.After just three waves of this methodology, 89.80% of the input space was discarded as implausible.Additionally, after only 3 waves, 1.43% of the input space was found as tenably within the level set.
Within this article, we have ignored the notion of 'model discrepancy', where one acknowledges the simulator is not perfect and it itself is flawed.It is straightforward and common to add an additive, constant, zero mean, measure of this discrepancy in History Matching (Vernon et al., 2014;Andrianakis et al., 2015), by simply replacing the variance term in the implausibility equation, V (y(x)), with V (y(x))+V M D , where V M D represents the subjective uncertainty around what the difference between the simulated quantity and the real world quantity could be.If V M D is not believed constant, or additive, or zero mean (all possible within a level set estimation procedure), then more must be done.Model discrepancy is a key open question when it comes to any form of simulator analysis, an open question when it comes to History Matching (see Goldstein and Rougier, 2009), and most certainly an open question when it comes to History Matching derivatives, such as this level set methodology, or History Matching inspired optimisation.
This article makes reference several times to the ease and intuition of applying History Matching, and the History Matching inspired level set estimation methodology.This does not however mean that constructing emulators is always easy.Often (and indeed within this article), constructing an emulator can require careful assessments of potential assumptions.Any subsequent analysis (be that level set estimation, prediction, optimisation, calibration, etc.), depends on this careful emulator construction, lest any conclusions be invalid.
Overall, we believe that emulation, and indeed the described level set methodology, are useful tools in extracting value from a simulator.We also believe that the ideas and techniques discussed herein prove to be useful for the field of building performance simulation.

Figure 1 :
Figure 1: The geometry of the modelled building.

Figure 2 :
Figure 2: An illustration of the 4 regions that emerge from the History Matching Level Set Estimation technique.

Figure 3 :
Figure3: Minimum implasubility plots (below and left of diagonal) and optical depth plots (above and right of diagonal) for Wave 3.For the minimum implausibility, the scale is capped above by 3 (above this all buildings are implausible), and below by 0 (below this all buildings are tenable).For the optical depth, the scale is on the log scale, and goes from 0 (i.e.all buildings are non-implausible) down to -10 or lower (i.e. less than 0.0045% of buildings are non-implausible).