Building stock energy modeling: Feasibility study on selection of important input parameters using stepwise regression

Building energy assessment is essential to accomplish the sustainable energy targets of new and present buildings. Retrofitting of the existing buildings by assessing them through energy models is the most prominent method. Studies revealed that there is still blank information about the building stocks, and these affect the valuation of building energy efficiency policies. Literature also recommends that the existing energy models are too complex and unreliable to predict the energy use. Reliability of such energy models would improve through a better alignment of the input parameters and the model assumptions. The authors hypothesized that the reliability of models would be improved through identification of the most relevant energy use parameters for the building stocks in different regions and models. One of the most commonly accepted methods for detecting the most dominant input parameters is sensitivity analysis, though its shortcomings include the need for a large number of data samples and long computing time. In this research, the Energy, Carbon, and Cost Assessment for Buildings Stocks (ECCABS) model is adopted to identify the most important parameters of the presented model. The research team uses the model that has been validated by studies conducted for the UK building stock. Moreover, by assessing the feasibility study with the stepwise regression to identify the significant input parameters have been discussed. Results show that stepwise regression method could produce the same results compared to sensitivity analysis. This paper also indicates that stepwise regression is considerably faster and less computationally intensive compared to common sensitivity analysis methods.


| INTRODUCTION
Existing building stock energy models approximate the baseline energy used for the building stocks to predict the projection of demands for future. There is a fact that the buildings are usually built for the long-term purposes. This will amplify the damaging impacts that the buildings exert on the environment. 1 Most softwares are developed as tools for the preliminary design stages. 2 Therefore, the detailed design phase of their developments is applied as the building performance assessment tools after the final design or complete construction. The ability to rely solely on the natural ventilation for cooling is limited by the climate and thermal loads. 3 The energy models are dependent on the availability of data from buildings and are not verified or tested for their relevance and reliability. 4 Thus, there is a disconnected bridge between the reality and assumptions. Most of the building stock energy models are shaped according to the bottom-up engineering models and require assumptions at several conditions. 4 The accuracy of the building models widely relies on the quality of information and the selection of the input parameters. 5 Sensitivity analysis is one of the commonly accepted techniques to identify the relevant parameters and components for the models. 6 The reason swirls around its reliability and ease of understanding, which is clearly described in literature. 4 However, the process often forces to make assumptions due to the lack of information and requires high processing capacity to run the models. 7 Also, the high time consumption and the complexity of outputs are few other drawbacks of such technique.
The objective of this research and its contribution to the body of knowledge is assessing the feasibility of applying stepwise regression instead of sensitivity analysis in building stock energy modeling. This promotes rapidness to the models and elevates model utilization. For this reason, the authors have adopted the existing Energy, Carbon, and Cost Assessment for Building Stocks (ECCABS) model, which was previously applied to simulate the UK building stock with 192 samples of residential buildings. 8 The authors selected ECCABS since it is a successful tested model, and it assesses the effects of the energy saving measures (ESMs) for the building stock. The model is a bottom-up engineering model, which means that calculation of the energy demand of a sample of individual buildings is based on the physical properties of the buildings and their energy use (eg, for lighting, appliances, and water heating), and results are scaled-up to represent the building stock of the region studied. The ECCABS model is designed to have limited complexity and limited inputs which facilitates data extraction from available databases and to facilitate short calculation times. The primary outputs of the model are the net energy demand by end uses, the delivered energy to the buildings, CO 2 emissions, and the costs associated with the implementation of ESMs. 9 The model assumes the buildings as the representative of a region that was evaluated. The paper utilizes the data from the model and mines the feasibility of identifying the important input parameters. It helps in reducing the unwanted or irrelevant input information that increases the complexity of the model and simulation time. Also, reducing the parameters eliminates unavailable information and increases the data quality. Two of the techniques (ie, sensitivity analysis and stepwise regression) are tested with restricted number of parameters, and results are tabulated with justification on the best technique in following parts of the paper.
In summary, the aim of this paper is to compare the results (input parameters), which are derived from the sensitivity analysis (ECCABS Model), with the identified results from the stepwise regression. Also, it discusses the advantages of latter over former. Specific objectives are categorized into:

| RELATED WORK S
This part includes three subsections that contain a more detailed literature review on the existing energy models, utilization of sensitivity analysis on energy models and its drawbacks, utilizing the stepwise regression and its advantages. Finally, it has been tried to present the gap in prior studies and how that drives the research methodology of this paper.

| Energy models
For energy model calculations, various tactics including building energy simulation, statistical analysis according to the calculations, and experimental measured data can be performed to analyze the energy use of the buildings. 10 In initial energy modeling approach, forward energy simulations are applied, which can be classified as the degree-day method, bin method, or hourly computer simulation tools to predict the annual energy consumption and costs. 11 Kavgic et al 4 compared different assessments of previous models, which are developed based on the building stock in the UK. In such models, heat balance and empirical relationships were performed to calculate the yearly or monthly energy use of an individual buildings. Table 1 explains various energy models developed by different organizations in the UK.
The above-mentioned models are different in their separation approach. DECarb is not widely aggregated, which means that a relative data set is arranged to make the 8064 unique combinations for 6 time periods. 13 In UKDCM, buildings are restricted by the climate zones, age, types of construction, number of floors, tenancy, and construction method. 14 In BREHOMES, the housing stock is divided into more than 1000 categories, which are categorized by their form of built, age of construction, tenancy, and the ownership of the central heating. 5,15 In CDEM, the yearly energy use of only 47 house archetypes is combined, and a unique combinations of form of built type and dwelling age are derived. 16 On the other hand, in the Johnston model, only two "notional" dwelling types (pre-and post-1995) have been developed. 17 All the above-mentioned models need some assumptions according to the available supporting data both in the absence of direct data and in the application of input values.
Building performance simulation (BPS) provides effective resources to support the informed design for the decision-making of energy efficient buildings. 18 Variety of approaches such as the yearly building energy modeling, spreadsheet calculations, and statistical analysis based on empirical data can be employed for performing the whole-building energy analysis. 10

| Statistics and energy models
Numerical investigation is valuable to extract the respective data regarding the future climate projection, natural disasters, and many other identifiable factors. It is an essential need to establish a framework to interpret the explanations from actual or digital data. 19 A large number of researchers are using the quantitative numerical methods to validate and prove their hypothesis for their presented data. Modelers by running their programs on different software should not expect to extract the same results. These variations in their extracted results are due to different modeling capabilities, various time steps, using disparate calculation algorithms, and many other assumptions during the calculations steps. 12 The statistical models are crucial to forecasting and detecting various real-time issues. Moreover, by including the statistical analysis in reporting the results, the credibility of the findings could be increased. 12 The following sections explain both sensitivity and stepwise techniques and their pros and cons when applied to energy calculations of buildings.

| Sensitivity analysis
Sensitivity analysis methods have been successfully applied in various fields including complex engineering systems, economics, physics, and many others. 20 Sensitivity analysis is a helpful tool to identify the critical control points, prioritizing the data collection or research schemes, and validating a model. 20 Lomas et al 21 verified three necessary sensitivity analysis tools with building energy modeling software. They stated that the Monte Carlo method is realistic although it has a drawback for only defining the total uncertainties. Lam et al 22 showed that it is still unclear for researchers how simulation models are repeated iteratively.
Generally, the sensitivity analyses have been employed in Integrated Method to Assess Greenhouse Effect (IMAGE) by using the Weismann and Morris methods. In this approach, the selected significant input parameters are compared between two methods. 6 Sensitivity analysis approaches need a huge number of iterative runs. Hence, it is essential to computerize the procedure by creating the building energy models with different combinations of inputs. Existing literature classifies sensitivity analysis methods into three categories. The first group of sensitivity analysis methods are based on mathematical approaches and usually calculate outputs for a few values of an input within its possible range. Most methods of this group use one-parameter-at-a-time (OAT) approach. The second group of sensitivity analysis methods are based on statistical analysis. In these group of methods, one or more input parameters may vary at a time. One of the benefits of the statistical methods is ability of quantifying the effect of simultaneous interactions among multiple inputs. 36 Examples of statistical methods include linear regression analysis (RA), the analysis of variance (ANOVA, the Fourier amplitude sensitivity test (FAST), the mutual information index (MII), standardized regression coefficient (SRC), and standardized rank regression coefficient (SRRC). The third group of sensitivity analysis methods are based on graphical illustrations. These methods use graphs, charts, or surfaces of inputs vs corresponding outputs. This paper investigates a statistical approach for sensitivity analysis and therefore can be classified into the second group of methods described above. The proposed method, stepwise regression, does not need to vary one or more parameters at a time. It actually considers all input parameters and ranks the input parameters based on their degree of importance. According to Tian, 23 a number of processes in the sensitivity analysis remain the same on most of sensitivity analysis techniques. Examples of these processes include determining input variations, creating and running energy models, collecting results and performing sensitivity analysis, and finally presenting the results. 24 These applications are different according to the variations of input factors for various research purposes, which are often ignored in the field of building analysis. 23 As mentioned, sensitivity analysis of building energy models requires costly computational procedures that may be the most important problem in adopting a number of sensitivity analysis methods such as variance-based methods. 37 However, existing literature documents the PEAR, PCC, and SRC as sensitivity analysis alternatives, but according to 37 these choices are a compromise between accuracy and computational cost. In addition, 38 states that the rank transformation sensitivity indices may modify the model under study and may produce results that are not reliable. A number of studies (eg, 39,40 ) have focused on the Morris method as a computationally cheap method. Their findings indicate that this method was not reliable in ranking input variables based on their level of importance; rather, it can be used to identify unimportant variables of an energy model.
Almost all sensitivity analysis methods include several issues, of which exploring around a base case prevails as the predominant method. 25,26 The other important issue with sensitivity analysis is the relevance and interaction among the input parameters that is not maintained throughout simulation runs. 23

| Stepwise regression
The regression methods are global sensitivity analysis approaches. In such analysis, the stepwise regression is used as a standard technique. In this technique, the selection of factors is based on their significance level. Therefore, the selection methodology is extracted in respect to the statistical techniques. 23 As stated by Helton et al, 29 the stepwise regression is initially created with the most important variable, which is continued by adding the next most important variable. It means that the sensitivity analysis does not necessitate all of the unclear variables for this regression model.
Dhar et al 30 have used stepwise regression technique to detect the most relevant factors for a temperature-based and generalized Fourier series for energy purposes. Dhar et al 31 describe the Fourier series methods developed for modeling of energy use in commercial buildings. Their study was not focused on finding most important parameters in building stock energy models. Later, Katipamula 31 determined the strength of the variables in three different buildings with a stepwise regression, executed for the dual-duct constant volume (DDCV) systems. Moreover, they compared two models based on their selected variables through such regression analysis. Katipamula 32 was focused on building a multiple linear regression baseline models which could be used in detecting deviations in energy consumption resulting from major operational changes and therefore did not aim to find the most important input parameters for building stock models. It is noteworthy to mention that this regression can be influential on either the design objectives or the constraints when the optimized models are utilized. The forward stepwise method is the most widely used in the analysis according to Helton et al 29 In this scheme, first, the most important factor enters into the model and then it is followed by the next important one, and this process repeats until no significant variables are left according to the statistical test. 23 Existing literature failed to examine the possibility of using stepwise technique in a building stock model, which calculates energy consumption of whole buildings in an area. Identifying the vital inputs is crucial when in creating a building stock energy model data on a number of input parameters is missing. Thus, decreasing the number of inputs supports the work to collect and disseminate data for analysis. 28

| RESEARCH METHODOLOGY
While sensitivity analysis provides robust and informative results, it requires huge number of iterative runs that are time-consuming and needs high-speed processors. This paper by referring to outputs of the sensitivity analysis and stepwise regression analyzes the benefits of each method over others. Although other data-driven approaches could be used, regression modeling approach is simple to develop and easy to use compared to many other approaches including detailed hourly simulations of energy use in buildings. 31 The methodology to run through the process of identifying the relevant input parameters through reliable technique is summarized in Figure 1 and explained below.

| Normality distribution and box-cox transformation
Data used in this study are derived from the baseline model created in ECCABS. For each of the 192 sample buildings, the ECCABS model calculates the energy consumption (response variable) based on the inserted input parameters (predictors). Such set of raw data was formatted as a long list of observations including predictors and responses. The created data set was used to perform the stepwise regression approach.
Normality distribution test is performed to figure out whether the data are normally distributed, which is required to execute the stepwise regression. Prior to performing such analysis, a normality check is done on the response variable (net energy use) to check its distribution. 28 Box-cox transformation is a typical technique used to normalize the data for more robust results. After ensuring that the response variable is normally distributed, the predictors must be arranged for stepwise regression. Some predictors might be strongly correlated, and some may be constant for all sample buildings and therefore do not enter the regression.
As mentioned, a common assumption used for building regression models is usually that the data originate from a normal distribution. 41 In case the variables are not normally distributed, we may need to transform so that this assumption is approximately satisfied. 34 Transformation of independent and dependent variables helps to create the simplest possible regression model. 34 A Box-Cox normality can be often plotted to find an alteration that will roughly normalize the data. The one-parameter Box-Cox transformation from y to y for positive variables is defined as follows:

| Performing stepwise regression
Stepwise regression is utilized to examine the importance of the input parameters. It can be performed either in forward method by including a single independent input variable and adding more variables until a certain degree of accuracy is achieved, or backward method by considering all independent input variables and removing the insignificant ones. 33 This paper utilizes the forward selection method to identify the most significant variables.

| Validation of the regression pattern and analyzing the results
Once the data are examined for normality, normalized using box-cox, and analyzed with stepwise regression, the results are checked for accuracy. Regression model is an approach that determines the obtained results are precise as explanations of the data. The validation for this approach can be done by testing the appropriateness of the regression curve fit, and study of the regression residuals. 32

| Comparing results
After validation, results are compared to ones obtained from sensitivity analysis in the ECCABS model and a discussion on the advantages of stepwise regression compared to the sensitivity analysis is presented.
The following sections explain the data collection methodology, considered building types, the results from both sensitivity analysis and stepwise regression. Finally, the conclusions of robustness of different methods and their applicability based on the data type have been presented. At this stage, buildings are divided into two groups and the stepwise regression is done on each group. This division is done to see the more important parameters relative to newer and older buildings.

ANALYSIS AND REGRESSION WORK
The aim of the sensitivity analysis is to examine the changes in the output variables of the model according to the minor changes in the inputs. 16 The sensitivity analysis is a helpful tool to determine which variables have the largest effects on the output variables of the model. 42 This step may take a long time if using the meta-model methods with a larger number of input factors. 15,22,26,27 According to Firth et al, 16 sensitivity analysis comprises the following steps:

Box-Cox Transformation
Stepwise Regression

Validation and Analysis
Compare Results
1. Input parameters should be assigned to set values (k i ). 2. Each input parameter faces a small change Δk i , while the other input parameters are kept constant, that is, ±1% change in the input parameter. 3. For each change in the input parameters, the model is iteratively run again. 4. New output variables are used to calculate the sensitivity coefficients, and all are normalized with the calculated coefficients.
The sensitivity coefficients characterize the partial derivatives of output variables to input parameters. These coefficients for a model with n output variables and m input parameters are given by the following equation: where y i is the ith output variable, k j is the jth input parameter (i ranges from 1 to n, and j ranges from 1 to m), y i k j is the sensitivity coefficient for output variable y i and input parameter of k i , and y i k i + Δk i is the value of y i when the input parameter k i is increased by Δk i rate.
To be able to compare the sensitivity coefficients of input parameters with different units, the normalized coefficients must be defined. Equation Iterative stepwise regression is actually a step-by-step model that automatically selects the independent input variables. As mentioned, in this paper, since the most significant variables are in focus, the forward method is adopted. The model has been validated satisfactorily, for which the results of modeling are in agreement with the statistical data (3% deviation).

| RESULTS AND ANALYSIS
The inputs to ECCABS model are listed in Table 2. The table also shows the results for the sensitivity analysis done by Arababadi 8 based on setting the initial values of the UK building stock. The presented results in Table 1 show that the largest S i,j values are all related to the input parameters, which can be influential in the consumption of energy for the space heating purposes. The indoor air temperature results have the highest sensitivity coefficient (1.63 in the residential sector). This can be explained by a 1% increase in the indoor air temperature, and the energy consumption of the buildings increases by 1.63%. Next most important inputs include the external surface and average U-value of buildings. The negative sensitivity values mean an increase in those parameters reduces net energy consumption.

| Stepwise regression
The statistical investigation begins with execution of a normality test on the response variable (total energy use of sample buildings). This paper refers to normality probability plot to interpret concerns and to find the outliers. Figure 2 displays the test results. The deviations from the straight line illustrate the abnormalities, and it is observed that the P-value for this data is smaller than the acceptable range (.005). Small P-value tells that the hypothesis under consideration may not sufficiently explain the observation. This shows that the null theory is rejected. So, the response variable and the net energy use of sample buildings are not normally distributed.
As mentioned, data transformation and, particularly, the Box-Cox transformation are selected as a technique to make the data normal. 34 However the Box-Cox transformation does not guarantee the normality of results. 34 Hence, it is necessary to check whether the transformed data are normally distributed by using a probability plot. Figure 3 shows the normality test done on the transformed data. Comparing Figures 2 and  3 shows a substantial development in the respective test. It is still not a perfectly normal data set but with P-value = .015; the null hypothesis is not rejected. Thus, it is assumed that the data are normally distributed.
In this paper, the forward selection technique is used that involves starting with no variables in the model, testing the addition of each variable using a chosen model comparison criterion, adding the variable (if any) that improves the model the most, and repeating this process until none improves the model. 43 Table 3 shows the results of the implemented regression on the 192 sample buildings. Steps in this table represent adding of new variables to the regression model until a reliable model is obtained. Note that some of the input parameters are not selected in the regression process because they were either strongly correlated or they were constant for all 192 samples. Correlated input parameters in building stock energy models are categorized as structural multicollinearity that means one input parameter is created using other input parameter. For example, one input parameters might be calculated by multiplying another input parameter into a constant.

| Stepwise regression model validation
As regression results show, the R-Sq (adj) is 96%, and the RMSE indicator is very small. A great R 2 cannot guarantee that the presented model is valid for investigation. Therefore, the best choice is to use the adjusted R 2 as it can notably improve the level of confidence in the validation process. 35 Figure 4 shows how the Box-Cox transformation helped to improve the fitted regression model by plotting the residual values for both before and after Box-Cox effects. When model is fitted with the data, the residuals are estimated for the random errors that make the relationship between the explanatory variables and the response variable. Thus, if a random appearance happens for the residuals, it can be concluded that the model for the data is fitted correctly. Figure 4 clarifies that the fitted model has improved after the Box-cox transformation.
As Figure 4 shows, the data points are randomly distributed on the fitted plot around zero. Therefore, it can be assumed that the error terms have a mean equal to zero. At this stage, buildings are divided into two groups and the stepwise regression is done on each group. This division is done to see the more important parameters relative to newer and older buildings.

| Built before 1995 (Old Buildings)
In this section, the results of a stepwise regression on buildings built before 1995 are presented, and the residual plots are analyzed. In the year 1995, the energy policies including heating and ventilation systems to standardize the conservation of fuel and power were updated. The analyzed results are presented in Table 4.
As results of stepwise regression for buildings built before 1995 show, the fitted model still has acceptable R-Sq.
(adj) and RMSE (S) but residual plots show ( Figure 5) that the previous model that was fitted to the entire data set was more accurate. It is due to the fact that the R 2 as a measure of model validity increases by adding more variables into the model.

| Built after 1995 (Newer Buildings)
The same procedure is done for buildings built after 1995. Table 5 shows the results of stepwise regression for this group of buildings. Note that due to characteristics of data on this group of buildings, more input variables were unable to enter the regression because they are either strongly correlated or constant for all samples.
Residual plots for the model fitted to the buildings built after 1995 are illustrated in Figure 6. The residual behavior of this model is similar to the previous model (built before 1995), and the first model that was fitted to all data set still has better residual behavior.

| DISCUSSION
In Table 6, the most important input parameters are compared. These parameters are found by both the stepwise regression and sensitivity analysis. Note that the most important input parameters found by sensitivity analysis are in fact input parameters with highest normalized sensitivity coefficients. It is clear that they have both identified almost the same parameters with relatively the same order of importance.
Stepwise regression in this paper produces relatively accurate results and is considerably faster. Obviously, computational times are highly dependent on the processor capabilities and may vary based on computer system speeds. Although

T A B L E 4 Results of Stepwise Regression for buildings before 1995
Step it would be beneficial to quantitatively report the computational times for stepwise regression and sensitivity analysis, this sort of data was not precisely recorded at the time of this research. However, it was observed that the calculation time is considerably reduced by adopting stepwise regression. To be more specific, in order to perform the sensitivity analysis based on Firth et al, 16 the energy model needs to be run 2 × N times (where N is number of input parameters) and each run takes approximately 30 minutes (using a normal core i3 processor) while performing a sensitivity analysis with the same processing capacity could be done in less than 2 minutes. The other advantage includes the analysis for a different group of samples as it was done in this paper (buildings built before and after 1995). By doing the stepwise regression on different types of sample buildings, one would be able to compare the most important input parameters for each group. Although stepwise regression has some significant benefits over the sensitivity analysis, some drawbacks require more attention. For example, input parameters that are strongly correlated cannot enter the regression process. Notice that in building modeling, some input parameters are calculated by multiplying a constant into another input parameter. For example, lighting energy use and appliance energy use are calculated by multiplying the lighting and appliance energy density by the total floor area; thus, both lighting and appliance energy are correlated with total floor area. Additionally, input parameters that are identical for all buildings cannot enter the regression process. We know that in building stock modeling, data gathering is time-consuming and labor intensive. In many areas, data on a number of input parameters may be missing. Therefore, it is common to assume the same values of some input parameters for all sample buildings. For example, occupancy, lighting, and plug-load schedules are usually assumed to be identical for different clusters of residential buildings. These sort of input parameters may not be selected in stepwise regression analysis.

| CONCLUSIONS
This paper illustrates that stepwise regression is a feasible substitution for sensitivity analysis. It is also provided that the input data parameters are not strongly correlated and not identical for all samples. Since it is usually easy to identify the correlated and nonvarying parameters in building stock modeling, it might still be possible to use a stepwise regression even when the variables are correlated. According to the research content, the following points can be concluded: • Stepwise regression can help the energy modelers to save the time that is required to run the simulation for sensitivity analysis. • Results can help the developers to identify the most important inputs for their models. It can help identify those parameters that can be neglected and reduce the number of unwanted input parameters, in turn making simulation easier. It would be easier and less time-consuming to gather the input data when the number of input parameters is reduced. For instance, results of this paper show that parameters such as coefficient of solar transmission of the window or Frame coefficient of the window could be neglected.
As a suggestion for the future work, the same method can be used for the data from other countries to both compare the models to the ECCABS model for the UK and to see whether the important input parameters vary regionally.