Testing for knowledge: Application of machine learning techniques for prediction of flashover in a 1/5 scale ISO 13784‐1 enclosure

A machine learning algorithm was applied to predict the onset of flashover in archival experiments in a 1/5 scale ISO 13784‐1 enclosure constructed with sandwich panels. The experiments were performed to assess whether a small‐scale model could provide a better full‐scale correlation than the single burning item test. To predict the binary output, a regularized logistic regression model was chosen as ML environment, for which lasso‐regression significantly reduced the amount of variance at a negligible increase in bias. With the regularized model, it was possible to discern the predictive variables and determine the decision boundary. In addition, a methodology was put forward on how to use the to update the learning algorithm iteratively. As a result, it was shown how a learning algorithm can be used to facilitate ongoing experimentation. At first as a crude guideline, and in later stages, as an accurate prediction algorithm. It is foreseen that, by iteratively updating the algorithm, by compiling existing and new experiments in databases, and by applying fire safety knowledge, the final learned algorithm will be able to make accurate predictions for unseen samples and test conditions.


| INTRODUCTION
Fire-classification of materials is a central element for ensuring safe building design. The classification of a product should, in principle, be arrived at based on its reaction to fire in a test that represents the end-use situation (often full-scale). However, as large-scale tests are often costly and labor intensive, a tendency exists to try to predict full-scale fire behavior based on small-scale testing. [1][2][3][4][5][6] In order to justify such a scaling methodology, a thorough understanding of the fire behavior is necessary. 7 While this is currently the case for many single burning items (SBIs), a knowledge gap persists for the interaction of a growing fire and combustible linings in an enclosure. Therefore, largescale testing is still needed to accurately classify such materials.
The full-scale room corner test (RCT) 8 used to be the standard for classification of linings in a variety of countries. However, because it requires rather large samples, it was neither considered to be costefficient nor time efficient and it was therefore replaced with the new European intermediate-scale SBI 9 test. The concept of scaling-down seems justified, as in 87% of the cases the full-scale fire growth behavior could be captured adequately by the intermediate-scale test. 10 Nevertheless, for materials such as sandwich panels, linear systems (eg, cabling and piping) and polycarbonate panels, the correlation proved to be less accurate. 10 New tests have been developed for some specimens, but the defining test for classification of sandwich panels remained with the SBI test. As such, it is foreseen that unwanted situations could arise due to the possible misclassification of sandwich panels. The risk is further magnified as sandwich panels are frequently used as free-standing or frame-mounted construction elements rather than as linings. Free-standing or frame-mounded sandwich panels should be tested conform the large-scale ISO 13784-1 11 standard to represent the end-use situation. Therefore, the correlation between the SBI test and the ISO 13784-1 can be further questioned. 12 For the aforementioned reasons, the dependency on large-scale testing remains for an accurate classification of sandwich panels. 13 Therefore, a new reduced scale test is needed to provide the industry with an accurate and time-and cost-effective method for quality control and product development. To this end, recent work by Yoshioka et al 14 and Leisted 15  The work by Leisted 15 is especially relevant for this article as it has highlighted that a tool that can both identify experimental configurations for their knowledge benefit, and discern relevant parameters, would significantly reduce research time and cost. As a result, the tool would augment the possibility of a successful research outcome.
One possibility to develop a tool that can aid ongoing experimentation encompasses the application of machine learning (ML), which has already proven its merit in many fields. A foremost advantage of a ML algorithm is that it possesses the capability to learn by way of observation and experience, rather than by using rigid prescribed equations. Meaning that, a relatively simple learning algorithm can prompt different learned algorithms, which can be magnitudes more complex, for varying data sets and without the interference of the ML expert. The simplicity inherently means that the learning algorithm can analyze complex problems and large amounts of variables, whereas conventional techniques quickly get overwhelmed, either by the limitations of computational-time or computational-space or because they are too complex to be understood by humans.
The currently proposed learning algorithm uses ML techniques to aid ongoing experimentation to derive a new intermediate-scale test procedure for sandwich panels with regards to the ISO 13784-1 standard. The regularized logistic regression model predicts flashover or no-flashover for a polyisocyanurate (PIR) sandwich panel exposed to various burning intensities within the physical confines of a 1/5 scale model of the standardized ISO 13784-1 enclosure.

| EXPERIMENTAL SETUP
The experimental data used for the ML algorithm are a selected version of the complete data set that was performed by Leisted 15 in an attempt to develop a screening method for the ISO 13784-1 enclosure. In particular, Leisted 15 researched whether a scale model of the   ISO 13484-1 would provide a better correlation than the SBI test,   with respect to sandwich panels. Toward this end, different experiments were conducted at 1/2 and 1/5 scale, but because only a few   1/2 scale experiments were available, the remainder of this article will focus on the 1/5 scale experiments.
2.1 | Geometrical scaling of the compartment and the gas burner Figure 1A shows the geometry of the full-scale ISO 13784-1 enclosure, and Figure 1B

| Froude scaling of the gas burner HRR and gas burner duration
To ensure that a correlation would exist across the scales, the Froude scaling technique was used to scale down the size of the fire with respect to the geometric scaling of the enclosure. 15 In particular, the gas burner HRR and the gas burner duration were scaled with, respectively, Equation (1) 16 and Equation (2). 16 Here  17,18 showed that fires in commercial premises were often much higher than the prescribed 300 kW and had a longer duration than 30 minutes. Therefore, Leisted 15 added two gas burner regimes with a third burner intensity and two regimes with a continuous burner output to allow for more (severe) testing conditions. The burner regimes were denoted with three subsequent numbers, one for every time step, which can take the following values: 0 for 0 kW, 1 for 1.79 kW, 5 for 5.37 kW, and 10 for 10.74 kW. The scaled stepwise burner regimes are depicted in Figure 2A and the scaled continuous burner regimes in Figure 2B.

| Experimental data from the 1/5 scale experiments
The sandwich panels were exposed in the 1/5 scale model to the aforementioned scaled burner regimes and the specimen HRR was recorded with the oxygen consumption theory. Figure 3A propagating outside the boundaries of the enclosure, it can also be deduced from the graph as a sudden spike in the HRR. It should be noted that on the onset of flashover, the burner was turned off regardless of the predefined burner regime. Figure 4A,B show graphs from a 1/5 scaled experiment with no-flashover and with flashover, respectively.

| THE ML ENVIRONMENT
The 14 experimental observations (M = 14) depicted in Figure 3 are a selected subset of all the 1/5 scale experiments performed by Leisted. 15 Initially, also the presence of a joint in the specimen buildup and the burner location in the enclosure were varied. These aspects were not considered for the ML analysis as only a few data points were available. Furthermore, one training example was omitted due to poor burner mounting, causing it to have a slight outwards angle. As the burner angle is considered vertical in the experimental setup, this observation was omitted.
The experimental observations used for the ML algorithm are summarized in Table A1, which in the remainder of the manuscript is referred to as the historical data set. The time to flashover t fo is listed as an informative feature for the reader but will not be used in the ML model as it is not an a priori known variable. Capital letters are used to denote the output variable Y and the five (N = 5) input variables X j The values of the variables are denoted with lowercase letters x i j and y i for every i th observation (1 ≤ i ≤ M) and jth input variable. The following list summarizes the input and output variables together with the boundaries defined by the historical data set: • The output variable Y, with y i {0, 1} for, respectively, flashover and no-flashover.
The specimen HRR profile when exposed to different gas burner regimes for A) the 0.

| The decision boundary
Equation (3) shows the general form off for a first-order linear model.
Note that an extra feature, i.e. the bias unit X 0 with x i 0 = 1, is added to the feature set to accompany the estimated intercept regression coefficientθ 0 , which allows the use of matrix notations.
Once the regression coefficients are estimated, Equation (4) can be used to predict the locationẑ i of the ith training example relative tof. Meaning that ifẑ i = 0, the observation is situated onf. Otherwise, the observation is situated either above or belowf.
Substituting the feature values of an observation from Table A1 in Equation (4) will result in an output valueẑ i ℝ . As the output of interest is binary, i.e. flashover or no-flashover, the sigmoid function, Equation (5), is used to scaleẑ i to a value 0 <ẑ i < 1 , as shown in Figure 6.p It should be noted that other functions exist which have the same effect, but the logistic function is preferred due to its traceability, interpretability, and smoothness. 19 The obtained value can be inter-  In practice, the following interpretations are made in conjunction with Equation (5) to come to the actual predicted output for the ith training exampleŷ i , i.e. flashover or no-flashover.

| Cost function for unregularized logistic regression
The estimated values for the regression coefficient matrixθ are those which minimize the difference between f andf , i.e. minimize the cost function Jθ À Á . The cost function, used for the model, is represented by Equation (7).
The right part of the equation is usually referred to as the log likelihood (log lik) function, as shown in Equation (8).
Equation (8) is usually minimized with mathematical and statistical programs, for example, Matlab, Python, R, that have built-in optimization algorithms. The process of optimizing the cost function is commonly referred to as fitting the model.

| Model performance and the deviance and R 2 Metric
In order to determineθ while still being able to evaluate the model performance, the historical data set is split into two parts: the training set and the test set. As such, the cost on the training set Jθ À Á train can be calculated with Equation (7) when only taking into account those observations which are allocated to the training set, and by replacing M with the total amount of training observations M train . The training set is then used to fit the model, and the test set is used to report on the anticipated ability of the fitted model to accurately predict flashover or no-flashover on "unseen" observations, i.e. the approximation of the generalization error. The reason for using the test set is that the data examples used to calculate Jθ À Á train do not classify as unseen anymore and thus give an optimistic approximation of the generalization error. The deviance on the test set D test , see Equation (9), is a metric, which is commonly used to approximate the generalization error for logistic regression. 20 It denotes the difference between the fitted model and the ideal model, i.e. the saturated model. As such, the higher the deviance, the worse the performance of the model. It should be noted that when D test is evaluated over multiple lists a conservative approach is usually taken and the simplest model, defined by the minimum D test plus one SE, is considered to be the most parsimonious model, 21 which in the remainder of the manuscript is referred to as the ideal scenario.
The model without any features is referred to as the null model, i.e. the worst model, and makes predictions solely with the intercept regression coefficientθ 0 . The deviance of the null model D 0 is calculated with Equation (10) and can be used as a benchmark for D test . As D 0 and D test might be difficult to interpret, especially due to the dependency on the number of observations, they can be used to derive the R 2 value (0 ≤ R 2 ≤ 1), see Equation (11). A value of unity (R 2 = 1) represents a perfect fit, while R 2 = 0 signifies a scenario where the features do not add anything to the regression.
In order to avoid an exceptionally good (or bad) allocation of observations to the test or training set the procedure is randomized.
As such, The model was fitted and D test was calculated as the average

| Bias and variance
To improve the performance of the model, the type of error is determined first. This is particularly important as the type will dictate the possible solutions. A high variance error signifies af which is too flexible. As such, the model will find a pattern that is not actually true in the real world, 22 see Figure 7A. On the other hand, a model suffering from high bias will not be flexible enough to capture the intricacies of a training set, see Figure 7B.
At this point, it should be clear that a decrease in bias will inevitably mean an increase in variance and vice versa. As such, the ideal situation is a trade-off between the two types of error. A learning curve, or a priori knowledge, allows to assess whether a model suffers from high bias or high variance. In addition, learning curves are also a tool From Figure 8A, it can be seen that D train is approximately zero for every training set size. Whereas the high value for D test implies that the model fails to generalize to unseen observations. As such, there remains a large gap between D test and D train , which is typical for a high variance/low bias case. In addition, Figure 8B shows that the predicted probability of flashoverp i , indicated by the circles, perfectly matches the experimentally observed output y i , indicated by the crosses, for every training examples, m i train of one random training list. This is an indication that the model suffers from high dimensionality, which in turn would explain the high variance/low bias error.
Strictly speaking, high dimensionality refers to the case where the amount of observations M is smaller than the number of features N. 21 Because, many of the same considerations apply when M is only slightly larger than N, the next section will further elaborate the concept of high dimensionality. Table A1 shows that there are no more than five observations in the least prevalent class, i.e. no-flashover. Whereas some rules of thumb advise a minimum of 10 to 20 observations of the least prevalent class per feature considered. 23 According to this rule of thumb, to evaluate all five features, approximately 50 to 100 noflashover observations would be needed. This suggests that the model is too complex for the recorded number of observations. The reason for the earlier mentioned high variance/low bias error can thus be attributed to the lack of observations relative to the number of features. As high dimensionality problems are becoming more and more frequent, mainly due to the large feature collection possibilities of the Internet, 22 numerous solutions were developed, of which one is explored in the next section.

| Cost function for regularized logistic regression
In order to solve the high-dimensionality problem, the choice was made to apply subset selection, i.e. evaluating the effect of deleting certain features. For the model at hand approximately 32 (2 N ) different subsets exist. As such, a shrinkage method was applied to avoid having to identify every possible subset and consequently run the model ≈ 32 times.
Shrinkage effectively introduces a shrinkage penalty P αθ j À Á to the cost function applied to the training set, see Equation (12). 24 For α = 0, the estimated regression coefficients of nonpredictive features are reduced toward zero, also referred to as ridge-regression. For α = 1, the regression coefficients of nonpredictive features are reduced to exactly zero, also referred to as lasso regression. A value of 0 < α < 1 represents an elastic-net regression, which can be seen as a trade-off between ridge-regression and lasso-regression. The reason for evaluating α is that it is difficult to know a priori which regression method will perform best. Lasso-regression will outperform ridge-regression when only a few features are related to the response and vice versa.
The tuning parameter λ controls the trade-off between the log-likelihood function and the shrinkage penalty P αθ j À Á . For λ ! ∞, all coefficients will be near or exactly zero, which defines the null model. For λ ! 0, the effect of the shrinkage penalty becomes negligible and the cost function is again represented by Equation (7).
It is advised to standardize the features with Equation (13) when applying shrinkage to make sure all inputs have a SD equal to one and a mean equal to zero. 19 As such, the magnitude of the regression coefficients will only be affected by the size of λ and not by the scaling differences between the features.
Illustration of a decision boundaryf, which represents A) a high variance/low bias, and B) a high bias/low variance scenario It should be noted that applying subset selection on a limited historical data set could result in the deletion of information that might be relevant. Therefore, a preference exists to increase the number of observations in order to resolve high dimensionality. Although this is not always possible, the fire safety community should strive toward an easily accessible database in which experimental results are compiled. Recent steps toward this goal were undertaken by Naser, 25 who compiled a library of 12 000 data-points for fire-tested timber members.

| Cross-validation
The introduction of the hyperparameters α and λ gives rise to another problem. Namely that for every possible combination of α and λ, the model must first be fitted on the training set. After which, the best model can be chosen as the one that minimizes D test . As such, the test set cannot be used anymore to approximate the generalization error, because the test observations do not classify as truly unseen anymore, i.e. they were used to establish the ideal hyperparameter combination. In order to fit the model, determine the ideal hyperparameters and be able to approximate the generalization error, the historical data set must be split into three parts. The training set to fit the model, the cross-validation (CV) set to determine the ideal hyperparameters, and the test set to calculate the generalization error.
Unfortunately, the available data set is not large enough to be split in three ways while still allowing enough data for training and CV. For this reason, it was decided to only split the data into a training and CV set and use D cv as an approximation of D test . In contrast to the training set, the CV set was only used to establish the hyperparameters. As such the model never truly "learns" from the CV set and thus D cv will be a better approximation of D test than D train . Nevertheless, as there is no subset to calculate the approximation of the generalization error, the unbiased performance of the model cannot be reported. It should be noted that the absence of D test is not a problem for the model at hand, as the goal is to provide a guideline for the user on which experiment to conduct next, rather than providing the industry with a finished prediction tool for flashover or no-flashover.
Leave-one-out cross-validation (LOOCV) was applied, 19 rather than randomly assigning observations to different lists. For LOOCV, the model is fitted on M − 1 data examples and the remaining ith data example is used to calculate the CV deviance. The process is then repeated M times, with for every run a different data example to be used as CV, and consequently D cv is calculated as the average over M observations. 19 The advantage of LOOCV is the absence of randomness in allocating observations to subsets and the possibility to fit the model on almost the complete data set. With the LOOCV method, the null model deviance was found to be ≈21, which will be used as a benchmark in the following section.

| RESULTS AND DISCUSSION
To determine the hyperparameters, the model was fitted with the LOOCV method by minimizing Equation (12)  with. In particular, by implementing the shrinkage parameter, it is implicitly assumed that the emphasis of the model is directed toward making predictions on unseen observations based upon the current historical data set and not so much on explaining the underlying correlations between the variables of the historical data set itself. The difference can be found in the fact that for explanatory modeling, the focus is to reduce the bias, i.e. make accurate predictions on the training set. Whereas for predictive modeling, the objective is to reduce both bias and variance, for which it might be necessary to sacrifice some theoretical accuracy. The latter was accurately described by Shmueli 26 as: "To explain or to predict." Lastly, in order to arrive at the learned algorithm, i.e. determine the final regression coefficients, the complete historical data set was used for training, in conjunction with the earlier defined hyper-parameters. To allow for a two-dimensional plot, the DB was calculated for a set of fixed values for X 3 . In other words, new feature combinations x i* were determined, with x iÃ 3 = 0 , x iÃ 3 = x iÃ 2 or x iÃ 3 = 2x iÃ 2 , such that the learned model assesses the probability of flashover asp iÃ = 0:5 , see Equation (5). As such, each line of Figure 9 divides the space into a flashover zone, see Figure 10A, situated above the line and denoting â p iÃ ≥ 0:5, and a no-flashover zone, see Figure 10B, situated below the line and denoting ap iÃ < 0:5 . Due to the limited amount of observations any extrapolations which does not comply with the following conditions should be treated with caution: x i 1 > 1 kW, x i 2 > x i 1 , 6 < x i 4 < 10 and x i 5 > 465. It is important to understand that herein the aforementioned "learned algorithm" relates to the (historical) data set available at a given point in time. In other words, the final regression coefficients are not truly "final" from this point onwards but are rather intended to be updated as more data becomes available. In practice, the algorithm will thus alternate between being learned and learning, as elaborated in the following paragraph. F I G U R E 1 0 For a third burning intensity equal to zero, the following areas can be discerned in Figure 9: A) The input vectors for which the model predicts a higher than 50% chance of flashover. B) The input vectors for which the model predicts a lower than 50% chance of flashover, i.e., no flashover flashover and no-flashover being equal to each other, in a zone with limited available test date, is close to zero. Meaning that the model requires further evaluations in this area to update its prediction accuracy. After every experiment x iÃ DB , the newly obtained knowledge can be used iteratively to update the learned algorithm, and as such the learned algorithm becomes a learning algorithm again.
The significant benefit of using ML, and thus the model set out herein, is that the model does not need recoding or rewriting to accommodate a changing data set. That is, for every update, the ML algorithm will re-evaluate the values for α and λ and change them accordingly.
For example, if at one point, more variables that are correlated to the output (rather than not) are analyzed, the model will prefer ridgeregression over lasso-regression by changing α to a value closer to zero. its DB and progress from a guidance tool to a more accurate prediction method. Nevertheless, a limited data set inevitably means that the algorithm cannot capture all the physics, as it can only learn from the data it is presented with. As such, predictions with a limited database, i.e. at an early stage, should be used in combination with engineering judgment and within the boundaries prescribed by the historical data set.

| CONCLUSION
The flexibility of ML algorithms is unmatched in current models. Thanks to this, it might prove to be part of the solution for an ever-changing application of innovative materials and design solutions. Nonetheless, to arrive at a fully learned model that can be used universally, i.e. an algorithm for which the regression coefficients are permanently fixed, a large amount of data are needed.
The tool presented herein partly overcomes the challenges associated with limited data as it is foreseen that the algorithm will develop as the database grows. In other words, the algorithm will make more crude predictions at first, and increasingly more accurate predictions as more data become available. In addition to the improved fidelity, the algorithm is expected to become more and more valuable with time due to the increasing complexity and size of the available data set.
Nevertheless, the end-goal of the research, for which the work presented herein is considered to provide a valuable contribution, is to create a screening tool that can be used by anyone to predict the output of large-scale and intermediate-scale tests. For example, it can be used for the SBI test and the RCT, as well as for various compartment geometries and materials based on parameters obtained from bench scale tests such as the cone calorimeter. Therefore, all the currently available fire test results need to be compiled in a database, and new reaction to fire tests must be defined for their knowledge. Still, due to various limitations such as anonymity issues and the destructive nature of reaction to fire tests, this widespread data-sharing platform is not deemed viable in the foreseeable future. Therefore, the envisioned algorithm will have to be a symbiosis between fire safety science and ML. This symbiosis will allow current science to fill in the knowledge gaps inherent to a limited database, and in turn allow machine learning to complement and enhance the knowledge-base in fire safety engineering.
F I G U R E 1 1 For a third burning intensity equal to zero in Figure 9, the decision boundary is the interface between the flashover and no-flashover area. The interface denotes the input vectors, crosses, for which the new test condition x iÃ DB is predicted to have a 50% chance of flashover and a 50% chance of no-flashover i X 1 (kW) X 2 (kW) X 3 (kW) X 4 (m) X 5 (s) Y (−) t fo (s)