Communicating statistical conclusions of experiments to scientists

The manuscript introduces a framework for presenting the results of the statistical analysis of experiments with multiple responses and multiple factors. We propose a utilisation of factors scaling to enable a transformation that combines main effects, quadratic effects and interactions into a meaningful summary that allows the scientist/experimenter to immediately recognise the most influential factors for a given response. The framework does not replace the thorough evaluation of the results but provides a clear high‐level summary of the relative importance of findings. The visualisation of such factor importance, using intensity heatmaps, allows the immediate understanding of the results across multiple responses that efficiently guides a following detailed analysis of certain responses and factors and contributes in designing subsequent experiments. The methodology is applied to a real industrial experiment and to a simulated data set with a larger number of responses and factors.


INTRODUCTION
Experimentation underpins the most important scientific and technological advances. An efficient statistical design and a correct analysis of experiments have the potential to generate important new domain and application knowledge in numerous disciplines, such as in agroindustry, biopharmaceuticals, food science, material science, medicine, biology and genetics, and savings in time and money. The main aim of a statistically designed experiment is the identification and estimation of an appropriate quantitative relationship between the controllable input variables (factors) and the observed outputs (responses). Such a relationship, or model, can be used to gain scientific understanding on the causal links between systems inputs and outputs, predict the response at unobserved inputs and optimise the system by finding values of the factors that match a target output.
In industry, the life sciences and the physical sciences, the 'big data' revolution, enabled through modern instrumentation for carrying out experiments and automatically collecting data, brings the need for efficient visualisation techniques for the clear communication of the statistical results to the applied scientists, for example, engineers, medical doctors, psychologists and so forth. Results summarisation and consequent visualisation is crucial for (i) spotting patterns, (ii) external memorisation, (iii) stimulation/evaluation of hypothesis, (iv) saving of time and (v) ability to give people the whole picture. 1 In this work, we develop an exploratory summarisation and visualisation technique that gives the overall picture of the data analysis in experiments with multiple responses and multiple factors.
The method can be broadly applied to any kind of experiment that uses a multifactorial design, with multiple responses. For example, in response surface methodology, a sequential strategy of experimentation, statistical modelling and optimisation, the experiment aims to study and understand the relationship between the controllable factors and the response. Typically, such experiment only considers a few factors. If there are many factors that may be of importance, a fractional factorial, a definitive screening or a supersaturated design should be used first in a screening experiment to identify the important factors. [2][3][4] The experiments are usually designed assuming that the treatments (combinations of factor levels) are completely randomised to the experimental units. However, just as with any designed experiment, there may be some structure in the experimental units that should account for known nuisance factors that affect the response and should be used for blocking. We refer for more details on the different types of experiments to the Handbook of Design and Analysis of Experiments. 5 The large number of factors in combination with the large number of responses and the further extensions of the traditional design framework, as suggested in the previous paragraph, make the comparison and presentation of the analysis results across many responses cumbersome, especially when a model selection needs to be applied and the final models differ across responses. The methodology proposed in this manuscript is a summarisation framework and graphical representation of the results of the statistical analysis of data from experiments with multiple responses. The approach, based on transforming parameters and summarising them in an interpretable way, is accompanied by intensity heatmaps that allow the practitioner just with a simple look at the graph to get an overall information on which factors are the most important ones for a given response, taking into account the main, the quadratic and the interaction effects. Additionally, the practitioner will be able to see easily how often certain factors appear to be important across multiple responses. This immediate understanding contributes to a correct combination of the statistical output with the scientific knowledge leading into the design of subsequent experiments and the modelling of the responses under study. However, this framework definitely does not replace the deeper evaluation of the individual models, responses and factors. It allows for smooth communication of the results and enables a statistician to focus on a detailed analysis of the responses and factors with the strongest effects. Finally, the results would need to be evaluated against scientific context and knowledge, taking into account the magnitudes of the effects within the context of each response as well as the relative importance of the responses themselves.
In Section 2, we present the motivating examples, in which our method was successfully applied. The first study comes from the literature and has a rather small number of factors involved. Its main role is illustrative, to provide a full understanding of the methodology, but even then, we can see a benefit when combining quadratic effects and interactions in a concise framework. The second example represents a screening study based on a definitive screening design with a much higher number of factors that would benefit largely from our visualisation framework. This motivating example was inspired by a real pharmaceutical case study performed by the first author in which the proposed technique was successfully applied. Unfortunately, the data set is proprietary and cannot be shown publicly in any form; hence, a simulated data set with the matching structure is used instead. Note that both case studies use as input the effect estimates, computed after a model selection step has been applied. This is typical for screening designs, when fitting the full model is rarely possible. However, if estimating the full model is feasible, the framework can be analogously applied with considerable benefit, when considering all the quadratic effects and the interactions. In Section 3, we describe the proposed methodology and possible extensions. In Section 4, we present the results from the application of our method to the real-life experiment and to the simulated data set. We provide the discussion in Section 5. In the supporting information, we provide the computer code in R 6 for the implementation of our technique to both case studies and a compilation of the R code into the HTML report.

The pastry dough experiment
The first case study is about an experiment performed to improve the quality of a pastry dough. A non-orthogonally blocked response surface design, with seven blocks of four runs, was used by Trinca and Gilmour. 7 The data set represents an experiment aimed to improve the quality of a pastry dough by adjusting three factors: the flow rate x 1 , the moisture content x 2 and the screw speed x 3 . The four evaluated responses 8,9 are a longitudinal expansion index y 1 , a cross-sectional expansion index y 2 and two measures of light transmission in two different bands of the spectrum, y 3 and y 4 .
The estimates of the fixed effects for the purpose of this manuscript, which is the graphical representation of the results of a statistical analysis, were retrieved from the literature. 10 These estimates were obtained using generalised least squares  (GLS) and backward model selection. The usual way to analyse data from this type of experiments is to fit the linear mixed model. This is usually done by using GLS for the fixed effects combined with restricted maximum likelihood (REML) estimation of the variance components. 9,11 The estimated effects used can be seen in Table 1.

Simulated data
The main benefit of the data set above is to simply explain the proposed methodology. However, studies with a large number of factors and responses will benefit the most from fast exploratory visual tools. The first author has applied the method on a definitive screening experiment evaluating 20 responses on 11 factors (that represented various settings of the process equipment). The simulated modelling output mimicking the structure of this real study will be used as a second case study in the manuscript. Because the visualisation and not a practical interpretation of the specific effects is the subject of the demonstration, there is no practical difference in presenting simulated numerical values for effect sizes instead of the real data. The definitive screening designs, 3 unlike most of the screening designs proposed in the literature, include factors in three levels instead of two. In this way, it is possible to detect quadratic effects even at the initial stage of the experimentation with just a few runs. Specifically, the minimal definitive screening design with m factors at three levels have 2m − 1 runs, while at least four extra runs are recommended for better estimate of the residual error. 12 The final design applied to this case study had 30 runs due to the presence of an additional categorical factor.
The simulated output was obtained as follows. For simplicity, we do not initially simulate the data and then the whole model selection process, but we simulate directly the hypothetical result of the model fitting and the model selection procedure. Hence, we directly simulate the set of factors (and their interactions and quadratic effects) that would be selected for each response by the hypothetical model selection procedure, that is, the final linear model for each response.
Therefore, the simulation process has initially randomly selected the number of selected main effects for a given response, then simulated the effect sizes from a normal distribution with zero mean and variance equal to 25. Quadratic and interaction effects were then simulated under the strong heredity principle, that is, a quadratic effect could only occur, if the corresponding main effect was non-zero and an interaction could only occur, if both respective main effects were non-zero. The quadratic effect and interactions were considered present with probability of 0.2 and 0.3, respectively. Finally, all the main and quadratic effects for factors 2, 6 and 11 were multiplied by 5 to create factors with higher relative importance. The approach was repeated for each response variable.
The simulation has resulted, in total (summed over all responses), in 118 main effects (out of 220 possible), 29 quadratic effects and 105 interaction effects. The simulation result comprises 20 distinct linear models, one per response. The models are not shown here due to their size, and the computer code to generate this data is submitted as supporting information for the readers' convenience and reproducibility. Note that we have created only single simulated data set to demonstrate the value of our visualisation method.

The framework
Let us assume that we have implemented a design aiming at describing the relationship of N factors with each of K responses of interest. The starting point of the visualisation approach is the result of modelling. The final model can be the full model including all the effects of interest or some preliminary selection can be applied to reduce the number of model terms. Further, we assume that the quadratic effects and the second-order interactions are the highest order effects of interest. The framework, however, can be extended to higher orders in a straightforward manner. Finally, we assume that the factors are centred and transformed in such a way that factors' levels with values of -1 and 1 are of the main interest. In practice, that would often mean that the factors would be transformed to have range between -1 and 1, but there may be exceptions, for example, in central composite designs (CCDs), the axial observations lie beyond the -1 and 1 boundaries. Even then, the shift of -1 and 1 is interpretable and relevant from a practical perspective.
Denote the factors as X 1 , … , X N and the responses as Y 1 , … , Y K . Then, the full model considered for any given response In Equation (1), • represents the intercept, that is, the mean response value if all the factors are set to zero, • nk represents the main effect of X n (n = 1, … , N) in the model for Y k , • nk represents the quadratic effect of X n (n = 1, … , N) in the model for Y k , and • nmk represents the interaction between X n and X m (n = 1, Note that the effects listed above are defined on transformed factors, with a range of interest between -1 and 1 and the centre at zero. This means that the effects reflect changes in the response when a factor is changed by one at the transformed scale, that is, by half of the range at its original scale.

The maximal possible influence
The aim of this manuscript is to provide a visual tool that allows the importance of the factors to be assessed at global perspective, that is, simultaneously for all Y 1 , … , Y K . An intensity heatmap is the plot type fitting such aim. Let us assume the matrix with the responses Y 1 , … , Y K in its columns and the factors X 1 , … , X N in its rows and each cell value being equal to some measurable quantity representing the strength of the relationship between a specific factor and a response. The heatmap simply assigns a colour to the observed quantity in each cell of the matrix, with a colour intensity proportional to the value of the quantity. For example, if our focus would be on main effects only, we could represent the relationship between all responses and factors with the following matrix: The visualisation of such a matrix is straightforward via the heatmap. However, considering only the main effects may be misleading in the presence of strong quadratic effects and interactions. Another metric to compare the relative contribution of factors is needed. Hence, we suggest a framework that summarises both types of second-order effects by one metric value per factor and response combination.
Let us define the 'maximal possible influence', MPI nk , of the factor X n on the response Y k as the maximal absolute change in Y k that can be induced by increasing the X n by one, within the range [−1, 1], for a certain starting value of X n and some favourable setting of the other factors X 1 , … , X n−1 , X n+1 , … , X N . As we will see later, a certain starting value is necessary for adding the quadratic effects and a favourable setting of other factors for adding the interactions.
Let us consider a quadratic effect first. We define the numbers b > 0 and c > 0 such that for a factor X n and a response Y k , the absolute value of the main effect is | nk | = b and the absolute value of the quadratic effect is | nk | = c. We can see all possible combinations of positive and negative main and quadratic effects in Table 2. Given that X n only takes values in the interval [−1, 1], we can calculate the level change by one, between the zero and the boundaries. As we can see in the table, the higher of the two effects, regardless of the sign of the main and quadratic effect, can be always obtained using the following equation: nk nk X n = −1 X n = 0 X n = 1 -1 to 0 0 to 1 max effect Note. We use the equation nk X n + nk X 2 n . We proceed analogously with the interaction effects. The interaction term nmk represents the change in influence of factor X n on response Y k conditional on the value of factor X m . When X n is increased by one, the response Y k will change by nk − nmk if X m = −1 and by nk + nmk if X m = 1 (assuming no quadratic effect). Hence, a value of X m can be always selected such that the sign of nk matches either nmk or − nmk . The choice of X m to match the sign of interaction effect with the sign of the main effect of X n reflects the most favourable setting towards the potential effect of X n . Naturally, the concept above can be extended to all the interactions between X n and the rest of the factors X 1 , … , X n−1 , X n+1 , … , X N . In summary, we select the most favourable setting for the remaining factors such as to maximise the effect of the factor X n . Therefore, regardless of the signs of the main effect and interaction effects, we can obtain the maximal effect in absolute sense as ( Note that Equations (2) and (3) have the same structure. Therefore, the maximal possible influence of X n at Y k , as defined above, can be calculated as Therefore, the final matrix looks as follows: which can be plotted directly within the intensity heatmap framework. However, Equation (4) will be practically useful only if the scales of the responses are comparable. Otherwise, the colour intensity will be entirely dominated by the responses with the higher absolute effects. Besides invalid comparison across responses, the visual reading of heatmap will be compromised. Therefore, we recommend scaling the results using one of the following two options. The first would be normalising the responses themselves to obtain comparable means and variances. After such transformation, Equation (4) can be applied as is.
The framework may also be used on the model output without access to raw data or in cases where refitting the models is impossible or when the experimenter prefers to fit the model using the original responses. In such case, an alternative approach could be applied by standardisation within each response Y k towards the highest (in sense of absolute value) observed MPI nk , n = 1, … , N. In other words, for each response (column), we visualise the MPI nk relative to the maximum |MPI nk | achieved for that response, that is, we visualise the 'relative importance' of the factors within each Y k : Additionally, in the screening context, the sign of the effect often is not of the utmost importance, but rather the selection of factors with strong effects in either direction that will be further evaluated in following stages of the process characterisation. Therefore, instead of the MPI nk or the MPI std nk , the absolute value of the respective metrics, |MPI nk | or the |MPI std nk |, can be computed and visualised using a heatmap. Reducing to only a one-directional intensity scale will help in a visual evaluation of the intensity ranges.

Considerations for follow-up experiment
Screening designs often represent the first part of a two-stage approach, intended to select the factors that will be carefully studied in a second stage experiment. In practice, the factors included in the screening stage might not be selected solely based on their inferential properties (e.g., statistical significance), but also on their actual practical relevance. Hence, it might be useful to see how the MPI nk of a given factor changes when other factors (not planned to be investigated further) would be fixed to the centre value (note that concept can be extended to any other value than centre). For example, it may happen that X n has a relatively small interaction with multiple other factors, leading to practically relevant MPI nk , whereas when all the other parameters are set to centre values (leading to zero contribution of the respective interactions), the actual potential of X n is reduced considerably. From this perspective, it may be useful to explore how the heatmap of the potential effects changes if certain factors are excluded from the final set. In terms of the MPI nk , the interactions with excluded factors will not be added into the sum.
It is often the case that the ranges of the factors are reduced (or extended) during the follow-up studies. The reasons for a range reduction would typically include higher cost or run time at extremes leading to a demanding implementation of the experiments. Hence, if we are interested in the potential of the factors to influence the response in a future study, we should take into account the range adjustments. Then, the impact on the interaction terms should be addressed, because the MPI nk as defined in Equation (4) assumes the whole range to be between -1 and 1 for all the factors, so the interacting factors can be set to the most favourable values of -1 or 1. Hence, if we assume a narrower range of the factors, the interaction terms nmk have to be scaled down appropriately (i.e., if a range between -0.7 and 0.7 is considered for factor X n , we should multiply all interactions with this factor by 0.7). Generalisation towards non-symmetrical ranges is straightforward.
Note that the considerations above apply as well for the case of extending the range of the factors' levels in the follow-up studies. An example of such a case may be the augmentation of a screening factorial design towards a central composite design, where levels of factors beyond the range of -1 to 1 are explored. However, scaling up the MPI nk for extended ranges

FIGURE 1 Pastry dough data:
|MPI std nk | relies on the assumption that the estimated effects and the relationship extrapolate outside the studied factor ranges. Such assumption may be often inappropriate.

RESULTS
The ideas described in the previous section have been applied to both the literature case study and the simulated data set, described in Section 2.

The pastry dough experiment
The MPI nk for the pastry dough data as defined in Equation (4), that is, the basic version without any formatting scaling or further adjustments is not shown in this manuscript; however, for a coloured version of this plot, see the supporting information. Given that the responses have different scales, we proceed directly with the adjusted |MPI std nk |, by considering the absolute value in Equation (5), that allows us to compare the relative importance of factors across responses. To provide a concrete example, let us explicitly calculate MPI 34 . We consider the factor x 3 and the response y 4 in the presence of the main effect, the quadratic effect and one interaction. See Table 1 Figure 1. The heatmap is interpreted as follows: relatively to other factors, the potential effect of x 2 seems to be the most important across y 1 , y 2 and y 3 . For y 4 , x 2 is dominated by x 3 . Factor x 1 does not seem to be (relatively to the effects of the other factors) very important, with exemption for y 1 . Let us assume that we would decide to perform a follow-up optimisation study, where we would like to reduce the range of x 3 to (−0.1, 0.1). A practical reason might be that the experiment was taking too long to perform at low screw speed while leading to a risk of equipment failure at high screw speed. Such a change would only impact the contribution of the interaction x 2 x 3 for y 4 towards the MPI 24 . As a matter of fact, the main and quadratic effects do not need to be adjusted, because we define the MPI nk as a change when the factors are increased by one (in a transformed scale). Hence, the range adjustment would only affect the contribution of the corresponding interactions (that requires the most favourable value setting for x 3 , which is affected by the range adjustment). Hence, we would expect the colour intensity of MPI 24 to decrease, due to the reduced interaction term value. We can see that this is indeed the case in Figure 2.

Simulated large study
The scaled absolute value results, that is, the |MPI std nk | per response, for the simulated study are shown in Figure 3. As expected, factor 6 seems to be the most relevant overall (recall that it was chosen to have high relative importance during the simulation), whereas factors 2 and 11 are slightly less pronounced. As noted in Section 1, any final decision taken based on these findings needs to be confronted with scientific knowledge, for example, relative practical importance of various responses. For example, if responses 5 and 9 would be extremely important, whereas other responses are comparatively less relevant, then factor 10 could be deemed more important than factors 2, 6 and 11.
Let us assume that only six factors will be selected for further evaluation in the optimisation stage: factors, 1, 2, 4, 5, 6 and 11. Figure 4 demonstrates why it is important to explore the reduced set of factors before finalising the set. Factor 1 could have been selected due to its relative importance with respect to the response 7 or 14. However, after removing the interactions of the removed factors, it seems that factor 1 has limited ability to influence these responses. Additionally, in the final set, there is no factor that has been selected as non-zero for the response 9 (uniformly white column). It may merit discussion if there is added value to measure response 9 in further experiments or not. The most important feature of the presented framework is that it allows a fast exploration of the relative effects to efficiently focus an attention at deeper level towards the relevant choices of factor sets. It is extremely efficient for transmitting statistical results to applied scientists, such as engineers, pharmacists, medical doctors and so forth.

DISCUSSION
The framework for summarisation and visualisation provided above offers fast and easy overall evaluation of the data set. The definition of MPI allows for pooling together all the effects related to certain factors and quantify their potential to alter the responses of interest. The framework is flexible to accommodate further modifications as scaling, absolute value, changes in ranges or exclusion of factors. It can be easily extended further, for example, by adding higher order interactions and/or higher order polynomial effects. We do not provide one particular metric and visualisation, but rather the platform/framework to visualise the results of modelling across multiple responses in general way. From this perspective, the framework may be generalised further, beyond linear models applied to multiple responses of experimental results to virtually any modelling-based multiresponse framework.
Throughout the manuscript, we have implicitly assumed that all factors considered in the experiment are quantitative. The framework can be readily applied to two-level factors by using analogous equations, a baseline level and no quadratic effects. Various options can be considered in the case of factors with multiple levels, for example, for each factor taking the maximal effect across all its levels as the MPI or treating each level as a separate dummy variable compared with the baseline or with the overall average effect across all factor levels. The most suitable solution depends on the context of each experiment.
Any modelling approach to reach the final linear model for each response is compatible with the proposed framework. Hence, both univariate and multivariate approaches may be used. When a multivariate model is being applied, the correlation among the responses is implicitly reflected in the estimated coefficients. Although the correlation structure among responses could be useful in interpreting the visualisation output (e.g., by downweighing the responses that are highly cor-related and focusing on the independent ones), we do not recommend to attempt to incorporate the correlation structure into the heatmap visualisation, because it typically results into very complicated graphics and reduces its communication potential. Hence, we would recommend to produce the correlation matrix (or its heatmap-like visualisation) aside and use it in the interpretation step.
The role of interactions in the visualisation may be tuned as needed for the application. If the interest is in main and quadratic effects only, the interactions can be entirely omitted and the framework can be simplified (although we discourage such practice without first checking for potential important interactions). Implicitly, the role of the interactions should be addressed in the model selection step.
The proposed framework is conditional to the model selection step and starts by using the estimates of the chosen models. Model selection would be typically needed in cases such as screening experiments that include a large number of factors and rarely allow for the estimation of all the main, the quadratic and the interaction effects. Naturally, the framework can be applied on the full model as well. We note that if possible to fit larger models, it is especially important to consider the impact of the interactions and not to evaluate the main effects in isolation.
The framework as presented above uses point estimates of the effects. Hence, it does not address uncertainty in the effects' estimates, which would result into unreliable results if the variability is generally large or severely imbalanced among the effetcs (e.g., in split-plot designs). In such cases, several straightforward extensions of the framework are possible. Standard errors of the effects' estimates can be visualised in analogous fashion to create a second heatmap to aid the interpretation of the point estimates. The relative standard error can be considered instead. Replacing the point estimates with upper or lower confidence bounds can be considered as well. Ideally, multiple of these visualisations may be combined to provide a sensitivity analysis to the simple point estimate-based analysis.
It needs to be clearly understood what the platform does not offer: the MPI and heatmaps cannot be used for direct decision-making on which factors are important and which are not in absolute sense. The final decision needs to combine the modelling output with scientific knowledge. However, this is not a limitation unique to this framework, but an essential property of any statistical tool: interpretation needs to be accompanied with domain knowledge in order to reach scientifically valid decisions. Visualisation accompanying the statistical analysis output can convey strong support for the evidence evaluation and an informed decision-making. The proposed technique certainly offers an easy way of presenting statistical conclusions on complicated large-scale experiments with a large number of factors and responses. This can be extremely handy in multidisciplinary projects where scientists from different fields need to find a common language to communicate the results.
Finally, the main motivation for the publication of this framework is the lack of such visualisation techniques in scientific communication as well as ready-to-make solutions in standard statistical software. We have not managed to find any alternative to be used for direct comparison. Visualisations are often focused on the main effects only or visualising all effects while losing the link between main and quadratic effect of the same factor. The closest tool to the purpose of the presented framework is probably the 'prediction profiler' graph from JMP, 13 which shows the predictions simultaneously for all the factors and responses. The main and quadratic effects are directly incorporated in the visualisation, but interactions are addressed only by the option to interactively adjust the levels of the corresponding factors. Hence, although the prediction profiler graph directly addresses the expected value of the response at certain combination of factors, it does not allow for immediate comparison across multiple factors. Additionally, it does not display the effect sizes using the MPI, but rather the resulting predicted response value. Hence, although both visualisation techniques are addressing similar goals in demonstrating modelling results, the prediction profiler is more focused on response surface experiments with a small number of factors rather than screening studies with a large number of factors and responses.