Discussion on “Testing small study effects in multivariate meta‐analysis” by Chuan Hong, Georgia Salanti, Sally Morton, Richard Riley, Haitao Chu, Stephen E. Kimmel and Yong Chen

The paper under discussion (described as “the paper” in this discussion) introduces models for the effect of small studies in multivariate meta analysis. This is done under the working hypothesis that this can be analyzed by studying the effect of the study-specific standard error on the outcomes of interest. That is fine with me. However, I am concerned about the way the data are modeled in the paper and the lack of transparency of the way the data are analyzed. In this discussion, I propose a modification of the model and a simplification of the analysis. I present a more detailed analysis of the two case studies. The P-values obtained in that analysis are close to the ones obtained in the paper. Finally, a sharper correction for multiple testing is presented and the lack of interpretation of “just P-values” is commented.


INTRODUCTION
The paper under discussion (described as "the paper" in this discussion) introduces models for the effect of small studies in multivariate meta analysis. This is done under the working hypothesis that this can be analyzed by studying the effect of the study-specific standard error on the outcomes of interest. That is fine with me. However, I am concerned about the way the data are modeled in the paper and the lack of transparency of the way the data are analyzed. In this discussion, I propose a modification of the model and a simplification of the analysis. I present a more detailed analysis of the two case studies. The P-values obtained in that analysis are close to the ones obtained in the paper. Finally, a sharper correction for multiple testing is presented and the lack of interpretation of "just P-values" is commented.

Standardized effect size
The authors use the standardized effect size = ∕ ( ), where Y is the outcome of interest and ( ) is its standard deviation. The concept comes from the social sciences and is used to correct for differences in This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2020 The Authors. Biometrics published by Wiley Periodicals, Inc. on behalf of International Biometric Society. scale between studies. Outcomes in different studies are all scaled to standard deviation = 1. The effect of an intervention is then given by the differences between the means of the treated group and untreated group. This is intrinsically different from the standardization used here. I would prefer not to apply the standardization and to use the more direct model of the outcome Y. The univariate models used in the paper are the meta-regression models (Thompson and Higgins, 2002;Van Houwelingen et al., 2002).
Presenting the models this way does not really change them, but it helps to see them in the broader perspective of the meta-regression models. If the relation between √ 2 + 2 and is approximately linear, the regression parts of the models are even approximately equivalent, but the variances are different if 2 > 0. The confusing complication might be that the in (2) pops up as in Equation (4) of the paper. So, there is no discussion about what to test, but a difference of opinion about the way of presenting the data.

2.2
Comments on the models: alternative model In his discussion of Lin and Chu (2018), Jackson (2018) showed dislike of the Lin-Chu model . I share his concern. I want to add three points: • If the model is meant to quantify the effect of selection, can be relevant but 2 will not be known at the moment of selection and cannot act as a modifier of the studyspecific treatment effect. It can only be used retrospectively and not prospectively. • If a parameter plays a role in both expected value and variance, the estimate of the expected value can be biased. (van Houwelingen, 1988) • The Lin-Chu model cannot be analyzed by GEE-type methodology, unless the two 2 's are disconnected.
The drawback of the Egger model is that it does not contain a random effect term. A more suitable model would be a hybrid model Hybrid model: This variance term in this model is flexible enough if there is a substantial random effect ( 2 >> 0), but it is quite rigid if 2 = 0. The authors suggest a two-step approach in which 2 is estimated from the marginal model ∼ ( , 2 + 2 ). This leads to the hybrid model ∼ ( + ,ˆ2 + 2 ). In this model, the variance is completely specified. I would suggest a more flexible variant: Flexible Hybrid model: ∼ ( + , 2 (ˆ2 + 2 )).
(4) The advantage of this model over the Lin-Chu model is that it can be fitted simply by standard models via the trans- The transformed model can be written as * ∼ ( , 2 ) with = ( 0 , 1 ) and = ( , ) ⊤ (6) It looks very much like the SND of the paper, but it is introduced for computational convenience only. The parameter is missing in the pseudo-likelihood approach of the paper. Moreover model (6) gives more information on the data than the score test for in the paper. Both and are estimated and the estimate does not depend on . This approach will be explored in Section 5 for the two case studies of the paper.

MULTIVARIATE DATA
The interest of the paper is in the situation of multiple outcome denoted by 1 , … , with corresponding 1 , … , and the development of a global test for the ′ , that is a test for = 0 for = 1, … , . As in the paper, we concentrate on = 2. The analysis suggested in Section 3 will yield esti-matesˆ1 andˆ2 with corresponding (estimated) covariate covariance matrices = cov(ˆ1) and = cov(ˆ2). The outcomes of interest need not be present for all studies. Lacking is the cross-covariance matrix = cov(ˆ1,ˆ2). As in the GEE with a working independence model, that cross-covariance matrix can be estimated from the set of studies that have both outcomes, as Here and are the ithe row-vectors of the xmatrices and for outcomes 1 and 2 , respectively, as defined in (5) and (6). Furthermore, * =ˆ1 and * = 2 . The estimate depends on the covariance of * 1 and * 2 and the amount of overlap of the two outcomes. From the large covariance matrix = cov(ˆ), = cov(ˆ1,ˆ2) can be extracted and from that a bivariate 2 -test for 0 ∶ 1 = 2 = 0 ( 2 [2] =ˆ⊤ −1ˆ) . If there is little overlap, the test boils down to adding the two univariate 2 [1] 's. This procedure must have some relation with the one in the paper, but I found that quite hard to read. The supplementary material was not very helpful. It is even not clear how outcomes are handled that are present in all studies.
The approach suggested here can be extended to > 2, but the large covariance matrix forˆmight fail to be nonnegative definite.

Case study 1: Heart failure
The two outcomes considered in the paper are 1 = "allcause mortality" and 2 = "Mental quality of Life." The related standard errors are denoted by 1 and 2 . The data are shown in Figure 1 of the paper. The paper only reports the -values of the univariate tests: P = 0.06 for 1 and TA B L E 1 Basic data for case study 1: n, mean, standard deviation, and correlations .09 for 2 and the multivariate = 0.03. (Unfortunately the P-values are presented in the wrong order, which caused serious confusion). The data were analyzed with model (4) using the transformation of (5) for ease of computation. Table 1 shows the basic information of mean, standard deviation, and correlation. The most striking feature is the high correlation between 1 and 2 . Table 2 shows the results of my analysis sketched above. The first observation is that despite the different approaches (score test vs Wald-test) the -values for = 1 agree with the ones given in the paper. Moreover the -value for the multivariate test agrees with the one given in the paper. Finally, it also shows the potential of the multivariate approach. The multivariate 2 [2] is larger than the sum of the two 2 [1] 's.

Case study 2: Neuroblastoma
The data are shown in Figure 1 of the paper. As there are much more observations for Disease-Free Survival (DFS) than for Overall Survival (OS), DFS is taken as 1 and over-all OS as 2 . Table 3 shows again a very high correlation between 1 and 2 and a bit confusing pattern of correlations. One should realize that there are only 10 studies with OS outcome and only eight studies with both DFS and OLS and all correlations involving 2 or 2 are very shaky.
The results in Table 4 are quite similar to those in Table 2. The difference is that the 2 [2] is smaller than the sum of the two 2 [1] 's. The role of global testing is discussed in the next section.
The high correlations between 1 and 2 in both examples suggest a common measure like sample size that could be used for all outcomes. Let the ordered p-values of the J outcomes be (1) , (2) , … , ( ) . The corresponding univariate nulhypotheses can be rejected sequentially as long as ( ) ≤ ∕ with = ( − 1, − 1, − 2, … , 1).

MULTIPLE TESTING
In the examples with = 2, the 's are given by = (1, 1). This implies that the nulhypotheses in the examples can be rejected at = 0.10 if the multivariate test is significant.
Actually, I think there is too much emphasis in the paper on -values. That is the weakness of score tests: no effect measures are given, only -values. I strongly suggest to give estimates of the parameters and some help in interpreting them.