Asymptotic based bootstrap approach for matched pairs with missingness in a single-arm

The issue of missing values is an arising difficulty when dealing with paired data. Several test procedures are developed in the literature to tackle this problem. Some of them are even robust under deviations and control type-I error quite accurately. However, most these methods are not applicable when missing values are present only in a single arm. For this case, we provide asymptotic correct resampling tests that are robust under heteroscedasticity and skewed distributions. The tests are based on a clever restructuring of all observed information in a quadratic form-type test statistic. An extensive simulation study is conducted exemplifying the tests for finite sample sizes under different missingness mechanisms. In addition, an illustrative data example based on a breast cancer gene study is analyzed.


Introduction
Conducting statistical tests on units measured repeatedly requires the consideration of the dependence structure of the resulting random vector. The simplest design is the matched pairs model, where units are measured at two endpoints of the same subject. This design has experienced a large field of application, including industrial and life sciences. In Biomedicine for example, several studies have been focused on identifying genes for up-or down regulated effects in head and neck squamous, prostate, lung or breast cell carcinoma. [1][2][3] In common statistical analysis, testing the equality of means in matched pairs design is conducted using the paired t-test. Even for non-normal data, the procedure is asymptotically exact, i.e. for sufficiently large samples, the test procedure is correctly reflecting type-I-error. However, first limitation of the paired t-test arises when data is only partially observed. Deleting observations with missing values is a sub-optimal solution, since variance or mean estimation based only on complete case analysis can be biased leading to incorrect statistical inference. This is especially the case when complete samples are small.
To tackle this issue, a simple approach is to impute missing values singly (or multiply) and to carry out statistical tests as if there were no missing values so far [4][5][6] . However, although leading to good imputation error [7][8][9] , such approaches may lead to inflated type-I error rate or remarkably low power in small to moderate sample sizes 10,11 . Therefore, we do not follow this approach here.
Differing to imputation, several test procedures that (only) use all observed information in the matched pairs design have been proposed in the literature [12][13][14][15][16][17][18][19][20][21][22] . These tests, however, rely on specific model assumptions such as symmetry or even bivariate normality, which are hard to verify in practice. Moreover, these procedures are usually non robust to deviations and may results in inaccurate decisions caused by possibly conservative or inflated type-I error rates 21,[23][24][25] .
To overcome these problems, the typical recommendation is to use the method based on combining separate results of adequate test statistics for the underlying paired and unpaired portions of the data using either weighted test statistics 21,23 , a multiplication combination test 24 , or combined p-values [26][27][28][29] . However, all these methods are only applicable for matched pairs with missingness in both arms. Thus, these methods cannot be used to analyze data on pathological stage I breast cancer patients from the Cancer Genome Atlas (TCGA) project. This data set consists of observations from 90 patients of which 74 had entries in one component of it, only 16 were complete, see Section 7 below for details. The question is now how to analyze such data?.  In contrast to the above methods, barely any work can be found that is potentially applicable in this special missing pattern, requires no parametric assumptions and also leads to valid inferences in case of heteroscedasticity or skewed distributions. One exception is given by the recent proposals of Qi et. al. 25 who recommended a so-called nonparametric combination test (NCT). It is based on merging results from Sign test and Wilcoxon Mann-Whitney test. In situations where these two nonparametric procedures show their efficiency, their proposed combination is indeed tempting. However, neither the Sign test is known to be very powerful for metric data nor is the Mann-Whitney test known for being robust against heteroscedasticity. In fact, our simulation studies demonstrate that their NCT inherits these unsatisfying properties to some extend. It show that, under heteroscedasticity and/or skewed distributions, the test statistic (NCT) tends not to maintain the pre-assigned type-I error level. The degree of variance heterogeneity, skewness, and sample sizes can all affect the type-I error rate control level. An overview of the type-I error control of NCT when heteroscedasticity coincides with a skewed error distribution is displayed in Figure 1. It reveals that, under heteroscedasticity and exponential distribution, the NCT type-I error rate function becomes surprisingly analogous to the power function where the type-I error rate increases dramatically with an increase in sample sizes.
The aim of this paper is therefore bilateral: First, we aim to provide a statistical test that is capable of treating single arm missing values in matched pairs which drops the common assumptions such as homoscedasticity and normality and secondly, is able to satisfactorily control type-I-error while maintaining good power properties. To this end, we propose three different test statistics, analyze their asymptotic behaviors under the null hypothesis and equip them with an asymptotically correct parametric bootstrap procedure for calculating critical values. In doing so, we structured the paper by firstly introducing the statistical model and the hypothesis of interest. In Section 3, we provide different test statistics of quadratic form-type that either converge to a χ 2 or a weighted χ 2 distributions. Proofs presenting theoretical guarantees of the proposed methods are delivered in the supplement. In Section 4, we introduce a parametric bootstrap technique to calculate critical values and prove it's theoretical correctness. Section 5 is devoted to already existing methods for statistical inference in matched pairs with single arm missingness while in Section 6 and 7, novel and existing methods are compared based on an extensive simulation study and a real data example from a breast cancer gene study. The supplement contains additional analyses.

Statistical Model and Hypotheses
We consider matched pairs given by a sample D n := {X 1 , . . . , X n }, where X j = (X 1j , X 2j ) T ∈ R 2 are i.i.d. random vectors with mean vector E[X 1 ] = µ = [µ 1 , µ 2 ] ⊤ ∈ R 2 and an arbitrary covariance matrix 0 < Γ ∈ R 2×2 . To incorporate missingness in one arm (says, the second) only denote with R 2j ∈ {0, 1}, j = 1, . . . , n the vector whose j-th component indicates whether X 2j is observed (R 2j = 1) or missing (R 2j = 0) for j = 1, . . . , n. If * denotes the component-wise multiplication of vectors, then in practice, one observes . . , n, and a "0" entry is interpreted as missing. Hence our framework has the following form, where − − − stands as a placeholder for R 2j = 0: Rubin defines the missing mechanism through a parametric distributional model on R = {R j } n j=1 and classifies their presence through Missing Completely at Random (MCAR), Missing at Random (MAR) and Missing not at Random (MNAR) schemes. 5 In our work, we first assume a MCAR mechanism, in that X c is independent of X (i) . However, we will also study MAR mechanisms in simulations and relate to the supplement for the explicit definition of the missing mechanisms. For notational purposes, let I nc denote the index set of |I nc | = n c complete pairs, i.e. R j = [1, 1] ⊤ for all j ∈ I nc . Similarly, I nu is the index set of observations with second component missing (R j = [1, 0] ⊤ , j ∈ I nu ) and |I nu | = n u . Thus, there are in total N = 2n c + n u observations from n = n c + n u subjects. In this framework, we like to use all the available data to test the null hypothesis H 0 : {µ 1 = µ 2 } of equal means against the alternative H 1 : To construct our paper test statistics, we first fix estimators of the population means µ 1 , and µ 2 . For estimating µ 1 , we consider two estimators; the sample mean of the first components of the completed data setX (c) 1i , and the sample mean of the first components of the unpaired datā X (i) 1j . For estimating the population mean µ 2 , we use the sample mean of the second components of the complete dataX Next, we define the normalized vector Z n that aggregates the difference between the mean values µ = [µ 1 , µ 2 ] ⊤ and their empirical estimators [X and take their correlations into account in the covariance matrix Σ n = Cov(Z n ).
For subsequently asymptotic analyses, we need to set up the following assumption regarding sample sizes, which we assume throughout Assumption 1 For min{n c , n u } → ∞ we require that 21 ) and ρ = corr(X (c) 11 , X (c) 21 ). The statistic Z n has, asymptotically, as n → ∞, a multivariate normal distribution with expectation 0 and covariance matrix given by Σ n can be consistently estimated bŷ whereκ 1 = n c /n,κ 2 = n u /n andσ 2 1 is the unbiased empirical variance estimator based on the pooled first components X 2· ) 2 and the correlation factor ρ is estimated through the empirical correlationρ calculated from the paired data X (c) .

Statistics and Asymptotics
In this section, we propose three different quadratic forms for testing H 0 : a Wald-type statistic (WTS), an ANOVA-type statistic (ATS), and a modified ANOVA-type statistic (MATS). To introduce the WTS, denote by B + the Moore-Penrose inverse of a matrix B. Then, the WTS is given by Thanks to the introduced studentization by (AΣ n A ⊤ ) + , the WTS is asymptotically distribution-free as studied below.
Similar WTS versions are also studied in the context of heteroscedastic ANOVA or MANOVA [30][31][32][33] . From these settings, it is known that the convergence to its limiting χ 2 -distribution is rather slow 32,34,35 , which leads to several refinements regarding bootstrapping for the calculations of critical values (see Section 4 below) or other structures of test statistics. In particular, Brunner (2001) 36 proposed an alternative quadratic form by deleting the Moore-Penrose inverse involved in the computation of the WTS, resulting in the following ATS: which is also applicable in case of |Σ n | = 0.
Theorem 3.2. Under the null hypothesis H 0 : {µ 1 = µ 2 }, the test statistic T A has asymptotically, as n → ∞, the same distribution as the random variable where Y i i.i.d ∼ χ 2 1 and the weights λ i are the eigenvalues of AΣA ⊤ where Σ is given in (3).
Another possible test statistic would be the modified version of the ANOVA type-statistic (MATS) that was developed by Friedrich and Pauly (2017) 33 for MANOVA models. Here, it is given by whereỸ i i.i.d ∼ χ 2 1 and the weightsλ i are the eigenvalues of DAΣA ⊤ and D = diag((AΣA ⊤ ) + ii ).
As, the weights λ i , andλ i in Theorems 3.2 and 3.3 are unknown and the χ 2 2 −approximation to T W is rather slow, we will develop adequate and asymptotically correct testing procedures based on bootstrap versions of T W , T A , and T M in the subsequent section.

Parametric Bootstrapping
To estimate critical values, we apply an asymptotic model based bootstrap approach which has , e.g. been applied in the context of (M)ANOVA factorial designs 32,33 . To this end, first, we generate parametric bootstrap variables as Here,Γ = σ 2 1ρσ1σ2 ρσ 1σ2σ 2 2 is an empirical covariance matrix estimator, whereσ 2 1 ,σ 2 2 andρ are as in Proposition 2.1. The idea is to reflect the original covariance structure to obtain more accurate finite sample approximation. Next, we generate missing values under the MCAR scheme by randomly inserting them to the second component of the bivariate vector X * j until a fixed amount of missing values of size n u is achieved. This results into the following bootstrapped data set X X , . . . , X , . . . , and the combined vector 2. ,X * (i) ). From this, the bootstrapped versions of the quadratic forms, i.e. the Wald-type statistic T * W , the ANOVA-type statistic T * A and the modified ANOVA-type statistic T * M are computed whereΣ * n =Σ n (X * (c) , X * (i) ) andD * The next theorem proves that all three bootstrapped test statistics can be used to approximate the null distribution of the respective test statistic.
From Theorem 4.1, we thus obtain the asymptotically correct bootstrap tests ϕ * To analyze their finite sample performance, we below conduct extensive simulations (Section 6). Before that, we will first discuss other possible candidates from the literature that should or should not be included in our simulation study.

Comparison with existing methods
We briefly review the existing literature on methods that can deal with the case of matched pairs with missing values in one arm only. As outlined in the introduction, there only exists a few which we can summarize them as follows (a) Simple methods such as: using the paired t-test while excluding the unpaired data OR using the independent t-test while ignoring the covariance structure of the data.
(b) Tests based on modified maximum likelihood estimators. 14,16,37 (c) Tests based on simple mean difference estimator. 12,13,37,38 (d) P-values pooling methods. 25 (e) Weighted linear and nonlinear combination tests. 25 However, none of the methods is free from distributional assumptions and at the same time robust against deviations such as heteroscedasticity and skewed distributions. In particular, the recent paper by Qi et al. 25 already included a simulation study to compare several of the tests mentioned in (a) -(e). As a conclusion, they recommended a so-called non-parametric combination test (NCT).
Therefore, we will mainly focus on the non-parametric combination method proposed in Qi et al. 25 As additional competitor for these bootstrap procedures proposed in Section 4, we choose the test of Little. 16 The latter assumes that the data follows a bivariate normal distribution and the test statistic is given by , the denominator is given by 16 The exact distribution of T LT is rather complicated and Little suggests to approximate it by a t-reference distribution with n c − 1 degrees of freedom, i.e. the test is given by ϕ LT := 1{|T LT | > t nc−1,1−α/2 } for some level α ∈ (0, 1).
In addition, the non-parametric combination test 25 proposed by Qi et al., is based upon a linear combination of the sign and the Wilcoxon Mann-Whitney test statistics: where It is proposed 25 to approximate the null distribution of T N C by a normal distribution with mean 1 and variance estimated by V ar(

Simulation Study
In this section, we investigate the finite sample behavior of the methods described in Sections 4 and 5 in extensive simulations. All procedures were studied with respect to their  (i) type-I-error rate control at level α = 5% and their (ii) power to detect deviations from the null hypothesis.
For each scenario, we generated missings as described below: For the MCAR mechanism, missing values are inserted randomly to the second component of the bivariate vector X j until a fixed amount of missing values of size n u for the second component is achieved. Table 3. Type-I error simulation results (α = 0.05) of the tests for different distributions under varying correlation values (ρ), different total sample sizes n ∈ {10, 20, 30} (numbers of subjects) and homoscedastic covariance matrix Σ1 under the MAR framework. For each setting, the values closest to the prescribed level are printed in bold and values exceeding the upper limit (6.8%) of the 99% binomial interval are in red colour. For the MAR mechanism, the probability of being missing on the the second component of X j is based on the corresponding value on the first component in the following way: first, we divide X into three groups based on their first component values corresponding to a 2σ−rule: the first group is given by {X j = (X 1j , X 2j ) : X 1j ∈ (−∞, −2 σ 1 ), j = 1, .., n}, the second by {X j : X 1j ∈ (−2 σ 1 , 2 σ 1 ), j = 1, .., n} and the last group by {X j : X 1j ∈ (2 σ 1 , ∞), j = 1, .., n}, Table 4. Type-I error simulation results (α = 0.05) of the tests for different distributions under varying correlation values (ρ), different total sample sizes n ∈ {10, 20, 30} (numbers of subjects) and heteroscedastic covariance matrix Σ2 under the MAR framework. For each setting, the values closest to the prescribed level are printed in bold and values exceeding the upper limit (6.8%) of the 99% binomial interval are in red colour. where σ 1 is the estimated sample variance from all first components. Then, we randomly insert missing values on the second component based on the following missing percentages: 15% for group one and three and 30% for the second group .
In order to assess the power of all methods, we set µ = [δ, 0] ⊤ with shift parameter δ ∈ {0, 1/2, 1}. All simulations were operated by means of the statistical computing environment R based on n sim = 10, 000 Monte-Carlo runs and B = 1, 000 bootstrap runs (in case of the three bootstrapped methods based upon T * W , T * A , and T * M ). The algorithm for the computation of the p-value of the parametric bootstrap tests is as follows: 1. For the given incomplete paired data, calculate the observed test statistic, say T .

Generate a bootstrap sample
Type-I-Error Results. Simulation results of type-I error level of the studied procedures under the MCAR framework for different sample sizes and for homoscedastic as well as heteroscedastic settings are summarized in Table 1 and Table 2 respectively. It can be readily seen that the suggested bootstrap approaches based upon T * W , T * A and T * M tend to result in quite accurate type-I error rate control under homoscedasticity as well as heteroscedasticity and over the whole range of correlation factors for most settings. Only in two cases; First, in case of the negative unbalanced sample size (10,30), particularly under heteroscedasticity, the bootstrapped MATS (T * M ) is not recommended due to it's liberal behavior. However, in such case, the other two suggested bootstrapped tests T * W , and T * A are controlling type-I error rate accurately. Secondly, in case of the skewed exponential distribution, the control is not adequate and a liberal behavior is observed. However, in this case, all the other chosen procedures also failed to control type-I error rate for the underlying sample sizes, which are indicated in bold red through all tables. Specifically, in the case of homoscedasticity, and a balanced sample size (10, 10), our three suggested tests still result in accurate test decisions. For a positive balanced sample size (30,10), the bootstrapped ANOVA (T * A ) still controls type-I error rate accurately under homoscedastic as well heteroscedastic settings. It has even the best control of type-I error rate among all considered methods that are identified by bold entries in the table.
In contrast, the other tests (T N C , T LT ) do not control type-I error level constantly under heteroscedasticity or even under homoscedasticity in all of the considered sample sizes. It can also be seen from Table 1 and Table 2 that the nonparametric combination test T N C , controls type-I error quite accurately in the case of larger numbers of complete pairs (n c = 30), but it shows liberal behavior for smaller numbers of complete pairs (n c = 10). This test turns very liberal in the case of heteroscedasticity. Moreover, the test that is based on the maximum likelihood estimator T LT tends to result in a very liberal decision in the case of smaller numbers of complete pairs together with positive correlation factors ρ. For larger numbers of complete pairs, it leads to an accurate type-I error rate control for correlation factors being less than 0.5. This behavior of the test does not depend on the homoscedasticity assumption.
It was also interesting to discover the type-I error rate control of the tests under similar attributes to the breast cancer gene study data which reflects data sets with a few pairs and large amount of unpaired portions. Simulation results for the type-I error rate of the studied procedures for (n c = 16, n u = 74) sample sizes are presented in Table S.3 and Table S.4 in the supplement. The correlation ρ in Table S.4 is estimated based on the data It can be easily seen from Table S.3 and Table S.4 that the bootstrap tests are robust under large amounts of missing observations and control type-I error rate accurately, especially the bootstrapped tests T * W , and T * A . Except in the case of skewed distribution. The alternative approaches T N C and T LT have acceptable control under homoscedasticity, but, under the exponential distribution, they turned very liberal under heteroscedasticity.
Simulation results of the type-I error level of the studied procedures under the MAR framework for different sample sizes and covariance structures are summarized in Table 3 and 4 respectively. There, it can be seen that for moderate to large sample sizes (n ∈ {20, 30}), the bootstrapped ANOVA T * A , the bootstrapped Wald T * W and the nonparametric combination test T N C exhibit a fairly good type-I-error rate control for almost all considered scenarios under homoscedasticity as well as heteroscedasticity. Only in the case of the skewed exponential distribution, the control of T W and T N C is not adequate and liberal behavior is observed which is marked with bold red through all tables. In contrast, the bootstrapped MATS T M and Little's T LT tend to be sensitive to the dependency structure in the data. In particular, T M exhibits a liberal behavior for negative correlations, while T LT does the same for positive correlations. For small sample sizes (n = 10), the T N C test tends to be liberal in all considered situations while T LT performs well and is only liberal for positive correlations. In contrast, the bootstrapped tests T * W and T * M exhibit good type-I-error rate control for most settings except for the Laplace distribution. The bootstrapped ANOVA T * A tends to be very conservative especially under heteroscedasticity.
Further Investigations on Type-I-Error. In addition to the small and moderate sample size settings, we were also interested in studying type-I error rate control when sample sizes increase, while missing rates remain nearly unchanged. For moderate to large sample sizes, we considered the choices (n c , n u ) = k · (1, 1) + (10, 10) and (n c , n u ) = k · (10, 30), where k ranges from 0 − 500 (unbalanced case) and 1 − 50 (balanced case), respectively. Figure 1 and Figure S.1 (in the supplement) summarize the type-I error rate (α = 0.05) for these settings. The results indicate that the nonparametric combination test by Qi et. al. 25 T N C controls type-I error rate quite accurate under symmetric distributions, however, it fails to control type-I error rate under skewed distributions. In fact, it gets even more liberal with increasing sample sizes. In contrast, Little's test 16 T LT tends to be highly liberal when small sample sizes like (n c , n c ) = (10, 10) are present and less liberal with accurate type-I error control for larger sample sizes. Only the suggested bootstrap approaches T * A , T * W , and T * M control type-I error rate accurately among all considered settings.
In order to cover the effect of increasing missing rates, we studied type-I error control for sample sizes of the form (n c , n u ) = ((1 − r) · 30, r · 30) with r ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8} covering missing rates (among subjects) from 10% to 80%. Figure 3 and Figure S.2 in the supplement summarize type-I error rate control for these settings under a homoscedastic and a heteroscedastic covariance structure, respectively. The results indicate that under homoscedasticity, the alternative approaches T N C and T LT tend to be slightly liberal. They move closer to the 0.05 threshold for missing rates below 60%. In contrast, under heteroscedasticity, T LT keeps the same behavior while T N C tends to be more sensitive to the missing rates. In particular, it exhibits a conservative or liberal behavior for lower and larger missing rates, respectively. In contrast, the suggested bootstrap approaches tend to control type-I error rate more accurate over the range of missing rates r for most settings. Only in case of the skewed exponential distribution and missing rates greater than 50%, the control is not adequate. However, in this case all the other chosen procedures also failed to control the type-I error rate. Especially, Little's test T LT tends to be very liberal under the whole range of missing rates.
Power. In addition to the type-I error rate control, we studied the power of the five tests for all considered settings. Due to the rather liberal behavior of the nonparametric combination test T N C and the maximum likelihood test T LT in the case of small number of complete pairs, their power functions are not really comparable to the others. Therefore, we present here the power simulation results for the case of large numbers of complete pairs. Hence, we consider positively balanced sample sizes (n c , n u ) = (30, 10) and n = 20 for MCAR and MAR mechanisms respectively. The Power simulation results for the other scenarios are included in the supplement. The power analysis results of the considered methods under MCAR and MAR frameworks involving homoscedastic as well as heteroscedastic settings are summarized in Table 6 and 7 for the MCAR mechanism and Table S.8 and S.9 in the supplement for the MAR mechanism. The entries that belong to very liberal tests have been coloured in red in the power tables. It can be easily seen that the five tests have almost similar large power behavior under homoscedastic as well as heteroscedastic settings. Only in the heteroscedastic cases with skewed exponential distribution, the nonparametric combination test T N C shows larger power than the others, which is due to it's rather liberal behavior. One should also notice that the power behavior of each test varies based on the dependency structure of the data except for the bootstrapped ANOVA test (T * A ).

Breast Cancer Study: Gene Expression Data
The Cancer Genome Atlas (TCGA) project is a pilot project which was launched in 2005 with a financial support from the National Institutes of Health. It aims to understand the genetic basis of several types of human cancers through the application of high-throughput genome analysis techniques. TCGA collects molecular information such as miRNA/mRNA expressions, protein expressions, weight of the sample as well as clinical data about the patients.
A breast cancer study has been performed by TCGA to improve the ability of diagnosing, treating and preventing breast cancer through investigating the genetic basis of carcinoma. Their study consists of 1093 breast cancer patients with Clinical and RNA sequencing records. Among them, there were 112 subjects that provided both, normal and tumor tissues. Here, we were interested in a subset of this data that contains patients with pathologic stage I. This subset contains a total of n c = 16 complete pairs and an unpaired sample for the patients who developed only tumor tissues of size n u = 74. The data can be downloaded from Firehouse (www.gdac.broadinstitute.org).
Based on previous studies, six genes have been found to be significantly associated with breast cancer: TP53, ABCC1, HRAS, GSTM1, ERBB2 and CD8A. [39][40][41][42] Another two genes; C1D and GBP3 were under investigation although they did not show any significant relation towards breast cancer patients. 25 In this paper, we aim to test the hypothesis whether mean genetic expressions of the eight genes are significantly different between normal and tumor tissues for patients with early stage I breast cancer. Boxplots representing the characteristics of the eight genes are shown in Figure 4.
We applied all bootstrap testing methods T * W , T * A and T * M as well as the alternative approaches T N C , T LT to detect the null hypothesis of equal means between normal and tumor tissues (H 0 : µ 1 = µ 2 ) against the two-sided alternative (H 1 : µ 1 = µ 2 ). The results are summarized in Table 5.
It can bee seen from Table 5 that the bootstrapped approaches T * W , T * A and T * M and Little's method (T LT ) identified three out of eight genes having significantly different genetic expressions in normal and tumor tissues; genes ABCC1, HRAS, and ERBB2. However, the nonparametric combination method T N C led to different results for the ERBB2 gene.

Discussion and outlook
The problem of matched pairs with missing values occurs frequently in practice. Most available procedures in the literature are not applicable when missing values occur in a single arm. One exception is the recent NCT approach by Qi et al. 25 who utilize a combination of the sign and Wilcoxon Mann-Whitney rank sum test. For homoscedastic settings with symmetric distributions, this approach can be recommended. If, however, the underlying assumptions are not true (e.g. in skewed heteroscedastic setups), the NCT may result in highly inflated type-I-errors or considerable power loss. To overcome these issues, we have provided resampling procedures that are not based on any parametric assumptions and use all observed information of the matched pairs design. They were shown to be asymptotically correct and robust under heteroscedasticity and skewed distributions. The tests were based on restructuring all observed information in a test statistic of quadratic form that can be either a Wald-type statistic (WTS), an ANOVA-type statistic (ATS), or a modified ANOVA-type statistic (MATS). Since WTS is well known (from other situations 32,34,35 ) for being liberal, while ATS and MATS tend to be rather conservative or liberal for small to moderate sample sizes, we improved their small sample behavior by an asymptotic model based bootstrap approach. The procedure's asymptotic validity was also proven.
In an extensive simulation study, the type-I error rate control of the tests have been examined for symmetric and skewed distributions with homoscedastic and heteroscedastic covariance settings under different missing mechanisms. There, it was seen that the parametric bootstrap versions of WTS, ATS, and MATS improve their small sample behaviour. In particular, our bootstrap tests have been shown to perform very well in most of the cases, even with larger amount of missingness, heteroscedastic covariance or skewed data. Only the type-I error control for the exponential distribution, particularly under heteroscedasticity, MCAR and small paired sample sizes with rather large unpaired portions (n c = 10, n u = 30), is not maintained. In this setting, however, all other considered methods 16,25 also failed to control the type-I error rate. Regarding the individual performance of each bootstrapped test, the parametric bootstrap version of the ATS yielded the most robust results and is therefore recommended. Furthermore, our simulation study exhibits that the bootstrap procedures' type-I-error control is not much affected by less stringent missing data mechanism such as the MAR. However, their power behaviors is quite affected.
In order to simplify the application of our approaches, the three proposed bootstrap statistical methods have been implemented within the PBT function in the freely available R-package MissPair. It is available on GitHub (https://github.com/lubnaamro/MissPair) and will be available on the CRAN repository.
Future research will be concerned with extending our procedures to multivariate settings (MANOVA).     In this supplementary material, we present the proofs of the propositions and theorems in the paper. Further, we recall the definition of the different missing mechanisms and present additional type-I error and power simulation results of our suggested methods and the alternative approaches from Section 5 of the paper.

Proof of Proposition 2.1:
The results follow from the multivariate central limit theorem (CLT) and the low of large numbers, respectively.

Proof of Proposition 2.2:
The stated convergence follows from Proposition 2.1 , and an application of the continuous mapping theorem (CMT).

Proof of Theorem 3.1:
It follows from Proposition 2.2 that we have convergence in distribution AZ n d − → N 2 (0, AΣA T ) as n → ∞ under H 0 . Hence, using the CMT, the quadratic formT W = (AZ n ) ⊤ (AΣA ⊤ ) + (AZ n ) has asymptotically a central χ 2 f distribution with f = rank(A) degrees of freedom. Moreover, asΣ n is a consistent estimator for Σ > 0, the result follows from Slutzky theorem, see, e.g., Konietschke et.al. 1 for similar arguments .

Proof of Theorem 3.2:
Applying again the CMT, it follows that tr(AΣA ⊤ ) · T A = (AZ n ) T (AZ n ) has asymptotically the same distribution as 3 . Then, the result follows from the invariance of the multivariate standard normal distribution under orthogonal transformations and the consistency ofΣ n by using Slutzky theorem.
Proof of Theorem 3.3: Following similar arguments as prescribed in Friedrich and Pauly 4 , we can obtain that Thus, the result follows from the representation theorem of quadratic forms 5 .
Proof of Theorem 4.1: First, we apply the Multivariate Lindeberg-Feller Theorem (MLFT) to show that (given the data) √ nX * (c) .
1. ,X * (c) 2. ] converges in distribution to a normal distributed random variable. We start by checking the MLFT conditions: The last step follows from the CauchySchwarz inequality. Now, the first term E X * (c) k 2 |X 2 is asymptotically bounded, while the second term converges to zero in probability since 1 , it follows that the Lindeberg condition is satisfied (in probability). Thus, proves that the conditional distribution of √ nX * (c) .
In a similar way , we proof that √ nX * (i) 1 given the data weakly converges to 1 κ2 N (0, σ 2 1 ) in probability. Now, due to the MCAR setting, X c is independent of X (i) and by using Slutzky, Z * n = √ n[X * (c) 1. ,X * (c) 2. ,X * (i) 1. ] ⊤ converges in distribution to N 3 (0, Σ) where, Σ is defined in Section 6 in the paper. Following the same steps as in the proof of Theorem 3.1-3.3 , this concludes the proof.

MCAR, MAR and MNAR
To explain the different missing schemes we define a missing indicator variable R j = [R 1j , R 2j ] ⊤ ∈ R 2 , that identify what is known and what is missing, i.e. R ij = 0 if X ij is missing and R ij = 1 otherwise, j = 1, . . . , n. Then, Rubin defines the missing mechanism through a parametric distributional model on R = {R j } n j=1 and classifies their presence through Missing Completely at Random (MCAR), Missing at Random (MAR) and Missing not at Random (MNAR) schemes 6 . To describe this in our model let us denote the observed data as X obs = (X (c) , X (i) ) and denote with X mis the missing observations.
The data are said to be MCAR if the probability of an observation being missing does not depend on observed or unobserved data, i.e. if P (R | X obs , X mis ) = P (R).
Data are said to be MAR if the probability of missingness may depend on observed data but does not depend on unobserved data, P (R | X obs , X mis ) = P (R | X obs ). Finally, data are said to be MNAR, if missingness does depend on the unobserved data. It can easily be seen that MAR includes MCAR as a special case. For more details about the different missing mechanisms, we refer to the monograph of Little and rubin (2014) 7 .

Type-I Error and Power Results
In the sequel, we present some additional type-I error and power results of the Monte Carlo simulation study, that is described in detail in Section 6 of the paper, for testing H µ 0 for matched pairs with missingness in one arm under the MCAR, and MAR schemes. Dist ρ n = 10 n = 20 n = 30  Dist ρ n = 10 n = 20 n = 30       Dist T N C T LT Normal -0.9 5.5 5.4 6.7 6.8 4 Table 10. Power simulation results (α = 0.05) of the tests for different distributions under varying correlation values (ρ) with sample sizes n = 10 and heteroscedastic covariance matrix Σ2 under the MAR framework.
T N C T LT Normal -0.9 5.1 4.1 4.9 6.3 Table 11. Power simulation results (α = 0.05) of the tests for different distributions under varying correlation values (ρ) with sample sizes n = 20 and homoscedastic covariance matrix Σ1 under the MAR framework.