Random sample sizes in orthogonal mixed models with stability
Abstract
In this work, we present a new approach that considers orthogonal mixed models, under situations of stability, when the sample dimensions are not known in advance. In this case, sample sizes are considered realizations of independent random variables. We apply this methodology to the case where there is an upper bound for the sample dimensions, which may not be attained since failures may occur. Based on this, we assume that sample sizes are binomially distributed. We consider an application on the incidence of unemployed persons in the European Union to illustrate the proposed methodology. A simulation study is also conducted. The obtained results show the relevance of the proposed approach in avoiding false rejections.
1 INTRODUCTION
Situations where the sample dimensions are not known in advance are common when planning a study. An illustrative example of this was presented by Moreira et al1 and Nunes et al,2, 3 where the collection of observations took place during a fixed time period in a study from patients with several pathologies arriving at a hospital. Another example is the approach proposed by Nunes et al,4 where it was considered for the comparison of pathologies in the case where one of the pathologies was rare. In these situations, samples with dimensions n1,…,nm are assumed as realizations of independent random variables N1,…,Nm in models with random sample sizes. In the aforementioned papers, it was assumed that the collection of observations was given by a Poisson counting process, so the sample sizes were considered as realizations of independent Poisson variables.
Nonetheless, the earliest studies in which the number of observations was considered as random go back to the 1980s, with the work of Nguyen5 and Singh and Gupta.6 In the work of Nguyen,5 the random number of observations was modeled by a Poisson process, while in the work of Singh and Gupta,6 a random number of observations was used to estimate the mean.
More recently, in the work of Esquível et al,7 some classical statistical inference was extended to the case of a random number of observations. The authors also described some models for the number of observations, obtained by truncation or translation of usual models (Poisson, binomial, geometric, and negative binomial).
The aim of the present paper is to extend the theory of mixed models to the case of random sample sizes, considering stable statistics. In these models, the test statistics have the same distribution whether referring to the fixed or the random part of the model when the tested hypothesis holds (see the work of Ferreira8).

is a realization of the following random variable:


. Furthermore, the vector n=(n1,…,nm)t will be a realization of N=(N1,…,Nm)t.
, which is the mean vector of the vector
of sample means, we have


The remainder of this paper is organized as follows. In Section 2, it is shown how we may test hypotheses in effects and interaction on mixed models, assuming normality. It also shows how the conditional distribution of the test statistics is obtained. Considering that we have random sample sizes, the unconditional distributions of the statistics are presented in Section 3. In Section 4, an application considering the incidence of unemployed persons in the European Union is developed to illustrated this approach. In Section 5, a simulation study is provided to analyze the proportion of false rejections, which may be avoid considering the proposed methodology. Finally, in Section 6, we make some concluding comments.
2 MODEL, HYPOTHESES, AND TEST STATISTICS
In this section, we assume that we have fixed sample sizes n1,…,nm. The total number of observations will be
.
of ▽j,
, is null if and only if

belonging to the orthogonal complement of ▽j,
.
given by


being a blockwise diagonal matrix with principal elements
, and

, normal independent with null mean vectors and covariance matrices
, j=w′+1,…,w, and gj=rank(Aj). The
, j=w′+1,…,w, are independent from


are the means of the components of
, corresponding to the different samples. Therefore,

will be independent of

, where Ir denotes the identity matrix of order r.








(see Mexia11).
is independent of
, Sj,j=1,…,w, will be independent of




, and consequently ℑj will have a noncentral F distribution with gj and n−m degrees of freedom and noncentrality parameter δj, F(.|gj,n−m,δj). The power of the F test increases with δj, j=1,…,w′ (see the work of Mexia11).

of a random variable with central F distribution with gj and n−m degrees of freedom. In this case, the test power increases with γj, j=w′+1,…,w (see the work of Lehmann and Romano14).
3 UNCONDITIONAL DISTRIBUTIONS

. To estimate r (and consequently ri, i=1,…,m), we can assume that we have a fixed probability q of taking an observation, with r being the least value of the parameter in the binomial distribution that ensures

| n• | p = 0.1 | p = 0.2 | p = 0.3 | p = 0.4 | p = 0.5 | time (s) |
|---|---|---|---|---|---|---|
| 9 | 12 | 14 | 17 | 21 | 26 | 0.103 |
| 13 | 17 | 20 | 24 | 28 | 35 | 0.104 |
| 20 | 25 | 29 | 35 | 42 | 51 | 0.107 |
| 32 | 39 | 46 | 53 | 64 | 78 | 0.108 |
(1)
4 AN APPLICATION
In this section, we will apply the presented methodology to the incidence of unemployment in the European Union. The data comes from PORDATA16 and gathers information regarding the age of the unemployed persons.
Given the sizes of the samples, we worked with asymptotic normal distributions to ensure the use of F tests (see the work of Ito17). We will consider a two‐way model with one fixed effect and one random effect factors. The fixed effect factor will be the time periods, with two levels: 2006‐2007 (before financial crisis) and 2012‐2013 (during financial crisis). The random effect factor will be the Countries. Due to the large number of countries, we resorted to the simple random sampling method to select four different countries from the available list. This leads to m=2×4=8 different treatments. Table 2 illustrates the sample mean age for each selected country considering the two time periods.
| Time periods (fixed effect factor) | |||
|---|---|---|---|
| 2006‐2007 | 2012‐2013 | ||
| Countries (random effect factor) | Slovakia | 35.47186 | 33.96895 |
| Spain | 35.33543 | 35.88430 | |
| Greece | 30.58468 | 34.92408 | |
| Czech Republic | 34.85264 | 35.03557 | |

We will consider the balanced case, which is robust in the presence of non‐normality and heteroscedasticity (see the work of Scheffé18). In this case, ni=6, i=1,…,8, which correspond to the number of different classes for the age of the unemployed persons. So,
.


is defined by 1.


, j=1,2,3, from which we can obtain upper bounds for the quantiles of
, j=1,2,3.
When we consider these upper bounds as critical values, the derived test's sizes do not exceed the theoretical values (see the work of Nunes et al3). We can use these upper bounds for a preliminary test. If the statistic's value exceeds the upper bounds, it also exceeds the real critical value (obtaining considering random sample sizes), and in this case we reject the null hypothesis. When the statistic's value is lower than the upper bound, we must compute the real critical values or calculate the minimum value of n•, which leads to reject the null hypothesis (eg, see the work of Nunes et al3, 19).
The interaction belongs to the random effect part of the model. Although the analysis usually starts with an interaction test and follows with the tests to the main effects, whenever it is not significant, we do not follow this approach since we are interested in showing how these tests could be carried out through unconditioning (see the work of Nunes et al3).
All computational procedures, namely, the quantiles of the conditional distributions and the upper bounds for the quantiles of the unconditional distributions, as well as the computations in Section 3 and 5, were conducted through R software.
4.1 Fixed effect factor


has as components the values presented in Table 2.
for the denominator of the statistics. In this example, we obtain


| α | 0.10 | 0.05 | 0.01 |
|---|---|---|---|
| z1−α | 2.8354 | 4.0847 | 7.3141 |
![]() |
4.0604 | 6.6079 | 16.2582 |
Let us now assume that N ≥ 13, which means that n•=13. Table 3 also presents the upper bounds for the quantiles for probability 1−α of the unconditional distribution of ℑ1,
. So, assuming n•=13, we decide do not reject H0,1 for the usual levels of significance.
The quantiles for random sample sizes (obtained when using the unconditional distribution) will exceed the classical ones (obtained when using common conditional distribution) since a new source of variation is considered. So, since in this case we do not reject the null hypothesis using the classical quantiles, we take the same decision using the quantiles for random sample sizes and consequently considering the upper bound approach. Therefore, we conclude that the fixed effect factor is not significant.
4.2 Random effect factor



| α | 0.1 | 0.05 | 0.01 |
|---|---|---|---|
| z1−α | 2.2261 | 2.8387 | 4.3126 |
![]() |
3.6195 | 5.4095 | 12.0600 |
Assuming that the global minimum dimension is n•=13, the upper bounds for the quantiles for the unconditional distribution of ℑ2,
, are presented in Table 4. These results lead us to reject the hypothesis for α=0.1 and do not reject for 0.05 and 0.01, which means that the decision made for both approaches are not concordant for α=0.05 and 0.01.
Table 5 presents the total sample sizes needed to reject the hypothesis as well as the execution time (in seconds). So, we have
for all n• ≥ 27 (see Nunes et al3). In this case, we reject H0,2 considering the usual values of α, and the random effect factor is significant, which means that the unemployed's age is significantly different for different countries.
| α | 0.1 | 0.05 | 0.01 | time (s) |
|---|---|---|---|---|
| n• | 12 | 14 | 27 | 0.036 |
4.3 Interaction between factors

(2)
Considering the quantiles for the conditional distribution presented in Table 4, we decide to reject H0,3 for α=0.10, 0.05, and 0.01.
Through the upper bounds for the quantiles,
, of the unconditional distribution presented in Table 4, obtained considering n•=13, we decide to reject H0,3 for α=0.1 and do not reject for 0.05 and 0.01. So, these results show that the decision depends on the approach when we consider α=0.05 and 0.01.
We should have the total sample sizes presented in Table 5 for ensuring rejection. Therefore, we have
for all n• ≥ 27, which means that we reject H0,3 considering the usual values of α. In this case, the interaction between factors is significant.
- • The fixed effect factor is not significant, which means that the unemployed's age is not significantly different for these two periods.
- • Considering a sufficiently large dimension for the samples, the random effect factor and interaction are significant.
- • For the random effect factor and interaction, we conclude that the common approach may drives in a false rejection for α = 0.05 and 0.01. A rejection is considered a false rejection whenever the null hypothesis is rejected by the conditional but not rejected by the unconditional approach.
5 A SIMULATION STUDY
In this section we carry out to a simulation study to verify the proposed method and analyze the proportion of false rejections which may be avoided considering our approach.
In line with Section 4, we considered a mixed model with a fixed effect factor (with two levels) and a random effect factor (with four levels). The samples dimensions, for the eight different treatments, were randomly generated, considering integers from a uniform distribution in the interval
. Next, the observed value of the statistics was computed for the fixed effect factor, random effect factor, and interaction, as well as the respective P values considering the conditional and unconditional approaches (assuming n•=10, n•=30, n•=50, and n•=100). The program ran 1000 times.
Table 6 show the obtained results. The values in bold style correspond to the cases were the null hypothesis is rejected at the 5% significant level. The execution time (in seconds) of this process is also provided. We conclude that, for n•=10, the decisions taken considering both approaches are not concordant for fixed effect factor and interaction.
| Conditional approach | Unconditional approach | time (s) | |||||
|---|---|---|---|---|---|---|---|
| (p‐value) | (p‐value) | ||||||
| ℑj,Obs | n=8133 | n•=10 | n•=30 | n•=50 | n•=100 | ||
| Fixed effects | 17.2170 | <0.001 | 0.0535 | <0.001 | <0.001 | <0.001 | |
| Random effects | 48.1637 | <0.001 | 0.0204 | <0.001 | <0.001 | <0.001 | 6.767 |
| Interaction | 9.9682 | <0.001 | 0.0925 | <0.001 | <0.001 | <0.001 | |
The number of rejections was also computed, considering both approaches, as well as the proportion of false rejections, which can be avoided when the proposed methodology is considered. Through the results presented in Table 7, we conclude that the proportion of false rejections is very high for n•=10, being lower for high values of n•.
| Conditional approach | Unconditional approach | Number of false rejections/ | |||||||
|---|---|---|---|---|---|---|---|---|---|
| (number of rejections) | (number of rejections) | proportion of false rejections | |||||||
| n•=10 | n•=30 | n•=50 | n•=100 | n•=10 | n•=30 | n•=50 | n•=100 | ||
| Fixed effects | 642 | 295 | 609 | 624 | 635 | 347 / 0.540 | 33 / 0.051 | 18 / 0.028 | 7 / 0.011 |
| Random effects | 927 | 364 | 910 | 920 | 923 | 563 / 0.607 | 17 / 0.018 | 7 / 0.008 | 4 / 0.004 |
| Interaction | 942 | 384 | 922 | 936 | 940 | 558 / 0.592 | 20 / 0.022 | 6 / 0.006 | 2 / 0.002 |
The results obtained agree with our previous comments on the relevance of our approach in avoid false rejections. So, we suggest the use of this approach whenever the sample dimensions are not known in advance.
6 FINAL REMARKS
In this work, we try to open a new field in the application of orthogonal mixed models to situations where sample dimensions are unknown. We point out the interesting fact that the test behaviors under null hypothesis are the same for the fixed effect and the random effect parts of the models. This situation of stability was first considered by Ferreira.8 We resorted to the case where the samples dimensions are binomial distributed, which is considered the appropriate choice when observation failures may occur. This approach leads to interesting applications in several research areas, namely in economic research, as can be seen through the application. A simulation study was conducted to verify and relate the proposed methodology with the classical one. The obtained results confirm the relevance of the proposed approach in avoiding false rejections.
ACKNOWLEDGEMENTS
The authors would like to thank the anonymous referees for useful comments and suggestions. This work was supported in part by the FCT‐ Fundação para a Ciência e Tecnologia, through projects UID/MAT/00212/2019 and UID/MAT/00297/2019.
CONFLICT OF INTEREST STATEMENT
The authors declare that there is no conflict of interests regarding the publication of this article.
Biographies
-

Célia Nunes is an Assistant Professor at the Department of Mathematics, University of Beira Interior (UBI, Covilhã, Portugal), and a member of the Center of Mathematics and Applications, UBI. She received the BSc degree in mathematics and the MSc degree in applied mathematics from the University of Évora (Évora, Portugal), and the PhD degree in mathematics from UBI. She has published research in the fields of probability and statistics and applied statistics. Her research interests include statistical inference in linear models and distribution theory. She has been involved in several research projects and is a guest reviewer of several international journals, being also a scientific committee member of international conferences. For comprehensive access to her work, please refer to: http://orcid.org/0000-0003-0167-4851.
-

Anacleto Mário is a student in the Mathematics and Applications PhD program of the University of Beira Interior (UBI, Covilhã, Portugal), and is a collaborator in the Center of Mathematics and Applications, UBI. He received the BSc degree in science of education from Agostinho Neto University (Luanda, Angola), and the MSc degree in financial auditing from the Polytechnic University of Madrid (Madrid, Spain). His doctoral research focuses on the extension of the theory of mixed effects models to the case of random sample sizes.
-

Dário Ferreira is an Assistant Professor at the Department of Mathematics, University of Beira Interior (UBI, Covilhã, Portugal), a member of the Center of Mathematics, UBI, and an active member of the Portuguese Society of Statistics (SPE). He has graduated in mathematics at the University of Évora (Évora, Portugal) and received the PhD degree in mathematics and statistics from UBI. His general research area is in linear models. More specifically, he works in the estimation of variance components using algebraic and stochastic methods. He has been publishing research in this field, invited to participate in some research projects, and cochaired several international conferences and workshops, being also a program committee member of several international conferences. For more details, please visit http://dar364.wix.com/dario.
-

Sandra S. Ferreira is a Professor at the University of Beira Interior, (UBI, Covilhã, Portugal). She received the PhD degree from UBI in 2006, where she also teaches courses on basic statistics, quantitative methods, hierarchical linear models and multivariate analysis. She is member of the working group (WG) CMStatistics (this WG focuses on all computational and methodological aspects of statistics) and a member of the IEOM Society and serves as an editorial board member of several journals. Her publications and current research interests focus on statistical inference for estimable functions and variance components, in linear mixed models with commutative orthogonal block structure. ORCID: 0000‐0002‐9209‐7772.
-

João T. Mexia is an Emeritus Full Professor at the Faculty of Science and Technology, New University of Lisbon, (FCT/UNL, Lisbon, Portugal). He received the BSc degree in forestry from the Technical University of Lisbon (Lisbon, Portugal) and the Habilitation in Mathematics and PhD degree in statistics from FCT/UNL. He has published many research papers in reputed international journals in the fields of probability and statistics, and his actual research interest is in linear statistical inference. He has supervised more than 40 PhD students. He has also supervised research projects and participated in the scientific committees of international conferences.






