Models for cluster randomized designs using ranked set sampling

Cluster randomized designs (CRD) provide a rigorous development for randomization principles for studies where treatments are allocated to cluster units rather than the individual subjects within clusters. It is known that CRDs are less efficient than completely randomized designs since the randomization of treatment allocation is applied to the cluster units. To mitigate this problem, we embed a ranked set sampling design from survey sampling studies into CRD for the selection of both cluster and subsampling units. We show that ranking groups in ranked set sampling act like a covariate, reduce the expected mean squared cluster error, and increase the precision of the sampling design. We provide an optimality result to determine the sample sizes at cluster and sub‐sample level. We apply the proposed sampling design to a dental study on human tooth size, and to a longitudinal study from an education intervention program.


INTRODUCTION
Sometimes the research focus of an experiment is on assessing the treatment or intervention effects on a system, methodology, procedure or policy rather than the effects on individual patients, participants or a single unit. In this case, the treatments are randomly allocated to cluster units rather than to the individual subjects (or units) within clusters. The design that governs the treatment allocation for these types of studies is called a cluster randomized design (CRD). When the target of the intervention is a collective entity or system rather than a particular person, such as a patient or student, CRD is an appealing procedure. 1 For example, CRD is better able to evaluate whether a new standard of care, guideline recommendation, or other practice-wide, hospital-wide, or system-wide change is affecting patient outcomes. In education research, the CRD would be appealing to see if a new education intervention program is more effective than a traditional education program at school or school-district levels. In this setup, CRDs provide logistical flexibility, a real-world view of research and a reduction in the total cost of the experiment, while minimizing the possibility of treatment contamination between units randomized to different intervention programs.
CRDs are particularly useful for studies where the nature of the intervention carries a high risk of contamination. 1,2 If the individuals in a treatment group in traditional clinical trials or education research are in frequent contact with individuals in other treatment groups, they may influence the outcome in that treatment group leading to contamination. In this case, performing a random allocation on cluster units, such as hospitals or schools, minimizes the contamination.
Fast changes in traditional practices in health care systems demand a more practical approach to all aspects of research without sacrificing the basic principles of a good design. In this regard, CRDs can be viewed as pragmatic clinical trials. They can be easily adapted to real-world settings and move research quickly into practice better than clinical controlled experiments. Two key features are proposed for a pragmatic clinical trial: (1) It leverages the availability of existing data such as the electronic health record and repurposes the data for research. 3 (2) It uses cluster, or group randomization 4,5 for which clinics, physicians, school-districts or schools are the unit of randomization. An example in Reference 6 shows the pragmatic nature of a cluster randomized design in a trial comparing a 12-h nursing shift to an 8-h shift. A CRD for this situation randomizes the 12-and 8-h shifts to floors, wards or even hospitals instead of randomizing to individual staff. This type of randomization reflects the reality of the real world setting of the intervention program and also minimizes the contamination on each cluster unit.
Use of CRDs has steadily increased in the medical literature since 1970. CRDs are extensively used in low-income countries. [7][8][9] CRDs are suitable research designs to evaluate effectiveness of public health intervention programs to change health behavior of individuals in society, such as improving dental care, 10 promoting hand-washing, 11 attending for immunization, 12 preventing substance abuse and chronic disease and promoting sexual health style. [13][14][15][16][17] CRDs have also been used to evaluate intervention programs designed to help children to become 'health messengers', as in a trial assessing whether hypertension education of children had an impact on the blood pressure of their parents. 18 The design, analysis and conduct of CRDs are often more complicated than for individually randomized trials. Many CRDs are run with a small number of cluster units (insufficient sample sizes) in both intervention and control groups. Smaller sample sizes reduce the precision of the estimator and prevent drawing a meaningful conclusion about the statistical significance of the intervention effect. CRDs have two different sample sizes, cluster and within cluster sample sizes. The impact of these sample sizes on the cost of the trial and on the precision of the estimator depends on the intra-cluster correlation coefficient (ICC). These sample sizes need to be optimized for a given ICC and the total budget of the experiment. Raudenbush constructed an optimal design for CRDs using a covariate. 19 Porter and Raudenbush showed that a single covariate can reduce the required sample size by half and provides a substantial amount of cost saving in CRDs. 20 Watson et al investigated the impact of small sample sizes on the performance of a three-arm parallel CRD. 21 Li et al 22 constructed a three-level CRD with two covariates to assess treatment effect heterogeneity. Recent developments in CRDs and up to date references can be found in review papers. 23,24 In clustered populations, there is often information about the relative position of the cluster units in a small comparison set. For example, one can easily rank a small set of schools in a school district based on their previous years' performance or their income level or a combination of both. Ranking information is used in ranked set sampling with cluster randomized designs to estimate the intervention effect. 25 Wang et al formulated the CRD as a two sample problem performed in two stages. 25 The treatments are randomly assigned to cluster units in the first stage, and subsamples are selected from the cluster units in the second stage. Each stage can be performed using either simple random sample (SRS) or ranked set sample (RSS) designs. The inference of the intervention effect relies on the least squares estimates of the difference between control and treatment population means. It would be useful to relate this inference to a comprehensive approach based on expected mean squares in the analysis of variance (ANOVA) table.
Human tooth-size analysis is a research interest in orthodontics and forensic identification. 26,27 Earlier studies have shown some evidence for difference in tooth sizes between male and female individuals as well as among racial groups. [28][29][30] Lee and colleagues 30 reported a data set that included a total of 293 subjects with normal occlusion selected from young Korean adults through a community dental health survey. The survey was conducted from 1997 to 2005 and involved 178 men and 115 women. A human adult typically has 32 teeth: 8 incisors, 4 canines, 8 premolars and 12 molars (including 4 wisdom teeth). For each of the subjects, the widths of all non-wisdom teeth were measured (in mm) using digital Vernier calipers by two technicians and the averages of their measurements were recorded. In this paper, we used these 293 individuals as our cluster population. In this population, each individual is considered as a cluster unit, and each of his/her teeth as an individual within cluster unit. The parameter of interest Δ is the difference between the mean tooth sizes for men (control) and women (treatment) groups. We analyze this data set in Section 6.
The approach in this paper differs from the one used by Wang et al 25 and from the covariance analysis used by Raudenbush. 19 The main idea in this paper and Wang et al (2016) is to use the ranks (relative positions) of experimental units in a small set to construct a CRD design to increase its efficiency. Even though both papers use RSS either at cluster or sub-sampling level, their focuses and results differ significantly. Wang et al concentrates on the estimation and testing of the parameter of intervention effect. This paper takes a broader view and solves it as a design of experiment problem. It considers more than two treatments, identifies all sources of variation including their expected mean squares, and establishes an optimal sample size allocation procedure for stage 1 and 2 sampling. Since our proposed model is in the framework of a linear model, it can be expanded easily to cover other sources of variation and correlation structures among cluster units.
Section 2 introduces a CRD that incorporates ranking information in stage 1 and 2 sampling. ANOVA tables along with the expected mean squares of model components are constructed. The efficiency of the new design is compared with that of traditional CRDs. Section 3 considers the estimation of the intervention effect. For a given cost model and intra-cluster correlation coefficient, optimal sample sizes for the stage 1 and 2 samples are constructed. Under reasonable ranking information, the proposed CRDs provide a substantial amount of improvement in the cost of the experiment. Section 4 investigates the empirical power of the F-test of the treatment effects. Section 5 considers an example to illustrate the use of the proposed design in an education intervention program. Section 6 uses dental survey data to investigate the empirical properties of the proposed design. Section 7 provides concluding remarks.

CLUSTER RANDOMIZED DESIGN WITH RANKING GROUPS
A traditional CRD is performed in two stages. In the first stage, a completely randomized design is constructed with t treatments, each applied to n cluster units, where the cluster units are regarded as experimental units. For notational convenience we use factor A and C for treatment and cluster effects, respectively. In the second stage, sub-samples of size m are selected from each of the experimental units in the first stage of the completely randomized design. The outcomes are measured at the individual level using subsampling in the second stage. A simple model for this design can be written as where is the overall mean; i is the treatment effect (factor A); c j(i) is the j-th cluster effect (factor C) nested within the i-th treatment; and k(ij) is the random error within the i-th treatment and j-th cluster unit. In this model, factor A is considered as a fixed effect with the constraint ∑ t i=1 i = 0. Factor C is a random effect having a normal distribution with mean zero and variance 1. The error term k(ij) is also random having a normal distribution with mean zero and variance 1. Two scaling factors, and , are also included for the cluster effects and the experimental units, respectively. We assume that c j(i) and k(ij) are independent.
The very first stage of the CRD in model (1) is to choose a total of tn cluster units, say tn schools from a school district, and then m subsamples of students from each of these tn schools. This design requires a total of tn experimental units at stage 1 and tnm observational units at stage 2 sampling. Table 1 provides the expected mean squares for a two-factor nested design in model (1). Under the assumptions in model (1), the F test for the treatment effect is the ratio of the mean squares for treatment and cluster effects, and the F test for the cluster effect (ie, =0) is the ratio of the mean squares for cluster and residuals.
If there is additional ranking information to induce further structure among cluster units, a grouping factor with H 1 levels can be created to enable the use of a cluster randomized ranked design (CRRD). We use a school district population to illustrate the construction of a CRRD with t treatments and H 1 ranking groups. The ranking group effects will be TA B L E 1 Expected mean squares of model (1).

Source df Expected mean squares
Error tn(m − 1) 2 TA B L E 2 Cluster randomized ranked design when H 1 = 4, t = 3.

Block (Rank)
Treat TA B L E 3 Subsampling using SRS in the i-th treatment and the h-th ranking-group.

Stage 2 units in cell (T i , h)
denoted by factor B. Suppose that n is a multiple of H 1 , say n = d 1 H 1 . Each ranking group will then contain td 1 units (schools from the school district). The schools in each ranking group will be selected using a ranked set sample (RSS) design. The RSS approach requires that, for each ranking group h (h = 1, … , H 1 ), we select H 1 schools from the pool of all schools in the school district, rank them from smallest to largest based on any or all available information and identify the unit with rank h. The information used to do this may be incomplete, rough and subjective. This rank is called a judgment rank to reflect the subjective nature of the ranking process. If there exist measurable auxiliary variables, such as the previous year's exam scores, income levels of schools or a combination of both, the ranking process can be done using these secondary variables. We have thus examined H 2 1 schools in order to obtain an independent estimate of each of the H 1 possible ranks. We repeat this process td 1 times to identify all required experimental (td 1 ) units in each ranking group h (h = 1, … , H 1 ).
This process creates H 1 stochastically ordered ranking groups of schools. In other words, schools in the ranking group h are stochastically smaller than the schools in the ranking group h ′ , h ≤ h ′ . The ranking groups (or ranking information) attempt to order the schools for which the results are most likely to occur in the absence of treatment effects, and the ordering of the schools is likely to be imperfect.
The td 1 schools in each ranking group h are now randomly allocated with d 1 schools assigned to each of the t treatments, so that there are n = d 1 H 1 schools in each treatment group. The first stage of CRRD is illustrated in Table 2. In this table, each treatment and ranking-group cell (T i , h) contains d 1 cluster units from the cluster population, and the number of cluster units applied to treatment T i is n = d 1 H 1 .
Stage 2 of the CRRD considers the selection (or sampling) of secondary units for each of the cluster units chosen at stage 1. This can be done using either a simple random sample (SRS) or a ranked set sample design. In an SRS design, we simply select m students from each of the d 1 schools in treatment T i and ranking-group h. This design is illustrated in Table 3 for a cell (T i , h). The model for the observed response is then given by where Y ihjk is the response from the k-th student within the j-th school, the h-th ranking group and the i-th treatment; is the overall mean; i is the treatment effect; * h is the ranking-group effect which is the expected value of the h-th judgment order statistic of a sample of size H 1 from a standard normal distribution; ih is the interaction effect between treatment i and ranking group h; c * j(ih) is the cluster effect nested within the i-th treatment and h-th ranking group; k(ihj) is the k-th error term for the treatment i, ranking group h and cluster unit j. For each combination (i, j), TA B L E 4 Expected mean squares of model (2).

Source df Expected mean squares
is the h-th smallest judgment order statistic (with shifted mean zero) from a standard normal distribution in a set of size H 1 and these are independent of each other by virtue of the RSS process. We note that even though c * is a judgment order statistic, it has expected value 0 since its mean is moved to * h in factor B. The factors A, B and the AB interaction are considered as fixed effects with the usual constraints h=1 ih = 0, respectively. The error terms k(ihj) have independent normal distributions with mean zero and variance one. The cluster errors c * j(ih) and subsampling errors k(ihj) are mutually independent. The expected mean squares in this model are given in Table 4. The design in model (2) requires td 1 H 2 1 experimental units at stage 1, but only td 1 H 1 of them are used for treatment allocation and measurement. The other td 1 H1(H 1 − 1) are used to identify the ranking groups.
School districts in each set are ranked prior to the start of the experiment using available information. Hence, the ranking process would be subjective and the rank order of the within set units is likely to be in error. Let G [h] (y) be the cumulative distribution function (CDF) of the h-th judgment order statistic c * j(ih) from a standard normal distribution. The ranking procedure is called consistent if the following equality holds where G(y) is the CDF of a standard normal distribution. Presnell and Bohn 31 showed that a ranking procedure is consistent if it assigns a unique rank to each unit in the set using the same ranking methods. Under a consistent ranking scheme we can establish the following equalities * ⊤ 1 = 0, and 2 are the mean vector and the trace of the variance-covariance matrix T * H 1 of the H 1 judgment order statistics from a standard normal distribution. In general, the forms and values of * and T * H 1 are unknown; they depend on the ranking procedure.
We provide a ranking model to establish explicit formulas for * h and * 2 h . We assume that there exists a secondary variable j(ih) that can be used to rank the H 1 units in each set and which has a correlation 1 with the cluster effect. Without loss of generality, we assume this ranking variable to have mean zero and variance one. Let u j(ih) be independent normal random variables. We can then model the unordered error term c j(ih) for the cluster unit j in the treatment and ranking-group cell (i, h) as where 1 is the correlation coefficient between the ranking variable j(ih) and the error term c j(ih) , 1 = cor(c j(ih) , j(ih) ). The random variable u j(ih) can be considered as a noise variable which determines how accurate the ranking process is, ranging from accurate ( 1 = 1) to random ( 1 = 0) depending on the value of 1 . Let c + j(ih) ( 1 ) be the properly centered value of c j(ih) ordered according to the values of (1) is the h-th smallest observation among H 1 standard normal random numbers, and the superscript (1) indicates that the ordering is done at the cluster level sampling. Then c + j(ih) ( 1 ) can be written as It can immediately be seen that where h and 2 h are the mean and variance of the h-th order statistics from a standard normal distribution, and can easily be computed.
Under ranking model (3), model (2) can be written as We note that, in this model, c + j(ih) ( 1 ) is a random effect with mean 0 and variance 1. In model (4), the schools are ranked based on the auxiliary variable (1) j(ih) in ranking model (3), such as, income levels and previous year's test scores of the schools, and so forth. If the correlation coefficient 1 = 1, the ranking process will be error free, and the terms c *

j(i,h)
and * h in model (2) reduce to the h-th order statistic and its expected value, respectively. If 1 = 0, ranking would be at random and model (4) reduces to the cluster randomized design in model (1). Other values of 1 lead to various degrees of ranking errors among within-set units.
If there is additional ranking information in the secondary populations, subsampling can be performed using an RSS design at this level. For the RSS in stage 2 sampling, we first identify a set size H 2 and an integer d 2 in such a way that m = d 2 H 2 . For each one of the d 1 cluster units in cell (i, h) in Table 2, we then select mH 2 units from the secondary population. We randomly partition these units into m sets, each having H 2 units. The units in each of m sets are judgment ranked from smallest to largest without measurement, and the units having the rank v are measured in d 2 sets for v = 1, … , H 2 . Note that each ranking group has an equal number of measured observations so that the RSS design is balanced. This RSS procedure creates H 2 ranking groups (blocks) each containing d 2 observations, Y ihjvk , k = 1, … , d 2 ; v = 1, … , H 2 for treatment i, ranking group h and cluster unit j. This stage 2 sample is illustrated in Table 5. Under an arbitrary consistent ranking scheme, a model for this design is given below.
where * v is the ranking effect (factor D) in the stage 2 sampling. It is the expected value of the v-th judgment order statistic for a sample of size H 2 from a standard normal distribution, and the * k(ihjv) is the k-th error term for the v-th judgment group having mean zero and variance * 2 v . The other terms are the same as in model (2). The design in model (5)  The main difference between models (2) and (5) is in the second stage sampling. Model (5) induces an additional ranking structure among subsample observations. The expected mean squares of model (5) in Table 6 suggest testing procedures for the treatment (A), block (B) and treatment-block (AB) interaction effects. An approximate F-test rejects the null hypothesis of no treatment effects for the large values of F-statistic F A , TA B L E 5 Subsampling using RSS in the i-th treatment and h-th ranking-group.

Cluster units in cell (T i , h)
Note: Each column-entry is obtained by ranking H 2 observational units in a set of size H 2 .

TA B L E 6
Expected mean squares of model (5).

Source df Expected mean squares
and under ranking model (6) where MSA and MSC are the mean squares for factor A and C, respectively. The null distribution of F A can be approximated by an F-distribution with t − 1 and tH 1 (d 1 − 1) degrees of freedom. In a similar fashion, the effectiveness of the stage 1 block (factor B), stage 1 treatment-block interaction (factor AB) and stage 2 block (factor D) can be observed by the large values of F B , F AB and F D , where MSB, MSAB and MSD are the mean square errors for factor B, AB and D, respectively. A significant AB interaction here would imply that treatment effects are different for the different levels of ranking, a result that would need to be explored further. The ranking model in stage 2 sampling can again be modeled using an auxiliary variable in the secondary populations. In the school district example, students' records within a school can be used to rank the students in a set of size H 2 . By adapting the notation of model (3) to stage 2 sampling, we write where 2 is the correlation coefficient between the un-ordered ranking variable k(ihjv) and un-ordered error term k(ihjv) , and (2) k(ihjv) is the v-th order statistic in a sample of size H 2 from a standard normal distribution in stage 2 sampling. Under this ranking error model, the model for the design is given by F I G U R E 1 Surface plot of E(MSC) with respect to correlation coefficients 1 and 2 when the subsample size m = 12 and 2 = 5.
It is clear that the test for the treatment effect uses the mean square cluster as the error term. Hence, we look at the impact of intra-cluster correlation coefficient (ICC) and the ranking information in stage 1 and 2 sampling on the expected mean square cluster error. We rewrite the expected mean square cluster error in Table 6 as a function of ICC, ICC = 2 ∕( 2 + 2 ), and stage 1 and 2 ranking information of the ranked set sample procedure under the ranking models (4) and (6) where ⊤ = ( 1 , … , H 2 ) and ⊤ = ( 1 , … , H 1 ) are the expected values of order statistics from the standard normal distribution of respective samples sizes H 2 and H 1 . It is clear that E(MSC) has two components, the first component in the round parenthesis is the contribution of the stage 2 sampling error, and the second component is the contribution of the stage 1 sampling error. Both components are decreasing functions of the ranking qualities ( 2 , 1 ) and set sizes (H 2 , H 1 ), but the amount of reduction in each component depends on the magnitude of the ICC. If the ICC is close to zero, reduction can be achieved in both components for large 2 , 1 and large H 2 , H 1 , but most of the reduction comes from the first component. If the ICC is large, the contribution of the second component, cluster level variation, becomes important. A significant reduction in variation can be achieved using large 1 and large H 1 . In this case, the reduction in the first component is not very large. Hence, use of a ranked set sample in stage 2 from the secondary population is not as crucial as the one we use in stage 1.
In Figure 1, we provide a surface plot for E(MSC) as a function of 1 and 2 when H 1 = H 2 = 5, m = 12 and 2 = 5. The first panel gives the surface plot when the ICC = 0.1. Since the ICC is relatively small, there is a reduction in E(MSC) in both 2 and 1 directions. The second panel gives the surface plot when = 0.5. In this plot, the ICC is relatively large. Hence, the surface plot is almost constant in 2 direction while decreasing sharply in 1 direction. This indicates that using RSS design in cluster level sampling is essential to reduce sampling variation when the ICC is relatively large.

SAMPLE SIZE ALLOCATION IN STAGE 1 AND 2 SAMPLING DESIGNS
Research focus in a cluster randomized ranked design is on drawing inference for contrasts between treatment effects. We consider the estimate of the contrast parameter i − i ′ . The choice of this parameter provides a connection between the variance of the estimator, and the sample sizes n, m of the stage 1 and 2 sampling designs. This connection provides a mechanism to optimize n and m for a given budget for the experiment. The estimate of i − i ′ and its variance are given bŷi Using Equation (8), we rewrite Δ i,i ′ as a function of ICC, 1 and 2 Under model (1), the variance of the estimate of i − i ′ is given by The relative efficiency of CRRD with respect to CRD is then given by the ratio of Δ The RE for the selected values of d 1 , H 1 , d 2 , H 2 , 1 = 0.9, 2 = 0.78 and ICC = 0.21 are given in Table 8 using High School Longitudinal Study of Section 5. The sampling cost of an experimental unit could be substantially different for a cluster and subsample unit. It may be desirable to minimize the variance of the contrast estimate for a given budget for the study. A simple cost model for CRRD can be given by where T is the total sampling cost of the experiment, and C 1 and C 2 are the sampling costs of between and within cluster units, respectively. In this model, we assume that the ranking cost is negligible; hence it is ignored. For a given total budget T, we minimize the Δ i,i ′ with respect to n and m. The optimal values of m and n are given by .
If m and n are not integers, we round them up to the next integers. The optimal value of m is intuitively revealing. If the 1 , 2 and ICC are constant, a higher cost (C 2 ) of a stage 2 sampling unit requires smaller sample sizes from stage 2. If the costs (C 1 , C 2 ) and ICC are constant, use of efficient ranked set sampling design (large 2 and H 2 ) in subsampling units allows one to use smaller sample sizes in stage 2. If the costs (C 1 , C 2 ) and the ranking qualities ( 1 , 2 ) are constant, the large values of intra-cluster correlation coefficient ICC leads to smaller m. If ICC is large, subsampling units (individuals) within a cluster look alike, and sampling more observations from similar observations does not increase the information content of the sample. Hence, large ICC leads to observing more units from cluster units in stage 1 and a smaller number of individuals from stage 2.
In Table 7, we investigate the impact of the ranking quality of stage 1 and 2 sampling on the optimal sample sizes (m, n). The optimal sample sizes are computed for the total budget of T = 500, the sampling cost of cluster unit C 1 = 2, 10, 50 and sampling cost of within cluster unit C 2 = 1. The ICC values are selected as 0.01 and 0.5. Set sizes of RSS for stage 1 and 2 are selected to be H 1 = H 2 = 5. The number of treatments t is selected as 2. The values of 1 and 2 are selected as 1 = 2 = 0, 0.25, 0.5, 0.75, 1. Each entry in Table 7 contains two integers (m, n). The first (m) and second (n) integers are the optimal sample sizes for stage 2 and 1 samplings, respectively. For fixed C 1 , C 2 , T, ICC and 1 , large values of 2 require smaller sample sizes (m) from the secondary populations and larger sample sizes (n) from the cluster population. For example, if the ICC = 0.01, C 1 = 2, 1 = 0, the optimal sample size m decreases from 14 to 8 and the optimal sample size n increases from 16 to 24 as 2 changes from 0 to 1. For the fixed values of C 1 , C 2 , T, ICC and 2 , large values of 1 produce the opposite effect, that is, increase m, and decrease n. For example, if ICC = 0.01, C 1 = 2, 2 = 0, the optimal sample size m increases from 14 to 23 and optimal sample size n decreases from 16 to 10 as 1 changes from 0 to 1.  Note: If the optimal m is 1, we replace it with m = 2 since m = 1 does not allow a variance estimate for the error term.

EMPIRICAL COMPARISON
In this section, we investigate the empirical power of the CRRD for the selected values of the simulation parameters. The set and cycle sizes in stage 1 and 2 sampling are selected as H 1 = H 2 = 4 and d 1 = d 2 = 3 respectively. The intra-cluster correlation coefficient is selected using = 0.5, 2 and = 2 so that ICC = 2 ∕( 2 + 2 ) takes the values ICC=0.059 or 0.5. We used two sets of ranking correlation structure   Figure 1, where we observed that the MSC of factor C is almost constant with respect to 2 when the ICC is large.

APPLICATION TO HIGH SCHOOL LONGITUDINAL STUDY
In this section, we use a data set, the High School Longitudinal Study of 2009 (HSLS09), collected by the National Center for Education Statistics (NCES) to monitor student progress nationwide as they move through the beginning of high school, post-secondary education, the work force and beyond. The data set shows a cluster structure in which 944 high schools are selected as a representative sample of all high schools in the USA. In each high school, an average of 25 ninth-graders are included in the study, for a total of approximately 24,000 students for all high schools. These students were given a test to assess their math proficiency in the ninth grade (2009) and again in the 11th grade year (2012). Readers are referred to http://nces.ed.gov/surveys/hsls09 for further details about this study. This data set is also used by Wang and colleagues 25 to illustrate the use of cluster randomized ranked set sample design. They downloaded the public-use student-level data of HSLS09 for the base year (2009) and the first follow up year (2012) from the NCES website. The data set is pre-processed and school IDs are removed for confidentiality purposes. The pre-processed data contains 472 high schools and 12,533 students. We use this processed data as our population. The data set contains several variables. In this paper, we consider the students' 2012 mathematics theta score (X2TXMTH) as a response variable. Students had a baseline measurement in 2009, the number of correct scores in 72 math items (X1TXMCR). We used these baseline measurements as ranking information. In stage 1 sampling, we computed the average school X1TXMCR scores and used these averages to rank the schools for their X2TXMTH values. The correlation coefficient between X1TXMCR and X2TXMTH schoool averages was 1 = 0.90 at school level. In stage 2 sampling, we TA B L E 8 Efficiency comparison of the intervention effect estimator and the power of the F-test of treatment effects, 1 = 0.9, 2 = 0.78, ICC = 0.21, and simulation size is 20,000. used the individual X1TXMCR student scores within each school as ranking variable. In this case, the average of 472 correlation coefficients from 472 schools in the population was 2 = 0.78. Using the entire data set (all 472 schools), we computed the ICC = 0.21, = 0.53 and = 1.013. As we see, the ICC value is moderately large. Hence, we expect that a large portion of the variation in the sample comes from stage 1 sampling. We performed a simulation study using the data from the 472 high schools. We fixed the set sizes as H 1 = 2, 3, 4, 5 and H 2 = 2, 3, 4, 5 and considered two treatments t = 2. For each set size combination (H 1 , H 2 ) we computed the optimal cycle sizes (d 1 = n∕H 1 , d 2 = m∕H 2 ) for the total cost T = 100. Stage 1 and 2 unit sampling costs are taken to be C 2 = 1, C 1 = 1. For each sample size combination, 20,000 CRD and CRRD designs were constructed. Schools and students were selected with replacement in each simulated sample. To evaluate the power of the F-test, we added a shift parameter to one treatment group with the shift values √ 2 + 2 , where = 0, 0.15, 0.25, 0.45, 0.8. The relative efficiency of CRRD with respect to CRD for the estimation of the shift parameter, 1 − 2 , is computed where MSC(CRD) and MSC(CRRD) are the estimated cluster mean squares from 20,000 simulation replications based on models (1) and (5), respectively. We also computed the theoretical values of RE from Equation (9) for comparison purposes. These efficiency values are provided in Column 3 of Table 8. It is clear that the simulated and theoretical efficiency values are reasonably close to each other. For all sample size combinations, the CRRD is more efficient than a CRD design. The efficiency increases with set size H 1 , but is relatively constant with set size H 2 . Since the ICC (0.21) is moderately large, the efficiency improvement with the increase in set size H 1 is larger than the efficiency improvement with the increase in set size H 2 . This is again consistent with the surface plot in Figure 1, where the reduction in MSC is negligible with the increase in 2 (and also in H 2 ) when the ICC is large. Table 8 also presents the empirical power of the F-tests for the treatment effects under CRD and CRRD models. If the shift value = 0, the F-tests of the treatment effect in both designs yield type I error rates reasonably close to the nominal value of 0.05. When the shift parameter > 0, the empirical power of the F-test of the CRRD is substantially higher than the power of the F-test based on the CRD. For example, when H 1 = 5, d 1 = 3, H 2 = 2, d 2 = 2, the ratios of the empirical powers are 1.752, 2.016, 1.795, 1.003 for = 0.15, 0.25, 0.45, 0.8, respectively.

APPLICATION TO TOOTH SIZE ANALYSIS
Using the dental survey population of 293 individuals' teeth measurements, we performed another simulation study using CRRD. The CRRD involves selecting n = H 1 d 1 individuals from each treatment group, and then selecting H 2 d 2 teeth from the 28 non-wisdom teeth for each selected person. The average width difference of teeth between the Korean men and women estimated from the entire data set as Δ = 0.27 mm. This is assumed to be the true value of the parameter in this study. The intra-cluster correlation coefficient found as ICC ≈ 0.001, which indicates that variation among the teeth sizes for the same individual is much larger than the variation among the teeth sizes of the different individuals.
For the simulation study, we fixed the set sizes as H 1 = 2 and H 2 = 2, 4 and the cycle sizes d 1 = 5, 10 and d 2 = 2, 3. These choices lead to the sample sizes n = 10 and m = 4, 6, 8 or 12 depending on the values of H 2 and d 2 . Since we have only two treatments, we set t = 2. Ranking of cluster units (individuals) is not trivial since it requires ranking individuals based on total width of all 28 non-wisdom teeth. On the other hand, ranking with a reasonable quality can be performed based on the width of one of the front central teeth via visual inspection or another cheap method. For example, one can use a simple compass with two sharp pointers, whose distance can be adjusted to copy the width of a front tooth without using digital calipers. In this study, cluster units (individuals) are ranked using the width of the lower third tooth from the center (N3L). The estimated correlation coefficient between the N3L and total width of all 28 teeth was 0.79. Ranking at the subsampling level is performed using the width of different teeth on the same individuals. This can be done either using a visual inspection or a compass to estimate the width of the teeth.
Ranking quality at subsampling level is modeled by using the imperfect ranking procedure in Wang et al. 25 This ranking method uses true ranks if the width difference in any pairwise comparison is larger than millimeters, otherwise ranks are randomly assigned. This ranking process attempts to model a realistic situation where the differences are indistinguishable (without accurate measurement) and a ranker assigns ranks at random. We let vary from 0 to 2, with step size 1/3. The simulation size is selected to be 5000 replications. Figure 3 presents the empirical powers of the test for treatment effects, and the relative efficiency of the contrast estimatorΔ based on CRD and CRRD. The empirical powers are computed for the null hypothesis that difference between F I G U R E 3 Empirical power of the test of treatment effect and the relative efficiency of the contrast estimators. the average tooth widths of male and female populations in Korea is zero (Δ = 0) against the alternative hypothesis that the Δ = 0.27 mm. In Figure 3, the panels in the first column contain the empirical powers of CRD for different values of H 2 and d 2 . The panels in the second column represent the empirical powers of the CRRD. The x-axis represents the quality of ranking, the larger values of indicate poor ranking. It is clear that empirical powers are almost constant in the first column since there is no ranking involved in CRD. The power curves in the second column are always higher than the power curves in the first column and they slightly decrease with since large values of induce stronger ranking error.
The third column in Figure 3 presents the relative efficiency of the contrast estimatorΔ CRRD with respect toΔ CRD , RE = MSC(CRD)∕MSC(CRRD). It is clear thatRE values are always greater than 1 for all values, which indicate that the estimator based on CRRD is more efficient than the CRD estimator. Efficiency also increases with the sample size of within-cluster sampling (m = H 2 d 2 ) with the exception of d 2 = 3 and H 2 = 4. The efficiency curve for d 3 and H 4 is slightly lower than the efficiency curve for d = 2 and H 4 in the third column of Figure 3. The reason for this is that the secondary population size is 28 and we sample m = d 2 H 2 = 12 units out of 28 with replacement. This means that the same unit from the secondary population may appear more than once in the sample. Hence, it reduces the effective sample size m. Since the ICC is very small, having smaller sample size (m) in the secondary sampling reduces the efficiency. Again this is consistent with the surface plot in Figure 1 and the optimal sample size allocation in Section 3, where we have observed we should sample more from the secondary populations if the ICC is small.

CONCLUDING REMARKS
The main focus of this research is to integrate ranked set sampling design into a cluster randomized design (CRD) in the design of experiments. CRDs allocate treatments to cluster units rather than allocating them to individuals within clusters. They are usually less efficient than the experimental designs where the treatment allocation is performed at the individual level. A typical CRD involves random sampling of the units at two different levels. The first one selects the cluster units from the cluster population. The second one performs a secondary sampling from each of the cluster units selected in the first sampling. Hence, in addition to treatment effects, between-and within-cluster variation are the other two components that contribute to total variation in the experiment. Variation from these two components can be reduced by replacing simple random sampling with a more efficient sampling design, ranked set sampling (RSS), in stage 1 and 2 sampling. We show that the impact of RSS on the efficiency of CRRD depends on the intra-cluster correlation coefficient (ICC). If the ICC is relatively large, using RSS at cluster level reduces between-cluster variation, while use of RSS at within-cluster level does not have a big impact on the efficiency. On the other hand, if the ICC is small, use of RSS in stage 1 sampling has a minimal improvement in the efficiency, while RSS at stage 2 sampling provides some improvement.
The CRRD requires two stage sampling. The cost of sampling and sample sizes in these stages could be significantly different. For a given total budget we obtained optimal sample sizes for stage 1 and 2 samplings when the ICC is either known or well estimated. The optimal sample sizes minimize the variance of the estimator of treatment effects between two treatment levels for a given cost function and a total budget. The optimal sample size m in stage 2 sampling is usually smaller when the ICC is large, and an efficient RSS design is used in stage 2 sampling. The opposite is true when the ICC is small, and an efficient RSS design is used in stage 1 sampling.
The CRRD is investigated by Wang et al 25 when the treatment factor has two levels. They estimated the intervention effect using least squares estimates of the contrast parameter. Our paper looks at the problem from a design perspective. It provides a model to account for the other sources of variation, such as ranking group effects in stage 1 and 2 sampling and the interaction effect between treatment and ranking blocks in stage 1 sampling. Using expected mean squares, we also provide guidance on determining appropriate tests for treatment and other sources of variation.

ACKNOWLEDGMENTS
Support from the Grains Research and Development Corporation, Australia is acknowledged. The methodology developed here is a research output of Statistics for Australian Grains Industry (UOA00164).

DATA AVAILABILITY STATEMENT
School district data set that supports the findings of this study is derived from the following resources available in the public domain: http://nces.ed.gov/surveys/hsls09/. The derived data set is included in online submission system. Tooth size analysis data set that supports the findings of this study is not publicly available due to privacy or ethical restrictions, but can be obtained from the corresponding author upon reasonable request.
We now consider blocking factor B. Let X B a basis matrix for the block effect. We have X B = 1 t ⊗ I H 1 ⊗ 1 d 1 m . We first note that We now show that The expected sum of squares of the block effect then can be written as Under consistent ranking condition, the equation above reduces to the following form Under ranking models (3) and (6) v . This completes the proof. We now consider the cluster effect, factor C. Let X C = I tH 1 d 1 ⊗ 1 m and X * C = I tH 1 ⊗ 1 d 1 m . Using these matrices we write The expected sum of squares for the factor C is then given by