Generalized estimating equations to estimate the ordered stereotype logit model for panel data

By modeling the effects of predictor variables as a multiplicative function of regression parameters being invariant over categories, and category‐specific scalar effects, the ordered stereotype logit model is a flexible regression model for ordinal response variables. In this article, we propose a generalized estimating equations (GEE) approach to estimate the ordered stereotype logit model for panel data based on working covariance matrices, which are not required to be correctly specified. A simulation study compares the performance of GEE estimators based on various working correlation matrices and working covariance matrices using local odds ratios. Estimation of the model is illustrated using a real‐world dataset. The results from the simulation study suggest that GEE estimation of this model is feasible in medium‐sized and large samples and that estimators based on local odds ratios as realized in this study tend to be less efficient compared with estimators based on a working correlation matrix. For low true correlations, the efficiency gains seem to be rather small and if the working covariance structure is too flexible, the corresponding estimator may even be less efficient compared with the GEE estimator assuming independence. Like for GEE estimators more generally, if the true correlations over time are high, then a working covariance structure which is close to the true structure can lead to considerable efficiency gains compared with assuming independence.

In the cross-sectional context, several models are available to estimate the corresponding parameters of scientific interest. Examples are cumulative, sequential, adjacent-categories, or baseline category models. [1][2][3] These models account for the ordinal or nominal character of the response by either restricting the effects of the predictor variables to be identical over the categories (i.e., the proportional odds property) or allow all regression parameters to vary freely. Stereotype models differ from the above models by allowing the effects of predictor variables to be specific for each possible category of the response variable but in a parsimonious way. 4 The ordered stereotype model assumes that the effects of the predictor variables are proportionally changing depending on the response category. This is modeled by introducing a scalar parameter that is specific to each category and is multiplied with the vector of regression parameters. Hence, this model is more flexible than models that restrict the effects of the predictors to be the same for the different categories but more parsimonious than the unrestricted models with freely varying regression parameters. The ordinal nature of the response in the ordered stereotype model is guaranteed by constraints on the order of these scalar parameters. 1,[4][5][6][7] Thus, by introducing the interaction of regression parameters and category-specific effects underlying the ordering constraints, the ordered stereotype model allows a flexible interpretation of the effects of predictors but still respects the ordinal nature of the response. In contrast to other models for ordinal responses which can be derived from latent one-dimensional processes, the ordered stereotype model can be motivated by a latent but inherently multidimensional nature of the response variable. 4 The price for the increased model flexibility is that estimation of the model is not straightforward due to the multiplicative structure of the parameters.
To estimate cross-sectional-ordered stereotype models, maximum likelihood 4 (ML) or generalized least squares 8 (GLS) methods have been proposed. This can be realized via general software to maximize a likelihood function, 8 usually using numerical derivatives, or by embedding the ordered stereotype models into a more general class of models. For example, stereotype models can be interpreted as a subclass of generalized or multivariate nonlinear models 8,9 and estimated, for example, using the gnm package 9 as part of the R system, 10 as restricted versions of multinomial models 11 or as special cases of the class of reduced-rank vector generalized models, 12 which simplifies to a reduced-rank multinomial logit model. Finally, the package ordinalgmifs 13 as part of the R system 10 allows fitting ordered stereotype models when the number of parameters exceeds the sample size, using the generalized monotone incremental forward stagewise method and imposing penalties to a set of specified predictors. This package can also be used in the case of non-high-dimensional data without specifying any penalty on the predictors in the model fitting process.
Estimation of ordered stereotype panel models or models for clustered data is, however, scarce. Notable exceptions propose to estimate the ordered stereotype panel model either via GLS 14 or ML 14,15 using existing software. For consistent estimation of the parameters of interest and their (co)variances, the GLS approaches require that the mean and association structure of dummy variables representing the possible values of the ordered response variable are correctly specified. For ML estimation, correct specification of the high-dimensional distribution of the response variables over time and numerical calculation of high-dimensional integrals is required. Although the latter is no obstacle anymore with today's computing capacity, the former is a strong requirement.
The generalized estimating equations (GEE) approach was proposed as an extension of generalized linear models to panel data. 16 Assuming a marginal generalized linear model for the one-dimensional response variable at each time point, the parameter estimators of the mean structure are consistent and asymptotically normally distributed under weak regularity conditions. A crucial condition is that the mean structure is correctly specified. Dependencies over time are modeled via a "working" correlation matrix avoiding computer-intensive estimation procedures for nonlinear models. For the estimators of the mean structure to be consistent and asymptotically normally distributed, however, correct specification of the working correlation matrix is not required.
GEE estimation of models for one-dimensional response variables at each time point has soon been generalized to vector-valued response variables. Hence, GEE estimation of models for clustered nominal response variables has been proposed 17,18 just as estimation of models that include clustered ordinal response variables 17,[19][20][21][22] -usually assuming a cumulative ordinal model. 2 However, to the best of our knowledge, estimation of the ordered stereotype logit model for panel data that is robust against a misspecified association structure over time has not yet been considered.
Therefore, in this article, we propose estimation of the ordered stereotype model for panel data via GEE. We choose the GEE approach because it avoids the calculation of high-dimensional integrals and is thus computationally efficient. In addition, the GEE approach turned out to be an attractive approach as illustrated in several simulation studies: 16,22 If the true associations are weak to moderate, then GEE estimators assuming independence are not only consistent but are usually efficient relative to ML or GEE estimators that take associations into account. If the true associations are strong, however, then explicitly modeling any association structure usually leads to more efficient GEE estimators compared with assuming independence. The observed effects on the relative efficiency may be weak if specific versions of invariant covariate designs are considered or may be amplified if the number of observations within each cluster varies. [23][24][25][26] In many practical cases, the efficiency loss relative to ML estimators will often be small.
In Section 2, we discuss the ordered stereotype logit model for panel data and in Section 3 its estimation. Sections 3.1 and 3.2 describe estimation of the stereotype model with the help of GEE based on a working covariance matrix. In Section 3.2, we introduce two ways of modeling the association structure: First, via working correlation matrices and second adopting a local odds ratios approach. Section 4 describes a simulation study to illustrate the finite sample properties of the GEE estimator of the stereotype model under different structures of the working covariance matrix. Application of the GEE estimator is illustrated using a dataset from a randomized clinical design to evaluate a treatment of rheumatoid arthritis 27 in Section 5. Finally, in Section 6, we discuss the results from the simulation study and the application and address limitations of the approach.

THE ORDERED STEREOTYPE LOGIT MODEL FOR PANEL DATA
LetỸ it be a q-level ordered categorical (ordinal) response with possible realizationsỹ it ∈ {1, … , q} (eg, a 5-level Likert scale defined as "strongly disagree", "disagree", "neutral", "agree", and "strongly agree") for unit i, i = 1, … , n, at time t = 1, … , T. The ordered stereotype panel logit model for the probability thatỸ it = k can be characterized by the log odds at each time point t, where k , k , and (p × 1)-dimensional are usually unknown parameters, the vector x it is a set of l = 1, … , p predictor variables for unit i at time point t that may include categorical or continuous variables, is invariant over the categories but may vary over time, and k can be interpreted 6 as the effect of x ′ it on the probability of observing , 1 = 0 and 1 = 0. However, to identify the model, one additional restriction is required and we set q = 1. Note that is invariant over the categories and thus a monotone ordering constraint on k , k = 1, … , q, is required forỸ it to be ordinal 4 -without this constraint, the model would be a model for nominal response variables. We choose 28 0 = 1 ≤ 2 ≤ · · · ≤ q−1 ≤ q = 1. Note that model (1) could be formulated based on other restrictions as well. For example, instead of a model implying that 1 = 0, one could choose a version based on ∑ q k=1 k = 0. For the version adopted here, the parameters to be estimated are 2 , … , q , 2 , … , q−1 , and . It should be noted, however, that if = 0, then the model is not identified. 29 A comparison of graph (A) of the first panel in Figure 1 with the other three graphs illustrates the effect of the nondecreasing restriction with respect to the parameters k , k = 1, … , q, onỸ it : If the ordering constraints are met, then the assumed order in the values ofỸ it is justified. Note that when two neighboring -parameters have equal values, we could collapse the corresponding response categories into one single response category as the model still holds with the corresponding subset of -parameters. 1 Model (1) defines the first category at each time point as a reference category. The vector of parameters represents the category-invariant effects of x it on the log odds for the categories k = 2, … , q relative to the reference categoryỸ it = 1. The parameters { 2 , … , q } can be interpreted as weights assigned to the categories: The larger k , the higher the probability of observing category k relative to the reference category. The parameter k affects aspects like skewness and kurtosis of the probability functions Pr(Ỹ it = k ′ |x it ), k ≠ k ′ for a given i, t, and k > 1 only via the restriction 1 = ∑ q k=1 Pr(Ỹ it = k|x it ). For an illustration, compare graph (B) in the first panel of Figure 1 with graph (C) in the second panel. Finally, { 1 , 2 , … , q } are category-specific parameters that affect the probabilities of observingỸ it = k depending on x ′ it . Thus, they have a stronger effect on the position and shape of the probability functions Pr F I G U R E 1 Illustration of effects of k and k as a function of = ′ x graphs (A) and (B) in the first panel and graph (D) in the second panel of Figure 1 illustrates this effect. Note that the assumptions of invariant parameters k , k , and over time can easily be relaxed.

The estimating equations
Treating the ordered stereotype panel model as a multinomial model with restrictions on the parameters, we adopt the GEE approach 16 to estimate the parameters in a parsimonious way, avoiding numerical integration and correct specification of high-dimensional distributions. This approach requires only the correct specification of the mean structure of the dummy variables representingỸ it . However, since the variance estimator 16 for n → ∞ turned out to underestimate the true variances in small to medium-sized samples, we propose an analytical correction for finite samples. Define The marginal distribution of Y it is multinomial and we assume that variables Y i are independent conditional on the predictor variables x i ′ 1 , … , x i ′ T , for all i, i ′ = 1, … , n and t = 1, … , T.
Let = ( ′ , ′ , ′ ) ′ , where = ( 2 , … , q ) ′ and = ( 2 , … , q−1 ) ′ , be the parameters of scientific interest. The GEE estimator̂solves the estimating equations where is the working covariance matrix. The positive definite matrix V i is the covariance matrix of variables Y it treated as independent over time, that is, , where bdiag() denotes a block-diagonal and diag() denotes a diagonal matrix. V 1∕2 i denotes the symmetric square root of V i and R is a "working" correlation matrix, which depends on a "nuisance" parameter that is not of scientific interest 16 but accounts for (linear) dependencies in Y it |x i1 , … , x iT over time, t = 1, … , T.
Generally, estimating equations (3) need to be solved iteratively. Given estimatêj at iteration j, an estimate for ,̂j, is computed as 16̂j where To calculate an estimate for̂, we adopt the strategy of alternating between a modified Fisher scoring to estimate and estimation of .
If the model of the mean i is correctly specified for all i = 1, … , n, that is, if E(Y itk |x it ) = itk at the true value 0 , then under some additional regularity conditions,̂is consistent and asymptotically normally distributed, 16,30,31 and its covariance matrix can be estimated bŷ D i is D i evaluated at =̂andŴ i is the working covariance matrix evaluated at =̂and the estimated association parameterŝ, the latter being equal to zero if independence is assumed. However, an outstanding characteristic of the stereotype model is that the means itk are functions of the product of parameters k and . As a consequence, , where = [ ′ ( ′ 1) ⊗ ′ ] ′ , ⊗ is the Kronecker product, 1 J is a (J × 1) vector of ones, I J is the (J × J) identity matrix , and 0 J×K is a (J × K) matrix of zeros.
Treating the covariance matrix W i as fixed at the observed values, the derivative of estimating equations (3) with respect to is ∑ A major difference between the models for which GEE was proposed 16 and the ordered stereotype model is D (1) , a matrix which would be equal to the identity matrix in most cases, but is a function of parameters and in case of the stereotype model. Consequently, the first term on the right-hand side of (6) would be zero in standard GEE situations.
is zero at the estimateŝand̂in finite samples, the first term on the right-hand side of (6) may not vanish. The second term on the right-hand side, which is ignored in (4) and (5), may also not vanish in small samples.
] ∕ in small to medium-sized samples. Therefore, to calculatêand to estimate its variance, we use the first and the second term in addition to the last term on the right-hand side of (6). Both terms are given in the Appendix A1. It should be noted, however, that although the is negative definite as long as does not take extreme values and the working correlation matrix is positive definite, (6) may not be negative definite for values of outside a close neighborhood of the solution̂.
Denote the derivative (6) as A n . Then the GEE estimator of the ordered stereotype panel model can be calculated by In case of convergence problems, approximation − ∑ i D i W −1 i D ′ i may be used instead of A n until convergence and then be replaced by A n for a few additional iteration steps until the convergence criterion is again met. This strategy requires usually more iterations but will more likely converge to a solution of (4). The variance of̂is estimated by replacing − The ordinal nature of the stereotype model is guaranteed by the restriction 0 ≤ 2 ≤ · · · ≤ q−1 ≤ 1, where at least two inequalities must be strict inequalities. These restrictions are neither necessary for the model to be a valid model nor is it required for the algorithm to solve the estimating equations. The ordering only ensures that the categorical response variable is ordered for the dataset at hand. The computed estimateŝk are realizations of estimators for which only asymptotic properties are available. In the GEE case, the asymptotic distribution of the estimators is the normal distribution. Thus, in finite samples confidence intervals for k may overlap, 1,4 may cover negative values, or values larger than one even if the above restrictions are met for the underlying true data generating process. Consequently, it can also not be ruled out that the above ordering is violated during some iterations in the estimation process. Thus, in applications it might be of interest to test whether the response is in fact ordinal. 4,8 Therefore, we did not enforce the restrictions on the parameter estimateŝk, k = 2, … , q − 1, by the estimation algorithm. 32

The working covariance matrix
For a consistent estimation of the mean structure parameter , a correct specification of the underlying association structure of Y i |X i is not necessary and, if the observations are assumed to be correlated,̂needs only be consistent 16 for some . However, the choice of the working association structure may have an effect on the relative efficiency of̂. To the best of our knowledge, there do not appear to exist simulation studies that consider GEE estimation of stereotype panel models, but simulation studies based on various other models suggest that if the true associations are weak or moderate, which is the case in many applications, then the choice of the working association structure seems to play only a minor role especially in sufficiently large samples. 18,33 If associations are strong, then efficiency improvements can be substantial. 16,18,25,26,34 On the other hand, it has been shown, that these effects depend on the amount of variation in cluster sizes and the covariate design, where for specific designs of invariant predictors, the choice of the working association structure may play only a minor or no role at all. 18,[23][24][25][26] In the context of models with binary response variables and working correlation matrices, it has been pointed out that the range of possible correlations is restricted 35 and alternatives to a working correlation matrix have been proposed based on global or local odds ratios to model associations. 18,36 It is, however, not quite clear what the practical consequences of violations of the corresponding restrictions are. Adopting an approach to estimate the correlation structure parameters in a binary model in a simulation study, it turned out that violations of the restrictions were associated with convergence problems, like multiple solutions of the estimating equations or negative estimated variances for single parameters of scientific interest. 37 In this study, extreme situations with sample sizes as small as n = 20 and T = 4 were simulated. Another simulation study with n ranging from 50 to 1000 and T = 5 reported rather high percentages of violations of up to 47% in a binary model but convergence problems only in one out thousands of simulated datasets and no negative variance estimates. 38 Therefore and because of its simplicity and computational efficiency, we adopt the working correlation matrix approach, 16 but compare it with an approach using local odds ratios. 18,39

Working correlation matrix
In the cross-sectional case with ordered or unordered categorical variables, the covariance matrix of the binary indicators In the case of panel data, however, variables Y itk and Assuming independence in the ordinal model for panel data amounts to adopting the identity matrix as the working correlation andV i as the working covariance matrix. The GEE estimator calculated under the independence assumption will be denoted as GEE I .
In the panel data context, this assumption is not plausible and, if wrong, may lead to inefficient estimators. There are several possible ways to proceed, depending on the permitted flexibility of the working correlation matrix to be estimated. Note that the (q − 1)T × (q − 1)T working correlation matrix consists of T(T − 1)∕2 off-diagonal (q − 1) × (q − 1) matrices, whose structures need to be modeled. Thus, the number of parameters to be estimated may vary between one and (q − 1) 2 T(T − 1)∕2. We consider two basic strategies to generate a working correlation matrix that differ with respect to the flexibility of the working correlation structure and the number of parameters to be estimated. Both simplify to the above block-diagonal matrixV i under the independence assumption. Under both strategies, four different association structures are modeled, that is, an equicorrelation structure, an autoregressive structure of order one (AR(1)), a Toeplitz structure and an unstructured correlation matrix.
In all cases, we first calculate an unstructured covariance matrix of the residuals over all units, A correlation matrixR common to all units that captures associations over time is then calculated asR denotes the inverse of the symmetric square root ofÛ tt , the tth (T × T) matrix on the diagonal ofÛ. Thus, The diagonal block matrices reduce to [(q − 1) × (q − 1)]-identity matrices. The two basic strategies of generating working correlation matrices differ systematically with respect to the treatment of the off-diagonal elements of the [(q − 1) × (q − 1)] matricesR tt ′ . Note that due to the symmetry of correlation matrices we only need to consider either the block matricesR tt ′ in the upper (t < t ′ ) orR t ′ t the lower (t > t ′ ) triangular part ofR.
Under the restrictive within-blocks strategy, we calculate the mean of all off-diagonal elements of those block matri-cesR tt ′ , t < t ′ , which are, given the adopted working correlation structure, assumed to be equal using the Fisher (and its inverse) transformation. If an unstructured correlation matrix is assumed, then all matricesR tt ′ are considered independently from each other. In this case, for each block matrix, an individual mean of the off-diagonal elements is calculated and used as an estimate of the off-diagonal correlations in this block. For example, a (3 × 3) blockR tt ′ with correlations under the restrictive strategy, where r tt ′ is the mean of the correlations r tkt ′ k ′ , k ≠ k ′ . If an AR(1)-or a Toeplitz structure is assumed, then the mean over all off-diagonal elements of all block matricesR tt Assuming an equicorrelation structure amounts to calculating the mean of all off-diagonal elements of all block matricesR tt ′ , t < t ′ .
Calculation of the diagonal elements of eachR tt ′ under the restrictive within-blocks strategy depends on the adopted working correlation structure. If an unstructured, a Toeplitz or an equicorrelation structure is assumed, then the diagonal elements of the blocks R tt ′ of the working correlation matrix are the means of the diagonal elements of all blocksR tt ′ , t < t ′ , assumed to be identical. If an AR(1)-structure is assumed then the diagonal elements are given bŷΔ, wherêis the solution to min Under the flexible within-blocks strategy, the step of generating identical off-diagonal elements within the blocks R tt ′ is avoided. The off-diagonal elements within blocks R tt ′ are given by r tkt ′ k ′ . Similarly, the diagonal elements within blocks R tt ′ are not restricted to be equal if the working correlation matrix that is assumed to be unstructured has a Toeplitz or an equicorrelation structure. In case of an AR(1)-structure, the diagonal elements are calculated as under the restrictive within-blocks strategy and thus are all equal. The number of correlation structure parameters varies considerably, depending on the within-blocks strategy and the working association structure. For example, the assumption of a flexible within-blocks unrestricted structure implies that all elements in all blocks R tt ′ may differ. Hence, the number of correlation parameters to be estimated is (q − 1) 2 T(T − 1)∕2. On the other hand, if an equicorrelation structure under the flexible within-blocks strategy is assumed, then all blocks R tt ′ are equal. In that case, only (q − 1) 2 parameters are required. Under the restrictive within-blocks strategy, assuming an equicorrelation matrix requires estimation of only two parameters.
Working correlation matricesR calculated assuming independence, an unstructured, Toeplitz, AR(1) or equicorrelation matrix will be indicated by subscripts I, U, T, A, and E, respectively. Superscripts R and F will denote the restrictive and the flexible within-blocks version. Finally, individual working covariance matrices are generated bŷ . GEE estimators will be indicated correspondingly as GEE m l , where the superscript is ignored in case of the GEE I estimator.

Association structure based on local odds ratios
To avoid possible problems of inconsistencies of the marginal distributions i and the distribution implied by the working covariance matrix if a working correlation matrix is adopted, we in addition adopt a local odds ratios approach to model the association structure. 18 In a first step, ignoring any predictor variables and assuming independence over the T(T − 1)∕2 blocks.
In a second step, one out of four versions of Goodman's 40 row and column (RC) effects model is estimated, where the log of the local odds ratios is modeled via tt ′ is a parameter representing the association strength between Y it and Y it ′ , and tt ′ tk are scores assigned to the kth response category at time point t if the time pair t, t ′ is considered. 18 The four different models imply four different association structures.
A parsimonious uniform structure is realized by assigning fixed unit-spaced values to the score parameters tt ′ tk and assuming̃=̃t t ′ for all t, t ′ . 18 This implies equal association strength over categories and time, log tkt ′ k ′ =̃. Allowing log tkt ′ k ′ =̃t t ′ , that is, the association strength to vary over the T(T − 1)∕2 blocks, but be invariant over categories within blocks, models an association structure denoted as category exchangeability. 18 Restricting̃=̃t t ′ , k = tt ′ tk for all t, t ′ , k and assuming that k are also unknown parameters to be estimated, implies an association structure denoted as time exchangeability. 18 This model allows the association strength to vary with the involved categories but assumes symmetry in the sense that tkt ′ k ′ = tk ′ t ′ k , and allows no time dependence in the variables Y it . A fourth model denoted as RC model 18 allows in addition the association strength to vary over time. This model is where the parameters tt ′ k are unknown and need to be estimated. Thus, this structure is very flexible but requires estimation of many parameters.
In a third step, based on the local odds ratios, the individual bivariate probabilities are estimated 18 and then adapted to be consistent with the estimated marginal distributions of Y i , i = 1, … , n, using the iterative proportional fitting (IPF) algorithm. 41 Note that the association structure is estimated ignoring the predictors and hence this step can be performed before the iterations to calculate the GEE estimator start. The IPF step, however, is applied in each iteration. If a model for ordinal responses is estimated and it can be assumed that the true association structure is close to being time and category exchangeable, then the uniform association structure may be appropriate. 18 Furthermore, results from a simulation study imply that convergence problems can be expected to increase with increasing flexibility of the association structure. The RC structure is only recommended for large sample sizes, 18 like n = 500. The results of this three-step approach are individual estimated working covariance matricesŴ i .
In the simulation study to be described in the next section, we compare GEE estimators based on a working correlation matrix with GEE estimators using the local odds ratios approach. The two approaches do not generate working covariance matrices that are easily comparable. However, with respect to their flexibility, the GEE estimator based on the uniform structure can be compared with the GEE estimator assuming the equicorrelation structure under the restrictive within-blocks strategy and the GEE estimator assuming time exchangeability with the GEE estimator based on the equicorrelation structure under the flexible within-blocks strategy. The GEE estimator assuming category exchangeability can be compared with the GEE estimator with an unrestricted correlation structure under a restrictive within-blocks strategy and the GEE estimator calculated under an RC strategy can be compared with a GEE estimator with an unrestricted correlation structure under a flexible within-blocks strategy. The GEE estimators using the uniform, the category exchangeable, the time exchangeable, and the RC structure will be denoted by GEE UN , GEE CE , GEE TE , and GEE RC , respectively.

Design of the study
In this section we present the results of a simulation study to evaluate the finite sample properties of the GEE estimator adopting different working covariance structures. All programs were written and run under R versions 10 3.5.3 to 3.6.2 and are available from the corresponding author upon request.
For the main part of the study, we generated data for n = 50, 100, 200, 300, 500 and n = 1000 units, observed at T = 5 time points. For every unit we simulated a response variable with q = 4 possible values at each time point in each of 1000 simulations. To approximate an applied situation we generated three predictor variables from different distributions with low to medium associations between them and over time. The predictor values were held fixed over the simulations. The first predictor was generated from a normal distribution and was correlated over time within each unit with correlation 0.3. The second predictor was a binary variable which was correlated with the first predictor with correlation of approximately 0.35 and correlations close to zero over time. The third predictor was generated from a 2 -distribution with three degrees of freedom, again correlated over time with correlation 0.4, but independent from the first two predictor variables. The true values = (0.3 0 0.4) ′ , = (0.3 0.7) ′ , and = (−.5 1 − .1) ′ were chosen such that the fraction of observations in each category of the response variable at all time points was well above zero.
To study the effect of time-invariant and unit-invariant predictor variables on the relative efficiency of GEE estimators, we generated datasets as described above but with three predictors generated independently from each other from a standard normal distribution. The first variable was generated independently over time and units, the second was invariant over time but varying over units, and the third predictor was invariant over units but varied over time. As true values for we chose = (1 1 1) ′ .
To simulate the response variables from the required distributions, variable values were generated in every simulation in two steps. First, nT (pseudo) random numbers were generated from a multivariate normal distribution with mean zero and one of two different (T × T)-dimensional covariance matrices independently for each category k = 1, … , q. The covariance matrices were both positive definite and equal to Toeplitz correlation matrices with elements In the next step, the generated values were transformed into quantiles of a type I extreme value (or Gumbel) distribution with location parameter equal to zero and scale parameter equal to one by a probability integral transformation. Given predictor variables and true values for , , and , the values of the linear predictor as given by the right-hand side of (1) was calculated. Then, for each i and t, the category for which this value was maximum was selected to be the observed value ofỸ it . The probability of observing the maximum given the predictor variables can be shown to be equal to (2) if the error variables are independent and follow a type I extreme value distribution. 42 It should be noted that the correlations under the low-and the high-correlation conditions are not equal to the correlations of the observed Y itk , Y it ′ k ′ |X i . Instead they are the correlations of corresponding response variables given predictor variables in a latent model. The correlations in the observed variables Y itk , Y it ′ k ′ |X i will be somewhat lower and are not decreasing according to the same relationships as in the latent model.
Both the low-and the high-correlation conditions were realized for each sample size condition. Estimators adopted in the main study are GEE I ,

Notes on the estimation strategy
As noted in the literature, 6,8 the starting values should be chosen carefully to prevent the estimation algorithm from nonconvergence. Thus, we adopted a three-step strategy to generate starting values. In a first step, starting values for a multinomial panel model are generated. As suggested by a reviewer, we used̃k = log , k = 2, … , q, as starting values for the constants. Starting values for the p regression coefficients restricted to be identical for all categories and time points,̃, were generated randomly from the uniform(−1,1) distribution and then transformed such that |x ′ it ||̃| < 3 for all i, t. In the second step, a GEE estimator for a multinomial panel model is estimated. 8 If the working correlation matrix approach was chosen to estimate the stereotype model, then the corresponding working correlation matrix is adopted to estimate the multinomial model. In case of the local odds ratios approach, we adopt the time-exchangeable association structure in this second step. Let the estimates from this step be denoted aŝ= (̂′̂′) ′ and its estimated covariance matrix beĈov(̂). Starting values for are then generated in a third step as the solution to min (̂− ) ′Ĉ ov(̂) −1 (̂− ), where = ( ) = ( ′ ( ′ 1) ⊗ ′ ) ′ . The maximum number of iterations in the second and third step and the estimation procedure for the stereotype model was restricted to 200. Iterations stop if the maximum absolute difference of estimates in consecutive iterations is below 10 −6 .
The working covariance matrices based on the approach using local odds ratios are generated using adapted functions from the R package 10 multgee 39 called with the default values. A strong constraint of this approach is that individual covariance matrices need to be inverted. To prevent the algorithm from too many nonconverging runs, we implemented a two-step approach. If an individual estimated covariance matrix is singular then, in a first step, diagonal values close to zero are forced toward larger values. If an inverse of the resulting matrix does still not exist, a generalized inverse of the original matrix is calculated. Both events are indicated by a warning message.
To fit the stereotype model to the arthritis dataset, we generated the starting values as described above for all but one GEE estimator: for the GEE CE estimator we used the GEE UN estimates as starting values. Because the starting values for the parameters are randomly selected, we needed several draws for some of the GEE estimators to converge. Furthermore, the strategy of first using the approximation to the first derivative and then, after convergence, to adopt the derivative derived in Section 3.1 and iterate again until convergence, turned out to be tolerant with respect to the starting values. Hence this strategy is utilized for the real data example.

Simulation results
In this section, we present detailed results only for n = 200 and n = 500. Results for n = 50, 100, 300 and n = 1000 can be found in the supplementary material.
The estimation results are collected over 1000 simulations under each of the conditions defined by the samples sizes and the association strength (low-and high-correlation conditions). Pseudo random numbers were generated using the same seed under all realized conditions. To compare the results, we calculated the estimated bias as the difference between the true values and the means of the estimates over the simulations, the SDs of the estimates relative to the SD of the GEE I estimates over the simulations, and the actual coverage rates of the true values for symmetric 0.95-level confidence intervals assuming normality. Estimated biases for n = 200 and n = 500 under the low-and high-correlation conditions can be found in Tables 1 and 2, relative SDs in Tables 3 and 4   The results in Table 1 do not imply systematic biases in the different GEE estimators under the low-correlation condition. The estimated biases are almost of the same order for all GEE estimators with a tendency of larger estimated biases for single elements of the GEE RC estimator. As expected, the estimated biases tend to be smaller for the larger sample size. The pattern of results for the high-correlation condition presented in Table 2 is very similar to the pattern in Table 1 for all but the GEE RC estimator. Again, the estimated biases are not large in general and are of the same order for the different GEE estimators. Under the larger sample size n = 500, they tend to be smaller compared with n = 200. Table 3 presents SDs of the various GEE estimators relative to the GEE I estimator under the low-correlation condition. The results suggest that all GEE estimators assuming some association tend to be more efficient than the GEE estimator assuming independence. These efficiency gains are not large and the differences between the various GEE estimators are small. Comparing the results of GEE estimators based on a working correlation matrix implies that the variances of those using a flexible correlation structure tend to be larger compared with those using a restrictive structure. The most efficient estimator, at least with respect to the parameters, of those that are based on local odds ratios tends to be the GEE CE estimator which allows time varying association parameters. The most efficient estimator tends to be the GEE R T estimator which is able to reproduce a correlation structure of decreasing correlations over increasing time distances, allows different correlations r tkt ′ k and r tkt ′ k ′ but at the same time requires estimation of less parameters compared with GEE F T , GEE R U , or GEE F U estimators. The AR(1) structure is less flexible than a Toeplitz structure, which explains why the GEE R A estimator is less efficient than the GEE R T estimator given the simulated true correlation structure. Table 4 implies that the efficiency gains can be considerably larger if the true correlations are higher at least for sample sizes n = 200 and n = 500. Again, however, the GEE estimators based on a flexible correlation structure tend to have larger variances compared with those calculated under a restrictive strategy. These differences tend to increase TA B L E 2 Estimated bias over 1000 simulations for n = 200 and n = 500 under the high-correlation condition, rounded to three decimal places from the GEE estimators assuming an equi-, AR(1), Toeplitz to the GEE estimators assuming an unrestricted working correlation matrix. Like under the low-correlation condition, GEE estimators using local odds ratios seem to be less efficient compared with those based on a working correlation matrix. The most efficient estimator is again the GEE estimator assuming a restrictive Toeplitz structure. The closer the structure of the adopted working correlation matrix to the true underlying correlation structure, the more efficient the GEE estimator, although the extra flexibility of the unrestricted version of the corresponding estimator does not pay, very likely due to the additional number of parameters to be estimated. In summary, the pattern of results is similar, but more pronounced than under the low-correlation condition if n = 200. The coverage rates presented in the left column of Figure 2 imply that under the low-correlation condition most of the coverage rates are within a symmetric 0.99-probability interval (dashed lines) around the value 0.95 based on 1000 independent Bernoulli trials. Hence, with but one exception, there is no evidence of systematic over-or undercoverage. The exception is GEE F U which involves the estimation of many correlations and is associated with systematically lower-and for four parameters too low-coverage rates compared with the other GEE estimators. Note that the results for GEE RC are only shown for the low-correlation condition with a slightly too low coverage rate for one parameter. Under the high-correlation condition (right column) again GEE F U shows undercoverage but so do the estimators GEE UN and GEE CE for one parameter each, although less pronounced. The coverage rates for all other estimators are in an acceptable range. Figure 3 presents the results for sample size n = 500. In this case, the coverage rates of all GEE estimators under the low-and high-correlation conditions are in an acceptable range. Under the high-correlation condition, no results are shown for GEE RC due to massive convergence problems.
The results based on n = 50, 100, 300 and n = 1000 can be found in the supplementary material. Results for GEE RC are only available for sample sizes of n > 100 under the low-correlation condition. In samples of sizes n = 50 and TA B L E 3 SD of estimates relative to assuming independence over 1000 simulations for n = 200 and n = 500 under the low-correlation condition, rounded to three decimal places  n = 100, the model for the association structure was only weakly or not identified. In larger samples, the number of successful simulations varied between 52 and 55 under the high-correlation condition. Hence, these results will not be considered. In general, under the low-and high-correlation condition, the estimated biases tend to decrease as the sample size increases. It should be noted, however, that the results for n = 50 are more volatile than for the other sample sizes and are based on only 867 to 925 successful simulations under the low-correlation condition and 601 to 875 successful simulations under the high-correlation condition for the GEE estimators based on a working correlation matrix. The number of successful simulations for estimators based on local odds ratios varied between 809 and 866 under the low and 568 and 747 under the high-correlation condition. Hence, for n = 50, conclusions about finite sample properties can only be drawn with great caution. A general statement that seems to be warranted even in this case is that under the high-correlation condition considerable efficiency gains are possible if an association structure is modeled, but coverage rates may only be acceptable for very restrictive association structures, like those adopted for the GEE R U or GEE UN estimators.
The number of successful simulations for n = 100 under the low (high)-correlation condition varied between 989 and 998 (973 and 997) if a working correlation structure is adopted and between 982 and 992 (914 and 956) if local odds ratios are modeled. Therefore, these results may be interpreted but with caution.
Under both the low-and high-correlation conditions, the order of the estimated biases are roughly equal for the different GEE estimators. There are only single slightly larger estimated biases for n = 100 and the GEE I , GEE F U and GEE R U , GEE CE and GEE RC estimators. Under the low-correlation condition, systematic, if only small to medium, efficiency gains can be observed from about n = 100 on if an association structure is explicitly modeled. This is different under the high-correlation condition, where the efficiency gains of GEE estimators based on working association structures closer to the true structure are more pronounced and tend to be much larger, with an advantage of the GEE estimators based on a working TA B L E 4 SD of estimates relative to assuming independence over 1000 simulations for n = 200 and n = 500 under the high-correlation condition, rounded to three decimal places  correlation structure compared with those using local odds ratios. One explanation for these latter findings for the local odds ratios versions may be the fact that the association parameters are estimated before any GEE estimates are calculated ignoring any predictor variables. The corresponding association structure may thus deviate markedly from the true structure. A general tendency under the high-correlation condition is that the flexible versions of the working correlation structures lead to less efficient estimators as the restrictive versions but these differences seem to vanish with increasing sample size. For most GEE estimators, the coverage rates are acceptable or close to acceptable from n = 100 on under both correlation conditions. This tends to be true for all estimators from n = 300 on under the low but also under the high-correlation condition. Below this sample size, the very flexible GEE F U estimator is associated with coverage rates that are systematically too low.
Taken together, the results imply that at least for the situations simulated, the proposed GEE estimators allow valid inferences from approximately n = 100 on if the working association structure does not require the estimation of too many parameters.
From the above findings, one may expect that for situations with invariant predictors, the general pattern of results with respect to efficiency of the estimators can be inferred from comparisons of results based on sample sizes n = 200 and n = 500 and estimators GEE I , GEE R T , and GEE R U . Thus, Table 5 presents biases and relative SDs of the estimates for the low-correlation condition, Table 6 those for the high-correlation condition from the simulations based on invariant predictors.
The estimated biases differ depending on the sample size but also on the predictor variable. As before, estimated biases are smaller for larger sample sizes. In addition, if n = 200, the estimated bias for 2 weighting the time-invariant predictor tends to be larger than for the other two parameters. With respect to the relative SDs, the results in Tables 5 and 6 suggest that the efficiency gain of GEE R T or GEE R U compared with GEE I is largest for 1 weighting the freely varying predictor F I G U R E 2 Actual coverage of true k , k and l , 0.95-level confidence intervals, n = 200; dashed lines represent symmetric 0.99-probability intervals around 0.95. High-correlation condition: Results for GEE RC are omitted due to massive convergence problems and smallest for 2 weighting the time-invariant variable. This effect seems to be strongest under the high-correlation condition. Put differently, for time-invariant predictors and lesser so for unit-invariant predictors, adopting a working covariance structure which is closer to the true structure may not lead to more efficient or only slightly more efficient estimators. The coverage rates over at least 998 simulations are almost all in an acceptable range. Only the coverage rates for 2 of GEE I and GEE R U under the low-correlation condition, 0.930 and 0.929, are slightly too low (see Table 7).

THE ARTHRITIS DATASET
The data to illustrate the application of the GEE estimator is from a randomized clinical trial designed to evaluate the effectiveness of the drug Auranofin relative to a placebo therapy for the treatment of rheumatoid arthritis 17,27 and was taken from R package 10 multgee. 39 The dataset can be retrieved by calling data(arthritis) after multgee is loaded * . The dataset includes 302 units out of which n = 289 were completely observed at T = 3 time points, that is, 1, 3, and 5 months after treatment.
*Data availability statement: Data sharing is not applicable to this article as no new data were created or analyzed in this study.

F I G U R E 3
Actual coverage of true k , k and l , 0.95-level confidence intervals, n = 500; dashed lines represent symmetric 0.99-probability intervals around 0.95. High-correlation condition: Results for GEE RC are omitted due to massive convergence problems The ordinal response variable is self-assessment of rheumatoid arthritis surveyed with a 5-point scale from 1 = "very poor" to 5 = "very good" at a baseline before the trial and at the three follow-up observation points. Self-assessment with q = 5 possible values was converted into four 0-1 variables with "very poor" as the reference category. It turned out, however, that the fraction of observed "very good" answers at the baseline was very small leading to convergence problems. Thus, we combined the fourth and the fifth category for the baseline predictors and the response variable.
Baseline assessment variables were treated as predictors (b 2 , … , b 4 ) and the follow-up assessments as the response variable. 17,18 Additional predictors that entered the model are two time-dummies, t 2 and t 3 (reference category is the first month follow-up) and the binary treatment indicator tr with the placebo group being the reference group.
For 13 out of the 302 units in the dataset, the variable self-assessment over the follow-up period was not completely observed. Baseline self-assessment variables were completely observed though. Assuming 17,18 that the missing values are missing completely at random, 43 we ignored the incompletely observed cases in our analysis. Table 8 presents the results of estimating the ordered stereotype logit model assuming independence, adopting a restrictive equi-and a restrictive unstructured working correlation matrix. Estimation results for all other GEE estimators can be found in the supplementary material. Table 9 shows the working correlation matrices over time and categories for the GEE R E (lower triangular matrix) and the GEE R U (upper triangular matrix) estimators. A comparison of the estimation results in Table 8 and the tables in the supplementary material suggests that, given all necessary assumptions are at least approximately met, the general conclusions with respect to 3 , 4 , 3 , t2 , trt , b3 , TA B L E 5 Estimated biases and SDs of estimates relative to assuming independence (relative SD) over 1000 simulations for n = 200 and n = 500 under the low-correlation condition, rounded to three decimal places    and b4 do not depend on the working correlation matrix. For all working correlation matrices, the 0.95-level confidence intervals for 3 , 4 , and t2 cover the value zero. Thus, there is no evidence that these parameters have an effect on the response variable. On the other hand, the 0.95-confidence intervals for 3 , trt , b3 , and b4 do not cover zero. With respect to 2 , 2 , t3 , and b2 the results are inconsistent. However, given that the predictors are all either unit-or time invariant and that the estimated elements of the working correlation matrices are rather low, the choice of the working correlation matrix cannot be expected to have a major effect on the efficiency of the estimators. Since the sample is of medium size, choosing a GEE estimator based on a very flexible correlation matrix or on local odds ratios may, however, not allow substantially more precise inferences compared with the GEE R E estimator. Hence, we will interpret the estimation results based on the GEE R E estimator. All the 0.95-confidence intervals for the parameters cover zero, which implies that there is no evidence that they differ from zero. The results for 2 and 3 imply that they presumably differ from 1 = 0 and 4 = 1. Due to the overlapping confidence intervals, there seems to be no strong evidence for 2 and 3 to be different. Hence, the second and third category could probably be combined into one. The estimates for 2 and 3 , however, do conform with the presupposed ordinal nature of the response variable, that is, 0 <̂2 <̂3 < 1.  Finally, the results for the regression parameters suggest that there seems to be no effect of the first relative to the second follow-up time point, but a positive effect of the third follow-up time point. This result implies that a difference in self-assessment may be observed at the earliest after just over 3 months.
The treatment with Auranofin seems to have a positive effect on self-assessment of rheumatoid arthritis, the estimated increase in the log odds is 1.240 for the fourth category relative to the placebo group. The effect of the baseline self-assessment 2 = "poor" seems to have no different effect on self-assessment at the follow-up time points than the effect of the self-assessment "very poor." However, baseline self-assessments of 3 = "fair" or 4 = "good or very good" before the trial, seem to have a positive effect on later self-assessments. The estimated increase in the log odds for the fourth category is 5.356 relative to the reference category.

DISCUSSION
Estimation of the ordered stereotype logit model in the cross-sectional context is not as straightforward as estimation of sequential or cumulative models. This is due to the intrinsic nonlinearity in the parameters reflecting the increased flexibility of the stereotype model compared with the sequential or cumulative model and justifies simulations as a valuable tool to study the finite sample properties. However, this nonlinearity of the model parameters makes the interpretation of estimation results more difficult, which may also explain why the stereotype model is not very common in applications.
Ordered stereotype panel models have been estimated using the ML or GLS method, but to the best of our knowledge, simulation results studying the finite sample properties seem not to have been published. ML methods require correct specification of the multivariate distribution of the vectors of binary responses indicating the observed category over all observed time points. If the distributional assumptions are correct, then ML estimators are known to be consistent, asymptotically normally distributed and allow asymptotically precise inferences. In this case, ML estimation would be the preferred approach. GLS estimation requires correct specification of the mean, variances, and covariances only, but the estimator tends to have larger variances.
The GEE approach requires only correct specification of the means of the vectors of binary variables, the association structure or higher moments need not be correctly specified. This robustness property comes at a price, which is a loss of efficiency compared with ML estimators. GLS estimation can be made robust by allowing a misspecified association structure as well, but would then also lead to a loss of efficiency and would require a robust estimator of the covariance matrix of the GLS estimator.
In this article we propose a GEE approach with a finite sample correction to estimate ordered stereotype panel logit models with a possibly misspecified association structure. The simulation study and the application to a panel dataset from a randomized clinical trial design to assess the effect of a drug on self-assessment of rheumatoid arthritis at three follow-up observation points after treatment show that GEE estimation of this model is feasible. In fact, the results from the simulation study imply that the finite sample properties of GEE estimators as reported in the literature for other models apply to these models in samples of medium to large size as well and inferences tend to be statistically valid.
The comparison of different GEE estimators in the simulation study implies that, similar to results known from binary models, adopting a working correlation matrix that explicitly models associations compared with the identity matrix may lead to substantial efficiency improvements for the estimators if the true correlations are high. In accordance with known results we found additional efficiency gains if a working correlation matrix is chosen whose structure is close to the true structure. On the other hand and again similar to results found for binary models, if the true correlations are low, then differences between the GEE estimators explicitly allowing for dependencies and even in comparison with the GEE estimator calculated under independence are small. However, the simulation results in this article additionally illustrate a conflict of objectives between the flexibility of the working correlation matrix and the consequences of a weakly identified estimation problem. Possible efficiency gains by choosing a flexible working correlation matrix may be lost or even reversed if the number of association parameters is large relative to the number of observations.
Comparisons of GEE estimators based on a working correlation matrix with those using local odds ratios did not reveal any basic problems of the former that could not be explained by weakly identified models in small samples. Instead, in our simulations we found that GEE estimates based on local odds ratios tend to have larger variances compared with those that use a working correlation matrix, which may be due to the fact that for the former the association structure parameters are estimated ignoring any predictor variables.
The results for the arthritis dataset illustrate once more that if the correlations are low, then the differences between various versions of the GEE estimator relative to the estimator assuming independence can be expected to be small.
As in other models for categorical responses, all possible categories of the response variable should be observed with sufficient frequencies and variation in the predictors should be sufficiently large to avoid weakly identified models and problems of nonconvergence. The greater flexibility of the stereotype model relative to other models requires larger datasets for identification. Hence, based on our simulation results, we suggest to fit an ordered stereotype panel model to datasets with sample sizes below n = 100 only if there are good reasons for doing so. In that case, however, we recommend to adopt only restrictive versions of the proposed GEE estimators, like the estimator based on a restrictive working equicorrelation structure. Similarly and more general, if the true correlations are assumed to be low, then a working correlation matrix with a small number of parameters to be estimated may be chosen. In this case, the efficiency loss compared with the GEE estimator assuming independence tends to be small, but the costs of choosing a GEE estimator which is too flexible in terms of bias, variance and undercoverage may be high. In medium to large samples, say from n = 100 on, with high true correlations, however, efficiency gains by adopting a toeplitz or AR(1)-structure if appropriate may be considerable. Based on the simulation results reported in this article, we do not recommend the use of the flexible versions of the working correlation matrices as long as sample sizes are below n = 300, depending on the type of the working correlation matrix. For example, n = 300 could be sufficient to adopt the flexible version of a working equicorrelation matrix but not for the flexible version of the unrestricted correlation matrix.
Summing up, the GEE estimator with the correction terms proposed in this article seems to be an interesting candidate to estimate the flexible ordered stereotype panel logit model even if the association structure is misspecified. In addition to the above-mentioned limitations and with respect to the arthritis dataset, it should be noted that for GEE estimators to be consistent and asymptotically normal in general, missing values must be missing completely at random. If that is not the case, then one could resort to either a weighting or a multiple imputation strategy. If in addition to the assumption that the means are correctly specified, all the required assumptions about higher moments of the distribution of the response variables are met, particularly with regard to the association structure, then the stereotype panel model may be estimated assuming a random effects model and adopting the ML approach. This approach would allow missing values to be missing at random, although standard estimation procedures will not be able to handle missing values in predictor variables properly. In addition, if the pattern of missing values is nonmonotone then the assumption of missing values being missing at random may be questionable, and ML estimation with standard software will very likely lead to invalid inferences. Finally, note that the GEE estimator proposed in this article is restricted to designs where units are observed at the same fixed time points. The approach does, for example, not apply if in longitudinal studies, predictors and responses of units are recorded based on varying time intervals with possibly different frequencies.

SUPPORTING INFORMATION
Additional supporting information may be found online in the Supporting Information section at the end of this article.