Simple solutions to the initial conditions problem in dynamic, nonlinear panel data models with unobserved heterogeneity



I study a simple, widely applicable approach to handling the initial conditions problem in dynamic, nonlinear unobserved effects models. Rather than attempting to obtain the joint distribution of all outcomes of the endogenous variables, I propose finding the distribution conditional on the initial value (and the observed history of strictly exogenous explanatory variables). The approach is flexible, and results in simple estimation strategies for at least three leading dynamic, nonlinear models: probit, Tobit and Poisson regression. I treat the general problem of estimating average partial effects, and show that simple estimators exist for important special cases. Copyright © 2005 John Wiley & Sons, Ltd.


In dynamic panel data models with unobserved effects, the treatment of the initial observations is an important theoretical and practical problem. Much attention has been devoted to dynamic linear models with an additive unobserved effect, particularly the simple AR(1) model without additional covariates. As is well known, the usual within estimator is inconsistent, and can be badly biased. [See, for example, Hsiao (1986, section 4.2).]

For linear models with an additive unobserved effect, the problems with the within estimator can be solved by using an appropriate transformation—such as differencing—to eliminate the unobserved effects. Then, instrumental variables (IV) can usually be found for implementation in a generalized method of moments (GMM) framework. Anderson and Hsiao (1982) proposed IV estimation on a first-differenced equation, while several authors, including Arellano and Bond (1991), Arellano and Bover (1995), Ahn and Schmidt (1995), improved on the Anderson–Hsiao estimator by using additional moment restrictions in GMM estimation. More recently, Blundell and Bond (1998) and Hahn (1999) have shown that imposing restrictions on the distribution of initial conditions can greatly improve the efficiency of GMM over certain parts of the parameter space.

Solving the initial conditions problem is notably more difficult in nonlinear models. Generally, there are no known transformations that eliminate the unobserved effects and result in usable moment conditions, although special cases have been worked out. Chamberlain (1992) finds moment conditions for dynamic models with a multiplicative effect in the conditional mean, and Wooldridge (1997) considers transformations for a more general class of multiplicative models. Honoré (1993) obtains orthogonality conditions for the unobserved effects Tobit model with a lagged dependent variable. For the unobserved effects logit model with a lagged dependent variable, Honoré and Kyriazidou (2000) find an objective function that identifies the parameters under certain assumptions on the strictly exogenous covariates.

The strength of semiparametric approaches is that they allow estimation of parameters without specifying a distribution for the unobserved effect. Unfortunately, semiparametric identification hinges on some strong assumptions concerning the strictly exogenous covariates; for example, time dummies are not allowed in the Honoré and Kyriazidou (2000) approach. Honoré and Kyriazidou also reduce the sample to cross-sectional units with no change in any discrete covariates over the last two time periods.

Another practical limitation of the Honoré (1993) and Honoré and Kyriazidou (2000) estimators—and one that often goes unnoticed—is that partial effects on the response probability or conditional mean are not identified. Therefore, the absolute importance of covariates, or the amount of state dependence, cannot be determined from semiparametric approaches.

In this paper I reconsider the initial conditions problem in a parametric framework for nonlinear models. A parametric approach has its usual drawbacks because I specify an auxiliary conditional distribution for the unobserved heterogeneity; misspecification of this distribution generally results in inconsistent parameter estimates. Nevertheless, in some leading cases the approach I take leads to some remarkably simple maximum likelihood estimators. Further, I show that the assumptions are sufficient for uncovering the quantities that are usually of interest in nonlinear applications: partial effects on the mean response, averaged across the population distribution of the unobserved heterogeneity.

Previous research in parametric, dynamic nonlinear models has focused on three different ways of handling initial conditions; these are summarized by Hsiao (1986, section 7.4). The first approach is to treat the initial conditions for each cross-sectional unit as nonrandom variables. Unfortunately, nonrandomness of the initial conditions, yi0, implies that yi0 is independent of unobserved heterogeneity, ci. Even when we observe the entire history of the process {yit}, the assumption of independence between ci and yi0 is very strong. For example, suppose we are interested in modelling earnings of college graduates once they leave college, and yi0 is earnings in the first post-school year. That we observe the start of this process is logically distinct from the strong assumption that unobserved ‘ability’ and ‘motivation’ are independent of initial earnings.

A better approach is to allow the initial condition to be random, and then to use the joint distribution of all outcomes on the response—including that in the initial time period—conditional on unobserved heterogeneity and observed strictly exogenous explanatory variables. The main complication with this approach is specifying the distribution of the initial condition given unobserved heterogeneity. Some authors insist that the distribution of the initial condition represent a steady-state distribution. While the steady-state distribution can be found in special cases—such as the first-order linear model without exogenous variables [see Bhargava and Sargan (1983) and Hsiao (1986, section 4.3)] and in the unobserved probit model without additional conditioning variables [see Hsiao (1986, section 7.4)]—it cannot be done for even modest extensions.

For the dynamic probit model with covariates, Heckman (1981) proposed approximating the conditional distribution of the initial condition. This avoids the practical problem of not being able to find the conditional distribution of the initial value. But, as we will see, it is computationally more difficult than necessary for obtaining both parameter estimates and estimates of averaged effects in nonlinear models.

The approach I suggest in this paper is to model the distribution of the unobserved effect conditional on the initial value and any exogenous explanatory variables. This suggestion has been made before for particular models. For example, Chamberlain (1980) mentions this possibility for the linear AR(1) model without covariates, and Blundell and Smith (1991) study the conditional maximum likelihood estimator of the same model; see also Blundell and Bond (1998). For the binary response model with a lagged dependent variable, Arellano and Carrasco (2003) study a maximum likelihood estimator conditional on the initial condition, where the distribution of the unobserved effect given the initial is taken to be discrete. When specialized to the binary response model, the approach here is more flexible, and computationally much simpler: the response probability can have the probit or logit form, strictly exogenous explanatory variables are easily incorporated along with a lagged dependent variable, and standard random effects software can be used to estimate the parameters and averaged effects.

Specifying a distribution of heterogeneity conditional on the initial condition has several advantages. First, we can choose the auxiliary distribution to be flexible, and view it as an alternative approximation to Heckman's (1981). Second, in several leading cases—probit, ordered probit, Tobit and Poisson regression—an auxiliary distribution can be chosen that leads to straightforward estimation using standard software. Third, partial effects on mean responses, averaged across the distribution of unobservables, are identified and can be estimated without much difficulty. I show how to obtain these partial effects generally in Section 4, and Section 5 covers the probit and Tobit models.


We introduce three examples in this section to highlight the important issues; we return to these examples in Section 5.

Example 1 (Dynamic Probit Model with Unobserved Effect): For a random draw i from the population and t = 1, 2, …, T:

equation image(1)

This equation contains several assumptions. First, the dynamics are first order, once zit and ci are also conditioned on. Second, the unobserved effect is additive inside the standard normal cumulative distribution function, Φ. Third, the zit satisfy a strict exogeneity assumption: only zit appears on the right-hand side, even though zi = (zi1, …, ziT) appears in the conditioning set on the left. Naturally, zit can contain lags, and even leads, of exogenous variables.

As we will see in Sections 3 and 4, the parameters in equation (1), as well as average partial effects, can be estimated by specifying a density for ci given (yi0, zi). A homoscedastic normal distribution with conditional mean linear in parameters is very convenient, as we will see in Section 5. □

Example 2 (Dynamic Tobit Model with Unobserved Effect): Consider

equation image(2)
equation image(3)

for t = 1, 2, …, T. This model applies to corner solution outcomes, where yit is an observed response that equals zero with positive probability but is continuously distributed over strictly positive values. It is not well suited to true data censoring applications, as in that case we would want a lagged value of the latent variable underlying equation (2) to appear. The function g(·) allows the lagged value of the observed response to appear in a variety of ways. For instance, we might have g(y−1) = {1[y−1 = 0], 1[y−1 > 0]log(y−1)}, which allows the effect of lagged y to be different depending on whether the previous response was a corner solution (zero) or strictly positive. In this case, ρ is 2 × 1.

Honoré (1993) proposes orthogonality conditions that identify the parameters, but partial effects are unidentified. We will show how to obtain equation image estimates of the parameters and average partial effects in Section 5. □

Example 3 (Dynamic Unobserved Effects Poisson Model): For each t = 1, …, T, yit given (yi, t−1, …, yi0, zi, ci) has a Poisson distribution with mean

equation image(4)

Again, we allow for the lagged dependent variable to appear in a flexible fashion, perhaps as a set of dummy variables for specific outcomes on yi, t−1. The null hypothesis of no state dependence is H0: ρ = 0. Chamberlain (1992) and Wooldridge (1997) have proposed orthogonality conditions based only on equation (4), where no conditional distributional assumptions are needed for yit or ci. Unfortunately, because the moment conditions have features similar to using first differences in a linear equation, the resulting GMM estimators can be very imprecise (even though the parameters would be generally identified). In Section 5 we show how a particular model for a conditional distribution for ci leads to a simple maximum likelihood analysis. □


Let i index a random draw from the cross-section, and let t denote a particular time period. Initially, we assume that we observe (zit, yit), t = 1, …, T, along with yi0. We are interested in the conditional distribution of equation image given (zit, yi, t−1, ci), where zit is a vector of conditioning variables at time t and equation image is unobserved heterogeneity. (The dimension of zit could be increasing with t, but in our examples its dimension is fixed.) We denote the conditional distribution by D(yit|zit, yi, t−1, ci). The asymptotic analysis is with the number of time periods, T, fixed, with the cross-section sample size, N, going to infinity.

We make two key assumptions on the conditional distribution of interest. First, we assume that the dynamics are correctly specified. This means that at most one lag of yit appears in the distribution given outcomes back to the initial time period. Second, zi = {zi1, …, ziT} is appropriately strictly exogenous, conditional on ci. Both of these can be expressed as follows.

Assumption 1: For t = 1, 2, …, T:

equation image(5)

We next assume that we have a correctly specified parametric model for the density representing equation (5) which, for lack of a better name, we call the ‘structural’ density.

Assumption 2: For t = 1, 2, …, T, ft(yt|zt, yt−1, c;θ) is a correctly specified density for the conditional distribution on the left-hand side of equation (5), with respect to a σ-finite measure ν(dyt). The parameter space, Θ, is a subset of equation image. Denote the true value of θ by θo∈Θ.

The requirement that we have a density with respect to a σ-finite measure is not restrictive in practice. If yt is purely discrete, ν is a counting measure. If yt is continuous, ν is a Lebesgue measure. An appropriate σ-finite measure can be found for all of the possible response variables of interest in economics.

Most specific analyses of dynamic, nonlinear unobserved effects models begin with assumptions similar to 1 and 2. Together, Assumptions 1 and 2 imply that the density of (yi1, …, yiT) given (yi0 = y0, zi = z, ci = c) is

equation image(6)

where we drop the i subscript to indicate dummy arguments of the density. In using equation (6) to estimate θo, we must confront the fact that it depends on the unobservables, c. One possibility is to construct the log-likelihood function that treats the N unobserved effects, ci, as (vectors of) parameters to be estimated. This leads to maximizing the function

equation image(7)

over θ and (c1, …, cN). While this approach avoids having to restrict the distribution of ci, it suffers from an incidental parameters problem with fixed T: except in very special cases, the estimator of θo is inconsistent.

The alternative is to ‘integrate out’ the unobserved effect. As we discussed in the Introduction, there have been several suggestions for doing this. A popular approach is to find the density of (yi0, yi1, …, yiT) given zi. If we specify f(y0|z, c) then

equation image(8)

Next, we specify a density f(c|z). We can then integrate equation (8) with respect to this density to obtain f(y0, y1, …, yT|z).

Rather than trying to find the density of (yi0, yi1, …, yiT) given zi, my suggestion is to use the density of (yi1, …, yiT) conditional on (yi0, zi). Because we already have the density of (yi1, …, yiT) conditional on (yi0, zi, ci)—given by equation (6)—we need only specify the density of ci conditional on (yi0, zi). Because this density is not restricted by the specification in Assumption 2, we can choose it for convenience, or flexibility or, hopefully, both. As in Chamberlain's (1980) analysis of unobserved effects probit models with strictly exogenous explanatory variables, we view the device of specifying f(c|y0, z) as a way of obtaining relatively simple estimates of θo. Specifying a model for f(c|y0, z) seems no worse than having to specify models for f(y0|z, c), which must themselves be viewed as approximations, except in special cases where steady-state distributions can be derived.

Assumption 3:h(c|y0, z;δ) is a correctly specified model for the density of D(ci|yi0, zi) with respect to a σ-finite measure η(dc). Let equation image be the parameter space and let δo denote the true value of δ.

Assumption 3 is much more controversial than Assumptions 1 and 2. Ideally, we would not have to specify anything about the relationship between ci and (yi0, z), whereas Assumption 3 assumes we have a complete conditional density correctly specified. In some specific cases—linear models, logit models, Tobit models and exponential regression models—consistent estimators of θo are available without Assumption 3. But these estimators are complicated and need not have good precision.

Under Assumptions 1, 2 and 3, the density of (yi1, …, yiT) given (yi0 = y0, zi = z) is

equation image(9)

which leads to the log-likelihood function conditional on (yi0, zi) for each observation i:

equation image(10)

To estimate θo and δo, we sum the log-likelihoods in equation (10) across i = 1, …, N and maximize with respect to θ and δ. The resulting conditional MLE is equation image and asymptotically normal under standard regularity conditions. In dynamic unobserved effects models, the log-likelihoods are typically very smooth functions, and we usually assume that the needed moments exist and are finite. From a practical perspective, identification is the key issue. Generally, if D(ci|yi0, zi) is allowed to depend on all elements of zi then the way in which any time-constant exogenous variables can appear in the structural density is restricted. To increase explanatory power, we can include time-constant explanatory variables in zit, but we will not be able to separately identify the partial effect of the time-constant variable from its partial correlation with ci.

The log-likelihood in equation (10) assumes that we observe data on all cross-sectional units in all time periods. Nevertheless, for unbalanced panels under certain sample selection mechanisms, we can use the same conditional log-likelihood for the subset of observations constituting a balanced panel. Let si be a selection indicator: si = 1 if we observe data in all time periods (including yi0), and zero otherwise. Then, if (yi1, …, yiT) and si are independent conditional on (yi0, zi), the MLE using the balanced panel will be consistent, and the usual asymptotic standard errors and test statistics are asymptotically valid. Consistency follows from the general argument in Wooldridge (2002, section 17.2.2).

When attrition is an issue, obtaining the density conditional on (yi0, zi) has some advantages over the more traditional approach, where the density would be conditional only on zi. In particular, the current approach allows attrition to depend on the initial condition, yi0, in an arbitrary way. For example, if yi0 is annual hours worked, an MLE analysis based on equation (10) allows attrition probabilities to differ across initial hours worked. In the traditional approach, one would have to explicitly model attrition as a function of yi0 and figure out the appropriate Heckit-type analysis.

Of course, reducing the data set to a balanced panel can discard useful information. But available semiparametric methods have the same feature. For example, the objective function in Honoré and Kyriazidou (2000) includes differences in the strictly exogenous covariates for T = 3. Any observation where Δzit is missing for t = 2 or 3 cannot contribute to the analysis.


In nonlinear models we often need to go beyond estimation of the parameters, θo, and obtain estimated partial effects. Typically, we would like the effect on a mean response after averaging the unobserved heterogeneity across the population. I now show how to construct consistent, equation image normal estimators of these average partial effects (APEs).

Let q(yt) be a scalar function of yt whose conditional mean we are interested in at time t. The leading case is q(yt) = yt when yt is a scalar, but q(·) could be an indicator function if we are interested in probabilities. Generally, we are interested in

equation image(11)

where zt, yt−1 and c are values that we must choose. Unfortunately, since the unobserved heterogeneity rarely, if ever, has natural units of measurement, it is unclear which values we should plug in for c. Instead, we can hope to estimate the partial effects averaged across the distribution of ci. That is, we estimate

equation image(12)

where the expectation is with respect to ci. (For emphasis, variables with an i subscript are random variables in the expectations; others are fixed values.) Under Assumptions 1, 2 and 3, we do not have a parametric model for the unconditional distribution of ci, and so it may seem that we need to add additional assumptions to estimate equation (12). Instead, we can obtain a consistent estimator of equation (12) using iterated expectations:

equation image(13)

where the outside, unconditional expectation is with respect to the distribution of (yi0, zi). While equation (13) is generally complicated, it simplifies considerably in some leading cases, as we will see in Section 5. In effect, we first compute the expectation of q(yit) conditional on (zit, yi, t−1, ci), which is possible because we have specified the density ft(yt|zt, yt−1, c;θo). Then, we (hopefully) integrate m(zt, yt−1, c) against h(c|yi0, zi;δo) to obtain E[m(zt, yt−1, ci;θo)|yi0, zi].

One point worth emphasizing about equation (13) is that δo appears explicitly. In other words, while δo may be properly viewed as a nuisance parameter for estimating θo, δo is not a nuisance parameter for estimating APEs. Because the semiparametric literature treats h(c|y0, z) as a nuisance function, there seems little hope that semiparametric approaches will deliver consistent estimates of APEs in dynamic, unobserved effects panel data models.

Given equation (13), a consistent estimator of µ(zt, yt−1) follows immediately:

equation image(14)

where r(zt, yt−1, yi0, zi;θo, δo)≡E[m(zt, yt−1, ci;θo)|yi0, zi]. Under weak assumptions, equation image is a equation image normal estimator of µ(zt, yt−1), whose asymptotic variance can be obtained using the delta method.


We reconsider the examples from Section 2, showing how we can apply the results from Sections 3 and 4. Our focus is on choices of the density h(c|y0, z;δ) that lead to computational simplicity. For notational convenience, we drop the ‘o’ subscript on the true values of the parameters.

5.1. Dynamic Probit and Ordered Probit Models

In addition to equation (1), assume that

equation image(15)

where zi is the row vector of all (nonredundant) explanatory variables in all time periods. If zit contains a full set of time period dummy variables these elements would be dropped from zi. The presence of zi in equation (15) means that we cannot identify the coefficients on time-constant covariates in zit, although time-constant covariates can be included in zi in equation (15).

Given equation (1), we can write

equation image(16)

where β = (γ′, ρ)′. When we integrate equation (16) with respect to the normal distribution in equation (15), we obtain the density of D(yi1, …, yiT|yi0, zi).

It turns out that we can specify the density in such a way that standard random effects probit software can be used for estimation. If we write

equation image(17)

where equation image, then yit given (yi, t−1, …, yi0, zi, ai) follows a probit model with response probability

equation image(18)

This is easy to derive by writing the latent variable version of the model as equation image and plugging in for ci from equation (17):

equation image(19)

where uit|(zi, yi, t−1, …, yi0, ai) ∼ Normal(0, 1); equation (18) follows. Thus, the density of (yi1, …, yiT) given (yi0 = y0, zi = z, ai = a) is

equation image(20)

and integrating equation (20) against the Normal equation image density gives the density of (yi1, …, yiT) given (yi0 = y0, zi = z):

equation image(21)

Interestingly, the likelihood in equation (21) has exactly the same structure as in the standard random effects probit model, except that the explanatory variables at time period t are

equation image(22)

Importantly, we are not saying that ai is independent of yi, t−1, which is impossible. (Dependence between ai and yi, t−1 means that a pooled probit analysis of yit on xit is inconsistent for the parameters and the APEs.) Further, the density in equation (21) is not the joint density of (yi1, …, yiT) given (xi1, …, xiT), as happens in the case with strictly exogenous xit. Nevertheless, the way random effects probit works is by forming the products of the densities of yit given (xit, ai), and then integrating out using the unconditional density of ai, and this is what equation (21) calls for. So we add yi0 and zi as additional explanatory variables in each time period and use standard random effects probit software to estimate γ, ρ, α0, α1, α2 and equation image.

Under the assumptions made, we can easily obtain estimated partial effects at interesting values of the explanatory variables. The average partial effects are based on

equation image(23)

where the expectation is with respect to the distribution of ci. The general formula in equation (14) turns out to be easy to obtain. Again, replace ci with ci = α0 + α1yi0 + ziα2 + ai, so that expression (23) is

equation image(24)

where the expectation is over the distribution of (yi0, zi, ai);zt and yt−1 are fixed values here. Now, as in Section 4, we use iterated expectations:

equation image(25)

The conditional expectation inside equation (25) is easily shown to be

equation image(26)

where the ‘a’ subscript denotes the original parameter multiplied by equation image. Now, we want to estimate the expected value of expression (26) with respect to the distribution of (yi0, zi). A consistent estimator is

equation image(27)

where the ‘a’ subscript now denotes multiplication by equation image, and equation image and equation image are the MLEs. We can compute changes or derivatives of expression (27) with respect to zt or yt−1 to obtain APEs. Thus, we can determine the economic importance of any state dependence.

Allowing for a more flexible conditional mean in equation (15) is straightforward, provided the mean is linear in parameters. For example, including interactions between yi0 and zi is simple, and would be warranted if we included interactions between the elements of zit and yi, t−1 in the structural model. Allowing for heteroscedasticity in Var(ci|yi0, zi) is more complicated and would probably require special programming. Still, specifying, say, equation image leads to a tractable log-likelihood function: simply replace σa in equation (21) with σa[exp(γ1yi0 + ziγ2)]1/2. With equation image, the conditional expectation in equation (25) is still easy to obtain: equation image—see Wooldridge [2002, problem 15.18(c)] for the static case—and so APEs would be readily computable by averaging across i.

Certain specification tests are easy to compute. For example, after estimating the basic model, terms such as equation image and equation image could be added and their joint significance tested using a standard likelihood ratio test. Obtaining score tests for exponential heteroscedasticity in Var(ci|yi0, zi) or for nonnormality in D(ci|yi0, zi) are good topics for future research.

The binary probit model extends in a straightforward way to a dynamic ordered probit model. If yit takes on values in {0, 1, …, J} then we can specify an ordered probit model with J lagged indicators, 1[yi, t−1 = j], j = 1, …, J, and strictly exogenous explanatory variables, zit. The underlying latent variable model would be equation image, where ri, t−1 is the vector of J indicators, and eit has a conditional standard normal distribution. The observed value, yit, is determined by equation image falling into a particular interval, where the cut-points must be estimated. If we specify D(ci|yi0, zi) as having a homoscedastic normal distribution, standard random effects ordered probit software can be used. Probably we would allow h(c|y0, z;δ) to depend on a full set of indicators 1[yi0 = j], j = 1, …, J.

Certainly there are some criticisms that one can make about the conditional MLE approach for dynamic probit models. First, suppose that there are no covariates. Then, unless α1 = 0, equation (15) implies that ci has a mixture-of-normals distribution, rather than a normal distribution, as would be a standard assumption. But ci given yi0 has some distribution, and it is unclear why an unconditional normal distribution for ci is a priori better than a conditional normal distribution. For cross-sectional binary response models, Geweke and Keane (1999) find that, empirically, mixture-of-normals probit models fit significantly better than the standard probit model. Granted, the mixing probability here is tied to y0, and the variance is assumed to be constant. But often is econometrics we assume that unobserved heterogeneity has a conditional normal distribution rather than an unconditional normal distribution.

Related to the previous criticism is that, in models without covariates, equation (15) implies a distribution D(yi0|ci) different from the steady-state distribution. This is not ideal—in the linear model, one can allow for a non-steady-state distribution while including the steady-state distribution as a special case—but it is only relevant in models without covariates. Plus, even if there are no covariates, it is not clear why imposing a steady-state distribution is better than that implied by equation (15). Dynamic panel data models are really about modelling the conditional distributions in Assumption 1. One can take issue with any set of auxiliary assumptions.

Another criticism is that if ρ = 0 then, because ci given zi cannot be normally distributed unless α1 = 0, the model is not compatible with Chamberlain's (1980) static random effects probit model. That the model here does not encompass Chamberlain's is true, but it is unclear why normality of ci given zi is necessarily a better assumption than normality of ci given (yi0, zi). Both are only approximations to the truth and, when estimating a dynamic model, it is much more convenient to use equation (15). Plus, Chamberlain's static model does not allow estimation of either ρ or the amount of state dependence, as measured by the average partial effect.

Assumption (15) is also subject to the same criticism as Chamberlain's (1980) random effects probit model with strictly exogenous covariates. Namely, if we want the same model to hold for any number of time periods T, the normality assumption in equation (15) imposes distributional restrictions on the zit. For example, suppose α1 = 0. Then, for equation (15) to hold for T and T − 1, zitα2T given (zi1, …, zi, T−1) would have to have a normal distribution. While theoretically this is a valid criticism, it is hardly unique to this setting. For example, every time an explanatory variable is added in a cross-sectional probit analysis, the probit model can no longer hold unless the new variable is normally distributed. Yet researchers regularly use probit models on different sets of explanatory variables.

5.2. Dynamic Tobit Models

For the Tobit model the density in Assumption 2 is

equation image

To implement the conditional MLE, we need to specify a density in Assumption 3. Again, it is convenient for this to be normal, as in equation (15). For the Tobit case, we might replace yi0 with a more general vector of functions, ri0r(yi0), which allows ci to have a fairly flexible conditional mean. Interactions between elements of ri0 and zi may be warranted. We can use an argument very similar to the probit case to show that the log-likelihood has a form that can be maximized by standard random effects Tobit software, where the explanatory variables at time t are xit≡(zit, gi, t−1, ri0, zi) and gi, t−1g(yi, t−1). In particular, the latent variable model can be written as equation image, where uit given (zi, yi, t−1, …, yi0, ai) has a Normal equation image distribution. Again, we estimate equation image rather than equation image, but equation image is exactly what appears in the average partial effects.

Denote E(yit|wit = wt, ci = c) as

equation image(28)

where wt = (zt, gt−1). As in the probit case, for estimating the APEs it is useful to substitute for ci:

equation image(29)

where the first expectation is with respect to the distribution of ci and the second expectation is with respect to the distribution of (yi0, zi, ai). The second equality follows from iterated expectations. Since ai and (ri0, zi) are independent, and equation image, the conditional expectation in equation (29) is obtained by integrating equation image over ai with respect to the Normal equation image distribution. Since equation image is obtained by integrating max(0, wtβ + α0 + ri0α1 + ziα2 + ai + uit) with respect to uit over the Normal equation image distribution, it is easily seen that the conditional expectation in equation (29) is

equation image(30)

A consistent estimator of the expected value of expression (30) is simply

equation image(31)

Other corner solution responses can be handled similarly. For example, suppose yit is a fractional variable that can take on the values zero and one with positive probability. Then we can define yit as a doubly-censored version of the latent variable equation image introduced earlier. Standard software that estimates two-limit random effects Tobit models is readily applied.

5.3. Dynamic Poisson Model

As in Section 2, we assume that yit given (yi, t−1, …, yi0, zi, ci) has a Poisson distribution with mean given in equation (4). For Assumption 3, write

equation image(32)

where ri0 is a vector of functions of yi0. Assume that ai is independent of (yi0, zi) and ai ∼ Gamma(η, η), which is analogous to Hausman et al. (1984). Then, for each t, yit|(yi, t−1, …, yi0, zi, ai) has a Poisson distribution with mean

equation image(33)

where ri0 denotes a vector function of yi0. Call the mean in expression (33) aimit. Then the density of (yi1, …, yiT) given (zi, yi0, ai) is obtained, as usual, by the product rule:

equation image(34)

where n = y1 + · · · + yT. When we integrate out ai with respect to the Gamma (η, η) density, we obtain a density that has the usual random effects Poisson form with Gamma (η, η) heterogeneity, as in Hausman et al. [1984, equation (2.3)]. The difference is that the explanatory variables are (zit, gi, t−1, ri0, zi). This makes estimation especially convenient in software packages that estimate random effects Poisson models with Gamma heterogeneity. The Chamberlain (1992) and Wooldridge (1997) moment estimators are compatible with this MLE analysis in the sense that the moment estimators only use the conditional mean assumption (4).


Vella and Verbeek (1998) (hereafter, VV) use panel data on working men to estimate the union wage differential, accounting for unobserved heterogeneity. I use their data to estimate a simple model of union membership dynamics. Most of the explanatory variables in VV's data set are constant over time. One variable that does change over time is marital status (marrit). A simple dynamic model of union membership is

equation image(35)

where t = 1 corresponds to 1981 and t = T corresponds to 1987. The initial time period is 1980. The unobserved effect, ci, is assumed to satisfy assumption (15), where zi is the 1 × T vector of marital status indicators and yi0 = unioni0. The ηt are unrestricted year intercepts.

Column (1) in Table I contains the maximum likelihood estimates. These were obtained simply by using the Stata®7.0 ‘xtprobit’ command, where a full set of time dummies (not shown), current marital status, lagged union status, union membership status in 1980 (union0) and the marital status dummy variables for 1981 through 1987 (marr1 through marr7) are included as explanatory variables. Asymptotic standard errors are given in parentheses.

Table I. Dependent variable: uniont
Explanatory variable(1)(2)
equation image1.1291.099
Log-likelihood value−1,287.48−1,283.39

Even after controlling for the unobserved effect using the model in Section 5.1, the coefficient on the lagged union status variable is very statistically significant. The initial value of union status is also very important, and implies that there is substantial correlation between the unobserved heterogeneity and the initial condition. In fact, the coefficient on union0 (1.514) is much larger than the coefficient on uniont−1 (0.875).

Getting married is estimated to have a marginally significant effect on belonging to a union, with a t statistic of about 1.51. The variables marr1, …, marr7 are included to allow for correlation between ci and marital status in all time periods. There is no clear pattern to the coefficients, and only marr7 is statistically different from zero at the 5% level.

In order to explicitly control for some observed heterogeneity, column (2) of Table I includes the time-constant variables educ and black. While we cannot necessarily identify the causal effects of education and race on union membership, we can include them in the model for unobserved heterogeneity in equation (15), which means we just add them as explanatory variables. The coefficient on educ is statistically insignificant, while blacks are significantly more likely to belong to a union. Interestingly, even after educ and black are included, there is much unobserved heterogeneity that cannot be explained by union0, marr1, …, marr7, educ and black: equation image. This means that the unobserved effect ai = ci − E(ci|unioni0, marri1, …, marri7, educi, blacki) accounts for about 56% of the unexplained variance of the composite error, ai + uit, where uit has a conditional standard normal distribution.

To get at the magnitudes of the state dependence, we estimate the probability of being in a union in 1987 given that the man is or is not in a union in 1986, broken down also by marital status. As discussed in Section 5.1, we average out ci using equation (27). Specifically, Table II reports

equation image

for uniont−1 = 0 or 1 and marrt = 0 or 1, where −0.0738 is the coefficient on the 1987 year dummy and equation image. The equation image are from column (1) of Table I

For a married man belonging to a union in 1986, the estimated probability of belonging to a union in 1987—averaged across the distribution of ci—is 0.408. For a married man not belonging to a union in 1986, the estimated probability is 0.226. The difference, 0.182, is an estimate of the state dependence of union membership. The magnitude for unmarried men, 0.173, is similar.

Table II. Estimated probability of being in a union, 1987
 In union, 1986Not in union, 1986
Married, 19870.4080.226
Not married, 19870.3700.197

One way to extend the model is to allow for the interaction term marrit·unioni, t−1 in the structural model. It is then natural to add the interactions between marrir, r = 1, …, T and unioni0 in the distribution D(ci|unioni0, marri1, …, marri7). When added to the model in column (2) of Table I, the coefficient on marriedit·unioni, t−1 has a t statistic of only 0.84 and the p-value for exclusion of all eight interaction terms is 0.981.


I have suggested a general method for handling the initial conditions problem in a dynamic, nonlinear, unobserved effects panel data model. The key insight is that, in general nonlinear models, we can use a joint density conditional on the strictly exogenous variables and the initial condition. In an application of an early version of this paper, Erdem and Sun (2001) applied the approach to choice dynamics for five different products. The authors find strong evidence of state dependence in product choice.

The conditional density in Assumption 3 can be modelled flexibly, but perhaps the most important contribution of the paper is that it shows how to obtain simple estimators in dynamic probit, Tobit and Poisson unobserved effects models for specific choices of the auxiliary density. Plus, we have shown how to obtain simple estimates of the partial effects averaged across the distribution of the unobserved heterogeneity; hopefully, APEs will be routinely reported in future empirical work.

Many issues can be studied in future research. For one, it is important to know the consequences of misspecifying the density in Assumption 3. Intuitively, as the size of the cross-section increases, we can make h(c|y0, z;δ) more and more flexible. Unless nonlinearities in the model are caused by true data censoring, any study to evaluate the impact of various choices of h(c|y0, z;δ) should focus on estimates of APEs, not just on θo. As is well known, it often makes no sense to compare parameter estimates across different nonlinear models.

The approach proposed in Section 3 can be modified when some of the explanatory variables fail the strict exogeneity requirement. Wooldridge (2000) lays out a framework for handling models with feedback, but specific implementation issues have yet to be explored.


I would like to thank two anonymous referees, the participants at the Michigan, Michigan State, North Carolina State and Penn State econometrics workshops, and the attendees of the ESRC Study Group Econometrics Conference in Bristol, UK, July 2000, for many helpful comments and suggestions. I am especially grateful to Brandon Bartels for catching two mistakes in coding some of the variables in my original analysis. This is a revised version of ‘The initial conditions problem in dynamic, nonlinear panel data models with unobserved heterogeneity’, University of Bristol Department of Economics Working Paper No. 00/496.