A semiparametric model for binary response and continuous outcomes under index heteroscedasticity



This paper formulates a likelihood-based estimator for a double-index, semiparametric binary response equation. A novel feature of this estimator is that it is based on density estimation under local smoothing. While the proofs differ from those based on alternative density estimators, the finite sample performance of the estimator is significantly improved. As binary responses often appear as endogenous regressors in continuous outcome equations, we also develop an optimal instrumental variables estimator in this context. For this purpose, we specialize the double-index model for binary response to one with heteroscedasticity that depends on an index different from that underlying the ‘mean response’. We show that such (multiplicative) heteroscedasticity, whose form is not parametrically specified, effectively induces exclusion restrictions on the outcomes equation. The estimator developed exploits such identifying information. We provide simulation evidence on the favorable performance of the estimators and illustrate their use through an empirical application on the determinants, and affect, of attendance at a government-financed school. Copyright © 2009 John Wiley & Sons, Ltd.


The last 30 years have witnessed the introduction of several estimators for the semiparametric binary response model under minimal distributional assumptions on the disturbance terms (see, for example, Manski, 1975, 1985; Horowitz, 1992; Powell et al., 1989; Ichimura, 1993; Klein and Spady, 1993). Much of the focus on relaxing distributional assumptions in the binary response model was motivated by the fact that maximum likelihood estimation of discrete choice models would generally lead to inconsistent estimates if the underlying distribution was incorrectly chosen.

In addition to the ‘shape’ of the error distribution, it may also be misspecified in the manner in which it depends on the explanatory variables. For example, if the error exhibits multiplicative heteroscedasticity that is not a function of the ‘mean’ response, then only the above-mentioned estimators of Manski and Horowitz are consistent. However, these estimators will not recover binary response probabilities or marginal effects. By estimating binary quantile models, Kordas (2006) obtains interval estimates of the probabilities under general conditions. Kahn obtains marginal effects for a more general model than that considered here, but the estimator may be subject to the ‘curse of dimensionality’ when the model contains many explanatory variables. One of the main objectives of the present paper is to obtain probabilities and associated marginal effects that are reasonably estimated when the dimension of the explanatory variables may be large. Accordingly, we model a binary response probability as depending on two indices, where the distribution of the error may depend on the explanatory variables through one or both of the indices. This specification allows for, but is not restricted to, multiplicative heteroscedasticity that depends on one or both indices.

To estimate the binary response model described above, we extend the estimator in Klein and Spady (1993). The estimator in Klein and Spady depends on a single-index assumption, which in the present context would imply that it can handle heteroscedasticity only if the ‘error’ distribution depends on the same index that determines the ‘mean response’. Here we allow a double index formulation in which the index underlying the ‘mean response’ may differ from that upon which heteroscedasticity depends. Such an index formulation is particularly important in view of a result due to Chen and Khan (2003). They consider a binary response model where the heteroscedasticity depends on an unknown function of the explanatory variables and does not have an index structure. In this case, they show there does not exist a equation image-consistent estimator for the model's parameters. Here, we will obtain a equation image-consistent estimator under an index specification As an extension of Klein and Spady (1993), we conjecture that when the error in the binary response model is independent of the explanatory variables, the resulting estimator is efficient in a general class of models that satisfy a double-index restriction.1

It should be emphasized that the estimator developed here depends on density estimators obtained under estimated local smoothing, where underlying density estimators are based on windows that vary for each observation in the sample. This is analogous to characterizing a distribution with a histogram in which the bin interval is allowed to vary depending on whether one is in the tails of the unknown density (where observations are sparse) or in regions where the true density is ‘high’. With such local smoothing, the proofs for the asymptotic properties of the estimator formulated here substantially differ from those in the literature that employ bias-reducing kernels. We pursue this strategy first because density estimators under local smoothing have mean-squared-error optimal properties (Abramson, 1982). Second, and most importantly, in the present context we have found that the finite sample performance of the estimator for the binary response model is much improved under local smoothing in contrast to bias-reducing kernels. We also found further improvements in the finite sample performance of the estimators by employing dependent kernels that depend on an estimated sample covariance matrix as advocated by Fukunaga (1972). Accordingly, all proofs in this paper are for estimation under local smoothing and dependent bivariate kernels.

In adopting the above smoothing strategy, we have found it necessary to employ a property of the derivative of semiparametric probability function due to Whitney Newey. Namely, when this derivative is taken with respect to index parameters and then evaluated at the true parameter values, it coincides with the corresponding parametric derivative minus its conditional expectation (conditioned on the indices). This ‘residual-type’ property of this derivative function is important below in controlling the bias in gradient terms in the asymptotic normality argument. As is typical for many semiparametric estimators, we will need to downweight (trim) observations where density denominators become ‘too small’. To exploit the residual property of the semiparametric derivative, we will employ a trimming strategy that depends on estimated indices as opposed to the explanatory variables.

The estimator developed here for the binary response model is also related to those of Ichimura and Lee (1991) and Lee (1995), who examine alternative multiple-index models. While the present paper makes use of several key identification results of the Ichimura and Lee paper, it differs from both in several important respects. First, and most important, we have formulated the estimator and all proofs for the case of estimated local smoothing rather than bias-reducing kernels. Second, we make use of identification results in Ichimura and Lee without imposing exclusion restrictions on the indices. We emphasize that we are not concerned here with recovering the original parameters in the binary response model (which even in the presence of exclusion restrictions are still only obtained up to location and scale). Rather, we are interested in estimating those identifiable functions of the parameters that suffice to identify the semiparametric probability function. It can be argued that with binary response models one is generally not concerned with the parameters themselves but rather with the response probability and marginal effects. Such marginal effects, which examine how the probability function changes as the explanatory variables change, are identified once the probability function is identified. Moreover, while the entire probability function converges pointwise and uniformly to the true function at a rate below the parametric rate of equation image, averaged marginal effects converge at the parametric rate. The original parameter values of the model are not required for such identification. In part, for this reason we focus on identifying the probability function itself rather than index parameters.

While one of our primary objectives is to provide an estimator for this double-index binary choice model,2 we note that applied researchers have become increasingly interested in larger systems in which the choice appears in another equation as an endogenous regressor. This type of model, frequently referred to as an endogenous binary treatment model, is at best poorly identified without an exclusion restriction. The well-known problem here is that the treatment probability, which would serve as an instrument for estimating the continuous outcomes equation, is often approximately linear in its argument. In the absence of an exclusion restriction on the continuous outcome equation, the instrument is then very close to being linearly related to the same exogenous variables in the continuous equation of interest. To resolve this problem here, we consider the case of multiplicative heteroscedasticity in the binary response equation, which is some function of the explanatory variables X. Write this function as S(X). In the next section we show that such heteroscedasticity may be viewed as inducing exclusion restrictions on the continuous outcomes equation. With no parametric assumptions on S(X) (other than that it depends on one or two indices) and with no parametric assumptions on the distribution of the error term in the binary response model, below we will develop an estimator that exploits such identifying information. We will then show that such information is useful both in theory and in practice (as indicated in a series of Monte Carlo experiments and in an empirical application).

For continuous simultaneous equations models, other authors have exploited heteroscedasticity as an identification strategy. For example, in a semiparametric formulation, Klein and Vella exploit such information to identify and estimate triangular simultaneous equations models without exclusion restrictions. In parametric formulations, Rummery et al. (1999) and Rigobon (2003) also exploit heteroscedasticity as an identification strategy for simultaneous equations. From the structure of the problem considered here, there is information in higher-order powers of the X's that could be exploited to construct instruments for the outcomes equation. Dagenais and Dagenais (1997) and Lewbel (1997) exploit such information in models with measurement error. In this paper, since the nature of the heteroscedastic function in the treatment equation is unknown, it is unclear which higher orders of the X's should be used as instruments. Consequently, we pursue an alternative strategy here that involves direct estimation of a double-index binary response model. One could attempt to bypass estimation of this equation and determine the appropriate higher orders of X's to use as instruments by extending Donald and Newey (2001) to the model considered here. However, as the treatment probability is itself of direct interest, we pursue an alternative strategy that employs the estimated treatment probability in estimating the continuous outcomes equation. In the present context, the conditional treatment probability is an optimal instrument (Amemiya, 1975).

The next section outlines the model and the estimation methods. In Section 3 we provide and discuss the assumptions required to establish asymptotic results. When estimating the treatment effect, we note that our procedure is of particular value when there are no exclusion restrictions which provide instruments. Accordingly, we focus on identification in the absence of conventional exclusion restrictions. In Section 4 we establish the asymptotic properties of the estimators for both the binary response and outcome models. In so doing, we sketch out the proofs, and provide complete technical details in the Appendix, which is available online from Wiley Interscience. The proof strategy differs from other arguments in the literature as it relies on estimated local smoothing. Section 5 provides simulation evidence. In Section 6 we provide an empirical application where an individual's total education level (the outcome) depends in part on whether or not the individual attended a state-financed high school in Australia (the treatment). Section 7 concludes.


Consider the following model:

equation image(1)
equation image(2)

where Y1i is the outcome variable and Y2i is a dummy endogenous variable defined through the indicator function {·}; Xi is a vector of exogenous variables; β0, π0 and θ0 are unknown true parameter values; and ui and ui are random disturbances. While the treatment effect, θ0, is invariant across individuals, this assumption can be relaxed as in the empirical application. The disturbances can be characterized as

equation image(3)
equation image(4)

where S(·) is an unknown (positive and non-constant) function; γ0 is an unknown parameter vector; and equation image is a homoscedastic random disturbance which is independent of the elements of Xi but dependent on ui. The model allows heteroscedasticity in each equation, though we only model it explicitly in index form for the binary response model. Note that there may or may not be known restrictions on the parameters in the above model. For example, suppose X≡[X[1], X[2]], where X[2] contains powers and cross-products of the ‘basis’ elements in X[1]. Then, in some formulations it will be reasonable to restrict the elements of β0 and π0 so that the ‘mean effects’ only depend on X[1]. In contrast, one may want to let heteroscedasticity, S, depend on the basis elements X[1] and the higher-order terms X[2]. Alternatively, we could interpret X itself as containing the ‘basis variables’ for the model and impose no exclusion restrictions on β0, π0, or γ0. Because of the aspects of the above model in which we are interested, we permit and indeed focus on this second case of no exclusion restrictions. The estimator developed here is for a model more general than above, but we will specialize to the above case for expositional convenience.

For the model in (1)–(4), the treatment probability has the form

equation image(5)

where P(·) is the distribution function for equation image. We estimate this probability function in a double-index formulation based on local smoothing. The estimator will depend neither on the functional form for S nor on the distribution of the disturbances.

We can also employ this probability function as an (optimal) instrument for estimating the continuous outcomes equation. Here we make several observations. First, if there is no heteroscedasticity in the above model, then effectively Z = X, in which case the model can be poorly identified because P is often approximately linear in its argument. When the argument of P is Xπ0 (i.e., Z = X), it is still possible to identify the model provided that P is not linear in Xπ0. However, this form of non-linearity in the function P itself will typically occur in the tails of the equation image and thus relies on a small fraction of the sample for identification. In contrast, in the presence of heteroscedasticity, Z no longer coincides with X and indeed will typically be linearly independent of the columns of X. Consequently, the Z variables are effectively excluded from the continuous outcomes equation. Such induced exclusion restrictions serve to identify the model even in the region of the data for which P is linear in Z.


We now provide the assumptions and definitions that we employ to establish the asymptotic properties for the estimator.

  • A1.The data The data (Y1i, Y2i, Xi), i = 1, …, N, are i.i.d. observations from the model in (1)–(4). With X as the N × K matrix of observations on the explanatory variables and with 1 as an N × 1 column vector of ones, the columns of [X1] are linearly independent with probability 1.
  • A2.Errors The error in the continuous outcomes equation (1), ui, is independent over i with E(ui|Xi) = 0 and with equation image uniformly bounded. The error in the binary response model (2) is given as
    equation image
    where the unscaled error, equation image, is i.i.d., supported on the real line,3 and has finite variance. The scaling function S(·) is finite, bounded away from zero, and is not constant. The vector Xi is independent of the unscaled error equation image.
  • A3.Parameter space The vector of true parameters values for the model in (1)–(4) lies in the interior of a compact parameter space, Θ.
  • A4.Index assumptions Assume that the vector of indices, I, depends on two distinct (functionally independent) continuous variables, X1 and X2. With X3 containing all other explanatory variables, write
    equation image
    and assume that the 2 × 2 submatrix Γc has rank 2.
  • A5.Reparameterized model With η≡(η31, η32), define
    equation image
    Under this reparameterization, note that P(Y = 1|I) = P(Y = 1|W). Define W* by replacing η above with η*. With x a realized value of X, write wxβ(η) and w*≡xβ(η*). Assume
    equation image
    Following Ichimura and Lee (1991), let equation image. Then, write:
    equation image
    Assume that there exists a set of positive probability on which the above equality may be differentiated with respect to the continuous elements of x3 with t held fixed. Further assume that condition (4) of Ichimura and Lee (1991, Lemma 3) holds.
  • A6.Densities Assume that all observed continuous variables in the binary-response model have compact support. To provide required smoothness conditions, let Xc≡(X1, X2) be the vector of continuous variables in (A5). Then, with f(xc|X3, Y2) as the indicated conditional density for Xc, denote equation image as the ith and jth cross-partial with respect to the elements of xc≡[x1, x2]. Then, with equation image, assume that f(w|·) has positive support on a compact set A, is bounded away from 0 on any compact subset of its support, and that on equation image is bounded above by a positive finite constant for i + j≤4.

Assumptions A1–3 define the index model that we propose to estimate. An index formulation of low dimension is important for obtaining reasonable results in finite samples. Note that this index assumption permits a more general error structure than that shown in A2. Namely, we require that the binary response probability depend on two indices, but do not otherwise restrict the manner in which the probability depends on the indices. The particular double-index structure implicit in A2 provides a convenient motivating case.

With the possible exception of assumptions A4–5, the above assumptions are somewhat standard in index models. Assumptions A4–5 essentially provides identification conditions. To motivate these assumptions, note that the W-parameterization in A5 is equivalent to the I-parameterization in A4 as both yield the same conditional probability function in x. We employ the W-parameterization to allow for the possibility that there may not be exclusion restrictions in the original I-parameterization. In this lower-dimensional parameterization, we then seek to identify the (nuisance) parameters η. Before proceeding, we note that these parameters have no natural interpretation as they are linear functions of the model's original parameters. However, if these parameters are identified, we can easily recover the binary response probability function and identify the marginal effects which measure how the response probability changes in response to changes in x. Moreover, asymptotic properties for these estimated marginal effects will readily follow from those for equation image. Finally, as elaborated below, the probability function is of interest in estimating a continuous outcomes equation that depends on the binary response variable.

Having reparameterized the model in A5 we then assume that the W-parameterization satisfies the identification conditions in Ichimura and Lee (1991).4 The condition on discrete variables is that given by Ichimura and Lee to identify their coefficients. Note that these identification conditions are based on the underlying assumption of a double-index model. In presenting simulation results, we will present results both for double-and single-index models. If a single-index model generates the data, it will not be possible to identify all of the parameters of a double-index specification. However, it is still possible to identify the probability function of interest. As the focus of this paper is on a double-index specification we defer further discussion of this issue to the simulation section.

Assumption A6 provides smoothness conditions. These conditions and densities satisfying them are discussed in Klein and Spady (1993, p. 393). It is possible to relax the compact support assumption at some technical expense in the proofs.5

In addition to the above assumptions, we also need a number of conditions or definitions that define the densities and probability functions of interest. Throughout, we employ kernel density estimators to estimate the semiparametric probability function entering a quasi likelihood. As is standard in this literature, such density estimators need to have an appropriately low order of bias. Here, we obtain bias reduction first by employing local smoothing as developed by Abramson (1982) and discussed in Silverman (1986). Such local smoothing requires that the windows in the final kernel density estimator vary by observation and depend on a pilot density estimator. Not surprisingly, these windows satisfy the intuitive requirement that they be smaller in the center of the distribution than in the tails. As a second source of bias reduction, we exploit a property of expected semiparametric probability derivatives. Namely, such derivatives have expected value zero when conditioned on the true indices. As will also be discussed below, to improve the finite sample performance of the estimators, we estimate the density for the vector of indices, W, using kernels that depend on the sample covariance matrix for W. Below, we will first define these estimators and then discuss their properties.

  • D1.Density estimators under local smoothing Let K be a symmetric, smooth univariate kernel function satisfying condition C8 in Klein and Spady (1993, p.394). The normal kernel, which is employed in the simulations and the empirical example, satisfies this condition. Let T be a matrix such that equation image, the inverse sample covariance matrix for W given that Y2 = s, s = 0, 1. Partitioning T = [T1T2]′ conformably with the ith observation on W:Wi = [W1iW2i]′, define
    equation image
    With gs(w) as joint density for W≡[W1, W2] conditioned on Y2 = s, s = 0, 1, and with Ps as the unconditional probability that Y2 = s, define an estimator for fs(w)≡Psgs(w) as
    equation image
    For w = Wi, the above averages are taken over the N − 1 observations for which ji.
  • D2.Smooth trimming functions Define a smooth trimming functions as:
    equation image
  • D3.Estimated local smoothing parameters Referring to D1, denote s as the geometric mean of the s(w;h, λ)′s and let equation image. Then, for j = 1, …, N, define estimated local smoothing parameters as
    equation image
    where the parameter a in D2 is set here to 0.01.6
  • D4.Multi-stage local smoothing Employing D3, the estimator for fs(w) is defined under several stages of local smoothing as
    equation image
    where 1 is a vector of ones. With hi = O(Nmath image), set r3 = 1/11 and 0 < δ< r3/2. Then, set r2 = (r3 − δ/2)/2, and r1 = (r3 − δ)/4.7
  • D5.Semiparametric probability function Define
    equation image
    where ĝ(w)≡1(w)+ 0(w) estimates the unconditional density for W. To define the Δ adjustment factors, first define the smoothed indicator:
    equation image
    where a1≡ε′r3/4, a2≡ε′r3/5, ĉs = Op(1), and ΔN≡Δ1N + Δ0N.
  • D6.Pilot estimator Let xk be the lower αth sample quantile for the continuous variable Xk (e.g. α = 0.01) and let k be the upper (1 − α)th sample quantile. For the Kc continuous variables, define the indicators: ik≡{xk < xik < k}, k = 1, …, Kc. In the notation of D1, define a pilot probability estimator as
    equation image
    Then, with equation image, the pilot estimator for η0 is defined as
    equation image
  • D7.Final estimator With equation image defined in D6, let equation image denote the vector of estimated indices. Denote equation image as the lower βth sample quantile for the equation image and let equation image be the corresponding upper (1 − β)th quantile. With τ as the trimming function in D2, the index trimming function is defined as
    equation image
    Then, with probabilities defined in D5, the final estimator for η0 is defined as
    equation image

Before discussing the role of the above definitions, as an overview note that there are two general aspects that need to be addressed in estimating semiparametric models. First, it is necessary to control the bias in the underlying density estimators. As discussed below, here we control this bias by employing local smoothing and exploiting a ‘residual’ property of semiparametric probability functions. Second, it is necessary to downweight or trim those observations for which densities become too small. For reasons discussed below, we employ a trimming strategy outlined in D4–6 that is quite similar to that in Klein and Spady (1993).

In explaining why we have defined various estimators as above, turn first to D1. As discussed by Silverman (1986) and advocated by Fukunaga (1972), we have employed bivariate kernels based on a sample covariance matrix. We ‘match’ this feature of the data as follows. Following Fukunaga (1972) we specify a density estimate for the vector W by first constructing the standardized vector W*≡TW. With the covariance matrix for W* being the identity matrix, the density estimator for W* is then somewhat naturally based on a product of independent kernels. The implied density estimator for W is then that given above. Fukunaga (1972) documents the performance of this estimator in a Monte Carlo study. Here, we have found that we obtain ‘better’ estimates of the parameters of interest when we select a density estimator in this manner.

For known local smoothing parameters (bounded away from zero), Abramson showed that the locally smoothed density estimator is optimal in a mean-squared error sense. This estimator also has the desired bias-reducing properties. As the local smoothing parameters are not known, they must be estimated. In using the estimates, we are able to prove that the resulting density estimators have desired bias-reducing properties when estimated in several stages. Namely, first employ a regular kernel density estimator (λ = 1 in the above notation) to construct estimated local smoothing parameters. Second, obtain a density estimator using these estimated local smoothing parameters. Third, and finally, use this second-stage estimator to reconstruct estimated local smoothing parameters and obtain the final density estimator shown in D4. We have been able to show ‘essentially’ that the bias is reduced at each stage. At the third stage, the order of the bias is equation image. This order is sufficiently small to obtain the asymptotic results below. For technical reasons, we smoothly trim in D2 so as to keep the local smoothing parameters above 1/Ln(N).8

The proofs exploit a residual-like property of the derivative (with respect to the parameters) of the true semiparametric probability function, with this derivative having conditional expectation of zero when evaluated at the true parameter values. By using this property, we can further control for the bias in the gradient to the objective function, which is essential to establishing asymptotic normality. To this end, we first estimate the model under X-trimming. The resulting parameter estimates, which we do not require to be equation image-convergent, are employed to obtain estimated indices or index densities. The model is then re-estimated with trimming based on estimated indices or their corresponding estimated densities. Such trimming affords ‘protection’ against small denominators when analyzing the gradient as it will be evaluated at the true parameter values. However, this type of trimming is problematic for analyzing the averaged log-likelihood and the Hessian matrix as we need to examine these components away from the truth. As in Klein and Spady (1993), we employ the Δ adjustment factors in D5 above for this purpose. These factors will vanish exponentially provided the density is not ‘too small’. In this manner, such factors will quickly vanish from the gradient where they are not needed, but will serve to control density denominators when analyzing likelihood and Hessian components.


In this section we provide and discuss the asymptotic properties for the estimator for both equations in the endogenous treatment model defined above. The Appendix contains formal proofs for all required intermediate lemmas and the main theorems given below. For expositional and notational purposes we will consider the more difficult case in which every index in the model depends on a linear combination of variables in X. In practice there will certainly be cases in which exclusion restrictions for the various indices are justified. In what follows, we begin with the binary response model and establish consistency using standard uniform convergence arguments. We then turn to the proofs for asymptotic normality.

4.1. Binary Response

To show that the proposed estimator for the binary response model is consistent, denote η≡[η1, η2] as the (nuisance) ‘reduced form’ parameters entering W1 and W2 as above. Then, with the quasi likelihood given by above, the estimator for this binary response model is given as

equation image

With the semiparametric probability function given as i(η) in (D5)

equation image

where equation image is a trimming function that is defined and discussed in the Appendix.

Obtain Q(η) from (η) by replacing i(η) with its uniform probability limit, Pi(η). It can then be shown (see the Appendix) that

equation image

From standard uniform convergence arguments, Q(η) converges in probability and uniformly in η to its expectation, E[Q(η)]. Under conditions for identification given above, E[Q(η)] is uniquely maximized at η0. Therefore, we have established the following theorem.

Theorem 1Under A1–6 and D1–7

equation image

Note that consistency of equation image will imply that the probability function is also consistently estimated. It is then also possible to establish consistency for estimated marginal effects. The asymptotic distribution for marginal effects readily follows from that for the estimated nuisance parameters. In the remainder of this section, we outline the normality argument, with the Appendix providing detailed proofs.

As discussed in the Appendix, we first employ fixed trimming on the basis of the X-variables to obtain a convergence rate for a pilot estimator for the parameters

equation image

Using equation image, we then estimate the two W indices and construct a smooth trimming function based on these indices. With ŵi as the estimator for the indices, denote equation image as the estimated trimming function under index trimming (D7). Employing this trimming function, write the objective function as

equation image

Denote Ĝ(η) and Ĥ(η) as the Gradient and Hessian for this objective function. Let Q(η) be the objective function obtained by replacing estimated with true probability functions (to which the estimated functions uniformly tend) and denote H(η) as the Hessian for Q(η). Then, with the Appendix containing the details, from a standard Taylor series expansion of equation image about ηo

equation image

The normality result then follows from an analysis of the gradient term.

To outline the argument for the gradient, define an estimated weight function

equation image

The gradient to the objective function is then of the form

equation image

To simplify A1, we first show that the estimated weight may be taken as given by showing

equation image

Since YP has conditional expectation of zero, a natural strategy would be to establish the above result by showing that equation image converges to zero. After simplifying A1, in the Appendix we provide the required mean-square convergence argument. It then follows that

equation image

Employing Lemma 8 in the Appendix, we are able to show that A2 converges to zero in probability.

Turning to the term in B1, in the Appendix we establish uniform convergence rates under multi-stage local smoothing (to reduce the bias) for estimated probability functions and their derivatives. In the first stage, estimated local smoothing parameters are constructed as functions of a regular kernel density estimator. These estimated local smoothing parameters are then employed (as variable windows) to re-estimate the density, which is in turn used to reconstruct estimated local smoothing parameters, which in the final stage are used to re-estimate the density. When local smoothing parameters are unknown, we show in the Appendix that this multi-stage approach results in increased convergence rates by reducing the order of the bias in the density estimator. Using these convergence rates, in the Appendix we show

equation image

We then have

equation image

From the above

equation image

Recall that for technical reasons the estimated probability function was defined as a ratio of adjusted, estimated densities. With the adjustment factors vanishing exponentially under trimming, we may ignore these adjustments and replace i with i/ĝi. In the Appendix (using a uniform convergence argument similar to that above), it is shown that

equation image

It now follows that B1 simplifies to

equation image

To further analyze B1 above, it is important to show that it is ‘nearly’ unbiased: E(R) = o(N−1/2). With biased reducing kernels

equation image

because the density estimators are ‘nearly’ unbiased. However, once we have shown that the gradient has the above form (under locally smoothed kernels), we can control the bias by exploiting a property of semiparametric probability derivatives. Let

equation image

where δ only depends on index values. Then

equation image

from the residual property of the semiparametric probability derivative. By exploiting this property, the Appendix employs a mean-square convergence argument to show that B1 converges to zero in probability (as does the comparable term in single-index models). Similar to the analysis for A2, using Lemma 8 in the Appendix, it can be shown that B2 also vanishes in probability.

Employing the above results:

equation image

Noting that an information equality holds for this problem, a standard central limit theorem then gives asymptotic normality in the theorem below.

Theorem 2Under A1–6 and D1–7

equation image

4.2. The Outcomes Equation

With θo≡[βo, µo] and Z≡[X, Y2], recall that this equation is given as

equation image

Then, letting *(η)≡[X, (η)] be an instrument for Z, the IV estimator is given as

equation image

From Lemma 9 in the Appendix, with Z*≡[X, P0)]

equation image

We can now immediately establish that the estimator is consistent and that it is asymptotically distributed as normal with a covariance matrix having the standard White heteroscedastic corrected form.

Theorem 3Definingequation image, andequation imagelet

equation image

Withequation image, under (A1-5) and (D1-4)

equation image

Note that we have assumed that E(u|X) = 0. If we assume further that u is independent of these conditioning vectors and let trimming vanish as the sample size increases, then equation image is an optimal IV estimator (see Amemiya, 1975).


To investigate the performance of the above estimator in a controlled setting, we conducted a Monte Carlo study.9 As the focus of this paper is on a double-index binary response equation, with heteroscedasticity providing the main motivation, one of the designs below is of this form. It is also of interest to examine the consequences of a double-index specification when the true binary response model is generated by a single index. Accordingly, we also present results for this case along with a related discussion of identification issues.

In formulating a design for the double-index case, note that the number of factors determining the nature of the simulation is very large, precluding an exhaustive examination of the estimator under all possible conditions. Accordingly, we adopt the following strategy. We consider the worse-case situation where we are unwilling to make any restrictions on which variables enter the means or the variances. That is, the same variables affect the means and the variances. With all exogenous variables distributed as standard normal, the true model with heteroscedastic errors is given as

equation image(6)
equation image(7)
equation image(8)
equation image(9)

The unscaled errors, equation image and equation image, were generated as normal with expectation zero. Their variances were selected to ensure that the scaled errors, vi and ui, each had unconditional variance of one. Finally, the unscaled errors were generated so as to have correlation of approximately 0.25 with each other. For the case in which the binary response is generated by a single index, we set Sv to a constant such that v has the same unconditional variance in both designs.

Turning to the double-index data-generating process, we first examine our ability to recover the reduced form parameters in the binary choice model. Second, we examine the ability of the IV estimator to estimate the outcome equation parameters.

In the first experiment we conduct simulations with a sample size of 1000 and with 500 replications. Under the W-parameterization discussed earlier, x2 is excluded from the first index and x1 is excluded from the second index. The true values for the nuisance parameters (the coefficients on x3 in each index after reparameterization) are 2 and − 1 respectively.10 In estimating these parameters we obtained starting values from a coarse grid search. The average of the estimates for these two parameters are 2.031 and − 1.037 with standard deviations of 0.469 and 0.508. Thus the estimates appear to be unbiased and they are reasonably precisely estimated. In addition to computing the double-index parameters we also estimated a probit model which does not account for the presence of heteroscedasticity.

We also compared probit, semiparametric, and true probability functions. As an overall summary comparison, we estimated the correlation between the true probability that Y2i is equal to 1, given the xi vector, and that from the double index and probit models. The correlation between the probit probability and the true probability over the 500 replications was 0.726 with a standard deviation of 0.018. In contrast, the correlation between the true probability and that from the estimated double-index model was 0.907 with a standard deviation of 0.010. In a more detailed comparison of probability functions, in Table I we report the predicted probabilities for each of five quantiles.11 These tables not only highlight the superior performance of the double-index model, relative to the probit model, but also suggest that the estimator is performing very well in estimating the predicted probability.

Table I. Probability quantiles
TrueProbitDouble index
(a) N = 1000
(b) N = 2000

Using the first step estimates we now employ these implied probabilities as an instrument for Y2i in estimating the second equation. In Table II we report the second-step IV and OLS estimates for the Y1 equation. We report the estimates for each of the second-step variables as each contributes differently in the heteroscedasticity index. When the semiparametric probability function is employed as an instrument, we refer to the resulting estimator as SPIV.

Table II. Simulation results
(a) N = 1000
(b) N = 2000

Column 1 reports the average value of the OLS estimates from the second step. Recall that the true value for each coefficient is 1. Each of the coefficients for the exogenous variables displays a level of bias in the range of 3.3–8.7%. The standard errors for the estimates, given below the estimates in parentheses, indicate the degree of precision of the estimates. We report these for comparison with the adjusted coefficients which follow. The average estimate for the intercept is 1.205, revealing that the bias is greatly influencing this coefficient. Finally, focus on the estimate of the treatment effect. The average OLS point estimate is 0.590, which reflects a bias in excess of 40%. Clearly the design employed is generating a substantial degree of endogeneity.

In column 2 we present the estimates in which we employ arbitrary functions of the explanatory variables as instruments. These included quadratic and cubic terms and all interactions between the variables, including the linear terms. Throughout, we use all of the variables in this available set. Column 2 indicates that this IV procedure reduces the bias on the coefficients on the exogenous variables and the intercept. The bias for the estimated treatment effect, however, is still on the order of 12.2%, although this represents a marked improvement over the OLS eliminates.

Column 3 presents the estimates from the SPIV procedure. For each of the parameters on the exogenous variables there is a large reduction in the bias in comparison to the OLS estimates. The procedure is successfully eliminating the bias from the endogeneity of the treatment effect. This is also true for the treatment effect itself, which now only displays 2% bias. Note, importantly, that the standard deviation for the treatment effect is smaller for this estimator than that shown in column 2.

We now repeat the same exercises after increasing the sample size to 2000. The first-step estimates are now 1.986 and − 0.988, with standard deviations of 0.241 and 0.249, respectively. Thus the estimates continue to be very accurate and we also see a large decrease in the level of variability. Once again we compute the correlations described above and we now find that the probit estimate is 0.727, with a standard deviation of 0.013, while the correlation between the truth and the probability from the estimated double-index model is 0.915, with a standard deviation of 0.007. In the lower panel of Table I(b) we report the quantiles for the various probabilities. Again the double-index model not only dominates the probit model but also produces an excellent performance in absolute terms.

We now focus on the estimation of the binary treatment model and this is reported in Table II(b). The SPIV estimator formulated here continues to dominate the alternative estimators. The estimator using the higher orders and the cross-products of the xs continues to eliminate some of the bias but even doubling the sample size has not produced a notable decrease in the degree of bias. Once again, the SPIV estimator is remarkably accurate, with the estimates seemingly unbiased for all coefficients. Perhaps the most remarkable feature of Table II(b) is the increase in efficiency for this estimator as it now displays a standard deviation significantly lower than that for the alternative IV procedure.12

Turn now to the single-index data-generating process noting that with constant Sv the binary response model becomes a probit model. However, suppose that the single-index restriction is not imposed and that we continue to estimate the binary response in double-index form. For this purpose, it is expositionally convenient to rewrite the model in an equivalent but more revealing form. Letting C and A be appropriately dimensioned non-singular matrices, return to the original parameterization and write the binary response as

equation image

The first characterization is the double-index form, while the second follows from a single-index restriction under a conventional normalization. With equation image obtained by imposing the single-index restriction (e.g., as in Klein and Spady, 1993), define the non-singular matrix C as

equation image

Note that the transformed variables are given as

equation image

where equation image is the estimated index under a single-index restriction.

The transformed parameters corresponding to the above transformed variables are given as

equation image

With equation image not identified, consider the set of equation image values such that the upper block of the transformed parameter matrix is non-singular and, as earlier, set A as the inverse of this block. The following double-index form now follows:

equation image

When the model is generated by a single index, equation image is identified. However, once we condition on the single index, equation image, any additional ‘information’ is irrelevant. Namely:

equation image

for all equation image. As a result, while the above expectation (probability) is identified, equation image is not identified. Consequently, when the binary response equation is estimated in double-index form, we expect the estimator for equation image to be close to zero and the estimator for equation image to have a ‘large’ variance. For N = 1000 observations, Table III provides results when the binary response model is estimated in both single- and double-index forms. Under the single-index constraint, the estimated coefficients have small biases and low variances. Furthermore, the distribution of the estimator is such that the mean components are close to the corresponding medians. In contrast, the bottom portion of this table provides results when the model is estimated in double-index form. On average, the estimator for equation image (0.055) is small, as one would expect. The corresponding standard error of 0.65 is relatively large, which is misleading as there were a small number of very large outliers. Note that the median of the estimator (0.0003) is much smaller than the mean and is consistent with the true value for the coefficient being 0. The other parameter is not identified, as is reflected in an extremely large sampling variance.

Table III. Single-index binary response, N = 1000
Coef.TrueAvgequation imageMedequation image
Single-index constraint
Double-index constraint: I1
equation image1
Double-index constraint: I2
equation image0
X3− 12.28010.1456

Table IV provides results for sample size equal to 2000. Other than there being less of an outlier issue, these results are similar to those above. Namely, as one would expect, the estimator for the identified parameter is close to 0 and is much more precisely estimated than when the sample size is 1000. Note that the smaller standard error is due largely to a much better estimated binary response probability. The sampling variance for the unidentified parameter is relatively large.

Table IV. Single-index binary response, N = 2000
Coef.TrueAvg (equation image)Med (equation image)
Single-index constraint
Double-index constraint: I1
equation image1
X30− 0.00260.0016
Double-index constraint: I2
equation image0
X3− 0.5424− 0.2791

Turning to the outcomes equation, shown in Table V, the results are as expected. Note that the estimated probability function converges (pointwise and uniformly) slowly to the truth in double-index form. As a result, and not surprisingly, there is only a slight advantage to the SPIV estimator over the IV estimator. As found earlier, the bias for the OLS estimator is substantial, ranging up to almost 50% for the treatment effect. At the larger sample size (N = 2000), the semiparametric probability is better estimated, which is reflected in a noticeable improvement of SPIV over IV. In particular, the standard error for the estimated treatment effect is approximately 20% lower for the SPIV estimator relative to the IV estimator.

Table V. Outcomes equation, single-index treatment, double-index constraint
N = 1000
N = 2000

It is also instructive to compare the above results across designs in the case of the outcomes equation. When a double index really generates the data, the SPIV estimator has small biases and standard errors. However, we now turn to the case where the model is still estimated in double-index form, but where a single index actually generates the data. In this case, the biases and standard errors are noticeably larger.


We now employ the estimators formulated here to study two questions of interest. There is a large recent literature on the effect of attendance at private schools on educational attainment and subsequent labor market performance (for recent examples see Evans and Schwab, 1995; Neal, 1997; Vella, 1999). This has become an increasingly well-studied area due to the common finding that attending private and catholic schools increases the number of years of school acquired and the level of post-schooling qualifications. Unlike previous papers which examine the effect of Catholic schools on education, we examine the effect of attending a government- or state-financed school. We begin first by estimating the marginal effects of particular variables on the probability of attendance at a government-financed school. This allows us to identify the determinants of the school choice while allowing for general forms of heteroscedasticity and without making distributional assumptions. Second, we examine the impact of attendance at a government-financed school on educational attainment. The issue of endogeneity of school type and education level needs little motivation. Schooling represents a form of human capital investment and the investment can differ in terms of duration and quality. However, as both decisions reflect human capital investments, albeit on different margins, each should be influenced by similar factors. As the unobservable factors are likely to be similar, this highlights the endogeneity. Moreover, as both decisions are likely to be influenced by the same observable factors, the absence of reasonable exclusion restrictions is immediately apparent. Despite the simultaneity the triangular structure is reasonable as the school type is chosen first and then the number of years follows from the individual's schooling success and the cost of the investment.

We employ data from the Australian Longitudinal Survey for 1985. The data comprise 5353 observations on youth who have completed their schooling. The binary response variable is the school type of the individual which we denote as Govt and which is a binary indicator function indicating that the individual attended a government-run high school. The mean of this variable is 0.808. The outcome variable is the number of years of schooling, which has a mean of 11.639. The model is the following:

equation image(10)
equation image(11)

The explanatory variables are those one would expect to influence human capital investment. With three exceptions the variables are indicator functions. For these indicator functions the variable name reflects what it measures. The variable Age is measured in years and Siblings denotes the number of siblings in the family. The one explanatory variable which requires some explanation is Attitudes. This variable is constructed from each individual's responses to a series of questions which aim to elicit the individual's view of the roles of females in the labor market. Vella (1994) investigates the role of this variable in the human capital investment for Australian youth and concludes that the variable captures family forces which influence educational attainment. An important issue in that study, which is equally of relevance here, is whether this variable can be treated as exogenous to human capital investment. While Vella (1994) starts with the conjecture that the attitudes variable is endogenous to human capital investment, that study is unable to provide any evidence that the attitudes variable is endogenous to schooling. Employing the same dataset, we proceed on the assumption that Attitudes is exogenous. The variable takes discrete values from 5 to 35, where a low score reflects a very traditional role for females, while a higher score reflects an attitude of gender equality. We treat this variable and age as continuous for identification purposes.

Before focusing on the estimates, it is useful to consider why the schooling choice equation might exhibit heteroscedasticity. Many of the explanatory variables are indicator functions and their inclusion is meant to capture their average effect on the schooling choice. However, the direction, and magnitude, of these effects might be expected to vary across individuals. For example, consider the indicator function capturing that the individual is Australian born. This captures the contrast with non-Australian-born individuals and for many reasons one might expect that there may be a difference across groups. However, just as it is likely that those comprising the Australian born are very different in various ways, such as family attitudes towards education and scholastic abilities, it also true that those comprising the non-Australian born are also heterogeneous. Accordingly, while the inclusion of the indicator function captures the mean difference across the two groups, there is likely to be a large variance in the effect depending on which individuals from the respective groups are compared. Moreover, this difference may not be correlated with the other explanatory variables and thus it is not easily taken into account. The same type of argument is true for many of the other explanatory variables. Allowing the explanatory variables to affect the variance is an attempt to more accurately capture this effect.

We begin by estimating the schooling type decision. In column 1 of Table VI we present the estimated parameters obtained by probit. In columns 2 and 3 of Table VI we report the estimates from estimating the double-index binary choice model. The standard error for each estimate is shown in parentheses under the estimate. Recall that we are able to transform the model to an equivalent one under a non-singular linear transformation so as to induce exclusion restrictions for purposes of estimating probabilities. Further, we obtain an equivalent model by normalizing the constant term to zero and one of the coefficients in each index to one. In view of these normalizations, it is difficult to interpret the coefficients other than to note that many of the variables have a statistically significant impact. Accordingly, we perform the following exercise using both parametric and semiparametric models. We use the estimates to evaluate the probability of each individual attending a government school with and without each of the characteristics. Then, with the exception of age, the attitudes variable and the number of siblings, we compute the average effect of each individual acquiring the characteristic. For age and attitudes variables, we evaluate the impact of a one standard deviation change, while for siblings we increase the variable by one. These are all reported in Table VII. Without exception, the partial effect for each of the variables has the same sign across estimation procedures. Perhaps the most striking difference across the two procedures is the magnitude of the effect of the variable denoting that the individual is Catholic. In the probit model the estimated effect is over 50 percentage points, while for the double-index model the effect is around 33 percentage points. Thus, while overall the partial effects are quite similar across models, the large difference in the Catholic effect illustrates the value of the double-index approach.

Table VI. Determinants of attending a government school
 Probit Govt schoolS-P Govt schoolS-P Govt school
Age− 0.017 1
Attitudes− 0.0221 
Both parents− 0.094− 0.2941.382
Mother/degree− 0.5835.662− 2.451
Father/degree− 0.5490.865− 0.345
Siblings0.0200.165− 0.721
Roman Catholic− 1.2703.567− 2.339
Males− 0.0322.961− 6.320
Aust− 0.296− 0.7402.697
Table VII. Partial effects
Age− 0.010− 0.007
Attitudes− 0.034− 0.027
Both parents− 0.059− 0.052
Mother/degree− 0.150− 0.162
Father/degree− 0.164− 0.099
Roman Catholic− 0.530− 0.326
Male− 0.009− 0.020
Aust− 0.020− 0.084

While there are some important differences between the estimated marginal effects from the probit and double-index models, it is valuable to test the probit model of government school attendance for the presence of heteroscedasticity and non-normality by employing the conditional moment tests outlined in Pagan and Vella (1989). The tests are implemented via artificial regressions whereby one regresses the product of the generalized residual and the single index from the probit model with the explanatory variable potentially causing the heteroscedasticity against the scores from the probit model and intercept. The test against the null of no heteroscedasticity is a t-test on the null that the intercept is equal to zero. We conducted this test for each of the variables which appear in the conditional mean of the Govt equation and report the results in Table VIII. The tests indicated the presence of heteroscedasticity operating through several of the variables. More precisely, there was a rejection at the 5% level for the Age, Aust and Both Parents Present variables and Attitudes at the 10% level. Moreover, the test for the imposed distributional assumptions strongly rejected normality. Note that the presence of both forms of misspecification makes it difficult to fully understand the cause of the rejections. Nevertheless, the evidence suggests that heteroscedasticity is present.

Table VIII. Test values for heteroscedasticity
VariableTest value
Both parents present3.313
Mother with degree1.398
Father with degree0.365
Roman Catholic1.288

We now examine how the presence of heteroscedasticity can help detect the effect of exogenous effect of attendance at a government high school. Before we do so, we report the OLS estimates and also employ two alternative approaches for accounting for the simultaneity. In column 1 of Table IX we report the ordinary least squares (OLS) estimates of equation (10). They indicate that attending a government school appears to decrease the years of educational investment by 0.559 years. The standard error is small, indicating the effect is relatively precisely estimated. This effect is not particularly large given the large premium associated with attending a private institution when at high school. For example, in this sample only 47.8% of the individuals attending government schools obtained at least 12 years of schooling, in comparison to 68.3% of the non-government students. Also, while only 2.9% of government students obtained a college degree, the corresponding number for the non-government students is 7.3%. The remaining coefficients are also generally statistically significantly different from zero and are all of a reasonable magnitude, although it is difficult to have strong expectations. The variables capturing the presence of both parents in the household and the level of each parent's education capture the effect of role models as well as higher incomes. The variable reflecting the number of siblings has the expected negative sign and is reasonable in magnitude. As found in Vella (1994) the Attitudes variable has a strong positive effect on years of education acquired.

Table IX. The impact of government school attendance on years of education
 OLS SchoolIV SchoolCF SchoolSPIV School
Both parents0.2940.3060.3030.310
Siblings− 0.117− 0.118− 0.118− 0.120
Roman Catholic− 0.0450.1290.075− 0.202
Govt− 0.559− 0.050− 0.206− 0.986
Mills ratio − 0.200 

From the above, the OLS estimated impact of attending a government school appears to be too small. Accordingly, we are motivated to consider a model that incorporates the schooling decision, and does so in a general specification. However, first we employ two procedures which do not directly exploit the heteroscedasticity. First we perform IV by using the predicted probability from the probit model as an instrument for the government indicator. The second is to include the inverse Mills ratio, from this parametric estimation of the government equation, as an additional regressor in the years of education equation. Note that the first of these estimates is consistent in the absence of normality, while the latter is not. To implement these procedures, it is necessary to employ the probability that the individual attends a government school from the estimates reported in column 1 of Table VI.

The second column of Table IX presents the estimates of the education equation when we conduct IV by instrumenting the Govt dummy with the predicted probabilities from the probit model. As the same variables appear in the Govt equation and the schooling equation the model is identified from the non-linear mapping from the explanatory variables. In general, the coefficients are similar to those in column 1, although there is a difference with respect to the school and religion variables. The coefficient on the attendance at a government school variable is now unreasonable in that it indicates those who attend a government school, ceteris paribus, will obtain only 0.05 years of education less than those at private schools. This is in complete contrast to the conventional understanding of the effect of attendance at state-financed schools. Note, however, that this coefficient is not statistically different from zero at the 10% significance level. When we adopt the plug-in version of this model we obtain an estimate of the government school effect of − 0.071 with a standard error of 0.891.

In column 3 we report the alternative procedure whereby one includes the inverse Mills ratio from the model in column 1 of Table VI as an additional regressor in the education equation. These results are generally reasonable in magnitude, in that they are similar to the OLS estimates, although the government variable's coefficient is now less than half the OLS estimate in absolute terms. However, the coefficient on this variable is very imprecisely estimated.13 Overall the evidence in columns 2 and 3 confirms our suspicion that there appears to be inadequate non-linearity in the transformations performed to enable accurate estimation of the model. Also note that as the t-statistic associated with the inverse Mills ratio is low there is no evidence to support the conjecture that school type is endogenous to years of education. One suspects that the test has relatively low power given the inaccurate manner in which the parameters are estimated and the associated collinearity.

In the fourth column of Table IX we report the estimates from the schooling equation when we instrument the Govt variable with the estimated probability from the semiparametric binary choice model. The estimates are generally similar to those in the first column. The most striking change is the increase in the magnitude of the Govt school coefficient, which now indicates that the effect is 0.99 years and is statistically significantly different from zero at the 10% level. This estimate seems far more reasonable given the educational behavior of those attending non-government schools. In order to explore the role of the double-index structure in this result we also estimate the model where we first semiparametrically estimated the probability to employ as an instrument via the single-index approach of Klein and Spady (1993). For this approach we found that the point estimate for the Govt coefficient was − 0.852, with a large standard error of 0.723. While the point estimate is similar to the double-index approach, the increased identifying power of the double-index model provides a different conclusion regarding whether the effect is statistically different from zero at conventional levels of testing.

Finally we explore the possibility that the treatment effect is not constant. To this end, denote Xi: 1xK as the ith observation on the K exogenous variables. Let the treatment variable enter as Govti*[co + Xiθo]. In this form, the Govt variable interacts with the individual's characteristics. We estimated the resulting model by IV, where we used the predicted probability from our double-index model interacted with the individual's characteristics as instruments for these interaction variables. To examine overall whether or not there is a treatment effect, we considered a Wald test for the joint null hypothesis: co = 0 and θo = 0. With a P-value of 0.0058, we reject the null hypothesis at conventional significance levels. We also calculated the average treatment effect (at the mean values of the X's) to be − 2.975 with an associated standard error of 1.162. Accordingly, there would seem to be a treatment effect whose magnitude is much larger than the average OLS effect previously reported. Not surprisingly, given the above results, we also reject the null hypothesis of a constant treatment effect (θo = 0) with an associated P-value of 0.0114.14


The primary objective of this paper is to develop a semiparametric estimator for the binary choice model under the presence of heteroscedasticity. To do so we present a double-index model where the indices capture the conditional mean and conditional variance respectively. We then estimate the parameters by maximizing a quasi likelihood function that depends on these two indices. We note that this procedure is applicable for any discrete choice models which is a function of two indices. We also highlight that in providing the asymptotic properties of our procedure we develop a theoretical argument which justifies the use of local smoothing as a bias-reducing device in discrete-choice models with a double-index structure.

The interest in binary response models often follows from the appearance of the response as an endogenous explanatory variable of some interest in another equation. An additional difficulty is that it is frequently difficult to identify variables which determine the response but which do not enter directly into an equation in which the response appears as a regressor. We illustrate how the presence of heteroscedasticity in the model can provide identification in such models in such instances. Using the predicted probability from the binary response model as an instrument for the treatment variable, we show that one can consistently estimate the treatment effect. We show that the estimators for both models are consistent and asymptotically (equation image) distributed as normal. We provide simulation evidence that illustrates that both procedures formulated here work well even in the case where the same variables are driving the conditional means and variances of both the treatment and outcome equations.

In an empirical investigation we illustrate the utility of both of our proposed estimators. In the first step we examine the determinants of the probability to undertake education at a government-financed school. In the second step we use this probability as an instrument to estimate the impact of attending such a school on the level of education. The evidence suggests that the estimated first-step probability is quite different from that generated by a probit model assuming homoscedasticity. The second-step estimates are suggestive that the heteroscedasticity in the schooling choice equation may be an effective means of identifying the effect of the school type on level of schooling.


Financial support from the Research Council at Rutgers University is acknowledged. We are grateful to an anonymous referee for useful comments.

  • 1

    However, the estimator may very well not be efficient in the restricted class of double-index models considered here.

  • 2

    Virtually all of the technical difficulties in this paper arise from estimating a double-index specification for the binary response model under estimated local smoothing.

  • 3

    This assumption is not necessary, but simplifies the trimming strategy.

  • 4

    Ichimura and Lee use this differentiability condition in their proof. We have explicitly stated it as an assumption, because it can fail when all continuous variables of an index are functionally related. For example, suppose equation image and write

    equation image

    The derivative condition in A5 does not hold for this case. Moreover, the model is not identified as

    equation image

    In other words, we are unable to distinguish η31 from equation image as both yield the same binary response probability.A similar issue arises in the case of single-index models. The identification argument in Klein and Spady (1993) requires that there is a continuous variable that is functionally independent of other continuous variables in the model. Ichimura (1993) provides weaker identification conditions by relaxing this assumption. It remains the case, however, that when the index is a linear in ‘basis’ functions of the same continuous variable, Z, then the index is not identified.

  • 5

    This assumption is used in two types of arguments. First, in conjunction with the parameters being in a compact set, it implies that probabilities are bounded away from one and zero. In the absence of this simplifying condition, one would need to make a tail assumption on how fast the probability function tends to one or zero. Second, this compact support assumption simplifies various uniform convergence arguments.

  • 6

    In taking fourth-order Taylor series expansions to examine bias terms, the fourth derivative will involve N4a. Consequently, it is important that a be ‘small’. Here, a = (r3 − εa)/8, with εa positive and arbitrarily small. The value a = 1/100 satisfies the required constraint and is employed in the Monte Carlo study.

  • 7

    In proving Lemma 8, we require r3 < 1/10, 4(r1 + r2 + r3)> 1/2, and 0 < δ< r3/2. In making a bias calculation (Lemma 3A–C), we will require 0 < r1 < r2 − 2a and 0 < r1 + r2 < r3 − 2a. These conditions are satisfied with the parameter a set as in D4, ri as in D5, and δ as in D5.

  • 8

    For bias reasons, it is important to let the densities that define these parameters be closer to zero than the densities upon which the semiparametric probability function is based. As the likelihood trimming ensures that densities have a lower bound of the form B > 0, we permit the local smoothing parameters to slowly tend to zero.

  • 9

    We set trimming and smoothing parameters as follows: a = 0.01, r3 = 1/11, δ = 1/25, r2 = (r3 − δ/2)/2, and r2 = (r3 − δ)/4.

  • 10

    Write the transformed matrix of index coefficients as

    equation image
  • 11

    In constructing this table, probabilities were sorted on the basis of the calculated true probabilities in each sample. Then, for the first N/5 observations, average probabilities were calculated for the true probabilities, probit probabilities, and double-index probabilities. These average probabilities were then averaged over replications (with minimal Monte Carlo sampling error). Similar calculations were made for each of the other reported quantiles.

  • 12

    It may be possible to improve this alternative IV procedure by developing a method for selecting the degree of the approximating polynomial. We have not pursued this strategy here primarily because the semiparametric probability function is of direct interest and secondly because this probability function is an optimal instrument.

  • 13

    Note that the standard errors for this column are underestimated as they have not been corrected to account for the estimation of the inverse Mills ratio.

  • 14

    While several of the individual interactions were significant, a number were not. Thus it would seem reasonable, but beyond the scope of this paper, to further explore variable treatment effects.