7.1 Simulation Study for the Parameter Estimates
We conducted a simulation study to evaluate the performance of the proposed LSKM estimation method for the semiparametric model (1) by fitting the linear mixed model (11). We considered the following model
(19)
where ei∼N(0, 1). To allow for xi and (zi1, … , zip) to be correlated, xi was generated as xi= 3cos(zi1) + 2ui with ui being independent of zi1 and following N(0, 1), zij(j= 1, … , p) were generated from Uniform(0, 1). The nonparametric function h(·) was allowed to have a complex form with nonlinear functions of the z's and interactions among the z's. In our simulations, we first fit the model using the same set of z's as that in the true model. In practice, without advanced knowledge, the true set of z's is often unknown and the set of z's that is used might be larger than the true set and contains some noisy z's that are irrelevant to the outcome y. To mimic such a scenario, in the second set of simulations, we added some noisy z's in the set of z's and fit (19).
We considered four configurations by varying n (the sample size) and p (the number of covariates z's). For each setting, only the Gaussian kernel is used and 300 simulations were run.
Setting 1: n= 60, p= 5, true h(z) = 10cos(z1) − 15z22+ 10exp(−z3)z4− 8sin(z5)cos(z3) + 20z1z5. Fit the model with the five true z's. This setting mimics the PSA data.
Setting 2: n= 100, p= 8, h(·) is the same as setting 1. Fit the model (19) by including 3 additional irrelevant z6, z7, z8 besides the true z1, … , z5.
Setting 3: n= 200, p= 10, true h(z1, … , z10) = 10cos(z1) − 15z22 + 10exp(−z3)z4− 8sin(z5)cos(z3) + 20z1z5+ 9z6sin(z7) − 8cos(z6)z7+ 20z8sin(z9)sin(z10) − 15z38− 10z8z9− exp(z10)cos(z10). Fit the model assuming these 10 true z's are used.
Setting 4: n= 300, p= 15, h(·) is the same as that in setting 3. Fit the model with additional 5 irrelevant noisy predictors z11, … , z15 besides the true z1, … , z10.
The point estimate results are presented in Table 2. Because it is difficult to graphically display the fitted value of h(·) as a function of z, we summarized the goodness of fit of h(·) in the following way. For each simulation data set, we regressed the true h on the fitted
, both evaluated at the design points. We then empirically summarized the goodness of fit of
by reporting the average intercepts, slopes, and R2's obtained from these regressions over the 300 simulations. If the intercept from this regression is close to zero and the slope is close to one and R2 is close to one, it would provide empirical evidence that the estimated multi-dimensional function h(·) is close to the true manifold.
Table 2. Simulation results of estimated regression coefficientsβand the nonparametric functionh(·)in modely=xβ+h(z)+ebased on 300 runs. Trueβ=1and trueσ2=1| Setting | True #z | Used #z | n | Model parameter estimates | Reg of h on  |
|---|
| β | σ2 | ρ | Intercept | Slope | R2 |
|---|
|
| 1 | 5 | 5 | 60 | 1.00 | 0.96 | 5.34a (estimated) | −0.04 | 1.00 | 0.99 |
| 100 | 1.01 | 0.96 | 7.24 (estimated) | −0.01 | 1.00 | 0.99 |
| 100 | 1.00 | 0.92 | 1.00 (fixed) | −0.01 | 1.00 | 0.99 |
| 100 | 1.00 | 1.01 | 100.00 (fixed) | −0.02 | 1.00 | 0.99 |
| 2 | 5 | 8 | 100 | 1.05 | 0.89 | 6.74 (estimated) | 0.16 | 1.00 | 0.98 |
| 100 | 1.06 | 0.30 | 1.00 (fixed) | 0.36 | 0.98 | 0.97 |
| 100 | 1.12 | 2.15 | 100.00 (fixed) | 0.23 | 1.01 | 0.96 |
| 3 | 10 | 10 | 200 | 0.98 | 0.93 | 12.83 (estimated) | −0.07 | 1.00 | 0.99 |
| 200 | 0.92 | 0.30 | 1.00 (fixed) | −0.18 | 0.99 | 0.98 |
| 200 | 0.98 | 1.15 | 100.00 (fixed) | −0.04 | 1.00 | 0.99 |
| 4 | 10 | 15 | 300 | 1.01 | 0.82 | 14.02 (estimated) | 0.03 | 1.00 | 0.99 |
| 300 | 1.01 | 0.75 | 10.00 (fixed) | 0.02 | 1.00 | 0.99 |
| 300 | 1.01 | 1.17 | 100.00 (fixed) | 0.02 | 1.00 | 0.99 |
The results in Table 2 show that, when the true set of z's was included in fitting h(·) and all the model parameters {β, h(·), τ, ρ, σ2} were estimated simultaneously, the LSKM method via the mixed model framework performed well in estimating β, h(·) and σ2. However, if the scale parameter ρ in the Gaussian kernel was fixed, which is often done in traditional machine learning, the model estimators could be subject to considerable bias, especially for the estimate of σ2. When ρ was fixed at values close to the estimated one, the bias was small. Because in practice, ρ is unknown, our results suggest it is useful to estimate the scale parameter ρ using the data. When extra irrelevant covariates z's besides the true set of z's were used in fitting h(·), the proposed method still performed well if all model parameters were estimated.
Table 3 compares the estimated standard errors of
using the frequentist method (12) and the Bayesian method (14) with the empirical ones. The results show that both the frequentist and the Bayesian standard error estimates were close to their empirical counterparts. Table 3 also compares the estimated standard errors of
(including intercept) using the frequentist method (13) and the Bayesian method (15) with the empirical standard errors. For the ease of presentation, for each setting, we averaged the SE estimates across all the grid points and presented these averages. The results show that when the scale parameter ρ was estimated, both the frequentist and the Bayesian standard error estimates were close to their empirical counterparts. When the scale parameter was fixed, the Bayesian and frequentist SEs were still close but could be quite different from the empirical SEs. These results further indicate that it is useful to estimate the scale parameter ρ in practice.
Table 3. Simulation study results of standard error estimates of
and
in modely=xβ+h(z)+ebased on 300 simulations| Setting | True #z | Used #z | n | Empirical SE | Bayesian SE | Frequentist SE | ρ |
|---|
| | SEs of  | |
| 1 | 5 | 5 | 60 | 0.088 | 0.088 | 0.083 | 5.34 (estimated) |
| 100 | 0.054 | 0.057 | 0.055 | 7.24 (estimated) |
| 100 | 0.062 | 0.066 | 0.058 | 1.00 (fixed) |
| 100 | 0.055 | 0.056 | 0.055 | 100.00 (fixed) |
| 2 | 5 | 8 | 100 | 0.066 | 0.065 | 0.058 | 6.74 (estimated) |
| 100 | 0.070 | 0.078 | 0.034 | 1.00 (fixed) |
| 100 | 0.082 | 0.081 | 0.078 | 100.00 (fixed) |
| 3 | 10 | 10 | 200 | 0.044 | 0.047 | 0.042 | 12.83 (estimated) |
| 200 | 0.050 | 0.077 | 0.024 | 1.00 (fixed) |
| 200 | 0.041 | 0.047 | 0.045 | 100.00 (fixed) |
| 4 | 10 | 15 | 300 | 0.039 | 0.042 | 0.033 | 14.02 (estimated) |
| 300 | 0.039 | 0.044 | 0.032 | 10.00 (fixed) |
| 300 | 0.037 | 0.041 | 0.039 | 100.00 (fixed) |
| | SEs of  | |
| 1 | 5 | 5 | 60 | 0.635 | 0.662 | 0.601 | 5.34 (estimated) |
| 100 | 0.482 | 0.515 | 0.464 | 7.24 (estimated) |
| 100 | 0.614 | 0.664 | 0.576 | 1.00 (fixed) |
| 100 | 0.458 | 0.470 | 0.456 | 100.00 (fixed) |
| 2 | 5 | 8 | 100 | 0.662 | 0.683 | 0.604 | 6.74 (estimated) |
| 100 | 0.933 | 0.540 | 0.449 | 1.00 (fixed) |
| 100 | 0.741 | 0.731 | 0.645 | 100.00 (fixed) |
| 3 | 10 | 10 | 200 | 0.606 | 0.667 | 0.583 | 12.83 (estimated) |
| 200 | 0.954 | 0.541 | 0.450 | 1.00 (fixed) |
| 200 | 0.559 | 0.630 | 0.596 | 100.00 (fixed) |
| 4 | 10 | 15 | 300 | 0.712 | 0.721 | 0.636 | 14.02 (estimated) |
| 300 | 0.737 | 0.717 | 0.634 | 10.00 (fixed) |
| 300 | 0.632 | 0.732 | 0.684 | 100.00 (fixed) |
7.2 The Simulation Study for the Score Test
We next conducted a simulation study to evaluate the performance of the proposed variance component score test for H0 : h(·) = 0 versus
. The true model is the same as (19), where x and z's were generated in the same way as that in Section 6.1 and
and a= 0, 0.2, 0.4, 0.6, 0.8, 1. We studied the size of the test by generating data under a= 0, and studied the power by increasing a. The kernel parameter ρ was fixed at a wide range of values: 0.5, 1, 5, 10, 25, 50, 100, 200. The sample size was 60, mimicking the PSA data example. For the size calculations, the number of simulations was 2000, whereas for the power calculations, the number of runs was 1000.
Table 4 reports the empirical size (a= 0) and power (a > 0) of the variance component score test for H0. The results show that the size of the test was very close to the nominal value 0.05 and was not sensitive to the choice of the scale parameter ρ. As a increased, the power quickly approached 1. The power was not much affected by the value of ρ if a moderate ρ was specified, but was more affected if a large value of ρ was specified
Table 4. Simulation results for the score test forH0:h(z)=0| Scale ρ | Size | Power |
|---|
| α= 0 | α= 0.2 | α= 0.4 | α= 0.6 | α= 0.8 | α= 1.0 |
|---|
| 0.5 | 0.050 | 0.158 | 0.487 | 0.865 | 0.989 | 1.000 |
| 1 | 0.047 | 0.137 | 0.509 | 0.869 | 0.991 | 1.000 |
| 5 | 0.050 | 0.127 | 0.482 | 0.865 | 0.987 | 1.000 |
| 25 | 0.051 | 0.139 | 0.484 | 0.886 | 0.990 | 1.000 |
| 50 | 0.046 | 0.138 | 0.508 | 0.863 | 0.990 | 1.000 |
| 100 | 0.048 | 0.134 | 0.497 | 0.867 | 0.988 | 1.000 |
| 200 | 0.054 | 0.148 | 0.494 | 0.874 | 0.991 | 1.000 |
7.3 The Simulation Study for Kernel Selection
A simulation study was also conducted to assess the performance of kernel selection using the kernel machine AIC and BIC criteria. The true model we considered is
where e∼N(0, 1), x was generated as x= 3 cos(z1) + 2u with u being independent of z1. All u and zj(j= 1, … , 5) were generated from N(0, 1). The sample size was 50, and the number of runs was 300. Three types of kernel functions were used in the simulation: the Gaussian kernel K(u, v) = exp(−∥u−v∥2/ρ), the second-degree polynomial kernel K(u, v) = (uTv+ 1)2, and the first-degree polynomial kernel that corresponds to ridge regression K(u, v) =uTv. For each simulated data set, the AIC and the BIC were calculated based on the model with three different kernels.
The mean AIC and BIC across 300 simulations for the Gaussian kernel are 190.79 (51.31) and 284.21 (50.21), respectively (the numbers within parenthesis are standard deviations), those for the second-degree polynomial kernel are 269.07 (10.00) and 308.91 (9.58), respectively, and those for the ridge regression are 363.67 (2.63) and 371.61 (2.51), respectively. The AIC and BIC values from each simulated data set are plotted in Figures 1 and 2. These results show that the kernel machine AIC and BIC of the model with Gaussian kernel are the smallest, whereas those of ridge regression are the largest. Hence the Gaussian kernel is preferred to both the second-degree polynomial kernel and the ridge regression kernel, which is desired in light of the complicated functional forms of the x's.