Accurate Prediction and Reliable Parameter Optimization of Neural Network for Semiconductor Process Monitoring and Technology Development

Herein, novel neural network (NN) methods that improve prediction accuracy and reduce output variance of the optimized input in the gradient method for cross‐sectional data are proposed, and the variability evaluation approach of optimized inputs in the semiconductor process is suggested. Specifically, electrical parameter measurements (EPMs) and power‐delay product of industrial high‐k metal gate DRAM peripheral 29‐stage ring oscillator circuits, including NMOS, PMOS, and interconnects, are focused on. The proposed methods find an optimized input to achieve a lower NN output variance in the gradient descent than one multilayer perceptron (MLP) and mean ensemble of MLPs even when considering the variabilities of the devices and interconnects. The local optima problem of one MLP is resolved by utilizing multiple MLPs trained with different train/validation data, their trimmed mean, and an additional learnable layer. Moreover, adding the learnable layer secures versatility for various parametric datasets. The methods improve the prediction accuracy (R2) by 5.6–15.6% in sparse data space compared to one MLP and the mean ensemble, decrease the NN output variance of the optimized input by 73.0–81.6% compared to one MLP and the mean ensemble, and are successfully verified by implementing it on EPMs of 3977 test patterns of 314 wafers and 16 lots.


Introduction
3][4] However, in a semiconductor process that requires high costs and precise yield control, decision-making from unreliable optimization and inaccurate process monitoring can have detrimental consequences.Therefore, verification of the results derived from NN applications is extremely crucial.In particular, because the electrical characteristics of manufactured devices have nonuniform distribution, including uncontrollable variability, [5,6] the data density is not constant across the entire range, and sparse data space inevitably exists where the accuracy of NN prediction can degrade.
Moreover, in the sparse data space, splitting training/validation datasets to prevent overfitting of a NN creates a lack of training data, further magnifying the prediction error in the space.As a result, the prediction accuracy of NN in the sparse data space varies depending on the training indices, which means that nodes of the NN do not have the global minima of the loss in the training process but the local minima of the loss in a given training dataset because the training dataset does not successfully represent the entire data space in balance.
Furthermore, if we apply gradient descent [7,8] to a single NN (one multilayer perceptron [MLP]) to optimize the input for low output in sparse data space, the optimal input has output deviation depending on the training indices of the NN.For example, Table 1 contains five input sets optimized by five MLPs with different training/validation datasets and five outputs, the powerdelay product (PDP) of the 29-stage ring oscillator, [9] estimated by the MLPs.Each MLP outputs a much lower PDP than the other MLPs when the input is optimized by the MLP itself.In other words, the input optimized by one MLP occasionally has significant output variance, and this characteristic of using one MLP results in reliability degradation of the optimization.This phenomenon is especially exacerbated in the sparse data space and is not resolved by the mean ensemble alone, [10,11] discussed in the following section.
In this study, to reduce the prediction error due to the local minima of NN nodes and to derive an optimal input with low output variance in the sparse data space, we propose a network ensemble [10,11] using multiple MLPs trained using different training/validation datasets, a trimmed mean function, and a learnable layer as a final decision model such that it can be adapted according to the dataset.Additionally, we introduce the variability evaluation approach of the optimized input and evaluate the variabilities of optimization results of the proposed NN methods with the approach.We apply these approaches to electrical parameter measurement (EPM) and PDP of industrial high-k metal gate (HKMG) DRAM 29-stage peripheral ring oscillator circuits (3977 test patterns of 314 wafers and 16 lots), and the overall processes are depicted in Figure 1.First, the EPMs with multicollinearity are preprocessed through principal component analysis (PCA) or nonlinear PCA, [12,13] resulting in three groups (NMOS, PMOS, and interconnects) with multicollinearity and preprocessed components (input) of each group.Subsequently, the NNs train the correlation between the preprocessed EPM (input) and PDP (output) with different indices, and EPMs are optimized for low PDP.In this process, we aim to verify the PDP-distributional improvement of the optimized EPM by generating random variables (RVs) by following the Gaussian mixture model with equal weights, the mean of the optimized EPM by each method, and a tied covariance matrix that includes the average variance and covariance of each lot. [14]Moreover, we employ k-nearest neighbor (k-NN) density estimation to verify that the variance of outputs from multiple NNs increases as the data density decreases. [15,16]The proposed method decreases the PDP-estimation variance of optimized EPMs lower than simply using one MLP.Finally, we compare the results of using one MLP, mean ensemble, and proposed ensembles with trimmed mean and final decision model (FDM).
The contributions of this work are as follows: 1) The examined algorithmic network ensemble with trimmed mean improves the prediction errors and prevents NNs from outputting local optima in the sparse data space.2) The proposed FDM optimizes the PDP more precisely than the trim-mean method and can be generically utilized for different types of data by modifying the output layer of the trim-mean method to be learnable.
3) We verify that the variance of outputs from multiple NNs increases in the sparse data space, and the proposed methods can derive reliable optimization results with lower output variance than one MLP and the mean ensemble even when all variabilities that occur in the semiconductor process are considered.4) The proposed methods enable accurate process monitoring and correct decision-making in semiconductor manufacturing with low data density.The increase of costs associated with training, prediction, and optimization of NNs is smaller than the cost incurred due to incorrect decision-making from unreliable optimization.In addition, the defects that process monitoring aim to detect typically occur in the sparse data space far from the characteristics of well-functioning devices or circuits, and the proposed methods improve the prediction accuracy in the sparse data space.

Related Work 2.1. Local Minima Reduction
Previous studies have reported learning without local minima in linear feedforward NNs using PCA, and the gradient method  could get stuck in local minima in the backpropagation of multilayered NNs (MLNs). [17]In addition, it has been demonstrated that these local minima normally occur in piecewise linear activation, such as ReLU. [18]Naftaly et al. [10] demonstrated that using averaged predictions from multiple independently trained NNs as the final prediction could prevent local minima by minimizing the variance of the prediction values.However, it is difficult to guarantee a good performance by simply averaging various predictions when dealing with sparse data.Krizhevsky et al. [11] output the final prediction after passing the learnable layers using the prediction results of multiple NNs.This approach enables the generalization of results from multiple NNs, and it can enhance network performance by enforcing nonlinearity on many predictions that may not have been predicted accurately in the aforementioned sparse data space.

Prediction and Optimization of Device Characteristics
Previous prediction or optimization studies using machine learning (ML) in technology development have focused on NN modeling for predicting the electrical characteristics of transistors or cells for given structural parameters or physical properties.For instance, several studies have been conducted to predict and optimize various aspects of device performance, such as power and delay for 32 nm node HKMG transistors, [1] RC delay and RF figure-of-merit (FoM) for vertical nanowire field-effect transistors, [19] threshold voltage (V th ) and V th window for charge trap nitride (CTN), [20] on-current (I on ), V th , and subthreshold swing (SS) for ferroelectric transistors. [2]Although these works were based on the prediction of one MLP and optimization using gradient descent, prediction failure and local minima problems in sparse data space did not occur because the datasets used were ideally generated by TCAD simulation.However, in real semiconductor manufacturing, EPMs have a distribution that includes uncontrollable variations around the median, unlike in an ideal dataset.Generally, EPMs can contain the sparse data space where impractical prediction and local minima of one MLP may occur because the variability of manufactured devices does not always have a uniform probability density in the variation range.EPMs with multicollinearity, which have a high variance inflation factor (VIF), were transformed through PCA and nonlinear PCA. [12,13]n addition, I off 's of NMOS and PMOS with exponential scale were log-normalized to adjust the scale similar to other parameters.Finally, the preprocessed EPMs were set as the input, and the PDP of the 29-stage R. O. was set as the output of the NNs.

Methodology
After preprocessing to remove outliers out of the range from À5 to þ5 standard deviations from all EPM distributions, 3187 data points were used in the experiment.

Network Training and Variance of Prediction
To prevent one MLP from impractically predicting with large errors in the sparse data space of the training dataset, as listed in Table 1, we used multiple MLPs with randomly different  1)) as an indicator of the parametric density. [15,16] where d denotes the dimensionality of the EPMs, n indicates the number of samples (2549 training samples), and k is typically adopted as n 1/2 .Additionally, we evaluated the optimization reliability by comparing the prediction variances of the EPMs optimized by one MLP and the proposed methods.

Network Ensemble with Trim-Mean
Trained MLPs tended to predict inaccurately because of the lack of data variety in the sparse data space.To exclude such imprecise predictions from the final output, we propose an innovative scheme entailing a network ensemble with a trim-mean that trims the top and bottom quarters of multiple MLPs predictions and takes the average of the remaining predictions, as shown in Figure 3. Here, outliers impractically predicted by some MLPs in the sparse data space can be excluded from the final prediction through trimming.

Trim-FDM-Mean
Although the aforementioned trim-mean method makes it extremely easy to remove some outliers and raises the prediction accuracy even higher, taking the mean after trim cannot produce a value close to the GT when the GT is located outside of the trim section or is close to the boundary of the trim range.Therefore, we suggest using the FDM between trim and mean operations (trim-FDM-mean) that can flexibly output values, as well as a mean value, which can be universally applied to various datasets.The detailed trim-FDM-mean process is shown in Figure 4.The procedures of trim-FDM-mean before trimming are the same as that of the trim-mean method before trimming.FDM is an MLP composed of three layers, which produces N/2 outputs (FinalÀPDP) by receiving N/2 outputs (PreÀPDP) by trimming the outputs from N MLPs.FDM can consider the nonlinearity of the trimmed outputs of multiple MLPs by adding layers to PreÀPDP output by only the trim, resulting in a more flexible FinalÀPDP.
In trim-FDM-mean, all MLPs are trained using a random training/validation combination; therefore, the ensemble network uses all datasets (training þ validation) except for the test dataset for training.Therefore, all datasets other than the test dataset are used for training the FDM.The MSE between the GT and the value predicted by trim-FDM-mean was used as a loss function during the training of the FDM.

Automatic Optimization of EPM within Manufactured Range
The PDP of DRAM peri.R. O. circuits can be optimized backwardly using gradient descent and a limiter setting the entire input range in Figure 5. [1] First, randomized input parameters were generated as an initial guess within the range of the preprocessed EPMs.The trained MLP then predicted the numerical gradient of the latent variables L z (Equation ( 2)) instead of the input EPM components z, and we optimized L z by subtracting the gradient within the entire input range of the limiter function C(L z ) (Equation (3)). [1] where i denotes the index of EPM components and C(L z ) is a differentiable function defined by the maximum and minimum values (z i max , z i min ) of EPM components with a sigmoid function (Equation ( 4)). [21]The latent variables and limiter function guarantee an optimized input in the desired range.

RV Generation for Verification of Variability
Manufactured R. O. circuits targeting the EPM optimized for a low PDP have a distribution of electrical characteristics, including variance and covariance due to variabilities in transistors and interconnects. [6]Therefore, if some points near the optimal point have a much higher PDP than the optimal PDP in the distribution, the optimal point is unstable and carries the risk of increasing the defect rate.Moreover, the EPM distributions of each lot in Figure 2 are similar to Gaussian mixtures having a covariance matrix of lots and the mean of each lot's median, which have an average Bhattacharyya Coefficient (BC) of 0.9323 and a standard deviation of BC of 0.0134 between each other. [22]Therefore, we generated RVs reflecting the variabilities of the manufactured R. O. circuits with the mean of the optimized EPM and verified their predicted PDP distribution by the 100 trained MLPs.Consequently, 325 RVs (1 lot = 13 test patterns Â 25 wafers) x were generated following the Gaussian mixture model λ = {w, EPM opt , Σ}, with equal weights w, mean μ of EPM opt , and the common covariance matrix Σ including the average variance σ 2 lot and covariance cov lot of each lot in Figure 2.

Experimental Section
Experiments were performed to compare the prediction and optimization results of one MLP, mean ensemble, trim-mean, and trim-FDM-mean.First, each of the 100 MLPs was trained with randomly divided training and validation datasets, as described in Section 3.2.Then, we selected the MLP with the lowest loss of the training and validation datasets for the one MLP method.Moreover, we used the 100 MLPs for the mean ensemble, 30 MLPs for the trim-mean method, and 80 MLPs for the trim-FDM-mean method, which showed the lowest loss in training and validation for each method.Each MLP used in the experiments was trained using MATLAB console with a Levenberg-Marquardt (LM) optimizer with a desktop computer with an Intel i7-8700 (Hexacore, 4.2 GHz) CPU and 32GByte RAM. [23,24]In addition, the FDM was trained by an optimizer employing an LM with a batch size of 2048 on the framework of PyTorch 1.9, CUDA 11.3, and one RTX 3090 GPU.The learning rate of LM was 1e-4, and the maximum epoch was 1000 in the training of MLPs.On the other hand, the maximum epoch in the training of FDM was 10 because it uses the outputs of pretrained MLPs as the input and consists of a linear layer with a sigmoid function.

Prediction of PDP
One MLP, mean ensemble, trim-mean, and trim-FDM-mean used preprocessed EPMs as inputs and predicted the PDP of the R. O. circuit.The prediction results were evaluated using the test dataset.Table 2 presents the PDP errors predicted by one MLP, mean ensemble, trim-mean, and trim-FDM-mean.The trim-mean and trim-FDM-mean exhibit improved prediction performance in R 2 measurement compared to one MLP by 8.5-8.7% and mean ensemble by 1.5-1.7% with the test dataset.Likewise, both proposed methods show improved prediction performance in R 2 measurement compared to one MLP by 15.0-15.6%and mean ensemble by 5.6-6.1% with the bottom one-third of the k-NN density in the test dataset.
We compared the prediction results of the methods in the bottom one-third of the PDP (Figure 6a) and the k-NN density (Figure 6b).The mean ensemble, trim-mean, and trim-FDMmean have smaller overall errors than one MLP.As the objective of the following optimization is to minimize the PDP, and most of the defects in process monitoring are detected in the sparse data space, prediction accuracies in low PDP and k-NN density

Optimization for Low PDP
Although semiconductor manufacturing necessitates validating the optimized EPM and PDP, it is costly and requires a long turn-around time (TAT).Furthermore, even if the actual process for the optimized EPM is implemented, we obtain the R. O. circuit distribution with the mean of the optimized EPM and variabilities of devices and interconnects, not the circuit of the optimized EPM point.Therefore, the GT verification of optimized EPM and PDP is almost impossible in probability, and we ascertain the difference between PDP predictions for the optimized EPMs of each method and the output mean value of 100 pretrained MLPs with high test scores.Meanwhile, the MLPs output different values for the same input from network to network in Section 4.1, and the input point optimized by one MLP also differs from network to network.This characteristic degrades the reliability of the optimization because the prediction of the MLPs for the optimal inputs has a large deviation.Therefore, we evaluate the convergence of the optimization results by comparing the mean, deviation, and distribution width (the range from the minimal value to the maximal value) of PDP estimations from 100 pretrained MLPs between the optima of one MLP, mean ensemble, and proposed methods.
Table 3 lists the optimization results for each method.The trim-mean obtained a normalized PDP of 0.8548 with an error of À1.0% as an optimization result, and the trim-FDM-mean derived a normalized PDP of 0.8521 with an error of À0.7%.Both proposed methods improved PDPs by 6.9% and 7.2% compared to one MLP, with a prominently lower error than À20.2% for one MLP.The mean ensemble obtained a normalized PDP of 0.8452 with zero error because the error indicates the difference between the output from each method and the mean of 100-MLP-outputs.Table 3. Optimization results of one MLP, mean ensemble, and proposed methods.In the table, the mean of 100 MLPs and variance of 100 MLPs denote the average and standard deviation of the 100-MLP-outputs for the optimized input, respectively.The output denotes the output value of each method for the optimized input.The bold texts indicate the best optimization reliability, the lowest output variance and distribution width or the highest k-NN density.In addition, the variances of PDP estimations from 100 MLPs were the lowest in the trim-mean of 7.577e-4 and the second lowest in the trim-FDM-mean of 7.583e-4, both of which had lower variances by 89.2% than one MLP of 7.015e-3 and by 87.4% than the mean ensemble of 6.007e-3.Similarly, the values of the distribution width indicating the largest difference between the PDP estimations were the lowest in trim-FDM-mean, 0.1318, and the second lowest in trim-mean, 0.1790.Both proposed methods had lower distribution width values by 81.6% and 75.0%than that of one MLP, 0.7161, and by 80.1% and 73.0%.Furthermore, the optimization results have a lower k-NN density in Table 3 than the average logarithmic value in Figure 7, implying that all methods derived optimization results in the sparse data space.
Figure 8a-d shows the distribution of PDP estimations (outputs) from 100 MLPs with the optimal value (red dotted line) estimated using each method.The inputs optimized by one MLP and mean ensemble have wide distributions of 100-MLP-outputs in Figure 8a,b.In addition, the predicted optimal PDPs by one MLP and mean ensemble were biased to the left sides of the distributions, resulting in degradation of the optimization reliability.The gradient descent optimization of one MLP derives an input with the minimal output of the MLP.However, the outputs of the optimized input by one MLP are not estimated as low as possible by most of the other MLPs.For example, most of the other MLPs have different inputs, with each minimal output located at each local minimum of the MLP.Therefore, the optimization of one MLP derived an unreliable input with a large negative output error, and the optimization of the mean ensemble was dependent on a few MLPs with this characteristic.
However, the inputs optimized by both proposed methods have narrow distributions of 100-MLP-outputs, and the predicted optimal PDPs (red dotted line) by the methods are close to the distribution center in Figure 8c,d.This implies that the inputs optimized by trim-mean and trim-FDM-mean are not biased to the side of their distributions of 100-MLP-outputs, the local minimum of some MLPs.The proposed methods trim the local minima of a few MLPs and are independent of the characteristic of one MLP.Thus, it is demonstrated that optimization using the proposed methods can avoid the problem of improper optimization from the local optima in the sparse dataset region, which occurs when employing one MLP.In conclusion, both the proposed methods performed more accurate and reliable optimization.
Meanwhile, to verify the PDP distribution of optimized EPMs reflecting the variabilities of NMOS, PMOS, and interconnects, we generated 325 RVs of one lot based on each optimal input point, as described in Section 3.6 and Figure 9, constrained within the entire range of the total preprocessed EPMs.
In addition, the box plots in Figure 10 indicate the mean and standard deviation of 100-MLP-outputs with each RV input for 100 times of RV generation.The RVs derived by the mean ensemble and proposed methods had much lower means of 100-MLP-outputs than RVs derived by one MLP, implying that the mean ensemble and proposed methods could outperform one MLP in the optimization.However, the RVs derived by the mean ensemble indicate much more standard deviation and distribution width than the proposed methods, resulting in unreliable optimization with the variabilities.In contrast, the proposed methods drastically decrease the distribution width of 100-MLP-outputs with RVs derived by the methods as inputs compared to one MLP and mean ensemble.Therefore, it is verified that the reliability of PDP optimization is improved despite reflecting all the variabilities of the devices and interconnects.

Comparison between Conventional and Proposed Methods
In this section, we compare the average inference time and the performance of prediction and optimization between traditional nondeep learning algorithms (k-means clustering and support vector regression [SVR]), one MLP, and proposed NN methods. [25,26]In addition, we analyze their contributions from the perspective of the semiconductor process.Table 4 contains the average inference time for predicting the PDP of each method.In this comparison, a single NVIDIA RTX 3090 GPU and PyTorch framework was used, and the values were averaged on all test sets.The TAT of the optimization is proportional to the inference time; therefore, comparing the PDP prediction time is significant in the evaluation of our proposed methods.
One MLP was the fastest to predict the PDP at 0.0848 ms with the calculation of only one network.The PDP was predicted using k-means clustering and SVR, with average inference times of 0.2246 and 0.1523 ms, respectively.The average inference times for the mean ensemble, trim-mean, and trim-FDM-mean were comparatively longer than that for one MLP because they used 100, 30, and 80 MLPs, respectively.Trim-FDM-mean had the second longest average inference time, but it predicted PDP in 5.7577 ms, which is sufficiently fast to be used in the real-time semiconductor process.
Figure 11 compares the prediction performance (MAE) in the sparse data space in Figure 6b and optimization performance (ycoordinate) in Table 3 based on the average inference time (xcoordinate).Here, k-means clustering and SVR showed poor prediction performance with R 2 of 0.2946 and 0.5637 and MAE of 0.0180 and 0.0133, so they were excluded from the comparison.Although the one MLP method was rapid, the absolute prediction error and distribution width of the input optimized by one MLP  3.
were significantly larger than those of the proposed methods because the input set was a local optimum mistaken by one MLP.Likewise, the mean ensemble indicates much more distribution width than the proposed methods.In contrast, the proposed methods optimized inputs with substantially lower absolute prediction errors and distribution width values with a slightly slower inference time.The trim-FDM-mean provided the best PDP prediction accuracy among all methods, and the trim-mean had the second most prediction accuracy and significantly shorter inference time than the trim-FDM-mean.In conclusion, the trim-mean method is the most suitable for process monitoring demanding accurate prediction in the sparse data space in a short period.The trim-FDM-mean method is the most appropriate for optimizing the semiconductor process demanding small errors and high stability of the optimized PDP.

Conclusion
We propose NN methods for accurate prediction and optimization in industrial HKMG DRAM peri.29-stage R. O. circuits with sparse data density.The trim-mean method prevents local optima of one MLP by trimming predictions located in the tails of the distribution.The trim-FDM-mean method can achieve more accuracy than the trim-mean and be applied to diverse    types of random distributions with an additional learnable layer.
The proposed methods exhibit less prediction error in all output regions compared to one MLP and mean ensemble with the improvement of R 2 by 8.5-8.7% and 1.5-1.7% in the test dataset, and by 15.0-15.6%and 5.6-6.1% in the bottom one-third sparse test dataset, respectively.Moreover, we confirm that the proposed methods enable more precise and reliable optimization than one MLP and mean ensemble with an improvement in the 100-MLPoutputs variance of the optimized input by 89.2% and 87.4%, and the distribution width of 100-MLP-outputs of the optimized input by 75.0-81.6%and 73.0-80.1%,respectively.The improvements of the optimization result from outliers among the outputs disregarded in the proposed methods.Furthermore, we verify the average lot variabilities of the NMOS, PMOS, and interconnects in the PDP optimization by confirming the improved PDP distribution of RVs reflecting the average variance and covariance of each ring oscillator circuit lot.The methods proposed in this study enable precise prediction of PDP in the sparse data space for process monitoring and correct decision-making of device/circuit design through reliable optimization for technology development.

Figure 1 .
Figure 1.Overall processes of the approaches.The proposed NN methods focus on a) EPM and PDP of 29-stage ring oscillator circuit and ensure b) accurate prediction in the sparse data space and c) reliable optimization with low PDP and PDP variance even when reflecting all variabilities.

3. 1 .
EPM and PDP EPMs consist of the DC characteristics of NMOS and PMOS devices, such as on/off current (I on /I off ), threshold voltage (V th ), and AC characteristics, such as interconnect RC and parasitic capacitances.Part of the EPMs is shown in Figure 2.

Figure 2 .
Figure 2. Matrix scatter plot of several EPMs and histogram of each EPM (diagonal).A set of light blue dots include 16 lots of DRAM peri.devices, and sets of black, magenta, green, and red ones represent each lot, respectively.

Figure 3 .
Figure 3. Proposed network structure and data ratio of training, validation, and test.Normalized PDP is estimated by trim-mean of outputs of N MLPs.

Figure 5 .
Figure 5. Optimization of EPM for low PDP using the trim-mean method.The pretrained network estimates normalized PDP and limiter C(L z ) constrains EPM components in the manufacture variation.
regions are crucial.In the two regions, the trim-FDM-mean performed the most accurate prediction.Additionally, the trimmean and trim-FDM-mean performed better prediction than one MLP and mean ensemble in the bottom one-third of the k-NN density, the sparse data space.Furthermore, we calculate the variance of PDP values estimated by 100 MLPs trained with different training indices and illustrate the relationship between the estimation variance and data density in Figure7.It shows that the data density and variance of 100 MLP estimations are negatively correlated, and MLPs with different training indices tend to make different predictions in the sparse data space.As different predictions of MLPs for one input point deteriorate the reliability of the prediction, we examine the data density of the derived optimal EPMs and the variance of their 100-MLP-outputs in Section 4.2.

Figure 6 .
Figure 6.Comparison of prediction errors (test dataset) in a) bottom one-third of the PDP and b) bottom one-third of the k-NN density between one MLP, mean ensemble, and proposed methods.All dots represent absolute values of errors and are sorted in the ascending order.The values in parentheses denote mean absolute error (MAE).

Figure 7 .
Figure 7. Variance of PDP estimations versus k-NN density pðxÞ (1) for the train dataset.The mean of log 10 ð pðxÞÞ for all train data is À20.15.

Figure 8 .
Figure 8. PDP distribution of 100-MLP-outputs when MLPs take inputs optimized by a) one MLP, b) mean ensemble, c) trim-mean, and d) trim-FDM-mean.Each red dotted line indicates the output estimated by each method and the statistics of the distributions are set out in Table3.

Figure 10 .
Figure 10.PDP values of optimal EPMs and RVs based on the optimal EPMs estimated each method.Crosses indicate the estimation of each colored method with input optimized by each method (x-coordinate).Box plots mean the average and standard deviation of 100-MLP-outputs, and error bars mean the ranges of 100-MLP-outputs with RV inputs based on optimal EPM by each method (x-coordinate) for 100 times.

Table 4 .
Comparison of average inference times of all test datasets by k-means cluster, SVR, one MLP, trim-mean, and trim-FDM-mean.The values in parentheses represent the standard deviations of inference times.

Figure 9 .
Figure9.Matrix scatter plot of some EPMs (off-diagonal) and histogram of each EPM (diagonal) with RVs generated based on each optimal EPM.The light blue dots include 16 lots, and red (one MLP), yellow (mean ensemble), green (trim-mean), and blue (trim-FDM-mean) ones represent each RV generated based on each optimal EPM, respectively.The white dots represent each optimal EPM centered around each RV.

Table 1 .
Each optimization result of five MLPs for low PDP (width: the maximal difference between the highest and lowest PDP values).

Table 2 .
Comparison of prediction performance (MSE, R 2 ) between one MLP, mean ensemble, and proposed methods.Here, the sparse data represent the bottom one-third of the k-NN density.The bold text indicate the best prediction performance for each dataset, the lowest MSE or highest R 2 .