Prediction of missing values. Comparison of approaches

We compare four numerical methods for the prediction of missing values in four different datasets [1]. These methods are 1) the hierarchical maximum likelihood estimation (ℋ‐MLE), and three machine learning (ML) methods, which include 2) k‐nearest neighbors (kNN), 3) random forest, and 4) Deep Neural Network (DNN). From the ML methods, the best results (for considered datasets) were obtained by the kNN method with three (or seven) neighbors. On one dataset, the MLE method showed a smaller error than the kNN method, whereas, on another, the kNN method was better. The MLE method requires a lot of linear algebra computations and works fine on almost all datasets. Its result can be improved by taking a smaller threshold and more accurate hierarchical matrix arithmetics. To our surprise, the well‐known kNN method produces similar results as ℋ‐MLE and worked much faster.


Problem settings, H-MLE and Machine Learning methods
Let (s 1 , . . . , s n ) be a set of spatial locations. Let I 1 := (s 1 , . . . , s n1 ) be locations with known values Z 1 = (Z(s 1 ), . . . , Z(s n1 )) ⊤ , and I 2 = (s n1+1 , . . . , s n ) be another locations with unknown values Z 2 = (Z(s n1+1 ), . . . , Z(s n )) ⊤ . The values Z 2 should be predicted. H-MLE method and parameter inference. We model the set of measurements as a realization from a stationary Gaussian spatial random field. Specifically, we let Z = (Z(s 1 ), . . . , Z(s n )) ⊤ , where Z(s) is a Gaussian random field indexed by a spatial location s ∈ R d , d = 2, 3. Then, we assume that Z has zero mean and a stationary parametric covariance function C(h; θ) = cov{Z(s), Z(s + h)}, where h ∈ R d is a spatial lag vector and θ ∈ R 4 the unknown parameter vector of interest. Statistical inferences about θ are often based on the Gaussian log-likelihood function: where the covariance matrix C(θ) has entries C(s i − s j ; θ), i, j = 1, . . . , n. The maximum likelihood estimator of θ is the value θ that maximizes (1). When the sample size n is large, the evaluation of (1) becomes challenging. Indeed, the storage of the n-by-n covariance matrix C requires O(n 2 ) units of memory. Computation of the inverse and log-determinant of C(θ) cost O(n 3 ) FLOPs. Hence, parallel and scalable methods that can reduce this high cost are needed. We consider the Matérn covariance, which depends only on the distance h := ∥s − s ′ ∥, where s and s ′ are any two spatial locations: with parameters θ = (σ, ℓ, ν, τ ) ⊤ . Here σ 2 is the variance, τ 2 the nugget, ν > 0 controls the smoothness of the random field, with larger values of ν corresponding to smoother fields, and ℓ > 0 the spatial range parameter that measures how quickly the correlation of the random field decays with distance. K ν denotes the modified Bessel function of the second kind of order ν, and I the identity matrix. Prediction: Estimating the unknown parameters θ is only an intermediate step.
Once it is done, the estimation θ ≈ θ is used for the prediction at new locations. Z(s) is a Gaussian random field indexed by spatial locations with indices from the index set (I 1 , I 2 ). After discretisation we can get a 2 × 2 block covariance matrix, consisting of 4 sub-blocks [C 11 C 12 ; C 21 C 22 ], where C 11 ∈ R n1×n1 , and C 22 ∈ R (n−n1)×(n−n1) . Now, the unknown vector Z 2 can be computed by the following formula Z 2 = C 21 C −1 11 Z 1 . We can also say that Z 2 has the conditional distribution with the mean value C 21 C −1 11 Z 1 and the covariance matrix C 22 − C 21 C −1 11 C 12 . The classical maximum likelihood method is very time-consuming for large datasets. To make computations faster, we approximated the joint Gaussian log-likelihood function in the hierarchical (H-) matrix format [2,3]. The H-matrix technique drastically reduces the required memory and computing time, making it possible to work with larger sets of observations obtained on unstructured meshes.

Machine Learning methods
k-nearest neighbours (kNN) is a non-parametric statistical machine learning method, which follows a simple idea: for each data point x for which one needs to predict its outputŷ, find its k nearest neighbors x 1 , . . . , x k with respect to some metrics, and setŷ = 1 k k i=1 y i , where y i is the observed value at x i . The value of k should be determined in the best way, for example, by a cross-validation procedure. Random Forest (RF) is another popular machine learning method in which a large number of decision (or regression) trees are generated independently on random sub-samples of data. The final decision for x is calculated over the ensemble of trees by averaging the predicted outcomes. The method shows promising results in many practical tasks, especially in the presence of multiple irrelevant features describing the observed data. Deep Neural Network (DNN) methods are based on the artificial neural network paradigm, which models the functioning of neurons in the brain. We use a fully connected neural network (FCNN). FCNN training procedure consists of finding neurons' connection weights for which the quality metric takes the best value. By minimizing the RMSE error using a validation sample repeatedly obtained by random sub-sampling of data in the proportion 1:9, we found optimal values of k for kNN, the number of trees for RF, and the number of layers and hidden units for FCNN. We considered values k ∈ {1, . . . , 20}, the number of trees in the set {100, . . . , 150} and examined some variants of FCNN architecture for different number of hidden layers in the interval [3,10] (having 50-100 neurons in each hidden layer).

Numerical tests
We considered four datasets, taken from [4,5]: two in Test 2a, and two in Test 2b. All datasets in Tests 2a contain 90, 000 locations for training and 10, 000 for testing (prediction). Both datasets in Test 2b contained 900, 000 locations for training and 100, 000 for prediction.
H-MLE can require up to 8 hours on a modern parallel node for the dataset with 900, 000 locations. A possible remedy is to precompute the initial guess. It will significantly reduce the required number of iterations to minimize the log-likelihood and the total computing time. A good initial guess can be found, for instance, on a smaller subset of observations. Another drawback of H-MLE is that the H-matrix approximation of C and its inverse (we compute it via Cholesky factorisation) were recomputed entirely on every iteration for each new value of τ and σ. It would be a lot cheaper just to update already existing matrices by adding a new diagonal or by scaling them.
Results. Among all implemented ML methods (kNN, random forest, deep neural network), the best results (for given datasets) were obtained by the kNN method with three or seven neighbors depending on the dataset [5].
Trying different numbers of trees in the RF method, we defined that an ensemble of 120 regression trees is optimal. Further, we have designed the FCNN architecture with 7 hidden layers with 100 neurons in each layer. The number of training epochs is 500, and the batch size equals 10000. We used tanh(·) activation function and Adam optimizer. The average computing time of kNN was 0.07 sec. RF required 12 sec., and FCNN finished after 173 sec.
The results computed with the H-MLE method were compared with the results obtained by the kNN method. For Test 2a, the H-MLE method showed a smaller RMSE error than the kNN method, whereas, for Test 2b, the kNN method was better. To conclude, it is not surprising that our H-MLE method worked fine on most datasets. What surprised us is that the well-known and straightforward kNN method performed very well and very fast. Since we did not make any theoretical comparison and compared H-MLE and ML methods only numerically on given datasets, we cannot, in general, conclude which method is better. We also remind that we used kNN only for the prediction, without the identification of unknown parameters.