Logistic regression for potential modeling

Regression or regression‐like models are often employed in potential modeling, i.e., for the targeting of resources, either based on 2D map images or 3D geomodels both in raster mode or based on spatial point processes. Recently, machine learning techniques such as artificial neural networks have gained popularity also in potential modeling. Using artificial neural networks, decent results in the prediction of the target event are obtained. However, insight into the problem, e.g., about importance of specific covariables, is difficult to obtain. On the other hand, logistic regression has a well understood statistical foundation and works with an explicit model from which knowledge can be gained about the underlying problem. However, establishing such an explicit model is rather difficult for real world problems. We propose a model selection strategy for logistic regression, which includes nonlinearities for improved classification results. At the same time, interpretability of the results is preserved.


Logistic Regression
The logistic regression model is a special case of a generalized linear model. It consists of the linear predictor η = β 0 + β 1 x 1 + . . . + β m x m and a link function which links the conditional probability P (y|x) = µ(η) to the linear predictor. This link function is the logarithm of the odds, called the logit, given as logit = log µ 1−µ . The MLEβ is obtained by minimizing the negative logarithm of the likelihood function, i.e.
This equation is solved using the Newton-Raphson algorithm where the new iterate can be obtained by solving a system of linear equations. Using the conjugate gradient method with an early stopping criterion, logistic regression methods can tackle also large scale applications [2]. The application of logistic regression in potential modeling has some specific challenges. First, one needs to deal with rare events. This means that the occurrence of one class has a significantly lower frequency than the majority class. It was shown, e.g., in [4], that logistic regression underestimates the probabilities of the rare events because it tends to be biased towards the more frequent class, which in many practical applications is the less important class. This can be solved through endogenous sampling which means taking all positive events and a random sample of the negative events to get balanced training data. This makes some corrections necessary such as using a weighted likelihood [5] and calculating the robust variance [6]. Section 17: Applied and numerical linear algebra

Model selection strategy
Since logistic regression is only a linear classifier we provide a dictionary with some nonlinearities in the data x. After choosing the appropriate nonlinearities, these are added to the linear predictor, i.e.
This is still linear in the parameters β i , i = 1, . . . ,m wherem is the total number of variables, including the nonlinearities. To obtain such a model a selection of variables needs to be carried out. The proposed model selection is performed in two main steps. The first step is a coarse selection using the p-value of the Wald test, which tests for the significance of a variable. Because of known problems with the Wald test in rare events and large samples, e.g. [7], many unimportant variables will be left in the model. Therefore a second selection step is needed. This uses the Bayes' information criterion (BIC). The BIC for a given model is calculated as where n and k describe the number of datapoints and number of variables, respectively. Furthermore, L is the function value of the likelihood for the MLEβ. Because the model should be able to achieve good results on unseen data, the BIC is calculated using a validation dataset. After calculating the BIC values of all models with one variable dropped we sort the unimportant variables. Then we apply the following method to discard more than one variable: We calculate the BIC for the models where one-half, one-quarter and one-eights of the most unimportant variables are dropped. The model with the smallest BIC is used as the starting model for the next iteration.

Results
Experiments on synthetic data, where the true model is known, show, that our suggested method is able to detect the true model if possible or approximates it using the given nonlinearities if it cannot be fully recovered. In both cases, it improves the performance of a simple logistic regression. It comes close to the prediction accuracy of a neural network while remaining interpretable. Due to space limitations the results on synthetic data are not presented here.
Our experiments with real world data show a similar behavior. Results for the datasets described in Table 1 are presented in Figure 1 and Figure 2. The model selection improves the simple logistic regression in both cases and for the Cod-RNA dataset it even gives the same prediction result as a neural network with 20 hidden layers and the logistic function as activation function.