Automatic deforestation detectors based on frequentist statistics and their extensions for other spatial objects

This paper is devoted to the problem of detection of forest and non-forest areas on Earth images. We propose two statistical methods to tackle this problem: one based on multiple hypothesis testing with parametric distribution families, another one -- on non-parametric tests. The parametric approach is novel in the literature and relevant to a larger class of problems -- detection of natural objects, as well as anomaly detection. We develop mathematical background for each of the two methods, build self-sufficient detection algorithms using them and discuss practical aspects of their implementation. We also compare our algorithms with those from standard machine learning using satellite data.


Introduction
Within the last decade, there have been some attempts to create remote deforestation detectors.Contemporary literature on the topic seems to be geared towards machine learning approaches.Just to name a few examples we refer to Ortega Adarme et al. [2020] that employed deep learning for deforestation detection in the Brazilian Amazon and Shermeyer and Haack [2015] that presented k-nearest neighbour classification to track deforestation in Peru.Bayesian approach has also seen use, e.g. in Reiche et al. [2015] on time series data from Fiji.The present work aims to create methods having statistical foundation, and, unlike machine learning methods, are fast, simple and in principle interpretable while being comparably accurate.The qualities are important on their own, but may also be useful for the task of near-real-time deforestation detection.To achieve these characteristics we employ sample means and covariances and empirical characteristic functions (ECFs) of colour intensities of image samples.The idea behind it is that forests seen from above, like many other natural objects, show rather simple homogeneous structure.Moreover, in many cases natural objects are distinguishable due to their colour features rather than their spatial patterns.Thus, operating on distributions and their ECFs and completely erasing all the spatial dependencies seems to be reasonable.
The general statistical setting of the problem is as follows.In visible spectrum images of geographical objects are represented as three matrices.Each of them consists of intensities of one of the red, green or blue colours.Thus, every pixel of an Earth image is represented by three numbers: one in each of the matrices.We employ supervised learning where tags (forest / non-forest) are assigned to entries of the training data and learn features of the distributions of the matrices in those entries.During classification, test images are split into much smaller †Address for correspondence: Dmitry Otryakhin, Department of Mathematics, Stockholm University, Matematiska institutionen, 106 91 Stockholm, Sweden.E-mail: d.otryakhin.acad@protonmail.chsquare sub-images for the purpose of obtaining better resolution, and these sub-images are tested against reference images that are known to contain forest.
We intend to investigate two different approaches to the problem: non-parametric multiple hypothesis testing and parametric one.In the case of parametric multiple hypothesis testing, we assume parametric distributions for the colour intensities of pixels and use training data to estimate the parameters of said distributions.New data is then tested against this distribution.In the case of the non-parametric setting, we do not assume any known distribution for the colour intensities but instead test if the multivariate distributions of the colour intensities of the training data and new data have equal means and covariance matrices.
We build both the parametric and the non-parametric algorithms in such a way that they work on images in visible spectrum, in particular, RGB composites.Thus, the algorithms do not require any certain pre-processing technique for the data and can be applied to a broad range of sources of images of Earth.There are at least two ways of obtaining large datasets of Earth images: satellite data obtained through open access Earth observation missions, e.g.Landsat (Roy et al. [2014]) and Sentinel-2 (Drusch et al. [2012]) and aerial photo imaging.In this work, Sentinel-2 images are used because they provide good spatial and temporal resolution.Sentinel-2 mission provides data from visible, near infrared and short-wave infrared sensors in 13 corresponding spectral bands.Although we use RGB composites obtained via a post-processing procedure of Sentinel-2 original tiles, our methods could work on all of the data in those 13 bands.

A practical example
In order to give more intuition of what task we tackle, we present a real-life example of deforestation detection.Figure 1a shows a Sentinel-2 image of a piece of land near the city of Arkhangelsk in Russia suffering from logging. Figure 1b is a black-and-white image representing the detection result.White represents forest and black is used for all types of terrains not covered by it.

Forest image analysed
The result of detection.White-forest, black-nonforest Fig. 1: A plot of land with deforestation and the result of detection produced by the nonparametric algorithm.
One can see that the detector highlighted the areas with logging traces (black rectangles) as well as natural objects: lakes, bare ground and pieces of rivers.
The rest of the paper is organized as follows.Section 2 briefly mentions the key steps in data preparation and discusses what features forest data possess.Sections 3 and 4 describe correspondingly novel non-parametric and parametric methods.Section 5 discusses theoretical properties of the two methods, while the last section compares the two with well-known methods from machine learning using Satellite data.
Our main contribution contains the application of the known theory of Hotelling's T 2 statistic to the problem of deforestation detection, which occupies Section 3. There, we prove Theorem 1 which gives the asymptotic distribution of a distance between two properly scaled multivariate populations.Using Theorem 1 we build classification and training algorithms in 3.2 and 3.3.Another major contribution is Section 4 (excluding 4.1), where we develop a new setting of hypothesis testing for stable models based on empirical characteristic functions in 4.2 and use this setting to build algorithms for detection of forest in 4.3 and 4.4.

Data and exploratory analysis
To obtain data in visible spectrum we use a simple preprocessing procedure, collecting red, green and blue bands from Sentinel-2 Level-1C products and combining them into one GeoTIFF file.It is then handled by GDAL (GDAL/OGR contributors [2019]), which creates three numeric matrices (one per each of the red, green and blue channels) with values between 0 and 1.The procedure is described in Appendix A in more detail.
Initial inspection of the empirical density of the forest data gave us some additional ideas for potentially suitable distributions.We considered distributions including, but not limited to, stable, normal, log-normal, gamma, Cauchy, Weibull and Levy.We estimated parameters for each of the three colour intensity distributions separately using Koutrouvelis regression-type estimator for the stable distributions, discussed in Section 4.1.1,and by maximum likelihood for the others.Then we calculated the root-mean-square-errors (RMSE) where f (•) is the probability density function using the estimated parameters, f * (•) is the empirical density and x i is colour intensity of data point i = 1, . . ., m.The empirical density is estimated using kernel density smoothing.To choose the best fitting distributions we then plotted the densities of the three distributions with the lowest RMSE's together with the empirical densities and compared.These distributions appeared to be normal, gamma and stable consistently throughout the set of images.An example of a forest image with such comparison can be seen in Figure 2, where we can observe the three best fitting distributions for a forest image with all parameters estimated.Further from the plots we see that the normal distributions do not seem to fit very well while the gamma and stable distributions are quite close, with stable edging out gamma especially for the intensity of the colour red.
Since the focus in this work is to be able to detect if an area has forest or not, we describe detection of presence or lack of this type of natural objects.However, our exploratory data analysis shows us that the framework developed could be used to detect other natural objects from satellite images as well.In Figures 10 and 11 in Appendix B, we can see promising results of using the stable distribution to describe the colour intensities of images of mountains and sea.When it comes to man-made objects such as, for example a city, our approach does not work, and we have to at least assume a multimodal mixture of some kind to be appropriate since satellite images contain distinctly different parts, i.e., an orange roof, green grass and grey pavement.

Detection with a non-parametric method
In this section, the non-parametric method for forest detection is developed.Later, in Section 4, the ideas presented here are used to create the theory and implementation for the parametric Comparison of densities for blue colour intensity Fig. 2: Plots of estimated probability density functions for distribution of colour intensities in an image of a forest.The means of the red, green and blue intensities are respectively 0.16, 0.24 and 0.20.one.

Comparing images with the scaled, squared Mahalanobis distance
The Mahalanobis distance D was originally introduced as a distance between a given multivariate point q and a distribution.For a point q and a distribution with mean vector µ and covariance matrix Σ; the distance is given by Where, in practice, the mean and covariance matrix of the distribution would be estimated by the sample mean and covariance matrix.Further extensions can be made to calculate the distance between two multivariate populations with Where X1 and X2 are the sample mean vectors of the two populations.The variable Σ−1 is the inverse of the pooled sample covariance matrix, given by Σ = (n 1 Σ1 + n 2 Σ2 )/(n − 2) for samples with respective sample sizes n 1 , n 2 , total sample size n = n 1 + n 2 and sample covariance matrices Σ1 , Σ2 respectively.An interested reader can find more on the topic in Bibby et al. [1979].
The Mahalanobis distance is often used in cases where there is correlation present in the considered multivariate distribution.As the reader might have noticed, the Mahalanobis distance expression is closely related to the statistical concept of standardization, i.e, subtracting the mean and dividing by the standard deviation.The idea is to decorrelate and scale each dimensions variance to be 1 and then calculate the Euclidean distance.
Theorem 1.Let X 1 and X 2 be independent samples of sizes n 1 ×p and n 2 ×p composed from i.i.d.p-dimensional random vectors from distributions with finite mean µ and finite covariance matrix Σ.Then, for D from (1) we have that For a proof, we refer to Appendix C.
Remark 1. Note, that X 1 and X 2 do not have to be drawn from the same distribution.If they are drawn from different ones, but possessing the same mean and covariance matrix, the result of the theorem still holds.

Hypothesis testing based on Mahalanobis distances between images
Suppose we have an RGB image of forest (Figure 3) of the size n × p pixels.It consists of three matrices corresponding to each of the red, green and blue channels.The matrices contain values of colour intensities of pixels in the range [0, 1].Then, each matrix is split into columnvectors that are subsequently stack together to form one column-vector of the length np.The three obtained large vectors are bound to form a np × 3 matrix X.Since the same operation is applied to every matrix, the triplets of pixel intensities are kept unchanged, so it is possible to analyse dependencies between colours.On the contrary, matrix X does not contain any information about spatial locations of the pixels.From X, a three-dimensional mean vector and a 3 × 3 covariance matrix are calculated.
The assumptions of Theorem 1 hold for data of type of X: since all entries are bounded between 0 and 1, the means and covariances will always be finite.This naturally leads us to a hypothesis testing framework.Having two images, one picture that is known to depict forest and one test image, we can use the theorem to accept or reject the null hypothesis that the test image is forest.This is done by first converting RGB composites into matrices X 1 and X 2 , obtaining their sizes, means and covariance matrices, and second, computing the Mahalanobis distance between X 1 and X 2 , and checking whether or not the rescaled Mahalanobis distance T := n1n2 n1+n2 D 2 is lower than a certain threshold.A scheme for multiple hypothesis testing framework is depicted in Figure 4.When it comes to classification of a certain image, rescaled Mahalanobis distances T 1 , T 2 , . . .between the image and every forest picture in the training set are computed.Out of all T 's, the minimal one is chosen, and is then compared to a threshold obtained during model training.If the minimal T is smaller that the threshold, the test image is tagged forest, if greater, then-non-forest.
Remark 2. This setting is equivalent to multiple hypothesis testing with the thresholds set to be equal.This restriction can be dropped, in which case the precision of predictions will generally increase.But the downside of such a generalization would be that the additional computational complexity during both training and classification would increase linearly with the number of forest images.Remark 3.This algorithm describes dependences between colours by means of the pooled sample covariance matrix.On the contrary, all the information about spatial features are erased.Also, distributions of colour intensities in each channel are described only by their location parameters.

Training and cross-validation of the non-parametric model
Training of the non-parametric model is schematically depicted in Figure 5.One can observe two sets of images there: reference forest images in the left column (set 1) and images of both types in the upper row (set 2).For every pair i-th image from set 1 and j-th image from set 2, statistic Tij = n1n2 n1+n2 D 2 ij is computed, where D ij is the Mahalanobis distance between the two pictures.Then, for every image from set 2, the minimal statistic min i Tij is found.Lastly, the threshold T is chosen to minimize misclassification error when classifying images from set  Aside from simple training on a set of images, we adopt the setting of k-fold cross-validation for this model.Initially, we shuffle pictures and group them into k folds.In this procedure, in order to preserve internal features, we don't split or merge pictures.We follow a similar procedure to what was described above.Once a fold is chosen, set 1 consists of all forest pictures in the other k − 1 folds, while set 2 is the chosen holdout fold.Then statistics min i Tij are computed for each of the k holdout sets and T is chosen to minimize the misclassification error on all the combinations.Cross-validation is summarized in Algorithm 1.For all images and all T 's, tag images with T < t as forest and T > t as non forest. 12: Compute the accuracy for the given threshold t, Acc(t).13: end for 14: return arg max t Acc(t) and n's, means and covariances of forest images.

Detection with a parametric method
The non-parametric approach described in Section 3 is simple and universal in terms of underlying data, it doesn't make any assumptions on the data coming from RGB composites.On the other hand, it possesses a theoretical drawback: it doesn't take advantage of the distribution of the data, only its mean and covariance matrix.The parametric approach to the detection was designed basing on the opposite ideas: to capture the actual distributions of the data and parametrize them in order to make use of parametric theoretical machinery.

Stable distributions for signal processing
The second classification method we develop in this study is based on univariate stable distributions as they proved to fit well to the empirical data (Section 2).In this short section we give some context for use of stable distributions in the field of signal processing, describe how they are fit to satellite data in the present paper and how we estimate sampled data.
The stable distribution family is a class generalising the Gaussian one.In signal processing it appears due to two factors.First, stable distribution is a result of the Generalised Central Limit Theorem which says that if there exists any limit of a mean of i.i.d.variables with possibly unbounded variations, then this limit must have a stable distribution.Thus, this class is a natural choice for modelling signals with potentially large deviations.Second, stable distributions might be asymmetric, and it is controlled by one of the parameters.This family of random variables has already seen use in remote sensing areas such as ship detection (Wang et al. [2008]) and video background subtraction (Bhaskar et al. [2010]).While it is unclear to us what role the Generalised Central Limit Theorem plays in shaping the distributions of forest colour intensities, it is certain that parametric control of their shape provides accurate fitting (Figure 2) when the general stable distribution is assumed.Definition 1.A random variable X is said to have a stable distribution if there are parameters 0 < α ≤ 2, σ ≥ 0, −1 ≤ β ≤ 1 and δ ∈ R such that its characteristic function has the following form: (2) Where sgn(t) = 0 if t = 0, −1 if t < 0 and 1 if t > 0. The parameter α is known as the index of stability and controls the decay of the tails.The parameters β is not the classical statistical skewness but it indicates how the distribution is skewed, indicating left skewness if β < 0, right skewness if β > 0 and symmetry if β = 0.The parameter σ is a scale parameter controlling variability and δ a location parameter which shifts the distribution.The location parameter only corresponds to the mean when α > 1, otherwise the mean is undefined.In fact, for α < 2 the p-th absolute moment Apart from a few cases, the closed form probability function is unknown for stable laws.There exists a method for computing stable densities numerically (Nolan [1997]) but it is computationally demanding.That is why we developed a mathematical foundation of our parametric method based on CF's and ECF's.The most famous special cases where the probability density function can be expressed explicitly is when the stable distribution follows a Gaussian, Cauchy or Lévy distribution.A stable distribution is Gaussian when α = 2.

Parameter estimation
This section will cover estimation of the four parameters of the univariate stable distribution used in this article.There is a number of techniques available for estimation of these parameters, including quantile-based estimators, maximum-likelihood-based estimators and those based on the method of moments.For this work, the Koutrouvelis regression-type estimator (Koutrouvelis [1980]) was chosen, as it shows a great balance between precision and computational performance.Estimators which performed well but were very slow were also studied, including ones based on the generalized method of moments.Similarly, estimators which were faster than the Koutrouvelis one were assessed, but these did not perform well enough for our purposes: the McCulloch estimator (McCulloch [1986]) based on quantiles and the one developed by Press.A more in-depth study on the performance of different parameter estimators may be of interest to individuals interested in improving the algorithm proposed in this article.
The Koutrouvelis regression-type estimator is quite fast, but one big downside is that the estimator of the location parameter δ is often unreliable.For our purposes this turns out not to be a big issue though because experimentally we have seen that the parameter α > 1 in all our forest data and when this is the case the location parameter δ equals the mean of the distribution [Samorodnitsky and Taqqu, 2017, Property 1.2.16].Which implies that the sample mean is a consistent estimator for the location parameter δ for our data.

A test for empirical characteristic functions of stable distributions
In this part, we develop a statistical test that rejects a null hypothesis that a sample comes from a stable distribution defined by ( 2) with certain parameters.First, we provide a limit theorem for vectors composed of parts of ECFs of stable random variables.For a sequence {X k } k=1,... of i.i.d.stable random variables the ECF is given by We define new random vectors where and the elements of the matrix are given by A proof can be found in Appendix C.
Remark 4. Note that when X j , j = 1, . . ., n are also symmetric and centred around the origin Σ z reduces to the form This case was described in Section 3 of Press [1972].
Proposition 1 provides a limiting distribution for a two-dimensional vector Z n − Z 0 , making the result inconvenient for statistical inference.To address that, we define an analogy of Mahalanobis distance, a statistic characterising how far Z n is from Z 0 .The following theorem provides a limiting distribution for this statistic and plays the key role in the parametric inference.
Theorem 2. Fix t.Assume that Σz d − → Σ z and that Z (1)  n (t), Z (2) n (t) come from i.i.d.X 1 and X 2 with distribution determined by Z 0 (t).Then a) Ti,0 : Appendix C includes a proof of this result.

Hypothesis testing based on ECFs of colour intensities
Now, let us return to the data extraction (Figure 6).Assume we have an image of forest and a test image.Having extracted matrix X, one can estimate parameters of the stable distributions of the forest picture channels and compute vectors Z 0 r (t), Z 0 g (t) and Z 0 b (t), where the superscripts r, g and b refer to red, green and blue channels.Similarly, after obtaining X from the test image, empirical characteristic functions are computed and subsequently Z n r (t), Z n g (t) and Z n b (t) are obtained.Note that we use CFs and ECFs of each channel separately, so dependencies between colours are ignored so are spatial dependencies.
Hypothesis testing in the parametric setting is illustrated in Figure 7.
It implies that in order to be tagged as forest, a test image must have all three colour samples passing their corresponding test.Since the limiting distributions in (9) are the same, thresholds for each channel are set to be equal.
For each pair of a test image and a forest one, T i := min(T r i , T g i , T b i ) is computed and then T min := min i (T i ) is obtained and compared with the threshold.The training procedure starts with finding Σz 's, Z 0 's and Z n 's for forest images and Z n 's for non-forest ones in formulas (9).After that, training proceeds as in 3.3: T statistics for pairs of images are computed and the threshold is set to minimise the misclassification rate.Throughout classification of new images, Σz 's and Z 0 's corresponding to training forest images will be available in advance, so that only Z n 's of the new images will be left to compute to perform hypothesis testing based on (9).For that reason we do not build the parametric framework on Theorem 2 b) with Σz 's obtained from both the test image and a training forest image: estimation of parameters of stable distribution is quite computationally demanding.Although, this approach would provide better accuracy.Cross-validation for the parametric model is summarised in Algorithm 2.
Algorithm 2 Parametric cross validation procedure 1: Inputs: Data set of images labelled forest or non-forest, number of folds k, max threshold T max , parameter for CF's (t in (2)).2: For every forest image compute Z 0 , parameters and the covariance matrix.3: For every non-forest image compute Z n .4: Randomly shuffle images and group them into k folds.5: for i = 1 to k do 6: for each image in fold i do 7: Compute 3 T 's from (9), between the image and all images labelled forest in the remaining k − 1 folds.

8:
Keep only the smallest T for the image.For all images and all T 's, tag images with T < t as forest and T > t as non forest.

13:
Compute the accuracy for the given threshold t, Acc(t).14: end for 15: return arg max t Acc(t) and Z 0 's, parameters and the covariance matrices of forest images.

Comparison of the methods
There are features typical for both of our methods.First, if by accident there is a non-forest image in the training forest set, test images of its type start to be tagged as forest because they pass the corresponding test.As a consequence, additional false positive results will occur.Specially designed cross-validation may be exploited to eliminate this problem.Second, the setting is multiple hypothesis testing, so it could naturally be re-set to classify forests images into multiple types such as rain forest, coniferous forest, etc.Another thing is that clouds and sunbeam scattering affect the algorithms because they introduce noise to the distributions of objects.It could be cured by atmospheric correction (Main-Knorn et al. [2017]).Also, both models increase accuracy with growth of the numbers of points in the training data and test data.Improvement of prediction accuracy could be achieved in three ways.Increasing resolution, e.g. by switching from satellite data to aerial imagery, would add more pixels to objects under consideration while adding new tagged images to the training set would provide better training of thresholds and more tagged classes.In practice, merely increasing the size of an image with fixed resolution (adding more area) works well only in the case of training pictures: enlarging test images would imply observing pictures with objects belonging to different classes and tagging some areas incorrectly because of that.And lastly, unlike classical machine learning methods, the models developed in the present article are robust to data transformations such as rotation and reflection.
There is also a number of differences between the two models.An advantage of the parametric model is that it can work on incomplete data: e.g. when only 1 channel is available or different channels are available at different times.The non-parametric method needs at least two channels simultaneously.On the other hand, the disadvantage of the parametric model is that the algorithm generally requires more pixels per image in training as it involves parameter estimation and because we chose to use Theorem 2 a).The estimation procedure also makes the parametric method slower than the other one.A con of the non-parametric method is that the model does not take into account the actual distributions of colour intensities, which can contribute to additional false positives.
To summarise the comparison, we point out that at the current stage of research it is not possible to claim superiority of one method over the other as they both have pros and cons.

Empirical application
Sentinel-2 satellite images are used to illustrate how the new Mahalanobis distance classifier (MDC) and stable distribution classifier (SDC) perform in comparison to some other commonly used classification methods.The methods used for comparison are support vector machine (SVM) and convolutional neural network (CNN).Both of the comparison methods have been widely used for classification, particularly in the setting of vegetation remote sensing [Kattenborn et al., 2021, Maulik andChakraborty, 2017] .
Three experiments with real data are shown here: two comparing accuracy of the methods and one giving visual representation of performance.As the source of terrain images, the region Occitanie in France is used, more precisely, vicinity of the city of Albi.
In the first two experiments, we look at how the methods perform using a training and test split of the available dataset.A training set consisting of 510 images of sizes 20 × 20 pixels with an equal number forest and non-forest images is used in the first experiment.The test set is also equally balanced between forest and non-forest images but it consists of 200 images of sizes 20 × 20 pixels.The second experiment is similar, except that each 20 × 20 image is split into four equal pieces, making 2040 images of sizes 10 × 10 pixels in the training set and 1000 10 × 10 pixels images in the test set.
In the third experiment, we show how the methods behave when every sub-image of size 10 × 10 pixels is classified of an image of size 2000 × 2000 pixels.This makes it possible to see how the methods behave for different terrains and get a visual interpretation of the results.
In each of the experiments, the SVM is implemented with the radial basis function kernel which is a commonly used kernel for classification Chang et al. [2010].The feature space consists of all pixels and their RGB colour values in the form of flatten input.The Python library 'sklearn' and its default settings are used for the implementation.
The CNN is implemented with 'tensorflow' in Python and the network consists of 2 convolution layers with 8 filters in each layer and a kernel size of 3. ReLu is used as activation function in the convolution layers together with maximum pooling of size 2 × 2. A dense layer with sigmoid as activation function is used in the final layer of the network.The loss function is set to be 'binary crossentropy' and 'Adam' is the solver of choice.It should be mentioned that we tried and assessed CNNs with all possible combinations of {1,2,3} convolution layers, {8,12,16} filters, {2,3} kernel size, {0, 0.05, 0.1} dropout.The one described above was the best performing in terms of accuracy.
The performance measure used to evaluate the methods in the first two experiments is the error rate, i.e., the number of incorrectly classified images in the test set divided by the total number of test images.Figure 8 illustrates how the methods perform when the training set varies between 10 % and 100 % of the total training set using images of sizes 20 × 20 pixels and 10 × 10 pixels, respectively.Even though the training set size varies, an equal number of forest and non-forest images are always used for training and the test set is kept unchanged.It should also be noted that all of the images are collected from the same region and under similar conditions.This means that the models are not trained for different seasons or different types of forests.However, it would be possible to make the classification models more general by including more diverse data.Figure 8 shows that the new methods perform better than both of the comparison methods.In the 20×20 pixels case, the MDC generally has a lower error rate than the SDC.The error rate corresponding to the MDC drops to zero when the size of the training set is about 300 whereas it takes until about 450 for the SDC.The SVM and CNN perform quite similarly, with possibly a slight advantage to the CNN.However, the SVM is more robust to changes in performance when adding more data to the training set.Neither of the comparison methods tends to have an error rate below 1 %.Analogous results are presented in the 10 × 10 pixels case.However, in this experiment the SDC usually has a lower error rate than the MDC.The error rate drops to zero almost immediately when using the SDC whereas the error rate corresponding to the MDC gradually decreases from 2.4 % and reaches zero when the training set size is about 1800.Again, the SVM and CNN show similar (worse) performance.
Next we turn to the third experiment and visually investigate the methods by classifying 10 × 10 pixels sub-images of a 2000 × 2000 pixels image.The same (complete) training set as in the previous experiments is used to calibrate the methods.This training set does not contain images that are part of the large image but they were collected from a nearby region under similar conditions.Sub-images that are classified as forest are marked as white and non-forest sub-images are marked as black.The result of the classification is illustrated in Figure 9. Generally, all four models performed well, though there are differences in their outcomes.One can see that unlike MDC and SDC, SVM and CNN are not so sensitive to the intensity of the green colour.It is best illustrated by performances of the models on forest covered by a shadow of the cloud in the right part of Figure 9. MDC and SDC incorrectly treat the shadowed vegetation as non-forest, while the other two tag it as forest.Shades from clouds generally could be accounted for by cloud detection, which is a solved problem in Earth image processing (see e.g.Main-Knorn et al. [2017]).

Conclusion
Forest images can be successfully modeled using alpha-stable distributions.This fact was used to create two new frequentist approaches to forest/non-forest classification.In the experiments with real data our methods outperform the benchmark machine learning ones, which indicates superiority of the former methods on small data.The same was not checked for big data and we note that frequentist techniques tend to be inferior to machine learning in such cases.
The new algorithms are very sensitive to colour intensities, which means that the training data must be obtained under the same conditions as the data for classification.Given this, on a dataset of images collected from different sources under different conditions, machine learning algorithms have an advantage.
Despite the fact that the new algorithms are used for 0-1 classification, they are suited for multi-type classification tasks such as detecting different types of forest.

Fig. 3 :Fig. 4 :
Fig. 3: Data processing for the non-parametric testing ig. 5: Training in the non-parametric model 2 according to a result of evaluation of Boolean expression min i Tij < T .As previously, True corresponds to forest, while False-to non-forest.

Algorithm 1
Non parametric cross validation procedure 1: Inputs: Data set of N images labelled forest or non-forest, number of folds k & max threshold range T max .2: Compute means and covariances for every image.3: Randomly shuffle images and group them into k folds.4: for i = 1 to k do 5: for each image in fold i do 6:Compute T = n1n2 n1+n2 D 2 from Theorem 1, between the image and all images labelled forest in the remaining k − 1 folds.7:Keeponly the smallest T for the image.
for 10: for t in 0 to T max do 11: Fig.

Fig. 7 :
Fig. 7: Classification in the parametric model for 11: for t in 0 to T max do 12: Fig. 8: Forest classification error rates depending on the training set size using MDC, SDC, SVM and CNN for different image sizes.
Fig. 9: Forest classification using MDC, SDC, SVM and CNN of an 2000 × 2000 pixels image using sub-images of sizes 10 × 10 pixels.White corresponds to forest and black corresponds to non-forest.
test image is compared with each of the forest images in the training set.When comparing a test image and a forest one, [Z 0