Thematic quality assessment of land surface geospatial data based on confusion matrices: A matrix set for research on measures and procedures

The confusion matrix has long been adopted as the ‘de facto’ and ‘de jure’ standard method of reporting on the thematic accuracy assessment of any land surface geospatial dataset. This type of data supports decision‐making in many different fields, so suitable quality is therefore essential in order to take the best decisions. Nevertheless, the creation and exploitation of the confusion matrix remains as an open topic with issues related to sampling design, quantitative indices derived from the matrix, statistical hypotheses that could be applied, etc. In connection with the latter, a confusion matrix dataset would be useful for a researcher in this matter. We have developed such a dataset retrieving confusion matrices from the literature, mainly research articles published in scientific journals included in WoS. We have collected almost 200 matrices in a database. This allows us to access the complete matrices and query different interesting properties of them and of the project where they were developed such as matrix size, sample size, location, year of data capture, labels of the classes, quality indices used, and extension and location of the project (where available).


| INTRODUCTION
The so-called confusion matrix (also referred to as misclassification matrix or error matrix) is the usual structure by which the classification correctness of land surface geospatial data is represented (see Congalton and Green, 2009;Stehman and Foody, 2019, between many others). Classification correctness is one of the so-called data quality elements established by ISO 19157 (International Organization for Standardization, 2013) for thematic accuracy when describing geospatial data quality. Land surface geospatial data (spatial databases, topographic maps, thematic maps, classified images, remote sensing products, etc.) support decision-making in several fields such as climate change, crop forecasting, forest times, national defence, civil protection and spatial planning. Therefore, suitable quality is essential in order to ensure that decisions based on it are technically the best.
The confusion matrix is also included in the list of standardized data quality measures of ISO 19157 (International Organization for Standardization, 2013). Different metrics can be derived from the matrix (see Congalton, 1991), but it is recommended to always report the raw confusion matrix so that the user of the data can derive any metric suitable for their needs (Salk et al., 2018). The structure of a confusion matrix is summarized below.
Suppose that k categories C 1 , C 2 , ⋯, C k are given (i.e. landcover categories, etc.) and n sample units were observed. All sample units were classified into categories through a certain classification method, and such classification is displayed in a contingency table called confusion matrix. The (i, j) element, n ij , represents the number of sample units that actually belong to C j and are classified into C i for i, j = 1, ⋯, k. In this way, the columns and rows of the contingency table correspond, respectively, to reference (index j) and classified (index i) data (Table 1). In consequence, the elements in the diagonal are correctly classified items and the off-diagonal elements contain the number of confusions, namely the errors due to omissions and commissions.
The use of a confusion matrix is long-established in research studies, which include land classification processes. Nevertheless, there are different aspects related to the creation and exploitation of the matrix that remain as interesting topics, such as new tools (i.e. Bratic et al., 2018), sampling design (i.e. Stehman, 2000Stehman, , 2014, indices derived from the matrix (i.e. Chen et al., 2010;Pontius and Millones, 2011;Salk et al. 2018;Stehman, 2013) and proposals for testing statistical hypotheses (i.e. Foody, 2004;García-Balboa et al., 2018). It would be useful for a researcher in this field to have available a wide set of land classifications in different locations for different applications, introducing different sampling designs or classification strategies, etc., which could be difficult or costly to develop by other means. Here, we propose, as an alternative, to retrieve from the literature a sufficient number of matrices so that a researcher could use those most suitable for a specific research goal.
Our proposal is to make available a database, which stores all this information. It covers not only the numeric data of each matrix but also complementary information related to the context of the research and to metadata of the spatial data involved. As an example, with this database a researcher can explore key issues such as: • the size of the matrices (number of classes), • the indices computed from the matrices, • the sample size, • type of geospatial data in both assessed datasets and control datasets, • classifications used, and • it also allows us to collect information for a meta-analysis in an easy way 2 | DATA PRODUCTION

METHODS
As a primary source of data, we have considered all confusion matrices included in publications that meet the following requirements:

Key Findings
• The confusion matrix has long been adopted as the standard way of reporting on the thematic accuracy assessment of any land surface geospatial dataset.
• The creation and exploitation of the confusion matrix remains as an open topic. Useful for a researcher in this matter, a confusion matrix dataset has been retrieved from research articles published in scientific journals.
• Almost 200 matrices have been collected in a database in an open format.

T A B L E 1 Structure of a confusion matrix with k categories
Classified data Reference data • It is a research article in a scientific journal.
• It is included in the WoS (Web of Science database).
• It belongs to the suitable categories of the WoS.
Therefore, the following search was performed in WoS: • Topic: "confusion matrix" or "confusion matrices" or "error matrix" or "error matrices". This search examines the following fields: title, abstract and keywords. The search terms are between quotes for an exact phrase. • First filter: document type is article. • Second filter: WoS categories are 'remote sensing' or 'geosciences multidisciplinary'. • Third filter: publication year is prior to year 2019.
More than four hundred articles (435) are obtained from the search, with an h-index of 57 and average citation of 27.81. The following figures summarize the results of the search: • Figure 1 presents the number of articles published in each WoS (first 10 categories). As can be seen, more than the two selected categories are shown since journals are usually included in more than one category. • Figure 2 includes the number of articles published in each journal (first 10 journals). • Figure 3 shows the number of articles published each year, with a clear rising trend.
From the previous search, a list of research articles was obtained. Each one could contain, or not, one or more confusion matrices that could be added to a confusion matrix set. Therefore, the next goal is to revise each article to find out whether it contains a complete confusion matrix. The steps are as follows: • Export the article list from WoS. • Access the full text of each article. This step would require a subscription to every journal included in the search results. • Read each article and localize every confusion matrix.
Confusion matrices are usually included as tables throughout the text. • Annotate the data included in each confusion matrix in order to include them in our confusion matrix database. These data include not only the numeric information but also the name of each of the classes. Also, when entering the matrix data into the database the criterion of putting the reference data in columns and classified data in rows was adopted. • If available in the document, annotate any other complementary information that could be useful to understand the context in which the confusion matrix is applied. This information is related to location (name or coordinates), area size, field of application and indices computed. Also, metadata of both sets of data, controlled dataset (CDS) and reference dataset (RDS) are annotated: type of geospatial data, date, scale, resolution, accuracy, sampling method, etc.

| DATASET LOCATION AND FORMAT
All data are included in a database created following a relational model. This contains not only the numerical data of every confusion matrix found in the literature but also any other data collected. The entity-relationship diagram (ERD) is presented in Figure 4. There is a main table with the name 'ConfusionMatrixSet', which stores the numerical data of the matrix and some other complementary data (see Table 2). The primary key ('Id') of the table 'ConfusionMatrixSet' is used as a foreign key in different supplementary tables (see Figure 4), which complement the information about each confusion matrix. A different table is included for different aspects that could be of interest to a researcher.
Additional tables are as follows: • unknown, a value of '9999' is set. The type of geospatial data is a text field with the most concise description possible, such as satellite image, SPOT imagery, crop map, CORINE Land Cover and lidar data. If it is unknown, a value of 'NAV' (not available value) is set. • Table 'RDS_Metadata'. These metadata describe the control/reference data used to assess the data and are derived from the original document. The same fields as the metadata of the assessed data are included (data source type, date, scale/resolution, accuracy). An additional field is added for the sampling method (e.g. random sampling) and sample unit (e.g. pixel).
Complementary tables are as follows: • contains a sheet for every confusion matrix included in the database. This file is only included for an alternative access to the numeric data of a confusion matrix. It is the same data included in the field 'CM_Data' in Table 2.
Sheet names are coincident with data included in the field 'CM_ExcelSheet' in Table 2

| DATASET USE
Classification correctness assessment throughout confusion matrices is an open and interesting topic. The dataset of published confusion matrices about land surface geospatial data constitutes an important resource in the advance of research related to geospatial thematic data quality. A researcher can select from the dataset those cases that are suitable for the  table 3) is stored in the database as follows: [[445,84,30,5,38], [22,51,17,1,9], [11,11,32,0,9] • year of data capture, • quality indices used and • extension and location of the project (where available).
As an example, a researcher can be interested on the use of the kappa coefficient (ĸ) when assessing Sentinel 2 images. Performing the corresponding query in the tables 'CDS_metadata' and 'Indices', it can be found that ĸ is applied in the matrices 210-213, published in Wessel et al. (2018).
Also, the whole dataset can be taken into account to perform analyses that could clarify how the scientific community has used the confusion matrix. As an example, in Liu et al. (2007) the number of categories was explored through a frequency histogram. They found that the majority of matrices have between 3 and 7 categories. It is easy to obtain a similar histogram with our dataset ( Figure 5). We found that the most frequent matrix size is 5 categories and that the majority range between 2 and 7 categories.
A similar analysis could be performed with any other aspect of interest, like the quality indices derived from the matrices. Widely adopted indices for thematic accuracy controls upon confusion matrices are the overall accuracy (OA) and ĸ (see Congalton, 1991 or Congalton andGreen, 2009), but more indices can be found in the literature (global or category-related). In Liu et al. (2007), 20 global indices and 14 category-related indices are compiled, and more recently, Morales-Barquero et al. (2019) review popular measures of accuracy in the context of natural resources and Stehman and Foody (2019) look at temporal trends in accuracy reporting. Our database can help a lot in studies of this type. So, in Figure 6 it can be seen that in the dataset the more frequent indices are OA and ĸ (global), producer's accuracy and user's accuracy. Any other proposals are minoritary.

| CONCLUSIONS
The use of the confusion matrix remains as an open topic in geospatial data. There are many aspects to take into account when applying this tool and when assessing thematic accuracy: sample size, sampling unit, sampling design, quantitative indices derived from the matrix, statistical hypotheses that could be applied, etc. We have therefore developed a dataset that could be useful for any researcher interested on the matter. We have created a database with almost 200 confusion matrices published in research articles included in WoS database. Any researcher can perform queries to retrieve useful data. The dataset is offered in both open and proprietary formats at Figshare. Further works can expand the database, not only in the quantity of matrices, but also in exploring other literature sources. Reports from recognized institutions and research articles not included in WoS but in other databases can also be of interest.