Comparison of built‐up area maps produced within the global human settlement framework

The validation of built‐up areas derived from different sensors is crucial for gaining a deeper understanding of the consistency and interoperability between them. This article presents the methodology and results of an inter‐sensor comparison of built‐up area data derived from Landsat, Sentinel‐1, Sentinel‐2, and SPOT5/SPOT6. The assessment was performed for 13 cities across the world for which cartographic reference building footprints were available. Several validation approaches were used: cumulative built‐up curve analysis, pixel‐by‐pixel performance metrics, and regression analysis. The results indicate that Sentinel‐1 and Sentinel‐2 contribute greatly to improved built‐up area detection compared to Landsat, within the global human settlement framework. However, Sentinel‐2 tends to show high omission errors while Landsat tends to have the lowest omission error. The built‐up area obtained from SPOT5/SPOT6 shows high consistency with the reference data for all European cities, and hence can potentially be considered as a reference dataset for wall‐to‐wall validation in Europe.


| INTRODUC TI ON
Knowledge of the spatial distribution of human settlements and monitoring of urban expansion are crucial for a large number of applications, such as exposure mapping and risk assessment, infrastructure planning, biodiversity conservation, climate change, and urban development (Chrysoulakis et al., 2014;Triantakonstantis et al., 2015;Florczyk et al., 2016). One key for understanding worldwide urbanization processes and developing actions toward sustainable urban and rural development is the availability of detailed, up-to-date, accurate, and consistent-in time and space-information on human settlements. Human settlements are closely related to population distribution, as well as to social and economic development which, from the spatial perspective, is not directly measurable. Built-up areas, which refer to the physical space used for human habitation, represent the physical dimension of human settlements and the most appropriate surrogate for urbanization processes. The spatial aspect involved in the definition of built-up areas is what makes remote sensing appealing for extracting global and up-to-date information on the extent of the built-up environment that can be related directly to human settlements.
The availability of open and free, high-spatial-resolution earth observation data (Copernicus Sentinel-1 and Sentinel-2 data, Landsat imagery), combined with robust algorithms for automatic information extraction and high-performance computing systems, is opening the path to a new generation of high-resolution (HR) products (10-30 m) describing built-up areas. Over the past decade, several initiatives have successfully produced pan-European and global maps of built-up areas (Aune-Lundberg & Geir-Harald, 2010; Gong et al., 2013;Ferri et al., 2014;Montero, Van Wolvelaer, & Garzón, 2014). The most exhaustive and widely used maps of built-up areas are those produced with TerraSAR-X data, namely the Global Urban Footprint (GUF) (Esch et al., 2013) or with free Landsat data, such as GlobeLand30 (Chen et al., 2015), and the Global Human Settlement Layer (GHSL) (Pesaresi, Ehrlich et al., 2016).
In the framework of the GHSL project, a generic methodology has been developed for automatic extraction of built-up surfaces from large volumes of remote-sensing images . The main characteristics of the GHSL methodology are scene-based processing and a multi-scale learning paradigm that combines auxiliary datasets with the extraction of textural and morphological image features. This methodological framework proved to be robust and scalable, with the possibility of adapting it for processing imagery in different spatial resolutions from a variety of sensors, including optical and radar imagery. It was applied successfully to the production of continental and global-scale built-up products. At the continental scale, the GHSL methodology was used to produce the European Settlement Map (ESM) from high-resolution SPOT imagery  and the South African settlement map (Kemper et al., 2015). The ESM is the first high-resolution builtup layer for Europe and currently the most detailed map expressing the proportion of the pixel area covered by buildings at a spatial resolution of 10 m. At the global scale, a step toward the production of a global human settlements monitoring system was achieved with the first multi-temporal GHSL data derived from Landsat imagery (Pesaresi, Ehrlich et al., 2016), hereinafter referred to as GHSL-Landsat. Compared to concurrent maps of human settlements, GHSL-Landsat currently represents the most up-to-date, multi-temporal, and open data on the physical characteristics and dynamics of human settlements. Drawing on 40 years of Landsat data, the multi-temporal grids describing built-up areas have been produced for the periods 1975, 1990, 2000, and 2015 as part of the global human settlement (GHS) framework.
The open data policy of the GHSL is currently driving the future development of global built-up products. The plan is to maximize the use of Copernicus earth observation missions (i.e. Sentinel-1 and Sentinel-2) in conjunction with Landsat data. In that context, the technology used for producing the first global layers from Landsat data was adapted to the processing of a worldwide coverage of mono-temporal Copernicus Sentinel-1 data (https:// ghsl.jrc.ec.europa.eu/s1_2017.php). The output of this experiment was a new global built-up layer at a spatial resolution of 20 m. The prospect of deploying the system for the analysis of Sentinel-2 imagery is a cutting-edge technological issue which requires a series of tests for analyzing the suitability of this new sensor data for mapping human settlements. In this regard, a prototype has been tested on the first Sentinel-2 images released during the commissioning phase (Pesaresi, Corbane et al., 2016). This study confirmed (i) the noticeable improvement of Sentinel-2 in comparison to Landsat for built-up classification; and (ii) the added value of combining Sentinel-1 and Sentinel-2 data for improving the discriminatory ability when comparing to results obtained using one-sensor data. The prototype for Sentinel-2 is currently in its final development phase and has also been tested on a set of cities across the world.
Despite the successful adaptation of the GHSL methodology to the analysis of several types of satellite data, currently there is a lack of assessment of the consistency among the derived products and insufficient information on their accuracy. Users of these products should get minimum guidance on which dataset to use and for what purpose. Besides, to fully exploit the synergies between the different sensors and possibly integrate those products in the future, it is essential to understand the potential and limitations of each of them and identify complementarities. In this respect, the validation of built-up data derived from different sensors is crucial for gaining a deeper understanding of the consistency and interoperability between the different products. In the specific case of the Sentinel-2-derived built-up data, validation is necessary for improving the workflow currently under development by providing insights into the main issues that need further research and experimentation.
The aim of this article is to provide validation and comparison of the built-up maps, derived with the GHSL framework, from different input sensors and with different spatial coverages: • At the local scale, with Sentinel-2 for which the workflow for built-up extraction is currently in the prototyping and testing phase.
• At the regional scale, with SPOT5/SPOT6 used for producing the ESM.
• At the global scale, with Landsat and Sentinel-1 used as input data for the production of GHSL-Landsat and GHSL-Sentinel-1, respectively.
As a benchmark, fine-scale building footprints were used as reference data. The analysis was performed over 13 cities selected from different continents, with diverse landscapes, settlement structures, and densities. To paint a complete picture, we developed and applied a validation framework that incorporates conventional methods of pixelby-pixel accuracy assessment and analysis of grid-based differences in built-up data in relation to settlement densities.
Through this detailed validation, this article attempts to identify the main differences in the detection of built-up data across sensors and identify potential areas of synergy between the different maps.
In order to provide an overview of the agreement/disagreement between the different products, an intercomparison of several globally available layers was also performed.

| PRE VI OUS A SS E SS MENTS OF BU ILT-UP ARE A MAPS
The challenge of providing a consistent and practical definition of "built-up areas" places a limit on the accuracy of the generated products, which is perhaps as significant as the sensor-based errors, such as the spectral data quality or geolocation. In the GHSL paradigm, the fundamental link between earth observation sensor data and human presence is the observable presence of built-up structures or buildings. Specifically, we define built-up areas as the aerial units recording the full or partial presence of buildings and the space in-between buildings (Tenerelli & Ehrlich, 2011). Practically speaking, for the purpose of validation, the definition of a built-up area is translated as "the area where the intersection between the reference building footprint and the raster grid is greater than 0." Figure 1 shows an illustration of the concept of "built-up area" in a given grid of regularized pixels. The size of the cell is determined by the spatial resolution of the remote sensor.
This definition of built-up areas determines also the technical specifications of the reference datasets required for validating the built-up area maps. The reference datasets should contain information based on building footprints or density of footprints . At present, there are no global validation datasets with these properties, and not even a statistically representative sample of reference building footprints. The existing global reference datasets have been produced to validate land cover classes, with the built-up areas implicitly considered under the "urban" or "artificial surfaces" land cover classes (Potere, Schneider, Angel, & Civco, 2009;Schneider, Friedl, & Potere, 2009;Gong et al., 2013;Chen et al., 2015). As a rule, given the need to assess the quality of multiple classes, the number of samples representing spatially rare land cover classes (such as the built-up area) may be very small, therefore limiting their utility for the validation of a thematic map (here, a single class built-up area map).
A large majority of the global validation efforts are also based on reference point datasets collected through visual interpretation of high and very-high-resolution satellite imagery. These point-based reference datasets are inadequate for the purpose of our analysis, which aims at preserving the geometric and thematic reliability of the different maps. Table 1 provides a summary of the validation efforts, with a regional or global scope, aimed at assessing the quality of land cover maps, including the classes related to the built-up environment. Some of these initiatives correspond to mono-class validation targeting only the "built-up areas."

| OVERVIE W OF THE FOUR BU ILT-UP PRODUC TS AND THEIR CL A SS IFI C ATI ON ME THODS
We present here a brief description of the four built-up maps under assessment, together with an overview of the information extraction workflows which were developed for producing them using imagery from different satellite sensors (Landsat, Sentinel-1, Sentinel-2, and SPOT5/SPOT6).
F I G U R E 1 Operationalization of the "built-up area" concept in a given raster grid (white). Blue line: the reference footprint of a built-up structure. Black line: access road to the building, with a small parking lot. The union of green cells corresponds to "non-built-up areas," while the union of orange cells represents "built-up area" [Colour figure can be viewed at wileyonlinelibrary.com] TA B L E 1 An overview of validation studies of global and regional layers: the "built-up specific" column corresponds to a mono-class validation experiment (targeting the "built-up area" class)  The technology at the core of the Landsat, Sentinel-1, and Sentinel-2 built-up products relies on the symbolic machine leaning (SML) supervised classifier . The basic concepts of the SML methodology are presented briefly in the next subsection. Then, the application of the SML to the classification of Landsat data for the generation of GHSL-Landsat is recalled, followed by a concise introduction of the adapted methods for the analysis of Sentinel-1 and Sentinel-2 data. Finally, a brief overview of the methodology used for the generation of the ESM from SPOT imagery is also given here for information purposes.
The SML schema is based on two relatively independent steps: 1. Reducing the data instances to their symbolic representation (i.e. a set of unique discrete data sequences).

2.
Evaluating the association between the unique data sequences subdivided into two parts: X (input features) and Y (known class abstraction).
The association is measured as a confidence index, called the Evidence-based Normalized Differential Index (ENDI), which provides a continuum of positive and negative values ranging from −1 to 1. The ENDI expresses the strength of association between the image data layers and the reference data. Values close to 1 indicate that the data sequence is strongly associated with the image class of interest (the built-up area in our case), while values close to −1 indicate that the feature is strongly associated with classes other than the built-up area.

| Built-up extraction from Landsat (GHSL-Landsat)
The GHSL-Landsat product was generated using an information extraction technique based on the SML classifier. The input data consists of four Landsat data collections for 1975, 1990, 2000, and 2014.

| Built-up extraction from Sentinel-1
In 2016, the availability of a global coverage of high-resolution SAR data collected from the European Sentinel-1 mission was a motivation for testing the applicability of the SML classifier to this new imagery in view of improving and updating GHSL-Landsat (Corbane, Lemoine et al., 2017).
The SML workflow was adapted to exploit the key features of the Sentinel-1 ground range detected ( The learning data at the global level consisted of the union of the built-up data obtained from Landsat and the Global Land Cover map at 30 m resolution (GLC-30). The latter has also been derived from Landsat imagery through operational visual analysis techniques (Chen et al., 2015).
A simplification of the adapted SML workflow for the classification of Sentinel-1 data (S1) is shown in Figure 2, with a total of 21 input features, and a reference built-up layer (derived from the built-up data from Landsat and GLC-30) used for learning in the association analysis.

| Built-up extraction from Sentinel-2
At the time of writing this article, the prototype for built-up extraction from Sentinel-2 imagery was still under development. However, an almost stable version of the algorithm was being tested in different landscapes with satisfactory results. The algorithm builds on the SML workflow with adjustments designed to exploit the key features of Sentinel-2 data: (i) availability of four 10 m spatial resolution bands (B2, blue; B3, green; B4, red; B8, near infrared); (ii) the availability of six bands at 20 m resolution, especially in the near infrared and shortwave infrared (B5, B6, B7, B8a in near infrared; B11, B12 in shortwave infrared).
The following features derived from Sentinel-2 are used for classification of the Sentinel-2 image with the SML approach.
F I G U R E 2 Simplified workflow showing the adaptation of the SML to the classification of Sentinel-1 images at the global level. The input features comprise 18 features derived from dual-polarization Sentinel-1 intensity data and 3 topographic features derived from a global digital elevation model (i.e. SRTM) • Spectral features: four 10 m resolution and six 20 m resolution bands.
• Textural features: four textural features derived from the four input 10 m bands by applying a texture-derived built-up presence index, pantex (Pesaresi, Gerhardinger, & Kayitakire, 2008). These four features were combined into a single feature using the minimum operator. The textural feature is used to refine the output confidence layer by eliminating overdetections (

| ESM based on SPOT5 and SPOT6
The ESM is the first high-resolution built-up layer for Europe at 2.5 m. The ESM classification workflow used more than 3,500 satellite images from SPOT5 and SPOT6 sensors as input data with spatial resolutions of 2.5 and 1.5 m . The ESM production workflow was based strongly on the GHSL methodology introduced by Pesaresi et al. (2013); however, the GHSL approach was adapted in order to allow the 2.5 m input data and use of the vegetated surface detector. The main elements of the GHSL methodology were still the core of the applied method: (i) scene-based processing; and (ii) multi-scale learning which combines auxiliary datasets with the extraction of textural and morphological image features. The validation of the ESM layer against the Land Use/Cover Area frame Survey (LUCAS) data (Eurostat, 2015) gave an overall accuracy of 96% with omission and commission errors lower than 4% and 1%, respectively. Full details on the ESM workflow and validation can be found in Florczyk et al. (2016).
In this work, the validation is performed using the recent release of ESM at 2.5 m spatial resolution (Ferri, Siragusa, Sabo, Pafi, & Halkia, 2017). Table 2 provides a summary of the layers analyzed in this study. It should be noted that the S2 layer is still under production, and the results presented here were derived from sensor images acquired in 2016.

| Ancillary datasets
The SML framework makes large use of already available spatial information describing human settlements from various sources at different scales, thematic definition, and completeness or accuracy conditions. SML is a very robust classifier. This has been tested in three typical scenarios of altered learning sets: • scale generalization noise in which the learning set was degraded to 10 different spatial resolutions; • random thematic noise, in which a salt-and-pepper noise was introduced to the learning set; • spatial displacement noise, in which the learning set was systematically shifted by 36 different displacement vectors with a worst displacement of 1,080 m.
The results showed that the SML outperforms parametric classifiers in terms of robustness. It is largely agnostic, both with respect to the statistical distribution of the input data and to the relations between the image data and the information to be detected. This fact largely facilitates the generalization of the identical classification process to image data collected by different sensors and with different local data collection conditions (season, illumination, building practices, materials, settlement patterns, natural background) .

TA B L E 2 Characteristics of the built-up products under validation
All the datasets that were used here as a training set or in the postprocessing step are referred to as ancillary datasets.
• MODIS global urban extents (MODIS) was produced with a supervised decision tree classification algorithm for 1-year MODIS data input (2001)(2002). Validation of MODIS with fine-scale building footprints showed an overall accuracy of 0.86 and a kappa coefficient of 0.09 (Pesaresi, Ehrlich et al., 2016).
• The GlobCover 2009 land cover product is the second 300 m global land cover map produced from an automated classification of MERIS time series. Validation of GlobCover 2009 class "artificial surfaces" with finescale building footprints yielded an overall accuracy of 0.88 with a kappa coefficient of 0.08 (Pesaresi, Ehrlich et al., 2016).
• GLC-30 was produced with a hybrid pixel-object-knowledge-based classification. Reported overall accuracy is 80%, with a kappa coefficient of 0.75 and a user accuracy of 86.70% for artificial surfaces.
• Both the Soil Sealing Layer from the European Environment Agency (EEA) and Corine Land Cover (CLC) 2006 rasters with 100 m pixel size were used as a training set for the production of the SPOT5/SPOT6 product. The training set used to rescale the pantex and saliency features was defined as an intersection of sealed surface greater than 0 and urban fabrics from the CLC.
All the datasets used, either as a learning set or in the postprocessing step, are summarized in Table 3.

| VALIDATION ME THODOLOGY
In this section we propose a validation framework which aims to explore a series of hypotheses related to the sensors under assessment and the features used for built-up extraction from the different sensors. In terms of sensor performance, we test the following: • The SPOT5/SPOT6-derived product outperforms the other products due to its fine spatial resolution.
• The Landsat-derived product provides the least accurate results due to its lower spatial resolution in comparison to the other sensors.
• The products derived from Sentinel-2 and Sentinel-1 outperform the Landsat-derived product due to their higher spatial resolution and, in the case of Sentinel-1, we expect underdetection in dense built-up areas due to shadow effects that occur on those sides of vertical structures that are facing away from the incoming radar beams.
• Textural features, which allow discrimination of built-up structures from other image features (edges and lines), have the drawback of producing a higher omission rate (Pesaresi et al., 2008).
The validation framework developed for testing these hypotheses is described in this section. It incorporates conventional methods of pixel-by-pixel accuracy assessment and analysis of grid-based differences in built-up data in relation to settlement densities. The purpose of this framework is to achieve a comprehensive and systematic description of the accuracy and validity of a layer under comparison.
First, we determine the absolute accuracies and performances on the basis of a pixel-by-pixel accuracy matrix. Then, the dependencies between the built-up and the physical settlement structures are explored using a grid-based analysis. This is performed following two approaches: (i) analysis of the relationship between built-up densities derived from the reference data and the cumulative built-up area from the different layers; and (ii) correlation analysis between the sums of built-up pixels per cell derived from the different layers and the observed reference built-up cell sums. The rationale and details of these three validation approaches are described in the following subsections.
The methodology and results of the global inter-comparison follow, after the methodology and results of the three previously mentioned validation approaches.

| Performance metrics
This subsection is dedicated to the standard pixel-based accuracy assessment, which is a de facto standard in the remote sensing community. It is based on the metrics derived from the confusion matrix (i.e. error matrix) (Congalton, 1991). For the analysis of absolute classification accuracies, we follow the recommendations of Foody The overall accuracy informs us about the correct classifications for all the pixels in a specific site (Equation   1). However, this metric is known for being vulnerable to bias from skew due to imbalanced data, as in the case of built-up areas (Jeni, Cohn, & De La Torre, 2013). In this work, we focus on the relative comparison of the different products. Therefore, despite the limitations of this metric, we have reported the overall accuracies of the different built-up layers that we analyze in combination with additional performance metrics. The kappa coefficient (Equation 2), coined by Cohen (1960), highlights the differences between the actual agreement in the error matrix (i.e. the correctly classified sample units presented by the major diagonal) and the chance agreement presented by the column and row totals. Landis and Koch (1977) 4) is defined as the fraction of values that belong to the built-up class but were not classified as such.

The analysis of built-up density introduced by Ferri et al. (2014) and Florczyk et al. (2016) is a grid-based method
with a predefined cell size, which allows assessing the relationship between the built-up density and the cumulative sum of built-up area derived per layer under study. This allows us to establish a stronger understanding of the mapping capabilities of each sensor in relation to built-up density and structural characteristics. In this validation experiment, the cell size is set to 500 × 500 m. This cell size is a good compromise between the areas of the cities under analysis, the necessity to capture commission and omission errors, and computational constraints. Several study sites are not big enough to use a 1 km or bigger cell size (see Table 4). Also, an experiment with smaller cell sizes, less than 50 m, is computationally demanding. For each cell, the sum of the built-up area for each layer is calculated. The maximum cell sum is 250,000 m 2 , because the common pixel size used for the resampled layers is 1 m. In order to derive the built-up density of reference cells, the total sum of pixels classified as built-up area in a cell is divided by the total area of the cell. The formula is as follows: where dens i is the density for cell i, bu k is the kth built-up pixel in the specific cell i, N is the maximum number of built-up pixels in one cell, w i and h i are the width and height of the cell, respectively.
The reference densities are ordered from the highest to the lowest. The cumulative built-up area sums are calculated for layers under validation and plotted against the reference built-up area.

| Regression analysis of built-up area
The regression analysis uses similar input as the cumulative built-up density analysis by exploring the correlation between the sums of pixels classified as built-up area per cell derived from the different layers, and the reference sums of built-up area per cell. We analyze the scatterplot matrices displaying pairwise correlations. To support the analysis of scatterplots, we also compute the Pearson coefficient of correlation r, the slope and intercept from first-order linear regression: where a is the slope coefficient and b is the intercept, y corresponds to the specific layer under analysis (SPOT5/ SPOT6, S1, S2, or Landsat), and x to the reference layer derived from building footprints.
Through this analysis, we explore the possibility of estimating the "actual" built-up area given an input layer derived from one of the satellite sensors assessed in this work. The outputs of the correlation analysis can also give an indication of the degree of interoperability between the different built-up layers, which may facilitate the task of sharing data between different agencies and organizations working on human settlements.

| Global inter-comparison
In addition to the selected cities validation, a global inter-comparison experiment was performed covering the following datasets: S1 and Landsat from the GHSL framework, GLC-30, and MODIS (Table 3). The SPOT5/SPOT6 product is available only in Europe, so it was excluded from this analysis as well as the S2 product, which is still in the production phase. All the products were warped to a common pixel size of 30 m. The projection used for the global comparison was Google Mercator.
The built-up area in pixels (30 m) was derived and compared per continent. A cross-correlation of built-up densities was performed by calculating the sums of built-up pixels in a grid with cell size 150 × 150 km. The total number of grid cells was 23,000.

| S TUDY S ITE S AND REFEREN CE DATA
The proposed validation framework was applied to 13 study sites located in Europe, North America, Australia, Africa, and Asia. The study sites were selected based on availability of reliable reference data (with preference for cadastre sources) and the need to cover different types of built-up structures, settlement densities, and landscapes.
Reference data consist of fine-scale building footprints obtained from different sources: national mapping agencies, national geoportals, and OSM official data. Building footprints downloaded from the OSM were first visually inspected with the help of very-high-resolution (VHR) images to ensure completeness of the geographic coverage. A very conservative approach was adopted in which the areas with missing building structures (as visually assessed using VHR) were excluded. This approach ensured that gaps in the OSM reference data are avoided. The reference building footprints were first rasterized to 1 m pixel size to preserve maximum possible F I G U R E 5 Selected cities for the validation experiment [Colour figure can be viewed at wileyonlinelibrary. com] detail. All other layers were resampled to the same resolution as the reference data using nearest-neighbor interpolation in order to match the pixel size of reference building footprints of 1 m. A common Universal Transverse Mercator projection was used for all layers. All the layers under assessment were up-sampled in order to not favor a particular layer. The 13 selected cities for which the fine-scale reference data were available are shown in Figure 5. Table 4 gathers, for each city, the source, timestamp (year), and total area of the reference dataset. The second column refers to the extent of the validated area (approximately) and not to the specific city boundary area. This extent was determined according to the reference footprints coverage grid. The total validated area is approximately 12,588 km 2 .
Reference built-up densities for European cities are plotted in Figure 6, with the median value also shown in the corresponding color. The curves are similar in shape for the cities of Milan, Torino, and Novara, where the building footprints cover only dense city areas. For Montpellier, Warsaw, Amsterdam, and Oslo, the building footprints cover a much bigger area, which includes smaller settlements, villages, and other built-up areas. Here we can notice that the curves still increase significantly when the reference density is converging to 0%. The median value for the Warsaw study site is higher than for all the other sites where the reference data does not cover only dense urban zones, which means that there is a significant amount of built-up area outside the Warsaw urbanized zone.

| RE SULTS AND D ISCUSS I ON
In this section the proposed validation framework is applied to the selected 13 study areas   Figure 7 displays the comparative performance metrics for each city. The analysis of the four different performance metrics shows that the SPOT5/SPOT6 layer resulted in the highest values of overall accuracy and kappa and the lowest value of CE. The performances of S1 and S2 in European cities are very similar, with the average kappa value being slightly higher for S1, while the OE for S2 is significantly higher than the OE of the other layers.

| Results of performance metrics
This may be explained by the fact that the S2 workflow includes a textural analysis (pantex) aimed at refining the derived built-up layers by removing roads and other non-built-up areas with high confidence values. The textural refinement seems to exclude a lot more than non-built-up areas by also excluding large buildings (e.g. industrial) and some groups of buildings in very dense cities (Figure 8). Landsat has the highest value of CE. This is because Landsat tends to classify some road networks, runways, bare soil in arable land and river beds, and so on as builtup areas (Figure 8). In general, all four layers show high average commission errors (i.e. greater than 0.5). In terms of OE, Landsat shows the best results with very low values, even lower than SPOT5/SPOT6. S1 also shows good overall performance between those of S2 and Landsat.

| Results of built-up density analysis
The results of the built-up density analysis are presented in Figure 9, order of reference densities. The analysis is made by visual comparison of the curves derived from SPOT5/SPOT6, S1, S2, and Landsat to the "reference curve.". Since these are cumulative sums of built-up area, the maximum value is the total built-up area reported by the layers. Evidently, the curves calculated from SPOT5/SPOT6 are closest to the reference curves in terms of both shape and distance. It is notable that for Milan and Torino, the cumulative built-up curves for SPOT5/SPOT6 almost coincide with the reference layer. Hence, the final estimated area is close to the reference data. However, SPOT5/SPOT6 still tends to overestimate the built-up area in general terms, except in the case of Torino, where it slightly underestimates the total built-up area. The almost perfect match between the SPOT5/SPOT6 curve and the reference curve, especially in the case of Torino and Milan, can be explained by the fact that the SPOT5/SPOT6 workflow uses the same reference building footprints as an auxiliary dataset for filling the gaps in the final SPOT5/ SPOT6 product. This certainly introduces a bias in the presented validation.
Landsat overestimates the built-up area significantly, and this overestimation is reached very fast. S2 shows good results, with a curve very close and parallel to the SPOT5/SPOT6 curve (e.g. in the case of Montpellier, Oslo, Novara, and Milan). S1 performed better than Landsat, but also has a general tendency to overestimate the built-up area in sparse urban zones. For the cities of Warsaw, Oslo, Amsterdam, and Montpellier, the curves start to diverge when the reference built-up density is around 20%. At that point, the cumulative values for all layers start to differ greatly, and that is when we begin to observe the overestimation. The reference data in those cities covers not only the dense city centers but also sparse built-up areas and some rural zones. In those cases, we can observe the tendency of the different layers to overdetect the built-up area in zones characterized by sparse and scattered settlement patterns. In the Italian cities (Novara, Torino, and Milan) the reference data covers essentially the urban core areas, where the density of the built-up area is very high. In that situation, all the layers show almost the same behavior and start overestimating around 50% of the built-up density.
Overestimation of built-up areas for all three layers (S1, S2, and Landsat) is expected due to the nature of the built-up area observed from the satellite sensors, and due to the inherent semantic definition of settlement areas that does not comply with individual building outlines.
It should be noted that the total built-up area estimated by S2 for Aizuwakamatsu (Japan) is higher than that estimated from Landsat (see Figure 11 later). This site is characterized by many rural areas which surround the city. Unlike the other sites, the OE of S2 was 0.27, as opposed to that of Landsat (0.33) and S1 (0.36) (Figure 6), due to the predominance of small buildings. Besides this, the Landsat built-up layer rarely (if ever) detects small F I G U R E 8 Close view of built-up area as depicted by SPOT5/SPOT6, S1, S2, and Landsat in Amsterdam scattered settlements such as those detected by S2, and usually underestimates them. A visual example is provided in Figure 10. and S2, especially for the cities of Montpellier, Novara, and Oslo. S1 and Landsat also provided high similarity of the total built-up area estimated for several cities: Glenorchy, Novara, Dar es Salaam, and Warsaw.

| Results of built-up regression analysis
The The results confirm the cumulative built-up curve analysis, implying a more accurate representation of built-up structural variability with SPOT5/SPOT6 in comparison to the other layers. This is manifest by the high r values, which range between 0.91 and 0.96. Also, the linear relationship is very prominent between the SPOT5/SPOT6 and reference datasets, especially in the case of Milan and Torino.
From the scatterplot (Figure 12), it can also be seen that both Landsat and S1 overdetect built-up areas, with Landsat showing a second-order polynomial relation with the reference data: the values in the Landsat vs. reference scatterplot (first column, fifth row) are mostly concentrated in the left and upper-left corner of the plot.
That is, for low values of reference built-up densities, Landsat gives much higher values than the other layers.
A significant number of values are sealed at the top. This relationship is similar also in the case of Landsat vs.
SPOT5/SPOT6, S1, and S2 plots (fifth row, columns two, three, and four), where values are concentrated in the upper parts of the scatterplot for most of the cities. This is a consequence of the fact that many Landsat cells have values of built-up density equal to 100%, while the ground-truth data report lower values for those cells.
Overall, when looking at the results obtained for all study sites, we notice that the r values are the lowest in the Both S1 and S2 are strongly correlated with the reference layer and with SPOT5/SPOT6. Similar to Landsat, they tend to overestimate the built-up areas, especially S1. It is important to note that the shape of the relation between S2 and the reference data is different from that observed for S1 and Landsat: there is less overdetec-  It is interesting to note that the regression slopes from S2 for four European cities are also in the range [1, 2], except for Warsaw. In the case of Warsaw, the built-up area is more concentrated in low-density areas, which is not common for other sites ( Figure 6). The overestimation tends to be high for all products when the built-up area is more concentrated in areas with low built-up density. The slope value from S2 in the case of all cities is 2.1.
Considering also a linear regression model, it is possible to quantify the overestimation of the built-up area for S1, S2, and Landsat. S2 overestimates the built-up area by a factor of 2.1. In the case of S1 and Landsat, the overestimations show a much larger variability per city and the slope values from the regression model when using all observations together are 2.8 and 3.1. Figure 16 shows an example of the level of precision of the built-up area as described in the SPOT5/SPOT6 layer. The similarity and consistency between SPOT5/SPOT6 and building footprints is very striking. In addition, the SPOT5/SPOT6 layer detects some additional buildings, which are missing in the reference data.
The empirical results confirm most of our hypotheses. The SPOT5/SPOT6 product outperformed the other products because of its fine spatial resolution. The Landsat product was the least accurate since its spatial resolution is the lowest. Also, S1 and S2 outperformed the Landsat product due to their higher spatial resolution. We refuse the hypothesis that S1 was underdetecting the built-up area in dense urban zones due to shadow effects.
The textural feature introduced more omission rates, since the pantex feature used in the S2 workflow contributed to the increased omission error.

| Global inter-comparison
The total built-up area for each product and for each continent is shown in Figure 17. sensor (radar or optical), training sets, sensor resolution, and the methodology used for the production of layers.
The correlation matrix is shown in Figure 18. S1 and GLC-30 products provided very good correlation, with R value of 0.96, followed by MODIS and Landsat with 0.92. S1 used GLC-30 and Landsat as training data, and therefore the high R values were expected. A similar interdependency is found in the case of MODIS and Landsat, where MODIS was used as a training set for the Landsat product. MODIS and GLC-30 provided the lowest value of R.

| CON CLUS IONS
An inter-sensor comparison of built-up areas derived from different sensors was presented for 13 selected cities and four built-up layers obtained in the framework of the GHSL project. The quantitative validation was performed using detailed building footprints available from OSM or from national mapping agencies and local authorities. We first determined absolute accuracies and sensor performances based on a pixel-by-pixel accuracy matrix, considering the entire population of reference pixels. Then, we explored the dependencies between the built-up density and the physical settlement structures using a grid-based analysis. This was accomplished using two approaches: (i) analysis of the relationship between built-up densities derived from the reference data and the cumulative built-up area from the different layers; and (ii) correlation analysis between the sums of pixels classified as built-up area per cell derived from the different layers and the observed reference layer. Finally, a visual comparison of built-up classification results was performed, illustrating areas of agreement/disagreement between the layers.
An additional experiment focusing on the global built-up area products comparison was also performed.
This analysis allows us to derive the following main observations: 1. ESM (SPOT5/SPOT6) shows the most accurate results relative to the reference data.
2. Both S1 and S2 showed an improved extraction of the built-up layer compared to that derived from Landsat.
3. All the layers overestimate the built-up area due to their inherent semantic definition of settlement areas that does not comply with individual building outlines. In particular, Landsat, with the lowest spatial resolution, was shown to be the product with the highest overestimation of built-up areas.
4. The results varied significantly across the different type of cities, suggesting the need to group the analysis per types of landscape and settlement patterns.
5. The total built-up area per continent varies significantly between the global products.
Based on all the validation experiments, SPOT5/SPOT6 gave the most accurate representation of the built-up area, independent of the settlement patterns and their densities. Given that reference building footprints are rarely available for large-scale validation, the outputs of this study suggest that the SPOT5/SPOT6 layer represents a good alternative for validation at the European level. Besides, previous validation work with SPOT5/SPOT6  showed that this layer-generated by automatic image information extraction-achieves 96% agreement with the LUCAS dataset. The omission and commission errors are less than 4% and 1%, respectively. This also confirms its suitability as a good reference dataset for the validation of future built-up products within Europe.
Compared to Landsat, S2 and S1 showed significant improvements associated with the exclusion of agricultural fields, parking lots, sand beaches, and roads from the detected built-up area. S2 was good at detecting small, scattered settlements but failed to detect large buildings in dense urban zones. Conversely, S1 did not succeed in identifying scattered settlements, but correctly classified large industrial buildings. These results outline the complementarity between S1 and S2 sensors for increased accuracy in the detection of built-up areas. Deeper insights into the characterization of human settlements can certainly be gained from the integration of S1 and S2 results.
The results of the global built-up area comparison showed significant variation. GLC-30 detected the most built-up pixels in the global comparison for almost all continents, while MODIS detected the least built-up pixels (excluding South America). Interdependencies exist between the global products because of the training sets used for production. S1 was correlated with Landsat and GLC-30, while Landsat was correlated with MODIS.
The large variability in the results of the validation between the study sites suggests a need to expand the sample size to cover cities in different landscapes and with different settlement patterns. The results, which would incorporate a wide scattered sample size, could be used to develop robust cross-sensor built-up models. A validation campaign using image interpretation and crowdsourcing is currently ongoing, with the purpose of validating the GHSL products and other available global products describing human settlements (https://www.geo-wiki. org/branches/urban/). The progressive development of a global database of building reference footprints, which currently covers 300 cities spread across the world and is used for accuracy assessment of the different global built-up maps derived in the context of the GHSL, but also from other data providers, is also ongoing.
Future work will focus on a systematic and exhaustive consistency check of the different products and extensive validation in view of the development of sensor-specific models for deriving "actual" built-up areas from the different available built-up products.
This article showed that different validation experiments, combined with visual examples, can contribute to a better understanding of the similarities and disparities between the different built-up information layers. It also sheds light on the potential for exploiting the synergies between the sensors for consistent mapping of human settlements at regional and global scales.

ACK N OWLED G M ENTS
The authors wish to thank the GHSL team who contributed in several ways to this work mainly: Panagiotis Politis,