An effective approach for mapping permafrost in a large area using subregion maps and satellite data

Permafrost distribution maps are of importance for environmental assessment, climate system modeling, and practical engineering applications. The scarcity of forcing data and parameters often limits the uses of permafrost models over large areas. However, detailed data are often available in a few subregions through field investigations. In this study, we propose a novel approach for mapping permafrost distribution in a large and data‐scarce area using an empirical model with subregion permafrost maps and satellite data as inputs. The surface frost number model (FROSTNUM) was re‐inferred to include an extra soil parameter to represent the thermal and moisture conditions in soils. The optimal soil parameters were determined from the subregion maps of permafrost distribution through spatial clustering, parameter optimization, and the decision tree method. FROSTNUM was fed with satellite‐derived ground surface freezing and thawing indices to map the permafrost distribution over the study area. The proposed approach was evaluated in the Gaize area on the Qinghai–Tibet Plateau, where intensive field studies have been done. The simulated permafrost distribution is consistent with a map of permafrost distribution made from borehole observations and field surveys in Gaize. Due to excellent accuracy, the approach is effective and can be used in large areas with limited data.

However, detailed data are often available in a few subregions through field investigations. In this study, we propose a novel approach for mapping permafrost distribution in a large and data-scarce area using an empirical model with subregion permafrost maps and satellite data as inputs. The surface frost number model (FROSTNUM) was re-inferred to include an extra soil parameter to represent the thermal and moisture conditions in soils. The optimal soil parameters were determined from the subregion maps of permafrost distribution through spatial clustering, parameter optimization, and the decision tree method. FROSTNUM was fed with satellite-derived ground surface freezing and thawing indices to map the permafrost distribution over the study area. The proposed approach was evaluated in the Gaize area on the Qinghai-Tibet Plateau, where intensive field studies have been done.
The simulated permafrost distribution is consistent with a map of permafrost distribution made from borehole observations and field surveys in Gaize. Due to excellent accuracy, the approach is effective and can be used in large areas with limited data.

K E Y W O R D S
permafrost distribution, permafrost mapping, Qinghai-Tibet plateau, remote sensing, surface frost number model

| INTRODUCTION
The permafrost distribution map is of great utility for cold region engineering applications or as a basis for assessing the ecological environment and developing proactive adaptions to climate change.
Improving the accuracy of permafrost mapping has significant scientific and practical significance. [1][2][3] Traditional permafrost mapping approaches involve costly field investigations that are generally timeconsuming, labor-intensive, and potentially unreliable if for a large area with complex conditions. 4 In recent decades, permafrost models at a variety of complexity have been extensively used for mapping permafrost distribution. [4][5][6][7] Permafrost models can be categorized into empirical/semiphysical models and physical models. Given that the distribution of permafrost is related to many external factors such as locations, elevation, and air temperature, many empirical models have been Jianan Hu and Shuping Zhao contributed equally. developed through establishing relationships between the distribution and those factors, such as the elevation model, 8 mean annual ground temperature (MAGT) approach, 9 the temperature at the top of permafrost (TTOP) model, 10 and the frost number model. 11 These models are still in extensive use on data-scarce regions such as the Qinghai-Tibet Plateau (QTP) due to model simplicity and minimal data requirements. [12][13][14][15] However, empirical models assume steady-state conditions and oversimplify the account of soil heterogeneity. They tend to be incapable of reflecting inhomogeneity in local factors that modulate the hydrological and thermal processes within frozen soils.
As a result, empirical models that do not consider the impacts of local factors such as topography, soil texture, and soil moisture condition may lead to large discrepancies in mapping permafrost distribution. 16 Zhao et al. have shown that the performance of empirical models can be improved by explicitly considering the effects of local factors. 5 On the other hand, physical models are theoretically superior to empirical models with regard to their ability to quantify hydrothermal processes occurring in frozen ground. The impacts of local factors on permafrost formation and distribution can be represented by spatially distributed parameters in the model. Physical models have been applied to some site-scale studies on the QTP (such as applications of the Community Land Model, 17 Common Land Model, 18 Simultaneous Heat and Water Model, 19 Coup Model, 20 and the Noah land surface model 21 ) and small areas with rich investigation data. 22 They have been rarely used in a large area with limited data due to their rigid requirements of data and model parameters. In a specific application, the choice of a proper permafrost model depends greatly on the availability of the data and model parameters in the study area. In a data-scarce area, it is practical to consider the empirical/semi-physical model as an alternative for a permafrost mapping effort.
The inputs for a permafrost model include forcing data providing upper and lower boundaries to the model, and model parameters such as soil and vegetation conditions. In a large area such as the QTP, observation sites are minimal. There are about 80 meteorological stations and dozens of permafrost monitoring sites in a total area of 2.6 million square kilometers. Most of the sites are mainly along the Qinghai-Tibet Highway and few in the western part of the QTP. Therefore, it is impossible to provide adequate observation data and model parameters for mapping the entire region. Satellite-derived data and re-analysis data can compensate for the shortage of forcing data.
As remote sensing provides globally or regionally covered, uninterrupted observations at high spatial, spectral, and temporal resolutions, the derived and re-analysis data are presently widely used as the upper boundaries of permafrost models. [23][24][25][26] Model parameters such as soil properties, however, are more dependent on field measurements and are difficult to acquire with indirect approaches. This will introduce considerable uncertainty to the results if sensitive parameters that exhibit strong horizontal and vertical heterogeneity are not properly assigned. In some previous studies, due to the lack of reliable soil parameters, the effects of soil heterogeneity on permafrost have been neglected, leading to compromised reliability in the results.
Recently, several permafrost investigation projects have been initiated in China, the United States, Canada, Russia, and pan-Arctic countries. [27][28][29][30][31] Through these efforts, many data have been collected from field survey and laboratory analysis. Since 2009, a China Minister of Science and Technology sponsored 5-year project was carried out to investigate permafrost and its environment over the QTP. Five representative areas (West Kunlun, Gaize, Wenquan, Aerjin, and National Highway 308) were chosen to perform intensive surveys. 27 The second scientific expeditions on the QTP have already collected and produced more field data for the region with extensive underlying permafrost. 28 Nevertheless, for such a large area like the QTP, investigations remain concentrated in relatively small regions and longlasting efforts are needed to improve the data conditions.
Because the investigation data represent the local conditions with good accuracy, the local permafrost distribution maps based on survey data are generally of good quality. In this paper, we fully mine the underlying information in the subregion maps of permafrost distribution that are of reliable quality, in order to create a high-accuracy permafrost map for a large study area encompassing those subregions through a semi-physical model, which takes into account the effects of local factors, along with remotely sensed surface temperature data.
The approach is evaluated in a transition zone from permafrost to seasonally frozen ground, the Gaize area on the QTP, to be applicable and effective.

| The extended surface frost number model
The surface frost number model (FROSTNUM) 11 was developed to determine the occurrence of permafrost quantified by a frost number, which is calculated from the freezing and thawing indices of ground surface temperature (GST). We extended FROSTNUM to include an extra parameter related to soil hydrothermal properties. 32 The depth of thaw, ξ t , in the thawing season, and the depth of freeze, ξ f , in the freezing season, at any location of frozen ground can be expressed as: where α is the ratio of the depth of thaw over the depth of freeze; α > 1 represents a seasonally frozen ground case and α ≤ 1 a permafrost case.
Equation 1 can be reorganized into: Assuming no convective heat transfer has occurred and there is a uniform soil texture, the freeze and thaw depths can be analytically derived by Stefan's formula given an initial ground surface temperature of 0 C: where DDF and DDT are the freezing and thawing indices ( C.day), respectively, on the ground surface; λ is soil thermal conductivity (W m −1 K −1 ); and Q is heat released or absorbed by a unit volume of soil (kJ m −3 ) during freezing or thawing periods, Q = L · γ · (W − W u ).
L is the latent heat of fusion for ice and liquid (3.3 × 10 5 kJ m −3 ).
W and W u are total and unfrozen soil water contents on a gravimetric basis (%), respectively. γ is the dry density of soil (kg m −3 ). Subscripts f and t denote frozen and unfrozen soils, respectively.
Let F, defined as the frost number, denote 1/(α + 1); we obtain F by substituting Equations 4 and 5 into Equation 3: where E is a dimensionless parameter representing the changes in soil properties from the unfrozen state to the frozen state and is determined by soil thermal properties and moisture conditions on both states. For the seasonally frozen soil, the thaw depth would be larger than freeze depth, i.e., α > 1 and F < 0.5. For permafrost, F > 0.5. It is physically meaningful to take F = 0.5 as a threshold for identifying permafrost and seasonally frozen ground in Equation 6.
In an idealized case ignoring the occurrence of soil property changes during phase transition and the effect of convection on the formation of frozen soils, E = 1, and the original Nelson's frost number model is then obtained: However, under realistic conditions, frozen soils generally have higher thermal conductivity than unfrozen soils. Therefore, a strong thermal

| OVERALL SCHEME FOR THE FROSTNUM/COP APPROACH
The FROSTNUM/COP requires high-quality subregion maps of permafrost distribution within the study area to inversely compute the soil parameter E using spatial clustering and parametric optimization techniques. The soil parameters are then determined for the entire study area using a classification method (i.e., the decision tree). As model inputs, ground surface freezing and thawing indices are estimated from remotely sensed land surface temperature (LST) data. The overall scheme of this approach is described in Figure 1. It includes the following steps: 1 Preparing freezing and thawing indices. As permafrost is a product of long-term climatic conditions, we acquired multiyear LST data 3 Determining soil cluster distribution over the entire study area.
Using the distribution of soil clusters in the subregions as a training set, a decision tree method (i.e., C50) is adopted to determine the distribution of soil clusters on unknown locations. Variables including topography, soil texture, and soil moisture condition serve as the predictors.

| Estimating daily GST from LST
Meteorological stations record 0 cm soil temperature but they are usually sparse and unevenly distributed particularly in a large area.
Therefore, a regional-scale distribution of GST is generally difficult to obtain by ground observations. Remotely sensed data such as Moderate Resolution Imaging Spectroradiometer (MODIS) land data products offer a possible way to estimate regional GSTs with global coverage at high spatial and temporal resolutions.
GST exhibits high spatial heterogeneity and is nonlinearly related to the LST. 33 A Wavelet-ANFIS approach proposed by Huang et al. 34 is recommended to estimate daily GST from two satellite LST observations. Inputs include daytime and night-time MODIS LSTs, which will be decomposed into one low pass and three high passes each by a wavelet transform, and the time of sunrise at the given day and location to be processed. The wavelet function is set to reverse biorthogonal 3.1. As many as possible in situ daily GST records in the study area are collected for training ANFIS to maximize prediction performance. In the case that the study area has too few GST observations, a larger area encompassing the study area can be used to ensure sufficient GST observations. The desired GSTs for the study area can be extracted later.

| Aggregating multiyear daily GSTs and calculating freezing and thawing indices
Ideally, for an m year span, there should be m GST values in a day of the year at every location. However, MODIS LST data are subject to heavy cloud contamination in some regions, especially on the QTP. 35  A cosine interpolation, 36 which can accurately reproduce the annual oscillation of temperature, is used to fill in the missing GSTs if the number of valid days in a location is more than 120. The remainder, which probably accounts for a small portion, is spatially interpolated using Ordinary Kriging, which can satisfactorily provide yeararound daily mean GSTs for the entire study area.
Based on the daily mean GSTs across multiple years, the annual freezing indices for each cell are calculated, per definition, as the sum of degrees that a day's GST is below 0 C. Likewise, the thawing indices are counted as the sum of degrees that a day's GST is above 0 C.

| Spatial clustering
The soil parameter (E) in the extended FROSTNUM is related to soil hydrothermal conditions. Soils that have similar hydrothermal conditions are likely to share the same parametric value. As environmental factors play critical roles in controlling the soil distribution and associated hydrothermal properties, 37

| Parametric optimization
As there is no direct way to measure the soil parameter E, a parametric estimation technique can be used to obtain optimal parameter values for soil clusters, provided that frozen soil distribution maps are available in some subregions. Those subregion maps are usually of relatively high quality as they were created with adequate field data and adjusted with expertise. PSO is prescribed because this populationbased metaheuristic method can converge to an optimal value quickly. 39 The Kappa coefficient is used as the objective function to measure the consistency between the simulation results and the subregion permafrost distribution maps. The optimal values of E associated with soil clusters can be found after intensive computations and a junction table is subsequently created to look up the type of soil cluster in exchange for the E value.

| Experimental study in the Gaize area, Tibet
We selected Gaize (Figure 2) as an experimental study area for evaluating the FROSTNUM/COP approach. The area is representative of a thermally unstable permafrost area, where local factors strongly influence the distribution of permafrost. A high-quality regional permafrost map is available for Gaize. Three subregions (R1, R2, and R3 in (SRTM/DEM) and were resampled to 1 km. Hillslope aspects were reclassified into three types: north, south, and east-west. The soil texture data were a subset from a 1-km soil properties dataset covering the mainland of China. 43 The soil moisture condition factors include topographic wetness index (TWI), which was also derived from SRTM/DEM, and mean annual total precipitation, which was calculated from the daily and 8-km CMORPH bias-corrected data in the same period. In our testing, an 8-year span of precipitation data are sufficient to give an average pattern of precipitation in the area.
When performing parameter optimization, the extended FROSTNUM model was used as the model operator. The range of E was set to 0.5-1.5. Iteration times and population size were set to 1,000 and 20, respectively. The number of clusters was determined as five after multiple trials. Both Cohen's Kappa coefficient (κ) 44 and overall accuracy (OA) 45 were used to evaluate the performance of this approach. The former is generally thought to be a more robust measure than OA in accounting for chance agreement, although sometimes it also suffers from the penalty associated with chance agreement. 46

| Freezing and thawing indices
The The average percentage of invalid GST pixels decreased from 31% to 5%, as shown by the red lines in Figure 3. The remaining 5% missing was then completed by the Ordinary Kriging method.
The quality of the daily GST estimates was verified against the observed GSTs as shown in Figure 4.

| Soil clusters across the subregions
A total of five clusters were formed by the k-prototype approach for the soils across the subregions. As shown in Figure 6, The primary characteristics of the soil clusters are summarized in Table 2. All the clusters have a dominant soil texture of sandy soil.

| Optimal soil parameter values for soil clusters
The values of the soil parameter E were optimized for the soil clusters, as listed in Table 3.

| Distribution of soil clusters over the study area
The distribution of soil clusters over the entire Gaize area was predicted, as shown in Figure 7. Soils in the western part of the study area consist primarily of clusters 4 and 1. Cluster 1 is also seen in the northeast part and usually occurs together with cluster 5. Cluster 3 is mainly located in high elevations in the central Gaize area. Cluster 2 is most likely to occur in the northeast and southern Gaize area.
The areal proportions of soil clusters in Gaize are presented in Table 1. Cluster 4 covers a maximal area of around 30.67%, followed by soil clusters 1 and 2, which share 22.81% and 21.95%, respectively. Cluster 3 occupies the least area (9.88%), whereas cluster 5 covers an extent of 14.69%. The constitution of soil clusters in the entire Gaize area resembles that in the three subregions. It signifies to some degree the representativeness of the selected subregions to the entire study area.

| Simulation and assessment in the subregions
The  OA: 0.66) has the lowest. As indicated in Figure 8, R2 and R3 possess more boreholes than R1 (only one borehole), so the survey-based maps on R2 and R3 are likely to be of better quality than the map of R1. Generally speaking, as the subregion maps serve as the target of optimization, the quality of subregion maps will inevitably impact the simulation accuracy. This indicates the importance of using reliable subregion maps to achieve good simulation performance. However, for the case in subregion R1 that shows relatively low accuracy against the survey map, we cannot simply conclude that the performance in R1 was compromised because the survey-based map in R1 may be subject to large uncertainty. In our simulation, all three subregion maps rather than just the R1 map were used for global optimization. In this way, the adverse effect from a target map with large uncertainty can be alleviated. The use of multiple subregion permafrost maps appears to be beneficial for the simulation accuracy.
By visually examining the simulated and survey-based maps in the three subregions, some noticeable disagreements are apparent.

| Simulation and assessment in the Gaize area
The permafrost area was modeled to be 20.86 × 10 3 km 2 in Gaize ( Figure 9a) and accounts for 51% of the total area (Table 4). Seasonally frozen ground occupies 20.32 × 10 3 km 2 in the simulation and constitutes 49% of the total area. The simulated permafrost area is slightly smaller than the survey-based map (Figure 9b) by~0.57 × 10 3 km 2 (Table 4). Both maps show very close similarity with respect to the spatial pattern, with a Kappa coefficient of 0.69 and an OA of 0.89.
The survey-based map of Gaize was independently made by an empirical approach and was not input to the simulation. Hence, the resulting high consistency between them firmly supports that the proposed approach is able to well reproduce both the broad and the local characteristics of permafrost distribution. The simulation shows that permafrost is mainly distributed in hilly and mountainous areas, while seasonally frozen ground is mostly distributed in lower plains. The continuity of permafrost decreases as elevation decreases.
Some discrepancies between the simulation and the survey-based map can be identified. In the southwest portion of the study area (marked by rectangles in Figure 9), there is less permafrost in the sim- Although it has been well evaluated for performance in this study, simulation accuracy will be affected by the quality of subregion maps and their representativeness in the study area, which calls for further investigations into the error sources and influencing factors. This new approach can be applied and is an effective means to map accurate permafrost distribution in a large data-scare area if reliable subregion permafrost maps exist.