Combined Use of Multiple Cloud‐Free Snow Cover Products in China and Its High‐Mountain Region: Implications From Snow Cover Identification to Snow Phenology Detection

Accurate snow phenology detection, including snow cover days (SCD), snow start date (SSD), and snow end date (SED), is increasingly important for understanding mountain hydrology such as snow heterogeneity and snowmelt seasonality. Multiple cloud‐free daily snow cover products have recently been developed in China, employing diverse retrieval algorithms and cloud‐gap‐filling methods, resulting in varying accuracy levels. However, comprehensive analysis of differences among products and their impact on snow phenology detection is lacking. This study systematically evaluates eight state‐of‐the‐art snow cover products in China, focusing on the challenging Tibetan Plateau (TP). We introduce a novel metric, the consistency‐weighted correlation coefficient (CWR), customized for SSD and SED detection, and propose product‐combining schemes like “ensemble voting” and “sensor preference” to enhance reliability. Our findings highlight the prime influence of retrieval algorithms under clear‐sky conditions on accuracy, surpassing the importance of cloud‐gap‐filling methods. Specifically, a product optimizing normalized difference snow index thresholds for diverse landcover types consistently outperforms others in detecting all three snow phenology parameters, with correlation coefficients for SCD of 0.82 and 0.69, and CWR values for SSD of 0.54 and 0.40, and for SED of 0.53 and 0.37 in both China and the TP, respectively. Moreover, our proposed scheme combining three high‐accuracy products significantly enhances snow cover identification and SCD detection, especially when the best‐performing product alone faces substantial uncertainty. These findings provide immediate, crucial implications for optimizing the use of multiple cloud‐free products to enhance snow phenology detection, ultimately advancing the applicability of derived snow parameters in mountain hydrology research.


Introduction
Snow serves as a vital water source for mountainous runoff and closely interacts with regional climate change (Paudel & Andersen, 2011;Riggs et al., 2017;H. Zhang et al., 2021).To a great extent, snow also meets the water demands for agriculture in global mountainous areas including the arid and semiarid regions of China (Biemans et al., 2019;Yunlong. Li et al., 2021).Recent studies have observed an overall decline in snow cover days (SCDs) and snow masses of China (Zhu et al., 2022), which may further impact the runoff in alpine basins (Thapa et al., 2021).Particularly, the Tibetan Plateau (TP) (mostly in China) is expected to experience a significant decrease in snowmelt due to future climate changes (Kraaijenbrink et al., 2021), potentially impacting water supplies for about 2 billion people in the downstream region (Yao et al., 2022).Therefore, it is of high importance to accurately monitor the dynamics of snow cover.
In remote alpine regions, ground-based observations are usually scarce, a problem that is most serious in the TP, known as the water tower of the world (Immerzeel et al., 2010).To overcome this problem, remote sensing technology has been widely used in monitoring spatiotemporal changes of snow cover globally (Dong, 2018).Compared with passive-microwave based snow products with coarse resolutions (mostly 25 km) which are mainly derived from the Special Sensor Microwave/Imager (SSM/I) (Grody & Basist, 1996) and the Advanced Microwave Scanning Radiometer-Earth Observing System (AMSR-E) (Kelly et al., 2003), the optical based snow products have a clear advantage of high resolutions (20 m-5 km).Such optical sensors include the • Performances of eight state-of-the-art cloud-free products and their combinations in detecting snow phenology parameters are evaluated • Impacts of retrieval algorithm, cloudgap-filling method, and sensor differences are considered for analyzing products' accuracy differences • Proposed scheme combining three high-accuracy products notably enhances snow cover days detection but including less accurate ones degrades Supporting Information: Supporting Information may be found in the online version of this article.
Copernicus Sentinel-2 mission (Gascoin et al., 2019), the Visible Infrared Imaging Radiometer Suite (VIIRS) (Riggs et al., 2017), the Moderate Resolution Imaging Spectroradiometer (MODIS) (Hall et al., 2002), Advanced Very High Resolution Radiometer (AVHRR) (Simpson et al., 1998) and others.Due to shorter revisit time and longer time series, MODIS and AVHRR are the most popularly used sensors for creating snow cover products.The popular snow cover retrieval algorithms include the threshold method using the normalized difference snow index (NDSI) (Salomonson & Appel, 2004), automatically get-endmember method of the MODIS subpixel snowcover algorithm (MODAGE) (Shi, 2012), the model of MODIS Snow-Covered Area and Grain size (MOD-SCAG) (Painter et al., 2009), the SCAmod method based on semi-empirical reflectance model (Metsämäki et al., 2015), and the machine learning method incorporating multispectral information and auxiliary variables (Hou et al., 2020).These products have been widely used to calculate regional SCD and snow phenology (e.g., snow start date (SSD) and snow end date), which have important applications in climate change assessment (Thackeray et al., 2019), calibrating hydrological models (F.Zhang et al., 2015) and the analysis of vegetation growth (Xiaoyue Wang et al., 2018).
Though remote sensing is an efficient approach for snow cover monitoring, a predominant problem limiting its application is cloud cover obscuring the surface (Gao et al., 2010).Continuous daily snow cover data are essential for accurately calculating snow phenology parameters (Ma et al., 2020;Xie et al., 2017).To overcome this problem, several cloud-gap-filling methods have been developed, including but not limited to the combination of MODIS TERRA and AQUA daily snow products (Yu et al., 2016), the Hidden Markov Random Field (HMRF) algorithm (Y.Huang et al., 2018) and the integration of passive-microwave snow depth (PSD) data (Che et al., 2008).Applying these methods has led to the creation of many cloud-free snow cover products in recent years, ranging from regional scales such as the TP (Yupeng Li et al., 2022;F. Pan, Jiang, et al., 2021) and China (X.Hao et al., 2021Hao et al., , 2022;;Hou et al., 2019) to the global scale (Dong & Menzel, 2016;Hall et al., 2019;Sönmez et al., 2014;Yatheendradas & Kumar, 2022).
For these cloud-free snow cover products, their reliability and applicability could vary significantly due to differences in snow cover retrieval algorithms (S.Hao, Jiang, et al., 2019;Masson et al., 2018), cloud-gap-filling methods (Coll & Li, 2018), and sensor sources (e.g., MODIS or AVHRR) (Wu et al., 2021).For instance, Zhang et al. (2021) demonstrated that compared with a fixed NDSI threshold (e.g., 0.4) popularly used before, the terrain-variant NDSI thresholds could significantly improve the accuracy of snow cover identification under clear-day conditions on the TP due to considerations of the scaling issue.Huang et al. (2018) reported an overall accuracy (OA) increase from 85% to 89% under all-weather conditions after applying the HMRF method compared to a simple temporal combination.Currently, few studies have conducted comprehensive comparative analyses of the up-to-date cloud-free products.The current state-of-the-art cloud-free products, benefiting from longer time series of data and advanced methods for snow cover identification and cloud-gap-filling (Yupeng Li et al., 2022;F. Pan, Jiang, et al., 2021;Tang et al., 2022), hold considerable application potential.However, their evaluation within China and its high-mountain sub-region remains limited due to the scarcity of available cloudfree products (≤4) in the past (X.Hao, Luo, et al., 2019;J. Yang et al., 2015;Yuan et al., 2022).
Moreover, understanding the factors contributing to accuracy differences among cloud-free snow products is essential to optimize their utilization and identify sources of error, which can ultimately help improve remote sensing-based snow cover monitoring.However, investigations into the causes of these accuracy variations have been insufficient to date primarily due to a lack of comparative analysis under different data conditions (e.g., cloudy-day and clear-day) and sensor sources.For example, S. Hao, Jiang, et al. (2019), X. Hao, Luo, et al. (2019) evaluated four cloud-free products on the TP and proposed a new product by combining MODIS TERRA, AQUA, and the "Interactive Multisensor Snow and Ice Mapping System" (IMS) data to enhance accuracy in nonforest areas.However, their evaluation only considered the original NDSI-based algorithm for clear-day conditions and applied a simple 3-day temporal combination for cloudy-day conditions, focusing on MODIS-based products.In recent years, various improved snow cover retrieval algorithms that used more spectral characteristics (Luo et al., 2022;Xiaoyan;Wang et al., 2020), advanced cloud-gap-filling methods that considered both the temporal and spatial information (Y.Huang et al., 2018;Tang et al., 2022), and more sensors, such as AVHRR (Naegeli et al., 2022;F. Pan & Jiang, 2022), have been employed to create cloud-free snow cover products.While some studies have suggested the efficiency of combining MODIS data with IMS for regional snow cover monitoring (Y.Li et al., 2019;Yu et al., 2016), the potential for enhancing accuracy through the ensemble use of multiple cloud-gap-filled products with distinct error sources remains largely unexplored.Therefore, there is an urgent need to assess the impact of major factors on accuracy differences among current cloud-gap-filled products to better identify their significant deficiencies.Moreover, it is essential to understand how these factors influence the effectiveness of combining different cloud-gap-filled products.
Another noteworthy issue is that prior studies have mostly evaluated the accuracy of snow cover identification (i.e., snow-covered, or snow-free) among different snow cover products, with limited attention given to their performance concerning snow phenology detection.Snow cover phenology parameters, such as SCD, SSD, and SED, are crucial for understanding the hydrological, ecological, and climatic dynamics in global mountainous areas (Ma et al., 2020).For instance, the timing and dynamics of snow phenology are found closely associated with the onset and intensity of springtime snowmelt-driven floods (C.G. Pan, Kirchner, et al., 2021).SSD and SED could significantly influence alpine vegetation growth by modulating soil thermal regimes and moisture availability (Xu et al., 2022;B. Zhang, Li, et al., 2022).Additionally, SCD has been recognized as an important factor of the well-known elevation-dependent warming over the TP primarily through snow-albedo feedback mechanisms (You et al., 2020;H. Zhang, Immerzeel, et al., 2022).Hence, snow phenology parameters derived from remote sensing are increasingly used (Guo et al., 2022;Xiaoyue;Wang et al., 2018;Zhao et al., 2022), highlighting the need for a comprehensive evaluation.Evaluating solely snow cover identification is thus inadequate.In addition, both evaluating snow cover identification at cloudy days and reliably calculating snow phenology parameters require long-term continuous daily snow observations, a challenging task on a global scale.China, with its hundreds of stations providing daily snow depth observations, is an ideal location for comprehensive evaluation of different cloud-free snow cover products in terms of both identifying snow cover and detecting snow phenology.The nearly 100 stations on the TP are especially valuable in exploring accuracy differences in complex terrains.
Therefore, this study comprehensively investigates major factors contributing to accuracy differences among cloud-free snow cover products, with the ultimate objective of enhancing the reliability of snow cover remote sensing.This study is unique in systematically selecting up to eight state-of-the-art daily cloud-free snow cover products and comparatively evaluating them using a novel assessment metric tailored for snow phenology detection.Furthermore, several approaches of combining multiple cloud-free products are proposed and tested to improve the reliability of snow phenology detection.Specifically, this study aims to: (a) comparatively evaluate the reliability of the eight products in terms of identifying snow cover and detecting snow phenology; (b) examine the potential factors impacting the accuracy differences and identify the major challenge to address in future; (c) assess the benefits of ensemble usage of multiple products to enhance the reliability of snow cover identification and snow phenology detection.

Study Area
China is selected as the study area and it is divided into six subregions including Xinjiang (XJ), Inner Mongolia and Loess Plateau (ILP), northeastern plain and mountain (NPM), eastern plain (EP), Yunnan and Guizhou Plateau (YGP), and the TP following the definitions from Zhang et al. (2019aZhang et al. ( , 2019b)).Among the six subregions, XJ and NPM have the largest multiyear average annual SCD of 57.55 and 75.43 days, respectively, followed by ILP (25.78 days), TP (20.25 days), EP (6.46 days) and YGP (4.62 days).Despite the TP having the highest average elevation of 3511 m, NPM, XJ and ILP have larger SCDs than the TP due to their higher latitudes (Figure 1).
Both SSD and SED show obvious regional differences (Figure S1 in Supporting Information S1).Generally, NPM is the earliest subregion to have continuous snow cover, with a regional mean multiyear average SSD value of 89.45 dsy (day of the snow year), followed by XJ (100.90 dsy), ILP (112.93 dsy), TP (122.19 dsy), EP (129.38),and YGP (134.53 dsy).NPM also marks the subregion with the latest disappearance of snow cover, with a regional mean multiyear average SED value of 178.96 dsy, followed by XJ (175.35 dsy), TP (160.97 dsy), YGP (156.35 dsy), ILP (156.04 dsy), and EP (142.61 dsy).Previous studies indicated that the snow cover in the northern parts of XJ and NPM is temporally continuous in a snow year (Ma et al., 2020).However, with decreasing latitude, the presence of intermittent snow cover becomes more prevalent (Y.Zhang & Ma, 2018).Despite TP's cold conditions attributed to its high altitude, its snow cover duration is short.This is due to its relatively small snow depth and fast snow sublimation, leading to predominantly intermittent snow cover across the TP (Ma et al., 2020).

Observational Data
We started with daily snow depth observations of 2002-2018 from 983 China Meteorological Administration (CMA) stations.The snow depth observation is regularly made manually at 08:00 a.m. using a ruler (CMA, 2003).Following previous studies, the records with snow depth ≥1 cm are considered as snow covered and those with snow depth <1 cm are considered as snow-free (Ke et al., 2016).Screening for extremely low snow cover removed 266 stations with SCD <20 days (H.Zhang, Zhang, et al., 2019).Finally, the remaining 717 stations were used for assessing snow cover identification accuracy.
As the accurate evaluation of snow phenology parameters relies on the availability of uninterrupted daily snow cover records, the remaining 717 stations underwent further filtration following the procedures outlined in Section 3.2.1 for SCD and Section 3.2.2 for SSD and SED.

Cloud-Free Daily Snow Cover Products
This study selected eight state-of-the-art cloud-free daily snow cover products (Table 1) based on the following criteria: (a) High spatial resolution (grid size ≤ 5 km); (b) High temporal resolution (daily data); (c) Robust temporal coverage (a minimum of an 18-year time series).According to data sources of product development, the eight snow cover products can be categorized as three groups: MODIS-based, AVHRR-based, and blended.The MODIS-based group contains four products titled "A new MODIS snow cover extent product over China (2000-2020)" (X.Hao et al., 2022) (referred to as Hao-M), "Daily cloud-free snow cover products for TP from 2002 to 2021" (Y.Huang et al., 2018) (referred to as Huang-M), "MODIS daily cloud-free factional snow cover data set for Asian water tower area (2000-2022)" (F.Pan, Jiang, et al., 2021) (referred to as Pan-M), and "Daily cloud-free MODIS NDSI and snow phenology data set over High Mountain Asia 2000-2021" (Tang et al., 2022) (referred to as Tang-M).These products were developed primarily based on MODIS data.The AVHRR-based group includes three products titled "Daily 5-km Gap-free AVHRR snow cover extent product over China 2000-2020" (X.Hao et al., 2021) (referred to as Hao-A), "High Mountain Asia snow cover product" (Yupeng Li et al., 2022) (referred to as Li-A), and "Snow extent time series derived from AVHRR global area coverage data  in the Hindu Kush Himalayas" (Naegeli et al., 2022;Wu et al., 2021) (referred to as Naegeli-A).The third group contains only IMS (U.S. National Ice Center, 2008), which combines a variety of data sources, including station observations and satellite data from MODIS, AVHRR, Geostationary Operational Environmental Satellite (GOES) Imager and others.Due to the differences in satellite sensors, the MODIS-based products have the highest resolution of 500 m, followed by IMS (4 km) and the AVHRR-based products (5 km).
In addition to the data source of satellite sensors and spatial resolution, the eight products differ significantly in their algorithms for snow cover identification under clear conditions and their methods for gap filling during cloudy days.For example, Huang-M and Hao-M, both created from MODIS data, employ different algorithms for identifying snow cover under clear conditions: Huang-M uses the standard MODIS daily snow cover data developed using a fixed NDSI threshold, while Hao-M uses land-cover-variant NDSI thresholds.During cloudy days, most of the MODIS-based products use a combination of MODIS TERRA and AQUA data to preliminarily fill missing data, and employ various techniques for interpolating most data gaps ranging from the cubic spline interpolation (Tang-M) to the Piecewise Cubic Hermite Interpolating Polynomial algorithm (Pan-M) and HMRF (Huang-M and Hao-M).The surface temperature screening (i.e., using land surface temperature to flag suspicious snow pixels) (Riggs et al., 2017), and the PSD data (Che et al., 2008) are also commonly used to fill remaining data gaps, such as Hao-A, Hao-M and Li-A.Detailed information on the eight snow cover products is summarized in Table 1.
It should also be noted that the products have different spatial extents.The three products with larger spatial coverage, including Hao-M, Hao-A, and IMS, are evaluated on both national and six sub-regional scales in China.The remaining products, including Huang-M, Li-A, Tang-M, Pan-M, and Naegeli-A, are solely evaluated in the TP subregion due to their limited extent.

Auxiliary Data
MODIS daily snow cover products of version 6.1 for TERRA (MOD10A1) (Hall & Riggs, 2021b) and AQUA (MYD10A1) (Hall & Riggs, 2021a) were downloaded for extracting the information about clear-day or cloudyday data conditions during the study period of 2002-2018.The MODIS Land Cover Type product (MCD12Q1) (Friedl & Sulla-Menashe, 2022) was employed to extract land cover types of the stations, using the International Geosphere Biosphere Programme (IGBP) global vegetation classification scheme.Due to the limited number of stations in high-vegetation types like forests, the "grasslands" type was reclassified as "low-vegetation", while the types of forests, shrublands, savannas, and croplands were reclassified as "high-vegetation".Additionally, elevation information for the stations was obtained from CMA.

Methods
In this section, the entire workflow of this study is illustrated in Figure 2. Based on the same station observations, the cloud-gap-filled products were evaluated and compared in terms of both snow cover identification and snow phenology detection.To find out the major causes of performance differences among the products, three potential factors were then analyzed based on the multi-comparison using the paired t-test.Four schemes were utilized, combining different cloud-gap-filled products to investigate the potential benefits of ensemble usage, and provide  guidelines for enhancing the reliability of snow cover and snow phenology for future studies.Subsequent subsections provide additional details on each step.

Assessment Methods of Snow Cover Identification
To evaluate the snow identification accuracy, the confusion matrix is used to compare the station observations with the data from different snow cover products, as shown in Table 2. "TR" (TRue positive) means that both the observation and the product data indicate snow-covered; "FN" (False Negative) means that the observation indicates snow-covered while the product data indicate snow-free; "FP" (False Positive) means that the observation indicates snow-free while the product data indicates snow-covered; "TN" (True Negative) means that both the observation and the product data indicate snow-free.Based on the confusion matrix, four evaluation metrics used in previous studies were calculated, including OA, probability of detection (POD), Precision (PC) and Cohen's Kappa (CK).Their definitions are listed in Table 2. OA quantifies the OA with which snow and non-snow pixels are classified.POD and PC quantify the omission errors (=1 POD) and commission errors (=1 PC) of snow, respectively.Given that OA can be biased by imbalanced sample compositions (e.g., a predominance of snow-free samples), and POD and PC only address one aspect of accuracy each, CK is deemed a more reliable evaluation metric (H.Zhang, Zhang, et al., 2019).Thus, CK is adopted as the main evaluation metric for comparing different products with the other three metrics provided as reference.

Assessment Methods of Snow Phenology Detection
Three widely used snow phenology parameters were selected, including SCD, SSD, and SED.Their definitions and assessment methods are described as follows.

Assessment of SCD
According to the study by Wang and Xie (2009), SCD was calculated following Equation 1: where, n is the total number of days in a snow year, C i is the snow cover status on day i.C i is 1 for snow and 0 for non-snow.Snow year is defined as the period from 1st September to 31st August of the next year (Ke et al., 2016).
For each station, the root-mean-square error (RMSE) and Pearson correlation coefficient (R) were used to assess the accuracy of SCD for each product.They were calculated as follows: where, M i and S i represent the remote sensing and observed SCD in the i-th snow year, respectively, and n is the total number of valid snow years at the station.Since RMSE is sensitive to the magnitude of snow variables (H.Zhang, Zhang, et al., 2022), R is selected as the main evaluation metric of SCD detection, with RMSE serving as a supplementary reference.
To mitigate the influence of missing data, an additional data filtration process was applied to the 717 stations.We considered only those days for each station where both station observations and remote sensing data were available.These instances were deemed valid for computing SCD.Any snow year with valid observation days comprising less than 50% of the total duration was excluded.Only stations with valid data for more than 8 years, corresponding to over 50% of the study period, were retained.After this process, 655 stations in China and 94 stations in the TP region remained for subsequent analysis.

Assessment of SSD and SED
We define a snow cover event within a snow year as a continuous sub-period during which the ground remains covered with snow for at least five consecutive days.A single snow year may encompass multiple snow cover events.SSD is defined as the start date of the first snow cover event in a snow year, and SED is defined as the end date of the final snow cover event within that year (Peng et al., 2013;Tang et al., 2022;Zhao et al., 2022).SSD and SED are frequently used for analyzing the impact of snow cover changes on vegetation growth globally (Möhl et al., 2022;Zhao et al., 2022).If a station or a grid cell from a remote sensing product does not experience a snow cover event during a snow year, the SSD and SED for that site or grid cell in that snow year are considered invalid (or null).Thus, SSD and SED may not always be available due to factors such as snow drought, low snow depths in certain areas, or detection inaccuracies of remote sensing products.Under these circumstances, accuracy measurements such as RMSE and mean errors are deemed inappropriate for assessing the snow phenology detection.For example, in a given year, if a station records valid SSD and SED dates but the corresponding remote sensing product reports invalid or null SSD and SED values, direct calculation of bias for that year becomes problematic.Thus, a new assessment metric, called "consistency-weighted correlation" (CWR), is proposed for evaluating the performance of different products in reproducing the temporal changes of SSD and SED.For each station, CWR is calculated as per the following equations: where, N ss represents the count of snow years in which both the station observation and remote sensing product exhibit valid SSD and SED values.N sn represents the count of snow years in which only the station observation has a valid phenology value while the remote sensing product does not.Similarly, N ns is the number of snow years in which only the remote sensing product possesses a valid phenology value while the station observation does not.R is the correlation coefficient between phenological values derived from station observations and remote sensing products, calculated exclusively for the subset of snow years in which both the station observation and remote sensing product possess valid phenological values.The major difference between CWR and the traditional R lies in the inclusion of consistency weight CW.This weight accounts for the impact of null values of SSDs and SEDs.For example, at a certain station, if observations indicate that there are consecutive valid SSDs during the snow years from 2002 to 2017, while the remote sensing product only detects valid SSDs in the first 10 snow years (i.e., 2002-2011), the traditional R calculation would only be based on the 10-year data.However, in such cases, the snow cover product's inability to detect SSDs in the remaining years (i.e., 2012-2017) is not taken into account, leading to overestimated performances.Thus, the consistency weight can be used to penalize such overestimations.
To mitigate the influence of null values of SSD or SED while ensuring a sufficient sample size, we implemented an additional data filtration process for the 717 stations as follows.In accordance with the definition for SSD and SED, we applied a 5-day moving time window.For any given snow year, if there exists a 5-day window during which the sum of both the number of snow-covered days and the number of days with missing data (not zero) equals 5, that particular year was deemed invalid and subsequently excluded from the calculation.Furthermore, if N ss for a station was less than 4, that station could not calculate R and was thus removed from the analysis.Following this filtration process, a total of 415 stations in China remained for the subsequent SSD and SED analysis.In the TP, there were ultimately 52 stations available for SSD and SED analysis.

Evaluating Main Factors of Performance Differences Between the Snow Cover Products
As stated in Section 2.3, there are typically three primary sources accounting for performance variation among the eight cloud-free products, including the primary sensors used, the snow cover retrieval algorithms employed in clear-day conditions, and the cloud-gap-filling methods used under cloudy-day conditions.The impact of these three underlying factors on snow cover identification is evaluated in detail below.
To evaluate the impact of snow cover retrieval algorithms, an accuracy comparison is made between different products that use the same primary sensor, exclusively under clear-day conditions, where no cloud-gap-filling was needed.A paired t-test is conducted at a 95% confidence level based on individual accuracy measurements (mainly CK) obtained from the same stations to identify significant differences between products.If statistically significant accuracy differences are found between the different products, it indicates that the snow cover retrieval algorithm significantly affect their performances.
Similarly, an accuracy comparison is made between different products that use the same primary sensor, but only under cloudy-day conditions, where spatial, temporal, or spatiotemporal interpolation methods were employed (as listed in Table 1).This allows to evaluate the impact of cloud-gap-filling methods.The definitions of "cloudy-

Water Resources Research
10.1029/2023WR036274 ZHANG ET AL.
day" and "clear-day" depend on the types of primary sensors.For the MODIS-based products, the information about cloudy or clear days was obtained from the original MODIS daily snow cover products including both TERRA (MOD10A1) and AQUA (MYD10A1).For the AVHRR-based products, the information about cloudy or clear days was extracted from the product of Li-A (Yupeng Li et al., 2022).
To evaluate the impact of sensor differences, a multi-comparison is made between the best MODIS-based product, the best AVHRR-based product, and the blended product (IMS) in terms of snow cover identification accuracy only on days when both MODIS and AVHRR data are cloud-free.Moreover, the analysis of the factors is only conducted in the subregion of TP due to the very limited number of available products for the whole of China.

Examining the Potential Benefits of Ensemble Usage of Multiple Cloud-Gap-Filled Products
In order to assess the potential of employing multiple cloud-gap-filled products to improve accuracy, two different strategies were proposed and evaluated in this study: the "voting ensemble" strategy and the "sensor preference" strategy.The former is a popular method used in certain machine learning models such as support vector machines and has not yet been employed for combining multiple snow cover products.Here, we implemented a simple version of the "voting ensemble" method by selecting the majority prediction from the most reliable products for each pixel.Depending on the number of products selected, we evaluated three different schemes: S1, S2, and S3, which respectively combined the three, five, and seven best products.For example, in scheme S1, we used the three best products with the highest CK.At each pixel, if two predicted snow cover and one predicted snow-free, the pixel was classified as snow-covered.Likewise, for scheme S2 (or S3), if three (or four) out of the five (or seven) best products indicated snow cover while the remaining two (or three) did not, the pixel was classified as snow-covered.
The "sensor preference" strategy takes into account the fact that data from the MODIS sensor typically outperforms that from AVHRR due to its higher spatial resolution.In this scheme, we prioritize predictions from MODIS if the original data are cloud-free (regardless of whether the original AVHRR data are cloud-free or not).
If the original MODIS data are cloud-covered while the original AVHRR data are cloud-free, we prioritize predictions from AVHRR.If both sensor data are cloud-covered, we prioritize predictions from the MODISbased cloud-gap-filled product.This scheme is referred to as S4.There have been few studies combining both MODIS and AVHRR data for developing cloud-gap-filled snow cover data (Yupeng Li et al., 2022), and it has remained unclear whether AVHRR data can improve predictions of snow cover when MODIS data are impaired by cloud cover.
Finally, the best-performing of the eight cloud-gap-filled products was used as a baseline reference, referred to as scheme S0.The performance of schemes S1, S2, S3 and S4 was evaluated using the methods described in Section 3.2 and compared with scheme S0 under the same conditions.

Accuracy Comparison for Snow Cover Identification
Among the three products that cover the entire China mainland, Hao-M achieved the highest average CK at 0.61, followed by IMS and Hao-A, with station-based average CKs of 0.55 and 0.51, respectively (Figure 3).These differences were statistically significant (P < 0.05) as determined by the paired t-test based on a total of 717 station-based CK.There were clear regional variations in CK scores for all three products, with the highest scores observed in XJ and NPM and the lowest scores in YGP and TP.However, the superiority of Hao-M remained consistent across almost all sub-regions, except for EP, where IMS demonstrated the highest CK.
On the TP, all the eight products were evaluated and compared (Figure 4).Hao-M showed the highest accuracy with a CK of 0.43, followed by Li-A, Tang-M, Hao-A, Naegeli-A, Huang-M, Pan-M, and IMS, with the CK scores of 0.35, 0.33, 0.31, 0.30, 0.29, 0.26, and 0.22, respectively.The results of multi-comparison indicate that Hao-M presents significantly better performance than the other seven products.The superiority of Hao-M in the TP was consistent with the findings from the entire China mainland.However, IMS showed significantly lower CK than Hao-A in the TP, which is inconsistent with the results of the other subregions of China (Figure 3a).Detailed results including OA, POD and PC are provided in Table 3.

Water Resources Research
10.1029/2023WR036274 ZHANG ET AL.

SCD Comparison
All three products that cover the full extent of China mainland exhibited acceptable accuracy in reproducing temporal changes of annual SCD, with their station-based average R values all >0.70 (Figure 5a).Hao-M obtained the highest R of 0.824, followed by IMS and Hao-A with the R values of 0.713 and 0.710, respectively.The multi-comparison result indicates that Hao-M was significantly better than both IMS and Hao-A, which is consistent with that of snow cover identification (Figure 3a).Regional variations in R values were observed for all three products, with higher R values observed in XJ, NPM, and ILP, while lower R values were found in EP, YGP, and TP.Additionally, Hao-M exhibited the highest R value in all the six sub-regions.The relative order of Hao-M, IMS and Hao-A is somewhat different in terms of RMSE in that IMS has the largest RMSE, attributed to its much larger RMSE in the TP (Figure 5a).

SSD and SED Comparison
The three products evaluated for their ability to detect SSD on the entire Chinese mainland all exhibited relatively low accuracy.The highest CWR achieved was only 0.54 by Hao-M, with IMS and Hao-A having CWR values of 0.42 and 0.40, respectively.Multi-comparison results showed that Hao-M was significantly better than IMS and Hao-A, but the latter two had statistically insignificant differences (Figure 6a).Regional evaluation results showed that higher CWR values were observed in XJ, NPM and ILP, while lower CWR values were found in EP, YGP and TP.Hao-M obtained the highest CWR value for SSD in all the six subregions.In terms of the comparison between the eight products covering the TP, Hao-M had the highest CWR at 0.40, followed by Tang-M, Hao-A, Huang-M, Li-A, Pan-M, Naegeli-A and IMS, with CWR values of 0.30, 0.29, 0.25, 0.24, 0.19, 0.19 and 0.16, respectively.Multi-comparison results showed that Hao-M was significantly better than the other seven products and the second to fifth best products had insignificant differences in their CWR values for SSD (Figure 6c).
The performance of Hao-M, IMS, and Hao-A in detecting SED across China was found to be relatively low too, with their CWR values on the mainland being 0.53, 0.40, and 0.40, respectively.The difference between Hao-M and the other two products were statistically significant while that between IMS and Hao-A is not significant (Figure 6b).Similar to the regional results obtained from SSD evaluations, higher CWR values for SED were found in XJ, NPM and ILP, while lower CWR values were observed in EP, YGP and TP.Among the eight products in the TP, Hao-M had the highest CWR at 0.37, followed by Li-A, Pan-M Tang-M, Huang-M, Naegeli-A, Hao-A and IMS, with CWR values of 0.22, 0.20, 0.20, 0.20, 0.19, 0.18 and 0.15, respectively.Multi-comparison analysis revealed that Hao-M significantly outperformed the other seven products.The CWR values of the second to eighth best products were not significantly different for SED (Figure 6d).

Impacts of the Retrieval Algorithm, Cloud-Gap-Filling Method, and Sensor Difference
The effect of snow cover retrieval algorithm on the products' performance varies between the two primary sensors used (i.e., MODIS and AVHRR).Within the MODIS-based group, Hao-M showed significantly better accuracy than the other three products under clear-day conditions (Figure 7a).This indicates that the snow cover retrieval algorithm plays a significant role in the performance of these MODIS-based products.However, for the AVHRRbased group, the differences in performance between the different products were statistically insignificant (Figure 7b), indicating little impact from the snow cover retrieval algorithm.
The impact of the cloud-gap-filling method appeared to be more pronounced in the AVHRR-based group compared to the MODIS-based group.In the case of the MODIS-based group, it was found that Hao-M exhibited  Water Resources Research 10.1029/2023WR036274 significantly better performance than the other products under both clear and cloudy conditions (Figure 7a).Therefore, determining whether the significant differences between Hao-M and other MODIS-based products under cloudy conditions were due to the cloud-gap-filling method used or the cascading effects from the snow cover retrieval algorithm is challenging.This issue is further elaborated upon in the subsequent discussion (Section 5.1).However, in the case of the AVHRR-based group, the significantly better performance of Li-A in snow cover identification was more likely attributable to the cloud-gap-filling method, as Li-A did not demonstrate significantly better accuracy than the other AVHRR-based products under clear conditions (Figure 7b).
Due to distinct spatial resolutions (Table 1), the selection of primary sensor was found to have a significant impact on the accuracy of snow cover identification (Figure 7c).The strongest evidence was that, even when the original MODIS and AVHRR data were acquired under clear-day conditions (referred to as the "MSAS" condition in Figure 7c) and the effects of cloud-gap-filling methods were removed, significant performance differences still existed between the sensors.Furthermore, it is noteworthy that the difference in CK values between Hao-M and Naegeli-A was 0.13 under the MSAS condition, but increased to 0.16 under the MSAC condition (i.e., MODIS clear while AVHRR cloudy), which is reasonable.Since Li-A relied on MODIS data when the original AVHRR data were obscured by clouds (Yupeng Li et al., 2022), we did not compare Hao-A with Li-A here.Naegeli-A, which ranked second best among the AVHRR-based group and entirely relied on AVHRR data, was ultimately selected for a more rigorous evaluation.

The Performance of Schemes Combining Multiple Cloud-Gap-Filled Snow Cover Products
The comparative analysis of different schemes was only conducted in the subregion of TP because only three products were available for the whole of China.Though the proposed four schemes had combined multiple cloudgap-filled products, only scheme S1, which utilized the best three products, demonstrated a slightly higher CK value (0.44) compared to the baseline scheme S0 (0.43), yet the difference was not statistically significant (Figure 8a).Nonetheless, all four schemes (S1-S4) exhibited higher accuracy than scheme S0 in detecting SCD.Among them, scheme S1 achieved the highest R value of 0.72 while scheme S0 obtained the lowest R value of 0.69.The differences in the station-based R values between schemes S1 and S0 were statistically significant (Figure 8b).
Regarding the detection of SSD and SED, scheme S1 exhibited higher CWR compared to the baseline scheme S0 for both cases.However, the differences between scheme S1 and S0 were not statistically significant (Figure 8c).The other schemes that combined multiple products showed even lower CWR values than the baseline scheme for both cases of SSD and SED.In summary, scheme S1 demonstrated significantly better performance than scheme S0 in detecting SCD, while having almost identical performances in detecting SSD and SED.Despite integrating multiple cloud-gap-filled products, schemes S2, S3 and S4 did not surpass the performance of scheme S0, which solely used Hao-M.

The Prime Importance of Snow Cover Retrieval Algorithm
As described in Section 4.3, the best MODIS-based data set has significantly higher accuracy than the best AVHRR-based due to the much higher spatial resolution of MODIS.Therefore, the subsequent discussion regarding the relative importance of the retrieval algorithm and cloud-gap-filling method is focused on the MODIS-based group.Numerous previous studies took the standard MODIS daily snow cover data that usually employed a similar snow cover retrieval algorithm as the single data source under clear-day conditions (X.Huang et al., 2017;Y. Huang et al., 2018;Tang et al., 2022;Yu et al., 2016), and concentrated on developing all kinds of cloud-gap-filling methods to create cloud-free snow cover products.However, our study highlights the importance of the snow cover retrieval algorithm itself, not just the cloud-gap-filling method.Our results demonstrated that the retrieval algorithm also has a substantial impact on the performance differences between MODIS-based data sets (Figure 7a).
Although it is challenging to differentiate the effects of the retrieval algorithm and cloud-gap-filling method under cloudy-day conditions, the importance of the retrieval algorithm can be further demonstrated by comparing products that use the same cloud-gap-filling method.For example, both Huang-M and Hao-M utilized HMRFbased cloud-gap-filling methods, with Huang-M using the full version of HMRF that incorporated spatiotemporal, spectral, and environmental information and demonstrated superiority in the TP (Y.Huang et al., 2022), while Hao-M employed a simplified version of HMRF that only considered spatiotemporal information for computational efficiency (X.Hao et al., 2022).Our results reveal that Hao-M outperformed Huang-M under cloudy-day conditions (Figure 7a), even though the latter used a more accurate cloud-gap-filling method.This reinforces that the snow cover retrieval algorithm is of greater importance than the cloud-gap-filling method for the four MODIS-based products examined in this study.
Furthermore, the snow cover retrieval algorithm emerges as a key factor explaining why Hao-M performed the best among the eight cloud-gap-filled snow cover products.According to Hao et al. (2022), the retrieval algorithm for Hao-M improved on the NDSI-based method by optimizing the NDSI threshold for different non-forest land cover types.This method also incorporated optimized decision rules considering both normalized difference vegetation index (NDVI) and the normalized difference forest snow index (NDFSI) specifically for forests (Xiaoyan Wang et al., 2020).Therefore, we conducted an investigation of the cloud-gap-filled products' performance under different land cover conditions.The results show that Hao-M had the highest accuracy in all the land cover types among both the three products within China (Figure 9a) and the eight products in the subregion of TP (Figure 9c).These findings demonstrated the superiority of the retrieval algorithm used in Hao-M.It implies that when developing cloud-gap-filled snow cover products for practical applications in mountain hydrology, it is important to also selecting an appropriate retrieval algorithm to apply on the clear-day snow cover data.
Regarding the AVHRR-based group, the primary reason for Li-A obtaining the highest accuracy among the three products under cloudy-day conditions was identified as the cloud-gap-filling method (Figure 7b and Section 4.3).
The cloud-gap-filling method utilized in Li-A incorporated clear-day MODIS data under the cloudy-day condition of AVHRR, which could have been an essential advantage over the other two products that were primarily based on the original AVHRR data (Table 1).However, the insignificant accuracy differences among these products under clear-day conditions may suggest the possibility of developing a more efficient algorithm in the future to further improve the accuracy of snow cover identification based on AVHRR data.

Relationship Between Accuracy of Snow Cover Identification and Snow Phenology
A product with higher CK is expected to have a higher R value of SCD and higher CWR values of SSD and SED, as snow phenology detection depends on accurate identification of snow cover.However, the accuracy rankings were rather inconsistent across different evaluation metrics for the eight products of the TP (Figure 10a), only except that Hao-M performed the best in terms of all the four metrics.For instance, Li-A, the second-best product for snow cover identification, ranked only fifth in terms of SSD detection.This could be related with relatively small differences in accuracy among the non-optimal products.The CK of Hao-M was 0.08 higher than that of Li-A, while the CK of Li-A was merely 0.02 higher than that of Tang-M, the third-best product for snow cover identification (Table 3).Accordingly, in terms of both SSD and SED detection, Hao-M showed statistically significantly better accuracy than both Li-A and Tang-M while the accuracy differences between Li-A and Tang-M were not statistically significant.This implies that only when an improvement in snow cover identification accuracy is sufficiently high, a notable improvement in the accuracy of snow phenology detection can be obtained.
To investigate the cause of such inconsistent product rankings, the relationship between accuracy in snow cover identification (i.e., CK values) and that in snow phenology detection was further examined based on the stationbased accuracy metrics for the three snow phenology parameters of SCD (i.e., R of SCD), SSD (i.e., CWR of SSD), and SED (i.e., CWR of SED), respectively.The results show that the strength of linear relationship between the accuracy of snow cover identification and snow phenology detection is significantly reduced for SSD and SED, compared with SCD, for most products (Figures 10b-10c and Figure S2 in Supporting Information S1).For example, the coefficient of determination (R 2 ) values are 0.47, 0.04, and 0.12 for SCD, SSD and SED in the case of Hao-M, respectively.This implies that an improvement in SSD and SED detection requires more improvements in snow cover identification than improving SCD detection.This is also consistent with our finding that scheme S3 showed clearly better accuracy in SCD detection whereas its accuracy improvement in SSD and SED detection compared to Hao-M was not significant (Figure 8).
It is worth noting that the detection accuracy of SSD and SED was relatively low.Among the three products evaluated in China, the highest CWR values of SSD and SED were 0.54 and 0.53, respectively, while the highest R of SCD was 0.82.This suggests that accurately monitoring SSD and SED based on remote sensing may be more challenging than detecting SCD.Such difference in accuracy could be associated with intrinsic nature of these parameters.SCD essentially quantifies the total number of snow-covered days within a snow year and is less sensitive to the precise timing of snow cover onset and disappearance.As a result, SCD can be more reliably estimated from remote sensing data, as it sums up across the entire snow season, mitigating the effects of isolated detection errors on particular dates.In contrast, accurately detecting SSD and SED requires precise identification of the very first and last days of a snow cover event, a task that is inherently more challenging due to its reliance on the accuracy of daily snow cover detection.Given that substantial accuracy increase in snow cover identification is required for improving SSD and SED detections, combining remote sensing data with model simulations (Y.Yang et al., 2022) or reanalysis data (Orsolini et al., 2019) may provide alternative approaches for improving accuracy of SSD and SED.Furthermore, this finding holds significant implications for hydrological research, emphasizing the need for caution when utilizing remote sensing SSD and SED data as observational references in snow hydrology analyses.

Implications for the Ensemble Use of Multiple Cloud-Gap-Filled Products
Two types of strategies were here tested including the "voting ensemble" and "sensor preference" (Section 3.4).It may be surprising that scheme S4, which applied the "sensor preference" strategy, achieved lower accuracy in snow cover identification than scheme S0, which only utilized Hao-M (Figure 8a).The result indicates that using AVHRR clear-day data to fill gaps when MODIS data are cloud-covered does not improve the accuracy compared to employing spatiotemporal interpolation methods for cloudy-day conditions.The principal cause of this phenomenon lies in the inherent sensor differences between MODIS and AVHRR.Even in the case of MCAS (i.e., MODIS is cloudy-day while AVHRR is clear-day), the MODIS-based Hao-M outperformed the AVHRR-based product in terms of snow cover identification accuracy (Figure 7c).This held true across all comparisons with the three types of AVHRR-based products (Figure 7c and Figure S3 in Supporting Information S1).Therefore, while AVHRR may have a data quality advantage under MCAS conditions, it cannot make up for its sensor deficiencies.Thus, a simple temporal combination of the MODIS-based and AVHRR-based products may not be effective in improving performance.
Regarding the strategy of "voting ensemble", only scheme S1 showed slightly higher CK than scheme S0 (P > 0.05) while schemes S2 and S3 even showed significantly (P < 0.05) lower CK values than scheme S0.This implied that while ensemble of multiple products can improve the OA to a certain extent, the inclusion of loweraccuracy products may diminish the overall performance, potentially making it even inferior to using a single high-accuracy product.Given that using an ensemble of multiple reanalysis data sets or model simulations is a common approach for investigating regional snow changes (Mudryk et al., 2020), combining multiple cloud-gapfilled snow cover products may be an important potential approach in the future, as the number of cloud-gap-filled products increases.As such, future studies need to carefully validate the reliability of an ensemble prediction based on multiple cloud-gap-filled snow cover products and exclude inappropriate candidates that may degrade overall performance due to their inefficient snow cover retrieval algorithm or sensor deficiency.
Moreover, it is noteworthy that scheme S1 that combined the three best products showed significantly (P < 0.05) higher accuracy in detecting SCD than scheme S0, though the differences between CK values of schemes S1 and S0 were not significant.To find out the major cause of such improvement in SCD detection of scheme S1, comparisons under different land cover types were further conducted.We found that the largest difference lies in the type of high-vegetation that the SCD R of scheme S1 is 0.11 higher than that of scheme S0 (Figure 11a).A further examination on the CK values of different land cover types indicated that the SCD improvement is mainly due to the significantly (P < 0.001) higher CK values of scheme S1 under the land cover type of high-vegetation (Figure 11b).These findings imply that an ensemble use of multiple cloud-gap-filled products can effectively improve the accuracy in the land cover type of high vegetation cover, in which cases the best single product had large uncertainty.

Water Resources Research
10.1029/2023WR036274 ZHANG ET AL.

Limitation and Uncertainty
One limitation of this study could be that not all the efficient snow cover retrieval algorithms were considered.For instance, only four types of snow cover retrieval algorithms were employed for the MODIS-based group.Among these, Huang-M and Tang-M used the standard MODIS snow cover products but with different fixed NDSI thresholds, Pan-M utilized the multiple endmember spectral mixture analysis (MESMA), and Hao-M primarily relied on land-cover-variant NDSI thresholds (Table 1).It is worth noting, however, that these selected products are representative of the major and relatively advanced algorithms in use.For example, though MODSCAG was not included in our comparison, it has been found that MESMA outperforms MODSCAG in the TP (F.Pan, Jiang, et al., 2021).Future analysis may benefit from inclusion of products with more diverse range of snow cover retrieval algorithms for a comprehensive comparison.
In this study, the validation of cloud-gap-filled products was conducted by comparing station observations with the corresponding product values within the pixel encompassing the station's location, a method commonly employed in previous research (Hao et al., 2022;Li et al., 2022;Zhang, Yao, et al., 2019;Zhang, Zhang, et al., 2019).Thus, this study could encounter the issue of scale mismatch between station observations and the product's pixel, further complicated by the distinct spatial resolutions of MODIS-based (500 m) and AVHRRbased (5000 m) products.To partly mitigate the impact of this issue on our results, additional analysis was conducted to upscale the Hao-M data from 500 to 5000 m to compare its accuracy against the AVHRR-based Li-A product.The result is presented in supplementary Figure S4 in Supporting Information S1.It shows that the original and upscaled Hao-M products exhibit very similar accuracies in snow cover identification, SSD, and SED, with their accuracy differences being statistically insignificant.Notably, even after upscaling to 5000 m, Hao-M still significantly outperforms Li-A, which is the top-performing AVHRR-based product, in both snow cover identification and snow phenology detection.We also examined the relationship between accuracy of snow cover identification and snow phenology for the upscaled Hao-M product, with the result presented in supplementary Figure S5 in Supporting Information S1.It shows that there were some stations with high CK values but low CWRs, which contributed to the relatively low R 2 values for both SSD and SED.This finding is consistent with that using the original Hao-M product (Figure 10b).These results suggest that the problem of different product resolutions could have limited effects on this study.
Previous research has suggested that the accuracy of snow cover products may be influenced by factors such as elevation, land use, and snow depth (Wu et al., 2021;J. Yang et al., 2015).In this study, 10 elevation zones were defined for both regional scales of China and the TP, and evaluation results were aggregated across elevations.Significant differences in CK values were found across these zones (Figure 9).Most products in the TP had their highest and lowest CK values in the elevation zones of 4000-4500 m and <2500 m, respectively.This finding was consistent with the fact that the former zone has the most snow cover (with a multi-year average SCD of 33.3 days) and the latter zone has the least snow cover (with a multi-year average SCD of 4.4 days) (Figure 1).A longer snow cover can help alleviate the impact of time mismatches between satellite overpasses and station observations (H.Zhang et al., 2020), while low SCDs can increase the impact of data noise.For instance, the low CKs to the south of the Yangtze River in Figure 3 may be due to the short snow season and ephemeral snow cover in this region.Moreover, the complex topography has been shown to significantly affect the accuracy of remote sensing snow cover data, primarily through scaling effects (Zhang et al., 2021).To assess the impact of terrain complexity, we analyzed the elevation standard deviation (ESD) from the Global Multi-resolution Terrain Elevation Data (GMTED2010) (Danielson & Gesch, 2011), with a spatial resolution of 15 arc-seconds (approximately 460 m), comparable to the resolution of MODIS-based products.The average ESD was calculated for stations within each elevation zone to assess the overall terrain complexity.Results are shown in supplementary Figure S6 in Supporting Information S1.We found that ESD distribution across elevation zones corresponds closely with CK values.Specifically, in China, the 2500-3000 m and 3500-4000 m zones, which have the highest ESD values, also have the lowest CK values for the Hao-M product (Figure 9b).This can be attributed to the difficulties in correctly identifying snow cover distribution around steep terrains with high ESDs, particularly in areas such as the outlying edge of the Sichuan Basin and the southeastern edge of the TP.These regions with complex topography pose challenges for accurate snow cover mapping, leading to relatively low CKs, as shown in Figures 3 and 4. Similarly, in the TP, elevation zones <2500 m, 2500-3000 m, and 3500-4000 m with higher ESD values have relatively low CK values for Hao-M (Figure 9d).Despite these variations, Hao-M had the highest CK value in all the 10 elevation zones among both the three products in China and the eight products in the TP.This is consistent with our finding that Hao-M is the best product for snow cover identification in both China and the TP (Section 4.1).
Regarding the effects of land cover types, the lowest CK values were observed at high-vegetation stations (Figure 9a), which is consistent with previous studies (Che et al., 2016;Luo et al., 2022;Wu et al., 2021).The superiority of Hao-M was demonstrated in all the four kinds of land cover types among both the three products in China and the eight products in the TP, as discussed in Section 5.1.However, the limited number of highvegetation stations is a limitation of this study, especially for the case of the TP (Figure 9e).More temporally continuous snow depth observations need to be collected in future to better assess the performances of different cloud-gap-filled products for the land cover type of high-vegetation.
To evaluate the impact of snow depth, a sensitivity test on snow depth thresholds was conducted.In addition to the snow depth threshold of 1 cm (Section 2.2), those of 2 and 3 cm were further applied to convert daily snow depth observations to daily snow cover status.For all the eight products, the CK values vary greatly among different conditions of snow depth thresholds .For example, both Pan-M and Huang-M evaluated in the TP showed much higher accuracy in snow cover identification when using 3 cm as snow depth threshold, implying potential risks of the scale problem related with the representativeness of station observations (H.Zhang et al., 2021).Thus, the performances of Pan-M and Huang-M may be underestimated in this study and observations from denser stations are needed in future to better assess these products in the TP.All the three products in China have clearly higher CK values when using 1 cm or 2 cm as snow depth thresholds based on significantly more stations than the TP.This suggested that the evaluation using thresholds of 1 or 2 cm could be more reliable and representative than that using 3 cm, which is also consistent with some other evaluation studies (Wu et al., 2021).Since a higher snow depth threshold could significantly decrease the number of valid samples for calculation of snow phenology, the threshold of 1 cm was here employed.In addition, Hao-M presented significantly (P < 0.05) higher CK values than the other seven products under all the three snow depth conditions (Figures 4a and 11c-11d), implying limited effects from snow depth thresholds.
High-resolution satellite images, such as those from Landsat, were not used as validation data primarily due to two reasons.Firstly, they cannot serve as reliable validation data under cloudy-day conditions, as they are also significantly affected by cloud cover due to the well-known spectral similarity between snow and clouds (Stillinger et al., 2019).Secondly, they cannot be employed to calculate validation data for snow phenology parameters due to their much longer revisiting time.However, we acknowledge that future studies may address the scaling problem in the analysis under clear-day conditions by utilizing high-resolution satellite images.

Conclusion
In this study, we conducted a comprehensive analysis of eight state-of-the-art daily cloud-gap-filled snow cover products in China and its high-mountain sub-region, encompassing their performance in both snow cover Water Resources Research 10.1029/2023WR036274 ZHANG ET AL.
identification and snow phenology detection.We introduced a novel assessment metric for SSD and SED, and explored various approaches to combining multiple cloud-free products.
Our study highlighted the importance of the snow cover retrieval algorithm in determining the accuracy of snow cover identification.Hao-M consistently outperforms other products mainly due to its optimized snow cover retrieval algorithm tailored to specific land cover types, rather than its cloud-gap-filling method.For AVHRRbased products, while the cloud-gap-filling method currently dominates the accuracy differences, the need for more efficient snow cover retrieval algorithms in the future is evident.
We further examined the relationship between the accuracy if snow cover identification and snow phenology detection.While improved snow cover identification accuracy typically resulted in higher accuracy in snow phenology detection, the relationship was not always linear.Owing to the newly designed assessing metric, the accuracy of SSD and SED detection was found much lower than that of SCD detection, implying more challenges on improving their accuracy.
Our investigation into ensemble strategies, including the "voting ensemble" and "sensor preference," highlighted the challenges and opportunities of combining multiple cloud-gap-filled products.Integrating products with relatively high accuracy substantially enhances snow cover identification and SCD detection, especially in cases where Hao-M exhibits significant uncertainty.However, we also demonstrate that the ensemble use of multiple products does not consistently enhance accuracy, suggesting the need for detailed evaluation of ensemble predictions in specific contexts.
The findings presented in this study not only advance our understanding of the major factors contributing to accuracy differences among current cloud-gap-filled snow cover products, but also provide important implications for optimal use of multiple products and future challenges in snow phenology detection.With the growing demand for accurate snow cover data, our study contributes to ongoing efforts to improve the reliability and accuracy of remote sensing-based snow cover products, which are critical for scientific research and practical applications in fields such as mountain hydrology, climate change, and environmental monitoring.

Figure 1 .
Figure 1.Overview of study area and distribution of stations with daily snow depth observations."Removed" means the stations filtered out due to extremely low snow cover days.XJ, Xinjiang; TP, Tibetan Plateau; ILP, Inner Mongolia and Loess Plateau; NPM, northeastern plain and mountain; EP, eastern plain; YGP, Yunnan and Guizhou Plateau.

Figure 2 .
Figure 2. The flowchart of this study.

Figure 3 .
Figure 3.Comparison of the three products that provide coverage across the entire China mainland in terms of accuracy in identifying snow cover.In panel (a), the results of the multi-comparison are also depicted on top of the bars: products sharing at least one common letter are considered to have insignificant (P ≥ 0.05) differences, while products with distinct letters indicate significant (P < 0.05) differences; The error bar on each column represents the spatial standard deviation calculated from the evaluation metric values obtained from individual stations.

Figure 4 .
Figure 4. Comparison of all the eight cloud-gap-filled products in terms of accuracy in identifying snow cover in the TP.In panel (a), the results of the multi-comparison are also depicted on top of the bars: products sharing at least one common letter are considered to have insignificant (P ≥ 0.05) differences, while products with distinct letters indicate significant (P < 0.05) differences.The error bar on each column represents the spatial standard deviation calculated from the evaluation metric values obtained from individual stations.

Figure 5 .
Figure 5.Comparison of the cloud-gap-filled products in terms of accuracy in detecting SCD in China (a) and (b) and its high-mountain subregion (i.e., TP) (c) and (d).The results of the multi-comparison are also depicted on top of the bars: products sharing at least one common letter are considered to have insignificant (P ≥ 0.05) differences, while products with distinct letters indicate significant (P < 0.05) differences; The error bar on each column represents the spatial standard deviation calculated from the evaluation metric values obtained from individual stations.

Figure 6 .
Figure 6.Comparison of the cloud-gap-filled products in terms of accuracy in detecting SSD and SED in China (a) and (b) and its high-mountain subregion (i.e., TP) (c) and (d).The results of the multi-comparison are also depicted on top of the bars: products sharing at least one common letter are considered to have insignificant (P ≥ 0.05) differences, while products with distinct letters indicate significant (P < 0.05) differences; The error bar on each column represents a spatial standard deviation calculated from the evaluation metric values obtained from individual stations.Due to the limited number of stations in YGP that meet the calculation criteria (only 3), the results of the multiple comparisons for YGP area are not presented in panels (a) and (b).

Figure 7 .
Figure 7.Comparison of multiple cloud-gap-filled products in terms of accuracy in identifying snow cover under clear and cloudy conditions for MODIS (a), AVHRR (b) and their mixed cases (c).In panel (c), "MCAC" means MODIS cloudy, AVHRR cloudy; "MCAS" means MODIS cloudy, AVHRR sunny; "MSAC" means MODIS sunny, AVHRR cloudy; "MSAS" means MODIS sunny, AVHRR sunny.

Figure 8 .
Figure 8. Evaluation results of various schemes combining multiple cloud-gap-filled products in snow cover identification (a), SCD detection (b) and SSD and SED detection (c).The results of the multi-comparison are also depicted on top of the bars: products sharing at least one common letter are considered to have insignificant (P ≥ 0.05) differences, while products with distinct letters indicate significant (P < 0.05) differences; The error bar on each column represents the spatial standard deviation calculated from the evaluation metric values obtained from individual stations.

Figure 9 .
Figure 9.Comparison between different cloud-gap-filled products under different conditions of land cover type panels (a) and (c) and elevation groups panels (b) and (d) for both China and its subregion of TP.Panel (e) illustrates the quantitative distribution of stations among different land cover types and elevation groups.

Figure 10 .
Figure 10.Relationship between accuracy of snow cover identification and snow phenology detection.Panel (a) illustrates the product rankings based on various evaluation metrics; Panels (b) and (c) demonstrate the degree of linearity between the accuracy of snow cover identification and snow phenology detection for Hao-M and Li-A, respectively.

Figure 11 .
Figure 11.Comparison between schemes S0 and S1 across different land cover types in terms of SCD detection (a) and snow cover identification (b), and an evaluation of the eight products in TP using 2 cm (c) and 3 cm (d) as snow depth threshold.

Table 1 A
Summary of the Eight Daily Snow Cover Products Used in This StudyShort name

Table 2
The Definitions of Metrics Used for Evaluating the Accuracy of Snow Cover Identification

Table 3
Snow Cover Identification Accuracy of Cloud-Gap-Filled Products in China and Its Subregion TP This study is supported by National Natural Science Foundation of China (U2243217), a Grant from the Key Laboratory of Water Cycle and Related Land Surface Processes, IGSNRR, CAS (Grant WL2023005), and Shandong Provincial Natural Science Foundation (Grant ZR2021MD029).The station-based evaluation metrics used in this study have been deposited in the Zenodo repository (https://zenodo.org/)with the DOI number 10.5281/zenodo.8330270.All the eight types of cloud-free snow cover products used in this study are open-access and freely available from their respective data repositories, as detailed in the Open Research section.
ZHANG ET AL.