Impact of the Spatio‐Temporal Mismatch Between Satellite and In Situ Measurements on Validations of Surface Solar Radiation

Satellite and in situ sensors do not observe exactly the same measurand. This introduces a mismatch between both types of measurements in the spatial or temporal. The mismatch differences can be the dominant component in their comparison, so they have to be removed for an adequate validation of satellite products. With this aim, we propose a methodology to characterize the mismatch between satellite and in situ measurements of surface solar radiation, evaluating the impact of the mismatch on validations. The Surface Solar Radiation Data Set—Heliosat (SARAH‐2) and the Baseline Surface Radiation Network are used to characterize the spatial and temporal mismatch, respectively. The mismatch differences in both domains are driven by cloud variability. At least 5 years are needed to characterize the mismatch, which is not constant throughout the year due to seasonal and diurnal cloud cover patterns. Increasing the mismatch can artificially improve the validation metrics under some circumstances, but the mismatch must be always minimized for a correct product assessment. Finally, we test two types of up‐scaling methods based on SARAH‐2 in the validation of degree‐scale products. The fully data‐driven correction removes all the mismatch effects (systematic and random) but fully propagates SARAH‐2 uncertainty to the corrections. The model‐based correction only removes the systematic mismatch difference, but it can correct measurements not covered by the high‐resolution data set and depends less SARAH‐2 uncertainty.


Introduction
The Guide of the expression of Uncertainty in Measurement (GUM) (JCGM, 2008) states that the first step before making or using a measurement is to have a complete description of the measurand, that is, the quantity being measured.This is particularly relevant in Earth Observation because biogeophysical variables are measured with instruments of different characteristics (spectral, temporal, spatial) that can be either at ground stations or moving platforms (e.g., satellites, balloons, aircraft) introducing a mismatch between the measurings observed by each sensor.
This mismatch appears in multiple domains, but the horizontal one, commonly referred to as spatial, is the most problematic in satellite product validation.The most extended approach is the direct comparison of satellite and in situ values.Some validations discard sites where unexpected large differences are obtained due to a high spatial mismatch (Amillo et al., 2014;Antiila et al., 2016;Pfeifroth et al., 2019;Riihelä et al., 2018), while others just avoid regions with high spatial variability (Mortimer et al., 2020;Urraca et al., 2018).The spatial mismatch is also mitigated by interpolating satellite values to the site location (e.g., bi-linear interpolation), but these methods assume specific distributions of the spatial variability that may not correspond to the real one.An independent analysis of the variability can be only made with additional measurements around the station.The straightforward approach is to make additional measurements around the stations (Barnett et al., 1998;Z. Li, 2005;Journée et al., 2012;Hakuba et al., 2013;Huang et al., 2016;Madhavan et al., 2017), even by using mobile platforms such as cars (Zhu et al., 2020) or aircraft (Rossini et al., 2022), but these methods are limited by their complex logistics.A more feasible method is to use high-resolution gridded values either from models or satellite products.One popular option in the land domain is to derive variograms from high-resolution optical products (Cescatti et al., 2012;Román et al., 2009Román et al., , 2010;;Urraca et al., 2022;Z. Wang et al., 2014).They provide a statistical analysis of the variability that can be used to screen the most spatially representative sites.However, variograms neither provide a mismatch estimate nor a correction factor, while besides, only a few specific images are analyzed due to the need for visual interpretation.
A natural evolution of variograms is the use of high-resolution products to directly calculate the mismatch between the spatial extent covered by the station and the satellite pixel area (Hakuba et al., 2014;Peng et al., 2015;Schwarz et al., 2017Schwarz et al., , 2018;;Song et al., 2020;Urraca & Gobron, 2023).This method not only allows us to fully characterize the mismatch including its temporal variability, but also to correct it as done in the Copernicus Ground-Based Observations for Validation (GBOV) up-scaled Land Products (LP) (Copernicus Global Land Service, 2022).The main challenge to its implementation is the availability of high-resolution products of sufficient quality and spatial resolution.The resolution must be high enough to characterize the spatial extent covered by the in situ sensor.Besides, the availability of high-resolution measurements is typically limited to the most recent years.Moreover, their uncertainty should be low enough to derive a valid correction (up-scaling) factor.
This study focuses on assessing the spatio-temporal mismatch in validations of shortwave downwelling irradiance (SWD).This topic has been addressed by both the satellite (Huang et al., 2016;Schwarz et al., 2017Schwarz et al., , 2018) ) and the solar energy (Polo et al., 2016(Polo et al., , 2020) ) communities.We have extended these works by providing a general framework that fully characterizes the mismatch considering the limitations due to the availability and uncertainty of high-resolution measurements.The methodology can be applied to different biogeophysical parameters and different mismatch domains (e.g., spatial, temporal, spectral).Besides, the study provides a detailed characterization of the mismatch in SWD validations, evaluating its sensitivity to annual and daily cycles and quantifying the impact of the mismatch on the validations of satellite products.Finally, as a real-world example, a kilometrescale SWD product is used to correct the spatial mismatch of degree-scale climate data records.Among the potential applications, the methodology could be used to calculate spatial representativeness metrics that help users screen the best stations for their validation exercises.

In Situ Measurements: BSRN
The Baseline Surface Radiation Network (BSRN) (Driemel et al., 2018) provides high temporal resolution surface broadband radiation measurements globally since 1992.The global (GHI), direct normal (DIR), and diffuse (DIF) components of shortwave radiation are provided as 1-min average of samples collected at 1-Hz.Standard deviation, maximum and minimum are provided with the same resolution for each component.All instruments are ventilated to reduce the thermal offset affecting pyranometer measurements.
We selected the 23 BSRN stations covered by SARAH-2 with at least one complete year of measurements from 2011 to 2020 (Table S1, Figure S1 in Supporting Information S1).BSRN 1-min measurements were quality controlled following the BSRN quality control guidelines (Long & Dutton, 2010) and the M7 procedure of Roesch et al. (2011) to minimize the influence of missing values when aggregating data.Negative or missing night observations (SZA > 93°) were set to zero, with SZA being the solar zenith angle.Then, the BSRN physically possible limits (global, direct and diffuse), the global-over-sum consistency test, and the diffuse ratio test were applied to 1-min measurements.All measurements not passing those tests were discarded.Tamanrasset (TAM) and Sede Bocquer (SBO) were fully discarded due to the high percentage of values not passing the global-oversum test.Finally, SWD (global component) was quantified as the sum of the quality-controlled direct and diffuse components (DIR ⋅ cos(SZA) + DIF), using the quality-controlled global measurements only for gap-filling.This quantity is considered more accurate for validation studies, as direct and diffuse instruments are not affected by the cosine error, which may increase the uncertainty of the unshadowed pyranometer measuring GHI.An analysis of the uncertainty of shortwave BSRN measurements can be found at Vuilleumier et al. (2014).

SARAH 2.1
The Surface Solar Radiation Data Set -Heliosat (SARAH) is a Climate Data Record (CDR) from the Satellite Application Facility on Climate Monitoring (CM SAF) that provides global and direct shortwave downwelling irradiance at 0.05°× 0.05°from 1983 to the present (Pfeifroth et al., 2017).SARAH-2 exploits the images from Meteosat geostationary satellites centered at 0°longitude, which covers the region (disk) from 65°W to 65°E, and 65°S-65°N.SARAH uses the HELIOSAT algorithm (Hammer et al., 2003) to estimate the effective cloud albedo from the visible channels of the MVIRI (Meteosat First Generation, 1983-2005) and SEVIRI (Meteosat Second Generation, 2005-present) instruments.The clear-sky index is estimated from the effective cloud albedo with a semi-empirical relationship.All-sky irradiance is obtained by combining the clear-sky index with the clear-sky irradiance calculated with the SPECMAGIC model (Mueller et al., 2009(Mueller et al., , 2012)), a look-up table model based on libRadtran simulations (Mayer & Kylling, 2005).SARAH 2.1 is available from 1983 to 2017 and can be extended to near-real-time with the interim climate data record (ICDR) based on SARAH-2 methods.The SARAH-2 validation report (Pfeifroth et al., 2017) describes the accuracy of the data set and its main sources of uncertainty.

NASA-GEWEX SRB 4
NASA/GEWEX Surface Radiation Budget (SRB) (Cox et al., 2017) provides long-term observations of shortwave and longwave radiation fluxes globally at 1 × 1°.The shortwave algorithm (GSW) is based on a modified version of Pinker and Laszlo (1992) taking as inputs (a) cloud parameters from the International Satellite Cloud Climatology Project (ISCCP) (Rossow & Schiffer, 1991); (b) atmosphere moisture profiles from the Goddard Earth Observing System reanalysis v4.0.3 (GEOS-4.0.3) (3) atmospheric column ozone from different satellite missions (Zhang et al., 2013) and (d) surface spectral albedos for five surface types from Matthews (1985).NASA/GEWEX SRB v4 is available from 1983 to 2017 as 3-hourly instantaneous values, daily means, and monthly means.A detailed validation of the product can be found at Zhang et al. (2013).

General Methodology
The total difference (δ tot ) between satellite (x sat ) and in situ measurements (x ins ) can be expressed as: where δ is the true difference between satellite and in situ measurements, and δ mis is the additional difference due to the mismatch between both measurements.The GAIA-CLIM project (Fasso et al., 2017) decomposes δ mis into a smoothing difference (perfectly aligned measurements of different resolutions) and a sampling difference (measurements of the same resolution not perfectly aligned): Note that δ mis,sampling is not the same as the sampling error due to incomplete sampling of the signal.
In satellite product validation, δ mis can appear in different domains (i) such as the horizontal, vertical, temporal, spectral or geometrical one.In each domain, the mismatch difference can be estimated empirically using highresolution measurements (x hr (i)) with a resolution equal to or higher than that of the measurements (both satellite and in situ) in that domain.If these conditions are met, δ mis,i can be estimated by integrating x hr (i) over the extent covered by satellite (ext sat,i ) and in situ (ext ins,i ) measurements: Note that x hr refers to a measurement with a sufficiently high resolution to analyze the mismatch, which may not be what is considered as high resolution by the satellite community.
Based on the previous concepts, Figure 1 proposes a data-driven methodology to estimate and correct δ mis in each domain.The first step is to determine the minimum resolution res min needed to evaluate the mismatch.For this, ext ins needs to be calculated, that is, the maximum extent to which x ins is representative (an interval for temporal and spectral domains, a region for the spatial domain).This can be done by analyzing the variability of x with an external high-resolution product or model.Note that res min may not be a constant value, for example, a higher resolution may be needed at some sites or in some seasons.Second, a high-resolution product (x hr ) with a resolution (res) equal or higher to res min needs to be found.Third, the quality of x hr must be good enough.δ mis includes the uncertainty of x hr (u hr ), so u hr has to be smaller than the measurements uncertainty to provide a reliable estimate of δ mis , and subsequently a potential correction factor.Finally, x hr has to be available for all the pairs of measurements evaluated.

Mismatch Estimation
The mismatch can be characterized by analyzing multiple (N) estimates of δ mis to generate a Probability Distribution Function (PDF mis ).The two most important metrics from the PDF are the mean (BIAS mis ), the bias added to the comparison, and the standard deviation (SD mis ), the standard uncertainty added by the mismatch.
where i is the domain being analyzed and j is the domain in which the repeated measurements of the mismatch are made to generate the PDF, for example, repeated estimates of the spatial mismatch over time (i = spatial, j = temporal).Other metrics that can be derived are the Mean Absolute Difference (MAD mis ) or the Root Mean Absolute Difference (RMSD mis ) added by the mismatch.
The sensitivity of PDF mis to changes in the magnitude of the measurand over the different domains also needs to be analyzed.For instance, the spatial mismatch is spectrally independent but changes from site to site and with time (annual trends, seasonal and daily cycle).

Mismatch Correction
Three scenarios appear based on the availability of x hr (Figure 1).
• Type A correction.If all the conditions are met, a x hr of sufficient quality is available to estimate and correct δ mis for all the pairs of measurements compared.In situ measurements are typically corrected or "up-scaled" to the satellite extent.Different correction methods are available (Copernicus Global Land Service, 2022;Loew et al., 2017), from a simple ratio based solely on x hr to more complex transfer functions that include some degree of modeling.This approach can fully remove the mismatch in the considered domain.Only the uncertainty of the correction remains.• Type B correction.If x hr has sufficient resolution and quality but is not available for all the measurements compared, the PDF mis can be used to fit a model that extrapolates δ mis for any pair of measurements.Again, models of different complexity can be fit (Dhata et al., 2022;Fernández-Peruchena et al., 2020;Polo et al., 2020).These models generally remove the systematic difference added by the mismatch, while random effects will remain.• No correction.If x hr does not have sufficient resolution or quality δ mis cannot be properly estimated.Some alternatives could be still implemented to at least screen measurements (e.g., sites, periods of the year) where the mismatch is small enough to make a direct comparison: (i) using the highest resolution product available, assuming that the mismatch estimates may not be valid for all the situations, or (ii) analyzing the mismatch of a correlated variable.

Estimation of the Mismatch Between Surface Solar Radiation Measurements
The mismatch between satellite and in situ measurements of surface shortwave downwelling irradiance (SWD) appears in different domains.
• Horizontal (spatial) mismatch.Satellite products are representative of the pixel area whereas in situ sensors make point measurements.This introduces both sampling (different resolution) and smoothing (pixel not aligned with the station) differences.Their magnitude is driven by the cloud cover variability around the station, which at the same time depends on factors such as orography (Huang et al., 2016;Schwarz et al., 2018).Sampling differences also depend on solar radiation changes with latitude, which introduces a systematic mismatch when the latitude of the pixel center differs significantly from the station latitude.Kilometer-scale products provide the highest spatial resolution, but even this resolution might be insufficient to represent the cloud cover variability around some stations (Huang et al., 2019).Kilometer-scale data is provided by MCD18A1 6.1 (D.Wang et al., 2020) (1 km, 3-hourly) from Terra-Acqua/MODIS, the Global Land Surface Satellite (GLASS) Downward Shortwave Radiation (DSR) (Liang et al., 2021) (5 km, daily) derived from Terra-Acqua/MODIS with the direct estimation method, the Breathing Earth System Simulator (BESS) (Ryu et al., 2018) (5 km, daily) also derived from Terra-Acqua using machine learning, or the downward solar radiation product (Hao et al., 2020) (0.1 × 0.1°, hourly) derived from the DSCOVR/EPIC sensor.We refer to R. Li et al. (2021) for a comprehensive assessment of these data sets.Products from geostationary sensors provide similar resolution: SARAH (Pfeifroth et al., 2017) (0.05 × 0.05°, 30-min) from Meteosat satellites, the National Solar Radiation Database (NSRDB) (Sengupta et al., 2018) (4 km, 10-min) from GOES satellites, or GeoNEX (R. Li et al., 2023) (1 km, hourly) derived from both GOES and Himawarii satellites.Compared to polar-orbiting products, geostationary-based products have regional coverage but a better sampling of the diurnal cycle, providing sub-daily SWD values based on actual satellite observations.• Temporal mismatch.Both satellite and in situ measurements are instantaneous snapshots, but in situ data is released as average values over different intervals depending on the network (from 1 min to 1 hr).This can introduce both sampling (in situ interval not aligned with satellite overpass) and smoothing (in situ interval too wide compared to the satellite snapshot) differences.Their magnitude is again driven by cloud variability.Sampling differences for intra-daily measurements are also affected by the diurnal solar cycle.The temporal mismatch can be fully removed by averaging high-resolution (1-min) in situ measurements in a small interval around the satellite overpass (Pfeifroth et al., 2019).However, a temporal mismatch is introduced when using data already aggregated at a different timestamp or resolution (e.g., Yang and Bright (2020)).• Geometrical mismatch.Pyranometers have a 180°field-of-view while satellite measurements observe the top-of-atmosphere reflected radiance in a specific direction (hemispherical-conical reflectance).The differences introduced could be particularly large at high spatiotemporal resolutions, where the 3D radiative effects of clouds have a significant weight but are not considered by current retrieval algorithms (Wyser et al., 2005).Moreover, satellite observing angles change between satellites, pixels, and scanning time.For instance, geostationary satellites scan the whole disk from a fixed position increasing the viewing angle toward the edges of the disk.This introduces the so-called parallax effect, or apparent shift in object position, which is typically addressed by retrieval algorithms with empirical corrections.• Spectral mismatch.Satellite and in situ SWD measurements do not cover the same spectral range.For instance, SARAH-2 and CAMS-RAD are both based on Kato et al. (1999) bands from 240 to 4,606 nm, whereas high-end thermopile pyranometers are sensitive from 285 to 2800 nm.This introduces a smoothing difference, but its magnitude is negligible compared to the differences between products due to the low amount of solar irradiance received on the edges of the solar spectrum.
This study uses high-resolution data to quantify the spatial and temporal mismatch.The geometrical mismatch could be relevant for high-resolution instantaneous measurements but it is excluded because it cannot be addressed with a data-driven methodology.RTM shall be used instead (Huang et al., 2016).
The spatial mismatch is estimated using SARAH-2 as the high-resolution measurements (x hr ), assuming that SARAH-2 0.05°× 0.05°values are representative of the area covered by in situ measurements.This assumption may not hold at some stations with high SWD variability, but sub-kilometer scale SWD products are not currently available (Huang et al., 2019).Despite MODIS-based products such as MCD18A1 providing a slightly better resolution (∼1 km in a sinusoidal grid), we chose a geostationary-based product, SARAH-2, because one of the goals of the study was to quantify the intra-daily variability of the mismatch effect.The potential lack of spatial representativeness of SARAH at some sites was investigated by analyzing the sub-pixel cloud cover variability with a MODIS-based global cloud frequency (CF) climatology (2000-2014, 1 × 1 km) (Wilson & Jetz, 2016).
The temporal mismatch was calculated using 1-min BSRN measurements as x hr , assuming that a 1-min average is representative of the instantaneous snapshot made by the satellite.The study was performed at all BSRN stations inside the region covered by SARAH-2, and the mismatch was evaluated at three temporal resolutions: instantaneous measurements, daily means, and monthly means.

Data Processing
BRSN provides 1-min measurements.The instantaneous SWD around the satellite overpass (half-hourly frequency) was calculated if at least 20% of 1-min BSRN measurements were available.The interval width depends on the type of experiment (Section 3.2).Daily values were calculated by averaging all instantaneous values (halfhourly frequency), only when the 48 half-hourly values were available.Monthly values were calculated with M7 procedure from Roesch et al. (2011) adapted to half-hourly values (to have a consistent procedure between satellite and in situ data).First, the monthly mean of the diurnal cycle (half-hourly steps) was calculated if at least 20% of the measurements were available.Then, monthly values were calculated if all the half-hourly values were available.Only years when all the M7 monthly estimates were available were used to minimize the influence of missing data on the metrics.
SARAH-2 provides instantaneous SWD measurements every 30 min.Daily and monthly aggregated products are also available, but for consistency, they were derived from the 30-min instantaneous values with the same procedure described for in situ values.Many instantaneous SARAH-2 values are missing 85°< SZA < 90°.These values were gap-filled using the clearness index from the closest valid retrieval to minimize their influence on daily and monthly means.

Mismatch Metrics
The spatial mismatch was calculated by applying Equation 3 to the spatial domain, using SARAH-2 0.05°× 0.05°v alues as x hr .
Similarly, the temporal mismatch was calculated by applying Equation 3 to the temporal domain, using 1-min BSRN values as x hr .
where SWD ±1min BSRN (t) is the BSRN measurement perfectly aligned with the satellite overpass (a ±1 min interval was used to minimize missing values, in total 3 min), and SWD interval BSRN (t)) is the mean of 1-min BSRN measurements within the interval evaluated, which can be perfectly aligned with the satellite overpass or not.Three intervals with different smoothing were used: ±4 min, ±9 min, ±14 min.In each case, sampling differences were analyzed by evaluating all the possible intervals not centered at the satellite overpass, with 1-min steps.
In both domains, PDF mis was generated with multiple estimates of the spatial/temporal domain over the temporal domain ( j = t), from 2011 to 2020.BIAS mis , SD mis and MAD mis were derived from the PDF.The sensitivity of these metrics to temporal changes of SWD (inter-annual, seasonal cycle, diurnal cycle) was also evaluated.

Impact of the Spatiotemporal Mismatch on Satellite Product Validation
The propagation of the previous mismatch estimates to the validation metrics is not straightforward.The mismatch estimates include the uncertainty of the high-resolution measurements used for their calculation, so the "true mismatch" is smaller.Two analyses are made to quantify the importance of the mismatch relative to the Journal of Geophysical Research: Atmospheres 10.1029/2024JD041007 URRACA ET AL. differences between satellite and in situ measurements.First, a sensitivity analysis ( Subsection 3.3.1)evaluates how the validation metrics between SARAH-2 and BSRN change when artificially increasing the spatiotemporal mismatch between both measurements.Second, two coarse-resolution climate data records are validated using raw and up-scaled BSRN measurements.

Sensitivity Analysis
This section evaluates how the validation metrics between SARAH-2 and BSRN (BIAS SARAH BSRN , MAD SARAH BSRN , RMSD SARAH BSRN ) change when artificially increasing their spatial and temporal mismatch.The spatial mismatch is added by averaging 0.05 × 0.05°SARAH-2 measurements to 0.25, 0.5, 0.75 and 1°(smoothing), and by using a standard climate grid with origin at (180W, 90N) instead of pixels perfectly centered at the station (sampling).The temporal mismatch is introduced by averaging 1-min BSRN measurements to ±4, ±9, ±14 min intervals (smoothing), and by centering the smoothed intervals at:00 UTC instead at the satellite overpass (sampling).In both cases, the reference case (no mismatch) is the comparison of the 0.05°× 0.05°SARAH-2 pixel co-located with the station against a ±1 min BSRN interval centered around the satellite overpass.The sensitivity analysis was made both for global and direct horizontal irradiance (DHI).

Validation of a Coarse-Resolution Products
The temporal mismatch can be eliminated if 1-min in situ measurements are available, so only the spatial mismatch needs to be corrected.The coarse-resolution products validated were NASA/GEWEX SRB and CERES-SYN 4A daily and monthly SWD measurements from 2011 to 2017.Both have a spatial resolution of 1°1 × 1°.The validation of hourly measurements was discarded because the uncertainty of SARAH-2 instantaneous measurements is too high to correct the spatial mismatch.
The three scenarios described in Figure 1  ).Multiple transfer functions exist for making both types of corrections, but evaluating these transfer functions is outside the scope of the study.The goal is to highlight the differences between corrections when x hr is available for all pairs of measurements (Type A) and those when x hr is available only for a short period (Type B).Therefore, for each type of correction, only one standard transfer function is implemented.
Type A correction is made as the ratio between the value observed by the in situ sensor(approximated with SWD 0.05×0.05°S

ARAH
) and the value observed by the degree-scale product (approximated as the mean of SARAH pixels inside the degree-scale pixel, SWD 1×1°S ARAH ): Type B correction is made by fitting a linear function between SWD 1×1°S ARAH and SWD 0.05×0.05°S

ARAH
. One function is fit for each month of the year based on the results obtained in the mismatch characterization.Then, the correction is applied as follows: where slope(m) and intercept(m) are the linear parameters for each month of the year.
Besides the standard validation metrics (BIAS, MAD, RMSD), the E n ratio was also used to evaluate the impact of the corrections on the uncertainty budget: where u tst and u BSRN are the standard uncertainties for the product tested and BSRN, respectively, and k is the coverage factor.The E n ratio is used to verify the coherence of the uncertainties with the differences observed.For k = 2, 95% of the measurements compared should be below 1. u BSRN was obtained by interpolating linearly the standard uncertainty reported by Vuilleumier et al. (2014) for small (50 W/m 2 ) and large (1000 W/m 2 ) signals of corrected pyranometers: 7 and 8.7 W/m 2 , respectively.u tst is not provided by any of the degree-scale products so we assume a constant uncertainty equal to GCOS "threshold" requirement: 10 W/m 2 (k = 2) (GCOS, 2022).In this way, instead of verifying the coherence of the uncertainties, we evaluate how many deviations are coherent with a product uncertainty within the GCOS threshold.
Note that in both scenarios (corrected and uncorrected), some uncertainty components are being neglected.In the uncorrected case, the uncertainty added by the mismatch (u mis ) is not included as it cannot be decoupled from the uncertainty of the product used to evaluate the mismatch.In the corrected cases, the uncertainty of the corrections (u cor ) is neglected because SARAH-2 uncertainties are not available.This comparison can be seen as an exercise to evaluate the efficiency of the corrections, as in principle, there should be more compliant cases in the corrected scenarios because u cor should be smaller than u mis .

Spatial Mismatch
The spatial mismatch is evaluated with SARAH-2 observations, assuming that the SARAH-2 0.05°× 0.05°p ixel co-located with the station is representative of BSRN in situ measurements (Figure 2).MAD mis,sp increases roughly linearly with the level of smoothing.The sampling differences aggravate MAD mis,sp almost doubling it at some stations.Both smoothing and sampling effects are mitigated when aggregating the data temporally, but a significant spatial mismatch remains in the monthly averages of some stations.The monthly MAD mis,sp is very similar to BIAS mis,sp because, at the monthly level, most of the mismatch effects are systematic and driven by the climatological distribution of clouds around each site.BIAS mis,sp is almost non-existent at stations with good spatial representativeness (TOR to CAB).On the contrary, stations with worse spatial representativeness (GOB to IZA) have a significant BIAS mis,sp even in areas perfectly centered at the station (only smoothing).The sampling differences (pixels not centered around the station) can either aggravate the bias or offset it, proving the importance of taking into account the specific product grid used when characterizing the spatial representativeness of a station.
The spatial mismatch is driven by cloud cover variability.A correlation of r = 0.85 exists between the spatial mismatch and the standard deviation of MODIS 1 × 1 km cloud frequency around the station (Figure S2 in Supporting Information S1).Stations with larger mismatch are located at sites with a large gradient of cloud cover such as islands (IZA, RUN, ENA), coastal areas (PAR, CAM, FLO) or mountain regions (SON, CNR, PAY).On the contrary, a small spatial mismatch appears at sites with very spatially homogeneous cloud cover, which are typically flat inland areas (e.g., LIN, CAB, TOR).
The sensitivity of the spatial mismatch to temporal changes of SWD is analyzed in Figure 3 (sensitivity to interannual variations) and Figure 4 (sensitivity to annual and diurnal cycles).The inter-annual changes only introduce random oscillations in the spatial mismatch driven by the natural inter-annual variability of annual cloudiness.Figure 3b shows that the standard deviation of the mismatch (SD mis,sp ) stabilizes by using more than 5 years for its calculation.Thus, 5 years should be the minimum period to characterize the spatial mismatch of SWD measurements.
The intra-annual and intra-daily changes of the absolute mismatch are driven by the annual and daily solar cycles, so absolute mismatch metrics (in W/m 2 ) should change linearly with the level of surface incoming irradiance.However, non-linear changes are observed at sites where cloudiness is not constant throughout the year or the day (Figure 4).At the annual level, cyclic patterns of different characteristics and magnitude are observed at DAA, BRB, PTR, CNR, SON, IZA.At the daily level, the spatial mismatch in the afternoon is generally larger than in the morning at many stations due to a progressive increase of cloudiness during the day.Some stations also have a mismatch peak at sunrise that could be related to the temperature inversion break or other local effects (TOR, LER, BUD, SMS, RUN).Characterizing these non-linear patterns is essential to provide representative estimates of the mismatch and adequate corrections.If the mismatch changes linearly with SWD, a unique correction model could be used, otherwise, higher-order polynomials or other input parameters (e.g., month, hour of the day) are needed.Ultimately, one mismatch estimate per month (e.g., BRB), or for different periods of the day (morning, afternoon) could be needed.

Temporal Mismatch
The temporal mismatch is evaluated with 1-min BSRN measurements using the ±1 min interval around the satellite overpass as the reference (Figure 5).Compared to the spatial mismatch, the temporal mismatch mostly introduces an additional MAD for instantaneous observations, due to both sampling and smoothing differences.MAD mis,t due to smoothing differences increases with the interval width and progressively disappears when aggregating observations temporally.Sampling differences (intervals not aligned with satellite overpass) have a also significant contribution at the instantaneous level, driven by the dependence of SWD on the daily solar cycle, but their effect disappears at daily and monthly means because morning and afternoon sampling differences cancel out.No effect at all is observed in the bias, due to the random nature of smoothing differences and the cancellation of morning and afternoon sampling differences.
The temporal mismatch is also driven by the temporal cloud cover variability (r = 0.6), which is measured with the daily standard deviation of the hourly clearness index (Figure S2 in Supporting Information S1).Stations with the smallest temporal mismatch are those with predominant clear-sky conditions (IZA, GOB, DAA, CAR), whereas stations with the largest mismatch are in regions with rapidly changing cloudiness such as the tropics (PAR, PTR, BRB, ENA or RUN).Therefore, stations with bad temporal representativeness are generally the same as those with bad spatial representativeness because the spatial and temporal variability of clouds are inevitably correlated (highly moving clouds introduce variability in both domains) (Figure S3 in Supporting Information S1).The exception is Izaña (IZA), which has one of the best temporal representativeness because it is located at 2373 m above the subtropical temperature inversion layer so it generally has clear-sky conditions (García et al., 2016).However, the surrounding area (lower part of the island below 2000 m) is generally covered by clouds, making IZA the station with the worst spatial representativeness.Due to the correlation between spatial and temporal mismatch, the sensitivity of the temporal mismatch to temporal changes in SWD is very similar to that discussed for the spatial domain (Figures S4 and S5 in Supporting Information S1).

Sensitivity Analysis
The impact of the mismatch is evaluated by aggregating SARAH-2 to coarser pixel grids (spatial mismatch) and by averaging BSRN measurement to wider temporal intervals (temporal mismatch).Figure 6 shows all the stations together, whereas Figures S6 and S7 in Supporting Information S1 plot each station separately.The BIAS SARAH BSRN is highly affected by the spatial mismatch.ΔBIAS SARAH BSRN is above 50% at DAA, CAB, LIN, and GOB (due to its low bias) and CAR, ENA, PTR, CAM, PAY, RUN (due to the large ΔBIAS SARAH BSRN ).The mismatch bias can either aggravate or offset the product bias.At some stations, the mismatch completely changes the validation results by fully offsetting the product bias (LIN or ENA), even changing its sign (CAR, CAM, CNR or RUN).ΔBIAS SARAH BSRN is greater for DHI than for SWD likely due to the higher sensitivity of the direct component to the spatial variability of clouds around the station.The BIAS SARAH BSRN is not affected by the temporal mismatch, because as mentioned above, temporal mismatch differences cancel out averaging the data.
The mismatch impact on MAD SARAH BSRN and RMSD SARAH BSRN is more complex because random mismatch effects do not propagate as straightforwardly as systematic effects.Moreover, the mismatch MAD and RMSD estimates also include the uncertainty of the high-resolution observations used to calculate them.In the spatial domain, the relative magnitude of ΔMAD SARAH BSRN increases with the temporal aggregation, with an average ΔMAD SARAH BSRN (95% CI interval) of (0%-17%), [0%-23%], [0%-36%] for instantaneous, daily and monthly values, that can be up to 30%, 51% or 55%.On the temporal domain, ΔMAD SARAH BSRN only changes significantly for instantaneous SWD with average MAD reductions around [3%-18%] and up to 30%.We would expect that MAD and RMSD would increase when adding a mismatch.This occurs when adding a sampling difference, but not when smoothing the observations.In the spatial domain, MAD SARAH BSRN at 0.25°× 0.25°is smaller than that at 0.05°× 0.05°in all the stations, which means that the average of the nine pixels surrounding the station is more similar to the BSRN point measurements than the exact pixel co-located with the station.This pattern is more accentuated in the temporal domain.MAD SARAH BSRN of instantaneous values always decreases when increasing the in situ interval width.This is again counter-intuitive, as we would expect a better agreement when using a small interval centered around the satellite overpass.Similar patterns are observed both for global and direct horizontal irradiance.
We analyze this further by plotting all stations together (Figure 6).The smallest MAD SARAH BSRN is observed by averaging 1-min BSRN measurements to ±14 min intervals (width = 1 + 14 ⋅ 2 = 29 min) and 0.05°SARAH-2 values to 0.25°.The greatest MAD reduction is obtained when increasing the temporal interval from ±1 min to min, with a median ΔMAD SARAH BSRN of around 20% for instantaneous values.MAD SARAH BSRN also decreases for daily measurements up to 5%, while the MAD changes are negligible at monthly level.The MAD reduction could be because wider temporal intervals are likely to represent better the cloud cover variability inside the satellite pixel than 1-min snapshots.Clouds observed during the satellite overpass inside the 0.05°× 0.05°p ixel will pass or have already passed, over the station during a wider temporal interval.In the end, the spatial mismatch is corrected by introducing a temporal mismatch in the comparison.In this line, Deneke et al. (2009) and Huang et al. (2016) suggested that using 30-40 min averaging intervals could mitigate the spatial mismatch effects.
The MAD and RMSD reduction is smaller but still significant when adding spatial mismatch, with a median reduction of 7% and 3% for instantaneous and daily measurements, respectively, when averaging SARAH-2 to 0.25°.This is even more counter-intuitive, as a smaller spatial mismatch, and thus a smaller MAD, is expected with higher spatial resolution.As suggested by Huang et al. (2019), this could be explained by 3D radiative effects of clouds (Wyser et al., 2005) not considered by current retrieval algorithms.These effects have a greater impact at high spatiotemporal resolution, due to a higher cloud cover variability, explaining the improvement when aggregating the data both spatially and temporally.Huang et al. (2019) also suggested averaging satellite measurements to 0.25°to correct the spatial mismatch.However, not all the errors originating from 3D effects of clouds can be attributed to spatial mismatch.While this could be true for local shadows over the pyranometer, cloud reflections from adjacent pixels or clouds in the surface-to-satellite path are not spatial mismatch differences but retrieval errors.Moreover, as seen in the previous section, spatial averaging can introduce a bias that may aggravate or offset the product bias.Therefore, aggregating systematically satellite products to 0.25°would artificially improve their validation metrics because different random errors cancel out.Both effects, spatial mismatch and errors due to 3D cloud effects, must be addressed separately.
The reader shall never interpret a smaller MAD or RMDS as a better validation, or vice-versa.The goal of a validation exercise is to compare apples with apples, that is, to minimize the mismatch between the measurements compared, regardless validation metrics improve or worsen.We would expect a smaller MAD or RMSD when minimizing the mismatch, but this section shows that this is not always the case.The mismatch bias can offset the product bias while smoothing (averaging) the observations can reduce the impact of random errors (e.g., 3D cloud effects).This reinforces the importance of minimizing the mismatch between the measurements compared, as otherwise, the mismatch can artificially improve the validation leading to an incorrect assessment of the product requirements.

Validation of Degree-Scale Products
This section quantifies the change of the validation metrics of degree-scale products, NASA/GEWEX SRB (Figure 7) and CERES-SYN 4A (Figure 8), when correcting the spatial mismatch with a kilometer-scale product (SARAH-2).RMSE results are shown as (Figures S9 and S10 in Supporting Information S1).Up-scaling reduces the MAD and RMSD at most stations.The daily MAD of NASA-GEWEX SRB is reduced in 13 (type A correction) and 15 (type B correction) out of the 18 stations, while the monthly MAD decreases in 10 (type A correction) and 12 (type B correction) stations.In CERES-SYN 4A, the daily MAD decreases in 14 (type A correction) and 18 (type B correction) out of the 18 stations, while the monthly MAD decreases in 15 (type A correction) and 16 (type b correction) stations.
As expected, the reduction is greater in stations with lower spatial representativeness such as CAM, CNR, ENA, SON or IZA.Moreover, Type B correction reduces the MAD at more stations but Type A corrections produce greater reductions.Type A corrections are derived with x hr values co-located with all the pairs of measurements compared, which allows removing both random and systematic effects of the spatial mismatch.However, type A correction fully transfers the uncertainty of SARAH-2 into the up-scaled values, explaining why the MAD increases at stations where SARAH-2 errors are large (e.g., IZA, SON or BRB).On the contrary, Type B corrections are made by fitting a linear model (Equation 11) to predict the mismatch difference outside the period covered by x hr .Thus, these corrections mostly remove the systematic effects of the mismatch but are less dependent on SARAH-2 uncertainty, being able to correct the mismatch at stations where SARAH-2 uncertainty is high.
The increase in MAD observed in some stations could be due to different factors.First, corrections include the uncertainty of SARAH-2 so they are more uncertain, and thus more likely to worsen the agreement, at stations where SARAH-2 uncertainty is high (Figure 7b).Second, the MAD also increases at stations where the mismatch bias was offsetting the BIAS (e.g., PAY, PTR or CAR) because up-scaling unmasks the true bias of the product.This effect is more evident in the monthly MAD than in the daily one because monthly metrics are more affected by effects.The results also show that the number of stations where the monthly MAD decreases is greater in CERES-SYN 4A than in NASA/GEWEX SRB, with CERES-SYN 4A having also generally lower BIAS, MAD, and RMSD than NASA/GEWEX SRB.
The changes are also evident in terms of conformity testing.Figures 7 and 8 show the percentage of measurements compliant with a product uncertainty of 10 W/m 2 (GCOS "threshold" limit).Up-scaling in situ measurements has a greater impact on monthly SWD.The GCOS threshold is the same for all spatiotemporal resolutions, so this is likely because NASA/GEWEX SRB is further for meeting the threshold at the daily level, and thus less sensitive to mismatch corrections.The percentage of compliant monthly measurements increases from around +30% to +40% at stations with low spatial representativeness such as CAM, CNR, ENA, SON or IZA.Again, the compliance rate can decrease at stations where the mismatch bias was offsetting the product bias (PAY, PTR, CAR).The magnitude of these changes highlights the importance of removing the mismatch when evaluating product requirements.
Note again that the goal of up-scaling is not to narrow the validation metrics but to minimize the mismatch.Despite we would expect a reduction of the validation metrics after up-scaling in situ measurements, again this is not always the case.Corrections can unmask a hidden bias while validation metrics can worsen if the uncertainty of the corrections is not low enough.

Limitations and Future Work
1. Minimum spatial resolution (res min ) needed to evaluate the spatial mismatch.Ideally, a high-resolution product (around 100 m) would be needed to evaluate the SWD variability around the station.This would allow us to determine the area in which the station is not representative any more.Unfortunately, SWD products with this resolution do not currently exist.As proposed by Huang et al. (2016), the use of RTMs to evaluate could be explored as an alternative to study the SWD variability around a station.2. Insufficient resolution of SARAH-2 at some sites.Following the previous point, SARAH-2 was selected as being the geostationary-based product with the highest resolution available.However, the analysis of sub-pixel variability (Figures S11-S32 in Supporting Information S1) and the comparison against BSRN data (Figure 7a) revealed that SARAH-2 uncertainty and spatial resolution might not be adequate at IZA, SON and RUN.Using SARAH for calculating the mismatch at those stations or deriving upscaling methodologies shall be avoided.Note however that the SARAH spatial resolution was sufficient for most of the stations.IZA, SON or RUN are particular sites that were not included in BSRN for satellite product validation.3. Lack of uncertainty estimates.Uncertainty estimates are needed to estimate the suitability of x hr  for estimating the mismatch PDF, as this uncertainty cannot be decoupled from the mismatch estimates, and to derive reliable corrections or up-scaled values.Figure 1 described the uncertainty budget of each scenario but, unfortunately, uncertainty values are missing in both SWD satellite and in situ measurements.The best we could do was to evaluate the suitability of SARAH-2 for correcting the mismatch based on MAD SARAH BSRN , which does not provide an uncertainty estimate but could be used to screen stations where SARAH-2 quality was not sufficient for deriving corrections.Stations with high MAD SARAH BSRN were those in which SARAH-2 based corrections led to an increase of validation metrics, suggesting that SARAH-2 uncertainty may be too high for up-scaling at those sites.Both in situ and satellite communities should work on deriving the uncertainty budgets of their measurements (Goryl et al., 2023).
The final question is, who will implement this?There are three main candidates.
• Users The previous is a time-consuming methodology that requires searching and retrieving additional highresolution measurements that may not be easily accessible, and sometimes do not exist.Therefore, most users don't evaluate or correct the mismatch when comparing satellite and in situ measurements.This is also the case for users of satellite data as point measurements or the assimilation of point measurements into gridded data sets.
• Satellite community Satellite validations must remove the mismatch differences to analyze the real uncertainty of their satellite estimates and verify if they are meeting the product requirements.However, the assessments done by satellite producers are tailored to the specific characteristics of their products, so a general assessment of the mismatch will not come from the satellite community.• In situ community In situ networks designed for validating satellite products appear as the best candidates to address this issue.Networks such as GBOV already provide up-scaled in situ measurements around the station that can be aggregated to the target satellite products.In situ networks could also include a quantitative description of the mismatch by providing (a) indexes of the spatial representativeness of the station at different spatial scales (e.g., BIAS mis and SD mis ), (b) analyses of the temporal variability of these indexes (e.g., Figures 3  and 4), and (c) maps showing the variability around the station (Figures S8-S29 in Supporting Information S1).For instance, based on the lessons learned from the sensitivity analyses of this paper, spatial representativeness indexes of all BSRN stations could be calculated by using a polar-orbiting product such as MCD18.This information would help non-technical users in screening the best stations for their needs.The methodology would be also very valuable in identifying the best sites to install new stations.

Conclusions
We presented a data-driven methodology to characterize the mismatch between satellite and in situ measurements of shortwave downwelling irradiance (SWD) measurements.The mismatch differences in the spatial and temporal domains are both strongly driven by the cloud cover variability, so stations with large spatial mismatch have also a large temporal mismatch.At least 5 years are needed to gather the random inter-annual variations of the mismatch.The mismatch differences are not constant throughout the year at more than 50% of the stations due to seasonal and diurnal cycles of cloud cover variability (e.g., more clouds in the afternoon than in the mornings).A temporally constant spatial mismatch cannot be assumed at most of the sites, as currently done by many studies.
The study also evaluated the impact of the spatio-temporal mismatch on the validation metrics with (a) a sensitivity analysis and (b) a real-world validation.The sensitivity analysis shows that the spatial mismatch significantly changes the validation bias, offsetting the product bias or even changing its sign at some sites.However, the smallest MAD and RMSD between satellite and in situ data is obtained when adding some mismatch between both measurements: smoothing (averaging) temporally in situ measurements to ±14 min intervals and by averaging spatially satellite measurements to 0.25°.The aggregation of in situ measurements to wider temporal intervals could be seen as a measure to correct the spatial mismatch, but both aggregations also correct random errors due to 3D radiative effects of clouds that are not attributable to the spatial mismatch (e.g., cloud reflection from adjacent pixels, clouds on the surface-to-satellite path).A validation exercise must always minimize the mismatch between in situ and satellite measurements, regardless validation metrics improve or worsen.
The real-world example validates NASA/GEWEX SRB and CERES-SYN 4A against raw and up-scaled (based on SARAH-2) BSRN data.The type A correction can be used only when SARAH-2 is available for all the pairs of measurements.This correction can remove both the systematic and random effects of the mismatch, but it is highly sensitive to SARAH-2 quality.Thus it gives the smallest validation metrics at stations where SARAH-2 quality is good, but it increases the disagreement between NASA/GEWES SRB and BSRN at stations where SARAH-2 uncertainty is high.On the contrary, the type B correction is a model that allows extrapolation of the mismatch difference for pairs of measurements when SARAH-2 is not available.The analysis of the intra-annual variability of the mismatch is essential to parameterize this model.Compared to type A, type B correction is only able to remove the systematic mismatch but it is less dependent on SARAH-2 quality and can be applied to any pair of measurements.In both cases, corrections can also worsen the validation metrics if the mismatch bias is offset by the product bias.These results highlight the importance of evaluating (a) the uncertainty of highresolution data used for corrections, and (b) the uncertainty propagation through the transfer functions.Uncertainties of both satellite and in situ measurements are essential for this, but unfortunately, rarely available.
x ins In situ measurement.
x sat Satellite measurement.
x hr High-resolution measurement used to characterize the mismatch differences.
δ tot Total difference between satellite and in situ measurements.
δ mis,i Additional difference due to the mismatch between two measurements in the domain i.
ext ins,i Extent covered by the in situ measurement in the domain i.
ext sat,i Extent covered by the satellite measurement in the domain i.
res min Resolution needed to characterize the mismatch in the domain i.

PDF mis
Probability Distribution Function of the mismatch differences.

BIAS mis,i
Bias introduced by the mismatch on the domain i.

SD mis,i
Standard deviation of the differences introduced by the mismatch on the domain i.

MAD mis,i
Mean Absolute Difference introduced by the mismatch on the domain i.

RMSD mis,i
Root Mean Squared Difference introduced by the mismatch on the domain i.
x ins,cor In situ measurement corrected to the extent covered by the satellite measurement.
u tot Total standard uncertainty in the comparison between satellite and in situ measurements.
u ins Standard uncertainty of the in situ measurements.
u sat Standard uncertainty of the satellite measurements.
u hr Standard uncertainty of the high-resolution measurements.
u mis Additional standard uncertainty due to the mismatch between satellite and in situ measurements.

Figure 1 .
Figure 1.Overall framework to estimate and correct the mismatch between satellite and in situ measurements based on the availability of high-resolution measurements (x hr ).
were analyzed using SARAH-2 to correct the spatial mismatch between BSRN and NASA/GEWEX measurements.(a) No correction (x ref = SWD BSRN ), (b) Type B correction based on SARAH-2 (x ref = SWD BSRN, corB ), and (c) Type A correction based on SARAH-2 (x ref = SWD BSRN, corA

Figure 2 .
Figure 2. Characterization of the spatial mismatch with SARAH-2 measurements: (a) Mean Absolute Difference (MAD) and (b) bias added by the spatial mismatch.The SARAH-2 0.05 × 0.05°pixel co-located with the station is used as a reference.Points show larger areas centered at the station (smoothing only).Error bars show areas not perfectly aligned with the station (smoothing + sampling differences, all possible alignments).Stations are sorted according to an increasing spatial mismatch.The bias does not change with the temporal resolution.

Figure 3 .
Figure 3. Sensitivity of the standard deviation of the spatial mismatch (SD mis,sp ) to inter-annual changes in shortwave incoming radiation (SWD): (a) temporal evolution of SD mis,sp and (b) change in the SD mis,sp when increasing the number of years used to calculate the spatial mismatch.Both plots are based on monthly SWD measurements and evaluate the mismatch in a area of 0.25°× 0.25°centered at the station (smoothing only).

Figure 4 .
Figure 4. Sensitivity of the standard deviation of the spatial mismatch (SD mis,sp ) to (a) seasonal and (b) daily cycles of shortwave downwelling radiation (SWD).The seasonal and daily cycles are evaluated with daily and hourly SWD measurements, respectively.Both plots show the mismatch in a area of 0.25 × 0.25°centered at the station (smoothing only).

Figure 5 .
Figure 5. Characterization of the temporal mismatch with 1-min BSRN measurements: (a) Mean Absolute Difference and (b) bias added by the temporal mismatch.The ±1 min interval centered at the satellite overpass is used as a reference.Points show wider intervals centered at the satellite overpass (smoothing only).Error bars show intervals not perfectly aligned with the satellite overpass (smoothing + sampling differences, all possible alignments).Stations are sorted according to an increasing temporal mismatch.The bias does not change with the temporal resolution.

Figure 6 .
Figure 6.Change in the Mean Absolute Difference (MAD) and bias between SARAH-2 and BSRN measurements when adding (a) a temporal mismatch (averaging in situ temporally, smoothing only) and (b) a spatial mismatch (averaging SARAH-2 spatially, smoothing only).Sonnblick (SON) and Izaña (IZA) and Reunion Island (RUN) are excluded due to the high uncertainty of SARAH-2 at those sites.SWD = global horizontal irradiance, DHI = direct horizontal irradiance.

Figure 7 .
Figure 7. (a) Quality of SARAH-2 measurements for correcting the spatial mismatch: Mean Absolute Difference between SARAH-2 and BSRN (MAD SARAH2 BSRN ).(b) Validation of NASA/GEWEX SRB using raw up-scaled BSRN measurements.Stations are sorted from left to right according to decreasing spatial representativeness.Sonnblick (SON) and Izaña (IZA) are plotted separately due to the large magnitude of the metrics.

Figure 8 .
Figure 8. Validation of CERES-SYN 4A using raw and up-scaled BSRN measurements.Stations are sorted from left to right according to decreasing spatial representativeness.Sonnblick (SON) and Izaña (IZA) are plotted separately due to the large magnitude of the metrics.