Utility of thermal remote sensing for evaluation of a high‐resolution weather model in a city

Progress in high‐resolution numerical weather prediction (NWP) for urban areas will require new modelling approaches and extensive evaluation. Here, we exploit land surface temperature (LST) data from Landsat‐8 to assess 100 m resolution NWP for London (UK) on four cloud‐free days. The LST observations are directional radiometric temperatures with non‐negligible uncertainties. We consider the challenges of informative comparison between the Landsat LST and the NWP scheme's internal characterisation of the complete surface temperature. The LST spatial coverage allows large‐scale observation–model differences to be explored. In one case, obvious spatial artifacts in the NWP surface temperature are observed relative to the Landsat LST. These are found to be related to the NWP's initial method of downscaling of soil moisture using soil properties. Updated model runs have higher spatial correlation between model and Landsat LST. In cases where meteorological conditions favour the formation of horizontal convective rolls, warmer air temperatures associated with updraughts in the mixed layer extend inappropriately to the urban surface. This manifests as warm stripes in the model surface temperature that are not present in the Landsat LST. NWP–Landsat LST differences are larger in more built‐up areas on days nearer summer solstice. This is largely attributed to urban thermal anisotropy, as Landsat preferentially views warmer urban surfaces, whereas the model LST represents all surfaces. We evaluate two approaches to quantify this sampling effect, but further work is needed to fully constrain it and facilitate more informative model evaluation.


INTRODUCTION
High-resolution weather and climate modelling is of interest for day-to-day operations of cities and planning for future city conditions (Baklanov et al., 2018).Developing operational products for urban applications and services requires the development and evaluation of next-generation numerical weather prediction (NWP) models, with grid-cell resolutions of the order of 100 m being explored (e.g., Boutle et al., 2016;Lean et al., 2019).These scales pose new challenges as greater heterogeneity and complexity of urban form and properties are resolved, but they offer the potential to provide neighbourhood-scale information in support of a wide range of integrated urban services (World Meteorological Organization [WMO], 2019).
To deliver the higher resolutions required in urban areas, sub-kilometre-scale models are being developed (Joe et al., 2018) to nest within kilometre-scale models.
The verification of models in urban areas remains a challenge given the lack of routine observations located appropriately (Grimmond & Ward, 2021).Any WMO observation sites (WMO, 2018a) in urban areas are likely to be located within the urban canopy layer, rather than the inertial sublayer or constant flux layer (Tang et al., 2021).Standard WMO site observations, such as those located in urban parks, represent grassed areas rather than the mix of buildings and vegetation that occur in different neighbourhoods.Citizen science weather stations-for example, Netatmo (Chapman et al., 2017;Fenner et al., 2021) and WOW (Kirk et al., 2021)-and WMO (2018b) recommended urban sites may better represent the mix of land covers in their source areas (Coney et al., 2022;Cornes et al., 2019;Muller et al., 2015;Vesala et al., 2008).If urban canopy layer observations are used for model evaluation, appropriate downscaling of variables from the inertial sublayer to within the urban canopy layer is required (e.g., Blunn et al., 2022;Tang et al., 2021;Theeuwes et al., 2019;Wang, 2014).
Eddy covariance (e.g., Hertwig et al., 2020;Masson et al., 2002) and large-aperture scintillometry (Saunders et al., 2024) sensors allow measurements of fluxes in the inertial sublayer, but with restricted horizontal spatial coverage (Grimmond & Ward, 2021;Ward et al., 2014) relative to a city.Ground-based remote-sensing techniques, such as automatic lidars and ceilometers and Doppler wind lidars, allow vertical profiles to be evaluated, but they remain limited in their horizontal coverage.Dense deployment of such sensors is generally limited to campaigns lasting months to years-for example, in the urbisphere in Berlin (Fenner et al., 2022) or in PANAME in Paris (Kotthaus et al., 2023).
Infrared imagers on satellites provide routine widearea coverage during clear-sky conditions.There is a trade-off between return period and spatial resolution.A single Landsat satellite (with thermal resolution ∼100 m) returns every 16 days at best (Malakar et al., 2018;Parastatidis et al., 2017), but more frequent ∼1 km resolution observations have been used in urban areas (e.g., order 1 km; Leroyer et al., 2011).
Our objective here is to assess the use of highresolution (100 m scale) thermal remote sensing from Landsat in evaluation of hectometre-scale NWP in urban areas.In the study we use the Met Office research model with a 100 m resolution (Lean et al., 2019), hereafter UM100 (Section 2).The Met Office routinely runs the so-called London Model, with an order 300 m resolution, nested within the UK variable-resolution (UKV) Unified Model (UM; Boutle et al., 2016).Here, UM100 runs are nested within a 300 m model (similar to the London Model) to explore and to help develop next-generation modelling on hectometre scales.The UM100 requires land cover plan-area fraction from which to determine the urban form and model parameters (e.g., albedo, emissivity).Given the crucial role of land cover information for both modelling and remote sensing, we also use a reference dataset to characterise the study area (Section 2.2).Consideration is given to the retrieval of land surface temperature (LST) from Landsat-8 (Section 3.2) and the implications of thermal anisotropy in urban areas (Section 3.3).An initial analysis of the model-observation differences on one day reveals some model improvements that could be addressed (Section 4.1), enabling revised UM100 runs for further, more detailed evaluation (Sections 4. 1-4.5).This establishes the basic utility of thermal remote sensing in the evaluation of surface energy balance behaviour of hectometre-scale urban NWP models.Priorities for further work on methodologies for model-satellite comparisons are given in the concluding comments (Section 5).

UM100 model runs
The UK Met Office's non-hydrostatic UM has a semi-implicit semi-Lagrangian numerical scheme for the deep-atmosphere dynamics (Davies et al., 2005;Wood et al., 2014).In this study, UM version 12.0 is run in research mode with one-way nesting to the highest resolution of 100 m (Lean et al., 2019), hereafter UM100 (Table 1).Four clear-sky days (Supporting Information Table S1) are analysed using 30 hr model runs, with the first 6 hr from 1800 UTC treated as spin-up (Table 1).

TA B L E 1
Model grid resolutions and nesting.Figure 1 shows the areal extent of the three inner model domains.

Number of grid points
Vertical levels Note: Model start times are on the day before the case studies, allowing 6 hr spin-up.

F I G U R E 1 Model domains
extent for the UKV (outer map extent), UM300 (red), and the UM100 (blue) with building plus paved plan area fraction (colour bar) for the United Kingdom (Fuller et al., 1994) and Europe (Hollmann et al., 2013).Table 1 provides more details.
The Global Atmosphere and Land configuration 6.1 (Walters et al., 2017) is used for the outer global model (midlatitudes resolution 17 km; Table 1) to provide the boundary conditions for the UKV model (1.5 km resolution over the United Kingdom) (Tang et al., 2021).The UKV in turn provides the boundary conditions for the 300 m resolution model (UM300) (Boutle et al., 2016), from which the UM100 obtains its boundary conditions (Figure 1).The regional models use Regional Atmosphere and Land 3 (RAL3), which is the regional dynamics and physics scheme science configuration.Relative to the RAL2 baseline (Bush et al., 2022), the most significant evolution in RAL3 is the unification of the tropical and midlatitude configurations, with associated cloud and microphysics parametrisation updates (Bush et al., 2024;Field et al., 2023;Van Weverberg et al., 2021a, 2021b).
The UM100 80 × 80 km 2 domain has a 10-20 km buffer around Greater London (Section 2.2) to give the turbulent flow time to adjust to the higher resolution model grid (i.e., spin-up; Blunn, 2021; Lean et al., 2019).The subgrid turbulence is parametrised using a "scale aware" blending scheme (Boutle et al., 2014).The local three-dimensional Smagorinsky-Lilly turbulence scheme (Halliwell, 2007;Lilly, 1962;Smagorinsky, 1963) and the non-local one-dimensional planetary boundary layer (BL) scheme (Lock et al., 2000) are weighted, such that the local scheme is preferred with decreasing horizontal grid length.
The Joint UK Land Environment Simulator (JULES) version 6.1 land surface model (Best et al., 2011;Clark et al., 2011) urban option MORUSES (Bohnenstengel et al., 2011;Porson et al., 2010aPorson et al., , 2010b) ) is used.MORUSES has improved timing and amplitude of the sensible heat flux in central London (Bohnenstengel & Hendry, 2016;Hertwig et al., 2020;Warren et al., 2018), with respect to the Best (2005) urban option used by Lean et al. (2019).Anthropogenic heat flux is assumed to vary monthly but with no diurnal or daily variations (Hertwig et al., 2020).The values are derived from monthly mean energy consumption for 1995-2003(DUKES, 2003)).The flux is assumed to vary with built (or impervious) fraction F I G U R E 2 Greater London (Figure 1 shows location in the United Kingdom) land cover fractions at 100 m resolution from two sources (Table 2), with higher fractions having darker shading.The mean fraction  is given in each case.Note UM100 bare soil fractions for London Heathrow Airport (LHR, circled) compared with VL2010 paved.between 16.7 and 26.2 W⋅m −2 (for 100% impervious grid cell) for different months of the year.

Study area
Greater London (Figure 2), covering 1,569 km 2 , experiences a humid warm temperate climate (Cfb) (Kottek et al., 2006).To characterise the UM100 model domain (Figure 1), each grid cell is assigned the relevant land cover fraction for the different JULES tile types (Table 2).
Operationally, the Met Office uses the so-called Institute of Terrestrial Ecology (ITE; now Centre for Ecology and Hydrology) surface classes derived from Landsat-5 Thematic Mapper data (Fuller et al., 1994).For MORUSES (Section 2.1), the "urban" ITE class is split into canyon and roof using empirical relations (Bohnenstengel et al., 2011), and other JULES tiles (vegetated, bare soil).A number of issues arise from use of the ITE dataset: it is outdated (Bunce et al., 1990) relative to the continually changing urban landscape; it has coarse resolution (25 m) relative to the sizes of buildings, roads, and trees; and the Bohnenstengel et al. (2011) relations are applied at a very different resolution (100 m) from their development (1 km).
To explore this further, the UM100 surface cover is compared with a more recent, higher resolution dataset, hereafter referred to as VL2010 (Lindberg & Grimmond, 2011a, 2011b;Lindberg et al., 2018).Based on 1 m observations aggregated to 4 m, VL2010 has five land cover classes.Here, all trees and shrubs are collapsed into one class (Table 2).To compare the datasets (UM100, VL2010) the projections are first made consistent.This involves reprojecting the ITE data onto the rotated pole grid of the UM and making the necessary datum correction.
Comparison of the land cover datasets (Figure 2) indicates the UM100 canyon fraction is overestimated and the roof fraction is underestimated, particularly outside the city centre.The UM100 trees-plus-shrubs fraction is underestimated for the Greater London area.Bare soil TA B L E 2 Surface cover fractions are used to weight the JULES calculated variables for each tile within a UM100 grid cell.The surface cover is compared with a reference dataset (VL2010)- Lindberg and Grimmond (2011a) and Lindberg et al. (2018) provide details-which combines classes as indicated.

UM100 surface temperature
The UM100 surface temperatures are derived from the JULES surface energy balance fluxes (Best et al., 2011) for each tile (Table 2) with the canyon and roof tiles using the MORUSES option (Best et al., 2011;Porson et al., 2010a).Tile emissivities  s,t have temporally and spatially fixed values except for the canyon tile, which varies with canyon geometry using fixed material values for wall and road.
The grid-cell surface emissivity  s is a function of the tile surface fractions f t : The surface emissivity in the UM100 is compared with the 100 m resolution Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) Global Emissivity Dataset (ASTER GEDv3; Hulley et al., 2015) used in the Landsat LST retrieval (Section 3.2) on the UM grid (Supporting Information Figure S2).The mean UM100 emissivity (0.964) is lower than ASTER GED (mean of 0.972).The Landsat LST sensitivity to this difference is ∼0.4-0.6K for the four days studied.Unrealistically low emissivity values (∼0.90) for LHR come from the high bare-soil fractions assigned (Figure 2).
The grid-cell value surface temperature T S is a function of the tile surface fractions and the tile surface temperatures T S,t : (2) The hourly model temperatures are linearly interpolated in time to the Landsat overpass time (∼1100 UTC, Supporting Information Table S1) to give "UM100 LST".The Landsat and UM100 LST are compared within the VL2010 domain (Figure 2) with a common mask.The UM100 minus Landsat LST difference is designated ΔLST.In Figures 5 and 6 (Section 4.1), maps of ΔLST are presented with the mean ΔLST subtracted (zero-mean differences) to highlight spatial patterns of variability.
The London overpass time is around 1100 UTC for the dates relevant to this study (Supporting Information Table S1).Landsat-8 carries two sensors: the Operational Land Imager with nine spectral bands in the visible to shortwave infrared wavelengths; and a thermal infrared sensor (TIRS) with two spectral bands in the atmospheric window (10.6-11.2μm and 11.5-12.5 μm) (USGS, 2021a).The Landsat Collection 2 Level-1 data products (USGS, 2021b) including the operational land imager data (nominal resolution 30 m) are used to assess cloud cover and to estimate surface albedo.To identify candidate clear-sky days, the Landsat total cloud cover fraction cloud mask derived from the quality assurance band (USGS, 2021b) is initially used, followed by assessment of the morning cloud optical depth estimated from hourly UKV profiles of liquid and ice water mass fractions.Four study dates are selected (Supporting Information Table S1) because of their near-zero (<4%) total cloud cover in the Landsat image and very low total cloud optical depths (<0.05) from 0900 UTC until the observation time across the UM100 domain.
The Landsat data are transformed from the WGS84 coordinate reference system to the British National Grid, then reprojected and interpolated to the rotated pole UM100 grid.Supporting Information Figure S3 shows a false-colour visible image, cloud mask, and band 10 (TIRS) brightness temperature on one of the study dates (July 15, 2018) after transforming the Landsat data to the UM100 grid.

LST retrieval and assessment
Here, we apply a single-channel radiative-transfer-based algorithm to retrieve LST from Landsat-8 TIRS data (Table 3, Figure 3).Band 10 (10.6-11.2μm) is used as it has the lower calibration uncertainty of the two thermal bands (USGS, 2019a).Its nominal resolution is 100 m (USGS, 2021a), and a "real" resolution of around 115 m has been estimated (USGS, 2019b).The radiative transfer model (RTM) RTTOV (Saunders et al., 2020) is used to calculate the downwelling and upwelling radiance and the atmospheric transmittance for calculating the blackbody radiance from the land surface B  (T LST ) using t s , (3) where L TOA is the top-of-atmosphere radiance observed by Landsat, B  (T i ) is the blackbody radiance from level i, t i is the level i to space transmittance, t s is the surface to space transmittance, and  is the surface emissivity obtained from ASTER GEDv3 (Hulley et al., 2015).The mean of ASTER band 13 (10.25-10.95μm) and band 14 (10.95-11.65 μm) emissivity is used as the Landsat band 10 emissivity.Atmospheric profiles of temperature, water vapour, carbon dioxide, and ozone from the European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis v5 (ERA5; Hersbach et al., 2020) interpolated to the Landsat observation times and locations are provided as input to RTTOV (Table 3).The land surface blackbody radiance is converted to LST using a look-up table with 0.01 K precision, used to obtain the LST by interpolating between the radiance values closest to B  (T LST ).
To verify the retrieved LSTs and gauge their uncertainty, we compare the retrieved LST with two Landsat LST products based on different algorithms (Table 3) that also use ASTER GEDv3 emissivities.MODTRAN4 (Berk  3).FORTH uses over 4,000 simulations with different atmospheric profile inputs (Jimenez-Munoz et al., 2009, 2014), allowing single-equation operational global-scale LST.JPL (Cook et al., 2014) uses Goddard Earth Observing System Model Version 5, Forward Processing for Instrument Teams (GEOS-5 FP-IT) reanalysis data (Lucchesi, 2013) and L2 top-of-atmosphere reflectance data.
The mean difference  is removed to highlight spatial patterns of variability of the three retrievals (Figure 4).The size of the mean LST differences (−1.6 to +0.4 K) and their standard deviations (, 0.2-0.6K) are of the order expected given the radiometric uncertainty from Landsat-8 (Gerace & Montanaro, 2017).Differences are indicative of the uncertainty arising from the different retrieval methods, with the median uncertainty in LST from the surface emissivity uncertainty being ∼0.3 K. Other uncertainty arises from different atmospheric profile data used (Table 3).Visible discontinuities in the JPL-retrieved LST differences occur due to changes in the JPL retrieval input profiles at the grid-cell boundaries at 51.5 • N (Figure 4c) and 0 • longitude (Figure 4f).The total uncertainty (∼1 K) is an order of magnitude less than the observed spatial and temporal variability.The Pearson correlation coefficient calculated between the three datasets is above 0.97 on both days (Supporting Information Table S2), again indicating uncertainty in the observed LST is much smaller than the spatial variability.
From comparison of the three Landsat LST retrievals, we conclude that model-observation differences (Section 4.1) greater than the observed ∼1 K differences between retrievals (Figure 4) can be attributed to the NWP model (e.g., physics, ancillaries) and sampling effects (thermal anisotropy, Section 3.3), rather than observation errors.

Comparison of Landsat with UM100 LST
The viewing zenith angle of Landsat varies from 0 • (nadir) to ±8.6 • .This near-nadir view results in near-horizontal facets being relatively heavily weighted in the sensor field of view.As vertical facets (i.e., walls) make up a large fraction of the total surface area in cities, the "complete" surface temperature should account for them in both observations and models (Voogt, 2000;Wang et al., 2022).However, neither satellite views nor urban land surface models do this, and they differ in their facet weightings and definitions.In late morning under clear skies, near-horizontal buildings' facets (e.g., roofs) can be significantly warmer than walls that have been shaded within urban canyons (Morrison et al., 2021).This qualitative understanding of the sampling or definitional difference due to thermal anisotropy predicts that Landsat will be warmer than the model LST on sunny days in more urbanised areas.
There are two logical approaches to minimise this definitional difference: extract a nadir-view LST from the model or obtain a complete surface temperature T c from Landsat.The former approach is challenging, as the canyon tile combines paved surfaces (e.g., roads) and walls requiring the horizontal (paved + roofs) and vertical (walls) facet temperatures to be separated before combining with appropriate vegetated surface values.Here, we consider the latter approach, estimating T c from the Landsat LST T r based on geometric and meteorological information (Yang et al., 2020) (Section 4.3).
We also compare LST retrieved from the moderate resolution imaging spectroradiometer (MODIS) on one day (July 15, 2018) also to the UM100 (Section 4.4).The MOD11A1 Version 6.1 Level-3 1 km resolution gridded LST product (Wan et al., 2021) is obtained using a generalised split-window LST algorithm (Wan & Dozier, 1996).As with the Landsat data, the MODIS LST data are transformed from their native (sinusoidal) projection to the British National Grid, then transformed to the UM100 grid.On this study day, MODIS observed London with an oblique view of ∼52 • at 1200 UTC, resulting in a greater weighting of vertical facets and therefore an LST more representative of T c (Jiang et al., 2018).However, the MODIS spatial resolution (1 km) is much coarser than Landsat (∼100 m).

RESULTS AND DISCUSSION
Comparison of the UM100 and Landsat LST identified issues related to surface ancillary data (treatment of soil moisture, leaf area index, and spatial mismatch of surface mapping due to incorrect co-ordinate reference systems).
Here, we present UM100 runs illustrating the soil moisture issue only.All runs shown here used RAL3 (Section 2.1) after leaf area index and spatial mismatch issues were addressed, having been highlighted in earlier (RAL2) runs and comparisons.In the "initial" runs shown here, soil moisture issues remain, and in the "final" runs the soil moisture treatment is updated.

Spatial assessment of UM100 LST
Initial model run results for July 15, 2018, show an overall model cold bias with mean difference in LST of 6.4 K (Figure 5).There is a clear lack of spatial agreement between the observed and modelled LST: the high UM100 LSTs above 315 K in the southeast of the domain do not have a corresponding analogue in the observations, and the standard deviation of the differences is high, at 3.4 K.
The high UM100 LSTs are associated with a "blocky" LST pattern that also extends more subtly south and northeast of Greater London (Supporting Information Figure S4d) and correlates with low model soil moisture (Supporting Information Figure S4a).Throughout the UM100 domain, low and high soil moisture patterns correlate with high and low LST respectively (Supporting Information Figure S4).As these LST patterns are not found in Landsat, we conclude that the soil moisture patterns are unrealistic.
In the initial model run, UM100 soil moisture (Supporting Information Figure S4a) is calculated by downscaling the global model soil moisture stress using 100 m resolution soil ancillaries (critical point and wilting point).The 100 m critical point and wilting point fields contain sharp gradients, with regions of low critical point and wilting point correlating with regions of low UM100 soil moisture.To ameliorate the issue, final model runs use the global critical point and wilting point linearly interpolated to the 100 m grid, resulting in a smoother soil moisture field without regions containing low values (Supporting Information Figure S4b).
The final model runs (Figure 6) show that the southeast hotspot in July 2018 is largely removed (cf. Figure 5).This results in a spatial distribution of UM100 LST closer to the observations, with reduced standard deviations of the differences from 3.4 K to 3.0 K.However, the mean difference increases from −6.4 K to −6.8 K.This can be attributed to an average soil moisture increase in the new model runs (Supporting Information Figure S4c).The October ( = −2.6K) and November ( = −0.1 K) cases have a relatively small overall model-Landsat bias, whereas a strong cold bias is present in the April ( = −6.3K) and July ( = −6.8K) cases.The exact cause of this seasonal variation is not explored here, but it is partly associated with April and July being closer to the Northern Hemisphere summer solstice, when there is more short-wave radiation.In such circumstances, larger energy fluxes and temperature gradients develop between the land surface, atmosphere, and soil, with the potential for greater errors.The standard deviations of the differences are also much smaller in the October (0.9 K) and November (0.9 K) cases compared with the April (2.0 K) and July (3.0 K) cases.Although each day has different spatial patterns in the differences, larger positive values tend to occur in the more F I G U R E 7 Variation of median (line) and interquartile range (shading) ΔLST (equal to UM100 minus Landsat) with built (paved + building) fraction on four days (key).Built fractions are from the VL2010 dataset (Table 2), with grid cells with a water fraction greater than 0.25 removed from this analysis.vegetated areas, particularly on the outskirts of Greater London compared with the more densely built-up interior (Section 4.3).The diagonal stripes visible (Figure 6) in the UM100 LST but not Landsat in April 2020 are discussed in Section 4.5.

Comparison to air temperature
As a general assessment of the final runs, we use 1.5 m air temperature observations (Met Office, 2012) from five sites (Heathrow Airport, Kew Gardens, St James Park, Kenley Airfield, Northolt) across Greater London.We find a model air temperature cold bias of −0.6 to −4.1 K across the four case study days.This is consistent in terms of sign with the model-Landsat LST cold bias; however, across dates and sites there is considerable scatter in these air-surface bias results (Supporting Information Figure S5).This reflects the different processes controlling the LST (cf.air temperature) bias, which likely includes critical impacts from model-Landsat LST definitional differences (Section 3.3).The relative model-Landsat LST cold bias is explored further in Sections 4.3 and 4.4.

Built fraction
To better understand the controls on the spatial distribution of model-observation differences (ΔLST), differences are analysed against the built-area (paved + building) fractions from VL2010 (Figure 2).The dependencies are shown for the four days in Figure 7.The UM100 LST becomes increasingly cold relative to Landsat with greater built fraction (lower vegetated fraction) on all four days.This trend is much stronger on the two days (July 15, 2018 andApril 22, 2020) with the larger mean ΔLST.This indicates that there is a larger temperature contrast between built and highly vegetated areas, such as parks, in the Landsat imagery relative to UM100 on these two days.This tendency for larger LST differences with increasing built fraction may be attributed to the sensor field of view or sampling effect of Landsat described in Section 3.3.To mitigate this effect, we estimate the complete surface temperature T c from the Landsat LST T r , adding to T r the adjustment from Yang et al. (2020).
The Yang et al. (2020) adjustment comprises a model based on three-dimensional simulations of urban surface facet temperatures for different surface geometries and properties, meteorological conditions, and solar angles.In their study area (Hong Kong), daytime differences between T c and T r reach up to 10 K, varying with geometric characteristics such as plan area fraction of buildings  P (Supporting Information Figure S1) and wall facet area to building footprint area fraction F, and meteorological conditions (viz., solar irradiance, wind speed).Yang et al. (2020) report greater differences between T c and T r in more built-up areas (i.e., larger both  P and F) and with higher solar irradiance.
We calculate the Yang et al. ( 2020) adjustment (T c − T r ) for London.The magnitude and sign of the adjustment varies between days (Figure 8a), because of differences in mean LST, solar irradiance, and solar geometry.The adjustment becomes cooler at higher built fractions.After applying this correction, the differences between ΔLST on each day (Figure 8b) are greatly reduced and the overall trend with built fraction for July 15, 2018, and April 22, 2020, reduces and changes sign.However, a model cold bias of 3-7 K (median) for each day remains.
Our use of the Yang et al. (2020) empirical method to estimate T c from the Landsat LST T r entails several caveats: impacts of vegetation on temperature are not accounted for in this model; their parameter ranges do not cover all our cases (e.g., low winter sun angles at London latitudes); the material types and canopy geometries considered in Hong Kong are likely to be different to London, F I G U R E 8 As Figure 7, but using the complete temperature T c derived using the Yang et al. (2020) empirical adjustment added to the Landsat LST T r .(a) The values of the adjustment.(b) UM100 LST (T UM100 ) minus T c (shown with the same vertical scale as Figure 7).which will affect the relation between T c and T r .More work is needed to obtain locally applicable parameters for the London area.

Dominant land cover
To explore the role of model land cover type we analyse ΔLST (T UM100 − T r ) for the grid cells with one "dominant" surface type with a fraction >0.5 (Figure 9).On July 15, 2018, and April 22, 2020, there are consistently larger ΔLST (Figure 7) and larger median ΔLST for all four VL2010 built and vegetated classes (Figure 9).There are small differences up to 0.4 K on all four days between the median ΔLST for the two built classes (vertical red and gold lines, Figure 9).This implies that the model-observation bias is relatively insensitive to whether a surface is predominantly roof or paved.The differences in median ΔLST are larger between the two vegetated classes between days (0.4-3.0 K), with the July 2018 case having a median ΔLST of −4.1 K for trees and shrubs but −7.1 to −7.7 K for grass, roof, and paved classes (Figure 9b).The median ΔLST is smallest for trees and shrubs on all days, except when the differences are very small (<0.2 K) for all classes on November 17, 2017 (Figure 9d).The areas with trees and shrubs as the dominant class also tend to have the lowest Landsat and model LST (Figure 6).However, the definitional difference between the Landsat LST and the model analogue complicates the interpretation of these results.If we take ΔLST = T UM100 − T c (Supporting Information Figure S6), the differences between the median ΔLST for the two built classes increase to 3.5-3.9K.The median ΔLST becomes much smaller for roofs on all days except November 17, 2017 (Supporting Information Figure S6c), with much smaller changes for the other three classes.This is because the areas where roofs dominate occur mainly in central London (Figure 9a), where larger differences between T c and T r are expected linked to their larger  P and F values.

Assessment of UM100 LST with MODIS
Comparison of the MODIS LST with the updated model run on July 15, 2018 (Figure 10) at 1200 UTC has both a much smaller mean difference (−0.7 K cf.−6.8 K for UM100-Landsat) and smaller standard deviation (2.6 K cf.3.0 K).This smaller mean difference is largely explained by MODIS's oblique view angle of ∼52 • and the relation between the satellite and solar azimuth angles.Simulations of thermal anisotropy in central London on August 27, 2017 (Morrison et al., 2023), relative to a nadir view, show the highest LST is obtained when the satellite and solar azimuth angles align, whereas the lowest LST is obtained when these angles are 180 • apart, as the satellite will view more shaded facets.This effect increases as the satellite view angle increases away from nadir.On July 15, 2018 (Figure 10), the MODIS view azimuth angle of −67 • is 116 • from the solar azimuth angle  s of +177 • .As this sun-sensor azimuth angle difference exceeds 90 • , when combined with its oblique view angle, MODIS preferentially views shaded (cf.sunlit) facets, resulting in a significantly lower LST than Landsat's near-nadir (<5 • ) view.Jiang et al. (2018) found, based on a single satellite view angle, the closest match to the complete urban surface F I G U R E 9 Analysis of ΔLST (UM100 minus Landsat) for UM100 grid cells with one dominant land cover type (>0.5, based on temperature is obtained with a satellite azimuth angle of  s ± 90 • and zenith angle of 45-60 • .This suggests that the MODIS LST is more representative of the complete surface temperature than Landsat in this case.Given the MODIS-solar azimuth difference, a small cold bias relative to the complete surface temperature is expected instead of the warm bias for Landsat. Despite the similar mean LST, there are clear differences in the spatial distribution between MODIS and UM100.The spatial distribution of the differences appears similar to the corresponding UM100-Landsat differences (Figure 6f).The relative model warm bias to the east of the domain is still present, and there is a noticeable ∼5 K warm bias in central London.

Horizontal convective rolls
On April 22, 2020, diagonal stripes are evident in the UM100 but not the Landsat LST (Figure 6), suggesting that the model is unrealistic in this regard.The stripes are caused by horizontal convective rolls (HCRs) being resolved in the UM100 (Figure 11).HCRs are counter-rotating horizontal turbulent vortices within the convective BL (AMS, 2022;Etling & Brown, 1993;Khanna & Brasseur, 1998;LeMone, 1973;Weckwerth et al., 1999;Young et al., 2002).HCRs span the depth of the BL (z i , defined here by the height a moist parcel rising adiabatically becomes negatively buoyant), which is ∼600 m for the UM100 (Figure 11b) north-south cross-section in the centre of the domain (Figure 11a).The UM100 HCRs have narrow (∼0.5z i , warm, positive vertical velocity) updraughts and broad (∼1.5zi , cool, negative vertical velocity) downdraughts (Figure 11b,e) parallel to the eastnortheast (246 • ) mean BL flow.When the updraughts reach the capping inversion, they spread perpendicular to the mean flow (Figure 11c,d), before descending.The updraughts extend throughout the BL down to the surface, causing the high LST stripes.We argue the UM100 having HCRs is correct, but the LST stripes should not be present in the UM100.The latter is supported by the Landsat LST.
There is evidence HCRs are more likely to occur in urban than rural areas due to their high surface roughness (Miao & Chen, 2008).With increasing BL wind shear, convective updraughts have a greater tendency to align with the flow, and for −z i /L < 15-20, HCRs form (Salesky et al., 2017).The Obukhov length L (Garratt, 1992), is a measure of atmospheric stability.The variables are the friction velocity u * , air temperature T, density of the air , specific heat capacity of air at constant pressure c p , and the sensible heat flux Q H ; in addition,  is the von Kármán constant (0.4) and g is the acceleration due to gravity.These variables are determined at the surface interface with the first level of the model, which is assumed to be at a height equivalent to the roughness length + displacement length + 2 m.For the UM100 we calculate z i /L to be −4.2 in the region plotted in Figure 11.This value is consistent with there being well-developed HCRs in the UM100.The UM100 LST stripes are ∼300 m wide and have ∼2 • C temperature anomalies (Figure 11a), so it is expected that they would be detected by the ∼115 m resolution Landsat LST.We therefore conclude that the LST anomalies associated with HCRs in the UM100 are likely unphysical (despite the correct presence of HCRs).Surface layer turbulence is smaller scale than HCR turbulence in the mixed layer, and therefore becomes increasingly less well resolved towards the surface in the UM100 (Figure 11e).High air temperatures associated with the HCR updraughts are extended too far towards the surface by the one-dimensional BL scheme (Boutle et al., 2014;Lock et al., 2000) (Figure 11b), warming the surface.It is likely that this BL turbulence "grey zone" (Honnert et al., 2020;Wyngaard, 2004) issue would be observed in other O(100 m) horizontal grid length NWP models that use one-dimensional BL schemes to parametrise near-surface turbulence.
It is possible that there are <100 m LST anomalies that are not resolved by the UM100 or Landsat.Airborne and tower-based thermal infrared sensors have observed ∼2 and 0.5-1 • C grassland LST anomalies, associated with surface layer streak turbulence (Derksen, 1974;Garai et al., 2013;Schols et al., 1985).For a convective BL with conditions suitable for HCRs (−z i /L = 11), over a grassy field, Garai et al. (2013) estimated LST patterns associated with surface layer streaks to have a streamwise length of 149 m and spanwise width of 69 m.They found that near-surface air temperature is highly correlated with LST and demonstrated that LST patterns move at the wind speed in the upper surface layer.Order 10 m resolution is required for satellite remote sensing to detect LST patterns associated with surface layer streaks.Whether such patterns occur in urban areas is a research question, since urban surface roughness modifies the near-surface turbulence structure (cf.smoother surfaces of the aforementioned studies) (Blunn et al., 2022), and thermal inertia is higher for urban than for vegetated surfaces, meaning that LSTs might respond too slowly to the transient surface-layer turbulence temperature anomalies.

CONCLUSIONS
With the next generation of NWP models moving to higher resolutions (O(100 m)), new methods to evaluate these models are being explored.In this study we investigate the use of high-resolution thermal imagery from Landsat-8 to assess model performance across Greater London.We conclude that currently there are important benefits to using this data: • Extensive spatial coverage allows large-scale evaluation of observation-model differences not possible using small area extent observations.
• The high spatial resolution (∼115 m) means that individual neighbourhoods and land cover variability are well resolved, enabling their representation in hectometric grid length NWP to be evaluated at commensurate scale.
• Spatial patterns visible in the UM100 model data that are not apparent in the Landsat imagery can be identified, allowing model configuration updates and areas for future model development to be identified.These include, in particular: i. LST patterns related to the prescription of soil properties and their use in downscaling soil moisture are found on one study day (July 15, 2018).As the NWP model resolution is much higher than before, the consequences not previously considered become apparent.Modifying model configurations, allowed the spatial pattern to be more realistic, but with a larger mean difference.ii.Erroneous ∼300 m wide stripes identified in the UM100 on one day studied (April 22, 2020) are not evident in the Landsat imagery.
We note certain limitations of using the satellite LST data: • Single-view satellites sample the surface temperature from a particular angle.For near-nadir field of view observations, near-horizontal facets (e.g., roofs, roads) are heavily weighted while vertical facets are not observed.Ideally, urban canopy modelling would represent the "complete surface" temperature but of the wide variety of approaches taken (e.g., Grimmond et al., 2009Grimmond et al., , 2011;;Lipson et al., 2023), few model all the facets fully because of the computational time being impractical for NWP.Other object resolving models do exist, which can address the surface more fully, but these are computationally orders of magnitude more demanding and/or have limitations of the processes resolved (e.g., radiative transfer only).Hence, there are two definitional differences to be understood: the sampling of the real-world measurement's field of view and the NWP model simplification of the urban surface.Notably, when a model takes a tile approach the real-world interactions between facets complicates the comparison further as most model grid cells have multiple tile types rather than individual facets (e.g., roofs).While this does not prevent important insights from this comparison, going forward it is critical to reduce definitional uncertainties in model-observation comparison, while improving model skill, to provide a stronger constraint on target behaviour for the model in cities.
• In urban model evaluation, other variables such as surface air temperature and humidity are typically of interest; however, these variables cannot be retrieved from satellite data.Hence, the greatest model benefits may result from combining satellite LST data with other observation sources (e.g., surface energy balance observations).
• Cloud-free conditions are required for satellite LST retrieval, limiting the availability of satellite based LST data for model evaluation.
To address the first of these limitations, and thereby to maximise the value of incorporating urban satellite LST in NWP, both data assimilation developments and model parametrisations are required.Where a model has the necessary internal variables, the preferred approach in data assimilation is to modify the model variables given the constraining observation using an observation operator, which quantifies the observations expected given a model state (e.g., Warren et al., 2018).In the absence of an observation operator, the alternative approach is to attempt to infer from the observed temperature and geometric information the modelled LST.Two preliminary studies provide evidence that this may be helpful.Morrison et al. (2023) demonstrate that, with detailed analysis, the difference between the satellite LST and other representations of temperature can be assessed using three-dimensional radiative transfer modelling, detailed high-resolution surface infrared temperature observations, and very detailed digital models.This approach could provide the basis for an observation operator but is computationally expensive.The Yang et al. (2020) methodology for adjusting the satellite temperature for its geometry is much simpler to deploy, but the empirical parameters currently only exist for one low-latitude city with its specific urban morphology.We have demonstrated the importance of accounting for satellite view angles when using satellite-derived LST data for urban model evaluation.An area needing further research is the development of a fast observation operator that maps between different definitions of temperature given satellite-solar geometry and statistics of the urban landscape.
NWP requires both the model parameters (e.g., albedo, emissivity, land cover, morphology) and the model physics to be correct for successful model performance.This study has provided clear suggestions where model improvement will be beneficial.These relate not just to the model physics but also to the model parameters used.Improved model parameters that are representative of current surface cover, derived using data that resolve the heterogeneous mix of urban surface features appropriately, are essential for assessing the model physics.Improvement of many model parameters (e.g., surface cover data) can also be achieved using other remote sensing data products (e.g., ESA World-Cover) at resolutions appropriate for next-generation NWP and climate modelling.

TA B L E 3
Abbreviations: ERA5, European Centre for Medium-Range Weather Forecasts Reanalysis v5; FORTH, Foundation for Research and Technology Hellas; GEOS-5 FP-IT, Goddard Earth Observing System Model Version 5, Forward Processing for Instrument Teams; JPL, Jet Propulsion Laboratory; NASA, National Aeronautics and Space Administration; NCEP-NCAR, National Centers for Environmental Prediction-National Center for Atmospheric Research; RIT, Rochester Institute of Technology; RTM, radiative transfer model.

F
I G U R E 3 Summary of Landsat land surface temperature (T LST ) retrieval algorithm with inputs/outputs (blue parallelograms) and processes (orange rectangles) indicated, where L ↑ and L ↓ respectively refer to the upwelling and downwelling long-wave radiance and t is atmospheric transmittance.Abbreviations: ASTER, Advanced Spaceborne Thermal Emission and Reflection Radiometer; ERA5, European Centre for Medium-Range Weather Forecasts Reanalysis v5.et al., 1999) is used in both the Foundation for Research and Technology Hellas (FORTH) and Jet Propulsion Laboratory (JPL) algorithms but with different inputs (Table

F
I G U R E 4 Land surface temperature (LST) products (Table3) for (a-c)July 15, 2018, and (d-f) October 10, 2018, retrieved using (a, d)    this study and the differences (b, e) Foundation for Research and Technology Hellas (FORTH) minus this study and (c, f) Jet Propulsion Laboratory (JPL) minus this study.Panels (b), (c), (e), and (f) are zero meaned; that is, the mean difference  is subtracted before plotting.The mean difference  and standard deviation  values are given in each case.Statistics are calculated after applying a common spatial mask to each dataset.

F
I G U R E 5 Land surface temperature (LST) on July 15, 2018, from (a) Landsat and (b) UM100 (initial run), and (c) (UM100 − Landsat) zero-mean (mean subtracted) difference, with mean difference  and standard deviation  calculated for the common area in each dataset.

F
Land surface temperature (LST) on four days from (a, d, g, j) Landsat and (b, e, h, k) UM100 (final runs), and (c, f, i, l) their zero-mean (mean subtracted) difference, with mean difference  and standard deviation  calculated for the common area in each dataset.

F
I G U R E 10 Comparison for July 15, 2018, at 1200 UTC of land surface temperature (LST) from (a) MODIS and (b) UM100, and (c) their zero-mean difference (UM100 minus MODIS), with the mean difference  and standard deviation  calculated for the common area of the two datasets.

F
I G U R E 11 UM100 model output for April 22, 2020, at the Landsat observation time (1058 UTC): (a) model domain surface temperature with ∼11 km line showing where the following vertical cross-sections are located; (b) air temperature (dashed line indicates boundary layer height z i ); (c) U, (d) V, and (e) W wind components.Horizontal ticks in (b)-(e) are spaced 1 km apart.