Global Variability of Simulated and Observed Vegetation Growing Season

Vegetation phenology and its variability have substantial influence on land‐atmosphere interaction, and changes in growing season length are additional indicators of climate change impacts on ecosystems. For these reasons, global land surface models are routinely evaluated in order to assess their ability to reproduce the observed phenological variability. In this work, we present a new approach that integrates a wider spectrum of growing season modes, in order to better describe the observed variability in vegetation growing season onset and offset, as well as assess the ability of state‐of‐the‐art land surface models to capture this variability at the global scale. The method is applied to the Community Land Model version 4.5 (CLM4.5) simulations and LAI3g satellite observation. The comparison between data and model outputs shows that CLM4.5 is capable of reproducing the growing season features in the Northern Hemisphere midlatitude and high latitude, but also displays its limitations in areas where water availability acts as the main driver of vegetation phenological activity. Besides, the new approach allows evaluating land surface models in capturing multigrowing‐season phenology. In this regard, CLM4.5 proves its ability in reproducing the two‐growing‐season cycles in the Horn of Africa. In general, the new methodology expands the area of analysis from northern midlatitude and high latitude to the global continental areas and allows to assess the vegetation response to the ongoing climate change in a larger variety of ecosystems, ranging from semiarid regions to rain forests, passing through temperate deciduous and boreal evergreen forests.


Introduction
Many studies have shown a large influence of plant phenology and its variability on land-atmosphere interactions (e.g., Buermann et al., 2001;Chen et al., 2016;Collalti et al., 2016;Dahlin et al., 2015;Keenan et al., 2014;Lawrence et al., 2011;Richardson et al., 2013) and identified changes in vegetation growing season as appreciable indicators of climate change (e.g., Collalti et al., 2018;Jeong et al., 2011). For example, Richardson et al. (2013) show that many vegetation-climate system feedbacks triggered by changes in local microclimate, surface roughness, or surface albedo, might be driven by modifications of vegetation phenology. Keenan et al. (2014) find a negative feedback among temperature, phenology, and carbon generated by a prolongation of the phenology active season. Besides, an increasing knowledge on land biogeochemical processes has recently led to the development of a new generation of land surface models, able to represent explicitly carbon and nitrogen land cycles, plant phenology, and surface processes at global scale (Goll et al., 2012;Lawrence et al., 2018;Oleson et al., 2013;Zaehle & Friend, 2010). These land surface models simulate the starting and ending times of vegetative season as a function of day length, soil humidity, temperature, and precipitation (e.g., Oleson et al., 2013). Since the phenological variability is crucial for climate studies, these global surface models need to be routinely evaluated in order to assess their ability in reproducing the observed land conditions (e.g., Murray-Tortarolo et al., 2013;Randerson et al., 2009).
Vegetation phenology can be assessed by taking into account many plant features and indexes, especially leaves color, described by means of many indexes such as the normalized difference vegetation index (NDVI), the enhanced vegetation index, the green chromatic coordinates, the soil-adjusted vegetation index (SAVI), or the soil-adjusted and atmospherically resistant vegetation index (e.g., Churkina et al., 2005;Jeong et al., 2011;Huete, 1988;Huete et al., 1994;Keenan et al., 2014;Sonnentag et al., 2012;Xu et al., 2013;Zhang et al., 2003), and canopy density, described by the Leaf Area Index (LAI, e.g., Kang et al., 2003;Murray-Tortarolo et al., 2013). Moreover, the benchmark of vegetation phenology is often performed in middle and high-latitude regions in the Northern Hemisphere (NH), areas covered by large and uniform forests, which are usually well reproduced in state-of-the-art land surface models (e.g., Jeong et al., 2011;Xu et al., 2013;Murray-Tortarolo et al., 2013). However, the skill of land surface models in the representation of global biogeochemical cycles, especially in areas characterized by complex vegetation structure, need to be carefully proven.
The present study describes a new approach, which integrates and expands previous methods aimed at investigating phenology variability at the global scale. This method, then, can be used for benchmarking land surface models phenology. Differently from Zhang et al. (2003), the approach presented here applies specifically on LAI, and it is based on the method recently discussed by Murray-Tortarolo et al. (2013, hereinafter MT13), which proved to be robust in the middle and high latitudes of the NH, but has never been applied before to the Southern Hemisphere (SH). The choice of LAI is motivated by the fact that it can be directly compared with land surface models and linked to its strong influence on carbon, energy, and water balance calculations (MT13). Differently from MT13, the new approach aims at evaluating a set of four known phenology behaviors: (1) evergreen phenology, which is characterized by limited canopy density seasonal variation (e.g., Wu et al., 2014); (2) single growing season, featured by a nearly normal distribution shape with a summer-peaking growing season (as identified and analyzed in MT13); (3) single growing season distinguished by a summer dormancy behavior (reduced summer plant activity triggered, e.g., by drought conditions as described in Volaire & Norton, 2006); (4) multigrowing-season behavior, distinguished by two distinct growing seasons, typical of areas characterized by two separate rainy seasons such as Eastern Africa (e.g., Martiny et al., 2006) and the Horn of Africa (e.g., Zhang et al., 2005), or croplands (e.g., Zhang et al., 2003).
The new approach introduced in this work, named "Four Growing Season Types" (hereinafter 4GST) is applied to a state-of-the-art land surface model, that is the Community Land Model version 4.5 (CLM4.5, Oleson et al., 2013), and validated against the LAI3g satellite observation . Despite that a new version of the Community Land Model (CLM5.0, Lawrence et al., 2018) has been recently released, the use of CLM4.5 allows testing the 4GST method versus previous phenology analysis (i.e., Dahlin et al., 2015;Chen et al., 2016). In particular, Dahlin et al. (2015) is a reference for drought deciduous phenology, while Chen et al. (2016) represent a benchmark for spring leaf-out in seasonal-deciduous phenology. Differently from these previous works, 4GST enlarges the evaluation of CLM4.5 phenology seasonal features to the global scale and assesses its behavior in regions characterized by evergreen behavior and multigrowing seasons.
The paper is structured as follows: methods, satellite observation, and land surface model are presented in section 2. In section 3, the main differences between satellite observation and land surface model are highlighted. The main implications of the results presented in section 3 are discussed in section 4, while the main conclusions are presented in section 5.

Satellite Observations
The LAI3g is a satellite-based observational data set, obtained applying a neural network algorithm to the third generation Global Inventory Modeling and Mapping Studies Normalized Difference Vegetation Index (NDVI3g) and to the Terra Moderate Resolution Imaging Spectroradiometer LAI. The obtained LAI3g data set has a 15-day temporal frequency and a 1/12 • spatial resolution. Further details can be found in Zhu et al. (2013).
In this study, we use the LAI3g values over the 30-year period 1982-2011; for consistency with model output, we regridded the LAI3g values over a 1.25 • × 0.9375 • regular grid using a first-order conservative remapping scheme (Jones, 1999). The choice of this scheme derives from the need to keep unchanged the information deriving from model subgrid structure.

The Community Land Model
The Community Land Model (CLM) is the terrestrial component of the Community Earth System Model (CESM, http://www.cesm.ucar.edu/models/cesm1.2/), and of the CMCC coupled model version 2 (Cherchi et al., 2019). Here, we use the version 4.5 of CLM (CLM4.5, Oleson et al., 2013) in its biogeochemical configuration (i.e., BGC compset, Koven et al., 2013) with prescribed plant functional types (PFTs) distribution, evolving in time following the Land Use Harmonization version 2 (LUH2, http://luh.umd.edu/) and accounting for a simplified representation of crops (considered as C3 grass), differently from the model described in Drewniak et al. (2013). CLM divides the global vegetated areas into 15 PFTs, each characterized by specific optical, morphological, and photosynthetic parameters. The CLM4.5 configuration used in this work resolves explicitly carbon-nitrogen biogeochemical cycles Oleson et al., 2013) as well as plant phenology. Specifically, CLM4.5 presents three phenology parameterizations, which apply to three distinct plant phenological types: evergreen, seasonal-deciduous, and stress-deciduous (Oleson et al., 2013).
The evergreen phenological type is applied to plants identified by a continuous leaf fall and fine roots turnover distributed along the year, called background litterfall. This mechanism is driven by a PFT-specific leaf longevity parameter (Oleson et al., 2013).
The seasonal-deciduous type is based on the off-line model Biome-BGC version 4.1.2 (Thornton et al., 2002) and parameterizations by White et al. (1997). In particular, the leaf onset is triggered when the thermal sum in soil temperature, also called accumulated growing-degree-day, exceeds a specific critical threshold, which is defined from the 2-m annual average air temperature. The leaf litterfall, instead, is initiated when the daylength is below another specific critical threshold (Oleson et al., 2013). Note that also summer and winter solstices act as time-limiting factors in this configuration.
Since the litterfall trigger of the seasonal-deciduous type is only valid out of the tropical zone, a third type of phenology is used to allow deciduous tropical plants dropping their leaves: the stress-deciduous type. This category is generally applied to grass and trees that respond to both cold and drought-stress signals, and it is based on the grass phenology model proposed by White et al. (1997). In areas characterized by year-round warm conditions, the onset is triggered by soil moisture availability, while, in region characterized by a cold season, both soil temperature and soil moisture are taken into account for triggering the leaf onset. The offset, instead, can be started by (1) sustained period of dry soil, (2) sustained period of cold temperature, or (3) day length shorter than 6 hr. Given these onset and offset conditions, multiple growing season per year and leaves persisting year-round can be simulated in areas mainly described by stress-deciduous phenological type.

Experimental Setup
The CLM4.5 is run on a 1.25 • × 0.9375 • regular grid and forced by atmospheric variables (2m air temperature, precipitation, wind, surface pressure, shortwave radiation, and air humidity) coming from the CRUNCEP version 7 reanalysis data set (Viovy, 2018). Note that the daily CLM4.5 LAI outputs are aggregated in 15-day values for consistency with the satellite observation data set (see section 2.1).
CLM4.5 is run a second time on the same regular grid and forced with the same atmospheric forcing. Differently from the previous simulation, this one, named +5K, is characterized by a global 5-K soil temperature increase in the entire column and covers the 10-year period between 1996 and 2005. Since soil temperature is one of the key variables driving CLM4.5 phenology parameterization, the +5K simulation is performed to test the response of the model phenology to extreme conditions.

Growing Season Analysis
Similarly to MT13, 4GST uses LAI seasonal cycle to infer the timing of the growing season onset and offset, but it completes and expands the usability of the antecedent approach. In fact MT13 focuses on the northern extratropical regions, where the LAI seasonal amplitude is assumed to have a normal distribution over the year Piao et al., 2007). In contrast, the method presented for the first time in this work is intended to focus on the global scale and represents four main phenology types: (1) evergreen (EVG); (2) single growing season peaking in summer (SGS-S); (3) single growing season with summer dormancy (SGS-D); (4) two growing seasons (TGS). EVG, TGS, and SGS-D were not considered in MT13, and, in general, they have often been neglected in literature with focus on global models. The EVG type is characterized by limited canopy density seasonal variation (e.g., Wu et al., 2014). For this reason, as a first step, the 4GST method identifies the EVG regions based on relative changes in LAI compared to local LAI mean value (equation (1)).
A 25% relative change is used as critical threshold (EVGT, equation (1)), similarly to Myneni et al. (2007), who highlight observed LAI oscillation in tropical forests. Other values of EVGT were tested for sensitivity assessment, and the 25% returns a reasonable representation of EVG (not shown).
The regions defined as not EVG follow the steps summarized in Figure 1: (1) the LAI annual time series is extracted over every land grid cell; (2) the LAI annual time series is shifted in order to place the maximum annual value in the center of the time series; (3) the time series is divided into four 3-month periods, each overlapping the previous and next periods by 15 days; (4) a linear least squares regression scheme is applied to identify LAI growth and decline, as well as timing of transition between positive and negative LAI slopes (e.g., Zhang et al., 2003); (5) three LAI shapes are identified through LAI slopes and transition timings: (a) two-growing-season type (TGS), and (b, c) single growing season types (SGS).
We use a critical threshold (CT) equal to 20% of the seasonal LAI annual variability, as defined in MT13, to identify the onset and offset of the growing season: Equation (2) is applied to the SGS types as in MT13 and to TGS type as described in Appendix A. The choice of the 20% CT derives from a study of Kang et al. (2003), who tested a set of three possible CTs. They showed that a 30% threshold could fall within the range of "already high LAI," with the effect of overestimating the growing season onset. Instead, a lower threshold (i.e., 10%) could be influenced by variations in understory.
Note that the distinction between SGS-S and SGS-D depends on the LAI peak. In the NH, SGS-S is detected when LAI peak occurs between May and September ( Figure 1b), otherwise SGS-D is detected ( Figure 1c). The opposite may be noticed in the SH.
Finally, the use of a set of four 3-month linear least squares regressions avoids short-term fluctuations and allows only long-term transitions between positive and negative LAI slopes in the identification of growing season cycles within the LAI annual time series.
The 4GST method aims at expanding the evaluation of the growing season period in land surface models to the global scale. This methodology is expected to perform similarly to MT13 in the regions covered by their analysis (i.e., in the NH middle and high latitudes, namely above 30 • N) and to supply useful insights elsewhere. From a direct comparison, indeed, the analysis is performed on a larger area, thanks to the inclusion of additional growing season types ( Figure S1 in the supporting information).

Model Phenology Evaluation
To benchmark the simulated land surface model phenology against satellite-derived phenology, the overall spatial distribution of (1) growing season types (section 3.1) and (2) onset and offset timings (section 3.2) are compared. This comparison is not applied in areas dominated by EVG type and desert regions, such as Sahara and Arabia, where phenology timings are not computed.
To better highlight differences between the model and observations, a regional analysis is performed (section 3.4). The regional separation is based on year 2000 observed PFT distribution (Land Use Harmonization v2, http://luh.umd.edu/). The mean onset (offset) timing in each region is calculated for both CLM4.5 and LAI3g, and then compared. Consequently, the effective area covered by each PFT-dominated region can change between model and observation, due to the different distribution of EVG type.

Growing Season Type Distribution
First, the 4GST method is applied to both satellite observation and CLM4.5, to compare the global spatial distribution of the four main growing season types (i.e., The NH midlatitude and high-latitude regions are mainly dominated by the SGS-S type in both LAI3g and CLM4.5 ( Figure 2). This justifies the use of the MT13 method, which only accounts for this type, in those areas. Instead, the Tropics and the SH do not exhibit a dominant growing season type and the agreement between LAI3g and CLM4.5 is reduced compared to the NH (Figure 2). In general, LAI3g and CLM4.5 agree in ∼53% of the land grid cells, mainly in areas dominated by SGS-S type (∼62% agreement). CLM4.5 largely overstimates the EVG fraction compared to LAI3g. The low amount of EVG identified in LAI3g can be partially linked to satellite artifacts induced by interannual variability of snowmelt in boreal regions (e.g., Delbart et al., 2006), or of cloud cover in the Tropics (e.g., Myneni et al., 2002Myneni et al., , 2007. On the other hand, both in situ camera observations combined with ecosystem carbon dioxide fluxes (  and remote sensing monitoring (Guan et al., 2015) show large seasonal variance in rain forest ecosystem photosynthesis, ultimately affecting vegetation indexes such as NDVI and LAI. Wu et al. (2016) conclude that models tuned to match current observations while assuming that LAI have no seasonal cycle, risk making erroneous estimation of forest response to climate conditions.
Despite the fraction of TGS grid cells is similar in CLM4.5 and LAI3g (∼9% and ∼8% of the total land grid points, respectively) the agreement in TGS type regions is low (∼16%).
With this first comparison, Tropics and the SH are identified as the regions with the major potential differences between observed and modeled phenology. This bias may be linked to the higher vegetation complexity found in these areas, compared to the large uniformity shaped of the boreal forests north of the 45 • parallel.
In global-scale observations and simulations, each grid point is composed by a combination of many different PFTs. Consequently, areas characterized by similar biomes, such as Amazon and Congo basins (Figure 2a), can feature different growing season types due to dissimilar PFT composition. Besides, satellite data such as LAI3g encounter some problems in screening densely cloud-covered areas like in Tropics, leading to potential artificial LAI data reconstruction (see section 4.3).

Onset and Offset Variability
The climatological (average over 1982-2011) growing season onsets and offsets simulated by CLM4.5 are compared with LAI3g ( Figure 3). In LAI3g, the growing season for most NH vegetation starts in the first half of the year (around DOY 120), while the growing season offset takes place in the second half of the year (between DOY 240 and 300) (Figures 3a and 3b). Both onset and offset exhibit a latitudinal distribution: the onset is progressively delayed with increasing latitude. In fact, in the Mediterranean region the onset occurs between February and March, while in Scandinavia and Russia it only takes place in May ( Figure 3a). As expected, the growing season offset occurs earlier in the year by moving northward: similarly, the Mediterranean offset occurs between October and November, while in Scandinavia it comes about September (Figure 3c). This latitudinal behavior indicates the strong constraint imparted by the radiation on the day length. The tropical regions and the SH are characterized by a more complex pattern of onset and offset timings (Figures 3a and 3b), as already highlighted by the growing season type patterns shown in Figure 2.
In general, onset and offset simulated by CLM4.5 agree with observations (difference smaller than 15 days) only in few regions (green shading in Figures 3c and 3d), and in general, the start of growing season is better reproduced than the offset time (onset agreement: ∼35% versus offset agreement: ∼28.3%).
In the NH middle and high latitudes, the agreement between CLM4.5 and LAI3g observations is high, although a slight deferral of both growing season onset and offset (up to 1 month, in line with other methods, e.g., Anav et al., 2018) is noticeable (Figures 3c and 3d). The 15-day time resolution (see section 2.1 and Zhu et al., 2013) limits a more detailed bias assessment between modeled and observed values.
The growing season window in NH middle-and high-latitude areas is well simulated even where the model overestimates the mean LAI (Figure 4b), and the maximum LAI gets delayed compared to observations (Figure 4d). Consequently, the ability to simulate onset and offset seems to be independent from the model skill for the mean LAI and the time of maximum LAI.
Previous literature works highlighted the imperfect CLM4.5 phenology variability in drought deciduous regions of the world (Dahlin et al., 2015), and the defective timing of spring leaf-out seasonal-deciduous phenology (Chen et al., 2016). We expect that 4GST will highlight similar issues to those shown by Dahlin et al. (2015) and Chen et al. (2016). Dahlin et al. (2015) show the CLM4.5 problematic behavior in drought-deciduous areas, and they tested their proposed improvements on a specific set of regions, namely Brasilia, Western Brazil, South Chad, Eastern Zambia, South Ethiopia, and Darwin (Australia), dominated by drought-deciduous vegetation. The onset and offset reproduced by 4GST are also problematic in these areas. In the Brasilia region, for example, CLM4.5 reproduces a shorter active vegetative season compared to LAI3g, due to an earlier average offset (CLM4.5 offset: DOY ∼145, LAI3g offset: DOY ∼187) and a slightly earlier onset (CLM45 onset: ∼237, LAI3g onset: DOY ∼256, see Figure 3). This behavior derives from an artificial increase of LAI during the dry season (Figure 5b), which alters the LAI cycle and the proper detection of the growing season onset. A similar feature may be noticed in Ethiopia (Figure 5c), where an earlier onset is detected, and this behavior is in agreement with the results of Dahlin et al. (2015). Likewise, the biases in Australia (Figure 5g) are detected by 4GST as well as in Dahlin et al. (2015).
Similarly to Chen et al. (2016), onset and offset timings are rarely captured by CLM4.5 between 20 • N and 40 • N (Figure 3). In this area, the model simulates SGS-D and TGS types instead of SGS-S type shown in LAI3g (Figure 3). Spain and Texas are two examples of this issue. In Spain, CLM4.5 LAI cycle is strongly precipitation driven, and LAI decreases during the summer drought season (Figure 5d). Instead, LAI3g shows a more gradual seasonal cycle with a spring LAI peak. In Texas, despite strong simulated precipitation amounts, CLM4.5 displays a summer LAI low (Figure 5e), opposite to the observed summer peak. This behavior may be related to a sharp simulated response to precipitation and temperature forcings in those regions.
Differently from the previous studies, 4GST highlights major differences between model and observations in the NH high-latitudes offset timing. For example, in Russia (Figure 5f), CLM4.5 growing season exhibits a shift compared to observation. CLM4.5 LAI cycle is precipitation driven and shows a gradual decrease during the leaf-senescence period (Figure 5f). Instead, LAI3g exhibits a fast decrease in LAI after the summer peak. This discrepancy highlights possible issues in CLM4.5 leaf-senescence parameterization. However, it is important to keep in mind that satellite observations have data reconstruction limitations in cloud-covered (e.g., Kandasamy et al., 2013) and snow-covered (e.g., Delbart et al., 2006) areas.

Two-Growing-Season-Type Regions
Multigrowing regions have important impacts on plant phenology, but it is difficult to detect and analyze those regions at global scale (Zhang et al., 2003). 4GST, then, addresses this phenology type by focusing on the areas characterized by two distinct growing season cycles, which are typical of croplands and of semiarid areas characterized by two separate rainy seasons (e.g., Martiny et al., 2006;Zhang et al., 2003Zhang et al., , 2005. At global scale, TGS is rarely simulated correctly by the model (Figure 2). The largest amount of TGS in observations is identified in central and Eastern Africa (Martiny et al., 2006;Zhang et al., 2005). Figure 6 shows good agreement in TGS between CLM4.5 and LAI3g in the Horn of Africa, while Central Africa is shown as evergreen in CLM4.5 (Figure 2). The Horn of Africa area, instead, is characterized by two distinct precipitation peaks, which drive LAI cycle in both model and observations ( Figure S2 in the supporting information).
Here CLM4.5 correctly simulates both growing season cycles ( Figure 6). In particular, CLM4.5 exhibits a later start of the first growing season compared to observations ( Figure 6c) and an earlier offset of the second growing season (Figure 6l). The first growing season offset and second growing season onset, instead, are well captured by CLM4.5 (Figures 6f and 6i).
In general, 4GST detects a TGS type only if each of the two growing seasons is at least 3-month long (section 2.4 and Appendix A). This assumption derives from the need to avoid short-time oscillations to be accounted as growing seasons (section 2.4), and this choice leads to a reduced area of multigrowing season detection compared to previous literature works, such as Dahlin et al. (2015).   The simplified crops (dark orange bars in Figures 7b and 7c) are correctly simulated, especially in the onset timing, and to the best of our knowledge, this is the first time that such quality is enlightened, since crop-dominated areas have been rarely taken into account in this sort of assessments.

PFT-Level Growing Season Variability
In tropical broadleaf evergreen trees (BET Tropic in Figure 7), CLM4.5 onset and offset are reversed, likely due to either different PFT composition between model and observations or to satellite data difficulties in area dominated by high LAI values and with frequent cloud cover (e.g., Kandasamy et al., 2013;Maignan et al., 2011;Myneni et al., 2002). This is evident in Central Africa, where CLM4.5 shows an evergreen type in place of a TGS growing season (Figures 2 and 6). Finally, the model tends to delay the vegetation onset for the PFTs growing in the NH middle and high latitudes, as described in Chen et al. (2016). The offset is anticipated for the needleleaf evergreen trees (NET Boreal and NET Temper in Figure 7c) and delayed for deciduous boreal forests and broadleaf evergreen trees in temperate regions (NDT Boreal, BDT Boreal, and BET Temper in Figure 7c, respectively).

Long-Term Trend and Interannual Variability
The 4GST method has the potential to be applied to different time windows and capture the interannual variability of growing season onset and offset. The annual time window provides some preliminary information on the skill of CLM4.5 to represent interannual variability. In this section, the phenology interannual variability is analyzed using the PFT distinction ( Figure 7a) applied in section 3.4, as shown in Figure 8.
To evaluate the long-term trends in onset and offset, a linear least squares regression scheme is applied. The significance of these trends is evaluated at 95% level by means of a Monte Carlo method (see Appendix B). CLM4.5 simulates a significant extension of the growing season in middle to high latitudes, due to delayed offset, in agreement with LAI3g (Table 1 and, e.g., Figures 8h-8o). Both the model and observations see a long-term onset delay in part of the Tropics (Table 1 and Figure 8f). Onset and offset trends in a few PFTs are incorrectly simulated, for example, needleleaf trees in boreal and temperate regions and part of the tundra vegetation (Figures 8a, 8c, and 8k). Finally, no significant trend in broadleaf evergreen trees is found in both CLM4.5 and LAI3g (Figures 8d and 8e).
In general, onset interannual variability is better modeled than the offset one (Table 1). In particular, high correlation is obtained in boreal needleleaf forests. Arctic C3 grass is the only PFT showing significant correlation between CLM4.5 and LAI3g in both onset and offset interannual variability (Table 1).

Methodology
The +5K simulation was performed to test the 4GST method under extreme conditions. As expected, the largest changes in onset and offset timings occur in the boreal and temperate areas, where temperature acts as plant growth limiting factor ( Figure S3). In particular, the temperate and boreal needleleaf forests respond to increased soil temperature with a longer growing season characterized by both earlier onset (∼40 days, Figure S3b) and later offset (∼40 days, Figure S3c). Temperate and boreal deciduous forests, as well as boreal shrubs and Arctic C3 grass also exhibit longer growing season, only induced by earlier onset (∼30, ∼40, and ∼60 days, respectively; Figure S3b). It is important to notice that, at high latitude, daylight is a strong limiting factor, then the effect of increased temperature is partly neutralized by daylight availability.
A more complete climate sensitivity study is out of the scope of the present work and will be addressed in a future work.

CLM4.5 Limitations
The above results showed that onset and offset simulated by CLM4.5 agree with observations in few regions (green shading in Figures 3c and 3d), while most areas are shifted by 15/30 days compared to LAI3g. Given the coarse 15-day time window available for observations, this result can be considered satisfactory and in-line with previous literature (e.g., Anav et al., 2018). The largest differences are found in regions where water and soil moisture are the main phenological drivers, which are remarkable in deciduous forests (see Figure 7). Hence, deciduous type parameterization still shows some biases (see section 2.2). 4GST highlights also the problematic CLM4.5 representation of the two-growing-season type due to unrealistic response of plant phenology to soil temperature and soil moisture, and to inaccurate PFT compositions. Further work is required to upgrade the CLM4.5 phenology parameterization: a few parameters should be tuned according to regional and geographical features, in particular thresholds related to links between phenology and water availability, whose smoothing should improve the response of vegetation to precipitation. Refinement of plant response time to water and temperature stresses could enhance the representation of growing season type in CLM4.5, as well. Besides, further exploration about the impact of PFT composition on plant phenology should be carried on in order to determine areas with wrong PFT classification. In this direction, a new version of the Community Land Model, namely CLM5.0, has been recently released. This new model version presents changes in plant physiology and crop modeling compared to the CLM4.5, which could alter the results we obtained for phenology. Nevertheless, CLM5.0 shares with CLM4.5 the main phenology assumptions and algorithms but includes the changes proposed by Dahlin et al. (2015). The new parameterization constrains leaf growth to rainfall in semiarid regions, avoiding possible unrealistic leaf flush during the dry season (Figures 5b and 5c). However, the recent work by Zhang et al. (2019) have also shown that the changes proposed by Dahlin et al. (2015) influence leaf senescence in temperate grassland, leading to a delayed growing season offset. This finding highlights the complexity needed for realistic and comprehensive phenology parameterization.
The water distribution in the land surface model derives from the atmospheric precipitation pattern used to force CLM4.5. A corrected reanalysis, specifically CRUNCEP v7 (Viovy, 2018), is here used to force CLM4.5. Although CRUNCEP is a hybrid data set which combines observations from the CRU TS3.2 (Harris et al., 2014) and the National Centers for Environmental Prediction reanalysis (Kalnay et al., 1996), precipitation is still critical, and biases to direct observations are still remarkable (e.g., Bosilovich et al., 2008;Gehne et al., 2016;Tang et al., 2017). Consequently, the model misrepresentation of the observed phenology behavior can be partly ascribed to biases in the reanalysis fields used as forcing.

Satellite Observation Limitations
Here, LAI satellite observation is used to validate the CLM4.5 performance. It is important to take into account that, in satellite data sets, the reflectance saturates in regions characterized by dense canopies (e.g., Maignan et al., 2011;Myneni et al., 2002). Consequently, vegetation indexes such as LAI and NDVI can be artificially limited to a threshold value: Myneni et al. (2002), for example, suggest LAI = 7.0 m 2 /m 2 as the upper limit, although higher values are generally possible. For this reason, areas dominated by elevated LAI can be underestimated in observation.
Many vegetation indexes have been developed in order to overcome some of these optical limitations, such as the SAVI, the atmospherically resistant vegetation index (ARVI), or the global environmental monitoring index (GEMI) (e.g., Huete et al., 1994). However, these indexes are mainly related to changes in leaves color and cannot be applied in the formulation of 4GST, which operates on changes in canopy density (see section 4.4).
Other caveats in LAI satellite observation may also affect our analysis. Cloud cover gap-filling technique and LAI reconstruction algorithm (e.g., Chen et al., 2017;Kandasamy et al., 2013) can impact the observational data set used. For example, Chen et al. (2017) show that the use of a novel atmospheric correction algorithm (multi-angle implementation of atmospheric correction) could lead to a quality improvement of LAI data. Kandasamy et al. (2013) display the abilities and limits of eight methods used in filling gaps in satellite data and state that the performance of these different methods depends on their implementation. In situ data, such as the PhenoCam network (Richardson et al., 2018), may be more reliable for model validation, yet global coverage of these data is missing.

Plant Phenology Definition
Vegetation phenology can be assessed by taking into account many plant features and indexes. As described, 4GST is driven by changes in leaf biomass, but other indexes, such as NDVI, SAVI, ARVI, and GEMI, may be selected as a proxy for plant phenology (e.g., Churkina et al., 2005;Huete et al., 1994;Jeong et al., 2011;Keenan et al., 2014;Sonnentag et al., 2012;Xu et al., 2013;Zhang et al., 2003). The choice of LAI is probably more robust for the definition of growing season onset and offset, since the presence of leaves is a physical parameter. Wu et al. (2014) show that many vegetation indexes are able to capture the plant phenology stages, but each vegetation index has a different predictive strength depending on sites and plant functional types. For this reason, it can be of interest for the community comparing our approach to others that assess the phenology states throughout different vegetation indexes, and this will be a focus of a future work.

Interannual Variability
Phenology interannual variability has been shown to have an influence on land-climate interactions (e.g., Keenan et al., 2014;Richardson et al., 2013). Keenan et al. (2014) reported a negative feedback in temperature-phenology-carbon coupling triggered by the observed increase in the phenology active season. Despite this, Keenan et al. (2012) and Richardson et al. (2012) show that state-of-the-art land surface models are currently unable to reproduce the observed land interannual variability and could, then, misrepresent the future responses of phenology to climate change. Therefore, improving the land surface model ability in capturing phenology interannual variability is fundamental.
Observations show changes in growing season onset and offset between 5 and 8 days/decade (e.g., Keenan et al., 2014), and high time resolution data (at least weekly data, Ahl et al., 2006) are required to properly assess the interannual variability of growing season onset and offset. The 4GST method, then, needs to be applied to observed and simulated data at higher time resolution than the 15-day time range available for LAI3g, such as the in situ PhenoCam network (Richardson et al., 2018) or the high-frequency Moderate Resolution Imaging Spectroradiometer data (Myneni et al., 2015). This analysis will be performed in a future study.

Conclusion
This work describes a new methodology that aims at defining and identifying a wider range of vegetation growing season types and at validating the plant phenology simulated by global land surface models. The so-called 4GST method has worldwide relevance, featuring the SH vegetation, which was partly neglected by previous literature, and better representing the phenological characteristics of many regions lying in the NH middle and high latitudes. The new set of growing season modes includes (1) evergreen phenology; (2) single growing season with summer LAI peak; (3) single growing season with summer dormancy; and (4) two growing seasons.
Applied to a state-of-the-art land surface model (i.e., CLM4.5) and to a satellite observation data set (i.e., LAI3g), 4GST highlights model skills and limitations in the representation of observed vegetation phenology. CLM4.5 is able to satisfactorily reproduce the observed growing season onset and offset in the NH midlatitude and high-latitude regions. CLM4.5 generally captures the growing season onset more realistically than the offset. In addition, 4GST highlights the main limitations exhibited by the model, which are primarily found in areas where water acts as the main phenological driver, such as in areas dominated by broadleaf deciduous trees. In this regard, the novel 4GST method is coherent with the literature benchmark and with its antecedent method. Differently from previous literature works, 4GST assesses the ability of land surface models to reproduce multigrowing season phenology. CLM4.5 captures the double growing season of the Horn of Africa vegetation, while it fails in other regions characterized by TGS type. This finding highlights the need to further test the PFT distribution used in CLM4.5.
Besides the global scale applicability, this approach can be used for analyzing the phenology time evolution and interannual variability. CLM4.5 shows a rather good correlation with the observed growing season onset in temperate and boreal forests, with the offset in Arctic C3 grass. However, higher time resolution observations would be required to properly investigate the LAI interannual variability. Finally, this work shows that the model ability to simulate the growing season onset and offset is independent from its skill in the representation of mean LAI and maximum LAI timings. This emphasizes the importance of surveying phenology also in regions where land surface models exhibit low skill in the simulation of canopy density. Despite a poor representation of the mean LAI magnitude, CLM4.5 is still able to provide a realistic picture of the active vegetative season in many regions, especially in NH middle and high latitudes.

Appendix A: TGS-Type Methodology
Due to the 365 day framework, in which the LAI peak is imposed to be in the center of the time series, TGS may be detected in two ways: (a) two growing cycles are entirely distinguished within the time series ( Figure A1a); (b) in addition to a central growing cycle, a second one is split between the beginning and the end of the time series ( Figure A1b). In case (a), equation (2) is applied to both growing cycles separately. In case (b), the peaks at the beginning and the end of the time series are merged into one single peak, to obtain a second growing cycle in addition to the one shown in the center of the time series ( Figure A1b). Then, equation (2) is applied to both growing cycles separately.

Appendix B: Statistical Analysis Methodology
The Monte Carlo method is a nonparametric test. It is based on resampling the observations in order to construct a set of artificial data sets. Here, 1,000 realizations, obtained by randomly shuffling the original 30-year time series, are used. A random observed trend, then, works as null hypothesis in the Monte Carlo test. Therefore, the null hypothesis is rejected, and then the trend considered significant, when the observed linear least squares regression lies in the top 95% of the thousand-random-sampled trends distribution. The same Monte Carlo method is used for testing the significance of the Spearman's correlation coefficient.
The Spearman's correlation coefficient evaluates the linear correlation between the LAI3g onset (offset) time series and the CLM4.5 onset (offset) time series: where Cov(L,C) is the covariance between the LAI3g onset (offset) time series and the CLM4.5 onset (offset) time series; L is the LAI3g onset (offset) time series standard deviation; similarly, C is the CLM4.5 onset (offset) standard deviation.