Climate Model Code Genealogy and Its Relation to Climate Feedbacks and Sensitivity

Contemporary general circulation models (GCMs) and Earth system models (ESMs) are developed by a large number of modeling groups globally. They use a wide range of representations of physical processes, allowing for structural (code) uncertainty to be partially quantified with multi‐model ensembles (MMEs). Many models in the MMEs of the Coupled Model Intercomparison Project (CMIP) have a common development history due to sharing of code and schemes. This makes their projections statistically dependent and introduces biases in MME statistics. Previous research has focused on model output and code dependence, and model code genealogy of CMIP models has not been fully analyzed. We present a full reconstruction of CMIP3, CMIP5, and CMIP6 code genealogy of 167 atmospheric models, GCMs, and ESMs (of which 114 participated in CMIP) based on the available literature, with a focus on the atmospheric component and atmospheric physics. We identify 12 main model families. We propose family and ancestry weighting methods designed to reduce the effect of model structural dependence in MMEs. We analyze weighted effective climate sensitivity (ECS), climate feedbacks, forcing, and global mean near‐surface air temperature, and how they differ by model family. Models in the same family often have similar climate properties. We show that weighting can partially reconcile differences in ECS and cloud feedbacks between CMIP5 and CMIP6. The results can help in understanding structural dependence between CMIP models, and the proposed ancestry and family weighting methods can be used in MME assessments to ameliorate model structural sampling biases.

Because all models are imperfect representations of reality, they are affected by various uncertainties in the model output, which can be broadly categorized as data, parameter, and structural uncertainty (Remmers et al., 2020). While data and parameter uncertainty can be relatively easily quantified and sampled, structural uncertainty pertaining to model code is hard to quantify or sample, and some authors noted that structural uncertainty is insufficiently sampled in CMIP MMEs . Models participating in CMIP are dependent in a number of ways, including being essentially the same model with a different configuration, sharing parts of their codes, model components, and schemes, using the same data sets for validation, and implementing similar parametrizations. Some authors have therefore called this MME an "ensemble of opportunity" (Boé, 2018;Knutti et al., 2013;Masson & Knutti, 2011;Sanderson et al., 2015a), since the inclusion is based on the intent of a modeling group to participate rather than objective selection criteria. If model dependence is not taken into account, the calculation of means, variance, and uncertainty can be biased, and spurious correlations (such as in emergent constraints) can arise in an MME (Caldwell et al., 2014;Sanderson et al., 2021). Remmers et al. (2020) investigated whether model code genealogy can be inferred from model output [also investigated earlier by Knutti et al. (2013) and discussed below]. Using a modular modeling framework, they generated a model ensemble of hydrological models by sampling the model "hypothesis space" [as defined in Remmers et al. (2020)] and compared its genealogies based on model code and model output. They found that it was not possible to infer complete model code genealogy based on model output because the performance of the inference was low. It is possible that the same would partially apply to much more complex models like GCMs and ESMs, and model code relationship needs to be studied in order to sample the model hypothesis space. Pennell and Reichler (2011) tried to quantify the effective number of models in an MME of 24 CMIP3 models based on model output error similarity, and found this to be about 8. Increasing the number of ensemble models did not substantially increase the effective number of models. Sanderson et al. (2015b) reached a similar conclusion, and found that the number of independent models calculated based on the model output in CMIP5 is much smaller than the total.
The simplest approach to analyzing an MME is "model democracy," where each model is given an equal weight in statistical calculations. More sophisticated approaches proposed to address model dependence include weighting or selecting models. Selecting models can be regarded as an extreme form of weighting. Often suggested weighting methods are based on model performance ("model meritocracy"), model output or code dependence, and diversity. The topic of climate model dependence and genealogy has been covered in many previous studies, most of which used the dependence of the model output (Bishop & Abramowitz, 2013;Haughton et al., 2015;Jun et al., 2008aJun et al., , 2008bKnutti et al., 2013;Masson & Knutti, 2011;Mendlik & Gobiet, 2016;Sanderson et al., 2015a), while a focus on code dependence has been relatively rare (Alexander & Easterbrook, 2015;Steinschneider et al., 2015). Boé (2018) distinguishes these two approaches as "a posteriori" and "a priori." Knutti et al. (2013) developed a CMIP5 model genealogy based on a hierarchical clustering of model output. They found that models from the same institute were much closer in their model output than other models, and contemplated that output similarity could be used for model weighting or selection to eliminate biases due to near duplicate models. A more simple approach is "institutional democracy," where one model per modeling group is selected, and "component democracy," where models are selected to represent different model components . Edwards (2000aEdwards ( , 2000bEdwards ( , 2000cEdwards ( , 2011Edwards ( , 2013 described the early to modern history of climate modeling and constructed a partial "family tree" of atmospheric GCMs based on their code heritage. Another account on early climate modeling was given by Arakawa (2000). Boé (2018) summarized institute, atmospheric, oceanic, land, and sea ice components of CMIP5 models and how they relate to proximity of the model results. However, the code dependence of all CMIP3, CMIP5, and CMIP6 models has not been analyzed. Partially, such understanding is limited by the availability of the source code. This contributes to the treatment of models as "black boxes" by the research community. Haughton et al. (2015) compared simple weighting with model performance and model output dependence weighting. They found performance weighting improved mean relative to observations (as expected) but degraded variance estimation, and dependence weighting improved both. Steinschneider et al. (2015) identified close correlations between model output of models of the same family even on a regional scale, and showed that the clustering of similar models can result in narrowing the MME variance attributable to intermodel correlations.
Reducing the size of an MME to a set of independent models is a relatively simple method of avoiding model dependence. Sanderson et al. (2015b) noted that permitting only one model per institute in an MME could lead to unfairly dismissing models which are substantially different, and overestimating independence in cases where code is shared between institutes. Weighting models by country can have some merit due to the fact that models are sometimes developed with a focus on accuracy over the region where the institute is located, and a model might be more extensively validated against data from observations in the region. For example, the New Zealand Earth System Model (NZESM) (in practice developed alongside HadGEM/UKESM) was developed to reduce Southern Ocean biases (Williams et al., 2016); the Indian Institute of Tropical Meteorology ESM (IITM ESM) has a special focus on the South Asian monsoon (Krishnan et al., 2021); the Australian Community Climate and Earth System Simulator coupled model (ACCESS-CM) has a focus on reducing uncertainties over the Australian region (Bi et al., 2013); and the Energy Exascale Earth System Model (E3SM) aims to support the U.S. energy sector decisions (Golaz et al., 2019). Weighting models by errors relative to observations (performance weighting) is complicated by the fact that there can be a decoupling between a climate model's accuracy in representing present-day and historical climate variables and its accuracy in representing the projected change (or trend) of the variables under a climate scenario (Jun et al., 2008a;Kuma et al., 2022;Zelinka, 2022). Thus, a model's performance in future climate projections cannot be fully inferred from its performance in present-day and historical climate. Performance weighting can also favor models which are better tuned to present-day, historical or paleontological observations by compensating biases. It is possible that model quality cannot be estimated solely from model output due to the fact that some models might represent physics more consistently with our knowledge of fundamental physics, yet give inferior output when compared to observations if they have fewer compensating biases or are tuned less to represent present-day or historical observations. Knutti (2010) provides a high-level discussion of the topic of model democracy, uncertainty, weighting, evaluation, calibration and tuning in the context of decision making. Apart from explicit model weighting or selection choices, seldomly recognized implicit choices based on values (other than widely acknowledged epistemic values such as openness, objectivity, evidence, and impartiality) influence model development, evaluation, selection, weighting, interpretation, and communication of results (Lenhard & Winsberg, 2010;Pulkkinen, Undorf, Bender, Wikman-Svahn et al., 2022;Undorf et al., 2022;Winsberg, 2012). The climate system is too complex to be captured by models perfectly. Some of the limitations stem from limited computational resources, uncertainty about how to represent processes at a coarse level through parametrizations, and a lack of observational data. Thus, model construction necessitates and is affected by decisions regarding a variety of compromises. Traditionally, a pursuit of purely knowledge-oriented science has been desired in order to avoid conclusions distorted by scientists' views, values and interests. However, some authors emphasize that purely knowledge-oriented construction of climate models is impossible because of decisions involved in the model development (Jebeile & Crucifix, 2021;Morrison, 2021;Parker, 2020;Parker & Winsberg, 2018). These decisions can be driven by not only the desire for creating an unbiased objective representation of the climate system, but also by purposes, views, values, interests and limitations. They include for example, a specific focus on modeling a certain geographical region and quantities of interest, the availability of validation data influenced by locations of observations, compromises regarding what errors are permissible, types of tuning (Schmidt et al., 2017), decisions involved in earlier versions of the same model or ancestral models resulting in inherited values, limited knowledge and time of the researchers, and limited resources. In turn, they can also perpetuate certain types of societal biases against traditionally understudied and underrepresented regions. Rarely are such decisions or values and interests which drive them explicitly acknowledged, which makes it difficult to quantify their impact on MMEs. Although less acknowledged, interests can also include reasons for pursuing certain research or development which are not driven by practical reasons but by curiosity. In a broader view, the development of climate models has aspects of iterative development, inheritance, recombination, cooperation, competition and filling of different niches. In this way, it can be considered a collective optimization process with the goal of describing the important and diverse properties of the climate system (as considered by various actors) through pluralism in the face of limited knowledge and computational resources, both of which also keep changing.
We can define the structure (code) of a model as based on a set of hypotheses about reality as well as computational realizations of such hypotheses. A desirable feature of an MME would be that models represent samples from the hypothesis space with probability equal to our degree of belief that the hypothesis is true (note that this is different from a uniform sampling of the hypothesis space, which would be both impossible and undesirable due to its size). However, this is rarely the case with existing MMEs, and it is not easily quantifiable. It is generally not desirable that the model output of individual models in an MME is the most unique, because one would still want all models to converge as closely as possible on the true representation of physical processes. Here, we define a "true representation" in limited terms as a pragmatically oriented conceptualization of the Earth system, which for example, might not include the anthroposphere as commonly externalized in CMIP models through scenarios. Models can be similar in their output because they are convergent on the best representation of reality or because of code similarity, and this limits the use of model output as a measure of model dependence. We note that some authors advocate against a value-free ideal to which models should converge (Parker, 2020;Parker & Winsberg, 2018). As a conceptual model (Figure 1), we can consider models in an MME to be samples corresponding to representations of a physical reality in a hypothesis space. Here, representation is supposed to mean code which produces output for given initial and boundary conditions, that is, without considering internal variability. While the true physical representation is unknown and impossible to simulate due to computational constraints, our collective belief that a given representation is true can be conceptualized theoretically by a probability density function (PDF). Ideally, models in an MME are independent samples from this PDF (Figure 1a). In actual MMEs (Figure 1b), however, models are dependent and tend to be clustered together for reasons incompatible with the PDF, such as the inclusion of several configurations or resolutions of a single model, selective sharing of code between models for reasons other than meritocracy (such as availability or political and organizational decisions), or model output availability. Therefore, if a PDF or its statistics are estimated from this MME, they will be biased compared to the actual PDF. The aim is then to compensate for this bias with appropriate model weighting, selection or more sophisticated techniques such as emergent constraints. Even if we could estimate the PDF in an unbiased way, the value with the maximum likelihood or the mean are unlikely to coincide with the true physical representation, because such a PDF only represents our belief that a given physical representation is true, which is limited by our knowledge. Note that model dependence itself does not preclude that an estimate of the PDF is unbiased. For example, in the Metropolis algorithm (Metropolis et al., 1953), an unbiased estimate of a PDF is generated by sequentially producing a chain of samples which are close to each other. After a large enough number of iterations, an unbiased estimate of the PDF can be inferred from the collection of all samples, despite close correlation between adjacent samples in the chain. Other aspects not considered in Figure 1 are that our knowledge about the climate system is shaped by various decisions such as which parts of the climate system have been considered interesting to study or observe, and individual models are also affected by such decisions during their development. As mentioned above, some models even have a particular explicitly stated purpose, such as ACCESS-CM, E3SM, IITM ESM, and NZESM. The consequence of this is that models are not only biased samples of the PDF due to code dependence, but also due to value and interest-based decisions. For the same reasons they can also converge or diverge. None of the model weighting methods mentioned above are without issues. Performance weighting can disregard models whose physics representation is relatively far from the most likely representation but still plausible, thus artificially narrowing the spread. Model dependence weighting based on output or code can disregard models which are close to other models but were chosen to be based on this model because of its perceived quality, thus preventing such an MME from narrowing down on the true representation of climate physics (as defined in the limited terms above). Dependence weighting based on output can mistakenly identify two models as similar when they are in fact independent, or fail to identify models with significant code dependence. Weighting based on diversity can give too much weight to outliers and too little weight on models more densely clustered around the most likely representation, thus artificially increasing the spread.
Recently, multiple models participating in CMIP6 (Eyring et al., 2016) predicted much higher effective climate sensitivity (ECS) than the assessed range of the IPCC Sixth Assessment Report (Masson-Delmotte et al., 2021). This was exacerbated by the fact that some models contributed multiple runs, making simple multi-model means potentially unreliable. Voosen (2022) cautioned that using models which predict too much warming compared to the range assessed by the AR6 can produce wrong results, and therefore model democracy should be replaced with model meritocracy. Partly due to the limitations of the simple multi-model mean, the authors of the AR6 departed from the use of multi-model means to quantify ECS and transient climate response (TCR), and instead used a multi-evidence approach similar to Sherwood et al. (2020), although a simple multi-model mean is used in other parts of the report.

Motivation and Objectives
Code dependence in CMIP models is not well explored, especially when it comes to code sharing between modeling groups. This hinders model evaluation studies, which sometimes regard the CMIP MME as an opaque set of models [e.g., Meehl et al., 2020; Schlund et al., 2020; Zelinka et al., 2020, but also many parts of AR6]. representing realizations of physical climate processes (model structure). The shading indicates a probability density function (PDF) quantifying our collective belief that a certain representation is true. In an ideal case (a), models are unbiased samples from this PDF, allowing us to estimate the PDF from a multi-model ensemble (MME). In reality (b), they form clusters because of structural model dependence (code sharing) as assumed and discussed in the introduction, sampling the PDF in a biased manner. They might also deviate from the PDF for a number of other reasons. Weighted sampling is necessary to estimate the PDF from such an MME. The unknown true physical representation, not coinciding with the PDF maximum or mean, is indicated by a red dot. For illustrative purposes, the hypothesis space is visualized in a 2-dimensional space. In reality, this space has a large number of dimensions and the PDF might not be symmetric. Model marker colors (shapes) in (b) indicate different hypothetical model families, within which models are structurally related. Note that the PDF represents model structure and might not correlate with model output PDF.
To gain insights into the whole MME, we map the code genealogy of all CMIP atmosphere GCMs (AGCMs), atmosphere-ocean GCMs (AOGCMs), and ESMs. Much of the information about code dependence is available in literature as well as CMIP model metadata and online resources of modeling groups, but has not been systematically organized across CMIP phases. When determining code relations, our focus is on the atmospheric component and atmospheric physics due to the fact that they are currently the main source of model uncertainty in climate sensitivity, dominated by cloud feedback (Forster et al., 2021;Wang et al., 2021;Zelinka et al., 2020). Steinschneider et al. (2015) also identified the atmospheric component as being a particularly important factor determining the similarity of climate projections of temperature and precipitation between models. However, other model components such as the ocean can also have an impact on the feedbacks and climate sensitivity (Gjermundsen et al., 2021). We present a model weighting algorithm based on the model code genealogy, and investigate whether it makes a difference in multi-model means of ECS, effective radiative forcing (ERF), climate feedbacks, and global mean near-surface temperature (GMST) time series. The algorithm can be used to produce weights for any given subset of CMIP models. In addition, we explore more simple weighting methods based on model family, institute, and country, and analyze whether model families differ significantly in their predictions from other model families and a simple multi-model mean.

Data
In our analysis we focus on AGCMs, AOGCMs, and ESMs in the last three phases of CMIP (3, 5, and 6). The CMIP5 and CMIP6 model output data from the control (piControl), historical, Shared Socioeconomic Pathway 2-4.5 (ssp245), Representative Concentration Pathway 4.5 (rcp45), abrupt quadrupling of CO 2 (abrupt-4 × CO2), and 1% yr −1 CO 2 increase (1pctCO2) experiments were acquired from the public archives on the Earth System Grid (CMIP5, 2022; CMIP6, 2022). The equivalent data from CMIP3 were not analyzed here, but we include all CMIP3 models in the model code genealogy. We used historical global temperature data from the Hadley Centre/Climatic Research Unit global surface temperature dataset version 5 (HadCRUT5) (Morice et al., 2021) obtained from the Met Office Hadley Centre (2022). In order to analyze model code genealogy, we performed a broad literature survey, complemented by CMIP model metadata and information available online, particularly modeling groups' websites. In total, we traced the genealogy of 167 models, of which 114 were participating in CMIP, and the rest were related to the CMIP models and thus necessary for reconstructing the genealogy. The model genealogy information, including related references, is also available in Table S1. Along with relations between models, we identified the model institute, the country where the institute resides, and the model family (defined by the oldest ancestral model in the genealogy). Model parameters such as ECS, TCR, ERF, and climate feedbacks were sourced from Zelinka et al. (2020) and the AR6. We use ECS calculated by Zelinka (2022), as an approximation of equilibrium climate sensitivity.

Weighting Methods
We applied several statistical weighting methods on the CMIP MMEs: 1. Simple weighting. Every model run is given equal weight. By "model run" we mean a model resolution or configuration (as listed in Table S1 in the columns CMIP3/5/6 names), not multiple simulations performed with the same model but different initial conditions. 2. Family weighting. Model families, defined as a complete branch as shown in Figure 2 (discussed later in Section 4.1), were given equal weight. This weight was further subdivided equally between models within the family. 3. Institute weighting. Model institutes, as shown in Figure 2 as labels on gray areas, were given equal weight.
This weight was further subdivided equally between models within the institute. 4. Country weighting. Model host countries, as shown in Figure 2 as labels on gray areas, were given equal weight. This weight was further subdivided equally between models of the same country. 5. Ancestry weighting. The oldest ancestor models (marked with a thick outline in Figure 2) were given equal weight. This weight was subdivided gradually through branches to descendant models. This method is described in detail in Appendix A. 6. Model weighting. All models are given the same weight. This is different from the simple weighting-see the note below.
Note that in all of the above, if a model supplied multiple runs of different configuration or resolution, the model weight was further subdivided equally between the runs. For clarity, in the following text references to the weighting methods and weighted means corresponding to the methods above are italicized.

Statistical Significance
Statistical significance in climate feedbacks, sensitivity, and forcing in Section 4.3 was calculated using a Bayesian simulation with PyMC3 (Salvatier et al., 2016). The difference between a simple mean of models within a family and a simple multi-model mean was marked as significant if the magnitude difference between the two means was larger than zero with 95% probability. The PyMC3 model is provided (see the Data Availability Statement below).  under a common model name. The model relations were identified with a primary focus on the atmospheric component, and in particular atmospheric physics, which is a compromise due to the fact that some models inherit multiple components (atmosphere, ocean, cryosphere, chemistry, etc.), or in some instances provide their own implementation of atmospheric dynamics while inheriting atmospheric physics from a parent model. Some models comprised multiple model runs in CMIP (configurations, resolutions or variations of components), and we grouped these together under a single model name. We identified 14 different model families-groups of models which share the same oldest ancestor model (marked with a thick outline in Figure 2 and also listed in Table S2 of the Supporting Information S1). The models come from 38 different institutes or institute groups and 15 different countries. Institutes are based on the institute attribute of the CMIP data sets (CMIP3, 2022;CMIP5, 2022;CMIP6, 2022) for CMIP models and reference publications or online resources for other models, separated by a slash if multiple institutes were involved. Country is the country of the main institute (defined loosely as the institute credited for most of the models in the group, or where the development originated), with the exception of the European community (EC)-Earth Consortium models, for which the assumed "country" is Europe. We recognize two kinds of model relations: a parent-child relation, when the child model is a code-derivative of the parent model with a different name (in the sense of fully or partially inheriting the code of the atmospheric component), and a relation between versions of the same model. Model counts per model family, country, and institute in each CMIP phase are listed in Table S2 of the Supporting Information S1.

Model Code Genealogy and Model Families
We make an exception to the rule that a model family is defined by the oldest ancestral model for the ECMWFand CCM-derived models, for which the model ECMWF is a common ancestor. We split this model family into two model families of ECMWF and CCM (beginning with CCM0B). This is a subjective choice made for our analysis in order to account for the fact that this split happened in early stages of the development in the 1980s (Edwards, 2011), Some of the identified model families are relatively small, such as CSIRO, GEOS, GFS, INM, UA MCM, NICAM, with fewer than four models participating in CMIP, while others are much larger, for example, CCM with 28 models and ECMWF with 23 models in CMIP (here by "model" we mean the main model as in Figure 2 rather than model runs in CMIP). In terms of model runs, CCM, ECMWF, and HadAM are particularly numerously represented in CMIP6 with 32, 27, and 12 model runs, amounting to about 70% of the entire CMIP6 MME (Table  S2 in Supporting Information S1). This means that there is a strongly uneven model representation in CMIP6.
The situation was getting more pronounced with successive CMIP phases: in CMIP5 and CMIP3 the share of the three most represented model families in terms of model runs is smaller at 52% and 50%, respectively. The size of model families and the diversity of models within a family are clearly influenced by the availability of model code. For example, the IFS/ARPEGE model is widely licensed to participating modeling groups in Europe, and therefore is used as a basis for a multitude of different models on the continent. The CCM-derived models have publicly available source code, which has been used extensively by many different modeling groups internationally. Other models with private code are used much more narrowly, such as CanAM, CSIRO, IPSL, or INM, which are only used by their own modeling group (and possibly a few collaborating organizations). Publicly available or widely licensed models usually have much greater participation in CMIP and an outsized impact in the MMEs.
Relations between model code can often be complex, ranging from a model component shared with an "upstream" project (such as models in the CCM family using the Community Atmosphere Model) to models taking atmospheric physics implementations from a parent model and developing their own atmospheric dynamics. Likewise, the ocean, land, sea ice, and biochemistry components are swapped for other components in some derived models. This complicates the notion of a model derivative. Because climate feedbacks in the atmosphere are currently the largest source of uncertainty in determining climate sensitivity, it is perhaps the most important model component to use as a determinant in model code genealogy. This is a subjective choice, and other choices would be possible when constructing a model code genealogy.

Climate Feedbacks and Sensitivity
Here, we evaluate how the proposed ancestry weighting and several simpler types of weighting impact the calculation of climate feedbacks and climate sensitivity in the CMIP MMEs. Zelinka et al. (2020) analyzed climate feedbacks, ECS, and ERF in CMIP5 and CMIP6. We perform the same analysis using their estimates of model quantities (Zelinka, 2022), but with different methods of weighting. Figure 3 shows results analogous to Figure  1 in Zelinka et al. (2020), but as means calculated using the different weighting methods relative to the simple multi-model mean. Following Zelinka et al. (2020), the "net [feedback] refers to the net radiative feedback computed directly from TOA fluxes, and the residual is the difference between the directly calculated net feedback and that estimated by summing kernel-derived components." The differences in feedbacks between the simple mean and the other types of weighting is up to about 150 mWm −2 K −1 in magnitude in CMIP6 and 80 mWm −2 K −1 in CMIP5. The different types of weighting often do not agree, except for the family and ancestry weighting, which give very similar results. If we focus on the weighting methods which we expect to be the most accurate in terms of accounting for model code sharing, the ancestry and family weighting, the largest difference from the simple mean is in the cloud feedbacks (total, shortwave, and longwave), with relatively large difference in ECS and ERF. This is perhaps not surprising given the very large spread in model cloud feedbacks in the CMIP MMEs. Interestingly, when we quantify the difference in feedback strength between the CMIP6 and CMIP5 MMEs (Figure 3c), we see that the ancestry weighting reduces the difference in cloud feedbacks between the two CMIP phases substantially. The magnitude difference is reduced from 77 to −26 mWm −2 K −1 for the total cloud feedback, from 145 to −68 mWm −2 K −1 for the shortwave (SW) cloud feedback, and from −70 to 41 mWm −2 K −1 for the longwave (LW) cloud feedback. However, the net and residual feedback magnitude difference is increased from 61 to −71 mWm −2 K −1 and from 3 to −33 mWm −2 K −1 , respectively. We define the root mean square difference (RMSD) between CMIP6 and CMIP5 calculated across the elementary feedbacks (Planck, water vapor (WV), lapse rate (LR), albedo, SW cloud, LW cloud) as: where λ i are means of individual feedbacks calculated from either CMIP5 (λ i,CMIP5 ) or CMIP6 (λ i,CMIP6 ). When the RMSD is calculated from the ancestry weighted feedback means compared with simple means, it is reduced by about 40% from 67 to 41 mWm −2 K −1 . Therefore, it is possible that a substantial part of the difference in feedbacks between CMIP6 and CMIP5 can be explained by a suitable choice of weighting which takes into account model code dependence. When the RMSD is calculated for family weighting (not shown in the plot), the RMSD is almost the same as ancestry weighting at 42 mWm −2 K −1 . But it is less for the model weighting (reduced to 60 mWm −2 K −1 ), and a slight increase in RMSD is seen for institute (increased to 95 mWm −2 K −1 ) and country (increased to 79 mWm −2 K −1 ) weighting. This could mean that only the ancestry, family, and to a lesser extent model weighting can explain some of the feedback difference between CMIP6 and CMIP5. The result is consistent with the expectation that the ancestry weighting is more suitable than the other types of weighting, which are less strongly related to the model code genealogy.
For ECS and ERF, the differences between weighting methods are also substantial-up to about 0.3 K for ECS and 80 mWm −2 for ERF 2x in magnitude (Figures 3a and 3b). In comparison, the difference in simple mean between CMIP6 and CMIP5 is 0.47 K in ECS and 114 mWm −2 in ERF 2x , and the standard deviation is 0.73 and 1.06 K in ECS (CMIP5 and CMIP6, resp.) and 390 and 490 mWm −2 in ERF 2x (CMIP5 and CMIP6, resp.). The difference in ensemble mean ECS between CMIP6 and CMIP5 becomes much smaller with ancestry weighting, falling from 0.47 K (simple mean) to 0.20 K (ancestry weighting), but the difference in ERF 2x is increased from 114 to 226 mWm −2 . Thus, it is possible that a weighting method which accounts for model code dependency can explain some of the difference in ECS between CMIP5 and CMIP6 as resulting from an over-representation of models with high ECS in the CMIP6 ensemble. Figure 4 shows model ECS and the statistical weights of models under the ancestry weighting. It can be seen that in CMIP6, the model weight is the highest for the lowest ECS range and progressively lower with increasing ECS (except for the highest ECS range), due to the fact that models with higher ECS are generally populated by the large model families HadAM, CCM, and to a lesser extent IPSL and ECMWF, while models with lower ECS come from more diverse families. Because of how the ancestry weighting algorithm works, models in larger families generally have lower per-model weight. In CMIP5 model weights are more even across the ECS range than in CMIP6. Partly, the higher simple mean of ECS in CMIP6 is also the result of ECS above 5 K being populated by models, whereas in CMIP5 there are no models in this range. Thus, the higher simple mean ECS in CMIP6 can be attributed mostly to the HadGEM and CCM model families, and their effect is reduced under the ancestry weighting by smaller per-model weight given to models in large model families. Figure 4 also shows the weights multiplied by the number of models in each ECS range (dashed lines). While the two most extreme ECS ranges in CMIP6 (below 2 K and above 5.5 K) have relatively large per-model weights, the number of models in these ranges is small (two), and they have little overall effect on the ancestry-weighted ECS mean.

Climate Feedbacks and Sensitivity by Model Family
We analyzed climate feedbacks and sensitivity by model family ( Figure 5). Because model family weighting showed results similar to ancestry weighting (Section 4.2), it should be a good proxy for ancestry weighting, while allowing us to separate the values into (potentially clustered) groups. Some model families tend to have similar values of climate feedbacks. This is most apparent in the cloud feedbacks, where differences between models are generally large. The HadAM family of models tend to be closely clustered in all climate feedbacks, despite the comparatively large size of the model family (6 models in the CMIP6 plot). Their total cloud and SW cloud feedback is consistently larger than the mean and their LW cloud feedback is consistently smaller than the mean (in this section we refer to simple mean as "mean"). The ECMWF family of models (14 models in the CMIP6 plot) have consistently below-mean SW cloud feedback, mostly below-mean total cloud feedback and almost consistently above-mean LW cloud feedback. The CCM family is the largest (17 models in the CMIP6 plot) and also the most varied, showing a large spread between its models in CMIP6, but a small spread in CMIP5. Despite this, they have some characteristic properties, such as in mostly above-mean total and SW cloud feedback and below-mean LW cloud feedback in CMIP6; mostly below-mean total cloud feedback, but also above-mean lapse rate and surface albedo, and below-mean water vapor feedback in CMIP5. In CMIP6, the UCLA GCM family of models (5 models in the CMIP6 plot) have consistently below-mean total and SW cloud feedback, and mostly above-mean LW cloud feedback.
In terms of ECS, the CCM and ECMWF families of models show a large and relatively even spread around the multi-model mean. In this case, the ancestry or family weighting is unlikely to make a significant difference in terms of the influence of the family on the overall MME mean. In CMIP6, the HadAM, and IPSL family of models are all more sensitive than the mean, and the UCLA GCM family of models are all less sensitive than the mean. ECS in of the HadAM family is significantly above-mean, and ECS of the UCLA GCM family is significantly below-mean (at 95% confidence).
In summary, some relatively large families of models show consistent properties when it comes to climate feedbacks and ECS, while others show a large spread. This suggests that models in some families have substantial interdependence which translates into clustering of climate feedbacks and ECS. The CCM and ECMWF families are quite diverse, but despite this they show common characteristics in some climate feedbacks.

Global Mean Near-Surface Temperature Time Series
To analyze the impact of the ancestry and model family weighting methods on MME statistics, we examine the case of GMST in the historical, SSP2-4.5, abrupt-4 × CO2, and 1pctCO2 CMIP6 experiments and the historical, RCP4.5, abrupt-4 × CO2, and 1pctCO2 CMIP5 experiments. Figures 6 and 7 show GMST time series in the CMIP6 and CMIP5 experiments (respectively), grouped by model family, as well as family and ancestry weighted time series. Included are all models which provided the necessary data. While some model families have many members in this analysis, such as CCM (7-22 members, depending on the experiment and CMIP phase), ECMWF (3-16 members), HadAM (2-6 members), and UCLA GCM (1-5 members), other families have less than 4 members, and therefore it is harder (or impossible) to assess model spread in the smaller The family and ancestry weighted GMST time series tend to nearly overlap in all cases, which points to a high degree of outcome similarity between the two types of weighting also noted in the preceding sections. Interestingly, the family and ancestry weighted mean is warmer than the simple multi-model mean in the CMIP6 historical experiment (in the CMIP5 historical experiment it is slightly colder by the end of the simulation) and also more consistent with observations, whereas in the 1pctCO2 and abrupt-4 × CO2 experiments it is colder than the simple mean (in both CMIP6 and CMIP5). When CMIP6 is compared with CMIP5, model families tend to exhibit similar cold or warm propensity, such as INM, GFDL, UCLA GCM being relatively cold in the non-historical experiments, and CanAM, HadAM, IPSL being relatively warm. This suggests that model families tend to maintain their climate sensitivity inclination across model generations.

Discussion and Conclusions
We mapped the code genealogy of 167 models in and related to CMIP3, CMIP5, and CMIP6 with a focus on the atmospheric component and the atmospheric physics. We showed that all models can be grouped into 14 model families based on code inheritance, although large amounts of code may have been replaced in some models, and therefore they are only weakly related to other models in the same family. In addition, we mapped the institute and country of origin of the models. Some model families, such as CCM, ECMWF, and HadAM, are particularly large. The CCM-derived models were extensively forked internationally, most likely due to the open availability of the code. The IFS/ARPEGE (licensed) code was the basis for many European models. The HadGEM code was shared internationally within a consortium. Together, these three large model families dominate CMIP6, accounting for 70% of all model runs, an increase from about 50% represented by the three largest model families in CMIP3 and CMIP5. Based on the code genealogy, we developed an ancestry weighting method, the aim of which was to more fairly weigh code-related models than a simple multi-model mean, thus mitigating structural model dependence effects in MMEs. We showed that when applied on CMIP5 and CMIP6, the ancestry and family weighting produced substantial differences in the climate feedbacks, sensitivity, and forcing, especially the cloud feedbacks (total, shortwave and longwave), ECS, and ERF 2x relative to the difference in simple mean between CMIP6 and CMIP5 and relative to the standard deviation of the quantities in CMIP5 and CMIP6. The ancestry and family weighting methods produce very similar results. The ancestry and family weighting seem to be able to explain some of the difference between CMIP6 and CMIP5 (about 40% RMSD reduction in climate feedbacks, and about 60% RMSD reduction in ECS under the ancestry weighting). This suggests that increased contributions from many code-related models in CMIP6 compared to CMIP5 were able to substantially affect the simple multi-model mean. Applying these methods to analyze climate feedbacks, sensitivity, and forcing by model family revealed that models in some families gave narrowly similar results (HadAM and UCLA GCM), and others in some cases had relatively wide spread but consistently above-or below-mean values (ECMWF and CSM). This suggests that code similarity in some cases translates to similarities in climate properties, but in other cases there is a large spread despite model similarity. Lastly, we analyzed GMST time series in four CMIP6 and CMIP5 experiments, and showed that models in some larger families (HadAM, and in some cases ECMWF) have similar GMST. The family and ancestry weighting showed very similar results-more warming than the simple mean (and closer to observations) in the CMIP6 historical experiment and less warming in the CMIP6 1pctCO2 and abrupt-4 × CO2 experiments. This suggests that these methods can partially balance the effect of the over-representation of model families with multiple similar models, like HadAM. Model families tend to exhibit tendencies toward greater or lower warming than the MME mean in response to increased CO 2 across the CMIP generations.
A limitation of our method of weighting based on model families or model code genealogy is that we have not quantified model similarity in other ways than through inheritance. We did not make an attempt to quantify model code independence from their parent models, because there is not enough publicly available information on the source code. Even if the source code were available, an objective quantification of code independence would require a sophisticated new method of code analysis. Some models have code bases which are more independent from their parent models than others. As a result, some model families might have members which are almost code-independent from the rest of the family. For example, it is possible that models which are related in the genealogy diverged enough from their ancestral models that it would be warranted to classify them as a separate family. This means that some models can be unjustly underweighted because they are grouped together with models to which they do not bear much resemblance or were developed for a different purpose in mind (discussed below). Overcoming this limitation would be a relatively difficult task. While it might be possible to investigate individual schemes and components in models to partially quantify the statistical distances between related models, it would be difficult to do so objectively. Such information is also unlikely to be available for all the CMIP participating models. Another possibility would be to analyze the code of models to quantify their similarity. A method of accurately quantifying similarity would necessitate analyzing large code bases, distinguishing scientific calculations from technical code, accounting for the fact that small changes in code can produce large differences in model results, and accounting for model runtime configuration. Emerging methods of code analysis based on deep artificial neural networks (DANNs) have a potential to be used for this task. DANN-based tools such as OpenAI Codex (Chen et al., 2021;OpenAI, 2023), GitHub Copilot (GitHub, 2023 and DeepMind AlphaCode (DeepMind, 2023) have been developed to translate natural text to computer code. This approach has a potential to be adapted to quantifying code similarity. However, regardless of the availability of such methods, access to the model code would be necessary. This is a substantial hurdle given that most model code is closed-source. Apart from this, the source code of older models (dating back several decades) might not be readily available even to the current modeling groups, or even preserved at all. In summary, users of our model code genealogy should be mindful that the proposed weighting methods are only a "first-order" approximation of model similarity, and they should make an educated choice when selecting models for an analysis or deciding which models to include in a model family for the purpose of weighting.
Structural dependence between code-related models is sometimes reduced by diverging purposes of models. We did not make an attempt to quantify this because limitations similar to those mentioned above. The purpose of a model, such as a geographical, process, or quantity focus, is only rarely explicitly stated and it would be difficult to objectively quantify this divergence. In such case the family and ancestry weighting can give too little weight to those models in the same family or branch of the code genealogy which are substantially different from the rest of the models due to their purpose. One way in which models are divergent within the same family or branch is their complexity in terms of being an AGCM, AOGCM or ESM ( Figure 2). It can be expected that ESMs are substantially different from a related AOGCM due to the inclusion of the carbon cycle, vegetation, atmospheric chemistry, biochemistry and other processes. Similarly AGCMs, even though rarely participating in CMIP as standalone models, are expected to differ substantially from related AOGCMs because they do not contain a prognostic ocean component. One way of accounting for this would be to analyze AOGCMs and ESMs separately. For example, Meehl et al. (2020) note that emissions feedbacks included in the ESM GFDL-ESM4 (Dunne et al., 2020) reduce ECS compared to its parent AOGCM GFDL-CM4 (Held et al., 2019); GFDL-ESM4 has ECS 3.9 K and GFDL-CM4 has ECS 2.6 K. In summary, the focus solely on model code inheritance as presented here does not account for this context, introducing limitations to our weighting methods.
To put our results into a broader perspective, we do not argue against the use of simple multi-model means, or model output and performance weighting methods in general, but see the presented weighting methods as complementary to the established methods. Simple means will likely continue to represent a useful default option (as used, e.g., in parts of AR6), but other weighting methods may be increasingly important due to model duplication in MMEs. It is possible that weighting methods based on model structure can capture these interdependencies better than methods based on model output. We suggest the family weighting, or a similar technique based on selecting a number of "independent" model branches from the model code genealogy, as a useful and easily implemented method of weighting for MME studies, especially if there is an expectation that model duplication is affecting the results.
The presented model code genealogy (Figure 2) can be further extended as more models become available in future CMIP phases. We provide the Scalable Vector Graphics source of this figure so that it can be extended in the future, and all related code and data are referenced in the Data Availability Statement below and available under an open source license.
Our results can facilitate MME assessments, which depend on the knowledge of model code relations. They provide a complementary approach to the model output dependence methods presented in previous studies. We have shown that as expected, code-related models tend to have related climate characteristics, which may help to explain some of the difference between CMIP5 and CMIP6. Certain model families stand out in terms of ECS or climate feedbacks, which can help in understanding model differences. This is especially important given that the model spread in ECS and some climate feedbacks have increased in CMIP6 relative to CMIP5. A useful method of accounting for dependencies among models is weighting model families equally, which has the benefit of being simpler to achieve than ancestry weighting. This can be readily employed in MME assessments if a more fair model weighting is desired.

Data Availability Statement
Our data processing and visualization code, as well as the associated data are available publicly on GitHub (Kuma, 2022a) and Zenodo (Kuma, 2022b). The version used in our analysis is 1.  (2023b)]. CMIP5 and CMIP6 model output is publicly available on the Earth System Grid Federation websites (CMIP5, 2022;CMIP6, 2022). The input data for model ECS and climate feedbacks are available publicly (Zelinka, 2022). The HadCRUT5 data are available publicly (Met Office Hadley Centre, 2022). Our code was developed in Python version 3.9.2 (Python Software Foundation, 2023) on Devuan GNU/Linux version 4 (Devuan project authors, 2023). The following Python packages were used directly in our code: ds-format version 3.5.1, matplotlib version 3.7.1 (Hunter, 2007), numpy version 1.22.1 (Harris et al., 2020), pandas version 1.4.3 (McKinney, 2010), pst version 2.0.0, pymc3 version 3.11.5 (Patil et al., 2010), and scipy version 1.7.3 , obtained from the Python Package Index (Python community, 2023). Figure 2 was made in Inkscape version 1.0.2 (Inkscape project authors, 2023). All of the listed software is available publicly under open source licenses.