ADDRESSING POLYMORPHISM IN LINGUISTIC PHYLOGENETICS

Understanding how languages change is important not only for the reconstruction of protolanguages and for estimating diversiﬁcation dates (i.e. the dates when languages split), butalsofortheinferenceofevolutionarytrees(orphylogeneticnetworks)oflanguagefamilies. Weproposeaparametricmodeloflanguagechangethataddresseslexicalpolymorphism(two ormorewordsforagivenbasicmeaning)basedonwhatisknownabouthowlanguages change.Underourmodel,changesofstateinlexicalcharactersoccur only duetosemanticshift orborrowing,leadingto(potentiallybrief)periodsinwhichpolymorphismispresent.Acrossa widerangeofmodelconditions,weﬁndthatasimpleandnaturalmodiﬁcationtothe maximumparsimony(MP)criterion(whichseeksthetreewiththefewestnumberofchanges) toallowittohandlepolymorphiccharactershasthebestaccuracy,substantiallyimprovingon well-knownBayesianmethodsbasedonappearancesanddisappearancesofwords.Wealso provideanewanalysisofIndo – European that takes polymorphism into account, ﬁnding support for a previous tree (Nakhleh et al., 2006) and a new tree that diﬀers from the previous tree in the relationship between Italo-Celtic and Tocharian. R ´ ESUM ´ E


INTRODUCTION
Linguistic phylogenetics is the study of how a group of languages evolve from a common ancestral language.Linguistic phylogenies (i.e.evolutionary trees and/or phylogenetic networks) are estimated for different groups of languages using a variety of methods and based on a number of different types of linguistic features (Dunn et al. 2005;Nichols & Warnow 2008;Calude & Verkerk 2016;Goldstein 2020Goldstein , 2022b)).These phylogenies are then used in subsequent analyses, such as reconstruction of proto-languages and estimates of dates for diversification of languages, which can then be used to answer anthropological questions, such as the geographical and temporal origins of Indo-European (e.g.Forster & Renfrew 2006, Gray et al. 2009, Chang et al. 2015, Haak et al. 2015, Olander 2022, Heggarty et al. 2023) and the diversification of the Romance family (Goldstein 2023).Thus, linguistic phylogenies have ramifications within linguistics as well as for disciplines outside of linguistics, and the estimation of these phylogenies is an important component of these analyses (Goldstein 2022a).
However, phylogenies for language families, and even for Indo-European (perhaps the most well-studied language family), are controversial, and phylogenies can differ as a function of choice of method (e.g.parsimony, distance-based methods, or Bayesian methods based on parametric models), choice of data (e.g.whether the data are restricted to lexical characters only or also include other types of data), how the linguistic data are represented mathematically as character data for input to the method, etc. (Ringe et al. 2002;Rexová et al. 2003;Nakhleh et al. 2005b;Goldstein 2020).Thus, developing methods with improved accuracy is important, and has the potential to advance discovery in many disciplines.
Over the years, models of linguistic character evolution have been proposed and then used for phylogenetic tree estimation.One of the main issues that has arisen is POLYMORPHISM, which occurs when a character exhibits multiple states within a language.For example, in English big and large simultaneously fill the semantic slot for 'large'.Thus, the lexical character for the semantic slot 'large' would necessarily exhibit two cognate classes (thus two states) for English.Our evaluation of polymorphism of lexical characters in Indo-European (Appendix D) suggests that 67% of these characters exhibit polymorphism, indicating that understanding polymorphism, modelling how it arises, and then using these models in inferring evolutionary trees is of considerable importance.Since one mechanism for polymorphism is lateral transfer (i.e.borrowing), stochastic models that incorporate lateral transfer can be used to generate polymorphic characters.However, most polymorphism is likely to arise by other mechanisms, and notably by semantic shift, as we explore in this study.
It is important to note several points about polymorphism.Firstly, polymorphism is not synonymous with any other term in linguistics or phylogenetics (also referred to as 'cladistics' in linguistics).While it is true that polymorphism in lexical characters is caused by synonymy (as exemplified above), inflectional characters can also be polymorphic.Secondly, although perfect synonyms are rare in natural human languages, this is of limited relevance in basic wordlist comparisons.For instance, while it is true that English little and small cannot both be used in exactly the same range of circumstances, in describing the relative size of an inanimate object-the basic meaning of the wordsthey are completely interchangeable, causing polymorphism.Thirdly, because such cases are not rare, any attempt to reduce all polymorphic characters in a dataset by simple suppression of polymorphic states is necessarily unrealistic, and the results obtained could be unrealistic as well.This is especially the case in dealing with languages of the distant past attested in a limited range of texts composed by relatively few authors.Finally, total absence of a word for a basic concept is comparatively rare, and it is always possible to assign a unique state in such a case.frequency of lexical polymorphism in language families more generally, the potential impact of this study is much broader.Our study also suggests opportunities for linguists in modelling language change using what linguists have established about language change, rather than adapting models from biology, and proposes a new line of research that focuses on modelling linguistic polymorphism.

BACKGROUND
This section describes basic terminology related to phylogeny estimation and linguistic character evolution, and also surveys the prior literature for phylogeny estimation from linguistic characters, and proposed models for polymorphism in languages.

Phylogenetic trees and networks
A LINGUISTIC PHYLOGENY is a rooted tree (typically assumed to be fully resolved, or binary) with each leaf labelled by a different language.This graphical model is based on the Stammbaum model of language evolution, wherein a split in a lineage occurs when speakers of a language move apart, producing two distinct languages that evolve independently.
Variations of this simplified model of language evolution have been proposed to more faithfully reflect linguistic evolution.For example, a PHYLOGENETIC NETWORK may be proposed where in addition to the tree edges, there is a set of contact edges (typically bi-directional) that reflect exchange between languages at the endpoints of the contact edge (Nakhleh et al. 2005a).In such a model, the underlying phylogenetic tree may still be discernible.However, the more contact edges there are, the closer the language family resembles a dialect continuumwhich is what happens when the languages are very closely related (e.g.dialects of the same language).When the group of languages is a dialect continuum, the concept of an underlying tree becomes less meaningful, and the model may no longer reasonably be considered a phylogeny.

Linguistic characters
The inference of these phylogenies (whether trees or phylogenetic networks) depends on the use of characters (i.e.properties that can attain many states).Characters are typically classified as phonological, morphological, or lexical.For example, lexical characters are often based on semantic slots, and the words used for that semantic slot are then used to define the states for the character.Here, two words are considered equivalent if they are cognates, so that the total number of states for a given semantic slot is the number of cognate classes exhibited by the set of languages.Lexical characters are thus frequently MULTI-STATE, since there can be more than two cognate classes in a given group of languages.Morphological characters are also frequently multi-state, but phonological characters (which are defined for the presence/absence of a sound change) are binary.
A character is POLYMORPHIC if it can exhibit multiple states simultaneously in a language; this is most common with lexical characters, where multiple words often exist for the same semantic slot.Characters that have only a single state in any language are called MONOMORPHIC.

Character evolution
Characters evolve down a tree or phylogenetic network under a model of evolution that describes how each character state changes into other character states.A common assumption is that the characters evolve independently of the other characters, and that they draw substitution rates from a distribution.Under a RATES-ACROSS-SITES model, if a character draws a rate c and another character has rate c 0 ¼ 2c, then the interpretation is that the second character evolves twice as fast as the first character on every edge in the tree.However, characters may not evolve under a rates-across-sites assumption if the rate of evolution depends on both the edge and the character; for example, there could be two characters where the first is faster than the second on one edge, but perhaps slower than the second on another edge.This kind of rate variation, which depends on both the edge and the character, is called HETEROTACHY (Philippe & Lopez 2001).These concepts are explained in greater length in Barbanc ¸on et al. (2013), and also in Warnow (1997).
Another concept that appears frequently is the LEXICAL CLOCK, which means that the rate of evolution is proportional to time.When evolution obeys the lexical clock, the distance from the root to each leaf is the same for all leaves; this means the tree is ULTRAMETRIC.Note that ultrametric trees can occur even without a lexical clock hypothesis.The UPGMA (Sokal & Michener 1958) method for tree estimation is designed for datasets when the lexical clock holds.

Homoplasy-free characters and perfect phylogenies
In some approaches, it is assumed that when a character changes state, the new state has not been seen before on the tree.This is called homoplasy-free evolution, as homoplasy is the phenomenon of back mutation or parallel evolution.If a character is homoplasy-free on a particular tree, it is said to be COMPATIBLE; and if all characters are compatible, the tree is called a PERFECT PHYLOGENY.

Inputs to phylogeny estimation methods
The input to a phylogeny estimation method is a matrix where columns correspond to the languages and rows correspond to the characters, and the entries are the numeric states for the characters.Many characters are naturally binary (i.e. the presence/absence of a sound change) but others can have three or more states.This is true for morphological characters, and especially true for lexical characters, since the number of cognate classes exhibited for a single semantic slot (basic meaning) can be very large across a given language family.An example of a character coding for the semantic slot 'male person' is given in Table 1 for five ancient Indo-European languages.
In this character, three cognate classes are represented (denoted by states 1, 2, and 3), and in two languages (OCS and OHG) two of these classes appear.Hence, these two languages are polymorphic for this character.Some methods, including the popular MP method (e.g.Ringe et al. (2002)), operate by using the original multi-state character matrix.However, it has become increasingly common to use a BINARY-ENCODED version of the input data, which allows the subsequent use of statistical phylogeny estimation methods.Given the common usage of this type of approach, we describe it in some detail in the following section.

Binary-encoding of multi-state characters
The BINARY-ENCODING approach was popularised in Gray & Atkinson (2003), and is now used in many studies.Note however that the representation of the multi-state character data in a binary matrix can be followed by any statistical estimation method that assumes binary characters.
The binary encoding of a single multi-state character operates as follows.For each state that the character exhibits in the given dataset, a new character is created, and the new character is simply coded as present or absent for each language in the input.The most common use of this approach is for lexical characters, i.e. semantic slots, so that there is a separate character for each hsemantic slot, cognate classi pair.The state for a given language is coded as a 1 if the language exhibits that character's cognate class in its semantic slot and a 0 otherwise.Table 2 illustrates a binary coding for the character presented in Table 1: Clearly, the binary coding of this semantic slot produces three characters, one for each cognate class.Note that because 1 indicates the appearance of a member of a specific cognate set in a specific meaning, 0 can either mean that the cognate set does not exist in the language or that the cognate set exists but not for the particular meaning.

Phylogeny estimation methods
There are many methods that have been developed for phylogeny estimation for languages, including distance-based methods, parsimony-and compatibility-based methods, and Bayesian estimation methods; see Nakhleh et al. (2002Nakhleh et al. ( , 2005b)), Nichols & Warnow (2008), Goldstein (2020) for more information about these methods for use in linguistic phylogenetics.
Given either the original or the modified input matrix, the following methods are among the most commonly used to estimate trees: • MP, which takes the input multi-state characters and tries to find a tree that has the smallest total number of state changes.• MAXIMUM COMPATIBILITY, which takes the input multi-state characters and tries to find the tree with the smallest number of characters that are not homoplasy-free (i.e. they require either back mutation or parallel evolution on the tree).• DISTANCE-BASED ESTIMATION, which computes a distance matrix between the languages based on the input characters, and then computes a tree on that distance matrix.Different  Felsenstein (1992).Subsequent approaches estimated trees from the character data under more complex models, including the SD model (Nicholls & Gray 2008) and its extension to allow for borrowing (Kelly & Nicholls 2017).The SD model primarily differs from GA2003 in that it imposes the Dollo condition: a gain (absence to presence) is permitted to occur only once for a character, but losses (presence to absence) may occur many times.

Stochastic models of polymorphism
While there has been a fair amount of research in linguistic phylogenetics, there is surprisingly little work investigating how to best model polymorphism, nor to estimate trees in the presence of polymorphism.However, Nicholls & Gray (2008) describe a stochastic model of character evolution in which each language can be seen as a collection of words, and where the words for the languages evolve down a tree under a birth-death model; each word is born at most once but can die independently on different time points (this is known as the Dollo principle).By treating the set of words as all belonging to a single semantic slot, this amounts to a stochastic model of lexical change that explicitly allows for polymorphism.In addition, although not explicitly modelled as such, the binary character models in Kelly & Nicholls (2017) and Gray & Atkinson (2003) can be used to define stochastic models of polymorphism, as we now describe.Recall that these models replace a multi-state character (which is defined for a given semantic slot) by a set of binary characters (for the presence/absence of a given cognate class).Hence, when considering the origin of the different binary characters (each derived from a multi-state character), they can be used to imply a model of polymorphic character evolution.In these models, the evolutionary processes of the different cognate classes for a given semantic slot are independent, so that birth and death of any such cognate (word) does not affect the other words for the same semantic slot.This allows a language to accumulate any number of words for the same semantic slot, thus producing polymorphism within that language for the semantic slot.Moreover, if a node in the tree (including the leaves) exhibits 0 (absence) for all the associated binary characters associated with the given semantic slot, this indicates that the language has no cognates for that basic meaning.
In other words, under these models, although the evolutionary process is described for only the binary characters, by associating each binary character with its original multi-state character, it becomes possible to interpret this as a stochastic model for polymorphism.However, because the evolutionary process assumes i.i.d.(identically and independently distributed) character evolution across the binary characters (which are each associated with a single cognate class), there is no selective pressure that impacts the number of states that are exhibited for a given semantic slot.Moreover, there is also non-zero probability of having no cognate classes for a given semantic slot in a language.As we argue below, these two aspects of these stochastic models are problematic for modelling real-world linguistic character evolution (see Section 3.1).

Previous attempts to estimate trees in the presence of polymorphism
The approach to polymorphism described in Bonet et al. (1999) is an extension of the PERFECT PHYLOGENY model (i.e.homoplasy-free evolution) to allow for polymorphism.In brief, the number of states any given linguistic character can have at an internal node is bounded by a value that is specific to that character: monomorphic characters would have the bound be 1, but polymorphic characters would have higher bounds.The idea of the model is that each language may be able to tolerate a certain amount of polymorphism, but not an unbounded amount.Given the bounds and an input matrix of multi-state characters, a tree is then sought for which all characters evolve without homoplasy, but within the allowed polymorphism bounds.This is a non-parametric approach to estimating trees in the presence of polymorphism.
Since there is no underlying generative model implied by the approach of Bonet et al. (1999), likelihood-based approaches are not possible.In addition, this non-parametric approach does not explicitly model how polymorphism arises.
We earlier described how an input multi-state character matrix can be replaced by a binary character matrix (Table 2), thus allowing statistical methods that require binary characters to be used to estimate trees.Here we show how this technique can be used to estimate trees when the input matrix is polymorphic.
Consider the case where a language is polymorphic for a semantic slot (i.e. it has two words that are not cognate), so that it has two or more states for that semantic slot.The binary encoding would represent such a polymorphism in the same basic way: splitting the multi-state characters into as many binary characters as there are states, it would then allow the language exhibiting polymorphism to have state 1 for the individual binary characters corresponding to the states it exhibits.Thus, the binary encoding technique naturally enables the same set of phylogeny estimation methods to be used to estimate trees, even in the presence of polymorphism.We also note that the method described in Nicholls & Gray (2008) could be used to construct trees from these binary-encoded characters.

OUR PROPOSED MODEL FOR POLYMORPHISM
We have developed a new model for linguistic character evolution, motivated by the observation of lexical polymorphism.We begin with a discussion of lexical polymorphism and how it arises, and then define the model more formally.A comparison between this model and prior models is provided in Appendix B.

Lexical polymorphism
Lexical polymorphism is primarily exhibited by the co-existence of synonym words in a particular semantic slot (e.g.big and large both filling 'large').The main cause of polymorphism where a language has two or more words for the same semantic slot is SEMANTIC SHIFT: the process by which one word shifts in meaning to fill a different semantic slot.The result is that more than one word co-exists in the same language for the same meaning.1When multiple words exist in the same meaning, there can be pressure for one of them to become dominant, which often results in the other words becoming archaic in that meaning.
Consider the example in Figure 1, which depicts semantic shift between the characters 'human being' and 'male person' in French and English: Figure 1a shows that in Classical Latin homō meant 'human being' and vir meant 'man'.In late western Latin homō began to compete with vir, which was lost before any of the Romance languages is attested; French borrowed the adjective humain from written Latin in the 12th century, but it began to be used as a noun only relatively recently (and only in the plural).Figure 1b shows a similar development in English: there was competition between mann and wer only during the 13th century; the French loanword human first appears as a noun around 1500, but has become the preferred word only recently.
In addition to semantic shift, borrowing can also produce polymorphism: when two languages come into contact, some lexical entries may be transmitted between them.This results in new words occurring alongside existing words within the same meaning.Figure 1b depicts this in English: following the Norman Conquest of the British Isles, English borrowed the word human from French.It slowly replaced Old English mann as the primary word for 'human being', resulting in only sparse uses of man referring to humans in general in Modern English.
These examples also illustrate the phenomenon that polymorphism is often short-lived: when multiple words begin to co-exist within one semantic slot, there is pressure for one to become dominant or for some to drop out.The reverse is also true: a single word should not have too many meanings.For example, the word 'man' could presumably mean 'male person' and 'human', or 'male person' and 'husband', but not all three.These examples show that lexical polymorphism is bounded, and any model aiming to reflect polymorphism should consider how to incorporate these constraints.

Model formulation
The model we present builds on Warnow et al. (2006) by permitting borrowing and homoplasy in a similar fashion, but also includes polymorphism.
We assume that there is an underlying linguistic phylogenetic tree or PHYLOGENETIC NETWORK (rooted, binary tree with contact edges) that defines the relationships between the languages.Each character is represented by an independent continuous-time birth-death process, and starts with a single cognate class at the root of the network.As time proceeds, each character will gain and lose cognate classes according to its stochastic process.We assume that the gain or loss of a cognate class depends only on the number of cognate classes existing within that character at a particular time.
In technical terms, the birth-death process for character c is defined by birth rates , where i is the number of cognate classes at a given time.The rate matrix, which defines how the number of cognate classes changes over an infinitesimal time period (from i to j classes), is given by for i, j > 0. This is the standard formulation of a birth-death process (Crawford & Suchard 2012;Crawford et al. 2018).
In order to prevent unbounded growth in the size of a character (see Section 3.1 for the linguistic motivation), we set λ . This is also known as an immigration-death process.Essentially, the birth rate is constant, and the death rate increases linearly with the number of cognate classes for the character.
We enforce the constraint that μ When transmission occurs, a single state is chosen at random from the set of cognate classes from the source language and added into the target's language set of states.The character's birth-death process continues as normal following transmission, using the updated set of states.
Figure 2 depicts an example of the evolutionary process for a particular character on two lineages: On the left lineage, the new state 2 is spawned prior to the contact edge; the right lineage shows that state 3 exists up to the contact edge.At the contact time, transmission occurs from left to right, and state 1 is transferred.Following this event, no change is observed in the left lineage; in the right lineage, state 3 dies out.As a result of the process, three cognate classes are observed for this character in the left language, while only one is observed in the right.

Linguistic considerations of the model
Recall from Section 3.1 that polymorphism primarily arises due to semantic shift and borrowing.It is clear that borrowing is incorporated directly into the model, but it may not be as clear how semantic shift is handled.
Consider the example of semantic shift from Figure 1b in English.We describe how we can code the characters 'human being' and 'male person' in various stages of English.Our model for polymorphism would interpret the change in these two characters as having an intermediate step, where both the old and new state co-exist for some amount of time.The new state could either have arisen due to borrowing (as in the case of human, borrowed from French) or through a BIRTH (as in the case of man).This is demonstrated in Table 3.
However, because each character is treated independently, the coding of states across characters need not be consistent.In this case, for example, the cognate class man exists in both characters, but is coded differently in each of them.Therefore, semantic shift is not explicitly modelled by our approach.Nonetheless, the birth of the cognate class 'man' in the character 'male person' can still be interpreted as semantic shift: either the new state is a true innovation (never having existed previously in any character) or it is a reinterpretation of another character (as in this case).We argue that the distinction is unimportant and does not need to be reflected explicitly in the model: as long as semantic slots are kept distinct, the phenomenon of semantic shift is implicit in the stochastic process.We consider how the proposed model prevents unbounded polymorphism (see Section 3.1).Since a character's death rate grows linearly with the number of states (i.e. , where i is the number of states currently exhibited by the character), the model discourages an unbounded number of cognate classes within a particular meaning.However, our model does not discourage a word from occurring in a large number of meanings.Such a constraint would break the independence of character evolution and considerably complicate the model; we leave such investigations for future work.
Finally, we emphasise that all language change in this model is due to a (however transient) period of polymorphism.In other words, character state substitutions are not allowed, as is the usual case in prior work.In particular, for a language to exhibit state 1 at time t 1 and state 2 at time t 2 , a period of polymorphism would have had to happen between t 1 and t 2 , in which both states were present.Note that we can effectively recover a substitution model by setting the death rate μ to ∞; this ensures that a death happens immediately after a birth, leaving no intermediate period of polymorphism.

Design of study
To evaluate the accuracy of different phylogeny estimation methods, we perform a simulation study; this is the standard approach in biological phylogenetics, and has also been used in linguistic phylogenetics in other studies.In a simulation study, a model phylogenetic tree or phylogenetic network for a set of languages is created (with the languages at the leaves), and then characters are evolved down the network under a given stochastic model of evolution.The sequences of character states for each leaf are then given as input to a phylogeny estimation method, which returns a tree on the sequences.Because the model tree or network that generated the sequences is known, the phylogeny estimation method can be evaluated for accuracy on the simulated datasets.We use this basic methodology in our study, as we now describe.
We implement a simulator for this model of evolution (available in open source in Github (Canby et al. 2024), see Appendix C for additional details).We use the new simulator to simulate datasets with polymorphic characters down phylogenetic trees and networks (with 1-3 borrowing edges), using numeric parameters for the model networks based on the 2024 Ringe & Taylor dataset, as described in the Github site (Canby et al. 2024) and below.
For each dataset we generated, we ran several phylogeny estimation methods, including distance-based methods, several variants of MP, the GA2003 Bayesian method as described in Gray & Atkinson (2003), and estimation under the SD model.For each estimated tree, we compare the estimated tree to the true genetic tree (known since we are simulating evolution) and report the topological error.
Since limitations of space preclude detailed discussion of the simulation and the Indo-European data on which the simulation parameters were selected, readers seeking additional details should consult Appendices C and D, as well as the Github site (Canby et al. 2024).Note that the simulation produces a matrix of character states with the statistical properties of 'human being' 0 man 0 man, 1 human 1 human 'male person' 0 wer 0 wer, 1 man 1 man the matrix for the unscreened 2024 Ringe & Taylor dataset, but notably the simulation does not have any missing data (i.e.all languages have at least one state for all characters).That should be kept in mind for a proper understanding of this section's discussion.

Simulated data
Following the simulation protocol of Barbanc ¸on et al. ( 2013), we simulate 32 random trees with 30 languages each, using a Yule model with constant speciation and extinction rates.For the network experiments, we successively add borrowing edges to each of these trees, producing datasets with 1, 2, or 3 edges.For each of the five levels of polymorphism (low, moderate, moderately high, high, and very high), we simulate 4 replicates down each tree (or network), using the model for polymorphism described in Section 3.2.This results in a total of 128 simulated datasets for each of these 20 configurations, based on 5 levels of polymorphism and 4 levels of borrowing (0, 1, 2, or 3 contact edges).These datasets are available in the Github site for this paper (Canby et al. 2024).
Each dataset contains 20 morphological and 300 lexical characters, each of which has moderate levels of heterotachy and deviation from lexical clock.Recall that in addition to exploring method accuracy using simulated data, we also aim to reconstruct a phylogeny for the 2024 Ringe & Taylor dataset; therefore, we set the numeric parameters in our simulation study to reproduce the features of this dataset, and then vary some of the numeric parameters.In particular, we use the moderate setting of the homoplasy parameter used in the simulation study reported in Barbanc ¸on et al. (2013); this simulation was performed under the stochastic model of Warnow et al. (2006)  To understand the impact of this choice for the homoplasy parameter, note that the UNSCREENED version of the 2002 Ringe & Taylor dataset included characters where homoplasy was suspected, whereas the screened version (used in empirical studies, such as Nakhleh et al. (2005a)) removes these characters.Thus, prior to screening, the morphological characters have higher levels of homoplasy than the lexical characters, but after screening the results are reversed.As established in Barbanc ¸on et al. (2013), this screening improves MP methods more than it helps methods that are based on binary encodings.Hence, using parameters selected for the unscreened Indo-European dataset avoids biasing the study in favour of MP.However, the parameters of the simulation can of course be adjusted to approximate conditions of other datasets.
We set the five polymorphism levels, which vary from low to very high in terms of the percentage of characters allowed to be polymorphic.These levels are as follows: the low level has 16% of the characters allowed to be polymorphic, moderate has 33%, moderately high has 50%, high has 67%, and very high has 80%.The observed percentage of characters exhibiting polymorphism is slightly lower (by at most 3%) than the allowed level (see Table 4).Our analysis of the polymorphism level in the Indo-European dataset (see Appendix D) indicates that about 67% of the 333 lexical characters in the dataset exhibit polymorphism, and nearly all of this is due to semantic shift rather than borrowing.Thus, the Indo-European dataset falls at the high level for lexical polymorphism.
In summary, we set the parameters for the different types of linguistic characters as follows.The morphological characters are entirely non-polymorphic and are split such that 25% of them are homoplastic.The lexical characters are split evenly into slow, medium, and fast evolving characters.Each of these groups is then further split such that 13% are homoplastic.
Note therefore that the morphological characters exhibit a higher level of homoplasy than the lexical characters; as noted, this is based on properties of unscreened morphological characters, as discussed above.

Phylogenetic estimation methods
We evaluate the accuracy of phylogeny estimation methods in the presence of polymorphic data.Importantly, all methods we evaluate are unaware of the evolution model presented in Section 3.2.We compare variants of MP, two distance-based methods, the Gray & Atkinson approach from 2003 (GA2003), and the SD model as implemented in TraitLab (Nicholls et al. 2013).Of these, only GA2003 and SD perform estimation under an explicit parametric model of evolution; we emphasise that the models assumed by these methods are unrelated to the model of Section 3.2.We provide an overview of these methods here; additional information about the methods is provided in Appendix A. The commands and software version numbers are provided in the Github site (Canby et al. 2024).
For the distance-based analyses, we use two standard techniques: UPGMA (Sokal & Michener 1958; unweighted pair group method with arithmetic mean) and NJ (Saitou & Nei 1987).Since we have polymorphic data at the leaves, we can use notions of set similarity to define distances between two leaves (see Appendix A for details).We use PAUP* (Swofford 2002) to compute trees for UPGMA and NJ on the distance matrices we produce.
Unlike distance-based methods, in their classic formulation MP approaches are not naturally suited to handling polymorphic characters; hence, we have to modify the polymorphic input.Common practices are to ignore the polymorphic characters altogether, or to somehow replace the polymorphic states with a monomorphic state.
We experiment with four different variations of handling polymorphic characters, where the first variant removes all characters that are polymorphic, and the remaining three replace each single polymorphic character by a single monomorphic character.For example, one variant replaces each polymorphism in a given language for a given character by the state that appears the most often among the languages; thus, if a language L exhibits three states a, b, and c for a character that appear 3, 1, and 2 times, respectively, among the languages, then it replaces those three states by state a.The other MP variants use other rules for modifying the polymorphic characters, and are described in Appendix A. We use PAUP* (Swofford 2002) for the MP analyses.
We also run four methods that are based on binary-encoded versions of the multi-state characters (i.e. a binary matrix with columns for languages and rows for each state of each multi-state character, and entries indicating presence-absence of the states).The methods we run on this dataset are (1) standard (unweighted) MP, (2) Dollo MP, (3) GA2003, and (4) the SD model.For all MP analyses on binary characters, we again used PAUP*.In the following paragraphs, we briefly describe the models under which GA2003 and SD perform inference, as well as practical information on how we ran them.See Appendix A for more information on the remaining methods.
The model of Gray & Atkinson (2003) (GA2003) was one of the original models for linguistic character evolution.The model specifies a tree topology and branch lengths, as well as transition probabilities for 0 to 1 (from absence to presence) and 1 to 0 (from presence to absence); thus, repeated appearance and disappearance of a trait is allowed.The SD model (SD) is similar, with the primary difference being the imposition of the Dollo condition: the requirement that each trait may be gained only once (i.e.only a single 0 to 1 transition is permitted for a trait), but may be lost several times (i.e.many 1 to 0 transitions are allowed).
Bayesian methods have been developed to perform tree estimation under these models, which are based on MCMC (Markov Chain Monte Carlo).For GA2003, we use MrBayes (Huelsenbeck & Ronquist 2001) with 100,000 generations, where trees are sampled every 500 iterations.After the burn-in (100 samples), we are left with 100 trees from the posterior distribution.This is consistent with the settings used in Barbanc ¸on et al. (2013).For the SD analyses, we use TraitLab (Nicholls et al. 2013) with 200,000 generations and trees sampled every 50 iterations.This produces 4000 trees, of which we throw out half for the burn-in, leaving 2000 trees sampled from the posterior.

Evaluation
For those analyses that produce single trees that are fully resolved, we evaluate topological error using the ROBINSON-FOULDS (RF) error rate (Robinson & Foulds 1981), which can be defined as follows.Every edge in a tree defines a bipartition on the leaf set, and that the internal edges of the tree define bipartitions that are non-trivial (i.e.split the leaf-set into two sets, each containing at least two leaves).We compare the estimated tree to the true tree, and record the total number of non-trivial bipartitions that are present in one but not both trees.This number is the RF distance.We then divide that value by 2nÀ6 (the total number of internal edges between the two n-leaf unrooted binary trees), to obtain a value that is between 0 and 1, which is the RF ERROR RATE.
Not all inference methods return a single binary tree.MP and distance-based methods may find multiple optimal trees, and GA2003 and SD naturally return several trees as they provide a sample from the posterior distribution.For MP and distance-based methods, we compute the RF error rate for each returned tree separately and average the results per input dataset.
The GA2003 and SD analyses return a sample of the posterior distribution following the MCMC analysis.To evaluate the results, we can either report the mean error of the sampled trees (as we do for MP and distance-based methods) or take a point estimate from the distribution.For the point estimates, we experiment with several options: a majority consensus tree (i.e. the unrooted tree that has those bipartitions that appear in more than half of the sampled trees), a MAP tree (i.e. the maximum a posteriori tree, which is the tree topology that occurs most often in the sample), or the MCC (maximum clade credibility) tree.This is the tree that maximises the product of its clade scores, which are based on the proportion of times each clade occurs in the sample.Further discussion of methods to take point estimates from a posterior distribution can be found at O'Reilly & Donoghue (2018).
These point estimates may be incompletely resolved (i.e.some nodes may have degree greater than 3).Thus, when reporting results on consensus trees, we show the the missing branch (i.e.false negative, or FN) rate, which is the number of bipartitions that are in the true tree but not the estimated tree, divided by nÀ3, the total number of edges in an unrooted binary tree on n leaves.
We use custom scripts to compute the RF and FN error rates, making use of the PHANGORN (Schliep 2011) and TREEDIST (Smith 2020) libraries in R.

Experiments
We design a sequence of experiments for this study.In Experiment 1, we determine the best way to run each of the methods we explore, using just one model condition (the high polymorphism level without any borrowing).Using the results from Experiment 1, we evaluate these best ways of running each method under all five polymorphism conditions, with Experiment 2 examining performance without borrowing and Experiment 3 considering performance with 1-3 contact edges.Additional experiments, shown in Appendix E, examine these methods in greater depth (e.g. the difference between the four MP variants and how parsimony and Dollo Parsimony perform on binary-encodings of the multi-state data).

Results from the simulation study
Results for experiments 1-3 are presented in turn; we also show the full set of results for each method and model condition in Appendix E.

Experiment 1: Setting the best way to run each method
The goal of this experiment is to determine the best way to run each method, the results of which are then used in subsequent experiments.We make these judgements based on error rates for estimated trees on data simulated under the high polymorphism condition with no borrowing.The results for this experiment are summarised here, but see Appendix E for the full set of results.
We consider four different ways of running multi-state MP (described in Section 4.1), which either remove polymorphic characters altogether (MP 1) or replace each set of polymorphic states with a single monomorphic state (MP 2-4).We also consider standard MP and Dollo MP on the binary-encoded dataset.We find that MP 4, which replaces the polymorphic set of states in a language with the member state that occurs the most often across the languages for that character, consistently performs the best.
For UPGMA and NJ, we evaluate three different ways of defining the distance matrix (described in Section 4.1).As shown in Figure A2 in Appendix E, the choice of distance matrix formula has little impact on the final results, but the Jaccard distance has a slight advantage in general.Henceforth, we report results for both NJ and UPGMA using the Jaccard distance.
Gray & Atkinson (GA2003) and SD use MCMC to produce a sample of trees from the posterior distribution.Therefore, we explore four options for how to evaluate the accuracy of these methods (described in Section 4.1).Our results (Figure A3 in Appendix E) show a clear ordering among these approaches for both GA2003 and SD: average error is worst, followed by majority consensus.The MAP (maximum a posteriori) tree and the MCC (maximum clade credibility) tree are very close, but MAP is slightly better for SD and MCC is slightly better for GA2003.Going forward, we report results using MAP for Stochastic-Dollo and MCC for GA2003.2

Experiment 2: Performance when no borrowing occurs
In this section, we investigate which method is best when estimating trees under different polymorphism levels but when borrowing does not occur.
For all levels of polymorphism, distance-based methods perform the worst (with UPGMA worse than NJ, as expected), followed by the two Bayesian methods that use binary encodings (i.e.SD and GA2003, with GA2003 consistently better than SD), and then MP; however, under the high and very high polymorphism levels, MP and GA2003 are essentially tied (Figure 3).It is also worth noting that two methods based on binary encodings SD and GA2003 both markedly improve in accuracy with increases in polymorphism, while MP becomes slightly less accurate and the distance-based methods do not change.
Figure A4 in Appendix E shows that MP analyses and Dollo parsimony of the binary-encoded characters are very inaccurate: Dollo parsimony on the binary-encoded characters is close to SD but less accurate, with MP on binary-encoded data having error rates close to the distance-based methods.Thus, of all the methods that use binary encodings, GA2003 is the most accurate, but even this method is generally not as accurate as the variants we explore for MP on the input multi-state characters.

Experiment 3: Impact of borrowing on genetic tree estimation
In this section, we investigate how well the two best methods, GA2003 and MP (MP), estimate the underlying genetic tree in the presence of borrowing, using phylogenetic networks with up to three contact edges.
Figure 4 compares MP to GA2003 across conditions with up to three contact edges, and varying the polymorphism level.In general, MP is more accurate than GA2003, with very large differences under low to moderately high polymorphism, but with reduction in the advantage as polymorphism increases.However, under the two highest polymorphism levels they can be very close.Note that borrowing also impacts the relative performance, so that MP tends to maintain its advantage over GA2003 for all polymorphism levels when there are three contact edges.
Figure 3. Experiment 2: Comparison of methods across five different polymorphism conditions.We present average RF error rates for UPGMA, NJ, SD, GA2003, and MP methods, across five different polymorphism levels.Results for each polymorphism are averaged across 128 datasets, each with 30 languages; error bars show standard error.These results show that MP has the best accuracy of all methods across all model conditions, but GA2003 matches it under the two highest polymorphism levels

Discussion of the experimental study
We summarise the trends reported in the experimental study: • The relative performance between methods is fairly consistent, with MP best, GA2003 in second place, SD in third place, and the distance-based methods in last place.Moreover, there is a clear gap between MP and GA2003 except for the two highest levels of polymorphism, where they can be very close.However, GA2003 exhibits a noticeable reduction in accuracy as the number of contact edges increases, while MP is only slightly impacted by contact edges.• All tested methods that use a binary encoding of the data (MP used with binary-encoded data, SD, and GA2003) have poor accuracy compared to MP under the low to moderately high levels of polymorphism, but improve as the polymorphism levels increase.MP methods used with multi-state data rather than binary-encodings become only slightly worse with increases in polymorphism, and the distance-based methods do not change.• The level of borrowing also impacts methods differently, with MP of multi-state data having a larger advantage over GA2003 as borrowing increases.
The trend that the distance-based methods have the worst accuracy of all tested methods, followed by GA2003 and SD, and followed by the MP analyses of the multi-state characters is consistent with general trends in a prior study by Barbanc ¸on et al. (2013) comparing phylogenetic estimation methods (although under a different model that did not include polymorphism).Yet, of all the estimation methods we try, only the distance methods truly account for polymorphism on multi-state characters, since our MP approaches for multi-state characters replace every polymorphic character by a monomorphic character, and the other methods (GA2003, SD, Dollo MP, and MP on the binary encoding) are based on binary-encoded versions of the input multi-state characters, and so replace every multi-state character to a set of binary characters.
The basic observation that distance-based methods have poor accuracy compared to many other types of phylogenetic estimation methods is generally observed in many studies in  -d).We report the mean RF topological error (across 128 replicates) in the estimation of the underlying genetic tree using either MP or GA2003.Results are shown for all five polymorphism levels.MP generally produces better accuracy than GA2003 across the different conditions, but polymorphism level and borrowing both impact the gap between the methods; the one case where GA2003 is more accurate than MP is under the very high polymorphism level without any borrowing.
Error bars indicate standard error phylogenetics, and is unsurprising (Nakhleh et al. 2002), and the likely explanation is that there is some information loss in using distances.
To understand the remaining trends, we consider first why methods that are based on binary-encoded versions of datasets only have reasonable accuracy when polymorphism is at least high.Since this set includes GA2003 and SD, two well-established Bayesian methods, understanding this is important for informing future practice.However, this set also also includes MP variants based on binary-encodings of the characters, which (as we show in Appendix E Figure A4) are even less accurate than SD.
Although we do not have any definitive answer why methods that are based on binary character encodings do poorly except with high polymorphism levels, we note that the number of characters available to methods that use binary-encodings increases as the polymorphism level increases.Recalling that there are 320 initial characters (300 lexical and 20 other), at the low polymorphism level there are 773 binary characters, but this rises to 2374 binary characters at the moderately high polymorphism level, to 3172 at the high polymorphism level, and finally to 3727 at the very high polymorphism level (Table A1).Thus, as polymorphism increases, the methods that are based on binary-encodings have a larger number of binary characters to use in estimating phylogenetic trees.It seems likely, therefore, that whatever limitations methods have that are based on binary-encodings of multi-state data, these limitations are somewhat ameliorated by the large volume of data that becomes available when polymorphism levels are at least high, making the best of these methods potentially close to a simple MP analysis.
We now consider why methods, including GA2003, that are based on binary-encodings of data do poorly when polymorphism is at most moderately high.Considering the case where there is no polymorphism as an example, we note that prior studies have also shown that using binary-encoded versions of multi-state characters reduces accuracy.For example, Barbanc ¸on et al. ( 2013) compared GA2003 and MP using simulated datasets under a substitution-based model of character evolution that included homoplasy, heterotachy, and borrowing (Warnow et al. 2006); in that study, they consistently found GA2003 to be less accurate than MP under all tested model conditions, which is essentially the same as we find here.Similarly, the finding that MP on binary-encoded datasets is poor relative to MP on the original multi-state data is also not new, and was observed in Rexová et al. (2003) for analyses of Indo-European.Our study adds to this body of evidence that using binary-encodings of multi-state characters reduces accuracy, by showing that Dollo parsimony and unweighted parsimony both applied to binary-encoded data also exhibit poor accuracy on simulated datasets, compared to all MP variants we study on the multi-state data.
It is therefore interesting to ask why using binary-encoded data reduces accuracy, compared to analyses of the original multi-state data.Here, we summarise concerns raised in prior studies, including Nichols & Warnow (2008).Some of these risks have to do with using likelihood-based methods based on parametric models, which assume independent evolution of the characterssomething that is clearly violated when a single multi-state character is replaced by a set of binary characters.Other risks are relevant to parsimony analyses, and have to do with loss of topological information from each multi-state character when replaced by a set of binary characters: the multi-state characters can evolve without homoplasy and hence be compatible on the tree but the binary character encodings may become incompatible (Nichols & Warnow 2008;Warnow 2017).Thus, our observations that show that all binaryencoding analyses are less accurate than MP on the original multi-state characters, whether they are based on parsimony or likelihood, are consistent with the potential risks reported in these prior studies.
The concerns raised in these prior studies can be summarised as expressing MODEL MISSPECIFICATION between the assumed generative evolutionary model and the evolutionary process that actually generated the observed linguistic characters.Moreover, while no model is perfect and some can be helpful, model misspecification can lead to methods based on parametric models having higher error than non-parametric methods (see discussion in Holmes (2003) specifically about the choice between parametric and non-parametric methods in phylogenetics).This may be why the variants of MP we study, which are not parametric, have better accuracy than the Bayesian methods based on either the GA2003 or the SD model, since these are methods that are based on fully-parametric models that may have very substantial model misspecification.In sum, this study builds on a significant body of prior work showing clearly that analysis of binary-encoded versions of multi-state characters is problematic.

PHYLOGENETIC TREES FOR INDO-EUROPEAN DATASET
Given the generally superior accuracy and reliability of MP 4 on linguistically plausible simulated data compared to other methods, we now turn to the estimation of the Indo-European tree in the presence of polymorphism.To do this, we use the 2024 Ringe & Taylor data which has 370 characters (described in Appendix D, see https://tandy.cs.illinois.edu/histling.html).
Due to undeveloped methodology for handling polymorphism, earlier studies by Ringe and colleagues (notably Nakhleh et al. (2005a), but also Ringe et al. (2002), Erdem et al. (2006), Nakhleh et al. (2005b)) used various techniques to remove polymorphic characters from the dataset, including replacing some polymorphic characters by a set of monomorphic characters (the SPLIT CODING technique, which can only be used for certain types of polymorphism, see Ringe et al. (2002) for discussion and exemplification) or by leaving them out of the study altogether.For our analysis in this paper, we use the full dataset with all polymorphic characters.
The placement of Albanian is well-known to be difficult (e.g.Nakhleh et al. (2005a) finds that it can be placed equally well on several edges); hence, we remove it from the dataset when searching for trees.We then run MP 4, with certain phonological and morphological characters upweighted heavily, reflecting a belief that any plausible Indo-European tree should be compatible on those characters.Most notably, these characters enforce that the satem branches (Baltic, Slavic, Indic, and Iranian) form a clade; no trees that violate this condition are studied.This analysis leads us to two primary trees of consideration, shown in Figure 5.
The two trees differ only in their placement of Italo-Celtic; it breaks off after Tocharian in Tree 1, while it is a sibling of Tocharian in Tree 2. Interestingly, Tree 1 was found to be the most plausible tree in Nakhleh et al. (2005a) while Tree 2 was not even considered as a candidate.We note that without the inclusion of Albanian, Tree 1 has slightly better parsimony and compatibility scores than Tree 2. In the following paragraphs, we explore the characters that support the varying placements of Albanian and of Italo-Celtic.We say that a character SUPPORTS an edge if and only if the collapse of the edge would increase parsimony score.

The placement of Albanian
Albanian is notoriously hard to place on the Indo-European tree.Nakhleh et al. (2005a) finds that Albanian can place equally well on several edges in each of the trees they studied.We, on the other hand, only find two placements of Albanian that optimise our metrics, which are indicated with ① and ② in Figure 5. Optimising parsimony would place Albanian as a sibling of Tocharian, indicated with ①.The characters 'he' and 'they' support this placement, and 20 TRANSACTIONS OF THE PHILOLOGICAL SOCIETY 0, 2024 both of these are irreducibly polymorphic and hence were left out of previous studies.The non-polymorphic character 'tooth' also supports this placement in Tree 2, but not Tree 1.On the other hand, optimising compatibility would lead us to place Albanian as a sibling of the satem languages, shown by ②.The primary evidence for this is the character 'weave'; importantly, this is polymorphic in Albanian.While the evidence is weak in general for the placement of Albanian, it is clear that a method that handles polymorphism is necessary to resolve the matter.

The placement of Italo-Celtic
Perhaps, the most interesting discovery of our analysis is the placement of Italo-Celtic.Tree 1 places Italo-Celtic as breaking off immediately after Tocharian, while Tree 2 considers it a sibling of Tocharian.Without the inclusion of Albanian, the placement of Italo-Celtic in Tree 1 is slightly preferable by both parsimony and compatibility scores.This placement is supported by 'die', 'drink', 'many', and 'one'.Of these, 'die' and 'many' are irreducibly polymorphic (and so omitted from previous studies); however, their polymorphism is restricted to the Germanic branch.The only character that favours Italo-Celtic as a sibling of Tocharian (Tree 2) is 'here', which is not polymorphic.However, once Albanian is included as a sibling of Tocharian (i.e.along edge ①), Tree 1 and Tree 2 have the same parsimony score (though the compatibility score is still slightly worse for Tree 2).The characters that support the placement of Italo-Celtic in Tree 1 with Albanian included are 'die', 'many', and 'one'; interestingly, 'drink' no longer supports this position, as it did when Albanian was omitted.On the other hand, the number of characters that support Italo-Celtic as a sibling of Tocharian (Tree 2) rises from 1 to 3 now that Albanian has been included.These characters are 'here', 'tie', and 'we'.Importantly, 'tie' is highly polymorphic and was not included in previous studies.
We are not, of course, claiming that we have found the 'best' tree for Indo-European.Our point is that the inclusion of polymorphic characters amounts to not discarding available Figure 5. Indo-European trees.The two best-scoring trees from running MP 4 on the polymorphic Indo-European dataset without Albanian.Edge lengths have no meaning; only the topology is of interest.The placement of Italo-Celtic and Tocharian differs in the two trees and is depicted with thick edges.Tree 1 is topologically identical to the best Indo-European tree found in Nakhleh et al. (2005a), but Tree 2 has not been proposed in any prior publication, to the best of our knowledge.Two edges are labelled in each tree indicating the optimal positions for Albanian: placing it on ①, so that it is the nearest sister of Tocharian, optimises parsimony while placing it on ②, so that it is the nearest sister of the satem core, optimises the compatibility evidence, with the result that we find a wider range of hypotheses worth considering.That is the most important takeaway from this paper.

CONCLUSION
We have presented a parametric model for character change that addresses polymorphism.Importantly, under our model, all character change is the result of a (potentially brief) period of polymorphism, rather than a direct substitution of one state by another.Under our model, polymorphism arises in lexical characters primarily by semantic shift, but can also arise by borrowing through contact between different linguistic communities.Our parametric model allows languages to have more than one state for a given character (thus, more than one cognate class for a given semantic slot), but includes selective pressure to reduce polymorphism within any language for any character.Furthermore, it also ensures that every language exhibit at least one state for each character (equivalently, at least one word for each semantic slot).Thus, our model is the most linguistically realistic of the models that have been developed that address polymorphism.
An important finding in this study is that we have observed clear trends in terms of the relative accuracy of different phylogeny estimation methods.Specifically, our simulation study shows that a natural modification to MP (our MP 4 algorithm) provides improved accuracy over the remaining methods, including the well-established Bayesian approach developed by Gray & Atkinson (2003).While MP 4 and GA2003 can be close in accuracy in some boundary conditions (specifically, when there is no borrowing between languages but very high polymorphism), in general MP 4 has better accuracy, and the improvement in accuracy can be large.Given the high frequency of loan words in English (Durkin 2014) and also in other language families (e.g.see Haspelmath & Tadmor (2009), Bloomfield (1962) 21-3, and Rubin (2010)  , it would seem that borrowing is perhaps the norm in language families, and hence we should expect MP 4 to be more reliable than GA2003.
Our study also shows the impact of including polymorphic characters in a phylogenetic analysis, since this inclusion (followed by the MP 4 resolution) of the 2024 Ringe & Taylor dataset produces a new plausible tree for Indo-European not considered in any earlier analysis, all of which had removed polymorphic characters.Given the high frequency of lexical polymorphism attested in the Indo-European dataset, and found also in other language families, this study shows the benefit of proper handling and analysis of polymorphism.
The main point of this study is not to propose the use of the MP 4 method, which is based on MP, for linguistic phylogeny estimation, nor to point out failings of other methods.Instead, our point is that including polymorphic multi-state characters in a linguistic phylogenetic analysis may lead to changes in phylogenetic estimations, and these may be more accurate than prior analyses.Indeed, incorporating and addressing variation in a language community will by necessity spur the linguistics research community to better characterise how polymorphism occurs, so that these estimations become founded on increasingly rigorous and realistic stochastic models of character evolution, and will most likely also require the development of improved methods for phylogenetic estimation that can include polymorphic characters.
This study suggests many directions for future work.Among them, we ask: can we estimate the underlying genetic tree using techniques besides MP?In particular, since we have posed a parametric model, is a likelihood-based technique that would estimate the various numerical parameters along with the underlying genetic tree feasible based on this model?Another direction for future work is to perform simulations under conditions that are representative of the case where many (or even most) of the languages are modern rather than ancestral; such conditions may have higher levels of borrowing, resulting in higher levels of polymorphism and possibly also of homoplasy.In addition to the Bayesian methods we explore in this study (i.e.GA2003 and SD), there are other parametric stochastic models that have been developed and Bayesian methods for estimating phylogenies under these models (see Appendix C), which could be explored in future work.Among the newer developments are techniques for correcting for ascertainment bias, which is when data collection techniques either remove characters (or languages) exhibiting certain features, or focus on a subset of the available data (see discussion in Goldstein (2023) and Chang et al. (2015)); in our analyses of both simulated datasets and the 2024 Ringe & Taylor dataset, we do not remove any characters, but future work could explore the impact of character removal followed by correction for ascertainment bias (see Appendix C for discussion).The focus of this study is on estimating the topology of the underlying genetic tree, rather than estimating divergence dates, and so future work could look at the impact of method choice and inclusion of polymorphism when dating is the primary concern.Finally, another direction for future work is to seek to estimate the phylogenetic network itself, rather than just the underlying genetic tree.The approach used in Nakhleh et al. (2005a) might be a reasonable first attempt, but it assumed monomorphic characters, and did not take advantage of any likelihood-based techniques, and so can most certainly be improved.
Understanding and modelling polymorphism and how it arises, and using these models for language phylogeny, has ramifications outside Indo-European.Moreover, the recovery of cladistic trees is important for the comparative reconstruction of any language family for a fundamental reason: the true cladistic tree determines which taxa are the daughters of each internal node of the tree and therefore determines what can be reconstructed for that node (Hoenigswald 1960;Hoenigswald & Wiener 1987;Olander 2022).Since language change typically involves polymorphism of various kinds, polymorphism should be relevant to the study of change in progress.So far as we know, this question has not been explored systematically.The model we have proposed is a natural first step towards formalising this process, but further development of the model is needed.For the distance methods UPGMA and NJ, we experimented with three different distance measures defined on sets.Suppose one language has a set of states A and another language has a set of states B for a particular character.Then, the distance d A, B ð Þ can be defined as:

Correspondence
For each pair of languages, these distances are computed for each character and added up to get a total pairwise distance.We then make a distance matrix and provide this to PAUP* (Swofford 2002), which we use to perform UPGMA or NJ.

A.2. MP METHODS
In their classic formulations, MP approaches are not suited to handling polymorphic characters; hence, we have to modify the polymorphic input.We experiment with four different variations of handling polymorphic characters: • MP 1: Remove all polymorphic characters from the dataset.In effect, this ignores much of the lexical data.• MP 2: Replace all polymorphic states with?, the missing state symbol.This essentially ignores the polymorphic data, resulting in no information being provided by a character for a language that exhibits multiple states.• MP 3: If a particular polymorphism is shared between two (or more) languages for a given character, replace each of them with a new unique state.For example, if multiple languages exhibit states 1 and 2 for a given character, then we code each of these characters identically with a new state that is not seen in any of the other languages.• MP 4: Replace each polymorphism with the state that occurs most often across all languages for that character.For example, if a language exhibits states 1 and 2, and state 1 is more common across languages for this character, we replace the state with 1. Ties are broken arbitrarily.
MP 1 and MP 4 are probably the most natural for historical linguists.Like the distance methods, we run PAUP* (Swofford 2002) for the MP analyses.This proposed model builds on the parametric model in Warnow et al. (2006), as both allow for homoplasy, do not assume a rates-across-sites model, and allow for borrowing.However, here we model character change through (possibly brief) periods of polymorphism while they modelled character state change using substitutions of one state by another.Moreover, in Warnow et al. (2006), borrowing results in replacement of the existing word by the borrowed word, whereas in our model borrowing increases the number of words in the recipient language.Thus, the major difference between our model and Warnow et al. (2006) is that we allow for polymorphism and use it as the mechanism for character state change.

B.2. COMPARISON TO NICHOLLS & GRAY 2008 MODEL
There are four main differences between our model and that of Nicholls & Gray (2008) that are important to be considered: 1.In Nicholls and Gray (2008) model, there is strictly positive probability of a language having no cognates at all for a given semantic slot; it is not possible in our model for any language to have zero words for some semantic slot.This is a major difference between models.2. The Nicholls & Gray (2008) model does not allow for borrowing, but ours does.3. The model in Nicholls & Gray (2008) is for a single character (i.e.semantic slot), so that extending to multiple characters (semantic slots) inherently assumes that all characters evolve identically and independently (the i.i.d.assumption) down the same network with the same numeric parameters.In our model, the characters evolve independently but not identically down a common network; thus, each character has its own (character-specific) numeric parameters.Moreover, we do not enforce a rates-across-sites assumption (see Evans & Warnow (2004) for a critique of this assumption); thus, our model allows for heterotachy.4. In Nicholls and Gray (2008) model, the loss of a given word can happen multiple times, but each word can be born only once in the network (i.e. the DOLLO property).In contrast, for our model, while multiple losses are allowed, each word can be born more than once; this difference means that our model allows potentially greater realism and complexity.However, if desired, the user can restrict birth to only occur once by not allowing for any homoplastic states in the model.Since loss occurs following periods in which the character is polymorphic, this produces the Dollo effect (birth can happen only once, but loss can happen multiple times), thus reproducing this property of Nicholls & Gray (2008).
Thus, our model can be seen as an extension of Nicholls & Gray (2008) to improve linguistic realism.

B.3. DISCUSSION OF OTHER LIKELIHOOD-BASED MODELS
Given the wide popularity of likelihood models in linguistic phylogenetics in recent years, a few notes about common modelling techniques are appropriate.Most applications of these models are for estimating diversification dates (and, by extension, the homeland of the Indo-Europeans); since we only work on inferring tree topologies, we will not focus on developments primarily relevant to dating in this section.
Three common likelihood frameworks are (1) restriction site models (Felsenstein 1992), (2) binary covarion models (Tuffley & Steel 1998), and (3) SD models (Nicholls & Gray 2008).Detailed discussions of these models can be found in Chang et al. (2015).All of these models assume an underlying phylogenetic tree and two states for each character, representing the presence and absence.Characters evolve down the tree, transitioning between the presence and absence according to the specifics of the model's transition matrix.
In our study, we examined inference under a restriction site model (the GA2003 analyses) as well as under a SD model.Here we describe models that we did not explore and methods for estimating phylogenies under these models, and directions for future work.
The BINARY COVARION model is similar to the restriction site model, in that it can alternate between the presence and absence, but has the additional property that characters can also switch between fast-and slow-evolving.In the PSEUDO-DOLLO model (Bouckaert & Robbeets 2017): a trait may be gained or lost many times, but once a loss occurs a gain may not occur again; the motivation for this model was to fit linguistic datasets when there are many borrowing events or coding errors.
Methods for estimating phylogenetic trees under these models are in packages such as MrBayes (Huelsenbeck & Ronquist 2001), BEAST (Bouckaert et al. 2014) and RevBayes (H öhna et al. 2016).Some linguistic datasets may have languages that are missing states for some of the characters in the dataset, or may have characters that are invariant (exhibit only one state across the languages).In some cases, phylogenetic analyses in these conditions may remove these characters (as discussed in Goldstein ( 2023)), which is a type of ascertainment bias.Therefore, methods for correcting the ascertainment bias have also been developed (e.g., see Bouckaert et al. (2014)).
We note that the simulations described in Section 4.1 did not produce characters that are invariant nor characters with missing data.The issues regarding ascertainment bias therefore did not arise in our simulation study.As for the 2024 Ringe & Taylor Indo-European dataset, it had only two characters that were invariant and a small amount of missing data (i.e.languages missing states for some characters).We did not remove any characters in the analysis of the Indo-European dataset, and so did not create any ascertainment bias in that aspect of our study.

APPENDIX C: Simulator
This section gives a brief overview of the simulator, with full details provided in the Github site (Canby et al. 2024).
In order to generate character data (i.e.sequences of character states) at the leaves of the tree or network, the user must provide the binary tree or network on which the characters will evolve, which may be a tree or phylogenetic network estimated for a language family.We provide the option to create a random network from the provided tree: the user simply needs to provide the number of contact edges desired.
Then, the user may specify a collection of CHARACTER CLASSES.A character class is a specification of the evolutionary properties of a set of characters.Each class may permit varying degrees of homoplasy, borrowing, deviation from lexical clock, heterotachy, and rate of evolution (in addition to model-specific attributes).For example, a researcher may choose to create different character classes for phonological, morphological, and lexical characters.The user also specifies the number of characters to be evolved down the tree or network within each class.
Finally, a researcher can specify the evolution model used to generate sequences.We have currently implemented two models: the polymorphism model proposed in this paper, and the model of Warnow et al. (2006).To define a new model, one must simply implement three functions.First, the user must specify how the root state should be chosen, which could be as simple as selecting state 0.Then, one must specify how to determine a child state given a parent state on a particular edge (typically, this would involve sampling from some distribution).Finally, the user must define how to resolve a transmission on a reticulate edge: for example, this could be replacing an existing state with the borrowed state or adding the borrowed state into the set of existing states.
Once the character classes and evolution model have been specified, the simulator can generate the linguistic character data.Importantly, the code that actually performs the evolution is decoupled from the model definition, so the researcher does not need to modify the evolution code for most use cases.The simulated data may be exported as a CSV file, which could then be provided to an inference method.For each dataset, the earliest well-attested languages are used to define the character states.For example, the dataset includes character states for Old English, rather than Modern English.The advantage of using these ancestral languages rather than only modern languages is that character state changes tend to result in loss of shared cognates, as character state changes usually result in a state not yet seen in the language family.A significant example of this loss is Albanian, which is the most recently attested of the languages in the Ringe & Taylor dataset, and which has lost much of its morphology compared to other languages in the dataset.The loss of morphology is the main reason why placing Albanian in the Indo-

28
TRANSACTIONS OF THE PHILOLOGICAL SOCIETY 0, 2024 European tree is so difficult (see discussion in Nakhleh et al. (2005a)).The disadvantage of using ancient languages compared to modern languages is the difficulty, in some cases, of finding words used for a given semantic slot in ancient languages; hence the 2002 and 2024 Ringe & Taylor datasets both have some missing data (i.e., languages that are missing words for some of the lexical characters).All phonological characters are monomorphic, and only 1 morphological character is polymorphic (M7, the genitive singular of o-stem nouns, is polymorphic in Latin, which exhibits both the *-osyo and *endings).An analysis of the 333 lexical characters shows that 223 (or 67%) of them are polymorphic in at least one language.
Figure A1 shows some empirical statistics about the lexical characters.In Figure A1, we see an average of 16.4 unique cognate classes per character, spread across 24 languages.Note that 15 characters have more than 24 unique cognate classes, meaning by necessity they must be polymorphic (by the pigeonhole principle, since there are only 24 languages).We also calculate the load of each polymorphic character; this is the maximum number of cognate classes that a single language exhibits.Of the polymorphic characters, 83% of them have a load of 2, 16% have a load of 3, and only 1% have a load of 4. In Figure A1b, we report how many languages exhibit polymorphism for each of the polymorphic characters.On average, 3.1 languages exhibit more than 1 cognate class in a polymorphic character, and the maximum across the lexical characters is 14.
Based on a detailed analysis of all the lexical characters that exhibit polymorphism (see Github site), we estimate that as many as 25% of these polymorphisms arose by borrowing, but it is much more likely that no more than 15%, and possibly as few as 2%, arose only by borrowing.Hence, we estimate that anywhere from 54% to 66% of the Indo-European lexical characters exhibit polymorphism resulting from semantic shift.This analysis places Indo-European either in the moderately high or the high polymorphism category (as those values are defined by simulations without borrowing).encoded data; when one of these methods returns multiple optimal trees, we report the average error of these trees.We do not run SD for the cases with borrowing (1-3 contact edges), due to the longer runtime of this method Table A1.Number of monomorphic characters in the simulated datasets.We show the number of monomorphic characters (i.e.characters that do not exhibit polymorphism) of the input characters and also in the binary encoding of these characters (300 lexical and 20 other).
The number of parsimony-informative characters (ones that exhibit at least two states, each with at least two languages) is also reported.Numbers are averaged across 128 datasets (each with 30 languages), simulated without borrowing.MP 1, which throws out polymorphic characters, has access to fewer and fewer characters as the polymorphism level increases, while methods that are based on binary encodings have access to more characters as the polymorphism level increases.These trends thus explain why MP 1 reduces in accuracy with rises in polymorphism and potentially why the methods that use binary-encodings improve in accuracy with rises in polymorphism

Figure 1 .
Figure 1.Semantic shift between the characters 'human being' and 'male person' in French and English.(a) Shift in meaning in French for homō.(b) Shift in meaning in English for mann

Figure 2 .
Figure 2. Example of evolutionary process with polymorphism and borrowing.Two edges that overlap in time are depicted after having undergone some amount of evolution following their least common ancestor (LCA).Edges are annotated with state changes that occurred along them and was based on the unscreened (i.e., full) 2002 Ringe & Taylor dataset.Given that the 2002 and 2024 Ringe & Taylor datasets differ only slightly (see Github site (Canby et al. 2024)), this allows us to use our simulator to produce datasets that reflect similar levels of homoplasy of the 2024 Ringe & Taylor dataset (see Appendix D as well as the discussion of Barbanc ¸on et al. (2013), Nakhleh et al. (2005b)).

Figure 4 .
Figure 4. Experiment 3: Comparison of Gray & Atkinson (GA2003) and MP for genetic tree estimation in the presence of borrowing.In this figure, the characters evolve down trees (a) or down phylogenetic tree-based networks with 1-3 contact edges (b-d).We report the mean RF topological error (across 128 replicates) in the estimation of the underlying genetic tree using either MP or GA2003.Results are shown for all five polymorphism levels.MP generally produces better accuracy than GA2003 across the different conditions, but polymorphism level and borrowing both impact the gap between the methods; the one case where GA2003 is more accurate than MP is under the very high polymorphism level without any borrowing.Error bars indicate standard error APPENDIX B: Comparison to prior models B.1.COMPARISON TO WARNOW ET AL. (2006) MODEL APPENDIX D: Properties of the 2024 Ringe & Taylor Indo-European Dataset We use the 2024 Ringe & Taylor Indo-European dataset, which is a corrected version of the 2002 Ringe & Taylor dataset (Ringe et al. 2002).Both the 2002 and 2024 Ringe & Taylor datasets have character data for 24 Indo-European languages.Each dataset has a total of 370 characters: 22 phonological, 15 morphological, and 333 lexical.The two datasets differ only in the character state assignments for a few characters (see Github site for details (Canby et al. 2024)).

Figure A1 .
Figure A1.Indo-European lexical character statistics.The Indo-European dataset has 24 languages, and we show statistics regarding the 333 lexical characters.(a) Number of cognate classes per lexical character; the dotted line shows a mean of 16.4.(b) Number of languages that are polymorphic (i.e.exhibit more than one state) per polymorphic character; the dotted line shows a mean of 3.1

Figure A4 .
Figure A4.Overall results for the simulation study Average RF error rates for all methods at each level of borrowing and polymorphism, averaged over 128 replicates (error bars show standard error).MP-Dol (i.e.Dollo Parsimony) and MP-Bin are both run on the binary encoded data; when one of these methods returns multiple optimal trees, we report the average error of these trees.We do not run SD for the cases with borrowing (1-3 contact edges), due to the longer runtime of this method

Table 2
(Sokal & Michener 1958g trees on distances can be used.The Hamming distance (number of characters in which the two languages differ) is a common technique for computing pairwise distances, and neighbour joining (NJ; Saitou & Nei 1987) and UPGMA(Sokal & Michener 1958; unweighted pair group method with arithmetic mean) are two of the more commonly used methods for computing trees from distance matrices.
. Binary character coding of the semantic slot 'male person' in Greek, Latin, Avestan, OCS, and OHG

Table 3 .
State change via intermediate period of polymorphism (without substitutions)