Trait evolution and historical biogeography shape assemblages of annual killifish

Reconstructions of evolutionary and historical biogeographic processes can improve our understanding of how species ssemblages developed and permit inference of ecological drivers affecting coexistence. We explore this approach in Austrolebias, a genus of annual fishes possessing a wide range of body sizes. Regional assemblages composed of different species with similar size distributions are found in four areas of eastern South America. Using phylogenetic trees, species distribution models and size data we show how trait evolution and historical biogeography have affected the composition of species assemblages. We extend age-range correlations to improve estimates of local historical biogeography. We find that size variation principally arose in a single area and infer that ecological interactions drove size divergence. This large-size lineage spread to two other areas. One of these assemblages was likely shaped by adaptation to a new environment, but this was not associated with additional size divergence. We found only weak evidence that environmental filtering has been important in the construction of the remaining assemblage with the smallest range of sizes. The repeated assemblage structures were the result of different evolutionary and historical processes. Our approach sheds light on how species assemblages were built when typical clustering approaches may fall short.


INTRODUCTION 48
Coexistence of related species is shaped by the effects of different ecological, evolutionary 49 and historical processes. These include speciation (Nosil 2012;Warren et al. 2014;50 Mittelbach and Schemske 2015) and extinction, local ecological processes such as 51 competition (Hardin 1960 co-occurrence and added matrices in Figure 1 showing which species pairs co-occur in our 282 data complemented with reports in Loureiro et al. (2015). 283 284 We characterised species ranges in two ways. First, we used the four areas of endemism that 285 distinguish the assemblages in our analysis (Costa 2010). These are: the western region of the 286 Paraguay River basin (Western Paraguay or W), the lower La Plata River basin and the 287 middle-lower Uruguay River basin (La Plata or L), the Negro River drainage of the Uruguay 288 River basin and the upper/middle parts of the Jacuí, Santa Maria, Jahuarão and Quaraí river 289 drainages (Negro River or N) and the Patos-Merin lagoon system including the southern 290 coastal plains (Patos Lagoon or P) (Fig. 1). A fifth area of endemism is found the middle 291 section of the Iguaçu River basin. It is home to a single species, A. carvalhoi, for which we 292 do not have data, so it was not included. 293 294 Second, we built species distribution models (SDMs) using the above data and MaxEnt 295 (version 3.3.3k; Phillips et al. 2006; Phillips and Dudík 2008) to more accurately determine 296 13 species ranges. We used the current global climate data layers of 19 bioclimatic variables and 297 altitude taken from WorldClim (www.worldclim.org) at a spatial resolution of 30 arcseconds 298 (approximately 1 km 2 ), 10 variables characterising soil composition and two variables with 299 river basin characteristics (Table S4). As most locations where Austrolebias have been 300 sampled are at roadsides, we sampled background points from a raster file of roads at the 301 same resolution (CIESIN 2013). We ran a PCA on the unstandardised environmental 302 variables and used the principal component scores (PC) as covariates in MaxEnt. This 303 produced ranges with fewer spurious predicted occurrences at a large distance from the 304 known capture sites than PC from standardised environmental variables. Range sizes were 305 calculated by applying a threshold to the logistic output of MaxEnt, at which the sum of the 306 sensitivity (true positive rate) and specificity (true negative rate) is highest. When a cell 307 possessed a habitat suitability score above this threshold, the species was counted as present 308 in that cell. 309 310

Environmental niches 311
We characterised the environmental niche of each species with statistics of the environmental 312 variables listed above. We followed an approach similar to the outlying mean index (OMI) 313 analysis proposed by Dolédec et al. (2000). First, environmental variables were all 314 standardised to zero mean and standard deviation equal to one. Then their values at the 315 known locations per species were averaged. We weighed locations by presence/absence as 316 local abundances are unknown and the different ecological roles in the genus might affect 317 abundances independent of environmental preferences (e.g. large piscivore species are less 318 numerous in ponds). On the resulting values a PCA was carried out which did not account for 319 phylogenetic relatedness among species as we did not want to impose an evolutionary model 320 a priori. 321

Body size data 323
We represented species size by a measure for asymptotic length. Growth is indeterminate in 324 fish and age information is often not available with size measurements. Body size data were 325 obtained by taking the largest known field measurements of adult male standard length (SL) 326 for each species from the literature and our own field records (full dataset in Table S5). without founder-event speciation (indicated by +j). We ran separate sets of analyses where 338 the maximum number of areas composing ancestral ranges was restricted to two (maximum 339 observed) or four (maximum possible). We performed model selection based on Akaike 340 Information Criterion corrected for sample size (AICc), to identify which models best fit the 341 data. We ran analyses where ranges were restricted to include only adjacent areas. For 342 example P could only share a range with N, while N could be part of a range with both L and 343 P (Fig. 1). Finally, Costa (2010)  Inferring historical biogeography using areas of endemism and ARE can oversimplify spatial 362 patterning of the actual species ranges and their overlaps. We used the age-range correlation 363 (ARC) approach of Barraclough and Vogler (2000) which can handle detailed range data. 364 Simulations have shown that it has some power to discriminate geographic modes of 365 speciation (Barraclough and Vogler 2000). ARCs are regression models of range overlap on 366 phylogenetic node age. Descendant ranges are pooled to calculate range overlap per node and 367 overlap is expected to increase (decrease) with node age for nodes that originated by 368 allopatric (sympatric) speciation. A major fault in the current inference of the probabilities of 369 different modes is that these are inferred from the estimated intercept of the regression across 370 pooled data. However, there are many combinations of geographical speciation modes that 371 can produce the same intercept. We used mixture regressions in the 'flexmix' R package 372 (Grün and Leisch 2008) to analyse age-range overlap data. Nodes of a phylogenetic tree can 373 be assigned to different clusters of events representing different geographical speciation 374 modes and with the intercept and node age slope estimated per cluster. Mixtures fitted 375 consisted of 1-3 component regression models. 376 377 Range overlaps were calculated as fractional overlap of the smallest range size following 378 Barraclough and Vogler (2000). For one species, A. paucisquama, we were unable to infer a 379 reasonable SDM so we could not include this species in our ARC analyses. After an 380 exploration of logit and other models, we analysed the data as mixtures of gaussian random 381 variables with linear regressions. These produced fewer spurious results and an absence of 382 strong effects of parameterisation details. We constrained the residual variances of all 383 mixture components to be equal. We otherwise obtained regressions with very small error 384 variances, which were not observed in process simulations (Barraclough and Vogler 2000). 385 We inferred range overlap intercepts by bootstrapping the preferred mixture model, which bellottii and A. apaii were treated as the same species because A. apaii is a junior synonym of 408 A. bellottii (García et al. 2012). In some cases A. cinereus individuals were more closely 409 related to A. vazferreirai than to their conspecifics. These species are also morphologically 410 similar (Costa 2006) and A. cinereus is known from only a single population. We decided to 411 merge A. vazferreirai and A. cinereus for the comparative analyses in this paper including the 412 *BEAST analysis. Our mtDNA topology differed extensively from our nrDNA tree (Fig. 3 The best-fitting ARE models were DIVA+j for all trees. The parameters and likelihoods of 427 each model are summarised in Table S6. These models revealed that jump-dispersal of 428 lineages between areas was common ( Fig. 4, Fig. S1). Accordingly, biogeographic stochastic 429 mapping found frequencies in the range of 47-60% of events (Fig. S2). Likelihood ratio tests 430 revealed that models with the jump-dispersal parameter conferred significantly higher 431 likelihood to the data than those without (Table S7)  simulation) and Negro River (2.7) while the other areas had averages below one. Vicariance 440 events were rare, probably due to the rarity of ancestral ranges with more than one area. 441 Within-area subset speciation is not included in the DIVA model. Neither adjacency matrices 442 nor changing the maximum number of areas significantly improved the likelihood (< 2 443 difference in AICc) when compared to the simplest set of models (Table S6). (AUC) values for the training data were above 0.95. Among mixture models fitted to range 491 overlaps, models were preferred with node age effects (Table S8 contains ICL values of fitted 492 models). For the nDNA tree (Fig. 6), a mixture with three components representing different 493 modes of speciation was preferred. The intercepts of these three components are -0.02 (s.e. 494 0.03), 0.24 (s.e. 0.04) and 0.59 (s.e. 0.09). The first value corresponds to allopatric speciation, 495 21 the other two to varying degrees of non-allopatric speciation. For the coalescent tree, a 496 mixture with three components had lowest ICL, which was only slightly smaller than the ICL 497 of a single-component mixture. We therefore infer for the coalescent and mtDNA tree ( When we compare the regional and more local analyses (Fig. 4, 6A), 17 of 23 nodes showed 502 agreement between analyses. There are three events where the regional analysis inferred non-503 allopatric speciation, while ARC did not. For three allopatric speciation events at the regional 504 scale, ARC inferred non-allopatry. This combination is an impossible geography, showing 505 limitations of historical biogeographic approaches. Most importantly, the selection regime 506 shift for size occurred at a node where non-allopatric speciation is inferred. We can conclude 507 that ecological interactions (C) drove this size divergence. The trees presented in this study are the most comprehensive inferences in Austrolebias to 514 date. Our nDNA tree is the first multi-locus tree based on nuclear DNA for this genus and 515 provides novel insight into species relationships. We find major differences between nDNA 516 and mtDNA trees. Paraguay, but the less extreme found by SURFACE (Fig. 5) was not recovered in l1OU 592 analyses (Fig. S5). While the less extreme shift seems plausible, SURFACE is known to 593 overestimate the number of shifts (Ho and Ané 2014). Observed shifts for environmental 594 niche were not associated with selection regime shifts for size. 595 596 Improved prediction of local historical biogeography 597 To improve the prediction of historical biogeography we have extended a method developed 598 to determine the primary geographical mode of speciation in a clade (Barraclough and Vogler 599 2000). We used mixture regression models to cluster data into groups of nodes fitted by a 600 shared regression. This allowed us to estimate the intercepts of the regression lines fitted to 601 each cluster, which approximate the range overlap at cladogenesis per group. We used the 602 assignment of nodes to mixture components to assign speciation modes to each divergence 603 event, without actually reconstructing ancestral ranges and without pre-defined discrete 604 biogeographic areas. 605

606
The ARC approach has been criticised in the past (Fitzpatrick and Turelli 2006) and these 607 criticisms still apply to our extended methodology. The power of the original method is low 608 and the discriminatory power between alternative modes of speciation disappears at older 609 nodes in a phylogenetic tree (Barraclough and Vogler 2000). A heuristic node weighting 610 method has been proposed (Fitzpatrick and Turelli 2006)  While patterns in Austrolebias revealed an appreciable fraction of non-allopatric speciation 619 events, this did not reflect the pattern in Nothobranchius where speciation is thought to be 620 exclusively allopatric and triggered by geological events (Dorn et al. 2014; Bartáková et al. 621 2015). Our ARC results corroborated our ARE analyses (Fig. 6) for most of the speciation 622 events in the tree. Results suggest that by using ARC we could improve upon estimates of 623 geographic proximity of incipient species during divergence. For example, the divergence of 624 A. duraznensis and A. affinis was estimated as within-area speciation using ARE (Fig. 4) but 625 in allopatry using ARC (Fig. 6). Additional simulations would be useful to understand factors 626 affecting error rates of the approach, as we found three cases where ARE and ARC predicted 627 incompatible biogeographic events. 628 629

Repeated patterns are the result of different processes 630
In Figure S8 we provide a visual summary of our reconstructions per assemblage. 631 We observed a single selection regime shift in size, indicating that there was a single source 632 of major size variation in the genus. Historical biogeographic approaches estimated that the 633 speciation event that preceded this shift was within-area speciation at the regional level and 634 non-allopatric speciation at the local level, taking place in the Patos Lagoon area. We 635 therefore predict that scenario (C) took place -size divergence was generated through Our best fitting ARE model (Fig. 4) reveals two major events that were key in distributing 645 size variation across assemblages. First, a jump-dispersal event from Patos Lagoon to La 646 Plata led to A. elongatus, followed by an event from La Plata to Western Paraguay which led 647 to A. monstrosus. Therefore Patos Lagoon represents one extreme of size redistribution by 648 dispersal (Fig. 2) because major size variation originated within, and Western Paraguay and 649 La Plata another since major size variation originated in other areas. Conspicuously absent is 650 the Negro River assemblage, which harbours no size variation generated by a selection 651 regime shift. Austrolebias vazferreirai is the largest species in this assemblage and is thought 652 to be a generalist (Costa 2006 However, the two niche regime shifts in Western Paraguay (Fig. 5) make us predict that 687 coexistence there is structured more by environment than in Patos. Species with more 688 extreme environmental niches such as A. monstrosus and A. vandenbergi will occupy 689 different ponds. We predict that body size will be less important for structuring pond 690 assemblages in the Negro River area. Inherently, there is then more scope for weak 691 environmental filtering effects or neutral coexistence. Confirmation of our predictions would 692 require careful assessment of co-occurrence across these areas.