Historical demography and climate driven distributional changes in a widespread Neotropical freshwater species with high economic importance

(cid:45)ocantins-Araguaia ((cid:45)o-Ar) basin, as well as individuals (cid:59)rom three (cid:99)sh (cid:59)arms. We in(cid:59)erred a potentially smaller distribution in the glacial period, with a possible re(cid:59)uge in central Am. Our genetic data agrees with this result, suggesting a higher level o(cid:59) genetic diversity in the Am basin, compared to that observed in (cid:45)o-Ar. Our deep learning model comparison indicated that the (cid:45)o-Ar basin was colonized by the population (cid:59)rom the Am basin. Considering a global warming scenario in the near (cid:59)uture, A. gigas could reach an even larger range, especially i(cid:59) anthropogenic related dispersal occurs, potentially invading new areas and impacting their communities.


Introduction
Te Neotropical region exhibits one o the greatest biodiversity levels worldwide (Antonelli et al. 2018a, Rull 2018).Within this region, Amazonia is identi ed as the major source o this outstanding species richness (Antonelli et al. 2018b, Fine andLohmann 2018).Te Amazon River (Am) basin contains the most speciose sh auna in the world (Reis et al. 2016), a likely result o its complex geomorphologic and climatic history (Figueiredo et al. 2009, 2010, Hoorn et al. 2017, van Soelen et al. 2017).Tis intricate history is also re ected by puzzling biogeographic patterns (Ribas et al. 2012, Dagosta andPinna 2017).Te ocantins-Araguaia basin ( o-Ar), although not a real tributary o the Amazon basin, as it ows directly to the Atlantic (Carvalho and Albert 2011), is considered part o the reshwater Amazon ecoregion (Albert andReis 2011, Dagosta andPinna 2017).A high level o endemic species is observed in o-Ar basin (higher than other basins in the Brazilian shield and comparable to the levels observed in lowland Amazonia), probably due to historical connections to the Am basin (Rossetti and Valeriano 2007, Albert and Reis 2011, Dagosta and Pinna 2017).Assessing genetic diversity in taxa occurring in regions with such puzzling geomorphologic history may provide insights on the evolutive processes acting on the biodiversity o these basins.
Among the sh orders with representatives in both Am and o-Ar basins, the Osteoglossi ormes is one o the rst three sister lineages to all other modern teleosts (Greenwood et al. 1966, Arratia 1999, Near et al. 2012, Hilton and Lavoué 2018) and living orms are restricted to reshwater (Myers 1949).Osteoglossi ormes species are naturally distributed or introduced to all continents, with the exception o Antarctica (Adite et al. 2005, Nelson et al. 2016, Hilton and Lavoué 2018).Currently, this order comprises six amilies, namely Pantodontidae, Notopteridae, Gymnarchidae, Mormyridae, Osteoglossidae and Arapaimidae.However, authors recently have considered Arapaimidae as sub amily Arapaiminae, within the Osteoglossidae amily (Nelson et al. 2016, Cavin 2017).Te Arapaiminae sub amily is represented by only two extant genera, the monotypic A rican Heterotis Rüppell, 1829, and the South American Arapaima Müller, 1843 (Nelson et al. 2016).Te genus Arapaima is considered monotypic by many authors, with A. gigas as the only valid species (but see Castello et al. 2013, Stewart 2013a, b).Stewart (2013a, b) care ully analyzed existing types o this species and advocated that they might represent easily diagnosable species.As a result, the author ormally described the new species A. leptosoma (Stewart 2013b) and validated a previously described species (Stewart 2013a).Tere ore, some authors accept as much as ve species or Arapaima (Castello et al. 2013).However, recent publications that analyzed the genetic diversity using molecular markers, considered Arapaima as a monotypic genus (Hrbek et al. 2005, 2007, Farias et al. 2019, orati et al. 2019).Here, taking into account the morphological classi cation per ormed in the museum a ter the voucher deposit, we decided to consider the specimens in all our analyses as monotypic (A.gigas), but discussing our results considering the potential or cryptic species to occur.
Arapaima gigas is widely distributed across a large portion o the o-Ar and Am basins in Brazil, Peru, Colombia, Ecuador and Guyana (Reis et al. 2003, Castello 2008), and represents one o the largest reshwater sh species, with some individuals reaching up to 200 kg o body mass and up to three meters in length (Stone 2007, Bezerra et al. 2013, Nelson et al. 2016).Te species presents rapid growth, typically reaching ~60-80 cm in the rst year o li e (Arantes et al. 2010) and is considered one o the sh species with highest aquaculture potential (Ono 2007), because o its high nutrition value and low at (< 5%) content (dos Santos Fogaça et al. 2011).
Previous studies on the population genetic diversity o A. gigas with molecular markers generally pointed to the absence o genetic structuring in the Am and lower o-Ar basins, and a more pronounced structuring in the upper o-Ar.Tese previous analyses, however, ocused on ew mtDNA sequences (Hrbek et al. 2005), nuclear markers associated with repetitive regions (Hrbek et al. 2007, Vitorino et al. 2015, 2017, Farias et al. 2019), and randomly distributed genotypes along the genome ( orati et al. 2019).Here, we sampled specimens in the eld and deposited vouchers or both wild populations and sh arms, obtaining SNP markers with DAr Seq (Kilian et al. 2012).Di erently rom most technologies that obtain random genomic sequences, such as CRoPS (Complexity Reduction o Polymorphic Sequences), GBS (Genotype by Sequencing) and RAD (Restriction Site Associated DNA) (van Orsouw et al. 2007, Baird et al. 2008, Elshire et al. 2011), DAr (Diversity Arrays echnology) is a genome-complexity reduction method that enriches or hypomethylated regions o the genome allowing active genomic regions to be recovered (Jaccoud et al. 2001, Kilian et al. 2012).Tere ore, it is possible to detect markers that may be under the e ect o selective pressures.Besides, when coupled with next-generation sequencing (NGS) technologies (DAr Seq), thousands o SNPs can be generated in a relatively short amount o time, making them power ul tools or genomic investigation (Kilian et al. 2012).
By using such datasets containing hundreds to thousands o loci (Garrick et al. 2015), coupled with demographic model selection strategies (Carstens et al. 2013a), it is possible to compare complex scenarios and make more accurate estimates o demographic parameters rom more realistic models (Knowles 2009, Tomé andCarstens 2016).Recently, deep learning methods were incorporated in population genetics (Sheehan andSong 2016, Schrider andKern 2018 and re erences therein) and applied or demographic model comparison (Flagel et al. 2019, Villanea andSchraiber 2019).Tese procedures have the advantage o making use o the in ormation present in the large and multivariate datasets obtained with NGS, without the need o reducing this in ormation with summary statistics (Flagel et al. 2019).
Here, we used species distribution models (SDMs) to assess the potential range o A. gigas during past and uture climate change.We also obtained genotypic data or thousands o loci with DAr Seq procedure, that were used to estimate the genetic diversity in di erent sampling sites and to estimate population structure along A. gigas distribution.Te SDMs and genetic diversity results were used to generate demographic models or A. gigas, that were compared against each other with a deep learning approach and explicitly test whether: 1) specimens rom the two basins can be considered a single genetic population; 2) populations rom both basins expanded their range to reach the current distribution; 3) the population located in Am was colonized rom the o-Ar basin with a ounding event ollowed by expansion; or 4) the opposite colonization pathway occurred, with o-Ar being colonized rom Am.

Individuals examined and DNA extraction
We collected A. gigas individuals rom seven localities in Am ( our sampling sites) and o-Ar (three sampling sites) river basins.Besides, we analyzed samples rom three di erent sh arms (Fig. 1 and able 1).We sampled individuals using traps, and a ter capture, the animals were transported to the research station.Te Brazilian environmental agencies ICMBIO/SISBIO (license no.48290-1) and SISGEN (A96FF09) authorized the collections and voucher individuals were identi ed and deposited ( able 1) in the sh collections o the Museu de Zoologia da Univ.de São Paulo (MZUSP).We collected liver ragments o all individuals and stored them in 100% ethanol or DNA extraction, ollowing Sambrook and Russell (2001).Te procedures ollowed ethical and anesthesia procedures, in accordance with the Ethics Committee on Animal Experimentation o the Univ.Federal de São Carlos (process number CEUA 9506260315).
Model calibration was carried out with present climatic conditions and a 30 arc-seconds resolution, while projections or the present, last glacial maximum (LGM, 21 kya), last interglacial maximum (LIG, 120 kya) and uture (2070) were per ormed with a 2.5′ arc-minutes resolution.A total o 5000 pseudoabsence points was simulated, using the SRE strategy with a quantile threshold o 0.005.We adopted a proportional weighted mean ensemble method with ve simulations or each algorithm, keeping only simulations with SS higher than 0.7.

DNA extraction and DArTseq genotyping
Te gDNAs o all sampled individuals were analyzed under the DAr seq technology (Kilian et al. 2012) by the Diversity Arrays echnology Company (Canberra, Australia).A combination o PstI and SphI enzymes was used to construct the libraries using methods described by Kilian et al. (2012) and sequenced on the Illumina Hiseq2500 next-generation sequencer.Raw data generated by sequencing were ltered, processed and converted to high-quality genotypes by the acility, using their proprietary DAr so t14 v1.0 so tware.Te ollowing lters were used to obtain the SNP markers: 1) overall call rate over 95%; 2) polymorphic in ormation content (PIC) between 0.3 and 0.5; 3) Q-value (that measures the alse discovery rate) above 2.5 and 4) minimum allele requency o 0.5%.Linkage Disequilibrium and deviations rom Hardy-Weinberg Equilibrium were estimated with dartR (Gruber et al. 2018).Genotypes were coded as an SNP matrix with loci in the rows and individuals in the columns.For each genotype, data were stored as 0 or state homozygotes, 1 or heterozygotes and 2 or alternate state homozygotes (Dryad doi: 10.5061/dryad.4qr6q7j).

Outlier loci detection
Signatures o directional selection were assessed or all sampling sites with the Bayesian method implemented in BayeScan (Foll and Gaggiotti 2008) with a Prior Odds o 100 and a False Discovery Rate o 0.01.Gene ontology (GO) and annotation o detected candidate loci were then evaluated in Blast2Go (Conesa et al. 2005).o achieve this, anking regions o outlier SNPs were blasted against the National Center or Biotechnology In ormation (NCBI) nonredundant nucleotide database and annotated with an e-value threshold o 10 −3 and 10 −6 , respectively.

Genetic diversity and isolation by distance
Summary statistics or genetic diversity were calculated using GENODIVE (Meirmans and van ienderen 2004) or each sampled locality, as estimates o expected heterozygosity (H E ), observed heterozygosity (H O ) and inbreeding coecient (G IS ).Allelic richness (A R ) was calculated to correct or heterogeneous sample sizes, by using the rare action method in diveRsity (Keenan et al. 2013).Pairwise F S (Weir and Cockerham 1984) was also calculated among sampling sites, with signi cance evaluated using 10 000 permutations a ter a Bon erroni correction with α = 0.05.
Isolation by distance (IBD) was tested only or natural-occurrence populations and or each basin separately.Te straight-line distance was used between populations SFA and JAV in o-Ar, as they get connected in the wet season.
For the Am basin and comparisons considering PEI in o-Ar, stream distances were used.We per ormed a Mantel test (Mantel 1967) and canonical redundancy analysis (RDA), a method combining PCA and multiple regressions, which decomposes the genetic variance based on allele requencies (Orsini et al. 2012).For RDA, stream and straight-line distances were trans ormed into coordinates with R-command cmdscale.Te obtained spatial coordinates were then converted to third-degree orthogonal polynomials, with a modied version o the scripts rom Meirmans (2015).We also obtained the spatial component o the total genetic variation by multiplying the percentage o constrained variation by the overall value o F S , as suggested by Meirmans (2015).

Population structure
Population genetic structure or all collected sampling sites was investigated with the non-spatial method astS RUC-URE ver.1.0 (Raj et al. 2014).Tis method is a variation o the popular Bayesian clustering method S RUC URE (Pritchard et al. 2000), optimized or large genotype datasets.Data preparation and analysis were per ormed with the aid o the 'lizards-are-awesome' pipeline (Melville et al. 2017).Population genetic structure o the wild occurring populations was also assessed with the spatially explicit strategy implemented in GENELAND (Guillot et al. 2011), under the correlated requencies prior with 500 000 iterations and a thinning o 200.Runs in astS RUC URE were repeated or a range o K (number o populations) rom 1 to 11 and in GENELAND rom 1 to 8. Results rom both analyses were processed with the online tool CLUMPAK (Kopelman et al. 2015), which simpli es the use o DIS RUC (Rosenberg 2004) and CLUMPP (Jakobsson and Rosenberg 2007) to summarize and plot the results, respectively.

Demographic model selection and parameter estimation
We simulated genetic data similar to our wild-occurrence dataset in ms (Hudson 2002), considering our possible scenarios or the demographic history o A. gigas: 1) one panmictic population harboring all samples collected rom the two basins studied here; 2) a reduction in the ancestral population, ollowed by expansion on both basins; 3) colonization o the Am basin rom the o-Ar, simulated as a ounding event ollowed by exponential population expansion; and 4) o-Ar basin being colonized rom Am, also simulated as a ounder event ollowed by expansion (Fig. 3).Such scenarios were conceived based on the results o the population structure estimates (Fig. 1B) and the SDM projections (Fig. 2).o per orm our coalescent simulations, we adopted a uni orm prior or generation time, spanning rom 4 to 5 yr (Hrbek et al. 2005).A mutation rate (μ) o 1.25 × 10 −9 mutations per site per year was used, calculated rom the splitting times and amount o genome di erences rom A. gigas to Scleropages ormosus (Vialle et al. 2018), the closest species with a sequenced genome.We per ormed data simulations (20 000 or each model) with scripts modi ed rom Perez et al. (2016), using empirical sample sizes.Values o θ were calculated or each simulation using μ and the e ective population size (N e ) sampled rom a uni orm distribution rom 100 to 500 000 individuals (Hrbek et al. 2005 suggest ca 150 000 emales killed by year in the transition o the 19th to 20th centuries, based on harvest estimates).Divergence time or Am and o-Ar basin (τ 2 ) was sampled rom a uni orm distribution rom 200 thousand yr ago (kya) to 2 million yr ago (Mya).Tis period o time includes the age estimate o ocantins river achieving its modern course (Plio-Pleistocene boundary, Rossetti and Valeriano 2007; 1.8 million yr ago (Mya), Silva-Santos et al. 2018).For divergence time o population COD rom other Am localities, a uni orm distribution between 0 and 200 kya was used.Founder e ect magnitude during colonization (θrF-A; used in the models 2 and 3), was computed as the ratio between the θ values during the colonization and the value or the ancient population (drawn rom a uni orm distribution ranging rom 0.001 to 0.1).Te intensity o population expansion a ter colonization (θrC-A) was estimated as the ratio between the θ value in the current and in the ancient population (sampled rom 0.1 to 1).
Competing demographic scenarios were compared using a recent approach described in Flagel et al. (2019).Tis strategy is based on converting the SNP matrices into images and extracting in ormation via convolutional neural networks (CNN; or a review on the main architectures and applications o CNN see Christin et al. 2019).We loaded our simulated data rom ms into python as NumPy arrays containing individuals in the lines and loci as columns.Te genotypes were coded as 0 (black) or the re erence state and 1 (white) as the alternate state.Ten, or each simulation, we clustered individuals by genetic distance and transposed the matrix to keep each individual in a column and markers as lines (Supplementary material Appendix 1 Fig.A1).Te resulting NumPy arrays containing all simulations were then shufed and 10 000 random simulations were separated to be used as a validation set, while the remaining 50 000 were used as training data.Te training data was submitted to a CNN based on the architecture suggested by Flagel et al. (2019) with slight modi cations (Supplementary material Appendix 1 Fig.A1).Brie y, it consists in three (the rst layer containing 250 and the other two 125 neurons) one-dimensional convolutional layers with a kernel size o 2. Tese convolutional layers apply lters (kernels) by sliding them along (convolving) the input image and generate new layers by calculating the scalar product o the kernel and the image being convolved.By doing such operations, they extract eatures o the input image, such as edges and ormats.Our convolutional layers were interleaved with average-pooling layers, that recover the average value o the area covered by the kernel, reducing the dimensionality by extracting dominant eatures o the data.Ten, two ully connected layers with 125 neurons connect all the previous neurons ( attened in a single dimensional layer).Finally, these layers compute the probability or each model using a sigmoid unction with a nal output layer with our neurons, corresponding to the our scenarios used to simulate the data.Te CNN was run with a mini-batch size o 250.Recti ed linear unit activation unctions, that usually learn image patterns aster, were used with the convolutional layers together with a dropout (to avoid over tting) o 25% and 50% o neurons a ter pooling and densely connected layers, respectively.We evaluated the learning per ormance with a loss unction o categorical cross-entropy and updated the network weights during training with Adam optimization (Kingma and Ba 2015).
A ter training the neural network, we per ormed a crossvalidation power analysis o our CNN approach using 2000 simulations per model to evaluate the per ormance o our method to identi y correctly the simulated scenario (Supplementary material Appendix 1 Fig.A2).A ter that, our empirical dataset was submitted to the trained CNN to select the most likely demographic scenario or A. gigas (Fig. 3).A ter selecting the pre erred scenario, we conducted 100 000 simulations or parameter estimation, using the same CNN architecture described above.Accuracy or estimates o each parameter was evaluated with root mean square error (RMSE), a measure o the standard deviation o the residuals, and Spearman's ρ.All scripts used in model comparison and parameter estimation are available in github (<https:// github.com/manoloperez/CNN_DemographyArapaima>).

Paleogeographic modeling
A ter evaluating variable correlations and explanatory capacity, only seven bioclimatic variables (bio 1 -annual mean temperature, 2 -mean diurnal temperature range, 3 -isothermality, 4 -temperature seasonality, 14 -precipitation o driest month, 15 -precipitation seasonality and 16 -precipitation o wettest quarter) were maintained.Projections or present showed the most continuous range.Although largely congruent with A. gigas occurrence points, the potential current distribution was more widespread and predicted areas that are outside oodplains and within rapids systems, where the species is not expected to occur (Fig. 2A).Te uture projection was the most widespread o the modelled periods (Fig. 2B), with suitable areas outside the contemporary species distribution, with a highly suitable area in central Brazil and another in western Amazonia.A similar result or uture conditions was observed by Oberdor et al. (2015), that projected increasing distributional ranges or A. gigas in uture periods.Te LGM projection exhibited the most restricted range, with a stable area in the central Amazon and another in the northern coastline o South America.Te projection or the LIG (Fig. 2C) presented more stable areas than the LGM (Fig. 2D), especially in central and western Amazonia, in the o-Ar basin, and in a ew areas along the northern South American coast.

DArTseq genotyping and genetic diversity
Sequencing o DAr Seq markers resulted in an average o 2 559 000 reads per sample.A total o 2364 high-quality ltered SNPs, with an average read depth a ter ltering o 43.1, were obtained in the 70 samples genotyped, with 3.14% missing data.A minor allele requency o 16% in average was observed, 4% o loci comparisons showed signi cant LD and 7% o the loci showed signi cant HWE deviation a ter Bon erroni correction (alpha = 0.01).We detected 18 loci as candidate outliers in BayeScan.Only our o them returned blast hits and presented GO terms related to DNA-binding transcription actor activity (GO: 0000113), heparan sulate sul otrans erase activity (GO: 0034483), IMP amily protein binding (GO: 0098769) and LEM domain binding (GO: 0097726).We maintained candidate SNPs or all urther analyses, as their removal rendered similar results (data not shown).
Diversity levels were higher or Am Basin populations when compared with o-Ar or all diversity indexes calculated, with H E and H O values at least one order o magnitude higher ( able 2, Fig. 4).Samples rom ABV and SJBV sh arms showed diversity levels similar to Am basin localities, while the diversity levels o the individuals rom CRI sh arm were similar to the o-Ar basin ( able 2, Fig. 4).Te inbreeding estimator (G IS ) presented negative values, that indicates outbreeding, in most localities, except or MAM and FBO (Am basin), with 0.052 and 0.130, respectively, JAV in the o-Ar basin (0.004), and ABV sh arm (0.113).Pairwise F S ranged rom 0.035 (between MAM and FBO) to 0.771 (between CAS and JAV).In general, higher values were observed in pairwise comparisons o Am basin and o-Ar localities (Supplementary material Appendix 1 Fig.A3).A ter applying a Bon erroni correction, all F S values were signi cant (Supplementary material Appendix 1 Fig.A3).Fish arming samples showed diverse patterns, with CRI showing higher F S values when compared with Am basin localities, ABV more dissimilar to o-Ar localities, and SJBV showing moderate values o F S with all other localities (Supplementary material Appendix 1 Fig.A3).

Isolation by distance
Comparisons o genetic and geographic distance with Mantel test suggested a small non-signi cant correlation between these two variables in Am basin (r = 0.3057; p = 0.2917) and high non-signi cant relationship in o-Ar basin (r = 0.8771; p = 0.1250).Redundancy analysis pointed to a signi cant correlation in Am basin (RDA = 0.5532; p = 0.0417), that resulted in a value that indicates absence o IBD when multiplied by F S (RDA*F S = 0.0457), according to Meirmans (2015).RDA or o-Ar basin localities was not able to select any variables, suggesting an absence o correlation between geographic and genetic distances.

Population structure
Results rom the chooseK command in astS RUC URE suggested a maximum marginal likelihood with K = 2, and K = 4 as the model complexity required to explain the data.Tere ore, we decided to show clustering results or 2-4 groups.All results grouped all o-Ar localities, along with CRI sh arm.Sampling sites rom Am basin were also grouped when K = 2 was used, alongside with samples rom ABV sh arm.When K = 3 and K = 4 were used, COD was allocated alone in a new group.All samples rom SJBV sh arm showed admixed ancestry, with part o their genomes assigned with samples rom both basins, but with most o their genome belonging to o-Ar basin (Fig. 1).Geneland results suggested three as the optimum number o genetic clusters.Te obtained result was largely congruent with K = 3 in astS RUC URE, grouping all o-Ar samples in one group, COD alone in a second group and the remaining localities rom Am basin in a third cluster (Fig. 1).

Demographic model selection
Based on the congruence o the results rom the two clustering analyses per ormed (Fig. 1), we decided to use three groups in the simulations o the demographic scenarios (Fig. 3).A ter 20 epochs, our CNN showed an accuracy o 0.9007 and 0.8815 in the training and in the validation set, respectively.Our cross-validation procedure showed a high proportion o simulations correctly predicted to their generating model, and the scenarios or both basin expansion (model 2) and Am colonization (model 3) were the most di cult to dierentiate with our approach, as they were con ounded with each other (80.9 and 83.3% o correct predictions or model 2 and model 3, respectively).Te panmictic scenario was the most easily diagnosable, as it showed very high tions o correct predictions (99.9%) (Supplementary material Appendix 1 Fig.A2).When the empirical data was submitted to the trained CNN, the most likely scenario was o-Ar basin colonization (Fig. 3), with a posterior probability (PP) o 0.99992, while Am basin colonization and both basin expansion showed the lowest probabilities (PP = 0.00000 or both).

Discussion
Our SDM projection or the present conditions recovered a potential distribution larger than the current natural occurrence o A. gigas.Tis result can be related to the use o only bioclimatic variables in our SDM approach, without incorporating the presence o oodplains or water alls in the models.Tough use ul, incorporating such in ormation would preclude projections o A. gigas distribution in past and uture periods.Similar strategies are being used by other authors when analyzing reshwater shes (Bagley et al. 2013, Oberdor et al. 2015, McMahan et al. 2017).Te paleogeographic reconstructions obtained here indicated that during the last glacial period, A. gigas distribution was constrained under severe climatic conditions, with suitable climatic habitats scattered and restricted to re ugial areas (Fig. 2C-D, 4).Tese population size uctuations may have resulted in an accentuated genetic dri t caused by bottlenecks (Wright 1931, Lande 1988), especially in o-Ar according to our demographic model results.However, care should be taken when correlating these two results, as the time period recovered or colonization o o-Ar is older than our SDM projections.Tere ore, these approaches can be viewed as complementary, giving insights about di erent evolutionary periods in A. gigas history.Altogether, these eatures possibly played a major role in shaping the modern genetic diversity observed in A. gigas, implying that the pattern o di erentiation observed between distinct populations is most probably a ected by hydrological and historical climate eatures.
Another important result or the species conservation is that the sh arm ABV is located within the area o the o-Ar basin, but presented genotypes associated to the Am population (Fig. 1B).Tis is o special concern, as the distribution models or the uture (Fig. 2B) indicate an invasive potential or areas outside the current species distribution.Many o those predicted areas outside the current distribution are in central Brazil, where climate change would increase temperatures and precipitation levels, promoting a reduction o savanna (Moncrie et al. 2016).Tough presenting a suitable climate, these areas present higher elevations, an absence o suitable lentic habitats or spawning, ast water currents and rapids, that would likely inhibit A. gigas invasion.However, some areas would still be potentially invaded, especially as the release o individuals rom sh arms in nearby rivers could acilitate that e ect.In act, according to in ormation rom local armers and sherman during our sample expeditions, the locality PEI is probably the result o introductions rom other o-Ar natural localities, as the species do not use to occur in the region be ore the 50s.Because the collected individuals were occurring in the wild and our pairwise F S results pointed to a unique genetic variation or this locality (Supplementary material Appendix 1 Fig.A3), we decided to analyze it together with the other wild occurrences.Such invasions can greatly impact the invaded communities, by spreading diseases and a ecting the ecosystems, as A. gigas is one o the largest reshwater species and a generalist carnivore.Moreover, admixture o local populations with alien alleles can cause outbreeding depression (Weeks et al. 2011), potentially yielding in ertile o spring.As the ood dynamics, associated with the reproductive biology o Arapaima, in the two basins are markedly di erent (Albert and Reis 2011), contamination o natural populations with introduced individuals rom a di erent basin can have negative results in their tness.Likewise, the reduced environmental suitability recovered in central Amazon (Fig. 2B) is also o conservation concern.Tis pattern can be related to a potential change in the Amazon phytophysiognomy towards a savannahlike habitat, as result o a drier climate (Nobre et al. 2016, Lovejoy andNobre 2019).

Genetic diversity among populations
In agreement with our distribution models or the past, we detected lower genetic diversity levels or localities rom the o-Ar basin compared to the Am basin ones, even when correcting or small sample sizes with A R ( able 2, Fig. 4).Lower genetic diversity in o-Ar localities, especially in the upper portion o the basin was also observed in other A. gigas studies (Vitorino et al. 2015, 2017, Farias et al. 2019, orati et al. 2019).In addition to the di erences in the genetic diversity levels, we recovered structure separating populations rom both basins and high signi cant F S values between them (higher than 0.7 or 9 comparisons; Supplementary material Table 3. Parameter estimates obtained rom the pre erred demographic model.N e -e ective population size; τ 1 -divergence time (in years) between COD and the other Am localities; τ 2 -divergence time (in years) between the two basins; θrF-A -population size reduction ratio during colonization; θrC-A -ratio between the current and the ancient population.Appendix 1 Fig.A3), a similar pattern to that recovered with microsatellite markers (Farias et al. 2019), also suggesting substantial di erences in the genetic distribution.Previous structure analyses o A. gigas populations rom Am and lower o-Ar basins pointed to the absence o genetic structure (Hrbek et al. 2005(Hrbek et al. , 2007) ) or resulted in highly separate genetic clusters in the two basins (Araripe et al. 2013, Farias et al. 2019).However, a recent study analyzing the genomic polymorphism through ddRAD sequencing o A. gigas including samples rom Am, upper and lower o-Ar basins ound a high genetic structure between the two basins, with the lower o-Ar showing mixed ancestry ( orati et al. 2019).We also ound some genetic substructure within the Am basin, with COD being assigned to a separate group rom the remaining localities.Tese results are in consonance with a hypothesis o more than one species be present in the genus Arapaima (Castello et al. 2013).However, we decided not to address ormal taxonomic suggestions here, as we believe that an integrative species delimitation approach (Carstens et al. 2013b) would be necessary, coupling genomic data with other sources o in ormation (e.g.morphological and cytogenetic), besides including specimens rom all previously described morphotypes (Stewart 2013a, b).Te detected genetic diversity and population structure patterns can be related to several aspects o Arapaima biology.Arapaima gigas is considered a sedentary species (i.e.low migratory activity) with a pre erence or low-oxygenated lentic environments and having specialized parental care (Hrbek et al. 2005).Te hydrological dynamics o the regions where A. gigas is currently ound shows long ooding periods and ow cycles allowing the migration o these shes to neighbor lakes within the same basin, a process known as lateral migration (Castello 2008, Farias et al. 2019).Tis migratory pattern can be responsible or the observed pattern o genetic groups including most or all samples within each analyzed basin, coupled with high genetic di erentiation among populations both in intra and inter-basin comparisons (Supplementary material Appendix 1 Fig.A3).Moreover, along with several other sh species, this species has shown a decline in genetic diversity due to the loss o natural habitats and commercial over-exploitation (Allan et al. 2005, Castello et al. 2011).In act, its obligate air-breathing behavior and the lentic environments where these shes inhabit make it an easy target or shing.

Demographic history
Our results indicate a scenario in which the ancient Am basin population colonized the o-Ar basin (Fig. 3).Te parameter estimation step suggested that the colonization o o-Ar took place during the Pleistocene (τ 2 ; median = 922.06kya).Tis estimate is older than the time periods used in our SDM projections, and caution is necessary to associate these results.Tese are complementary results, and the obtained estimate or o-Ar colonization is in agreement to the suggested Plio-Pleistocene age or the de nitive splitting o the Am and o-Ar basins based on river sediments (Rossetti and Valeriano 2007).Although the ages estimated or the separation o these two basins rom dated phylogenetic trees based on mtDNA in Inia (Hrbek et al. 2014) and in mtDNA and two nuclear markers in Salminus (Machado et al. 2018) were older than our estimates, the con dence intervals were also placed on the Plio-Pleistocene boundary.Te estimated e ective population size was also highly concordant with a previous estimation o this parameter using cpDNA markers (Hrbek et al. 2005).Te remaining estimated parameter showed a lower accuracy when simulated data were evaluated by RMSE and Spearman's ρ, and they should be considered with care.Te estimated values suggested a very recent separation o the COD population (τ 1 ; median = 89.49kya; interval = 84.20-96.68kya), a strong bottleneck during the oundation o the o-Ar basin (θrF-A; median = 0.0571; interval = 0.054-0.059),and a current population size or o-Ar that is approximately hal o the size estimated or the Am basin (θrC-A; median = 0.5266; interval = 0.496-0.567).Among those estimates, the result or current population size o o-Ar was unexpectedly high.Tis is probably related to limitations o our method to estimate this parameter, as there is much more suitable habitat or the species in the Am basin (Fig. 2A).Also, the time estimate or the separation o the COD population was placed in the Pleistocene glaciations, which could be related to these events.Te model comparison and parameter estimation approach adopted here, based on CNN, allows taking in ormation directly rom the SNP matrices, without the use o summary statistics.Besides, contrary to other model testing approaches based on a rejection step that discards most o the simulations and retain only a small part that is more similar to the empirical data (e.g.ABC; Csillery et al. 2012), CNN use in ormation rom the whole set o simulations to learn how to distinguish among concurrent scenarios.Tese eatures resulted in a high capacity to distinguish among the simulated colonization models in our dataset (Supplementary material Appendix 1 Fig.A2).
Te most likely demographic scenario recovered was the colonization o o-Ar basin rom an ancient Am basin population, in accordance to the genetic diversity observed.Our analysis showed a more restricted distribution or A. gigas in the past, especially during glacial periods.Present distribution is more widespread and continuous, while uture predictions indicate a distribution range shi t towards south and a more ragmented distribution (Fig. 2B), potentially involving extinction o populations in central Amazon, as a result o global warming.Such scenario can result in invasive potential o new areas, increased by the presence o sh arms containing specimens located outside the natural occurrence area.Tis result is o special concern as A. gigas is one o the largest reshwater species and a generalist predator, characteristics that might cause a high impact in the invaded communities.

Figure 1 .
Figure 1.(A) Map o northern South America indicating the sampling sites o Arapaima gigas analyzed in this study rom ocantins-Araguaia (yellow) and Amazon (green) river basins, coded according to able 1. Fish arming sampling sites are represented as orange triangles, while natural samplings are shown as red circles.(B) astS RUC URE results or K rom 2 to 4 and GENELAND analysis or K = 3.Each sampling site is represented as a vertical bar showing the proportion o their genome belonging to each o the K groups.Black lines separate individuals o di erent sampled localities.

Figure 2 .
Figure 2. Climatic distribution modeling in A. gigas, constructed based on current distribution points.Suitable climatic areas are shown according to a gradient or present (A), uture (2070 -B), last interglacial maximum (LIG, 120 kya -C) and last glacial maximum (LGM, 21 kya -D).

Figure 3 .
Figure 3. Graphic representation o the simulated scenarios tested in A. gigas.Model 1 considers the sampled distribution as a simple panmictic population at all times; model 2 simulates a reduction ollowed by expansion on both basins; model 3 simulates a colonization event in the Amazon basin; and model 4 a colonization event in the ocantins-Araguaia basin.Symbols represent the estimated parameters o theta in the current populations (θ) in all models, as well as divergence times between basins (τ 2 ) and o the COD population (τ 1 ), population sizes during ounder event (θrF-A) and ratio o population sizes during and a ter the ounder event (θrC-A).PP -posterior probability.

Figure 4 .
Figure 4. Map o northern South America showing the predicted suitable areas (environmental suitability higher than 200) or Present (light grey), LIG (grey) and LGM (dark grey).Circles represent the value o genetic diversity indexes (A) number o alleles -A; (B) allelic richness -A R ; (C) expected heterozygosity -H E and (D) inbreeding coe cient -G IS (with circle colors representing positive values in blue and negative in red).Circle sizes are proportional to the value o the genetic parameter, and the highest and lowest values are shown.

Table 1 .
Sample in ormation rom Arapaima specimens used to develop SNPs with DArTseq and to per orm genetic analyses.

Table 2 .
Genetic diversity levels in Arapaima sample sites.A -average number o alleles; A R -allelic richness; H O -observed heterozygosity; H E -expected heterozygosity; G IS -inbreeding coe fcient.