Genomic analyses identify multiple Asian origins and deeply diverged mitochondrial clades in inbred brown rats (Rattus norvegicus)

Abstract Over 500 strains of inbred brown rats (Rattus norvegicus) have been developed for use as a biomedical model organism. Most of these inbred lines were derived from the colony established at the Wistar Institute in 1906 or its descendants following worldwide distribution to research and breeding centers. The geographic source of the animals that founded the Wistar colony has been lost to history; thus, we compared 25 inbred rat strains to 326 wild rats from a global diversity dataset at 32 k SNPs, and 47 mitochondrial genomes to identify the source populations. We analyzed nuclear genomic data using principal component analyses and co‐ancestry heat maps, and mitogenomes using phylogenetic trees and networks. In the nuclear genome, inbred rats clustered together indicating a single geographic origin for the strains studied and showed admixed ancestral variation with wild rats in eastern Asia and western North America. The Sprague Dawley derived, Wistar derived, and Brown Norway strains each had mitogenomes from different clades which diverged between 13 and 139 kya. Thus, we posit that rats originally collected for captive breeding had high mitochondrial diversity that became fixed through genetic drift and/or artificial selection. Our results show that these important medical models share common genomic ancestry from a few source populations, and opportunities exist to create new strains with diverse genomic backgrounds to provide novel insight into the genomic basis of disease phenotypes.

of the famed Wistar colony. In 1909, Helen Dean King began inbreeding the Wistar colony, and by 1920, there were two colonies ("inbred" and "outbred") of Wistar rats. Many subsequent strains were derived from the outbred Wistar commercial stock (e.g., Lewis, Buffalo, Wistar Kyoto). Another set of strains were created by mating the outbred colony with other rats of unknown commercial or wild stock; for example, Long-Evans was produced by breeding a male caught in Berkeley, USA, with a Wistar female, and Sprague Dawley was produced by breeding a hooded male with a white Douredoure female, a line assumed to contain Wistar ancestry (Lindsey & Baker, 2006). Finally, King produced the Brown Norway strain from wild rats caught in Philadelphia, USA (Lindsey & Baker, 2006). One complicating factor for understanding strain development at the Wistar Institute was the introduction of cottonseed meal into the rat diet in 1918 that resulted in death or low fertility in the Wistar colonies (Lindsey & Baker, 2006). To meet commercial demand for rats, the Institute purchased other commercial stock, yet this stock brought diseases that also resulted in increased mortality of the colony. Thus, not only did the Wistar colonies experience a bottleneck, but individuals of unknown origin were introduced in the early 1920s and these strains may have been Wistar rats from another facility. Not all rat strains have Wistar ancestry; for example, Maynie Rose Curtis produced several inbred lines including Fisher 344, Marshall 520, and August 7,322 from stocks she received from breeders, where the breeder's name became the name of the line.
She also produced the Avon and Copenhagen lines that were named for cities in Connecticut, USA, and Denmark, respectively (Lindsey & Baker, 2006). Finally, the Fawn Hooded is an outbred stock originally produced by Norman Maier at the University of Michigan by crossing a German brown and a Lashley albino, the latter of which was from the laboratory of Karl Lashley of Harvard University (Hedrich, 2006). Contemporary Fawn Hooded strains may also be crossed with Long-Evans depending on the breeding facility.
Despite the diverse number of strains available, the Wistar colonies had an outsized role in creating the diversity of inbred strains today. Approximately half of the greater than 500 strains have known Wistar ancestry (Aitman et al., 2008). Inbred lines are developed by brother-sister matings for more than 20 generations, often selecting which siblings to mate following screens for physiological or behavioral traits of interest (e.g., body size, hypertension, tameness). Once a strain is developed, animals are shipped to research institutes and/or medical supply companies where inbreeding continues; thus, genetic drift may occur within the same strain maintained at different facilities. A recent analysis of 29 inbred (sub)strains used whole genome sequencing (WGS) to identify selective sweeps at genes associated with the physiological traits selected in each line (Atanur et al., 2013).
The population genetic relationships between inbred strains, and between inbred stains and wild relatives (Song et al., 2014) have been investigated previously albeit with limited sampling of either the genome or the diversity of wild individuals. Early work to deduce relationships among strains used microsatellites, RAPDs, and isozymes (Canzian, 1997;Thomas, Chen, Jensen-Seaman, Tonellato, & Twigger, 2003), while more recent work has investigated single nucleotide polymorphisms and variants (SNPs and SNVs, respectively; Atanur et al., 2013;Hermsen et al., 2015). Topological differences between phylogenetic trees and networks were observed across these different studies due to the (sub)strains genotyped, marker type, and analysis method. We do not propose to untangle the network of rat strains (STAR Consortium 2008), but instead place the strains within the geographic context of worldwide wild rat diversity. The history of the rat colony at the Wistar Institute suggests multiple putative origins of rats including countries in Western Europe where rat studies began in the mid-1800s, Chicago, USA, and/or Philadelphia, USA. Strains such as the Copenhagen were known to be sourced from Denmark; thus, we hypothesize that inbred lines will share evolutionary similarity with multiple geographic locations.
We selected ten individuals from our global collection of R. norvegicus samples for WGS, four from Western Europe (including two from continental Europe: England and France, and two from New York City, USA, to represent the expansion range), two within South-East Asia (Philippines and Cambodia), Northern Europe (Sweden and Netherlands), and one sample each from the Aleutian Islands and Western North America (Table S1). Samples (4 ng RNase-treated genomic DNA) were sequenced at the New York Genome Center on an Illumina HiSeq 2500 generating paired-end reads. Average sequencing depth ranged from 24 to 38× per genome (NCBI SRA: PRJNA344413).

| Nuclear genome analyses
We mapped reads for each individual within the inbred and Chinese rat genomes to the Rnor_6.0 reference genome (Gibbs et al., 2004) using Bowtie v2.2.6 (Langmead & Salzberg, 2012) with default parameters. We extracted the 32,127 SNPs that were called in the ddRAD-Seq dataset using a position list with SAMTOOLS v1.2 mpileup function (Li et al., 2009). Using these data, we estimated genetic diversity (expected heterozygosity: H E , and mean number of alleles: A) in ARLEQUIN v3.5 (Excoffier & Lischer, 2010). We ran a principal component analysis (PCA) where we projected the inbred samples into the PC space from the global diversity dataset using EIGENSOFT v5.0.2 Price et al., 2006). We also investigated population structure using FINESTRUCTURE v2.0.7 (Lawson, Hellenthal, Myers, & Falush, 2012) on the 20 autosomal chromosomes (31,489 SNPs). We phased and imputed each chromosome using fast-PHASE (Scheet & Stephens, 2006). In FINESTRUCTURE, we ran the unlinked model with 25% of the data used for initial EM estimation, 750,000 iterations of the MCMC with 50% used as burn-in and 1,000 samples retained, 20,000 tree comparisons, and 500,000 steps for the tree maximization. We viewed MCMC trace files to confirm stability of all parameters.

| Mitochondrial genome analyses
For samples with WGS data, we exported reads aligned to the mitochondrial genome using SAMTOOLS, then mapped those reads to a reference mitogenome (NCBI accession AY172581 which is a Brown Norway strain, BN/NHsdMcwi) in GENEIOUS v5.4 (http://www. geneious.com; Kearse et al., 2012) using default settings. We exported the consensus sequence from each assembly.
We analyzed the mitogenomes both as a network and phylogenetic tree. We aligned all 47 brown rat mitogenomes using MUSCLE (Edgar, 2004) within GENEIOUS, then built a NeighborNet network in SPLITSTREE v4.13.1 (Huson & Bryant, 2006). To understand divergence time between the brown rat clades, we downsampled each clade identified in the network to a single individual (n = 12). As phylogenetic software views polymorphisms as fixed substitutions between sequences, we downsampled to limit this influence on the estimation of the substitution rate, where an overestimate results in older divergence times. Haplotype selection may influence this rate, as individual haplotypes within a clade contain differing numbers of polymorphisms, thus selecting highly polymorphic haplotypes can overestimate divergence time. We selected mitogenome outgroups from R. rattus (NC_012374), R. tanezumi (EU273712), R. exulans (EU273711), and Mus musculus (NC_005089; Bayona-Bafaluy et al., 2003;Robins et al., 2008), then aligned the genomes as above.
Using the program BEAUTI, we set up a BEAST v1.8.0 (Drummond & Rambaut, 2007) input file with the following parameters: no partitioning of the data, a lognormal relaxed substitution model (Drummond, Ho, Phillips, & Rambaut, 2006), and a constant coalescent tree model (Kingman, 1982). We placed a fossil calibration (normal distribution, mean 11.8 Mya, std 0.5 Mya) on the root of the tree splitting Mus and Rattus; the settings were chosen so that 90% of the prior distribution was between 11 and 12.5 Mya (Benton & Donoghue, 2007;Robins et al., 2008). Within the CIPRES Science Gateway v3.3 (Miller, Pfeiffer, & Schwartz, 2010), we ran two independent iterations of BEAST for 10 8 Markov chain Monte Carlo steps sampling every 10 4 steps. For comparison, we ran a separate iteration where the input file contained the priors yet no sequence data. We observed that the independent runs converged and that the runs with data were better supported than the prior alone, using TRACER v1.6 (Rambaut & Drummond, 2009). We combined the independent runs following removal of 25% of MCMC steps as burn-in using LOGCOMBINER v1.8, then visualized the tree with the highest median log credibility score using TREEANNOTATOR v1.8 and report node age and the 95% highest probability density (HPD). One branch of the consensus tree had a posterior probability of 0.58 (see below); thus, we ran DENSITREE v2.2.5 (Bouckaert & Heled, 2014) to observe alternative topologies.
To place our results within the context of previous work on brown rat mitochondrial diversity, we aligned 1,140 bp of cytochrome-B (cytB) previously analyzed by Song et al. (2014), and extracted the same region from the wild and inbred mitogenomes. We screened for duplicate haplotypes using COLLAPSE v1.2 (Posada, 2004). We aligned data in GENEIOUS then built a NeighborNet network in SPLITSTREE.
We named clades in this network when samples were concordant with our mitogenome results.

| Nuclear genome analyses
Inbred rats had moderate genetic diversity measured as H E and A (Table S2) when compared to sampling sites around the world.
However, when individual lines were analyzed, inbred rats had the lowest genetic diversity of any population analyzed where H E ranged from 0.005 to 0.039 and A 0.66-1.062. These results were consistent with expectations under inbreeding, where all lines taken together contained similar diversity to wild rats, but any individual strain had very low genetic diversity as strains were selected for different traits.
When inbred brown rats were projected into the PC space from a global diversity dataset, they clustered between samples from San Diego (i.e., Western North America), and eastern China and eastern Russia (Figure 1) on the third PC axis which distinguishes diversity in Asian samples. The first PC axis represents divergence between Asian and non-Asian samples; inbred strains vary along this axis with Brown Norway showing the closest affinity to wild Western Europe rats (Figure 1). The results from FINESTRUCTURE were similar; first, the 25 strains formed a single cluster. When compared to the global diversity, inbred strains shared the most co-ancestry with wild rats from the Western North America evolutionary cluster; co-ancestry was F I G U R E 1 Principal component analyses of (a) the global diversity dataset (n = 326) of 32k SNPs and the inbred samples (n = 29; black) projected into the PC space for the first and third axes, (b) the inbred samples labeled (see Table S1) from the same projection. Sample colors indicate genomic clustering, including China (dark brown), South-East Asia (light brown), eastern Russia (pink), Aleutian Archipelago (orange), Western North America (yellow), Northern Europe (purple), Western Europe and global expansion (light blue), and Haida Gwaii, Canada (dark blue)

(a) (b)
F I G U R E 2 Co-ancestry heat map of Rattus norvegicus (global diversity dataset n = 326; inbred n = 29) using 32k SNPs from the nuclear genome analyzed in FINESTRUCTURE, where yellow and black, respectively, denote lower and higher co-ancestry moderately high with rats from eastern China and Russia (Figure 2 Figure 2).
We observed 12 evolutionary clusters within the 29 inbred rats ( Figure 1 and Figure S1). This method delineates population-level substructure within the global dataset (Puckett et al., 2016)

| Mitochondrial genome analyses
We identified 11 clades within the 47 mitogenomes sequenced ( Figure 3). The cytB network had similar patterns between the clades but with two additional clades not identified using the mitogenomes; additionally, clades 9 and 10 lacked sufficient resolution for differentiation ( Figure S2). The inbred samples were distributed between clades 10, 14, and 15, a result that confirms earlier work (Schlick et al., 2006). The Brown Norway strain clustered with samples from Sweden and the USA, and was denoted as clade 10 by Puckett et al. (2016  We estimated divergence time (Figure 4 and Figure S3) and mutation rate (0.023 substitutions per site per Ma; HPD 0.020-0.027) across the Rattus mitogenome tree. As expected, estimated divergence times between mice and rats, and within Rattus were similar to previously published results ( Figure S3; Robins et al., 2008); thus, we focused on the timing of divergence within R. norvegicus. Notably, the Bayesian posterior probability for a sister relationship between clade 3 (node B) and other Asian samples was 0.58 (Table 1) where the DENSITREE analysis presents two alternative topologies for the placement of clade 3 ( Figure S3). The R. norvegicus crown was 139 kya (HPD F I G U R E 3 Network of Rattus norvegicus mitogenomes denoting either the geographic location of strain of wild and inbred rats, respectively (see Table S1). The name of each clade is listed in bold 105-181 kya; Figure 4, Table 1). Divergence times of other clades were primarily before the last glacial maximum (LGM; 18-22 kya), except for divergence of a sample from Tokyo, Japan, and the inbred strains in clade 14 (node G) where divergence was estimated following the last glacial maximum (Figure 4, Table 1).

| DISCUSSION
Within the 25 inbred rat strains that we investigated, the nuclear genomes formed a single genomic cluster of admixed Asian ancestry ( Figure 2); thus, neither of our hypotheses were supported. We first hypothesized that inbred rats would cluster with Western Europe genotypes due to an assumption that colonies were founded by wild rats closest to the researchers in Europe and the USA that ini-  Figure S3 for tree with outgroups). See Table 1 for posterior support, divergence times, and 95% HPD for each node T A B L E 1 Divergence times with 95% highest posterior density (HPD) estimates and Bayesian posterior probabilities for each node in the Rattus norvegicus phylogenetic tree shown in Figure 4 Node  Both analyses contain haplotypes from clades covering the deepest split (Node A in Figure 4); thus, this was not the source of the different estimates. Third, our estimate of a substitution rate of 0.023 per site per Mya had little variation across the branches of the full tree and was lower than the 0.098 substitutions per third codon per Mya previously estimated for R. norvegicus (Nabholz, Glemin, & Galtier, 2008), yet our inclusion of all nucleotides in this estimate explains this difference. Fourth, we note an incongruence with the SNP haplotype network used to select samples for full mitogenome sequencing, including that the haplotype network underestimated divergence of the samples from Harbin, China, originally grouping them into two clades (clades 1 and 9) where the mitogenome network identified five clades with old divergence (Puckett et al., 2016).

Divergence (kya) HPD (kya) Posterior
Our results indicate that only a small portion of global genomic diversity has been captured within inbred rats, and current strains are most closely associated with the Western North America and China evolutionary clusters. Thus, there is substantial genomic variability in wild rats not accounted for in current medical models, although we acknowledge that we studied a subset of highly used strains in North American and European research laboratories; thus, there may be strains representing additional diversity.
This finding parallels the skew in human genomewide association studies (GWAS), where linkage disequilibrium, private SNVs, allele frequencies, and genomic architecture differ between ancestral backgrounds, thus limiting the transferability of the highly studied Northern and Western European (CEU) population to other ancestral backgrounds and admixed populations (Bustamante, De La Vega, & Burchard, 2011;Need & Goldstein, 2009). Thus, the generation of new inbred rat lines from one or more backgrounds (e.g., South-East Asia or Western Europe) may expand both the phenotypic diversity and our understanding of the genomic basis of disease (Chow, 2015). Developing and maintaining inbred lines is costly in both time and money, although such efforts may be rewarded by developing a broader understanding of the genomic architecture underlying traits with biomedical applications.