A comprehensive biomedical variant catalogue based on whole genome sequences of 582 dogs and eight wolves

Summary The domestic dog serves as an excellent model to investigate the genetic basis of disease. More than 400 heritable traits analogous to human diseases have been described in dogs. To further canine medical genetics research, we established the Dog Biomedical Variant Database Consortium (DBVDC) and present a comprehensive list of functionally annotated genome variants that were identified with whole genome sequencing of 582 dogs from 126 breeds and eight wolves. The genomes used in the study have a minimum coverage of 10 × and an average coverage of ~24×. In total, we identified 23 133 692 single-nucleotide variants (SNVs) and 10 048 038 short indels, including 93% undescribed variants. On average, each individual dog genome carried ~ 4.1 million single-nucleotide and ~1.4 million short-indel variants with respect to the reference genome assembly. About 2% of the variants were located in coding regions of annotated genes and loci. Variant effect classification showed 247 141 SNVs and 99 562 short indels having moderate or high impact on 11 267 protein-coding genes. On average, each genome contained heterozygous loss-of-function variants in 30 potentially embryonic lethal genes and 97 genes associated with developmental disorders. More than 50 inherited disorders and traits have been unravelled using the DBVDC variant catalogue, enabling genetic testing for breeding and diagnostics. This resource of annotated variants and their corresponding genotype frequencies constitutes a highly useful tool for the identification of potential variants causative for rare inherited disorders in dogs.


Introduction
The domestic dog (Canis lupus familiaris or Cauls familiaris) was the first domesticated animal species. In the last few hundred years, more than 400 genetically isolated breeds of dogs have been created through human selection and breeding in closed populations. Dogs are the species with the greatest intra-species phenotypic diversity among vertebrates. Body size as well as limb and skull proportions differ noticeably among breeds, with Chihuahuas at one end of the spectrum measuring 20 cm in height and 2 kg in weight compared with Great Danes on the other end exceeding 76 cm in height and 90 kg weight. In addition to morphology, behavioural variation exists across breeds, with many breeds being highly specialized for single tasks such as herding, hunting, retrieving, guarding, detecting scent or providing companionship. The genetic bases for phenotypic variation among breeds have been reported, including body size (Sutter et al. 2007;Boyko et al. 2010;Vaysse et al. 2011;Rimbault & Ostrander 2012;Hayward et al. 2016), skull shape (Schoenebeck et al. 2012), short legs (Parker et al. 2009;Brown et al. 2017), hair morphology (Drögemüller et al. 2008;Cadieu et al. 2009; and aggression or fear (Zapata et al. 2016;Sarviaho et al. 2019). These studies provided important insights for understanding the genetic regulation of analogous traits in humans.
Many genetic diseases are analogous to those in humans, for example susceptibility to certain types of cancer and a large number of Mendelian diseases. The list of known diseases in dogs is greater than that in any other domestic animal species (Ostrander et al. 2017). The Online Mendelian Inheritance in Animal (OMIA) database lists 426 potential dog models for human diseases (https://omia.org; accessed 14 May 2019). Canine heritable diseases not only show the clinical and pathological features of their human counter-parts but also harbour underlying causative genetic variants in the same genes as in humans (Hytönen & Lohi 2016). In addition, the identification of disease-causing variants in previously uncharacterized genes provides novel candidates for rare human diseases. Examples include rare diseases like retinitis pigmentosa, osteogenesis imperfecta, congenital ichthyosis, nasal parakeratosis, footpad hyperkeratosis, neurodegenerative vacuolar storage disease, myoclonic epilepsy and lethal acrodermatitis (Zangerl et al. 2006;Drögemüller et al. 2009Drögemüller et al. , 2014aMerveille et al. 2011;Grall et al. 2012;Jagannathan et al. 2013;Kyöstilä et al. 2015;Hytönen et al. 2016;Wielaender et al. 2017;Bauer et al. 2018a).
Given the breeding history of dogs, with extreme genetic bottlenecks during breed formation and the subsequent maintenance of closed populations starting from a small number of founding animals, it is not surprising that individual breeds also exhibit breed-specific patterns of disease susceptibility. In the last 15 years, geneticists have mapped hundreds of Mendelian trait loci using smaller sample sizes than are required in human disease association studies. Instances include MITF, which is involved in white spotting, found with nine cases and 10 controls (Karlsson et al. 2007) and the epilepsy gene, LGI2, that was mapped with 11 cases and 11 controls (Seppälä et al. 2011). For simple Mendelian traits, only 15 000 informative single-nucleotide variants (SNVs) spanning the genome were reported as being adequate for successful genome-wide association studies, whereas for the same study in humans, 300 000 SNVs might be required (Lindblad-Toh et al. 2005). This is due mainly to long-ranging intra-breed linkage LD, which extends to roughly 1 Mb in dogs compared with a few kb in humans (Sutter et al. 2004;Lindblad-Toh et al. 2005). Because of the long-ranging LD, the identification of association signals is easier in dogs compared with humans. However, fine mapping and the identification of the underlying causal variant(s) for a given association signal are often quite challenging in dogs. Within dog breeds, typically many variants close to the causal variant are in strong or even perfect LD and hence it may be difficult to discriminate among them.
With whole-genome sequencing (WGS), it has become feasible to at least comprehensively access the existing genome variation. Identification of causal variants for heritable traits in a WGS approach involves the mapping of case and control sequence reads to a reference genome sequence (Bourneuf et al. 2017). The currently used CanFam3.1 reference genome sequence is derived from a female Boxer with contig and scaffold N50 sizes of 180 kb and 45 Mb respectively (Lindblad-Toh et al. 2005). Any given dog genome typically has several million variants compared with the reference genome. Therefore, a meticulous approach to reliable variant calling and hierarchical filtering strategies is required to reduce the millions of variants per sample to a manageable list of candidate genes and causative variants. Many heritable traits in dogs have been successfully solved using such a WGS approach for both diseases and morphological features; examples include a de novo variant in ASPRV1 for ichthyosis and variants in GJA9 for polyneuropathy, RBP4 in congenital eye disease and DVL2 in dogs with screw tails Becker et al. 2017;Kaukonen et al. 2018;Mansour et al. 2018). More recently, Plassais et al. (2019) identified candidate causative variants for several phenotypes.
The cost of a mammalian WGS including data analysis and data storage is still above US $1000. Hence, WGS experiments in individual research laboratories directed at identifying the causal variants for Mendelian traits are carried out with small numbers of cases and controls. The efficiency of such WGS approaches can be drastically improved by including variants from dogs sequenced for other projects from around the world. Comprehensive variant databases containing accurately called and genotyped variants from hundreds or thousands of dog genomes enable a straightforward filtering option based on allele and/or genotype frequencies within or across selected cohorts. This is a powerful approach to identifying and excluding common variants that are unlikely to cause rare inherited diseases. A comprehensive set of variants together with their allele and genotype frequencies can thus greatly help to distinguish functionally relevant from neutral variants. The use of such filtering strategies has become standard practice in human medical genetics, and for example, the gnomAD database (https://gnomad.broadinstitute.org/) has become indispensable for the identification of candidate pathogenic variants in human genetics (Lek et al. 2016). A similar approach is used within the cattle community and the 1000 Bull Genome project (Hayes & Daetwyler 2019).
Currently, the canine variations reported in the National Center for Biotechnology Information (NCBI) genetic variation database hosted at the European Variation Archive contain ~2.9 million SNVs. We submitted a comprehensive list of variants from 238 purpose-bred dogs, containing 18 639 483 SNVs and 9 293 851 short indels, to the European Variation Archive in 2017 (PRJEB24066), but this dataset was not processed until May 2019. Comprehensive variation data from 132 canids and 90 village dogs were reported previously (Marsden et al. 2016;Taylor et al. 2016). The iDog database released a variation list from 127 individual dogs and grey wolves (Bai et al. 2015;Tang et al. 2019). Lastly, Plassais et al. (2019) reported the analysis of 722 canine whole-genome sequences to discover variants associated with 16 different phenotypes, including body weight variation observed across modern dog breeds but absent in wild canids.
We report here a comprehensive list of high-quality variants from 590 genome sequences comprising 582 dogs from 126 different breeds and eight wolves (Table S1). We included 178 sequences from public databases that had been produced by other groups, whereas the remaining 412 whole-genome sequences were produced and submitted to public databases by members of the Dog Biomedical Variant Database Consortium (DBVDC) from more than 20 institutions around the world. The DBVDC was established in 2013 to aggregate WGS data from a variety of projects examining dog biomedical and morphological phenotypes. The DBVDC proceeds in incremental 'runs', with a run taking place approximately every 3-6 months.

Datasets
Whole genome sequences of 582 dogs and eight wolves available at the European Nucleotide Archive and the Short Read Archive were used for the analysis. See Table S1 for details of the project and sample accessions. The CanFam3.1 reference genome was downloaded from Ensembl (http://www.ensembl.org/Canis_familiaris/Info/Index) and was used for alignment of reads.

Samples, read filtering and alignment
A total of 590 individual genomes were sequenced with Illumina sequencing technology NextSeq, HiSeq or Nova-Seq. The 590 individuals consisted of 467 males and 123 females. The samples were selected based on their availability in public databases and with a minimum coverage of 10×. Raw sequencing reads were filtered for adaptors and trimmed based on the quality score using FASTP (Chen et al. 2018). Quality filtered reads in FASTQ format were then aligned with the reference genome assembly CanFam3.1 using BWA (version 0.7.13; Li & Durbin 2010). GATK CALLABLELOCI was used to collect statistics on callable, uncallable, poorly mapped and other parts of the genome. The following arguments were used --minBaseQuality 20, --minDepth 4 and --minMappingQuality 10.

Variant calling and filtering
Aligned reads stored in SAM format were co-ordinate sorted and converted to bam format using SAMTOOLS (version 0.1.18; Li et al. 2009). Duplicates were marked with PICARD tools (http://broadinstitute.github.io/picard). Best practices established for the GENOME ANALYSIS TOOLKIT (GATK version 3.8; DePristo et al. 2011) were used for calling SNVs and short indels. Base quality recalibration was performed using BaseRecalibrator from GATK, where the recalibration report was formed using the default setting for covariates and the NCBI dbSNP database (Build 151) as the database for known sites. For each sample, the HaplotypeCaller program was used to call variants from the recalibrated bam file. The output, a GVCF file (with g.vcf extension), contained raw, unfiltered SNV and indel calls for all sites in the genome, variant or invariant. The GenotypeVCF tool was then used for performing joint genotyping on the per-sample GVCF files generated by HaplotypeCaller and produced a single VCF for the cohort. The single cohort VCF file was passed to the VariantFiltration tool for filtering based on several metrics provided by GenotypeVCF tool. Hard filtering was done using the following criteria, as suggested by GATK documentation: QUAL > 30.0, QD > 2.0, MQ > 40.0, FS < 60.0, HaplotypeScore < 13.0, MQRankSum > −12.5 and ReadPosRankSum > −8.0.

Annotation of variants
The filtered VCF file was annotated using SNPEFF (version 4.3; Cingolani et al. 2012) and NCBI annotation release 105 of the CanFam3.1 build. Analysis of 'protein-changing' variants included variants annotated by SNPEFF with the following sequence ontology terms: missense_variant, start_lost, stop_gained, stop_lost, stop_retained_variant, splice_acceptor_variant, splice_donor_variant, conservative_inframe_deletion, conservative_inframe_insertion, disruptive_inframe_deletion, disruptive_inframe_insertion, exon_loss_variant, frameshift_variant, and gene_fusion. SNPEFF classifies missense variants, conservative in-frame insertions and conservative in-frame deletions as having a 'moderate impact'. 'High-impact' variants include all other protein-changing variants. SNPEFF predicts loss-of-function (LoF) variants that include stop-gains (nonsense), splice site-disrupting sNVs, frameshift indels in a coding sequence or larger deletions that remove coding exons. SNPEFF also filters putative LoF variants identified in the last 5% of the coding region. VCFTOOLS was used for studying the statistics of the variants (Danecek et al. 2011).

Phylogenetic tree
The phylogenetic tree was constructed using SNPHYLO (Lee et al. 2014). We used only exonic biallelic variants and used an LD threshold of 0.7 to remove redundancy owing to LD, leaving a total of 288 452 variants that were used for the tree construction.

Enrichment analysis
The genes carrying tolerant LoF variants were used for enrichment analysis using the Database for Annotation, Visualization and Integrated Discovery (DAVID) functional annotation tool (Huang et al. 2009). The DAVID functional annotation clustering uses an algorithm to explore relationships among the annotation terms from various annotation sources and then presents a score for an enriched group of terms. In the current version of DAVID, the annotation tool includes more than 40 annotation categories including GO terms, KEGG Pathways and UniProt.

Identification of potential embryonic lethal alleles
To identify and prioritize LoF variants in putative embryonic lethal (EL) genes, we used data from recent publications on genome-wide screens for genes that cause embryonic lethality in mouse (Dickinson et al. 2016), cattle (Agerholm et al. 2001(Agerholm et al. , 2006Charlier et al. 2012;Fritz et al. 2013Fritz et al. , 2018Sonstegard et al. 2013;Cooper et al. 2014;Daetwyler et al. 2014;Venhoranta et al. 2014;Adams et al. 2016;Schütz et al. 2016;Schwarzenbacher et al. 2016;Michot et al. 2017), pig (Derks et al. 2019) and human (Shamseldin et al. 2015;Lord et al. 2019). The non-redundant curated list of EL genes from each of the studies was used to obtain orthologous gene ids from the NCBI dog genome CanFam3.1 annotation release 105 (Table S2). Biallelic variants in these orthologous genes were considered to cause potential EL alleles, if annotated by SNPEFF as high impact and not present in the homozygous variant state in any of the individuals. Additionally, we restricted our list to variants with an average sequencing depth of more than 5 using the GATK variant filter 'select DP>'. DP values in the VCF file represent the number of reads passing quality control used to calculate the genotype at a specific site in a specific sample, with higher values for DP generally leading to more accurate genotype calls.

Identification of potential developmental disorder alleles
We also used the developmental disorder (DD) genes included in the DDG2P panel (https:// decipher.sanger.ac.uk/ddd; accessed 25 March 2019) (Deciphering Developmental Disorders Study 2015) . This included 1846 unique genes. Mapping them onto CanFam3.1 NCBI Annotation release 105 found 1809 orthologous genes. Variants in these 1809 genes were filtered in the same manner as explained above (Table S2).

Results and discussion
We analysed short-read whole genome sequencing data from 582 dogs and 8 wolves. The read coverage for the 590 genomes ranged between 10× and 66× with a mean coverage of 24× (Table S1). The exclusive usage of genomes with relatively high coverage improved the accuracy of variant detection and the assignment of correct genotype calls in each analysed dog genome. All WGS data available in the DBVDC collection are processed through the DBVDC pipeline, which detects sequence variants in the form of SNVs and short indels from alignments of the sequences. Each animal with a whole genome sequence is genotyped for all variants detected. The aim of the consortium is to make summary data and variant (VCF) flies available to the wider scientific community.
Diploid calls were confidently made for an average 94.0% of the autosomai bases in the reference genome, with a range of 80-97% across the 590 genomes. These numbers are similar to those reported in 132 individuals from five canid species . After filtering, a total of >33 million variants were identified consisting of 23 133 692 highconfidence SNVs and 10 048 038 indels (Table 1). Only 7% of these variants were contained in the NCBI dbSNP database (Build 151). The average SNV transition-to-transversion ratio was 2.08, and the SNV heterozygote-to-homozygote ratios were in the range of 0.5-3.3 (average 1.2). We identified 6 297 746 distinct short insertions (range 1-704 bp) and 3 750 292 distinct short deietions (range 1-329 bp), with an average of 1 42 7 335 indel variants per individuai genome. The estimated heterozygote-to-homozygote ratios were 0.4-2.5 for short indels (average 1.02).
We used NCBI Annotation Release 105 to functionally annotate the variants (Appendix S1) and identified SNVs in 29 413 genes and indels in 29 440 genes out of a total of 29 831. As expected, only 2% of the variants were located within protein-coding regions/exonic sequences. The number of homozygous and heterozygous variants varied greatiy between individual genomes. We speculate that this may reflect in part the different levels of inbreeding (Table 1).

Phylogenetic tree
A total of 288 452 biallelic exonic variants were used to construct a phylogenetic tree of the 590 genomes analysed. This confirmed the expected breed clustering (Fig. S1) as reported in other studies (Plassais et al. 2019). Subtle differences include a single Dalmatian which was closer to the English Pointer in our tree, whereas Dalmatians were closer to Curly Coated Retrievers and Dachshunds in Plassais et al. (2019). Also, Greyhounds in our tree were closer to the Terriers, whereas in Plassais et al. (2019), they were closer to Borzoi and Bouvier des Flandres.

Loss-of-function-tolerant genes
The entire DBVCD dataset comprising 590 genomes contained 32 240 loss-of-function (LoF) variants (5614 SNVs and 26 626 indels). LoFs represent a more stringently filtered subset of high-impact variants, based on previously suggested parameters (MacArthur et al. 2012). Only 181 of these LoF variants were found in dbSNP (Build 151). The LoF variants were found in 40 847 RefSeq protein-coding transcripts (including XM xenoRefs) belonging to 8109 genes.
In order to look for LoF-tolerant genes, i.e. genes that are not essential for survival and reproduction, we subjected the putative LoF variants to a series of filtering criteria. The exclusion criteria were as follows: (i) LoF variants not occurring in homozygous variant state in at least one of the individuals, (ii) LoF variants for which all protein-coding transcripts of the gene were not affected, (iii) LoF variants that overlapped any repetitive sequences, (iv) variants affecting non-canonical splice sites, (v) indel variants affecting splice sites and (vi) known OMIA gene variants published in OMIA. After performing these filtering steps, we obtained 1897 genes harbouring LoF variants (Table S2) from a total of 13 603 genes annotated to contain LoF variants. Enrichment analysis (Appendix S1) revealed that the olfactory receptor genes are the only large class of functionally related genes that are tolerant of LoF variants.

Potential embryonic lethal and developmental disorder variants
We used our dataset to identify and prioritize variants that could potentially give rise to recessive alleles causing embryonic lethality (EL) in dogs. We obtained 528 dog orthologues to genes with published evidence for EL alleles in other species. After filtering for variants as detailed in Appendix S1, we identified 247 genes where a single or a small number of dogs were heterozygous whereas none of the 590 genomes were homozygous for the mutant allele (Table S2). Across the 590 genomes, each genome carried on average 30 potential EL alleles (range 6-181). This is slightly higher than what has been reported for 624 cattle (Charlier et al. 2016).
Loss-of-function variants were also analysed for 2211 dog orthologues to DD genes. We found potentially deleterious variants in 894 DD genes (Table S2). Across the 590 genomes, each genome carried on average 97 potential DD alleles (range 21-583).
The numbers of identified potential EL and DD alleles are probably overestimated, as the current NCBI Annotation Release 105 of the CanFam3.1 reference genome is still imperfect, resulting in erroneous SNPEFF functional effect predictions. Future improvements in the reference genome assembly and its functional annotation will further increase the accuracy of the variant catalogue. We also acknowledge that our catalogue lacks most of the structural variation, which may be assessed by the use of third-generation long-read sequencing technologies.

Relevance for biomedical research
Work based on the DBVDC variant catalogue has already resulted in the identification of more than 50 causative variants for various inherited traits in dogs (Tables 2 & S2). These included many canine homologues of known human hereditary disorders, but also identified several novel genes not previously known to be associated with human diseases, thus providing new candidate genes for those homologous human diseases.

Conclusion
The variant analysis of 590 canine genomes identified ~33 million functionally annotated variants. We made an effort to include only genome sequences with high coverage and applied stringent filtering criteria to ensure the high quality of the variant and genotype calls. This dataset should help to identify causative variants for monogenic disorders more efficiently. The addition of more genomes will eventually also aid in the identification of causal variants for complex traits.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material. Anim Genet. Author manuscript; available in PMC 2019 December 01.