Draft genome of the river water buffalo

Abstract Water buffalo (Bubalus bubalis), a large‐sized member of the Bovidae family, is considered as an important livestock species throughout Southeast Asia. In order to better understand the molecular basis of buffalo improvement and breeding, we sequenced and assembled the genome (2n=50) of a river buffalo species Bubalus bubalis from Bangladesh. Its genome size is 2.77 Gb, with a contig N50 of 25 kb and the scaffold N50 of 6.9 Mbp. Based on the assembled genome, we annotated 24,613 genes for future functional genomics studies. Phylogenetic tree analysis of cattle and water buffalo lineages showed that they diverged about 5.8–9.8 million years ago. Our findings provide an insight into the water buffalo genome which will contribute in further research on buffalo such as molecular breeding, understanding complex traits, conservation, and biodiversity.

Swamp type is widely used for draught, but river type is mostly used for milk and meat. It is known that river type buffalo is the second largest dairy species in the world. In addition to milk, a significant amount of Asian meat and hide production comes from buffalo (Pasha & Hayat, 2012). Buffalo meat is healthier as it is relatively lean with a fat content of less than 2% (Borghese, & Manzi, 2005).
Limited research has been conducted globally in order to explore the genetic diversity, and molecular genetic basis in buffalo compared to other farm animals (Gonçalves, Silva, Barbosa, & Schneider, 2004). Molecular markers can be a powerful tool for livestock improvement through breeding strategies. Based on the cattle genome, Madhu et al. assembled the buffalo genome with 17-19× Illumina reads and only with a median contig length of 2.3 kb (Tantia et al., 2011). As poorly assembled results, they only identified some SNPs and indels in the buffalo genome (Tantia et al., 2011). Recently, Glanzmann et al. (2016) reported the African buffalo genome (Syncerus caffer, 2n=52) which was assembled as 2.68 Gb with a contig N50 of 43 kbp and Scaffold N50 of 2.3 Mbp.
Genome analysis has found 19,765 genes and 97.60% of them could be successfully annotated. Moreover, they identified some expanded predicted genes which are coupled to a large variety of GO terms, including G-coupled protein and olfactory receptors. Williams et al. (2017) assembled the water buffalo genome and obtained 2.83 Gb size with the scaffold N50 of 1.4 Mb and contig N50 of 21.4 kb. Based on the RNA-seq data, they identified 21,711 protein coding genes. Although they presented the water buffalo genome, the completeness of the genome such as the scaffold N50 can still be improved. In this study, we present sequencing and de novo genome assembly of Bangladeshi buffalo which is a river type water buffalo. The results obtained in this study could be used for the breed development of water buffalo through molecular breeding not only in productive traits to ensure higher milk and meat yield but also disease resistance and environmental adaptivity in the changing global climate.

| Genomic DNA and sequencing
The river water buffalo genomic DNA was extracted from an eightyear-old plain land reverian type male water buffalo's blood from Bangladesh. A series of short-insert (170, 500, and 800 bp) DNA libraries were constructed according to the Illumina manufacturer's protocol. As for the long-insert mate-paired libraries (>1 kb), approximately 20-50 μg genomic DNA was fragmented, biotin labeled, circularized, broken, and enriched using biotin/streptavidin, to generate the libraries. Illumina HiSeq2000 paired-end sequencing was performed with PE101 for short-insert libraries, and PE50 for longinsert libraries.

| Data filtering and error correction
The raw reads 350.80 Gbp generated from the Solexa-Pipeline should be filtered. The filtering criteria were as follows: (a) filtered reads with more than 2% of Ns or containing polyA structure; (b) filtered reads where 40% of the bases had a low Phred quality values (<8) for short-insert libraries and 60% bases for large-insert libraries; (c) filtered reads with more than 10 bp aligned to the adapter sequence (allowing <4 bp mismatch); (d) filtered paired reads with overlapped sequence length >10 bp (allowing 10% mismatch); (e) filtered PCR-duplicated reads. The filtered reads is 255.95 Gb.
We built a K-mer (K = 17) frequency table, set the cutoff at 8 for dividing low-frequency K-mers and high-frequency ones, and performed the error correction step to eliminate sequencing errors.
Low-frequency K-mers were corrected in the error bases or trimmed at the ends to form high-frequency K-mers.

| Genome assembly
Based on SOAPdenovo V2.01 (Luo et al., 2012), the error-corrected short-insert library reads were split into K-mers (K = 41) to first construct a de Bruijn graph, which was simplified by removing tips, merging bubbles, and solving repeats to get 235,999 contigs.
Secondly, we realigned all reads gradually to the contig sequences to determine the shared paired-end relationships, weigh the rate of consistent and conflicting paired ends, and construct 33,840 scaffolds. Also, we utilized SSPACE V1.1 (Boetzer, Henkel, Jansen, Butler, & Pirovano, 2010) to improve scaffold lengths. Finally, we extracted the short-insert size reads with one end mapped to the contig and the other end located in the inner-scaffold gap region, then we performed a local assembly to fill the gaps.

| CpG island identification
We identified CpG islands in the water buffalo and cattle genomes by using the search algorithm developed by Takai and Jones (Takai & Jones, 2002). Regions of DNA of greater than 500 bp with a G + C equal to or greater than 55% and observed CpG/expected CpG of 0.65 were more likely to be associated with the 5' regions of genes and this definition excludes most Alu-repetitive elements.

| Identification of synteny and segmental duplication
With parameters T = 2, C = 2, H = 2000, Y = 3,400, L = 6,000, and K = 2,200, we used LASTZ (Harris, 2007) to detect synteny blocks between the water buffalo and other mammals. Before the genome alignment, we downloaded the masked repeat genome of cattle from ensemble and used the following method to mask the buffalo genome, after that LASTZ was used for alignment. Based on the selfalignment implement, we used whole-genome assembly comparison (WGAC) to identify segmental duplications by LASTZ. We defined segmental duplications to be two sequences with a length larger than 1 kb and have a higher identity than 90% and lower than 99.5%. The resulting alignments that extended to >1 kb length and had >90% sequence identity were deemed recent segmental duplications.

| Annotation of repeats and non-coding RNA
We identified repeat sequences by combining the homologous strategy and de novo method. Based on the homologous strategy, known transposable elements (TEs) were identified against the Repbase (Jurka et al., 2005) 16.10 TE library using RepeatMasker (Version 3.3.0) (Smit et al.) and RepeatProteinMask at the DNA and protein level. In parallel, we annotated the tandem repeats using Tandem Repeats Finder (Version 4.04) (Benson, 1999). To generate a de novo repeat library, we implemented the LTR_finder (Version 1.05) (Xu & Wang, 2007), Repeatmodeler (Version 1.0.5) (Smit & Hubley), and a RepeatMasker analysis. We detected four types of non-coding RNAs by searching the whole-genome sequence. The transfer RNAs were predicted by tRNAscan-SE-1.23 (Lowe & Eddy, 1997). The ribosomal RNAs were found by aligning with the human rRNA sequences.
The snRNAs and miRNAs were identified by aligning with BLASTN (Version 2.2.23) (Altschul et al., 1997)
Eventually, a consensus gene set was produced using EVM (Haas et al., 2008) to integrate the source evidence generated from both approaches above.

| Function annotation
Based on the databases TrEMBL (Bairoch & Apweiler, 2000) and SwissProt (Bairoch & Apweiler, 2000), we assigned gene functions in accordance with the best match of the alignments. The domains and motifs of water buffalo genes were acknowledged by InterProScan (Quevillon et al., 2005) against six protein databases, including ProDom, Pfam, PRINTS, PANTHER, PROSITE, and SMART.
Meanwhile, we obtained Gene Ontology (GO) (Ashburner et al., 2000) IDs for each gene from the corresponding InterPro entries. At last, we aligned all genes against KEGG proteins and pathway.

| Gene family clusters
We used the Treefam (Li et al., 2006) methodology to define a gene family as a group of genes that descended from a single gene in the last common ancestor of a considered species. As per the following steps: (a) Blastp was applied to all protein sequences (water buffalo, Bactrian Camel, cattle, horse, yak, American bison) against a database containing a protein dataset of all species with the e-value of 1 × e −7 and conjoined fragmental alignments for each gene pair by Solar (Li et al., 2010). We assigned a connection (edge) between the two nodes (genes) if more than one-third of the region aligned to both genes. An H-score that ranged from 0 to 100 was used to weigh the similarity (edge). (b) Extraction of gene families (clustering by H-cluster_sg). We used the average distance for the hierarchical clustering algorithm, requiring the minimum edge weight (H-score) to be larger than 5, and the minimum edge density (total number of edges/theoretical number of edges) to be larger than one-third.

| Gene family expansion and contraction
We used CAFÉ (De Bie, Cristianini, Demuth, & Hahn, 2006) to analyze changes in gene family size under a random birth and death model, over the phylogenetic tree with divergence times. A global gene birth and death rate parameter λ across all branches for all gene families was estimated using the maximum likelihood method. The conditional p-value was calculated, and families with a p-value <0.05 were determined to suffer significant family changes.

| Divergence time
The CDS sequences of the single-copy orthologous genes were used for estimating divergence times based on the phylogenetic tree. The PAML (Yang, 2007) MCMCTREE (Yang & Rannala, 2006) performs Bayesian estimation of species divergence times using soft fossil constraints under various molecular clock models. The Markov chain Monte Carlo (MCMC) process of PAML MCMCTREE was run to sample 1,000,000 times, with sample frequency set to 50, after a burn-in of 5,000,000 iterations. Parameters of "finetune" were set as "0.004, 0.016, 0.01, 0.10, 0.58." Other parameters were set as default. We did two independent runs to confirm that the different runs produced very similar results.

| Positive selection analysis
As mentioned above, we obtained single-copy orthologous genes from six species. Then, we calculated dN/dS ratios for these single-copy orthologous genes using codeml in the PAML (Yang, 2007) package to estimate positive selection. Then, we used PRANK (Löytynoja & Goldman, 2008) and Gblocks (Castresana, 2000) software to estimate Ka, Ks, and Ka/Ks ratios on each branch.

| RE SULT AND D ISCUSS I ON
Libraries were constructed with insert sizes ranging from 170 bp to 20 kbp, from which 350.80Gbp of paired-end sequencing data were generated using the Illumina Hiseq2000 platform. After filtering out low quality, adapter-contaminated, PCR-duplicated, and small-insert reads, we obtained 255.95 Gbp of clean data, covering the water buffalo genome with an approximately 87-fold depth and 2,777-fold physical depth ( Table 1). The water buffalo genome size was estimated to be 2.95 Gbp. After the error-containing library data with low-frequency K-mers of short insert size (<1 kbp) had been corrected, contigs and scaffolds were constructed using the data with SOAPdenovo software, further super scaffolds were built by SSPACE, and the inner-scaffold gaps were filled by GapCloser. A total of 2.77 Gbp of assembled sequences were obtained, with a contig N50 of 25 kbp and scaffold N50 of 6.96 Mbp. These assembled results were comparable to those of the previously obtained African buffalo genome and water buffalo genome (Table 2). In order to evaluate the genome completeness, two different methods were used. The first one was based on EST/mRNA sequences downloaded from the NCBI and aligned to our genome by BLAT (Kent, 2002), where approximately 98.15% of the data could be well aligned, demonstrating a well-assembled genome. The second method involved benchmarking against universal single-copy orthologs (BUSCO 2.0), where our assembly covered 94.3% of the core genes, with 3,870 genes being completed. This also implied the high quality of our assembly. We identified more CpG islands in the water buffalo genome (39,578) than in the cattle genome (12,120) (Han, Su, Li, & Zhao, 2008).
This difference was mainly because these two species have different recombination rate and chromosome size (Jobse et al., 1995 coding genes (Figure 1) On the basis of a combination of ab initio gene finders, a homology-based method, and an RNA-Seq method, we predicted 24,613 water buffalo genes (Table 6). BUSCO (Simão, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015) was carried out to evaluate the gene prediction quality, and results showed that 97.1% of orthologs could be found in our annotation (Table 7) It is possible that these expanded genes are related to environmental adaptation and specific molecular genetics mechanisms. To identify genes that might be candidates for the water buffalo's adaption to its environment, we identified 382 genes that contained positive selection sites in buffalo. These genes were mostly annotated to signal transduction pathway, metabolic pathway, and immune system functional pathway. LGM) at about 20,000 years ago and rised to another peak almost simultaneous with the Penultimate Glaciation (PG) at about 200,000 years ago (vertical gray shadow on graph). The graph's horizontal axis shows the measurement of time by pairwise sequence divergence, and the vertical axis shows the measurement of the effective population size by the scaled mutation rate. The light pink lines correspond to PSMC inferences on 100 rounds of bootstrapped sequences and the red line stands for the estimate from the data It is interesting to infer the demography of a diploid species up to hundreds of generations ago using its whole-genome sequence data (e.g., by using pairwise sequential Markovian coalescence; PSMC) (Li & Durbin, 2011). The reads used for assembling the buffalo genome were mapped onto the assembled genome. A total of 5,704,306 heterozygous loci were identified as putative heterozygotes in the genome with a heterozygosity rate of 0.2% as obtained by BWA ) and SAMtools ). We employed the PSMC method to explore the changes in effective population size (Ne) of the ancestral population of the buffalo to accommodate the Quaternary climatic change (Kelley et al., 2014). Assuming that the inference of the mutation rate for buffalo is correct, the analysis suggests that the buffalo population expansions occurred before the advent of penultimate glaciation (PG) at about 200,000 years ago and the population size declines after the retreat of PG (Figure 4). When climate became favorable, the buffalo population size had reached a maximum size coinciding with the largest glacial maximum (LGM) at about 20,000 years ago (Zheng, Xu, & Shen, 2002). Subsequently, the buffalo population plummeted in the late period of the LGM, which may have led to the grassland degeneration and forest vegetation restoration (Mei et al., 2016).

| CON CLUS ION
In this study, we have provided a draft genome and evolutionary analysis of the water river buffalo from Bangladesh. This study has shed light on the genomic synteny, phylogenetic position and split time among the Artiodactyla order. The integrated water buffalo genome map shows a brief overview of the evolutionary characteristics we have elaborated upon above. In addition, we have presented a usable water buffalo genome which has important practical purposes for economic application to water buffalo-derived products.
Moreover, this will be useful for generating a water buffalo reference genome for data mining, in order to promote the genetic engineering, molecular research, and breeding of buffalo.

ACK N OWLED G M ENTS
We express our sincere thanks to Mr. Jiang Sanqiao (Winall Seed Co.

CO N FLI C T O F I NTE R E S T S
The authors declare that they have no competing interests.

DATA ACCE SS I B I LIT Y
This whole-genome shotgun project has been deposited at DDBJ/ ENA/GenBank under the accession NPZD00000000. The version described in this paper is version NPZD01000000. Raw DNA sequencing reads have been submitted to the NCBI Sequence Read Archive database (SRA488780).