Identification of male heterogametic sex‐determining regions on the Atlantic herring Clupea harengus genome

Abstract The sex determination system of Atlantic herring Clupea harengus L., a commercially important fish, was investigated. Low coverage whole‐genome sequencing of 48 females and 55 males and a genome‐wide association study revealed two regions on chromosomes 8 and 21 associated with sex. The genotyping data of the single nucleotide polymorphisms associated with sex showed that 99.4% of the available female genotypes were homozygous, whereas 68.6% of the available male genotypes were heterozygous. This is close to the theoretical expectation of homo/heterozygous distribution at low sequencing coverage when the males are factually heterozygous. This suggested a male heterogametic sex determination system in C. harengus, consistent with other species within the Clupeiformes group. There were 76 protein coding genes on the sex regions but none of these genes were previously reported master sex regulation genes, or obviously related to sex determination. However, many of these genes are expressed in testis or ovary in other species, but the exact genes controlling sex determination in C. harengus could not be identified.


| INTRODUCTION
The evolution of sexual reproduction has resulted in several sex determination systems, with gonochorous organisms (the stable separation of sexes in different individuals), stable hermaphrodites and organisms that change sex dependent on age, environmental and/or social cues (Devlin & Nagahama, 2002;Shen & Wang, 2014). Each of the different systems has evolved independently several times through evolutionary history, and even within each system there might exist several mechanisms for determining the sex of an organism . The best-known system is the XY sex chromosomes found in mammals, where the females have two X chromosomes and the males have an X and a Y chromosome. Thus, the XY system is a male heterogametic system. The sex-determining region Y (SRY) gene is located on the Y chromosome and signals to the body to develop into a male rather than a female, which is the default (Kashimada & Koopman, 2010). The ZW system is a similar system where the females are heterogametic.
This system is found in birds and some amphibians (Bull, 1983;Yoshimoto & Ito, 2011). These two systems do not represent the complexity of sex determination systems in the animal kingdom. Systems with only one sex chromosome also exist, for example the X0 system where males have only one sex chromosome and the Z0 system where females have only one Clinton, 1998). Furthermore, sex determination systems can be more complex, with multiple chromosomes or genes affecting the sex Roberts et al., 2016). There are even systems where age and size (Allsop & West, 2003), societal factors (Buston, 2003;Fricke, 1979) or environmental factors such as temperature (Pieau, 1996) play crucial roles in sex determination. In some organisms, both genetic and environmental factors are involved in determining the sex, for example in the Nile tilapia Oreochromis niloticus (Linnaeus 1758) (Baroiller et al., 2009) and Atlantic silverside Menidia menidia (Linnaeus 1766) (Lagomarsino & Conover, 1993).
Teleost fish display a variety of sex determination systems (Brykov, 2014;Devlin & Nagahama, 2002) and the plasticity of teleost genomes makes it possible for new systems to evolve relatively quickly. This makes teleost fish good candidates for studying the evolution of sex determination. Although there are sex determination systems in fish that are influenced by nongenetic factors (see the references above), genetic sex determination seems to be more common. The male heterogametic system (hereinafter 'the XY system') has been established in some fish species, for example bighead carp Hypophthalmichthys nobilis (Richardson 1845) and silver carp Hypophthalmichthys molitrix (Valenciennes 1844) (Liu et al., 2018), as has the female heterogametic system (hereinafter 'the ZW system') in half-smooth tongue sole Cynoglossus semilaevis Günther 1873 (Chen et al., 2014). The cichlid fishes of Lake Malawi have families with the XY system and others with the ZW system, but notably the species Metriaclima pyrsonotus (Stauffer, Bowers, Kellogg & McKaye 1997) has both these systems, showing strong epistatic interactions between them (Ser et al., 2010). Several polygenic systems are also found in fish, such as the European sea bass Dicentrarchus labrax (Linnaeus 1758) (Palaiokostas et al., 2015) and the cichlid fish Astatotilapia burtoni (Günther 1894) (Roberts et al., 2016). There are even individuals from the same species that have different sex determination systems, for example some zebrafish Danio rerio (Hamilton 1822) laboratory strains have lost the sex-determining region that is present in wild-type D. rerio, and therefore have evolved a new polygenic system that is still not fully understood (Wilson et al., 2014).
In some organisms (e.g., mammals and birds), sex chromosomes have evolved that contain master sex regulation (MSR) genes that control the sex of the organism, as with the previously mentioned SRY. Most species of fish do not have specific heteromorphic chromosomes that control sex, but have regions on autosomes that are associated with sex determination. These sex regions sometimes contain MSR genes or candidate MSR genes, such as the Y chromosomespecific anti-Müllerian hormone (amhy) gene in Patagonian pejerrey Odontesthes hatcheri (Eigenmann 1909) (Hattori et al., 2012) or the sexually dimorphic gene on the Y chromosome (sdy) in rainbow trout Oncorhynchus mykiss (Walbaum 1792) (Yano et al., 2012). However, sometimes no obvious causal genes are found on these regions that have been associated with sex, such as in the mandarin fish Siniperca chuatsi (Basilewsky 1855) (Sun et al., 2017).
In the Clupeidae family, few species have been studied regarding their sex determination systems. In the Tree of Sex Consortium database , only six Clupeiformes species are mentioned; four of these are a part of the Clupeidae family. Two are hemaphrodites [the toli shad Tenualosa toli (Valenciennes 1847) and the longtail shad T. macrura (Bleeker 1852)], whereas the Argentine menhaden Brevoortia pectinata (Jenyns 1842) and Brazilian menhaden B. aurea (Spix & Agassiz 1829) are both gonochoristic. B. pectinata is homomorphic and B. aurea is male heterogametic with X 1 X 2 Y sex chromosomes (Brum, 1992). In addition, the Gulf menhaden Brevoortia patronus Goode 1878, the yellowfin menhaden Brevoortia smithi Hildebrand 1941 and the Atlantic menhaden Brevoortia tyrannus (Latrobe 1802) are gonochoristic and homomorphic (Doucette Jr & Fitzsimons, 1988), but their sex determination systems are not known. The sex determination system of the commercially important Atlantic herring Clupea harengus L. has not yet been described.
Increasing the knowledge of sex determination at this branch of the tree of life would further elucidate the evolution of sex determination in teleost fish. We therefore undertook this study to find regions on the C. harengus genome that are associated with sex determination.

| Ethical statement
The C. harengus samples were received from stock assessment cruises and commercial catches in the north-east Atlantic. No fish were caught or handled while alive for the purpose of this project. All fish were dead when they were selected for the study. Thus, the research did not involve animal experimentation or harm, and required no ethical permits.

| Samples and DNA extraction
Kidney samples were taken from 103 adult Atlantic herring, originated from four stocks, with ages ranging from 3 to 12 years with an average of 6.1 years. The sex was determined by visual inspection of the gonads by experienced staff at the Faroe Marine Research Institute, revealing 48 females and 55 males. DNA was extracted from the kidney tissue of these fish using an AS1000 Maxwell 16 instrument (Promega, Madison, WI, U.S.A.) and the Maxwell 16 Tissue DNA Purification Kit (Promega). DNA concentrations were measured using a Qubit 3.0 fluorometer (ThermoFisher Scientific,Waltham, MA, U.S.A.).

| Sequencing
The isolated DNA from each individual was fragmented to roughly 300 bp using a Covaris M220 focused-ultrasonicator (Covaris, Chicago, IL, U.S.A.), and the libraries were prepared using the KAPA LTP Library Preparation Kit Illumina Platforms (KAPABiosystems, Wilmington, MA, U.S.A.). Approximately 1 μg of input DNA was used for each library and a final concentration of 1 μM of the 6 bp adapters (Pentabase, Odense, Denmark). After the ligation step, double-sided size selection for fragments between 250 and 450 bp was performed, following the manufacturer's instructions.
The libraries were amplified for two to four cycles, depending on postligation concentration, and no further size selection was performed.

| Data processing and variant calling
Trimmomatic v0.36 was used to remove adapter sequences and trim low-quality bases with an average quality score lower than 20 (sliding window of four bases) from the paired-end data (Bolger et al., 2014).
AfterQC v0.4.0 was used to remove the polyG reads (Sun et al., 2017) and FastQC v0.11.5 was used to assess the quality of all the sequencing data before and after adapter removal and low-quality base trimming (Andrews, 2010). The sequencing reads are available in the European Nucleotide Archive repository, with accession numbers from ERS4329014 to ERS4329116.
However, data with coverage 1 were used when comparing the experimental genotype data with a theoretical model of how to infer reliable diploid genotypes from sequence data (see below).
Sex-specific insertions and deletions (indels) on the regions associated with sex were also called using FreeBayes and the aligned sequencing data from males and females separately. Indels with more than 30% of the genotypes missing (due to low coverage) and indels where homozygous reference allele genotypes were present were filtered out.

| Association analysis
A genome-wide association study (GWAS) was performed using Plink v1.07 (Purcell et al., 2007) to test whether any of the SNPs identified were associated with sex. For the GWAS, phenotypic sex was used as cases (females) and controls (males), and the Cochran-Mantel-Haenszel test was used to account for the population stratification (−mh option in Plink).
The Bonferroni correction for multiple testing assumes that each test is independent. This assumption is not always true for a GWAS because of linked SNPs, thus the Bonferroni correction can be considered too conservative. In human genetics, the genome-wide significance P value threshold of P = 5 × 10 −8 is standard for common-variant GWAS (Fadista et al., 2016). In this case, the null hypothesis is rejected if P < 5 × 10 −8 or -log 10 (P) > 7.3. However, no studies have been published that show this value to be true for herring. Therefore, we chose to use the conservative Bonferroni correction (0.05/number of tests, which was 7,614,270) P = 0.66 × 10 −8 and reject the null hypothesis of no association for -log 10 (P) > 8.2. The R packages qqman (Turner, 2014) and ABHgenotypeR (Furuta et al., 2017) were used for visualization of the results.

| Statistical analysis
We compared the experimental genotype data with a theoretical model of how to infer reliable diploid genotypes from the sequence data. In this model, the probability of having x identical reads given homozygous genotypes is P(x|hom) = 1, while the probability of having x identical reads given heterozygous genotypes is P(x|het) = 1/2 x− 1 (Chenuil, 2012). For the number of identical reads (x) ranging from 1 to 21, we plotted the predicted proportions of homozygous genotypes together with the experimental data. The error bars of the experimental data correspond to the 95% confidence interval of the exact binomial test in R. We made no comparison for the number of identical reads higher than 21 because of a low number of samples with such high read coverage.

| Search for causal genes
Possible orthologs for the genes in the significant regions were found Furthermore, the sequences of 17 known sex determination or differentiation genes in fish were blasted against the C. harengus genome to investigate if any of these genes were present but not predicted for the C. harengus genome. The FASTA sequences were obtained from public repositories and blasted against the C. harengus genome using BLAST+ (Camacho et al., 2009) (Table 3). This suggested a male heterogametic or XY sex determination system for C. harengus.
The erroneous call at low sequence coverage of homozygotes from factual heterozygotes is as expected, and was theoretically investigated in a previous study (Chenuil, 2012). Thus, the true rate of heterozygotes in our data was higher than our result of 68.6%, but this could not be detected due to low sequencing coverage, resulting in male genotypes possibly being wrongly called as homozygous.  proportions of homozygotes are all larger than 0.939 and 16 of them are larger than 0.990. The median is 0.996 (interquartile range = 0.008).
Nine are exactly equal to 1 as expected for the females, while the binomial test rejects the null hypothesis for females in the remaining 12 of the 21 coverages (Supporting Information Table S1 and Figure 3). For the eight highest coverages of 14-21 only one is rejected. For males, the numerical discrepancies from the theoretical model are much larger and the binomial test rejects the null hypothesis for males in 17 of the 21 coverages, while we find an exact agreement for four of the five highest coverages of 17-21 (Supporting Information Table S3 and Figure 3). However, the overall trend for males is very different from the females and much more similar to the theoretical model, since the male homozygous proportion decreases towards zero for increasing coverage.
The experimental results for males indicate that perhaps not all the sex SNPs, but the majority must be heterozygous for males to develop. Nevertheless, these results support the suggestion of an XY sex determination system for C. harengus.
We see four possible explanations for the deviations from the theoretical expected proportions in Figure 3: a. The physiological sex has been wrongly registered. We think this explanation is unlikely because Figure 2  We investigated these genes for possible involvement in sex determination. None of these genes have previously been shown to be MSR genes in other organisms. However, to investigate further, possible orthologs for these genes were found and their reported functions investigated. None of the 76 genes were obvious candidates for being MSR genes, but 11 showed some potential linkage with sex determination or sex-related functions. These are listed in Table 5, together with their possible link to sex determination. The orthologs of 8 of these 11 genes had noteworthy expression patterns or were X-linked ( To examine whether the sex SNPs could have functional consequences, their locations were investigated in more detail. Of these 529 SNPs, 151 were located in intergenic regions and 105 were located in promoter regions ( Table 6). The SNPs in promoter regions could potentially affect the expression of genes. The remaining 273 SNPs were located in protein-coding genes; however, the majority (167) were located in introns (Table 6). Among the 57 SNPs located in coding regions, 30 caused amino acid substitutions (Table 6).
Among the 76 genes in the sex regions, six had sex SNPs that caused nonconservative amino acid substitutions in exons. These substitutions were in tpcn1, iqcd, loc105890446, claudin-4-like (loc105890498), mettl27 and bmpr1bb. Table 7 lists the nonsynonymous SNPs and their corresponding amino acid substitutions. It is possible that these SNPs could have an effect on these genes, but to answer this question functional analyses need to be carried out.
Most of the genotypes are homozygous for the alternative allele, but this could also be affected by the low sequencing coverage, as mentioned before.
None of the known sex determination or differentiation genes in fish were found on or close to the sex regions identified in this study.
This could suggest that C. harengus has an unknown sex determination mechanism.

| DISCUSSION
We identified two regions on two chromosomes on the C. harengus genome that were associated with sex. The data strongly indicated that females are homozygous, whereas the males are heterozygous for the SNPs in these sex-associated regions. This is consistent with an XY sex determination system. There are 76 protein-coding genes in these associated regions but no obvious MSR genes. However, some of these genes could potentially affect sex determination or development because they are associated with sex organs or sex functions in other species, as briefly referred to in the Results section (Table 5). Neither the investigation of the amino acid substitutions caused by SNPs nor that of indels pointed to a single sex determination gene in C. harengus.

| Low sequencing coverage
The SNPs were identified by low coverage whole-genome sequencing (on average 3 to 4× over the whole genome). This potentially resulted in some caveats regarding the genotypes. First, it is more likely to have missed genotypic data for some of the SNPs in some of the individuals, simply because the area has not been sequenced. Second, sequencing errors are more likely to be implemented as variations and could result in falsely called low-frequency alleles. This is not a problem in the present situation because we are dealing with highfrequency alleles. The third caveat is more serious: if, by chance, only one of the alleles from a heterozygous individual is sequenced, the genotype would always be called homozygous. With an average coverage of 3×, the probability of sequencing only one of the two alleles is on average 12.5% (and 12.5% for the other allele). Thus, statistically we would achieve a 75% detection rate in a group consisting of 100% heterozygotes (Chenuil, 2012 making it more probable that they are miscalled (Table 4). Note: There were 55 males and 48 females but not all individuals had data for all variations because of the low sequencing coverage. Only insertions and deletions present in all individuals (with data) of the same sex were included. A, homozygous alternative allele; H, heterozygous.
Furthermore, the observed male proportions of homozygotes versus coverage followed the same trend as the theoretically expected proportions if all genotypes were truly heterozygous (Figure 3). These results indicate that all the sex SNPs could potentially be heterozygotes in males.
One way to verify this would be to repeat the experiment with higher coverage. Meynert et al. (2014) demonstrated experimentally that 9-13× coverage was required to correctly call 95% of heterozygous genotypes. Chenuil (2012) showed that with a coverage of 5× (where all reads show the same allele) and a heterozygous rate of 0.5, the homozygous genotype would be correct 95% of the time.
Therefore, a read depth of more than 5 would be appropriate to increase the sensitivity of correct genotypes to above 95% for both homo-and heterozygotes. The sex SNPs identified in this study could also be genotyped in genotyping experiments, rather than using sequencing.
In our study, the number of individuals sequenced partly makes up for the weakness caused by low coverage and shows that 99.4% of the female genotypes are homozygous, while at least 68.6% of the males are heterozygous. When genotyping an ideal male heterogametic sex-determining system with sex-linked SNP markers, we would expect females to be homozygous and males to be heterozygous at these markers. The very high proportion of homozygous females (99.4%) strongly supports this hypothesis, whereas the measured proportion of heterozygous males is limited by the much lower heterozygous sensitivity of the method at low coverage.

| Sex regions on the C. harengus genome
The association between sex and chromosome 8: 21,063,268,779 was stronger (i.e., lower P values) than for the region on chromosome 21, and it is also larger and contains more SNPs associated with sex ( Table 2). As sex chromosomes evolve, they tend to become less stable and accumulate genes that are sex-specific/beneficial, and eventually recombination between the homologous chromosomes stops and they become heteromorphic over time (Charlesworth et al., 2005). However, not all species develop heteromorphic sex chromosomes (Wright et al., 2016), for example the tiger pufferfish . This suggests that sex determination in herring could be complex, maybe polygenic (Moore & Roberts, 2013). Further studies are needed to investigate this possibility.
There are several possible explanations for the observation of two sex-related regions located at different chromosomes: a. It might simply be a statistical coincidence. If so, the shorter region on chromosome 21 is most likely the one that is erroneously pointed out. However, as a rather conservative statistical threshold has been used, we think this is not a likely option.
b. There might be an error in the chromosome assembly. A piece of chromosome 8 might have been assembled into chromosome 21, for example, because of similar repeated sequences at the two chromosomes. We have not been able to detect such repeated sequences. Still, assembly errors are rather common, even in highquality assemblies, so we will not exclude this possibility.
c. If we assume that that it is biologically correct that there are two separate sex regions, and they segregate normally, we should have

| Potential genes involved in sex determination
As mentioned in the Results section, the genes on the sex regions are not known MSR genes or known to be part of the sex determination pathway, so a specific gene could not be identified as the most probable MSR gene. Therefore, the potential effect of sex SNPs was investigated further.
First, 105 sex SNPs were present in promoter regions and could alter the expression of these genes, thereby affecting the sex determi- and some of the SNPs found in this study to be associated with sex could affect a noncoding element that has not been identified yet.
In addition, there might be sex-specific sequences present in C. harengus. For example, if the reference genome assembly was from a female individual and C. harengus has a male-specific sequence that controls sex determination, then this study would not be able to identify this. It is likely that this is the case, and that the sex-specific sequence is at or close to the sex regions identified here, and the heterozygous male sex SNPs are linked with this sex-specific sequence.
Targeted sequencing of these regions could reveal if this is true.

| Evolution of sex determination within the Clupeiformes order
Teleost fishes have highly diverse sex determination systems . The XY sex determination system for C. harengus, suggested in this study, fits well with the other Clupeiformes mentioned in the Introduction.
A study by Pennell et al. (2018) indicated that in fish, transitions from gonochorism to hermaphroditism occur at higher rates than the reverse, and transitions from female to male heterogamety occur at higher rates than the reverse. They also found similar transition rates between homomorphic and heteromorphic sex chromosomes in both fish and amphibians. This could suggest that the common ancestor for Clupeidae and Engraulidae had a Z0 or ZW sex determination system, which is still present in Coilia nasus (Xu et al., 2014). The common ancestor for Clupeidae then lost the Z chromosome and adapted to a XY system, which has been found in Brevoortia and now also in Clupea. These sex chromosomes are early in their evolution and still homomorphic, as seen in Brevoortia spp. and C. harengus. A single known exception is B. aurea, a species that has heteromorphic sex chromosomes with two X and one Y chromosome. As Tenulosa split from Brevoortia and Clupea, they evolved to be hermaphrodites. Of course, this series of events is a speculative hypothesis at present.

| Conclusion and future work
We identified regions on the C. harengus genome that were associated with sex. The genotypes of the SNPs associated with sex indicated an XY sex determination system for C. harengus, which is consistent with other Clupeiformes species. Nonetheless, we could not identify the exact genes for sex determination. None of the known sex determination genes in fish were found on or close to the sex regions, indicating that C. harengus could have a previously unregistered, unknown sex determination mechanism. New experiments where these sex regions are sequenced at a higher coverage for both males and females should be conducted to reproduce and more effectively delineate the sex determination regions. This would also better characterize the potential existence of homozygous SNPs in a small proportion of the males in these regions and could identify possible sex-specific sequences.

ACKNOWLEDGEMENTS
We wish to thank the staff aboard the local fishing boats 'Gunnvør',