SMN1 copy‐number and sequence variant analysis from next‐generation sequencing data

Abstract Spinal muscular atrophy (SMA) is a severe neuromuscular autosomal recessive disorder affecting 1/10,000 live births. Most SMA patients present homozygous deletion of SMN1, while the vast majority of SMA carriers present only a single SMN1 copy. The sequence similarity between SMN1 and SMN2, and the complexity of the SMN locus makes the estimation of the SMN1 copy‐number by next‐generation sequencing (NGS) very difficult. Here, we present SMAca, the first python tool to detect SMA carriers and estimate the absolute SMN1 copy‐number using NGS data. Moreover, SMAca takes advantage of the knowledge of certain variants specific to SMN1 duplication to also identify silent carriers. This tool has been validated with a cohort of 326 samples from the Navarra 1000 Genomes Project (NAGEN1000). SMAca was developed with a focus on execution speed and easy installation. This combination makes it especially suitable to be integrated into production NGS pipelines. Source code and documentation are available at https://www.github.com/babelomics/SMAca.

patients show a complete absence of at least exon 7 of SMN1. Most of the remaining patients have a single copy of the SMN1 gene which is inactive due to point mutations or deletion of exons 1-6. The disease severity is determined mainly by a copy gene, SMN2. The more SMN2 copies present, the milder the phenotype usually is. Both genes, located on chromosome 5q13.2, can be distinguished by only five nucleotides (Monani et al., 1999).
Most individuals have two copies of each SMN1 and SMN2, however, due to the complex genomic structure, gene conversion and rearrangements occur quite frequently in SMN locus leading to copy-number variations (MacDonald et al., 2014). The majority of SMA patients have an SMN1 deletion or gene conversion of SMN1 into SMN2, which results in a homozygous loss of SMN1 exon 7 or exons 7 and 8. Establishing the SMN2 copy number is of importance for SMA patients due to the inverse correlation between disease severity and SMN2 copy number. SMA carriers are mostly asymptomatic and can be identified by the presence of only a single SMN1 exon 7 copy. About 5% of SMA carriers have two SMN1 copies in cis in the same chromosome and 0 copies on the other (2 + 0) known as "silent carriers" . To detect silent carriers, we select samples with two polymorphisms (g.27134T>G [NC_000005.9:g.70247901T>G, rs143838139] and g.27706_27707de-lAT [NC_000005.9:g.70248473_70248474del, rs200800214]) associated with duplication events in SMN1 (Luo et al., 2014). Depending on the number of SMN2 copies, the expected should be close to 0.75 (2:1) or 0.5 (2:0) and, in both cases, the scaled coverage proportion of SMN1 should be close to ½ in each position. Finally, a variant, referred to as SMN1/2Δ7-8, contains one or two extra copies of SMN exons 1-6 of SMN1 or SMN2. This variant is often present in individuals with no, or only one, SMN2. Although it is frequently found (23%) in Spanish carriers and noncarriers, its clinical significance is not yet completely understood . The frequency of SMA carriers in the population is around 1.7%-2.1% (Larson et al., 2015;Su et al., 2011), presenting most of them only a single SMN1 exon 7 copy.
At present, the gold standard genetic test for SMA diagnosis is multiplex ligation-dependent probe amplification (MLPA) of SMN1 and SMN2 although it cannot identify silent carriers (2 + 0) nor subtle mutations in SMN1 (false-negative rate of approximately 5%). However, next-generation sequencing (NGS) technology is rapidly becoming a costeffective approach for clinical testing (Boycott et al., 2019). Despite the difficulties that the accurate determination of two genes almost identical inherent to short-read technologies, some strategies to process NGS data have already been proposed to detect SMA carriers (Feng et al., 2017;Larson et al., 2015). Although at the moment of writing this manuscript there were not freely available tools to assess the mutational status of SMA from primary massive sequencing data, a tool was reported in a recent study published when this manuscript was under review (X. Chen et al., 2020).
Here, we present a python tool, SMAca that can detect SMA carriers and concomitantly estimate the absolute SMN1 copy-number from NGS data. Moreover, SMAca can exploit the knowledge variants specific to SMN1 duplication to identify the silent carriers, following the recommendations for SMA carrier testing by the American College of Medical Genetics and Genomics (Prior et al., 2011).
Additionally, a set of 1109 alignment CRAM files, comprising a diverse set of individuals from multiple populations, were downloaded from the 1000 genomes data portal (https://www. internationalgenome.org/data-portal/sample; 1000 Genomes Project Consortium, 2015; Clarke et al., 2017). The corresponding SMN1 and SMN2 validated copy number statuses were obtained from Vijzelaar et al. (2019). The whole set of CRAM files were analyzed with SMAca in a single batch. Samples with one experimentally validated SMN1 copy predicted to be SMA carriers were marked as true positives.

| SMN1 copy-number estimation
The availability of a batch of samples allows a more accurate estimation of the SMN1 copy number. SMAca first calculates the raw proportion of SMN1 reads over the total number of reads covering SMN1 and SMN2 at three specific gene positions (denoted as a, b, and c) for each sample (Table 1).
These positions correspond to single nucleotide differences between SMN1 and SMN2. Raw values are then scaled with respect to 20 control genes (Table 2) previously described to have consistent average coverage relative to SMN1 and SNM2 (Larson et al., 2015).
Additionally, two genetic variants that have been associated with duplication events in SMN1 are also screened and reported (Luo et al., 2014).
In particular, the relative coverage of SMN1 and SMN2 with respect to each control gene is calculated: and c i2 are the average coverage for the whole genes SMN1 and SMN2, and H ki is the average coverage for the control gene k in the ith sample. Then, the scale factor θ , N is the total number of samples and K the total number of control genes, is calculated for each sample. Finally, the raw proportion of SMN1 reads are scaled: 1 and D ij 2 are the raw coverage for SMN1 and SMN2 at position j in the ith sample.

| SMA carrier categorization
Results were classified following some simple rules. Samples with a scaled coverage proportion of SMN1 (π ij ) less than ⅓ in positions a, b, or c were marked as likely carriers. The scale factor θ i (i.e., proportional to the total SMN1 and SMN2 copy number) and the raw proportion of To detect silent carriers, we select samples with two polymorphisms (g.27134T>G and g.27706_27707delAT) associated with duplication events in SMN1 (Luo et al., 2014). Depending on the number of SMN2 copies, the expected θ i should be close to 0.75 (2:1) or 0.5 (2:0) and, in both cases, the scaled coverage proportion of SMN1 should be close to ½ in each position.
Therefore, SMA carriers can be detected either in genome or exome sequences, and even small panels as long as SMN locus and control genes are covered. However, for silent carriers, which require the analysis of some intronic positions, the use of genomic sequences is recommended.

| Experimental validation
To test the reliability of SMAca predictions in a real scenario, we leveraged our participation in the Navarra 1000 Genomes Project NAGEN1000 to screen a dataset of 326 genomes. Among them, seven samples (2.15%) were identified as putative SMA carriers and successfully validated by MLPA (Table S1a-d). Interestingly, the percentage of the predicted and further confirmed SMA carriers in our dataset fits perfectly to the expected carrier frequency (2.10%) previously described in the bibliography (Su et al., 2011). Moreover, the genotype estimation for the SMA carrier samples agreed with the experimental validation as shown in Table 3 (except for case no. 7 where the genotype could not be estimated). Interestingly, case no. 1 corresponds to an SMA carrier with an extra copy of SMN exons 1-6 (SMN1/2Δ7-8).
Additionally, we have performed a large-scale validation of 1109 genomes from a diverse set of individuals from multiple populations.
Results show an overall high accuracy (0.998) and an F1 score (Jackson et al., 1989) of 0.938 (see Table S1 with the validation test, genes   ACAD9  FASTKD2  ITGA6  NTRK1  SIL1   ATR  FOXN1  IVD  PTEN  SLC22A5   CYP11B1  HEXB  LMNA  RAB3GAP1  SLC35D1   EDNRB  IQCB1  LRPPRC  RAPSN  STIM1 Note: List of genes used to calculate the scale factor (θ i ) and the scaled proportion of SMN1 reads (π ij ). Note: PI_x: scaled proportion of SMN1 reads in position x; cov xp: raw coverage of gene x at position p; scale factor: θ i ; CN estimation: absolute copy number estimation SMN1:SMN2; MLPA genotype: genotype inferred from MLPA analysis.

T A B L E 3 MLPA results
Abbreviation: MLPA, multiplex ligation-dependent probe amplification.
*MLPA analysis showed deletion of exons 7-8 on both genes but three copies of exons 1-6 (impossible to distinguish whether they come from SMN1 or SMN2). Table S2 with the full SMAca output, and Table S3 with the list of validated samples).

| Performance
With the idea of facilitating the introduction of SMAca in production NGS pipelines, it has been optimized for running in different computer environments. Special stress has been made in the parallelization for exploiting multiple cores/processors when available. Figure 1 shows the runtimes with an increasing number of processors. The estimation of SMA mutational and copy number status for 326 genomes from Navarra 1000 Genomes Project NAGEN1000 takes almost 1 h and a half in one core but can be reduced to only 3 min in 24 cores (see Figure 1). When a similar number of samples is analyzed with the recently published tool (X. Chen et al., 2020) in the same conditions, the runtime exceeds 1.5 days.

| CONCLUSIONS
Here, we present SMAca, the first freely available python tool to detect SMA carriers and estimate the absolute SMN1 copy-number from NGS data. As a conceptual novelty, SMAca includes the analysis of two polymorphisms that have been linked to silent carriers (Luo et al., 2014) and are recommended for SMA carrier testing by the American College of Medical Genetics and Genomics (Prior et al., 2011). This tool was developed with a focus on execution speed and easy installation. Thus, SMAca is available through Bioconda (Grüning et al., 2018) to facilitate distribution and foster reproducibility (Baker, 2016). Also, SMAca is robust against technical biases, including read length, sequencing platform, capture method, or the aligner used. Moreover, given the way in which SMAca carries out the data normalization, using the depth of coverage of SMN1 + SMN2, the results must be consistent across different populations given that the total SMN copy number tends to be constant (Vijzelaar et al., 2019). This combination makes of SMAca an especially attractive tool to be integrated into production NGS pipelines.

ACKNOWLEDGMENTS
This study is supported by grants SAF2017-88908-R from the Spanish Ministry of Economy and Competitiveness and "Plataforma de RecursosBiomoleculares y Bioinformáticos" PT17/0009/0006 from the ISCIII, both cofunded with European Regional Development

CONFLICT OF INTERESTS
The authors declare that there are no conflict of interests.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from Navarra 1000 Genomes project (NAGEN1000). Restrictions apply to the availability of these data, which were used under license for this study. Another dataset of validation composed of 1109 alignment CRAM files is available at https://www.internationalgenome.org/ data-portal/sample. SMAca source code and documentation are available at https://www.github.com/babelomics/SMAca.