Spectrum of pathogenic variants and founder effects in amelogenesis imperfecta associated with MMP20

Amelogenesis imperfecta (AI) describes a heterogeneous group of developmental enamel defects that typically have Mendelian inheritance. Exome sequencing of 10 families with recessive hypomaturation AI revealed four novel and one known variants in the matrix metallopeptidase 20 (MMP20) gene that were predicted to be pathogenic. MMP20 encodes a protease that cleaves the developing extracellular enamel matrix and is necessary for normal enamel crystal growth during amelogenesis. New homozygous missense changes were shared between four families of Pakistani heritage (c.625G>C; p.(Glu209Gln)) and two of Omani origin (c.710C>A; p.(Ser237Tyr)). In two families of UK origin and one from Costa Rica, affected individuals were homozygous for the previously reported c.954‐2A>T; p.(Ile319Phefs*19) variant. For each of these variants, microsatellite haplotypes appeared to exclude a recent founder effect, but elements of haplotype were conserved, suggesting more distant founding ancestors. New compound heterozygous changes were identified in one family of the European heritage: c.809_811+12delinsCCAG; p.(?) and c.1122A>C; p.(Gln374His). This report further elucidates the mutation spectrum of MMP20 and the probable impact on protein function, confirms a consistent hypomaturation phenotype and shows that mutations in MMP20 are a common cause of autosomal recessive AI in some communities.

origin and one from Costa Rica, affected individuals were homozygous for the previously reported c. p.(Ile319Phefs*19) variant. For each of these variants, microsatellite haplotypes appeared to exclude a recent founder effect, but elements of haplotype were conserved, suggesting more distant founding ancestors. New compound heterozygous changes were identified in one family of the European heritage: c.809_811+12delinsCCAG; p.(?) and c. 1122A>C;p.(Gln374His). This report further elucidates the mutation spectrum of MMP20 and the probable impact on protein function, confirms a consistent hypomaturation phenotype and shows that mutations in MMP20 are a common cause of autosomal recessive AI in some communities.

K E Y W O R D S
amelogenesis imperfecta, enamel, founder effect, hypomaturation AI, MMP20

| INTRODUCTION
The hardness of dental enamel is remarkable and is a result of its high mineral and low protein content (C. E. Smith, 1998) MMP20 is a zinc-dependent endopeptidase that is secreted in trace amounts during the secretory and transition stages of amelogenesis by ameloblasts (Llano et al., 1997;Seymen et al., 2015). Once activated, it selectively cleaves the secreted enamel proteins: amelogenin, enamelin, and ameloblastin into several products with distinct functional roles (Simmer & Hu, 2002). MMP20 is also thought to facilitate ameloblast movement during secretion, through cell-cell communication, and may influence ameloblast gene expression (Guan & Bartlett, 2013). These activities create a newly voided space within which enamel crystallites are able to grow in width and thickness, and finally, to interlock (Bartlett, 2013). If they are not removed, the enamel proteins occupy the enamel volume and restrict the growth of the enamel crystallites. This produces immature enamel that fails prematurely due to its inability to resist the mechanical stress resulting from biting and chewing.
Mutations in MMP20 and 19 other genes have been shown to cause nonsyndromic amelogenesis imperfecta (AI) with autosomal dominant, autosomal recessive, or X-linked inheritance (J. W. Kim et al., 2019;C. E. L. Smith et al., 2017 and http://dna2.leeds.ac.uk/LOVD/genes), while perhaps as many again have been implicated in syndromic AI (Dubail et al., 2018;Wright et al., 2015). Some of these genes encode proteins processed by MMP20, such as the enamel matrix proteins AMELX (MIM #300391), AMBN (MIM #610259), and ENAM (MIM #606585). Other genes implicated in AI include the second enamel matrix protease, KLK4 (MIM #603767), and cell adhesion proteins, such as LAMA3 (MIM #600805), LAMB3 (MIM #150310), and COL17A1 (MIM #113811). AI, therefore, describes a heterogeneous group of conditions characterized by inherited enamel defects in both dentitions. It can present as a hypoplastic phenotype, where a deficit in secretion results in the formation of a reduced volume of mineralized enamel, or as hypomineralized AI, whereby a failure in maturation results in the enamel of full thickness but which is soft or brittle and fails prematurely (Gadhia et al., 2012). Hypomineralized AI can be further subclassified as hypomaturation, caused by incomplete removal of protein from the developing enamel, or hypocalcification, caused by insufficient transport of calcium ions into the forming enamel (Smith et al., 2017), though these phenotypes often overlap. Mutations in MMP20 cause autosomal recessive hypomaturation AI. Fourteen pathogenic variants have been reported to date (Table S1), most of which are located in the catalytic domain of MMP20 and are thought to affect the stability and functionality of the protein structure.
This study describes the identification of pathogenic MMP20 variants in 10 families segregating autosomal recessive hypomaturation AI, providing additional insights into the spectrum of MMP20 mutations and associated phenotype. Four novel variants are reported, two of which are relatively common founder mutations in specific populations. Additionally, we perform molecular dynamics (MD) simulations of variants in the catalytic domain of MMP20 to examine the predicted effect that these changes have on the protein. This study increases the total reported pathogenic MMP20 variants to 17, suggests that defects in MMP20 are a more common cause of AI than was previously reported and enriches our understanding of the effects that mutations in the catalytic domain of MMP20 can have on its functionality.

| Patients
Individuals from each of the 10 families were recruited following informed consent in accordance with the principles outlined in the Declaration of Helsinki, with local ethical approval (REC 13/YH/0028).
These 10 families are part of a larger AI cohort presenting with a variety of AI phenotypes. A diagnosis of AI was made by a dentist after clinical examination, based on the physical appearance of the dentition, and was confirmed by dental X-ray. Genomic DNA was obtained via venous blood samples, using conventional extraction techniques, or from saliva using Oragene® DNA Sample Collection kits (DNA Genotek) and extracted according to the manufacturer's instructions.

| Whole-exome sequencing (WES) and analysis
Three micrograms of genomic DNA from a single individual from each family (marked with an arrow on the pedigrees shown in Figure 1) were subjected to WES and analyzed as described previously (C. E. L. Smith et al., 2019). Variants present in the dbSNP150 database of NCBI or the Genome Aggregation Database (gnomAD; http://gnomad.broadinstitute.org) with a minor allele frequency ≥1% were excluded. The potential pathogenic effect of the variants was predicted using Combined Annotation-Dependent Depletion (CADD v1.3; https://cadd.gs.washington. edu) (Rentzsch et al., 2019), the Sorting Intolerant From Tolerant algorithm (SIFT; Sim et al., 2012), Protein Variation Effect Analyzer(PROVEAN; Choi et al., 2012), and MutPred2 (http://mutpred. mutdb.org/index.html). The potential effect on splicing for each F I G U R E 1 Pedigrees of the 10 families recruited for this study. The genotypes determined from the microsatellite markers are presented beneath each individual examined. Families 1, 2, and 3 carry the c.954-2A>T variant; Families 4, 5, 6, and 7 carry the c.625G>C variant; Families 8 and 9 carry the c.710C>A variant; and Family 10 has the c.809_811+12delinsCCAG and c.1122A>C variants. The common haplotypes are presented in the same color and the haplotypes without a pathogenic variant are colored gray. The markers are presented in the order: 11cen, D11S940, D11S1339, MMP20, D11S4108, D11S4159, D11S4161, 11qter. The recruited family members are marked with an asterisk and the proband in each family is indicated with a black arrow intronic variant was predicted by the Human Splicing Finder v3.1 (http://umd.be/HSF3/HSF.shtml).

| Polymerase chain reaction (PCR) and Sanger sequencing
Mutations were confirmed and segregation was performed for all available family members, marked with (*) on each pedigree of Figure 1. Primer sequences can be found in Table S2. Sanger sequencing was performed using the BigDye Terminator v3.1 kit (Life Technologies) according to the manufacturer's instructions and resolved on an ABI3130xl sequencer (Life Technologies). Results were analyzed using SeqScape v2.5 (Life Technologies).

| Microsatellite analysis
Primer sequences for markers were obtained from the UCSC genome browser and standard HEX-tagged primers (Sigma-Aldrich; Table S2) were used to assess the flanking haplotypes of MMP20.
The analysis was performed as described previously (Nikolopoulos et al., 2020).

| Protein structure analysis
The tertiary structure of MMP20 has been determined in atomistic detail by nuclear magnetic resonance (NMR) (PDB: 2JSD) (Arendt et al., 2007), which provides the starting structure for MD simulations. Amino acid substitutions were made in the wild type (WT) structure using the Chimera visualization tool (Pettersen et al., 2004). MD simulations were performed using AMBER18 (Case et al., 2018). Protocols used to perform the MD simulations are in the Supporting Information Methods. Rhapsody was used for in silico saturation mutagenesis analysis (Ponzoni et al., 2020) and the prediction of the changes in solvent accessibility, residue occlusion, and free energy (ΔΔG) of the mutated proteins was performed with Site-Directed Mutator (SDM) (Pandurangan et al., 2017).

| RESULTS
Ten unrelated families presenting with features consistent with autosomal recessive hypomaturation AI, in the absence of any cosegregating disease, were recruited for the study (Figures 1 and 2).
Genomic DNA from affected members of each family was subjected to exome sequencing. Detailed coverage statistics can be found in  Table S3. Samples were sequenced in different batches over a period of some years, with considerable variation in coverage between batches. Variant files were filtered to select rare variants with high predicted pathogenicity, then variants in known AI-causing genes were highlighted. This revealed biallelic mutations in MMP20 (Table S4) in each family. The position of these variants in the gene is shown in Figure 3, along with representative electropherograms for each novel variant. PCR and Sanger sequencing confirmed the variants segregated with AI in all available family members.
In Families 1, 2, and 3, a homozygous frameshift variant (NM_004771: c.954-2A>T, NP_004762: p.(Ile319Phefs*19)) was identified in intron 6. This has been published previously as a cause of autosomal recessive hypomaturation AI and is expected to lead to retention of the sixth intron (J. W. Kim et al., 2005). To confirm this hypothesis, we attempted to perform reverse-anscriptase PCR of the MMP20 transcript on control blood complementary DNA (cDNA).
However, no amplification was achieved, suggesting that the level of MMP20 expression in blood is below the threshold for detection.
A novel homozygous missense mutation, c.625G>C, p.(Glu209Gln), was identified in exon 4 as the cause of disease in Families 4, 5, 6, and 7. This is known variant rs199788797, which has not previously been associated with a disease phenotype. In the gnomAD database, this variant has a frequency of 0.000457 in the South Asian population but is absent from all other reported populations. E209 is fully conserved in the mammalian clade, as shown in Figure S1, using the sequences listed in Table S5. Additionally, the mutation is predicted to be damaging (Table S6).
In Families 8 and 9, a novel homozygous missense mutation, c.710C>A, p.(Ser237Tyr) was identified in exon 5. This variant is absent from databases, is evolutionarily conserved in all the mammalian species analyzed ( Figure S1) and is predicted to be damaging by all algorithms used (Table S6).
The affected individual in Family 10 was found to be a compound heterozygote for a novel missense mutation c.1122A>C, p.(Gln374His) in exon 8 and a novel deletion-insertion (delins) variant: c.809_811+12delinsCCAG, p(?), spanning the splice donor site of intron 5. Both are absent from variant databases. Variant p.(Gln374His) is predicted to be damaging (Table S6), and Q374 is conserved in the mammalian clade ( Figure S1). Interestingly, mouse Family 2 is homozygous for a haplotype also seen in Family 3, but the second haplotype in Family 3 has proximal recombination. Family 1 could share the recombinant haplotype observed in Family 3 together with an unrelated haplotype, but without phase information this cannot be confirmed. Interestingly, all three families are homozygous for D11S4108 100 kb from MMP20. Families 4-7, of UK-Pakistani origin, carry the c.625G>C variant. The proband in Family 6 is homozygous for the same haplotype segregating in Family 7. However, a second haplotype with distal recombination is seen in the second affected sibling in Family 6, suggesting the affected (unsampled) father carries both haplotypes. Family 5 is homozygous for a third haplotype, again identical at the proximal end to that in Family 7 but recombinant at the distal end. In contrast, Family 4 is homozygous for a fourth haplotype identical to the Family 7 haplotype at the distal end but proximally recombinant. Again, all four families are homozygous for the immediately adjacent marker D11S4108.
The catalytic center of MMP20 is a 160-residue domain containing the zinc-dependent peptidase active site. It contains one catalytic and one structural zinc ion; the structures modeled here additionally contain two calcium ions, which are also structural. We  Table S7, and the repeats for the entire 900-ns simulation are shown separately in Figure S3. The calculated RMSD values are continuously adjusted and recalculated during the 900 ns of the simulation. As such, the final 300 ns, during which the plots have stabilized, are selected as the representative RMSD values for each simulation, and so the average values presented in Table S7 are used for the comparison among the variants. An increase in RMSD relative to the WT, implying decreased protein stability, was observed for all variants, apart from p.(Thr130Ile). We also analyzed the changes in three key interatomic interactions between the WT and the variants, atomic fluctuations ( Figure S4a), hydrogen bonding interactions (Table S8), and salt bridges (Table S9), to provide insight into why these particular variants cause functionally deleterious changes in protein structure. The most significant structural distortions were observed in the simulations performed in the absence of structural zinc and calcium ions (see Figure S4b,c).
In silico saturation mutagenesis of the MMP20 active site performed by Rhapsody shows that there are regions of the protein that are significantly more likely to cause a pathogenic effect when the residues located there are altered ( Figure S5). These regions largely correlate with the sites of the known and novel variants and have an increased PolyPhen-2 score ( Figure S5). The Rhapsody analysis was limited to the catalytic domain of MMP20 because it relies on the availability of a tertiary structure. The results of the SDM analysis are presented in Table S10, showing the changes of free energy, residue occlusion, and solvent accessibility for the WT and each mutant, respectively.

| DISCUSSION
Exome sequencing in a cohort of nonsyndromic AI families revealed biallelic MMP20 variants in 10 autosomal recessive families. MMP20 is a protease that plays an essential role during the secretory and early transition stages of amelogenesis. Defects in MMP20 cause AI in both humans and mice, due to a failure to process, degrade, and remove proteins from the extracellular matrix scaffold upon which the developing dental enamel is formed. The hypomaturation phenotype observed in these families fits with this hypothesis, with the dental enamel being of normal volume but characterized by a loss of translucency, discoloration, and premature enamel loss. Posteruptive changes probably determine the pattern of premature enamel failure. The phenotypes observed in the families presented are consistent with those described in previous reports of AI due to MMP20 variants.
The variant identified in Families 1-3, (c.954-2A>T), has been reported previously (Gasse et al., 2017;J. W. Kim et al., 2005;Prasad et al., 2015;Wright et al., 2011), but it has not been possible to confirm the effect on splicing experimentally since MMP20 is not expressed in blood. The prediction that the mutation most likely results in the loss of exon 7 from the MMP20 transcript (J. W. Kim et al., 2005), therefore, remains speculative. If confirmed, this would break the reading frame and lead to an abnormal transcript, which would be expected to be subject to nonsense-mediated decay.
MMP20 cDNA cannot be obtained from the blood due to undetectable expression levels; however, it could possibly be obtained from tissues, such as tonsil or appendix, as expression has been detected there and these are routinely removed through routine surgery. However, altered splicing in these tissues would only suggest that splicing is altered in ameloblasts, instead CRISPR-Cas9 editing of this gene within ameloblast cell lines might be a better model to determine whether splicing is affected.
The missense variant identified in Families 4-7, p.(Glu209Gln), is novel and replaces a negatively charged glutamic acid residue with neutral glutamine. E209 is a coordinating ligand for calcium ion binding and is, therefore, critical to the function of MMP20 (Andreini et al., 2004;Arendt et al., 2007;Yamakoshi et al., 2013). Replacement with glutamine decreases protein stability relative to WT in the MD simulations (see Table S7). Inspection of the WT atomistic structure shows that Ε209 has a pair of carboxylate oxygen atoms oriented toward the nearby Ca 2+ ion. In E209, the average distance is <4 Å for both MD repeats (Table S9). However, the mutant Q209 only has one oxygen; in one of the two duplicate MD trajectories, the interaction between this amide oxygen and Ca 2+ remains strong (an interatomic distance of 2.5 Å) but in the other it is completely disrupted (interatomic distance of 6.9 Å), allowing it to form non-native hydrogen bonds with other residues (see Table S8).
The novel missense variant identified in Families 8 and 9 replaces a small serine residue with a larger aromatic tyrosine, p.(Ser237Tyr) in a highly conserved part of the catalytic domain of MMP20. Correspondingly, the SDM analysis identifies an increase in the folding free energy of the protein relative to the WT (Table S10), implying that the variant is less stable. Moreover, in the MD simulations, the larger and more hydrophobic Y237 results in an increased RMSD (2.35 Å) relative to the WT (1.55 Å). In these trajectories, the main backbone hydrogen bond formed by S237 (with M244) is also formed by Y237.
However, the number of side-chain hydrogen bonds is reduced in Y237 compared with S237 (see Table S8) because the corresponding tyrosine oxygen protrudes too far into the solvent to participate in interatomic interactions within the protein.
In Family 10, the delins variant disrupts the intron 5ʹ splice donor site and is therefore again likely to lead to a transcript that is subject to nonsense-mediated decay. The novel c. Microsatellites were used to determine whether the variants shared in each of the three family groups were founder alleles, implying these families are related rather than that these sites are mutation hotspots. Family 1 is from Costa Rica, the population of which is largely of mixed European and indigenous American descent. Families 2 and 3 are Caucasian European families from the United Kingdom. The c.954-2A>T variant has been reported previously in at least six families from France (Gasse et al., 2017;Prasad et al., 2015) and one from the United States (J. W. Kim et al., 2005). Ethnicity is given as Caucasian by Prasad and co-workers but is not given in the remaining reports. Our haplotype analysis shows that this variant is present on three different chromosomal backgrounds, but elements of the haplotype, in particular the genotype of D11S4108 100 kb from MMP20, appear conserved between families. This suggests they may be related but through a distant common ancestor. The picture is similar for Families 4, 5, 6, and 7, UK families of Pakistani origin. In contrast, Families 8 and 9 from Oman share an identical haplotype across the region tested. These findings, therefore, suggest that MMP20 The only pathogenic missense variant modeled that did not show a significantly larger RMSD than the WT was p.(Thr130Ile). While minor changes in hydrogen bonding interactions were detected (Table S8), no substantial changes in overall protein structure or perturbations to interatomic interactions with the ligand or metal ions were observed. Consequently, our MD simulations do not provide a structural mechanism for why this variant should be pathogenic. However, p.(Thr130Ile) is located at the edge of the catalytic domain, close to regions of the protein that are absent from the available structure and which, therefore, do not feature in these simulations.
The 14 MMP20 variants previously reported to cause AI are shown in Figure 3, together with those newly described herein, indicating their location in the gene and protein. Previously reported pathogenic variants consisted of seven missense variants, two premature stop codons, two frameshifts, and two putative splice variants. Here, we report a further three missense and one splice variants, consistent with and extending the spectrum of mutations in MMP20 causing autosomal recessive AI. The three splice variants have not been verified beyond in silico prediction and therefore remain likely but unproven. These 17 variants, and in particular, the missense variants, cluster primarily in or near to the zinc-dependent peptidase domain ( Figure 3). Our analyses suggest that pathogenic mutations in this domain alter protein stability, while others (Y. J. Kim et al., 2017) have shown that variants in the catalytic domain can lead to a reduction or complete loss of enzymatic function. The combination of recessive inheritance, missense variants that reduce or abolish function, and the lack of significant difference between phenotypes associated with missense and nonsense variants, all point to a loss-of-function phenotype, where lack of functional MMP20 gives rise to a consistent hypomaturation AI phenotype.
In summary, we have identified 10 families segregating AI due to four novel and one known mutations in MMP20, and reviewed previously reported mutations, raising the total number of AI-causing MMP20 variants to 18. This expands the spectrum of

CONFLICT OF INTERESTS
The authors declare that there are no conflict of interests.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available in ClinVar at https://www.ncbi.nlm.nih.gov/clinvar/, accession numbers: SCV001338799-SCV001338802 and in the AI Leiden Open Variation Database (LOVD) at http://dna2.leeds.ac.uk/LOVD/ with reference numbers: 0000000313-0000000317.