MiSynPat: An integrated knowledge base linking clinical, genetic, and structural data for disease‐causing mutations in human mitochondrial aminoacyl‐tRNA synthetases

Abstract Numerous mutations in each of the mitochondrial aminoacyl‐tRNA synthetases (aaRSs) have been implicated in human diseases. The mutations are autosomal and recessive and lead mainly to neurological disorders, although with pleiotropic effects. The processes and interactions that drive the etiology of the disorders associated with mitochondrial aaRSs (mt‐aaRSs) are far from understood. The complexity of the clinical, genetic, and structural data requires concerted, interdisciplinary efforts to understand the molecular biology of these disorders. Toward this goal, we designed MiSynPat, a comprehensive knowledge base together with an ergonomic Web server designed to organize and access all pertinent information (sequences, multiple sequence alignments, structures, disease descriptions, mutation characteristics, original literature) on the disease‐linked human mt‐aaRSs. With MiSynPat, a user can also evaluate the impact of a possible mutation on sequence‐conservation‐structure in order to foster the links between basic and clinical researchers and to facilitate future diagnosis. The proposed integrated view, coupled with research on disease‐related mt‐aaRSs, will help to reveal new functions for these enzymes and to open new vistas in the molecular biology of the cell. The purpose of MiSynPat, freely available at http://misynpat.org, is to constitute a reference and a converging resource for scientists and clinicians.


INTRODUCTION
Aminoacyl-tRNA synthetases (aaRSs) contribute critically to protein biosynthesis by catalyzing the specific ligation of amino acids onto their cognate tRNA(s). In human mitochondria, the translation machinery is devoted to the synthesis of 13 proteins, all subunits of the respiratory chain complexes. Mitochondrial (mt) aaRSs are thus implicated in cellular energy (ATP) production. Except for GlyRS and LysRS that are encoded by single genes, human mt-aaRSs are encoded in the nucleus by a set of genes distinct from those coding for the cytosolic aaRSs (Bonnefond et al., 2005). Mt-aaRSs are all synthesized in the cytosol, This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. addressed to and imported into the mitochondria, thanks to the presence of an N-terminal pre-sequence (MTS, for "mitochondrial targeting sequence"), which is presumably cleaved upon entry to mitochondria (Carapito et al., 2017).
The first correlation between a mutation affecting a mt-aaRS and a human disease dates back to 2007, when mutations within the DARS2 gene, coding for mt-AspRS, were associated with a leukoencephalopathy (LBSL) (Scheper et al., 2007). This first description attracted the attention of the medical community, and many other disease-causing mutations have been discovered since. Today, all 19 mt-aaRS-encoding genes have been reported to be affected (reviewed in e.g. Diodato, Human Mutation. 2017;38:1316-1324 wileyonlinelibrary.com/journal/humu Ghezzi, & Tiranti, 2014;Konovalova & Tyynismaa, 2013;Oprescu, Griffin, Beg, & Antonellis, 2017;Schwenzer, Zoll, Florentz, & Sissler, 2014;Suzuki, Nagao, & Suzuki, 2011). Except for GlyRS and LysRS, which present dominant mutations, all mutations lead to autosomal recessive disorders, with patients being either homozygotes or compound heterozygotes. Despite being ubiquitously expressed and probably having a common role in a single cellular process, that is mt translation, mt-aaRSs are impacted in various ways. Their mutations cause pleiotropic effects with an unexpected variety of phenotypic expressions, including mainly neurological disorders but also non-neurological symptoms.
Besides the fact that new mutations are continuously discovered, neither the cause of the selective vulnerability nor the molecular mechanisms leading to the diseases are well understood.
It is therefore timely to extract and collect existing and emerging data related to mutations affecting human mt-aaRSs to allow the community of clinicians and fundamental researchers to analyze them comprehensively and systematically within a dedicated and continuously updated computing infrastructure. To achieve this, we devel-

Updating system
The quality and lifetime of a knowledge-based Web server strongly depend on the balance between maximum automation for its update and minimum human dependency. To incorporate new bibliographic information into the database, the chosen strategy is based on a double sieve mechanism. Firstly, publications concerning human mt-aaRS diseases are automatically retrieved from PubMed through the NCBI Web service using a specialized filter. Secondly, the articles selected by the first sieve are presented through a Web form to an expert who validates or rejects them and enters the relevant information regarding mutations/diseases into the database. Missense, nonsense, insertion, deletion, and splicing defects present in the mt-aaRS exons or introns are all registered and those impacting the protein sequence are further visualized. The bibliography search and validation process is performed daily and has been tested over a period of 18 months. After training and adjustment of the double sieve system, two-thirds of the newly presented publications were relevant.

Web interface
The home page at http://misynpat.org ( The "header" section has a text search widget and access to the home page as well as to a synthetic description of the mt-aaRSs scientific background, the contact and the help pages. The "access" section has five main entry points: All systems, Mutations Overview, Diseases, Mutations Statistics, and Bibliography. The "mutation" section gives direct access to the mutation modeling tool and the "bibliography" section provides a plot of the automatically updated cumulative number of bibliographic entries with reported cases of human mt-aaRS disorders is shown in an automatically updated plot F I G U R E 2 Screen capture of the "All Systems" synoptic page. The page provides general overviews of the 19 interactive insets corresponding to the 19 human mt-aaRSs. Each inset contains the ModMutGimmick with the mutations localized by lollipops, the 3D structure thumbnail, the name and length of the human mt-aaRS, the date of the most recent update and the current number of bibliographic entries (bib. entries). The ModMutGimmick is as an interactive true-scale graphical representation of the mt-aaRS modular organization with colored boxes (see Supp. Table  S3) indicating idiosyncratic and common functional regions and motifs. The reported mutations are schematized as follow: (i) recessive diseaserelated missense mutations are indicated by green and orange lollipops for homozygous and compound heterozygous mutations, respectively, (ii) dominant mutations are indicated by cyan lollipops, and (iii) nonsense mutations are indicated by black lollipops content of the considered mt-aaRS, the length of the human protein sequence and the current number of bibliographic entries per system (Fig. 2). Of note, no inset for mt-GlnRS is provided since there is no mt-GlnRS gene and it has been demonstrated that Gln-tRNA Gln is produced via an alternate indirect pathway (Echevarría et al., 2014). It compiles the number of missense and nonsense mutations, the conservation status of the residues affected by the mutation and the occurrence count of each amino-acid substitution.
(v) The "Bibliography" page shows the list of related literature ordered chronologically and gives access to the publications through the PubMed Website.

Single mt-aaRS window
To Disease-related protein sequence variations are indicated within MiSynPat tabs following the HGVs international nomenclature standards (e.g., p.Arg179His, which indicates that Arginine at position 179 is replaced by an Histidine) and using the one letter code (e.g., R179H) in order to be easily recognizable within the multiple sequence alignments. It should be noted that help is provided in MiSynPat through tooltips when applicable, and a dedicated Help Page is accessible throughout the Website.

SURVEY OF DATA ANALYSIS
The automated follow-up of disease-related mt-aaRS publications revealed that, just 10 years after the first description of patients with mutations in the DARS2 gene (Scheper et al., 2007),

Database architecture
The database (see Supp. Fig. S1) consists of four main tables called Synthetase, Disease, Mutation, and Bibliography. Two junction F I G U R E 3 Screen capture of the DARS2 Integrative Analysis tab. From top to bottom, the page provides: UniProt/NCBI and MSeqDR links, the interactive ModMutGimmick with a black line indicating the compound heterozygous status of the patient, who has the R179H mutation associated with the E425X mutation, the linear sequence of the human mt-AspRS, the JSmol 3D structure viewer window (either a crystallographic structure or a 3D model) with its control panel, the 3D Mutation toolbox to select and render the 3D model of a known disease-related mutation or a userdefined mutation. Upon selection of the R179H mutation in the dropdown menu, the conservation status, the relative solvent accessibility, and all known alleles with the bibliographic references are displayed. The mutated structure can be calculated and rendered by clicking on the "Generate the mutated structure" button. Finally, the same information can be obtained for a user-defined mutation in the "Evaluate your own mutation" section tables, ln_biblio_synthetase and ln_disease_synthetase, link each Synthetase data to its corresponding Bibliography and Disease data.
The Synthetase

F I G U R E 4
View of the DARS2 Alignment tab. Below the ModMutGimmick, the multiple sequence alignment of the mt-AspRS sequences from bacterial, archaeal, and eukaryotic origin is visible through a sliding window. The human mt-AspRS, highlighted in red, appears twice, at the top of the alignment and within its closest relatives. The Color Feature dropdown menu displays: (i) the conservation mode (black, gray, and light gray boxes correspond to 100% strictly conserved residues, more than 80% strictly conserved residue, and more than 60% physico-chemically conserved residues, respectively); (ii) functional modules as shown in ModMutGimmick; (iii) PFAM domains; (iv) PhyloBlock; and (v) secondary structures. The absolute position ruler is scaled on the ModMutGimmick allowing a rapid focus to the clicked position. The "Toggle IDs" button switches between the default MiSynPat sequence name nomenclature and the UniProt/NCBI accession number. Clicking on one residue (indicated in red) of the human mt-AspRS sequence from the alignment will automatically highlight this residue in all options of the Integrative Analysis tab

Sequence alignment, features, and conservation
For each aminoacylation system, a manually curated 3D structureguided multiple sequence alignment has been built using sequences from 92 complete genome organisms representative of bacterial, archaeal and eukaryotic phylogenetic diversity (see Supp. Table S1).
In addition, 19 manually curated multiple sequence alignments of vertebrate mt-aaRS have been built using sequences from 58 highquality UniProt reference proteomes representative of the Vertebrate diversity (see Supp. Table S2).
Each sequence is identified by an in-house defined MiSynPat sequence nomenclature composed of: a one letter code of the aaRS system followed by a four letter code indicating the Bacterial (bact), Archaeal (arch), or Eukaryotic cellular location (cyto for cytosolic, mito for mitochondrial, chlo for chloroplastic), then an underscore sign followed by 2 four letter codes for the Genus and species separated by a dot (e.g., Dmito_Homo.sapi stands for the human DARS2).
Each alignment has been treated by the MACSIMS program (Thompson et al., 2006) in order to collect and propagate associated functional/structural features (PFAM-A domains, PhyloBlock, secondary structures). We defined three conservation levels: residues with 100% identity over a column, residues with >80% identity over a column and columns with >60% of similar residues using the six sim-
Using the PDB (Berman et al., 2000) RCSB Web services, an automatic weekly search retrieves all aaRS structures and all human mt-aaRS structures. If a new structure of a human mt-aaRS is published, the calculated model is replaced. If a non-human related structure is released in the PDB, a new PHYRE2 model is calculated. When applicable, the LSQMAN program (Kleywegt & Jones, 1994) is used to build the dimeric mt-aaRS by superposing the computed monomer over the two chains of the closest dimer. The model of a given mutation is computed by the Modeler script program using default parameters (Fiser & Sali, 2003). The protein residue Relative Solvent Accessibility (RSA) is computed with the NAccess program (Hubbard & Thornton, 1993), using a default probe radius of 1.4Å and a standard Ala-XXX-Ala tripeptide as the full accessibility reference for amino acid XXX. The PolyPhen (Adzhubei et al., 2010) and SIFT (Sim et al., 2012) predictions for a given mutation are computed on the fly within the Integrative Analysis tab.

CONCLUSIONS
The main aim of MiSynPat is to help to bridge the gap between clinicians and basic researchers working on human mt-aaRS-related disorders. The diseases linked to mutations in the nuclear genes coding for mt-aaRSs are now described at an increasing frequency and revealed to be complex. The complexity is not only due to the variety of phenotypic expressions, with tissue-specific phenotypic imprints, but also due to the absence, for the moment, of clear mechanistic explanations for most of the systems. In MySinPat, mutations discovered in the past ten years are now enriched by weekly reports. MiSynPat, by allying sequence data, allelic composition in patients, sequence conservation, structural information, and bibliography, offers a homogeneous resource for in-depth analyses. The conservation/structural standpoint perfectly complements existing databases, whose main goal is to facilitate genomic investigation (e.g., MSeqDR; Shen et al., 2016). In addition, the embedded infrastructure and the automated mining and retrieval of literature data with minimal human intervention guarantee durability, regularity and up-to-date maintenance of the site.
The unique configuration of MiSynPat explains why being restricted to a family of enzymes is a true added value and allows users a more comprehensive view of the field. Indeed, aaRSs, as a family of ancillary/housekeeping enzymes, have been studied structurally and functionally for decades. We now have a broad understanding of these enzymes (e.g., Bullwinkle & Ibba, 2014;Giegé & Springer, 2016;Havrylenko & Mirande, 2015;Schimmel, Giegé, Moras, & Yokoyama, 1993). For instance, crystallographic structures are available for homologous proteins that are representative of all aaRSs.
This allowed us not only to provide structural alignments (based on structural data), but also to generate 3D models for any of the human mt-aaRSs, and thus for any of the mutated mt-aaRSs. In-depth investigations and knowledge also clearly identified substrate interacting interfaces and uncovered, at least for cytosolic aaRSs, a large number of non-canonical functions, beyond translation (e.g., Guo & Schimmel, 2013;Guo, Yang, & Schimmel, 2010;Kim, You, & Hwang, 2011;Ray, Arif, & Fox, 2007). It is only recently that a similar example has been identified for the rat mt-TrpRS (Wang et al., 2016), opening the doors for alternate roles of mt-aaRSs.
Currently, the global view of disease-associated mutations impacting mt-aaRSs excludes a simple mechanistic explanation, and instead suggests that analysis of multiple mutations at once, rather than a one-at-a-time approach, will lead to a better understanding of the associated diseases. MiSynPat thus offers a way to categorize the mutations in the affected mt-aaRSs, and to distinguish those that may interrupt functions affecting protein synthesis from those that may disrupt alternate function(s) (at, e.g., non-conserved position(s) and/or enzyme surfaces that do not interact with tRNA), which appeared in mammals and are not directly involved in protein synthesis. As such, the body of knowledge provided by MiSynPat will form the core of a future expert system further bridging the gap between clinical data and mechanistic interpretations.