How to use the MEROPS database and website to help understand peptidase specificity

Abstract The MEROPS website (https://www.ebi.ac.uk/merops) and database was established in 1996 to present the classification and nomenclature of proteolytic enzymes. This was expanded to include a classification of protein inhibitors of proteolytic enzymes in 2004. Each peptidase or inhibitor is assigned to a distinct identifier, based on its biochemical and biological properties, and homologous sequences are assembled into a family. Families in which the proteins share similar tertiary structures are assembled into a clan. The MEROPS classification is thus a hierarchy with at least three levels (protein‐species, family, and clan) showing the evolutionary relationship. Several other data collections have been assembled, which are accessed from all levels in the hierarchy. These include, sequence homologs, selective bibliographies, substrate cleavage sites, peptidase–inhibitor interactions, alignments, and phylogenetic trees. The substrate cleavage collection has been assembled from the literature and includes physiological, pathological, and nonphysiological cleavages in proteins, peptides, and synthetic substrates. In this article, we make recommendations about how best to analyze these data and show analyses to indicate peptidase binding site preferences and exclusions. We also identify peptidases where co‐operative binding occurs between adjacent binding sites.


| OVERVIEW
The MEROPS website (www.ebi.ac.uk/merops) began in 1996 as a vehicle in which to present the classification of proteolytic enzymes into evolutionarily related clans and families. A proteolytic enzyme cleaves the peptide bond between two amino acids in a peptide or protein, and like any enzyme that degrades a biological polymer, had been difficult to classify by specificity alone. The specificity of a proteolytic enzyme can be complex, and many enzymes can show similar if not identical specificity but act in different environments or under different environmental conditions. The classification by Rawlings and Barrett 1 was based entirely on sequence and structural relationships. Proteolytic enzymes with similar sequences were assembled into a family, and enzymes from different families were assembled into a clan if the structures were known (or thought to be) related. When no tertiary structure is known, it is usually not possible to assign a family to a clan. In some cases, if the order of the catalytic residues is the same in the sequence, then a family where the structure is not known can be provisionally assigned to an existing clan.
Many proteolytic enzymes are multidomain proteins with the proteolytic activity restricted to one structural domain. Only the sequence and structure of this single domain are taken into consideration when assigning a proteolytic enzyme into a family and clan.
The vary majority of proteolytic enzymes are peptidases (also known as proteases or proteinases) which cleave peptide bonds by hydrolysis (and form subclass 3.4 in the NC-IUBMB Enzyme Nomenclature 2 ). Peptidases vary in the nature of the nucleophile in the hydrolytic reaction, which can be the hydroxyl of a serine ("serine peptidase"), the hydroxyl of a threonine ("threonine peptidase"), the thiol of a cysteine ("cysteine peptidase"), water bound to aspartic acid residues ("aspartic peptidases"), water bound to glutamic acid residues ("glutamic peptidases"), or water bound to a metal ion ("metallopeptidases"). Within a family, almost all peptidases will be restricted to one catalytic type, and each family is given an identifier consisting of a letter to indicate the catalytic type (S, T, C, A, G, or M) followed by a number. A clan identifier consists of a letter to indicate the catalytic type followed by a second letter assigned sequentially. However, within a clan, catalytic type can vary, so additionally for a clan with mixed catalytic type the identifier begins with the letter P. For a family where the catalytic type is unknown, the identifier begins with the letter U. Asparagine lyases, which cleave themselves by rearrangement of an Asn to form a succinimide, are the only nonhydrolytic proteolytic enzymes, and for these the clan and family names begin with the letter N. 3 Some families have been divided into subfamilies where there is strong evidence of a deep division within the family (e.g., when a sequence relationship was discovered that enabled two former families to merge, with each former family being retained as a subfamily within the new family). We wish to stress that in the MEROPS classification system, a subfamily is a major division within a family, but it is optional. Of the 275 families of proteolytic enzymes, only 41 are divided into subfamilies. Most families have only two subfamilies; the largest numbers are in families A2 (where there are six), C3 (eight), M28 (six), and S1 (six).
Similarly, some clans have been divided into subclans, where there is evidence that the catalytic mechanism differs between the families. Only five of the 47 clans are divided into subclans: Clan MA is divided into the Glu-zincins, Asp-zincins, and Met-zincins; Clans PA, PB, PC, and PD, all of which are of mixed catalytic type, are divided into individual subclans of either cysteine, serine, or threonine peptidases.
A third level in the hierarchy was introduced in 1998, which we have termed a "protein-species," which represents the same enzyme from different organisms. 4 Each protein-species is given a unique MEROPS identifier consisting of the family name (padded with a zero to be at least three characters) followed by a dot and a sequential number. With only a few exceptions for model organisms with completely sequenced genomes, a MEROPS identifier is only established for a biochemically characterized protein. In any family, there will be homologs that cannot be assigned to a MEROPS identifier. It is only at the level of protein-species that specificity is taken into account. Unfortunately, some other databases that also classify protein sequences have used the term "subfamily" to represent the protein-species level, which has caused confusion.
Some peptidases function in a complex of proteins. Where this complex contains more than one peptidase, a special MEROPS identifier is used. If all peptidases in the complex are homologous, then the special identifier consists of an "X," followed by the family identifier, a dot, and a sequential number. For example, the eukaryotic 20S proteasome is XT01.001. For a protein containing more than one peptidase unit, a similar special identifier is used, for example, metallocarboxypeptidase D is XM14.001. Where the complex contains peptidases from different families, the letter "P" followed by a number is used instead of the family name, for example, the tricorn peptidase complex is XP01.001. Each characterized, individual peptidase (or peptidase unit) from a complex is also given a normal MEROPS identifier. Special identifiers are not created for viral polyproteins because an individual enzyme is functional only on separation from the polyprotein.
There are proteins with no hydrolytic activity but which are related to peptidases. This means that within any family of proteins, function is not necessarily conserved. Often loss of activity is associated with loss or replacement of catalytic residues, enabling an uncharacterized protein to be identified as a nonpeptidase homolog from its sequence. Within some families of peptidases, some homologs have different enzymatic activities, for example, family S9 includes lipases and esterases. For a characterized nonpeptidase homolog, a special MEROPS identifier is established in which the first number after the dot is a nine.
There are many biochemically characterized peptidases for which a sequence is not known, or only a short fragment is known which cannot be assigned to a family. Special identifiers are created for these, using a letter to represent the catalytic type followed by a "9," and a letter to indicate the kind of peptidase activity: A for an aminopeptidase, B for a dipeptidase, C for a dipeptidyl-peptidase, D for a peptidyl-dipeptidase, E for a carboxypeptidase, and G for an endopeptidase. For example, membrane Pro-Xaa carboxypeptidase is M9E.004.
Thus in the MEROPS classification there is a multilevel hierarchy, from sequence to protein-species, to subfamily (if any), family, subclan (if any) and clan.
Once the MEROPS hierarchy was established, it was possible to expand the website to include other items that could be linked to a level in the hierarchy. The first of these was an extensive bibliography which could be linked at protein-species, family, and clan levels. The bibliography is updated fortnightly. Sequence alignments and phylogenetic trees were added at the family level 5 ; these are usually regenerated for each release. A collection of known cleavage sites in proteins was established from the scientific literature which could be linked at the F I G U R E 1 An example peptidase summary page from the MEROPS website. The peptidase summary page for glutamyl endopeptidase I is shown protein-species level. 6,7 Peptidase-inhibitor interactions were included at the protein-species level. 8 Both of these collections are updated periodically. A summary page is presented on the website for each holotype, family, and clan. An example peptidase summary page is shown in Figure 1. Buttons across the top of the screen provide access to supplementary pages, which for a peptidase summary page includes substrate cleavages, inhibitor interactions, and a bibliography. Tables on the summary page show alternative names, the full MEROPS classification, and details of activity. Substrate specificity is shown as a logo and as a matrix showing the amino acids occupying binding pockets P4-P4'. Both are calculated dynamically from all the known substrate cleavages for this enzyme. In the matrix, different shades of green highlighting are used to indicate preference, with the brighter green showing most restricted specificity. Finally, there is a list of important inhibitors, with links to the relevant inhibitor summary pages.
The methodology used to classify proteolytic enzymes was extended to include peptidase inhibitors that are proteins. 9 An identifier for a family of inhibitors begins with the letter I. In most cases, an inhibitor family has a unique tertiary structure, so in most cases a clan contains only one family. Because there are 34 clans of inhibitors, identifiers starting with more than one letter were required, and the identifier for a clan of inhibitors begins either with an I or a J. For compound inhibitors, similar special MEROPS identifiers were also created, but beginning with the letter "L"; for example, LI01.001 for ovomucoid.
Small molecule inhibitors (SMIs) were added to the MEROPS website in 2008. 8 These are not classified and are simply listed alphabetically.
All the features available on the MEROPS website were detailed in a recent publication. 10 Few new features have been added, but the data underlying these features are kept up to date. The most recent release (release 12.2) of the MEROPS database was June 2020. In this report, we will discuss the classification and the substrate cleavage collection, in particular how to use the data to predict the specificity of proteolytic enzymes and present some analyses of that data.

| Type examples
Our classification of proteolytic enzymes and inhibitors has been influenced by organism taxonomy. When an organism is described for the first time, the identification of a type specimen is the first step. The equivalent for a protein would be the source organism and sequence from the publication where it was first characterized biochemically. This sequence is known as the holotype. In the MEROPS classification, there is a holotype for each protein-species. One of these holotypes will be selected as the type example for a family (or subfamily): this is usually the first protein in the family (or subfamily) to be biochemically characterized. For a clan, there is a type structure, which is usually the first solved tertiary structure for a protein from any of the families included in the clan. Only that part of the sequence which includes the catalytic machinery (the active site residues and the primary substrate-binding sites but not a secondary binding site such as an exosite) are considered for a proteolytic enzyme, and only that part of the sequence which includes the reactive site are considered for an inhibitor. We term these parts of the sequence the peptidase unit and the inhibitor unit, respectively. When active or reactive site residues are not known, the peptidase or inhibitor unit is estimated, omitting parts of the sequence known to have other functional domains and targeting signals as identified by Pfam. 11 A new holotype and MEROPS identifier is set up when a protein is characterized and either has a different specificity to any other protein in the family, or the specificity is similar, but the protein acts in a different cellular location, has a different architecture, or the sequence does not cluster in a phylogenetic tree with that of the existing holotype(s) that have similar specificity.
The same MEROPS identifier is assigned to uncharacterized homolog of the holotype that are considered to represent the same protein from different species. Our criteria for doing this were set out 12 and include more than 50% sequence identity within the peptidase/ inhibitor unit, preservation of the order and number of domains, similar targeting signals and transmembrane regions, and the homolog and the holotype cluster together on a phylogenetic tree.

| Assembly of families
Homologs are assigned to a family from a BlastP 13 or HMMER 14 search using the peptidase/inhibitor unit sequence from an existing member of the family (usually that of the holotype). Homologs are collected from either the nonredundant Protein sequence library at NCBI 15 or UniProt. 16 All homologs returned from the search with an E-value less than 0.001 are assigned to the family. There are instances when a search with a sequence from one family will return homologs already assigned to another family; if all the active site residues are aligned, then the families will be merged (with each former forming a subfamily in the new family). If not all active site residues align, then the families are not merged, and this frequently happens among families within clans SC and MA.
Protein sequence alignments are made with Muscle 17 and phylogenetic trees using QuickTree. 18

| Assembly of clans
Co-ordinates for a peptidase (or inhibitor) structure are downloaded from the Protein DataBank (PDB) 19 and submitted to the Dali server. 20 Any structures returned showing similarity at 6 SD units or more are considered to be homologous, and all of those that are peptidases (or peptidase inhibitors) with be included in the same clan, provided the similarity covers the peptidase (or inhibitor) units.

| Peptidase specificity analysis
A nonredundant set of cleavages for each peptidase was made by replacing all nonstandard amino acids and synthetic blocking groups with the letter "X" and then selecting distinct P4-P4' sequences. For further analysis, only those peptidases with 20 or more nonredundant cleavages were considered. The frequency of each amino acid for each peptidase binding site could then be counting, and binding sites occupied by only one or two amino acids could be identified. By converting each amino acid to one of six types, a binding site occupied by one type or not occupied by one type could also be identified. The types were acidic (Asp, Glu), basic (Arg, Lys, His), aliphatic (Ile, Leu, Val), aromatic (Phe, Tyr, Trp), small (Ala, Gly, Ser), and other (Asn, Cys, Gln, Met, Pro, Thr). Binding sites where one or two amino acids (or one amino acid type) were never found could also be identified, but to exclude the rarity of amino acids such as tryptophan affecting the results, the number of cleavages per peptidase was increased to 30 or more. To account for instances where the substrate cleavage data collected from the literature might be in error, substrate-binding sites occupied by one amino acid or one amino acid type in more than 90% of the cleavages were calculated for each peptidase.
Examples of co-operative binding, where the binding of an amino acid in one site affects what can bind in a neighboring site, were also searched for. For example, Phe only binds in P1 if Gly binds in P2. Because the frequency of occurrence of an amino acid will affect how often it is found with another amino acid in the P4-P4' sequence, there had to be 10 or more instances where two amino acids occurred together, and to exclude the preferences previously observed, each amino acid had to occur in less than 90% of the cleavages for the peptidase in question. Table 1 shows the number of MEROPS identifiers, families and clans for proteolytic enzymes and inhibitors.

| RESULTS
Numbers of other features included on the website are shown in Table 2.
Of the 4,684 MEROPS identifiers, 1,424 are assigned to holotypes that are uncharacterized peptidase homologs from model organisms. The remaining 3,260 holotypes are known to be active as peptidases, but cleavage sites in substrates are only known for 1,342 different peptidases (27.6% of all holotypes, but 41.2% of all holotypes known to be active peptidases). In addition to the number of cleavages shown above, a further 4,173 cannot be assigned to a single peptidase. For example, some 3,694 eukaryotic protein precursors are known to have the initiating methionine (Met1) removed by either methionyl aminopeptidase 1 (M24.001) or methionyl aminopeptidase 2 (M24.002), but because it is not known which, the cleavages are assigned to an "M24 homolog." The number of cleavages that can be mapped to UniProt entries is There are 6,300 cleavages in synthetic substrates. The peptidase with most known cleavages is trypsin 1 (S01.151) with 22,528; this is because trypsin 1 is widely used in proteomics studies to degrade whole proteomes prior to mass spectroscopy. The peptidase with most physiological or pathological cleavages is matrix metallopeptidase-3 (M10.005) with 2,452. Table 3 shows the frequency that amino acids occur in all non-redundant cleavages sites, in terms of percentage. Amino acids that occur in more than 10 % of cleavages are highlighted in yellow. These are Leu in P2, Arg and Lys in P1, and Ala and Leu in P1'. Amino acids that occur in less than 1 % of cleavages are highlighted in orange. These are Trp in any position, and Cys and Ile in P1. Both Trp and Cys are the least frequent amino acids, and Cys may occur in a disulfide bond, which probably prevents cleavage. The low frequency of Ile in P1 is unexpected.
The number of holotypes for which 20 or more nonredundant cleavages are known is only 217 (4.6%), and it is these for which substrate preferences are analyzed below.
Peptidases where only one or two amino acids, or one amino acid type, are accepted in P4-P4' are shown in Figure S2. This figure also shows binding sites where one or two amino acids (or a single amino acid group) are unacceptable. Preferences for 168 peptidases are shown, representing 77% of peptidases with 20 or more known substrate cleavages. For the remaining 49 peptidases, the specificity is cryptic and cannot be explained in such Note: Standard amino acids are included, plus "-" to indicate an unoccupied site and "X" to indicate a nonstandard amino acid or other moiety. An amino acid that occupies a substrate-binding site in 10% or more cleavages is highlighted in yellow. An amino acid that occupies a substrate-binding site in less than 1% of cleavages is highlighted in orange.  Figure S2 include cathepsin L (C01.032; 2,862 cleavages), matrix metallopeptidase-2 (M10.003; 2,558), matrix metallopeptidase-3 (M10.005; 2,425), meprin alpha subunit (M12.002; 771), meprin beta subunit (M12.004; 925), granzyme B (Homo sapiens-type, S01.010; 1,636), and the 20S constitutive proteasome peptidase complex (eukaryote, XT01.001; 675). Granzyme B apparently cleaves after residues other than Asp and Glu. 22,23 There is a common misconception that only the P1 binding pocket of a peptidase is important, perhaps because peptidases from the well-studied families C14, S1 and S8 all show such limited specificity. Figure S2 clearly shows a preference in any of the binding pockets P4-P4' for some peptidases. The P1 pocket, however, shows most instances of limited preference (in 61 peptidases), whereas fewest peptidases show preference in P4 (seven peptidases) or P4' (six). However, when looking at unacceptable binding, the P1 pocket shows least (30), whereas the other binding pockets show unacceptable residues in between 34 and 48 peptidases.
All peptidases have to accept at least one residue in P1 and P1'. For a dipeptidase, there are no other binding pockets, for an aminopeptidase P4-P2 do not exist, and for a carboxypeptidase P2'-P4' do not exist. A peptidase that releases a dipeptide from the N-terminus of a peptide (a "dipeptidyl-peptidase") does not have P4 or P3, and a peptidase that releases a dipeptide from the C-terminus of a peptide (a "peptidyl-dipeptidase") does not have P3' and P4'. In Figure S2, there are apparently 15 aminopeptidases, nine carboxypeptidases, four dipeptidyl-peptidases, one peptidyl-dipeptidase, and two peptidases that act like peptidyl-tripeptidases.
Clearly, there are some anomalies in Figure S2. The DmpA peptidase from Ochrobactrum anthropi (MEROPS identifier P01.001) has been shown to be an aminopeptidase, but it is also thought to be self-processing (and then acts as an endopeptidase), 24 hence it does not show the characteristics of an aminopeptidase in Figure S2. Carboxypeptidase Q (M28.014) appears to be a dipeptidase in Figure S2, which reflects its former characterization as a lysosomal dipeptidase. 25 SplE peptidase (S01.312) from Staphylococcus aureus was characterized by use of a cellular library of peptide substrates (CLiPS), in which residues beyond P1' were not identified, 26 hence it appears in Figure S2 as if it were a carboxypeptidase. Sedolisin-B (S53.002) and kumamolisin-B (S53.005) were characterized using synthetic chromogenic substrates that covered P4-P3' only, 27 given the false impression of a peptidyltripeptidase (an activity not known to exist).
The commonest preference shown in Figure S2 is that of either Arg or Lys (or "basic") in P1, which occurs in 36 peptidases (but 20 of these are peptidases from family S1 and nine from S8). A preference for Arg or Lys occurs in P1' for CPG70 carboxypeptidase from Porphyromonas gingivalis (M14.023). Preferences for Ala, Asp, Gly, Met, and Val each occur in substrate sites from three peptidases. On the other hand, there is no preference for His or Trp in any of the binding sites shown in Figure S2, and Cys and Ile are each observed on only one occasion.
The residue most frequently excluded from a substratebinding site is Trp, which is not found in at least 45 of the sites shown in Figure S2, plus many others in combination with another amino acid (Cys and Trp are excluded from 15 binding sites, and Met or Trp from six). In addition, Cys is absent from at least 49 binding sites. Both Trp and Cys are the amino acids, which occur with the lowest frequency. However, evidence that these are real exclusions and not just the result of a low level of occurrence is that the number of binding sites in which Trp is not found is the same if in the analysis the number of cleavages per peptidase is increased to 50 or reduced to 10. It is also noteworthy that Ile is excluded from more sites than Pro, His, or Met, and that acidic residues are excluded from 11 sites.
The binding of a substrate to an enzyme is often described as a "lock and key" hypothesis, but this is misleading, because it implies a rigidity that is not present in either substrate or the enzyme. Peptidases are no exception, and there is a degree of plasticity in the active site. For example, if a large amino acid binds in one pocket, then this may prevent anything other than a small amino acid binding in the adjacent pocket. This is known as cooperative binding, and 37 examples from 23 proteolytic enzymes have been identified and are shown in Table 4.
No co-operative binding labeled P4' is shown, because the next residue would be P5' and only residues occupying P4-P4' are in the MEROPS collection. The site showing co-operative binding among most proteolytic enzymes in P2 (16 enzymes). In some cases, this reflects the perceived specificity at P1, rather than co-operative binding. For example, caspase-1 is thought to cleave only aspartyl bonds, yet other cleavages have been observed in a proteomics study 28 : of the nonredundant cleavages for caspase-1, only 83% are of aspartyl bonds.
Amino acids most frequently involved in co-operative binding are Arg, Ala, Lys, and Val. Trp is not observed at all, and Cys, Met, Thr, and Tyr occur just once.

| DISCUSSION
All the tables from the MEROPS MySQL database can be downloaded freely from the MEROPS FTP site (ftp://ftp.ebi. ac.uk/pub/databases/merops/current_release/database_ files), including the tables that form the MEROPS substrate cleavage collection (Substrate_search and cleavage).
Tables are available in SQL format and as comma-delimited text files.
Users of the MEROPS substrate cleavage collection who wish to analyze the data are reminded that the data set is not nonredundant, and it is recommended that a nonredundant subset is generated from it. The collection includes cleavages of synthetic substrates, as well as physiological and nonphysiological cleavages in peptides and proteins, and further subsets should be made as required, for example. to analyze only cleavages in physiological substrates.
There has been a tendency when analyzing peptidase preference to concentrate only on the substrate-binding sites adjacent to the cleavage site, namely P1 and P1', and although for many peptidases specificity is directed to P1, this is only the case with peptidases from a minority of families. It is as important to identify which amino acids do not bind to a subsite as it is to identify those that do, and we have shown particular amino acids are more likely to be excluded from subsites other than P1. The substrate-binding site of a proteolytic enzyme has some degree of plasticity, which we have shown from the identification of adjacent subsites where co-operative binding may take place.
Much of the data in the MEROPS substrate cleavage collection has come from high-throughput proteomics studies. In such studies, hundreds of cleavages are identified, the vast majority of which conform to the known specificity of the peptidase being studied. However, cleavages after amino acids thought not to correspond to the specificity of the peptidase have also been observed. Peptides are usually generated from an entire proteome by two proteolytic cleavages, one by a peptidase with known, limited specificity (such as trypsin) and the other by the peptidase under study (the "test peptidase"). Because the sequences of all the proteins in the proteome are known, and cleavages by the peptidase with limited specificity can be calculated, cleavages by the test peptidase can also be calculated from the mass of each peptide. Performing the same digestion only by the peptidase with limited specificity provides the control. Two cleavages are required to generate short peptides, because this reduces the number of candidate source proteins and the computing time required. 29 As can be seen from Figure 1, there are many cleavages identified for glutamyl endopeptidase I which do not correspond to its "known" specificity: preference for Glu (92% of cleavages) or Asp (4%) in P1. Given that glutamyl endopeptidase I is regularly used as a peptidase with known limited specificity in these proteomics experiments, cleavages other than at glutamyl and aspartyl bonds could potentially lead to erroneous assumptions about which peptides are generated by the test peptidase when a proteome is digested. It is possible that under proteomics conditions, nonpreferential cleavages occur, resulting in these unusual cleavages, but it is also possible that these unusual cleavages are either a result of another, contaminating peptidase, or that the peptide has been mapped to an incorrect protein, perhaps because a splice variant has been misidentified in the proteome.