Analysis of the tryptic search space in UniProt databases

In this article, we provide a comprehensive study of the content of the Universal Protein Resource (UniProt) protein data sets for human and mouse. The tryptic search spaces of the UniProtKB (UniProt knowledgebase) complete proteome sets were compared with other data sets from UniProtKB and with the corresponding International Protein Index, reference sequence, Ensembl, and UniRef100 (where UniRef is UniProt reference clusters) organism-specific data sets. All protein forms annotated in UniProtKB (both the canonical sequences and isoforms) were evaluated in this study. In addition, natural and disease-associated amino acid variants annotated in UniProtKB were included in the evaluation. The peptide unicity was also evaluated for each data set. Furthermore, the peptide information in the UniProtKB data sets was also compared against the available peptide-level identifications in the main MS-based proteomics repositories. Identifying the peptides observed in these repositories is an important resource of information for protein databases as they provide supporting evidence for the existence of otherwise predicted proteins. Likewise, the repositories could use the information available in UniProtKB to direct reprocessing efforts on specific sets of peptides/proteins of interest. In summary, we provide comprehensive information about the different organism-specific sequence data sets available from UniProt, together with the pros and cons for each, in terms of search space for MS-based bottom-up proteomics workflows. The aim of the analysis is to provide a clear view of the tryptic search space of UniProt and other protein databases to enable scientists to select those most appropriate for their purposes.

or nucleotides) to match peptide sequences to experimental spectra and then to infer the proteins to which those peptides belong [1]. The serine protease trypsin is the most used cleaving agent in these workflows.
The Universal Protein Resource (UniProt, www. uniprot.org) [2] is among the most used protein sequence and functional annotation providers. Among the UniProt databases (DBs) are the UniProt knowledgebase (UniPro-tKB) that acts as the central hub for the collection of functional information on proteins and the UniProt reference clusters (UniRef) [3] that merge closely related sequences based on sequence identity. UniProtKB consists of two sections: UniProtKB/Swiss-Prot, which is manually annotated and reviewed, and UniProtKB/TrEMBL, which is automatically annotated and is unreviewed. In the UniProtKB/Swiss-Prot section, protein isoform and variant information is also provided.
The aim of the analysis reported in this study is to provide a clear view of the tryptic search space of UniProt and other protein data sets to enable scientists to select those most appropriate for their purposes. This is all the more pertinent now since proteomics papers in the public domain are still being produced (as well as evaluated and accepted for publication) using as DB the International Protein Index (IPI) [4], well after it was discontinued (on September 2011), hence omitting any new or updated protein sequences. Since comparing tryptic search spaces from different data sets can assist in pinpointing differences between them and help users to understand the reasons behind those differences; in addition to the UniProtKB data sets, we also considered other popular resources, such as the Ensembl [5] and the National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) [6] data sets, for this analysis.
We also included information coming from MS proteomics repositories, which provide a global view of large sets of processed mass spectral data. Specifically, we investigated the three most prominent: the PRIDE [7], PeptideAtlas [8], and the Global Proteome Machine database (GPMDB) [9].
In this article, we focus on sequence collections related issues and we then present the results of comparative analysis of the tryptic search space in these various resources and suggestions for their use, including, for instance, the advice not to a priori exclude UniProtKB/Swiss-Prot isoforms nor UniProtKB/TrEMBL sequences from UniProt collections.

Protein sequence collection
The species analyzed were Homo sapiens and Mus musculus, for which UniProtKB complete proteome sets [2] were obtained from ftp.uniprot.org/pub/databases/uniprot/current_ release/knowledgebase/proteomes/. The other UniProtKB sequence collections were all obtained using the public UniProt web interface. A summary of the sequence data sets is reported in Table 1 and Supporting Information Table 1 together with the nomenclature adopted for each concrete protein data set and the relationships among data sets. Additional species were also analyzed but are not discussed in any detail in the main text. Information is reported in the Supporting Information Notes, for example, as in Supporting Information Table 2.
The organism-specific UniRef100 files were created using the customized in-house CD-HIT algorithm, which is part of the UniRef pipeline [3]. Files containing sequences with variants were generated using the "varsplic.pl" script [10,11] (Supporting Information Notes). A modified variant expansion was also devised in order to limit the expansion to the human variants marked as disease-related in the UniProt human polymorphisms and disease mutations file (www.uniprot.org/docs/humsavar, the "humsavar" file), a file concerning all human variants annotated in UniProtKB/Swiss-Prot, including information on disease association and disease name.
Ensembl [5] version 68 data sets were retrieved from ftp.ensembl.org/pub/current_fasta/. RefSeq [6] version 55 data sets were retrieved from ftp.ncbi.nih.gov/refseq/ or the NCBI taxonomy DB. IPI data sets were obtained from ftp.ebi.ac.uk/pub/databases/IPI/current/ and were the last versions produced on September 2011. The retrieval from these resources was done at the same time as the studied UniProt release.

MS-based proteomics repositories collection
PRIDE-identified peptides were downloaded from the PRIDE BioMart (http://www.ebi.ac.uk/pride/prideMart.do) and were filtered to retain only the peptides from human and mouse that were identified in at least five PRIDE experiments to compensate for the data heterogeneity present in PRIDE (following the same approach used in [12]). Numbers given for PRIDE human and mouse content (as those reported in Tables 2, 3 and 4) are based on these filtered results. The PRIDE content unfiltered numbers (considering peptides that were also identified in less than five experiments) for human and mouse can be found in Supporting Information Fig. 1 and Supporting Information Table 3.
PeptideAtlas-identified peptides and GPMDB proteotypic peptides were obtained from www.peptideatlas.org/builds/ and ftp.thegpm.org/projects/xhunter/libs/eukaryotes/ peptide/, respectively. Data from these three repositories was used as originally provided. The retrieval from the three resources was done at the same time as the studied UniProt release (see Table 1). The numbers reported in the tables as valuable evidence from the MS proteomics repositories are referred to the presence of one specific peptide in at least one of the repositories. Therefore, unless noted otherwise, they are not referred to the concurrent presence of those specific peptides in the three repositories.

DB pairwise comparisons
After the initial in silico tryptic digestion of the protein data sets, only those with six or more amino acids (AAs) were considered for the analysis, as shorter peptides are rarely detected in MS bottom-up proteomics pipelines and lack sequencespecific information [13]. Tryptic peptide pairwise comparisons were performed in a similar way to what has been previously described [12]: full tryptic cleavage, no missed cleavages, and no initiator methionine cleavage. Details on nontryptic and missed cleavage containing peptides are given in the Supporting Information Notes. Each pairwise comparison is delimited by wider spacing after each "Com." occurrence, and the two data sets (DB) being compared are indicated next to the numbers of peptides unique to each of them. "Peptides" indicate the number of tryptic peptides for each of the three compartments of the comparisons (I, II, and III in Supporting Information Fig. 2). "Com." indicates the number of tryptic peptides shared by both data sets in each pairwise comparison. Corresponding percentages of peptides that are found in MS proteomics repositories are reported in brackets for each of the three compartments of the comparisons. Mouse comparisons are highlighted with a light gray background. Each pairwise comparison is delimited by wider spacing after each "Com." occurrence, and the corresponding two data sets (DB) are indicated next to the numbers of peptides unique to each of them. "Peptides" indicate the number of tryptic peptides for each of the three compartments of the comparisons (I, II, and III in Supporting Information Fig. 2). "Com." indicates the number of tryptic peptides shared by both data sets in each pairwise comparison. Corresponding percentages of peptides that are found in MS proteomics repositories are reported in brackets for each of the three compartments of the comparisons.
These comparisons split all the tryptic peptides coming from the two data sets being compared (generally DB1 and DB2) into three categories (denoted as I, II, and III in Supporting Information Fig. 2): unique to DB1, shared by both data sets, and unique to DB2. These three lists for each pairwise comparison were used as input to query the MS proteomics repositories.
Upon in silico digestion, the sequences of the tryptic peptides corresponding to each entry in the protein data set were recorded, together with their DB accession number and monoisotopic mass [14] (see Supporting Information Notes for details concerning ambiguous and nonstandard residues). By comparing the DB accession numbers corresponding to the three groups of peptides coming from the comparison of two protein data sets (indicated as I, II and III in Supporting Information Fig. 2), it is also possible to identify the accession numbers from DB1, which do not have a sequence representative in DB2, thus changing the focus from peptide sequence to DB accession numbers. As can be seen in Supporting Information Fig. 2, only filtered peptides coming from the in silico digestion of the protein data sets can get a match to the peptides coming from repositories: the criterion followed is to have 100% exact sequence match for the entire length of each peptide.

Results and discussion
In this study, the tryptic search space of the different UniProt sets was compared ( Table 2). In addition, UniProt complete proteomes data sets were also compared with IPI, RefSeq, and Ensembl (Tables 3 and 4).
By using UniProtKB sequence data sets containing canonical plus isoform sequences, or only canonical sequences, it was possible to verify the amount of extra information provided by isoforms and to associate it with the current evidence available in MS proteomics repositories. The same reasoning was applied for variant-expanded data sets and for sequence clustering. Human and mouse were the main focus of the study for these expanded data sets since there is not enough information for the other organisms in UniProtKB for protein isoforms and variant content (Supporting Information Table 4).
In the context of protein inference [15], a unique peptide is a peptide that can be unambiguously assigned to a single protein sequence or a group of proteins coming from the same gene (although this second possibility was not explored here). Hence, peptide uniqueness is dependent on the collection of protein sequences considered. These topics, together with exact sequence redundancy (which makes two proteins Each pairwise comparison is delimited by wider spacing after each "Com." occurrence, and the corresponding two data sets (DB) are indicated next to the numbers of peptides unique to each of them. "Peptides" indicate the number of tryptic peptides for each of the three compartments of the comparisons (I, II, and III in Supporting Information Fig. 2). "Com." indicates the number of tryptic peptides shared by both data sets in each pairwise comparison. Corresponding percentages of peptides that are found in MS proteomics repositories are reported in brackets for each of the three compartments of the comparisons. indistinguishable by MS approaches), underline the importance of having a complete and clear view on the information provided by different data sets.

Sequence redundancy removal
Sequence redundancy does not help in the identification of a protein, since more peptide-protein mapping ambiguity will occur summing to the MS-inherent identification ambiguities [16]. Nevertheless, when it is not limited to exact entire entries, sequence redundancy removal from protein data sets eliminates parts of sequences that can produce peptides upon cleavage, which also hinders identifications. In order to explore the effect of the removal of sequence redundancy, we show in detail the effect of sequence clustering on the UniProtKB sequences. In the tables, data were reported for the UniRef100 clustering of the human and mouse CPI, CPID (only human), CPIV, UPI, UPID (only human), and UPIV data sets (Table 1 for details and abbreviations). Sequence clustering of the human UniProt UPI data set removed 41 458 (28%) sequences (corresponding to the UPI vs. UPIR data sets in Supporting Information Table 1). In terms of tryptic peptides, UPIR had 16 243 peptides less than UPI (Table 5). Accordingly, peptide unicity was around 33 and 37% for UPI and UPIR, respectively (Table 5). A total of 15 125 peptides of the 16 243 ones were uniquely produced from single UPI sequences, 5412 from the N-terminal ends, and 7099 from the C-terminal ends. Table 2 (UPI/UPIR comparison) shows that the evidence in MS proteomics repositories for the 16 243 lost peptides is low. Similar trends were found for the mouse UPI/UPIR comparison and all the other UniProt human and mouse data sets where redundancy was removed: for example, the comparisons CPI/CPIR, CPID/CPIDR (only human), CPIV/CPIVR, UPID/UPIDR (only human), and UPIV/UPIVR (Tables 2 and 5; Supporting Information  Tables 1 and 5).
In the comparisons between human and mouse data, for example, the UniProtKB protein sets (either with and without sequence redundancy) and the corresponding data sets from other providers (RefSeq, Ensembl, and IPI; Supporting Information Table 6), the number of peptides unique to the non-UniProtKB data sets always increased after redundancy removal, together with a corresponding increase of the evidence in MS proteomics repositories (which is in general low for these peptides). This indicates a loss of sequence information in UniProtKB upon sequence redundancy removal.
The issue of peptides, resulting from protein cleavage being lost during the clustering redundancy removal, is not limited to UniRef100. The reason behind is exemplified for trypsin cleavage in Supporting Information Fig. 3, where sequence A is merged with sequence B during clustering. This process leads to the loss of the peptide indicated in gray. If sequence A consisted of two distinct sequences (divided at the gap), these two sequences would still be merged with sequence B and the two peptides lost would be the gray ones located at the extremities (one nontryptic and the other tryptic in this example). These losses can occur in any part of the sequence, not only in the central portion as schematized in the figure. Even though search engines have the option to specify the cleaving details (e.g., the specificity) of the proteolytic agent, a question remains whether these type of lost peptides have been properly addressed in the reprocessing efforts performed by MS proteomics repositories (as shown in the repository evidence in Table 2 while comparing UniRef100 data sets with the corresponding nonclustered ones). DB comparisons help to quickly fish out these lost peptides. This information, together with peptide unicity, would provide the list of peptides in which to focus on during reprocessing efforts.
There are very few supporting evidences in the MS repositories for these peptides and the reason could simply be that not many searches have been done to track them down. Therefore, it seems advisable to search spectral data sets to check for strong matches against these peptides before deciding to remove them from protein sequence collections.

Variant expansion
Next, natural variants for human and mouse were added into the DB comparisons. Human variation information was taken into account both in its entirety in UniProtKB and as a subset containing only the disease-related variants as explained in Materials and methods. This was done also to reduce sequence redundancy with respect to the expansions produced with all variations. In UniProt release 2012_10, there were 1871 UniProtKB/Swiss-Prot human entries (15.1% of the entries in humsavar) that were directly linked to disease. These entries carried 22 743 distinct feature IDs (34.2% of the total feature IDs in humsavar). Of a total of 67 102 variant entries in humsavar, 338 (0.5%) were associated with I/L variations that are difficult to target with standard proteomics MS approaches, and 50 of those 338 were associated to disease.
Disease-related variant expansion for human, created 25 531 additional tryptic peptides (SPI vs. SPID data sets in Table 5) from the 48 116 additional sequences (SPI vs. SPID data sets in Supporting Information Table 1). Of those 48 116, 23 743 of them (49%) were created in the canonical sequences (1871 distinct ones), whereas 24 373 (51%) in the isoforms (851 distinct corresponding canonical sequences). The evidence coming from MS proteomics repositories can be observed in Table 2 (SPI/SPID comparison). The corresponding numbers for the normal expansion (not limited to disease-related) are reported below.
Regarding the coincidences that might occur between UniProtKB/TrEMBL human tryptic peptides and the additional UniProtKB/Swiss-Prot tryptic peptides generated by variant expansion, it is noteworthy that 4243 UniProtKB/Swiss-Prot peptides generated from the variant expansion had the same sequence than an identical number For each organism and each data set, the total number of tryptic peptides is reported together with the percentage of unique peptides in brackets.
of UniProtKB/TrEMBL tryptic peptides. Considering that UniProtKB/Swiss-Prot variant expansion produces 61 830 additional tryptic peptides (SPI vs. SPIV data sets in Table  5), this corresponded to 6.8% of the additional peptides coinciding. This percentage went down to the range of 0.5-0.7% if the total amount of peptides produced from human UniProt data sets (SP, SPI, SPIV, TR, UPI, and UPIV) were considered.
These 61 830 additional tryptic peptides (SPI vs. SPIV data sets in Table 5) came from 126 852 additional sequences (SPI vs. SPIV data sets in Supporting Information Table 1). Of those, 66 103 were from UniProtKB/Swiss-Prot canonical sequences (12 437 distinct ones), whereas 60 749 were from UniProtKB/Swiss-Prot isoforms (5321 distinct corresponding canonical sequences). The evidence coming from MS proteomics repositories can be observed in Table 2 (SPI/SPIV  comparison).
From the UniProtKB/Swiss-Prot perspective, in addition to the 61 830 human additional peptides created by the variant expansion, there were 770 additional peptides that by chance coincided with peptides from the SPI DB. Therefore, also the level of peptide coincidence upon variant expansion within UniProtKB/Swiss-Prot was negligible. These peptides could be found by comparing the SPIV DB with an equivalent data set, where all the additional variant-containing sequences had been substituted by their corresponding canonical or isoform ones.
The observed effects in mouse data were different. Due to the substantially lower amount of mouse variation data available (Supporting Information Table 4), the effect of UniRef100 clustering on the CPIV data set resulted in a number of entries in the corresponding CPIVR data set (Supporting Information Table 1), which was lower than the number of entries in CPIR. As shown before, this trend is the opposite one to human.
With respect to the sequence redundancy introduced by variant expansion, it might not dramatically affect protein grouping in the process of inferring proteins. For instance, in passing from human UPI to UPIV, there is an 86% increase in the number of sequences (from 148 042 to 274 894 see Supporting Information Table 1), which corresponds to a 25% decrease in the number of data set unique peptides. This might indicate that the additional sequence redundancy introduced should be mainly found among those entries that are being expanded with variation data.
It is noteworthy here that the disease-related variant expansion human DB UPID had quite less entries than the UPIV database (Supporting Information Table 1) and that the sequence unicity of UPID became very similar to the UPI one (Table 5). In terms of MS proteomics repository content, only very few variant-containing peptides were found for UPIV and consequently, the same applies for UPID.
A possible limitation of the content of MS proteomics repositories is that "if you don't search for it, you'll never find it." If the data sets used for the searches (by submitters to PRIDE or during the reprocessing for GPMDB and PeptideAtlas) do not contain variation information, it is not possible to find evidence for variant-containing peptides in the repositories.

UniProtKB versus IPI: MS proteomics repositories content
Since IPI has been extensively used by the proteomics community, we report here the comparison between UniProtKB and IPI. Details of the comparisons between IPI, Ensembl, and RefSeq against the UniProtKB complete proteomes can be found in the Supporting Information Notes.
In terms of content of MS proteomics repositories, not many peptides were missing from UniProtKB when compared to the corresponding IPI data sets. From Table 4 it can be seen that when comparing IPI to UPI, 3.9% human and 2.6% mouse peptides do not have an equivalent sequence in UniProtKB. In addition, only a very small proportion of these peptides had MS repository evidence, namely 240 (0.7%) and 285 (1.5%) peptides, respectively (see panels A and B in Supporting Information Fig. 1). The evidence was even less for the peptides that are concurrently found in the three repositories (the central intersections of the Venn diagrams in Supporting Information Fig. 1). Panels C and D in Supporting Information Table 1 show that, when comparing them to, respectively, panels A and B, the filtering strategy applied to PRIDE (peptides present in at least five different PRIDE experiments) did not significantly affect the results in terms of the peptides concurrently found in the three repositories with respect to the changes in the number of peptides exclusive to PRIDE.
When compared with SPI, the number of peptides unique to IPI increased more than threefold for human (1547 peptides with evidence) and slightly less than ninefold for mouse (9823 with evidence). So, it can be concluded that the highest contribution in terms of IPI coverage comes from the UniProtKB/TrEMBL data sets.

Peptide unicity for the human UniProtKB UPI data set
Among the UniProtKB human protein sets, the unicity of a tryptic peptide within the UPI data set is the most conservative way to evaluate it. Evaluation of the unicity in the variantexpanded data sets, such as UPIV and UPID, would result in an excessive penalization caused by the variant-expanded entries. Table 5 and Supporting Information Table 5 show that peptide unicity ranged from 19.8 (human SPIV) to 97.6% (mouse SP) for UniProtKB, and from 33.4 (Ensembl human) to 75.7% (RefSeq mouse) for the other DBs.
In the Supporting Information Notes, details on the human peptides containing the ambiguous residues X, B, and Z and their effect on peptide unicity are reported, together with the related information from MS proteomics repositories.
Excluding the X-, B-, and Z-containing unique peptides, 81 444 UniProtKB human sequences (55.0% of a total of 148 042) were found with at least one unique tryptic peptide. Of those, 19 756 were from UniProtKB/Swiss-Prot (24.3%; 11 153 canonical and 8603 isoforms) and 61 688 from UniProtKB/TrEMBL (75.7%). Of the 81 444 entries, 68 224 (83.8%, 6715 from UniProtKB/Swiss-Prot and 61 509 from UniProtKB/TrEMBL) had a protein existence (PE) value different than 1 ("Evidence at protein level"; for details about PE see www.uniprot.org/manual/protein_existence). The highest number of unique peptides per sequence was 246 for the UniProtKB/Swiss-Prot entry Q14204 (4646 AAs), which ranked in position 126 among all the human UniProtKB entries (i.e., the human UPI data set), sorted by decreasing sequence length. The unique tryptic peptide that was repeated many times inside the same sequence was LTMMGTR that was found 27 times in the sequence Q6ZWG8.
After removal of the X-, B-, and Z-containing peptides, the unique peptides left for the human UPI data set were 252 124 of which 6% (15 234) were isoform-specific unique peptides. This highlights the importance of including isoforms in sequence collections. In addition, 55 994 adjunctive unique peptides (thus bringing the total to 308 118) from 4447 UniProtKB/Swiss-Prot entries (2290 of which are not among the 81 444 above mentioned) were found when a pep-tide was still considered unique if it was found among different isoforms (canonical sequence included) of the same UniProtKB/Swiss-Prot entry. These "entry-specific" unique peptides were not considered further in this numerical analysis, but are key to evaluate gene-level peptide unicity.
In order to use data only from the peptides that are present in the MS proteomics repositories, we observed that the GP-MDB human peptides had a length up to 51 AAs, PeptideAtlas up to 66, and PRIDE up to 84 (only seven peptides are longer than 66 AAs and they all contain tryptic missed cleavages). Finally, PRIDE (filtered for five experiments as explained in Materials and methods) contained peptides up to 66. So, we decided to explore on unique peptides of length up to 66 AAs.
The number of tryptic unique peptides (ambiguous sequences excluded, as before) from the UPI data set with a length up to 66 AAs was 248 675. Among these, 30 peptides contained "U" residues (selenocysteine) and only one of these had experimental evidence in PRIDE. In total, 20 848 of these peptides had experimental evidence in at least one of three MS proteomics repositories.
In total, 4197 (1.7%) of the 248 675 peptides concurrently have evidence in all the MS proteomics repositories. They come from 1302 UniProtKB entries (1246 UniProtKB/Swiss-Prot canonical sequences, 10 UniProtKB/Swiss-Prot isoforms, and 46 UniProtKB/TrEMBL sequences). The number of unique peptides per entry ranged from 1 to 131. The PE values for the UniProtKB/Swiss-Prot entries ranged from 1 to 5 (57 entries with PE other than 1) and from 1 to 4 for UniPro-tKB/TrEMBL entries (41 entries with PE other than 1).
In conclusion, even in this conservative situation (human, which is the best annotated species; in silico digestion of the protein data set with only one cleaving agent without missed cleavages; excluding from the digested protein data set entryspecific unique peptides and X-, B-, and Z-containing ambiguous peptides; evaluation of tryptic peptide unicity within the UniProt UPI data set; PRIDE content filtered to five experiments; repository content up to 66 AAs in length and finally concurrent evidence from the three MS proteomics repositories) there is room for enhancement of the PE-value assignment in UniProt. For instance, one UniProtKB/Swiss-Prot entry (O75558) has a PE-value of 2, having nine unique tryptic peptides found in the three MS proteomics repositories. Other two UniProtKB/Swiss-Prot entries (O60361 and Q9H853) have a PE-value of 5 with one unique tryptic peptide found in the three MS proteomics repositories. Finally, twelve UniProtKB/TrEMBL entries (A2NJV5, A8MUW5, D3DTH7, E7EVA3, E9PAU2, E9PGZ2, H0Y4K8, H0Y7A7, H0Y8×4, Q0ZCH6, Q5NV62, and Q5NV86) have a PE-value of 4, having one unique tryptic peptide found in the three MS proteomics repositories.

UniProtKB versus Ensembl, IPI, or RefSeq
When comparing UniProtKB, either complete proteomes (Table 3) or other data sets (Table 4 and Supporting Information Table 6) to other DBs, some general considerations The amount of extra sequence information provided by the human and mouse isoforms included in UniProtKB/Swiss-Prot and UniProtKB/TrEMBL was evident by looking closely at the results from the pairwise comparisons performed against CPI (or SPI) data sets with those performed against CP (or SP) data sets, or in the comparisons performed against SPI data sets with those performed against the UPI ones. For instance, in the case of human the biggest effect of UniProtKB/Swiss-Prot isoforms provided a 43% decrease in the number of peptides unique to RefSeq when compared to the UniProtKB/Swiss-Prot canonical sequences alone. In the case of mouse, the largest effect of UniProtKB/TrEMBL entries provided a 98% reduction in the number of peptides unique to Ensembl with respect to UniProtKB/Swiss-Prot alone.
In addition, the loss of information upon human sequence clustering can have an effect as big as providing about a fourfold increase in peptides unique to Ensembl, when looking at the results from the comparisons UPI/Ensembl versus UPIR/Ensembl. To summarize, the CPI data sets matched well with the Ensembl data sets (coming directly from the corresponding genomes), but matched increasingly less well with IPI and RefSeq. In addition, the UPI data sets carried more sequence information than the corresponding CPI data sets.

Concluding remarks
From the analyses performed in this study, these are the main conclusions that can be extracted: (i) If a maximal sequence coverage (also compared to the previously generated IPI data sets) is sought, then the whole UniProtKB content for the corresponding organisms should be used (UniProtKB UPI data sets). (ii) If a maximal correspondence with the current underlying Ensembl genome is sought, then UniProtKB complete proteome sets should be used. Indeed, these sequence collections are the proposed substitutes of IPI [12]. Table  3 shows that the added value of the UniProtKB complete proteomes (CPI data sets) is due to the well-established pipelines between UniProt and Ensembl [2]. In fact, the number of peptides unique to Ensembl shown in Table 3 is the lowest one, when compared to the peptides unique to RefSeq and IPI, meaning the highest concordance between CPI data sets and Ensembl. Furthermore, UPI data sets (whole UniProt organism-specific content) contain additional peptide-level sequence information compared to the CPI data sets for human and mouse (Table 4; see Supporting Information Notes "UniProt complete proteomes and other sequences"). (iii) If variation data need to be considered in a given study, then a variant-expanded data set is the proper choice. In this case, an additional choice consists in focusing on a subset of variation considering only the ones directly linked to disease, obtaining a data set focusing on detrimental variations with a lower sequence redundancy in the variant-expanded sequence data set. Human variation data have increased as UniProt has developed a pipeline to import high-quality 1000 Genomes [17] and COSMIC [18] nonsynonymous single AA variants from Ensembl variation [19]. (iv) If sequence redundancy is critical in the analysis, then an organism-specific UniRef100 clustered data set is the proper choice bearing in mind that some tryptic peptides are lost during the sequence clustering. Furthermore, organism-specific files similar to the UniRef100 sequence collections are used in the initial steps of the construction of the peptide spectral libraries from the National Institute of Standards and Technology (peptide.nist.gov).
In order to address the peptide loss, it would be useful to exploit global mass spectra reprocessing from MS proteomics repositories to check for evidences on the peptides removed during the clustering process. Another way to address sequence redundancy, while not affecting the peptide content, is to remove only those entries that have exactly the same length and sequence. (v) UniProtKB/Swiss-Prot isoform sequences and the appropriate UniProtKB/TrEMBL sequences are included in the UniProtKB complete proteome sets. The analysis of peptides present in isoforms has shown that it is highly advisable not to discard isoforms. (vi) The analysis of the UniProtKB/TrEMBL peptide content of the complete proteome sets has shown that it is important to include UniProtKB/TrEMBL sequences in the searches to provide a broader sequence coverage. By not doing so, a large amount of valuable sequence information will be lost (Supporting Information Fig. 4). The lack of manual annotation in UniProtKB/TrEMBL entries is not a valid reason to discard these sequences a priori. This is particularly important for MS matching where, in many cases, missing a sequence in the protein sequence data set might cause missing matches for good spectra. (vii) The mismatch between the number of tryptic peptides from sequence data sets and the fewer peptides that have evidence in the MS proteomics repositories could be reduced both by considering peptides with missed cleavages sites and peptides obtained with other cleaving rules (trypsin/P and other cleaving agents). Even so, the gap would be addressed in a much more efficient way by regularly performing a global spectral reprocessing against appropriate and updated sequence collections. We strongly recommend the reprocessing of large collections of MS spectra using sequence collections from UniProtKB that contain as much sequence information as possible. (viii) Integrating the tryptic search space (or a search space from any other cleaving agent) from protein data sets with the peptide-level identifications from the main MS proteomics repositories and the unicity evaluation is a way of adding annotations to the corresponding proteins. One clear example is the use of this information to annotate protein sequences in UniProtKB to assign the appropriate PE-value and to enrich the discovery of sequences of interest and focus curation efforts. Isoform-or variant-specific unique peptides can also be identified for annotation purposes. (ix) These type of analyses can thus help in deciding what to import into UniProtKB from other data sets with the added value of MS evidence and can detect potential proteotypic/quantotypic peptide candidates to be used in targeted proteomics workflows, such as SRM, including isoforms and sequence variants.
In the near future, UniProt will integrate all the valuable experimental information coming from the MS-driven proteomics data publicly available in MS proteomics repositories.