From chemoproteomic‐detected amino acids to genomic coordinates: insights into precise multi‐omic data integration

Abstract The integration of proteomic, transcriptomic, and genetic variant annotation data will improve our understanding of genotype–phenotype associations. Due, in part, to challenges associated with accurate inter‐database mapping, such multi‐omic studies have not extended to chemoproteomics, a method that measures the intrinsic reactivity and potential “druggability” of nucleophilic amino acid side chains. Here, we evaluated mapping approaches to match chemoproteomic‐detected cysteine and lysine residues with their genetic coordinates. Our analysis revealed that database update cycles and reliance on stable identifiers can lead to pervasive misidentification of labeled residues. Enabled by this examination of mapping strategies, we then integrated our chemoproteomics data with computational methods for predicting genetic variant pathogenicity, which revealed that codons of highly reactive cysteines are enriched for genetic variants that are predicted to be more deleterious and allowed us to identify and functionally characterize a new damaging residue in the cysteine protease caspase‐8. Our study provides a roadmap for more precise inter‐database mapping and points to untapped opportunities to improve the predictive power of pathogenicity scores and to advance prioritization of putative druggable sites.


Contents:
Appendix Table S1. Definitions of key terms Appendix Figure S1 -Data losses that result from re-mapping chemoproteomic datasets to new releases of Ensembl and UniProtKB.
Appendix Figure S2. UniprotKB Human Proteome ID counts in cross-referenced databases.
Appendix Figure S3. Comparison of single and multi-isoform UniProtKB protein mapping to identical Ensemble protein sequences, using Ensemble xref files.
Appendix Figure S4. Comparison of single and multi-isoform UniProtKB protein mapping to identical Ensemble protein sequences, using versioned ID xref files.
Appendix Figure S5. Sequence similarity between UniProtKB protein sequences and protein sequences associated with Ensembl stable IDs across releases.
Appendix Figure S6. Comparison of GRCh37 and GrCh38 CADD models for loss of cysteine and loss of lysine.
Appendix Figure S7. Correlation of pathogenicity scores for all possible non-synonymous SNVs at codons of detected or undetected cysteine and lysine residues.
Appendix Figure S8. CADD38 PHRED scores for all possible missense variants at CpD cysteine and lysine codons, stratified by Grantham score.
Appendix Figure S9 (Meyer et al, 2016;Huang et al, 2008) 2. Residue-residue mapping, a one-to-one correspondence between amino acids in proteins from different databases. (David & Yip, 2008;Martin, 2005;Dana et al, 2019) 3. Residue-codon mapping, a one-to-three correspondence between an amino acid and nucleotide coordinates (codon) in a reference genome Appendix Figure S3. Comparison of single and multi-isoform UniProtKB protein crossreferences to Ensembl proteins, using the Ensembl xref files. Using five Ensembl xref files (Materials and Methods, Method A) containing only stable ID cross-references to UniProtKB IDs, protein sequences were compared for A) 1,466 single isoform UniProKB IDs and B) 2,487 multiisoform UniProKB IDs contained in our CpDAA-containing protein dataset.
Appendix Figure S4. Comparison of single and multi-isoform UniProtKB protein crossreferences to Ensembl proteins, using the UniProtKB mapping file. Using UniProtKB mapping file (Materials and Methods, Method B) provided canonical protein isoform ID crossreferences to Ensembl stable protein IDs. Comparisons between UniProtKB canonical proteins from 2018_06 release were made to Ensembl proteins from five releases. Results of sequence identity comparison was performed for A) 1,466 single isoform UniProtKB IDs and B) 2,487 multi-isoform UniProKB IDs contained in our CpDAA-containing protein dataset.
Appendix Figure S5. Sequence similarity between UniProtKB protein sequences and protein sequences associated with Ensembl stable IDs across releases. Heatmaps show A) normalized Hamming distance and B) normalized Levenshtein distance for sequence alignments of the protein sequences associated with the top 74 stable Ensembl gene, transcript, and protein IDs with an identical cross-referenced Ensembl protein sequence in one release, but non-identical sequences in additional releases. Scores range from 0 to 1, with 0 indicating identical to the canonical sequence in the 2018 UniProtKB CCDS release. Source data is shown in Source Data for S5 Table, which includes the 49 UniProtKB IDs that had no canonical sequence equivalent in all five Ensembl releases analyzed and CpDAA index differences for most detected cysteine or lysine positions.