Catsnap: a user‐friendly algorithm for determining the conservation of protein variants reveals extensive parallelisms in the evolution of alternative splicing

Summary Understanding the evolutionary conservation of complex eukaryotic transcriptomes significantly illuminates the physiological relevance of alternative splicing (AS). Examining the evolutionary depth of a given AS event with ordinary homology searches is generally challenging and time‐consuming. Here, we present Catsnap, an algorithmic pipeline for assessing the conservation of putative protein isoforms generated by AS. It employs a machine learning approach following a database search with the provided pair of protein sequences. We used the Catsnap algorithm for analyzing the conservation of emerging experimentally characterized alternative proteins from plants and animals. Indeed, most of them are conserved among other species. Catsnap can detect the conserved functional protein isoforms regardless of the AS type by which they are generated. Notably, we found that while the primary amino acid sequence is maintained, the type of AS determining the inclusion or exclusion of protein regions varies throughout plant phylogenetic lineages in these proteins. We also document that this phenomenon is less seen among animals. In sum, our algorithm highlights the presence of unexpectedly frequent hotspots where protein isoforms recurrently arise to carry physiologically relevant functions. The user web interface is available at https://catsnap.cesnet.cz/.


Fig. S1
The outline of the Catsnap ML features.Table S1 Animal species included in the reduced web-mode database of alternative isoforms.
Table S2 Conserved Arabidopsis thaliana AS events used as an initial source for the training set for the ML algorithm.

Table S3 AGI codes and accession numbers of validated plant alternative proteins.
Table S4 The full list of analyzed isoform pairs from animals, in the order corresponding to the graph presented in Fig. 3b.Fig. S2 The snapshots of the Catnap graphical output interface.(a) In the 'Sequence' mode, the candidate orthologous pairs can be downloaded in the FASTA format by choosing 'Download Sequences' from the menu.The .zip archive contains two data sets: all orthologous pairs of isoforms and the reduced list with each most similar isoform per species.The isoforms (the reduced data set) can also be aligned directly by the web browser using the icons 'Muscle Alignment'.The obtained multiple sequence alignment can be elementarily edited (sequences can be deleted, renamed, or moved), dismissed by 'Restore All Sequences' or saved by the 'Save Alignment' icon.For the additional instructions, refer to the 'Help' button.(b) The snapshot of the 'Structure' mode shows exon-intron schemes of the identified isoforms.They are drawn on the basis of the alignment from the 'Sequence' mode.Exons are indicated as rectangles, introns as lines.White regions inside exons correspond to the gaps in the protein alignment (may thereby affect the true proportionality of the resulting schemes); in the instances when multiple exons are merged, they are joined by a solid frame.The buttons denoting 'Long Introns' and 'Compressed Introns' switch between the representations of the diagrams, allowing for a convenient viewing of the AS events particularly in organisms with long introns (typically animals).

Setaria viridis Monocot
Arabidopsis thaliana TFIIIA     Table S4 The full list of analyzed isoform pairs from animals, in the order corresponding to the graph presented in Fig. 3b.

Fig. S2
Fig. S2The snapshots of the Catnap graphical output interface.

Fig. S3
Fig. S3 Schematic relationships of the main plant (a) and animal (b) phylogenetic groups.

Fig. S4
Fig. S4 Alternative splicing of TTL from representative plant species.

Fig. S5
Fig. S5 Alternative splicing of RCA in various plants.

Fig. S7
Fig. S7 Alternative splicing and alternative transcription start sites of SGR5 in various plants.

Fig. S11
Fig. S11 Alternative splicing of Glu4 in various animals.

Fig. S12
Fig. S12 Alternative splicing of Kif2a in various animals.

Fig. S13
Fig. S13 Alternative splicing of CD40 in various animals.

Fig. S14
Fig. S14 Alternative splicing and alternative transcription start sites of various animal NOSTRIN genes.

F4F3
Fig. S1The outline of the Catsnap ML features.(a) An example AS event illustrating the features implemented by Catsnap.(b) A scheme of the feature emphasizing mutual exclusivity of AS regions between query and hit sequences.(c) and (d) A scheme of the feature weighing amino acid similarity of the AS region (c) and amino acid dissimilarity of the AS regions (d).(e) A diagram describing location identifier used for the analysis of complex AS events.
Fig. S3 Schematic relationships of the main plant (a) and animal (b) phylogenetic groups.The group names used in the columns 'Evolutionary depth' in the main figures are indicated in pink.
Fig. S4 Alternative splicing of TTL from representative plant species.(a) An exon-intron scheme of the TTL 1-and TTL 2-splice variants from Arabidopsis thaliana.(b) Multiple species show AltA in the corresponding intron.(c) Amino acid sequence alignment of the region undergoing AS from selected species.Depending on the organism, AltA introduces either the peroxisome targeting signal (PTS) or a single glutamate residue (E).
Fig. S5 Alternative splicing of RCA in various plants.(a) A scheme of RCAα and RCAβ splice variants from Arabidopsis thaliana.(b) Conservation of the truncated RCAβ isoforms produced by various AS types, depending on the species, as returned by Catsnap.(c) Amino acid sequence alignment of the C-terminal parts of the RCAα and RCAβ proteins.
Fig. S6 Alternative splicing of JAZ10 in various plants.(a) A scheme of the JAZ10 gene producing canonical JAZ10.1, and alternative JAZ10.3 (encoded by two transcripts) and JAZ10.4 isoforms.AS affects the Jas motif at the C-terminus (boxed in green), required for protein-protein interactions crucial for jasmonate signaling.The Jas motif of JAZ10.3 is partially truncated due to a premature stop codon introduced by either AltD or IR.In JAZ10.4,the whole Jas motif is replaced by a frameshifted sequence, introduced by AltD in the third intron.(b) Conserved truncation of the Jas motif attributable to the JAZ10.3isoform in the last intron results from various AS types in different species.(c) Alignment of selected C-terminal parts of the JAZ10.1 and JAZ10.3 amino acid sequences from various species.Arrowhead indicates the position of the respective exon junction.
Fig. S7 Alternative splicing and alternative transcription start sites of SGR5 in various plants.(a) A scheme of the SGR5 α and SGR5β transcript variants from Arabidopsis thaliana.(b) The N-terminal truncation of SGR5β in different species originates from various AS types and alternative transcription start sites.(c) Amino acid sequence alignment of the N-terminal part of SGR5 from the species listed on (b).

Fig. S8
Fig. S8 Alternative splicing of CPK28 in various plants.(a) A scheme of a triple IR event in the Arabidopsis thaliana CPK28.(b) Conserved isoforms lacking the encoded EF-hands motifs result from various AS types.(c) Amino acid alignment of selected CPK28 and CPK28-RI isoforms.Three out of a total of four EF-hand domains responsible for the Ca 2+ -dependent activation in the area of the C-terminus of the CPK28 kinase are marked on the top of the alignment.
Fig. S9 Alternative splicing of PTB2 in various plants.(a) A scheme of the exon skipping event in the Arabidopsis thaliana PTB2.(b) The transcripts that include the premature termination codon (PTC) are conserved in eudicots and monocots and can involve various AS types.(c) A sequence alignment of the nominal proteins, whose transcripts are subjected to NMD.The light green rectangle on (a) corresponds to the area of the alignment on (c).The asterisk marks PTC referring to the frame of the reference isoform.The nominal open reading frame in alternative isoforms is determined as the longest one in the sequence context (horizontal hatching), as defined by RefSeq.
Fig. S10 Alternative splicing of TFIIIA in various plants.(a) An exon-intron scheme of the exon skipping (ES) and the exon including (EI) transcripts of the TFIIIA gene in Arabidopsis thaliana.(b) The NMD-triggering PTC originates from various AS types in different species.(c) A sequence alignment of the predicted amino acid sequences, shortened by the PTC.The light green rectangle on (a) corresponds to the area of the alignment on (c).The asterisk marks PTC.The open reading frame is determined as the longest one in the sequence context (horizontal hatching), as defined by RefSeq.
Fig. S12 Alternative splicing of Kif2a in various animals.(a) A scheme the mouse (Mus musculus) canonical Kif2a.1 and alternative splice variant Kif2a.3,produced by AltD.(b) Evolutionary cladogram depicting AS of Kif2a in various animal phylogenetic groups.Protein isoforms corresponding to mouse Kif2A.3 are processed by AltD in tetrapods and basal teleosts, however, evolutionarily derived teleosts display ExS in this gene.(c) Exon-intron diagrams of representative Kif2a transcripts from tetrapods (Mus musculus), basal teleosts (Scleropages formosus), and euteleosts (Oncorhynchus mykiss).(d) Amino acid sequence alignment of the region of Kif2A processed by AS in the species presented in (c).

Fig. S14
Fig. S14 Alternative splicing and alternative transcription start sites of various animal NOSTRIN genes.(a) A scheme of NOSTRINα and NOSTRINβ from human (Homo sapiens).(b) The Nterminal truncation in NOSTRIN is controlled by various AS types, including their combinations and AltTSS.(c) Amino acid sequence alignment of the N-terminal parts of NOSTRIN from the genes outlined in (b).

Table S1
Animal species included in the reduced web-mode database of alternative isoforms.