Using HHsearch to tackle proteins of unknown function: A pilot study with PH domains

Advances in membrane cell biology are hampered by the relatively high proportion of proteins with no known function. Such proteins are largely or entirely devoid of structurally significant domain annotations. Structural bioinformaticians have developed profile‐profile tools such as HHsearch (online version called HHpred), which can detect remote homologies that are missed by tools used to annotate databases. Here we have applied HHsearch to study a single structural fold in a single model organism as proof of principle. In the entire clan of protein domains sharing the pleckstrin homology domain fold in yeast, systematic application of HHsearch accurately identified known PH‐like domains. It also predicted 16 new domains in 13 yeast proteins many of which are implicated in intracellular traffic. One of these was Vps13p, where we confirmed the functional importance of the predicted PH‐like domain. Even though such predictions require considerable work to be corroborated, they are useful first steps. HHsearch should be applied more widely, particularly across entire proteomes of model organisms, to significantly improve database annotations.


| INTRODUCTION
An important way to determine the function of an unknown protein is to identify homology to a protein with known function. However, a significant minority (~20%) of protein sequences shows no obvious homology at the level of primary structure (sequence alone) to any other protein. 1 Tools such as Basic Local Alignment Search Tool (BLAST) begin to perform poorly when sequence identities within a single domain drop below 30%. 2 To overcome this problem, sequence-sequence searches have been supplemented with profilesequence searches such as PSI-BLAST, 3 which increase sensitivity by searching with a profile that contains statistical information from a multiple sequence alignment (MSA) created from the initial query. Even more sensitive tools carry out profile-profile searches, which extract family-wide information for both query and target. [4][5][6] Such tools involve the construction of profiles for all members of target sequence libraries, which requires considerable computational effort. Therefore, profile-profile searches tend to be restricted to a small range of targets, most often the Protein Databank of solved structures (PDB). Profiles can implicitly encode aspects of secondary structure, 7 but some tools go further by explicitly including secondary structure, [8][9][10] either relying on solved structures or predicting secondary structural elements. 11,12 There has been a steep decline in the discovery of completely new folds, 6 which suggests that many regions of unknown function will turn out to be distant structural homologues of known domains. It may be possible to detect these distant structural homologies by applying the most sensitive tools to all protein regions of unknown function. HHsearch is a powerful profile-profile alignment tool 13 that has not been applied widely or systematically.
Pleckstrin homology (PH) domains and their structural homologues, PH-like domains, are one of the most common folds in eukaryotes. 14 Although PH-like domains have no unified function, they have typical docking sites for other proteins or inositide lipids. 15 They are often involved in targeting peripheral membrane proteins involved in traffic and signaling. 16 Hence, prediction of new PH-like domains may identify residues that are functionally important for intracellular traffic. Since the initial discovery of classical PH domains, newly solved PH-like structures have unexpectedly been found, both early on, 17 and at least 10 times since. [18][19][20][21][22][23][24][25][26][27] Each of these discoveries added a new family of PH-like domains to a growing PH-like clan. 28 PH-like domains are therefore a good example of a structural fold with high sequence divergence.
The model organism S. cerevisiae is one of the most highly studied, especially in systems biology. Yet for 10%-13% of yeast proteins there is no functionally or structurally relevant domain information (TL, unpublished observation). Progress in finding functions for yeast proteins with no recognizable domain is slower than expected. 1 As an example that can be worked through in depth, we used HHsearch to study the PH-like domain clan (sometimes called superfamily) in yeast. After bench-marking HHsearch against PH-like domains with solved structures, we searched for PH-like domains in yeast.
HHsearch was more accurate than tools used to annotate databases, and it detected some new domains easily. We found even greater sensitivity by seeding searches with yeast protein sequences, although this required high volume automated offline searches. Overall, we discovered 16 new yeast PH-like domains. One of these is at the C-terminus of Vps13p, a conserved membrane contact site protein implicated in sorting along the endocytic pathway. The presence of this one PH-like domain was verified by demonstrating both that it has a specific targeting activity and that it is required for one aspect  Figure 1 and Movies S1-S3, Supporting Information), but no PSI-BLAST search seeded with a classical PH domain makes a significant alignment (ie, there is no hit) with GRAM domains and vice versa.
Each PH-like family is presumed to have diverged from a single ancestral protein. In addition there is some evidence that the whole clan diverged from a prokaryotic origin. 26 PH-like families (Table S1) have been defined by a variety of profile-sequence tools (Box 2), but given that there is a lag between discoveries and upgrading databases, we repeated these definitions.
Defining PH-like domains starts with mining the literature, the PDB for solved structures, and sequence databases. For a standard definition of the PH-like clan we used PSI-BLAST to study PDB, which contains >200 different PH-like domains. We described the minimal set of 39 PH-like sequences that identified the complete set of PH-like domains in PDB as strong hits in PSI-BLAST searches. This minimal set indicates the presence of 39 families in the PH-like clan ( Figure 2A and Table S1). This classification is more complete than definitions of the PH-like clan in databases (Box 2). The largest family consists of the classical PH domains (~50% of all sequences). Three families have~10% of the domains each: FERM-C, PTB and RanBD.
Non-significant hits indicated that some families share weak homology ( Figure 2A). However, the main feature was the presence of 20 small families (≤3 members) that were completely isolated from all others ( Figure 2A) Defining a PH-like domain PH-like domains are the fourth most common fold in the human proteome. 14 Their 8 structural elements (βββββββα) span 100-120 aa, making an orthogonal, slightly splayed beta sandwich (4 + 3) capped by the helix (Figure 1, Movies S1 and S2  Figure 3A). This indicates that the prob[SS] metric is a conservative estimate of accuracy. Alignments to non-PH-like domains differed from true positives by being shorter ( Figure 3B). In addition, non-PH-like domains had weaker secondary structural alignment (data not shown). These alignments were either to a single PH-like structural element, or to longer portions of the PH-like structure ( Figure S1). We conclude from this benchmarking that a threshold prob[SS] ≥85% will ensure close to 100% specificity ( Figure 3A).    Figure 2C). In addition, there were 5 hits with Vid27p produced top hits that were PH-like domains, and other tools also identified these regions as PH-like (Table S2B). Therefore, our PDB-to-yeast searches identified 4 new PH-like domains.
The fifth region identified was at the N-terminus of the ARF-GEF Age1p. When this was seeded into HHsearch or other tools there was no match to PH domains or any other strong hit (Table S2B).
Therefore, it is a false positive hit. We next investigated how such a strong false positive arose (asterisk in Figure 3B). We found an explanation in the way the MSA for Age1p was created. The MSA was made for full length Age1, which aligns with >100 other ARF-GEFs.  with views of both sides of the beta sandwich. Models colored by secondary structure: red = sheet, blue = helix; strong colors = classical PH, weak colors = GRAM. For more detail in A-C, see Movies 1-3, respectively. Domains in pleckstrin were initially identified as a family of homologues~100-120 aa, 29 and several structures were solved soon after. 15 GRAM domains were identified as a family of sequences~70 aa long, 30 with the first solved structure identified as beta sheets 1-5 of a complete PH-like domain. 18 that only the 4 new regions in Caf120p, Lam1p, Sip3p and Vid27p are PH-like domains ( Figure 4A).

| Yeast-to-PDB: a step change in sensitivity
A key finding in this work is that initiating a tool such as PSI-BLAST with yeast sequences strongly matched them to defined PH-like domains, even though the same hits were not seen in the other direction with PSI-BLAST or any other profile sequence tool. This indicates that profiles centred on yeast sequences may contain critical information that is missed by profiles seeded on non-yeast sequences. Since  Table S1. Weak homologies, shown by dashed lines, were deduced from the presence of non-significant hits (P > .005) that occurred with no false positive hits above them in hit lists. They lead to weak family grouping indicated by dashed outlines. Classical PH domains (~50% of all domains) are made up to 3 overlapping groups, and they have some overlap with some phosphotyrosine binding (PTB) domains. A high degree of overlap is seen within the 2 PTB groups and in the 2 Ran-binding domain (RanBD) groups (EVH1, WH1). Twenty independent families contain 1, 2 or 3 domains with no links to larger families. B, The same domains and families were analyzed by HHsearch. Domains that produce hits to each other with prob[SS] ≥ 85% are grouped together. The 16 of 20 unlinked small families are now included in larger groups. PTB are fused with FERM-C domains; also GRAM domains are fused with 9 other families. Some domains produce hits to both classical PH domains and either PTB/FERM-C domains or GRAM domains, leading to partial fusion of these 3 groups. Dashed lines indicate incomplete overlap (prob[SH] = 50-80% varying color grey to black). C, The families arranged according to the HHsearch analysis from (B) were populated with 73 yeast PH-like domains described in the literature (details in Table S2A). Empty families are shown in faint outline.

BOX 2
Profile-sequence tools currently used to define domains All profile-sequence tools are seeded with one or multiple sequences to find new hits, these being sequences that align more closely than the E-value threshold. The profile is a statistical representation of the different residues across the multiple sequence alignment (MSA) made from seed plus hits. The profile is then used to initiate another round of searching, and because it contains more sequence information than the previous round it may detect more hits. The whole process iterates through multiple rounds until no more sequences are added.
Profile-sequence tools differ in choice of seeds, in particular how broad a net to cast. For example, the tool SuperFamily only uses seeds with solved structure, while Pfam starts with any recognizable domain. 34,35 Tools also differ in the curation of the profiles obtained, with Pfam using human intervention to group structurally related families into clans. 36  xfam.org/clan/CL0266), but new discoveries are not disseminated to databases instantly. 40 Upgrades lag behind the literature and curation is not perfect. For example, Pfam describes a domain in unconventional myosins as "Myosin_TH1," but lacks any link to the PH-like fold. 32,41,42 In addition, some domains that do not meet the PH-like definition because homology is too weak are nevertheless accepted. The basis for this is typically that the remainder of the protein sequence is highly homologous to a protein that does contain a PH-like domain, and it is thought that the 2 proteins are full-length homolo-  43,44 Yeast is one of the most highly annotated organisms. The

Saccharomyces Genome Database (SGD) is annotated by
InterProScan, which is a resource from the European Bioinformatics Institute (Hinxton, UK). InterProScan incorporates all of Pfam, SuperFamily, Gene3D and many other profile-sequence tools. 45,46 For PH-like domains in yeast, InterPro provides useful data, but its output is not close to 100% sensitive, only including 61 of 73 domains (Table S2). 47,48 Also, it is not entirely specific, as it includes 3 false positives (Table S2).
HHsearch is freely accessible profile-profile search software that is run either offline, or on a webserver called HHpred.
The program is powerful, 13 fast and flexible, with many settings that can be varied by users.

| Support for the predicted PH-like domain in Vps13p
One of the most typical functions of PH-like domains is organelle targeting, binding proteins or lipids. In a previous survey of 33 yeast classical PH domains, only 9 had this function, 64 but it is also applicable to other PH-like families. 65 To determine if the newly described PH-like domains target membranes, we expressed GFP-PH fusions for domains in 8 proteins implicated in membrane traffic/function: Bud2p (both domains), Gyp7p, Lam1p, Pkh2p, Tph3p (both domains), and Vps13p ( Figure 4A). We also expressed the false positive in Age1p. None of these showed any specific cellular localisation (data not shown). We next constructed dimers (in the form GFP-PH-PH) for 6 of these PH-like domains (all except Tph3p) and Age1p, since dimers can reveal weak membrane targeting. 66,67 All constructs remained diffusely cytosolic except for the Vps13 PH-dimer, which faintly localized to rings at the bud neck of some cells ( Figure 5A).
This suggests that the region we identified in Vps13p is involved in intracellular targeting, and may be functionally important.
Full-length Vps13p (3144 aa) has intracellular localisations to multiple membrane contact sites, including vacuole and mitochondrial patches (vCLAMP), endosome-mitochondrial contacts and nucleus vacuole junctions (NVJ), [68][69][70] but as yet no specific targeting domains have been tested. We determined the role of the predicted PH-like domain by constructing a mutant version predicted to inactivate it. A desirable strategy would be to mutate putative ligand binding sites in the predicted variable loops sited at β1-β2, β3-β4 and β6-β7. 64 However, Vps13p has no conserved residues in these loops ( Figure 5B). Therefore, we constructed 2 other mutants: (a) Vps13ΔPH, lacking the entire PH domain (deletion of 3029-3144); (b) a double point mutation of conserved hydrophobic residues in the predicted alpha helix ("LIAA" = L3125A I3129A, Figure 5B). We predicted that these hydrophobic residues would stabilize the PH domain core, similar to the Wxxx motif ( = hydrophobic residue) in classical PH domains, so the LIAA mutant may partially unfold. 71 Cells expressed the PH(LIAA) dimer construct at a lower level than wild-type dimer ( Figure 5C). This is consistent with reduced protein stability caused by partial unfolding.
We next studied the effect of the two Vps13 mutations on intracellular distribution. Vps13-EGFP showed a complex intracellular distribution: in log phase it was largely diffuse, with the majority of cells containing puncta, some of which were close to the vacuole ( Figure 6A); by comparison in early stationary phase Vps13-EGFP targeted the NVJ ( Figure 6B). The two Vps13p mutants showed a marginal reduction in punctate targeting ( Figure 6C/E), but considerable loss of NVJ targeting, which was partial for Vps13LIAA-EGFP ( Figure 6D) and undetectable for Vps13ΔPH-EGFP ( Figure 6F). Thus, the extreme C-terminus of Vps13p appears to play a role in targeting, particularly to the NVJ in early stationary phase.
We next looked for a functional role of the proposed PH-like domain in Vps13p. We tested if plasmid-borne Vps13-EGFP, Vps13LIAA-EGFP and Vps13ΔPH-EGFP rescue sorting of carboxypeptidase-Y (CPY), a vacuolar enzyme that is subject to a strong vacuolar protein sorting (vps) defect (and hence is secreted) in Δvps13 cells. 73 Wild-type Vps13 tagged at the C-terminus with EGFP fully corrected the defect as shown previously. 70 In contrast, Vps13ΔPH-EGFP and Vps13LIAA in Δvps13 produced significantly stronger signals than wild-type Vps13-EGFP, stronger for Vps13ΔPH-EGFP than Vps13LIAA, although both were significantly weaker than empty plasmid ( Figure 6G). Thus, both constructs in which the predicted PH-like domain was mutated elicited partial rescue. This shows that the C-terminus of Vps13p is functionally important, and this 5. Make more links between non-significant hits and PDB entries by indirect searches in related organisms. First, use unknown-to-related proteome searches to identify homologous regions in the related proteome. Use these hits to carry out related proteome-to-PDB searches.

Confirm all new matches with domain only searches.
A more complex alternative for the confirmatory steps (A2 and B6) is to create HMMs for the query and target regions with 8 iterations, and align them pairwise with HHalign. 63 function requires the predicted PH-like domain to be intact. Related PH-like domains are widely predicted at the extreme C-terminus of VPS13, for example in all 4 human homologs and in some plant homologs (data not shown), but this is not the most highly conserved part of the C-terminus. 70 Without the prediction by HHsearch, the PH-like domain might be overlooked for an adjacent glycine-rich domain that is far more conserved ( Figure 6H). 72

B
GFP-Vps13PH LIAA (x2) GFP-Vps13PH (x2) A C FIGURE 5 Intracellular targeting by the PH-like domain of Vps13p. A, GFP-Vps13-PH-PH (dimer) weakly targets the bud neck, seen as linear targeting across the neck of small-to-medium buds (filled arrowheads), and dots either side of occasional larger buds (hollow arrowheads). The minor nuclear enrichment is nonspecific, being seen with all other PH monomers and dimers (data not shown). B, Vps13p 3028-3144 as query (Q, top 3 lines) aligned with a target (T) hit from the solved structure 3hsa_A, a bacterial protein (Shewanella amazonensis, bottom 5 lines). The fourth line indicates which residues align, where: "|" is a very good alignment, "+" is good, "." is neutral, "-" is bad, and "=" is a clash. For both Q & T, the secondary structure prediction (ss prediction) is in 3 states, E for sheet (blue), H for helix (red) and C for unstructured loop (black), with prediction confidence shown for target. The target also has "ss_dssp" showing its solved structure. The box above shows statistics on the hit, including prob[SS] and COLs. L3125 and I3129 (highlighted in yellow) are partially conserved residues (lower case in consensus) that align with 3hsa_A ( "+" and "|," respectively). Alignment made by HHalign. C, GFPtagged dimeric Vps13-PH(LIAA) (L3125A and I3129A) accumulates in cells to a much lesser extent than wild-type. Scale bars 5 μm. To allow secondary structure to be weighted higher than the standard 11%, we edited line 853 of "hhhitlist.C":  HHalign: Domain sequences from both PDB and yeast proteins with 3-5 flanking residues where available were submitted to HHblits, and the resulting "representative alignments," deleting the secondary structure prediction and confidence entries, were compared in HHalign run with standard settings.
Yeast HMM library (Ychunk200): Total yeast protein sequence was obtained from downloads.yeast-genome.org/ sequence/ S288C_reference/orf_protein/. The problem highlighted by Age1p ( Figure S2) was addressed by dividing proteins into regions of 200 residues (overlapping by 100), producing~26 000 "chunks." MSAs and HMMs were created for all chunks (8 iterations, ssw = 11%) to create a "Ychunk200" library that is available as an FTP download on request.
Yeast-to-PDB searches: All HMMs in Ychunk200 were submitted as queries for HHsearch using the PDB library (June 2013) as target. Top hits to PDB entries that contain PH-like domains were then filtered for the presence of multiple domains, and if these were found alignments were curated by hand to include only those hits where the alignment includes the PH-like domain.
Residue conservation across the C-terminus of Vps13p: ConSurf was used with standard settings for the final 1095 aa of Vps13p. 75 Results were normalized so that the 50th centile is 0 and the inter- Microscopy: Cells grown at 30 C to mid log phase, or 16 h thereafter for early stationary phase, were immobilized between slide and coverslip, and visualized on an AOBS SP2 confocal microscope (Leica, Wetzlar, Germany) at room temperature, using identical settings for images comparing different constructs.
Vacuolar Protein Sorting of CPY: Haploid BY4741 yeast lacking VPS13 were transformed with full length and mutated Vps13-EGFP plasmids, as well as a plasmid expressing GFP-GFP. The parent wildtype strain, also carrying GFP-GFP, was included as positive control.
Ten-fold dilutions of cells were spotted on selective medium (10 4 , 10 3 , 10 2 per spot) and grown for 24 h before being overlaid with a nitrocellulose membrane for 16 h. This was washed extensively, and then processed to detect CPY, 73 using monoclonal anti-CPY (10A5B5, Thermofisher, Paisley, UK).

ACKNOWLEDGMENTS
We would like to thank Johannes Söding for assistance in using

CONFLICT OF INTEREST
The authors declare that they have no conflicts of interest.

AUTHOR CONTRIBUTIONS
DRF adapted the program hhhitlist.C, constructed Ychunk200, and carried out all offline HHsearches. SEM assayed CPY sorting, and carried out cloning and imaging. KC created and analyzed match lists, and created display items. PA, RE-T and JI created and analyzed match lists. TPL conceived the study, created match lists, and created display items. All authors contributed to drafting of the manuscript, and read and approved the final manuscript.

Glossary
Clan: a group of domain families that share the same fold but cannot be linked by standard sequence alignment tools (eg, PSI-BLAST) alone. Sometimes called superfamilies. E-value: each alignment between query and target is assigned an evalue, which is the number of hits as good as the one obtained that would be expected to occur randomly given the size of the database being searched. A threshold is chosen, (here 0.01) so that alignments more statistically unlikely than this are considered significant.
HHalign: pairwise application of HHsearch for direct comparison of 2 MSAs.
HHblits: Builds profiles from a single query via multiple rounds of searching, differing from PSI-BLAST in several ways that increase speed and sensitivity.
HHsearch: profile-profile tool that explicitly weights a proportion of an alignment to aligned (predicted) secondary structure. MSAs can be built by either PSI-BLAST or HHblits.
Hit: target in database that aligns with an e-value more statistically significant (ie, lower) than the chosen threshold.
HMM: Hidden Markov Model: a way to represent MSAs, coding features including penalizing insertions and deletions in a profile-specific way; more computationally complex than a simple profile.
MSA: multiple sequence alignment: sequences with multiple small regions of local homology are aligned across a large region by inserting gaps. MSAs can be dominated by large numbers of highly related sequences (eg, mammalian orthologues). This problem can be addressed by filtering MSAs for redundancy, reducing non-diversity.
Pfam: a tool that identifies domains broadly, even without known structure/function. Pfam domains are constructed as HMMs. They are curated, reliable, and when structurally related families are found, they are grouped into clans on the basis of structural homology confirmed by HHsearch. 35 Profile: statistical representation of the residues across an MSA, sometimes represented as a protein "logo." Simple profiles apply inflexible, standard rules, for example for insertions/deletions. PSI-BLAST: Position-specific iterated basic local alignment search tool. Builds profiles from a single query via multiple rounds of BLAST.
Query: (or seed) sequence from which searches are initiated.
SuperFamily: a narrow specificity tool for identifying domains similar to known structures, similar to SCOP.
Target: group of sequences curated into a database among which homologs are being sought. Database size varies from relatively small (all solved structures, predicted proteins in individual genomes) to very large (all proteins in all genomes). Size can be reduced by excluding redundant sequences sharing more than a pre-determined level of sequence identity (20%-90%).