The PAS fold

A redefinition of the PAS domain based upon structural prediction

Authors


M. Hefti, Key Drug Prototyping BV, Wassenaarseweg 72, 2333 AL Leiden, the Netherlands.
Fax: + 31 71 5276355, Tel.: + 31 71 5276354,
E-mail: marco@keydp.com

Abstract

In the postgenomic era it is essential that protein sequences are annotated correctly in order to help in the assignment of their putative functions. Over 1300 proteins in current protein sequence databases are predicted to contain a PAS domain based upon amino acid sequence alignments. One of the problems with the current annotation of the PAS domain is that this domain exhibits limited similarity at the amino acid sequence level. It is therefore essential, when using proteins with low-sequence similarities, to apply profile hidden Markov model searches for the PAS domain-containing proteins, as for the PFAM database. From recent 3D X-ray and NMR structures, however, PAS domains appear to have a conserved 3D fold as shown here by structural alignment of the six representative 3D-structures from the PDB database. Large-scale modelling of the PAS sequences from the PFAM database against the 3D-structures of these six structural prototypes was performed. All 3D models generated (> 5700) were evaluated using prosaii. We conclude from our large-scale modelling studies that the PAS and PAC motifs (which are separately defined in the PFAM database) are directly linked and that these two motifs form the PAS fold. The existing subdivision in PAS and PAC motifs, as used by the PFAM and SMART databases, appears to be caused by major differences in sequences in the region connecting these two motifs. This region, as has been shown by Gardner and coworkers for human PAS kinase (Amezcua, C.A., Harper, S.M., Rutter, J. & Gardner, K.H. (2002) Structure10, 1349–1361, [1]), is very flexible and adopts different conformations depending on the bound ligand. Some PAS sequences present in the PFAM database did not produce a good structural model, even after realignment using a structure-based alignment method, suggesting that these representatives are unlikely to have a fold resembling any of the structural prototypes of the PAS domain superfamily.

Abbreviations
HMM

hidden Markov model

PYP

photoactive yellow protein

In 1997, Zhulin et al. ([2]), and Ponting and Aravind ([3]) observed that conserved motifs representative of PAS domains were ubiquitous in archaea, bacteria and eucarya, and that many PAS containing proteins were involved in the sensing of oxygen, redox or light. PAS domains were first found in eukaryotes, and were named after homology to the Drosophila period protein (PER), the aryl hydrocarbon receptor nuclear translocator protein (ARNT) and the Drosophila single-minded protein (SIM). These domains are sometimes referred to as LOV domains; light, oxygen or voltage domains [4–8]. Unlike many other sensory domains, PAS domains are located in the cytoplasm [9] and are found in serine/threonine kinases [3], histidine kinases [10], photoreceptors and chemoreceptors for taxis and tropism [11], cyclic nucleotide phosphodiesterases [12], circadian clock proteins [13,14], voltage-activated ion channels [15], as well as regulators of responses to hypoxia [16] and embryological development of the central nervous system [17]. Many PAS domains bind cofactors or ligands, which are required for the detection of sensory input signals.

The first 3D structure determined of a PAS domain containing protein was the structure of the Ectothiorhodospira halophila blue-light photoreceptor PYP (photoactive yellow protein [18,19]). Pellequer and coworkers suggested that PYP is a prototype for the 3D-fold of the PAS domain superfamily [20]. PYP undergoes a self-contained light cycle. Light-induced trans-to-cis isomerization of the 4-hydroxycinnamic acid chromophore and coupled protein rearrangements produce a new set of active-site hydrogen bonds. Resulting changes in shape, hydrogen bonding and electrostatic potential at the protein surface form a likely basis for signal transduction [19]. In recent years, more PAS-like protein structures have been determined. These include the 3D structure of the heme-binding domain of the rhizobial oxygen sensor FixL, from Bradyrhizobium japonicum[21] and from Rhizobium meliloti[22]. FixL is an oxygen-sensing histidine protein kinase, forming part of a two-component system that regulates symbiotic nitrogen fixation in root nodules of host plants [22]. The PAS domain in FixL is a heme-based oxygen sensor that controls the activity of the associated histidine protein kinase domain. FixL is regulated by the binding of oxygen and other strong-field ligands. The heme domain permits kinase activity in the absence of bound ligand, but when the appropriate exogenous ligand is bound, this domain turns off kinase activity [21]. The structural resemblance of the FixL heme domain to PYP indicates the existence of a PAS structural motif, although both proteins are functionally different. In addition to the PYP and FixL protein structures, the N-terminal domain of the human ether-a-go-go-related potassium channel, HERG (first 3D model of a eukaryotic PAS domain [23]), the FMN containing phototropin module of the chimeric fern Adiantum photoreceptor [6], and the NMR structure of the N-terminal PAS domain of human PAS kinase [1] have also been determined. Recently, two further structures of PAS-like domains have been solved; the periplasmic ligand-binding domain of the sensor kinase, CitA [24], and the sensory domain of the two-component fumarate sensor, DcuS [25]. These proteins have not been used in our large scale modelling work, but structural alignment of our six template structures and the two new structures (CitA and DcuS) using VAST indicates that the beta-sheet of all eight 3D-structures superimpose very well, but of the α helices only helix D superimposes well (Fig. 1). Helix F appears to be part of the flexible loop which links the PAS-domain and the PAC-motif. It should be noted that CitA and DcuS have three to four helices on the N-terminal side of the PAS-fold, compensating the absence of helices C and E in the latter two proteins.

Figure 1.

 Structural alignment of the six representative PAS structures. (A) An overlay of the structural alignment of the six representative PAS structures selected is presented. The PFAM PAS-annotated regions are coloured in blue, the PAC motif regions in orange/red. Structures and part of structures currently not assigned as either PAS or PAC are coloured in grey. (B) The 20 lowest-energy solution structures of the human PAS kinase. (C) A schematic representation of the human PAS kinase (according to [1]) is given. The flexible region between Fα and Gβ is clearly visible in B. This loop is located between the PAS domain and PAC motif. (D) Shows the structural alignment of the six structures selected. The PAS domains are indicated with blue bars, the PAC motifs with orange bars. The boxes on which the structural alignment is based are indicated in black. Helical and sheet region residues are coloured in red and green, respectively.

In order to understand the different mechanisms by which PAS domains mediate signal transduction, detailed information about their sequences and structures is needed. In the PFAM Protein Families Database (version 7.8) [26] are 958 PAS domains present in 607 different proteins. According to PFAM, a PAC motif is found at the C-terminus of a subset (51%) of the PAS domains. PAS domains are defined differently by different authors. The definition used by Zhulin and coworkers [2] comprises a large sequence dataset, including S1 and S2 boxes. These sensory boxes were initially detected in bacterial sensors, and these conserved regions are present in PAS domains in all kingdoms of life. The S1 and S2 boxes are separated by a sequence of variable length.

Ponting and Aravind [3], on the other hand, split this PAS sequence into two separate regions; the PAS domain and PAC motif. These two regions roughly correspond to the S1 and S2 boxes [2], with varying lengths between the PAS domain and PAC motif. The SMART [27] and PFAM databases use the definition provided by Ponting and Aravind, thereby giving rise to an annotation system based upon two domains, PAS and PAC. Although the PAC motif is proposed to contribute to the PAS domain structure [3], many PAS sequences in the SMART and PFAM databases are not linked to a PAC motif, raising the question about possible differences within the PAS domain superfamily. The PFAM annotation system is based upon multiple sequence alignments and profile hidden Markov models (HMM). Although HMM is more sensitive in detecting sequence similarities than, e.g. BLAST, HMM-based profiles are still dependent on sequence homology. Problems with HMM-based searches may arise when proteins have virtually identical 3D-structures but limited sequence similarity. As many protein sequences are emerging from the databases, annotation of these sequences should preferably be accurate. The availability of the 3D-structures of several PAS domain containing proteins, provides the opportunity to use 3D-information in addition to sequence comparison. By modelling PAS sequences annotated in the PFAM database onto known PAS structures, we have redefined this intriguing family of sensory proteins. Our analysis gives rise to a single structural module, the PAS fold, combining the existing PAS and PAC annotations into one new structurally annotated fold.

Experimental procedures

Description of the modelling templates

Seven crystal structures [18,19,28–31] and one NMR structure [32] are known for the photoactive yellow (PYP) and PYP mutants from E. halophila in the Protein Data Bank (PDB) [33]. The structure with accession number 3PYP was chosen as the template structure as it has the highest resolution (0.85 Å) [29]. The oxygen sensor FixL has been crystallised from two different organisms. We selected from the two R. meliloti FixL structures deposited in the PDB, 1EW0 [22], as this has the most recent release date, and also because the resolution of the two FixL structures is identical. The five different PDB files of B. japonicum FixL [21,34]) have similar 3D folds; they are only different with respect to the bound ligand. 1DRM [21] was selected, being an apo-protein with the highest resolution (2.4 Å). The FMN binding domain (1G28) [6] of the fern photoreceptor protein from Adiantum capillus-veneris has a resolution of 2.7 Å, and the N-terminal domain of the human-Erg potassium channel (1BYW) [23] has a resolution of 2.6 Å. The last structure used for modelling is the average NMR structure of the human PAS kinase N-terminal PAS domain (1LL8) [1]. These six representatives are listed in Table 1.

Table 1. The six representative structures selected, their Protein Data Bank accession number and their PFAM-annotated domains.
PDB nameNameAccession numberaPFAM PASPFAM PAC
  • a 

    Some proteins are not annotated in the SWISS-PROT protein sequence database or its supplement TrEMBL [50]. Therefore, they are not annotated in the PFAM database.

  • b However, PFAM has the possibility to BLAST a sequence against their HMM search profile.

3PYPPYPP16113PAS
1EW0FixLP10955PAS
1DRMFixLP23222PASPAC
1G28PHY3NA bPAC b
1BYWHERGNAbPAC b
1LL8PAS kinaseNAPAS bb

Structural alignment of the representative PAS structures

The six representative PAS domain structures were aligned structurally using the homology module of insight ii (MSI/Biosys, San Diego, CA, 1997; version 2000), running on a Silicon Graphics O2 workstation. The six proteins were compared automatically by calculating the root mean square difference between their alpha carbon distance matrices. Peptide segments were classified as being conserved when they had similar local conformations and similar orientations with respect to the rest of the protein. In regions of structural conservation among the proteins, the amino acid sequences were aligned, and atom coordinates were assigned based upon these alignments.

Alignment strategy

All PFAM-annotated PAS sequences, including those from proteins containing multiple PAS domains, created a list of 958 PAS sequences. The PFAM-alignment of the PAS domains was used as an initial alignment. All amino acid residues extending from the N-terminal end of the PAS domain were deleted manually, and all sequences were extended C-terminally of the PFAM PAS domain in order to incorporate the PAC motif. If a sequence had a PFAM-annotated PAC motif, C-terminal to the PAS domain, the corresponding alignment was used. If no PAC motif was present, the sequence was elongated to a length similar to the other sequences based upon the genomic information available in public databases. This is the best possible option available, as an HMM search in PFAM did not result in the assignment of a PAC motif at the C-terminal end of many PAS domains, most likely due to the limited sequence homology to the PFAM HMM defined PAC motif. In this way, an alignment of 958 protein sequences was created, with an average length of 105 amino acid residues per sequence. Each of the sequences was modelled against all six template structures representative for the PAS fold.

The PAS- and PAC-annotated sequences of four organisms were studied in greater detail. All PAS-annotated sequences from Arabidopsis thaliana, Escherichia coli, Azotobacter vinelandii and Caenorhabditis elegans were realigned using the Align-2D command within modeller version 6.2 (Table 2). This enables the alignment of a sequence with a structure in comparative modelling, as amino acid sequence gaps are placed in a better structural context, and could improve the alignments provided by PFAM [35].

Table 2. All sequences of the model organisms annotated in the PFAM PAS domain alignment. The presence of any adjacent PFAM PAC annotated domain is listed. For each sequence, the template sequence with the best E-value (expected value) is given, as well as the z-score of the best model before, and after realignment using Align-2D. Some sequences are annotated as having a PFAM-B region (B_66903 or B_39648 or B_19516). PFAM-B regions contains a large number of small families that do not overlap with PFAM-A. Although of lower quality PFAM-B families can be useful when no PFAM-A families are found.
NameAccession number
PFAM PAC
PROSA z-score
(best model)
z-Score after
Align-2D
(best model)
  • a

    PFAM has the possibility to BLAST a sequence against their HMM search profile. The indicated sequences are then annotated as PAC motif.

Arabidopsis thaliana
 Phytochrome AP14712NA−6.04−6.19
632–737 3PYP1DRM
 Phytochrome AP14712NA−2.02−3.17
765–872 3PYP1DRM
 Phytochrome BP14713NA−5.72−6.04
676–772 1G283PYP
 Phytochrome BP14713NA−2.49−4.09
800–904 1DRM3PYP
 Phytochrome CP14714NA−5.96−5.32
618–723 3PYP3PYP
 Phytochrome CP14714NA−2.20−4.16
751–859 3PYP3PYP
 Phytochrome DP42497NA−5.94−5.29
670–776 1EW03PYP
 Phytochrome DP42497NA−2.58−3.57
804–908 1G283PYP
 Phytochrome EP42498NA−3.96−4.36
609–718 3PYP1DRM
 Phytochrome EP42498NA−1.28−4.57
746–851 3PYP3PYP
 Nonphototropic hypocotyl protein 1O48963PAC−4.22−6.10
201–300 1G281G28
 Nonphototropic hypocotyl protein 1O48963PAC−5.03−7.77
476–578 1G281G28
 Putative Ser/Thr kinaseO64511PAC−5.75−6.51
38–141 1BYW1G28
 Putative Ser/Thr kinaseO64511PACa−4.08−6.23
260–364 1BYW1G28
 Nonphototropic hypocotyl protein 2O81204PAC−4.29−6.08
137–236 1G281G28
 Nonphototropic hypocotyl protein 2O81204PAC−3.62−7.40
390–492 1DRM1G28
 Putative ser/thr kinaseO82754PAC−4.79−6.84
102–198 1EW01EW0
 Putative protein kinaseQ9C547PAC−4.53−6.94
76–172 1EW01EW0
 Putative protein kinaseQ9C833PAC−5.42−6.25
76–172 1EW03PYP
 Putative protein kinaseQ9C902PAC−5.71−6.32
115–211 1EW01BYW
 Putative protein kinaseQ9C903PAC−5.42−6.25
76–172 1EW03PYP
 Hypothetical 82.2 kDa proteinQ9C9V5PAC−5.34−7.08
113–209 1EW03PYP
 Protein kinaseQ9FGZ6PAC−4.35−7.49
112–208 1DRM1DRM
Escherichia coli
 Hypothetical transcriptional regulator ygeVQ46802NA−4.20−2.86
171–276 1BYW3PYP
 Sensor protein atoSQ06067NA−2.95−3.50
273–379 1G281EW0
 Sensor protein dcuSP39272B_19516−4.33−1.72
233–339 1BYW1G28
 Hypothetical protein yegEP38097PAC−4.14−6.73
313–420 1BYW1EW0
 Hypothetical protein yegEP38097PAC−5.95−6.84
566–671 1EW01BYW
 Hypothetical protein yciRP77334NA−4.67−3.25
121–227 1DRM1EW0
 Sensor kinase dpiBP77510B_39296−3.78−4.00
233–341 1EW01DRM
 TraJ proteinP05837B_39648−4.21−3.17
52–158 1BYW1EW0
 TraJ proteinP13949B_39648−4.55−3.58
32–138 1BYW3PYP
 Phosphate regulon sensor phoRP08400NA−3.91−2.71
107–209 1LL81EW0
 Aerobic respiration control sensor arcBP22763NA−3.39−2.38
164–270 1EW03PYP
 Hypothetical protein yddUP76129PAC−7.58−7.69
24–129 1EW01EW0
 Hypothetical protein yddUP76129PAC−4.13−5.73
146–254 3PYP1BYW
 Glycerol metabolism operon regulatorP76016NA−3.03−2.85
214–318 1EW01DRM
Caenorhabditis elegans
 Aryl hydrocarbon receptor nuclear translocator ortholog 1O44711NA−4.87−4.35
128–235 1G283PYP
 Aryl hydrocarbon receptor nuclear translocator ortholog 1O44711B_66903−4.13−4.83
288–394 3PYP1EW0
 Aryl hydrocarbon receptor ortholog 1O44712NA−6.19−4.47
139–245 1BYW1EW0
 Aryl hydrocarbon receptor ortholog 1O44712NA−2.83−3.09
284–391 1LL81G28
 F38A6.3B proteinQ9TVM0NA−6.43−4.70
200–306 1EW01LL8
 F38A6.3B proteinQ9TVM0PACa−4.10−3.88
349–445 3PYP3PYP
 C25A1.11 proteinO02219NA−4.87−4.35
128–235 1G283PYP
 C25A1.11 proteinO02219B_66903−4.13−4.83
290–396 3PYP1EW0
 F38A6.3 A proteinO45486NA−6.43−4.70
200–306 1EW01LL8
 F38A6.3 A proteinO45486NA−5.26−3.88
339–445 3PYP3PYP
 Putative transcription factor C15C8.2Q18018NA−4.86−3.46
163–271 1G281EW0
 Putative transcription factor C15C8.2Q18018PACa−3.52−1.87
304–410 3PYP3PYP
 Single-minded homolog T01D3.2P90953NA−3.70−4.79
95–201 1EW01DRM
Azotobacter vinelandii
 Nitrogen fixation regulator NifLP30663PAC−2.96−5.69
36–144 1G281G28
 Nitrogen fixation regulator NifLP30663NA−3.86−4.34
162–268 1EW01DRM

There are eight PFAM PAC -annotated sequences (Table 3) in these four organisms, which lack a PAS domain N-terminal to the PAC motif. These sequences were elongated N-terminally, to incorporate any potential pas sequences. The PAC alignment as present in the PFAM database, was not altered, and the N-terminal region was aligned manually. Also, these sequences were realigned using a structure-based alignment method (Align-2D). These sequences and the modelling results are listed in Table 3.

Table 3. Sequences that have a PFAM PAC annotation, but not a PFAM PAS annotation, were extended N-terminally to incorporate any available PAS domain. The N-terminal region of these sequences were aligned manually, and the sequences were subsequently modelled against the six template structures. Realignment with align-2d of the A. thaliana, E. coli, and C. elegans sometimes resulted in better models.
NameAccession numberPFAM PASPROSA z-score best model; after manual alignmentPROSA z-score best model; after Align-2D
Arabidopsis thaliana
 Adagio 2trQ9C5S6B_462−5.36−6.30
42–142 3PYP1BYW
 Hypothetical 69.1 kDa proteintrQ9C9W9B_462−5.44−4.54
58–166 1G281G28
 Clock-associated PAS protein ztltrQ9LDF6B_462−4.96−6.01
53–157 1G281G28
 Fkf1 (adagio 3)trQ9M648B_462−5.44−4.54
58–166 1G281G28
Escherichia coli
 Hypothetical protein yegEP38097B_45327−3.82−4.30
 1BYW3PYP
 Aerotaxis receptorP50466NA−5.72−6.65
 1DRM1BYW
Caenorhabditis elegans
 Hypothetical protein F16B3.1O44164B_462−6.45−6.79
 1BYW1BYW
 EAG K+channel EGL2Q9XYX7B_462−6.45−6.79
1BYW1BYW

Homology modelling

Models of all 958 PAS containing sequences were generated using modeller version 6.2 [35–37] running on a dual processor Xeon 1.7 GHz Pentium computer with 1 Gb RAM, with redhat linux release 7.3. The average calculation time for one model was about 90 s, resulting in six days of computer calculations. To optimize CPU usage, not more than three modeller jobs were running at the same time. For the resulting 6× 958 protein models, the Prosa z-score was calculated using prosaii version 3.0 [38]. The z-scores is a knowledge-based energy potential using force fields based on the Boltzmann principle. The z-score represents a quality index for structural models. A more negative z-score indicates a better structural model. To overcome the fact that the prosa z-score is dependant of the length of the amino acid sequence, the z-score was normalized using the natural logarithm of the sequence length [39]. The resulting Q-score could be used to discriminate between good and bad 3D protein models. In our study, the sequence length of all modelled sequences was virtually equal and therefore we used the z-score directly.

modeller is an implementation of an automated approach to comparative structure modelling by satisfaction of spatial restraints. As input, it requires an alignment file and a PDB file of the template structure. As output, it generates a PDB file of the model. Default settings were used, and the molecular dynamics refinement level was set to two. The Align-2D command in modeller aligns a block of sequences with a block of structures, using a variable gap opening penalty. This gap penalty can favour gaps in exposed regions, and avoid gaps within secondary structure elements. The Align-2D command can be used to try to improve the existing alignment, but does not always result in a better quality of the 3D model generated.

Results

Alignment of existing structures

Six structures were chosen (Table 1) as representatives of the 21 PAS domain structures in the PDB database for comparative analysis. The other 17 structures (mutants or structures containing a different cofactor) have very similar 3D structures to the six representatives or have only recently been released (CitA and DcuS). Of these six structures, all N- and C-terminal amino acid residues that did not align after superimposition (Fig. 1A) were removed from the corresponding alignment file manually (Fig. 1D). The alignment obtained incorporates the two previously identified regions, the PFAM PAS and PAC motifs (The areas on which our structural alignment is based, is indicated with a black bar below the sequence alignment in Fig. 1D). In this way, the sequences were trimmed back to a sequence length in which the common fold observed was equivalent for all six proteins. The root mean-square deviation for this alignment is 1.25 Å, indicating high structural similarity. As some structures are more closely related than others, Table 4 shows the partial root mean-square deviations for all six structures.

Table 4. Backbone root mean square deviation values (in Ångstrom) of the structural alignment of the six representative structures present in the Protein Data Bank.
 3PYP1EW01DRM1G281BYW1LL8
3PYP1.00.91.41.31.5
1EW01.00.71.21.51.3
1DRM0.90.71.21.51.3
1G281.41.21.21.01.7
1BYW1.31.51.51.01.5
1LL81.51.31.31.71.5

The 20 lowest-energy NMR solution structures of the human PAS kinase are shown in Fig. 1B. The majority of the human PAS kinase structure was solved with high precision, but portions of the Fα helix and the subsequent FG loop were poorly defined in this structural ensemble [1]. The Fα helix and the FG loop correspond to that region of the PAS fold that is part of the region which tethers the PAS domain and PAC motif. A schematic representation of the human PAS kinase is depicted in Fig. 1C. The recently published NMR structure of the E. coli histidine protein kinase DcuS [25] has major differences in the region linking the PAS domain and the PAC motif, supporting our hypothesis that this region is important in the structure-function relationship of proteins with a PAS-fold. The other PAS domain containing structures resemble a similar fold, in which the area corresponding to the Fα helix and the subsequent FG loop of human PAS kinase is believed to form specific interactions in the hydrophobic core or with bound cofactors. The FixL structures have elevated temperature factors in the FG loop region, indicating increased flexibility [21,40]. The FG loop might be the key flexible region necessary for signal transduction [1].

According to the PFAM Protein Families Database [26], not all six template structures contain both a PAS (PF00989) and a PAC motif (PF00785) (Table 1). (In Fig. 1D, the PAS-annotated domains are coloured with blue bars, and the PAC-annotated domains with orange bars.) It is obvious from the structural overlay in Fig. 1A, that all six proteins share a common domain with a characteristic five-stranded, β-pleated, α-helical structure. In comparing the structural and sequence alignments, it is clear that the subdivision of the domain into PAS and PAC motifs is arbitrary, as their existence would imply that the conserved five-stranded β-sheet is split into two sections. Based upon this observation, and also on our large scale modelling results (see below), we propose to use the name PAS fold [9,20] for the complete β-pleated α-helical structure that defines PAS domains and C-terminal PAC motifs in terms of structure rather than sequence.

Large-scale modelling

The first, and most critical, step in protein homology modelling is the appropriate alignment of template and experimental sequences. The alignment of the six representative 3D-structures (Fig. 1A,D) provides the possibility to use all six structures as template for large-scale homology modelling. Note, that not all six structures contain a PAS as well as a PAC motif, according to the PFAM database (Fig. 1D and Table 1). Each of the 958 PAS domains was modelled against each of the six template structures presented in Fig. 1. ProsaII z-scores were sorted by template structure, resulting in both good and bad models. With an average sequence length of 105 amino acid residues, all models with a z-score higher than −3.57 (that is, closer to zero) were considered to be poor models [39], and were rejected. This value of −3.57 was validated using the pG server (http://www.salilab.org/). Thus, 30% of the sequences used did not produce a good quality model. Of the resulting 672 best models, 188 were constructed using 1EW0 as template, and 177 were constructed using 1DRM. Only 2.2% of the best models used 1LL8 as a template. A diagram of these results is depicted in Fig. 2. Notably, 1EW0 and 1DRM were the best template structures, each in about 27% of the cases. This might indicate that most PAS domain proteins would resemble a fold similar to FixL. A list of all PAS sequences modelled, as well as their best template structure, will be distributed on our website in the near future.

Figure 2.

Models sorted by template structure. The distribution of the percentage best model, for each of the 672 best models, is presented in the left panel. Of the six template structures used, 54% of the sequences give the best model with the FixL (1DRM and 1EW0) structures as template, while only a small percentage of the best models is created by using 1LL8 as a template. The subsequent panels show the distribution of the percentage best model for all PFAM PAS-annotated A. thaliana, C. elegans, and E. coli sequences. On average, for these three model organisms, 32% of the sequences give the best model with the 1EW0 as template, while only 3% of the best models is created by using 1LL8 as template. Note that for the latter three, only a limited number of sequences is modelled.

Arabidopsis, Escherichia, Caenorhabditis and Azotobacter– a case study

Some of the PAS domains have been analysed in detail. We chose four representative organisms from the animal, bacterial and plant kingdoms, A. thaliana, E. coli, A. vinelandii and C. elegans, to analyse their complement of PAS domains. These species have been studied extensively and many details of their gene expression and function are known.

The existing PFAM PAC annotation of sequences from these organisms is listed in Table 2. However, some sequences with a PAC motif are not annotated as having a PAS domain (Table 3). The full-length sequences of these proteins were aligned manually, and subsequently trimmed back to the region which we denote as representing the PAS fold. Alignment of this region from the A. thaliana sequences listed in Table 2 and Table 3, based upon the structural alignment (Fig. 1D) of the six representative PAS proteins, is depicted in Fig. 3. We conclude from this alignment that all PAS-annotated A. thaliana proteins also contain a PAC motif, and conversely that all PAC-annotated A. thaliana proteins contain a PAS domain. Therefore, in the case of A. thaliana, the PAS and PAC motifs are inseparable, indicating that the annotation of these proteins as containing only PAS or PAC motifs is questionable. A similar realignment was performed with the other three organisms, resulting in the same conclusion: PAS and PAC motifs do not occur independently of each other, but are parts of the same functional fold, separated by a linker region which is flexible in length. As all sequences of the four organisms studied showed inseparable PAC and PAS regions, the coexistence of PAS and PAC motifs might also apply to most other PAS and PAC protein sequences present in the PFAM database.

Figure 3.

Alignment of all A. thaliana sequences that are either annotated as a PFAM PAS domain or as a PFAM PAC motif. Regions of sequences that have an amino acid sequence similarity > 35%, are depicted in black shading. In the left column, the SWISS-PROT or TrEMBL accession numbers are listed, in the adjacent column the first and the last amino acid residue numbers. The PAS and PAC-annotated regions are indicated above the sequences.

The sequences of these proteins were also realigned using the Align-2D command [35], in order to try to improve the manual alignment. Modelling based upon these alignments sometimes resulted in higher z-scores, and thus better models, as listed in Table 2. Indeed, some of the low-scoring models had a better z-score after realignment, resulting in more reliable models. This was specially the case for the A. thaliana phytochromes. The PFAM PAC motif-annotated sequences, that do not have a PFAM PAS annotation, also gave reasonable z-scores after realignment (Table 3).

It is interesting to consider whether the best template for modelling a particular PAS domain is related to the cofactor which it contains. Unfortunately, there are insufficient PAS domains characterized at the biochemical level to make any definitive correlation. The NifL PAS fold (amino acid residues 36–144) from A. vinelandii binds FAD as cofactor [41]. The best template was 1G28 (Table 2), a FMN binding PAS fold protein. The second PAS fold in this protein (amino acid residues 162–268) gives the best model when using the heme containing FixL X-ray structure 1DRM (Table 2). There is some indication that this domain indeed binds heme (V. Colombo, R. Little and R. Dixon, unpublished results).

PAC-annotated sequences

Eight protein sequences from A. thaliana, E. coli, and C. elegans do not contain a PAS domain but only a PAC motif according to PFAM. All eight sequences yielded reliable models, judged by their ProsaII z-scores (Table 3). For example, the E. coli aerotaxis receptor (P50466) is described as containing a PAS domain by Ponting and coworkers [2,3], although it is not annotated as such in the PFAM database. This protein has FAD as cofactor [42].

The two C. elegans sequences listed in Table 3 were derived from different strains, and differ only in one amino acid residue. This mutation is not in the PAS fold region, and therefore both protein sequences gave identical results. The 3D models were very reliable over the complete PAS fold sequence length. More examples of sequences that are (almost) identical are present in the PFAM PAS database (for instance the C. elegans sequences O02219 and O44711).

Discussion

In the PFAM database there are amino acid sequences of almost 1000 PAS domains representative of all kingdoms of life. However structural analysis of PAS domains in the PDB database clearly demonstrates that the PAS and PAC motifs split the five-stranded β-sheet into two sections. The PAS and PAC motifs are connected through a loop region, which was recently suggested to be important for the intrinsic function of PAS domain containing proteins. It is evident from our large scale modelling studies presented here, that the PAS and PAC motif are inseparable and together give rise to a structural fold. In order to avoid confusion in protein annotation, it is important to define the sequence requirements for a given protein fold. We propose to define the complete β-pleated α-helical structure observed in the prototype structures of the PYP, FixL, human PAS kinase, HERG, and PHY3 proteins as the PAS fold. For comparison of proteins it is necessary to abandon the use of the commonly used annotations S1/S2 [2], PAS-A/PAS-B [43,44], LOV domain [8,45], and PAS domain/PAC motif [3] which are now in use to specify sequence similarities. Unfortunately in recent years the meaning of the term ‘PAS domain’ has evolved. We favour the use of the term ‘PAS fold’ for referring to proteins sharing the PAS structural element, although the commonly used sequence-based annotations provide the researcher with a powerful tool to detect different regions within the PAS fold.

For the large-scale homology studies, the existing PFAM PAS domain alignment was extended C-terminally by 50 amino acids in order to include the neighbouring PAC motif. Because we base our conclusions from modelling on the PROSA z-score, we calculated the z-scores for the six structures of the PAS domain proteins present in the PDB database.

Furthermore, we have modelled the sequences of all six template structures against each other. The resulting models all were of good quality, based upon their z-scores (ranging from −3.82 to −7.85). 1LL8 is the only structure based upon NMR studies, and only 2.2% of the best models used 1LL8 as template structure. The z-scores of the modelled structures using the NMR structure as template are significantly lower (ranging from −2.25 to −4.31) than for the X-ray structure templates, and it is possible that NMR structures are less suitable for fold recognition.

Our studies show that sequence comparison is a useful tool, but in isolation is no longer sufficient to annotate newly discovered protein sequences as having a PAS domain. The modelling studies also give considerable insight into this intriguing family of sensory proteins, as 30% of the PAS domains annotated in the PFAM database are unlikely to share the ‘PAS fold’ as defined in this article. After re-alignment of PAS-annotated protein sequences from four model organisms, some 3D models improved in quality, while others did not. Structure-based realignment (using Align-2D) could be of help in improving sequence alignments, but is not always successful. For the four organisms studied extensively, the drop-out percentage for bad models decreased significantly, from 21% to 12% (Fig. 2). To date, 3D structures of eight different PAS proteins have been elucidated. When more structures of PAS fold containing proteins will become available, it will be possible to redefine the PAS fold containing proteins into several subclasses, depending upon template structure or cofactor.

The PAS fold represents an important sensory domain present in all kingdoms of life [2], and in the PFAM database some proteins appear to have more than one PAS domain. It is therefore possible that such proteins may utilise co-factors in multiple PAS domains to integrate different environmental signals. There are of course precedents, enzymes that contain two flavin cofactors [46,47], or both flavin and heme [48,49], though they do not contain a PAS fold.

All models of sequences from the four organisms used in the case study, which had a PFAM PAS domain annotation, had reliable z-scores, even if, according to PFAM, no PAC motif was present. We extended the region C-terminally to the PAS domain to include any PAC motif present, whether annotated or not. Remarkably, all models of sequences with only a PFAM PAC motif annotation had good z-scores as well. This stresses the importance of better annotation of the PAS fold, based upon structural information rather than sequence information. Annotation of protein sequences by domain analysis tools such as PFAM and SMART is based upon sequence homology and HMM profiles. These facilities are of great benefit in the recognition of domain homologues and for assigning potential function to proteins. However, when proteins have only limited sequence similarity (as is the case for the PFAM PAC motifs), annotation of these motifs is difficult even when using HMM. We show here that large scale homology modelling can be very useful in addition to HMM-based sequence annotation to define structural folds. With the rapid increase in structures present in the PDB database, annotation of sequences based upon structural homology is likely to become of more importance.

Ancillary