Random, de novo, and conserved proteins: How structure and disorder predictors perform differently

Understanding the emergence and structural characteristics of de novo and random proteins is crucial for unraveling protein evolution and designing novel enzymes. However, experimental determination of their structures remains challenging. Recent advancements in protein structure prediction, particularly with AlphaFold2 (AF2), have expanded our knowledge of protein structures, but their applicability to de novo and random proteins is unclear. In this study, we investigate the structural predictions and confidence scores of AF2 and protein language model‐based predictor ESMFold for de novo and conserved proteins from Drosophila and a dataset of comparable random proteins. We find that the structural predictions for de novo and random proteins differ significantly from conserved proteins. Interestingly, a positive correlation between disorder and confidence scores (pLDDT) is observed for de novo and random proteins, in contrast to the negative correlation observed for conserved proteins. Furthermore, the performance of structure predictors for de novo and random proteins is hampered by the lack of sequence identity. We also observe fluctuating median predicted disorder among different sequence length quartiles for random proteins, suggesting an influence of sequence length on disorder predictions. In conclusion, while structure predictors provide initial insights into the structural composition of de novo and random proteins, their accuracy and applicability to such proteins remain limited. Experimental determination of their structures is necessary for a comprehensive understanding. The positive correlation between disorder and pLDDT could imply a potential for conditional folding and transient binding interactions of de novo and random proteins.

sequence space from evolutionarily conserved proteins and might instead resemble unevolved random proteins.2][13][14][15][16][17] De novo proteins, particularly those of a more recent origin, have been found to differ from evolutionary conserved proteins and instead exhibit a similarity to random proteins, as they have not yet been subjected to natural selection. 8,9,13Libraries of random sequences constrained only by a close to natural distribution of amino acids have been shown to form secondary structural elements. 10,18Earlier on, single proteins selected from random libraries have already been shown to exhibit chemical activities. 19,20[28][29] Advanced machine-learning-based structure predictors, most prominently AlphaFold2 (AF2), 30 might, at least in theory, circumvent such experimental hurdles and enable structural analysis of random and de novo proteins in silico.This could help to detect novel folds, explore sequence and structure space, and provide novel templates for protein engineering.Despite its unprecedented advancement, AF2 comes with certain caveats for random and de novo proteins.AF2 is based on co-evolutionary data which is obtained from multiple sequence alignments (MSA).These MSAs are by definition, shallow for both de novo and random proteins. 30Additionally, the aforementioned short length of de novo proteins and high disorder of both random and de novo proteins pose additional obstacles to structure predictions by AF2. 29 The presumably abundant disordered regions are flexible in space, while predictions based on the co-evolution of residues require amino acids to be in positions in a sequence that correlate to a fixed position in structure. 31 an alignment-free alternative, protein language model (pLM) based predictors have been considered to overcome the hurdles of predicting structures of proteins lacking homology, as in the case of de novo and random proteins. 26,32Such pLMs learn to recognize sequence architectures in proteins and their relation to structures without the need of an MSA.This process is reminiscent of learning grammar and building whole sentences and words from just the pattern of appearance of letters. 32,33A significant benefit of pLM-based predictions lies in their comparatively lower computational cost and faster processing speed when compared to AF2.However, to ensure reliable structure predictions using pLM-based programs, it is essential that the sequences in the training sets are sufficiently close in sequence space to de novo and random proteins. 29ESMFold, which combines the final structure module of AF2 and Evolutionary Scale Modeling (ESM-2), performs high-speed predictions and is therefore, most applicable for predictions of large datasets of proteins with limited homology. 34,356][37] Being based on the structure module of AF2, all modern structure predictors provide a per residue confidence score established with AF2; pLDDT (predicted local distance difference test). 30,38 calculate the pLDDT score, structure predictors use a neural network trained on predicted structures scored with per-residue lDDT-Cα against ground truth structures.Only high-resolution structures (0.1-3.0 Å) were used for this training, and no NMR structures. 30ructure predictors provide the pLDDT from the features of the prediction itself, such as the predicted distances and angles between residues.The pLDDT score indicates the agreement between the prediction and the consensus structure obtained from multiple models of the training set rather than any particular solved structure. 30[46][47] Accordingly, pLDDT has been used for large-scale studies of structural conformations. 39,42,43,48,49nce de novo and random proteins are considered to be less compact, while containing secondary elements, and more disordered, 8,10,18 not only structure predictors but also appropriate disorder predictors are essential for computational analyses of their conformations. 29The most widely used disorder predictor is Iupred, 50 which is based on energy estimations for each amino acid independently and within the context of their neighboring residues.It is important to note here that such energy estimations are not based on data from disordered proteins but instead from contacts between residues in experimentally resolved globular protein structures.Consequently, Iupred provides a probability for each residue how likely it is to be disorder-promoting. 50According to Critical Assessment of protein Intrinsic Disorder prediction, 51 deep-learning-based flDPnn outperforms Iupred and other disorder predictions on computing time and accuracy.flDPnn has also been considered to be more appropriate for random and de novo proteins since it is not based on evolutionary data. 29,52While other studies have focused on comparing different protein structure predictors for single de novo proteins and IDPs 29 or single orphan proteins in comparison to selected random proteins, 52 we here focus on the correlations between different predictions, their confidence and biophysical properties on a larger scale.
For this purpose, we used a dataset of de novo proteins from Drosophila, 21 length-matched conserved proteins from Drosophila proteome, and random proteins generated in silico matching the length and amino acid distribution of the de novo protein dataset.We conducted protein structure predictions with AF2 and ESMFold 53 and annotated secondary elements using DSSP. 54Disorder predictions were performed with both flDPnn 55 and Iupred3 (long). 50While disorder correlates negatively with pLDDT for conserved proteins and IDPs as generally assumed, we found positive correlations of disorder and α-helices with pLDDT for random and de novo proteins.On the contrary, pLDDT correlates negatively with predictions of β-sheets in random and de novo protein sequences.Additionally, we quantify that MSA depth and pLDDT for de novo and random proteins vary, dependent on length, as proposed in Monzon et al. 56 for singletons.Shorter de novo and random proteins show higher pLDDT per residue with lower MSA depth.Analyzing disorder and pLDDT per amino acid type, we discovered that de novo and conserved proteins exhibit a similar distribution.Surprisingly, the distribution of disorder per amino acid type predicted by flDPnn is uniform in random proteins.Our findings contradict the general notion of a negative correlation between disorder and pLDDT in the case of random and de novo proteins.

| Dataset curation
A total of 6716 orphan protein sequences from Drosophila and their evolutionary age, were obtained from Heames et al. 21Duplicated entries with the same FlyBase ID and sequence were removed from the initial dataset.Only sequences whose mechanisms of emergence were identified as "denovo" or "denovo-intron" in Heames et al. 21were selected for analysis.This resulted in 2510 de novo protein sequences.Out of these 2510 proteins, 1481 were annotated as "denovo," while 1029 were described as "denovo-intron."Based on their date of emergence, the de novo proteins were divided into young (5 mya), intermediate (5-30 mya), and old (>30 mya) proteins.
The three groups comprised 2205, 110, and 195 proteins, respectively.Random sequences were generated to match the amino acid distribution and sequence length of the de novo set with the build_oligos.pyscript provided by Heames et al. 10 In brief, one random sequence is generated for each protein in the de novo dataset, which matches the length of the respective sequence.The sampling of the amino acids was based on the amino acid frequency found in the de novo dataset, with the constraint that the first amino acid of the sequence was set to Methionine.With this procedure, 2507 unique sequences were generated.
A set of conserved proteins with the same sequence length distribution as the de novo proteins were randomly selected from the combined proteome of 12 Drosophila species.To this end, the combined proteome was filtered to match the sequence length of a respective de novo protein.Duplicates and de novo proteins were removed, resulting in 2235 unique conserved proteins.Intrinsically disordered proteins (IDPs) were selected from the DisProt database 57 (Release 12/2022).Proteins were included if a predicted structure was available in the AlphaFold Protein Structure Database. 58In total, 2205 unique IDPs were selected.Datasets are available on zenodo https:// doi.org/10.5281/zenodo.7976051.

| Structure predictions
Structural predictions were performed using AlphaFold v2.1.1 on High-Performance Computing Cluster PALMA II (University of Muenster).Predictions with the highest mean pLDDT were selected.AF2 predictions of DisProt proteins and conserved Drosophila proteins were downloaded from the AlphaFold Protein Structure Database 58 for initial analysis.ESMFold 53 predictions were performed using Google Colab with a Fujitsu M520 running overnight on an RM5F, except for DisProt proteins: ESMFold on Google Colab.

| Prediction and annotation of structural features
Predicted structures of AF2 and ESMFold were used for annotation of secondary elements with DSSP (v3.0). 54Respective secondary element proportion was calculated by dividing the number of residues with a DSSP-annotation of "H" for α-helices, or "E" for β-sheets, by the total number of residues of the sequence.Analogously, the secondary structure of the three datasets was predicted with SPIDER3-Single 59 and residues annotated with "H" and "E" were used for the calculation of the secondary element proportions.For the prediction of disordered residues, Iupred3 (long) 50 and flDPnn (default settings) 55 were used.Residues were considered disordered if their score was > = 0.5.Disorder fraction was calculated by dividing the number of disordered residues by the total sequence length.

| MSA depth calculation
The MSA depth was calculated from the feature input file generated by AF2.For each amino acid of the input sequences, the number of non-gap residues were counted.

| Predicted structural differences between random, de novo, and conserved proteins
Due to the notion that structure predictions of random and de novo proteins are less confident than for conserved proteins, 29,51 we performed AF2 predictions on a set of 2510 de novo proteins from the Drosophila clade identified in Heames et al. 21random sequences matched in amino acid frequency and length (total 2507), and finally, length-matched conserved proteins from the combined Drosophila proteome (total 2235).As expected, we could detect a significantly lower pLDDT for random and de novo proteins in comparison to conserved proteins (Figure 1A).Disorder predictions using flDPnn of those three sets of proteins also confirmed that random and de novo proteins are predicted to be more disordered than conserved proteins and are following the assumed inverse relationship of low pLDDT and high disorder (Figure 1B).However, de novo proteins were predicted to be more disordered than random sequences, despite having a higher mean pLDDT (Figure 1A, B).When annotating secondary elements to AF2 predictions, we could detect that both random and de novo proteins contain secondary elements, especially α-helices, although to a lesser extent than conserved proteins (Figure 1C, D).
After all, these predictions of disorder and annotation of secondary elements are in accordance with experimental obeservations on random and de novo proteins. 10,18When comparing the secondary elements predicted by AF2 to secondary structure predictions by SPI-DER3 67 it is apparent that for all sets of proteins, the prediction of β-sheets is significantly different between the two programs.Additionally, only for random proteins, the predictions of α-helices are significantly different between AF2 and SPIDER3 (Figure S1A).proteins, pLDDT correlated negatively with β-sheet predictions while positively for α-helices and intrinsic disorder predicted by flDPnn (Figure 2A).As mentioned before, it is assumed that pLDDT correlates positively with secondary elements but negatively with disorder.For conserved proteins, we could confirm this general notion of positive correlation of pLDDT with secondary elements and negative correlation with disorder (Figure 2A).We repeated our predictions and analysis with ESMFold and Iupred3(long) to find that the correlations persisted (Figures 2B and S2A, D).Investigating correlations between residues annotated as β-sheets from AF2 and ESMFold, there is a significant difference in the case of random and de novo proteins, but not for conserved proteins (Figure S2E).Therefore, while the correlation between β-sheets and pLDDT is negative for both AF2 and ESM-Fold, the two programs predicted different fractions as being annotated as β-sheets (Figure S2E).Correlations of α-helices are near perfect between AF2 and ESMFold for all three sets of proteins (Figure S2E).Structural alignments between predictions of the same protein sequence by AF2 and ESMFold have low similarity in most cases, and annotations of β-sheets differ (Figure S2B, C).As an additional control, we performed predictions and calculated correlations for confirmed disordered proteins from DisProt database and confirmed the reported negative correlation between disorder and pLDDT (Figure 2A).These results indicate that the general assumption of a negative correlation between pLDDT and disorder does not apply to random and de novo proteins.In the case of those sets of proteins, pLDDT correlates positively with disorder and α-helices but negatively with β-sheets, both for AF2 and ESMFold.

| Influence of MSA depth and sequence length on pLDDT and disorder
Since AF2 is based on co-evolutionary information of residues extracted from MSA, a deeper MSA should result in a higher pLDDT. 30,55When AF2 searches for sequence homology to the query, one would thus expect a deeper MSA per residue for longer sequences since these tend to align to more sequences than shorter sequences.Contrary to this, it has been noted that shorter sequences can display a higher pLDDT, despite having a lower MSA depth. 56erefore, we examined correlations of MSA depth, pLDDT, and disorder to sequence length quartiles at the level of each residue (Figure S3).By definition, random and de novo proteins produce a shallow MSA, which can also be seen in the MSAs generated by AF2 (Figure 3A, B).As expected, the MSA depth for random sequences is lower than for de novo proteins.The median pLDDT for random sequences decreases with increasing sequence length, while median predicted disorder fluctuates between different length quartiles, being highest for the shortest sequences and lowest for the second length quartile (Figure 3A).For de novo proteins, we observe that, as expected, MSA depth increases with sequence length (Figure 3B).
However, with increasing sequence length, the pLDDT decreases, despite the deeper MSA.As mentioned before, this decrease in pLDDT with increasing sequence length is, also observable for random sequences.On the other hand, with increasing sequence length, de novo proteins display a steady decrease in predicted disorder, unlike the varying disorder in random sequences of different lengths.For conserved proteins, the MSA depth increases with sequence length and accordingly also the pLDDT increases.In reverse, the predicted disorder decreases with increasing sequence length and indicates the negative correlation between pLDDT and disorder, independently of sequence length or MSA depth.

| For random sequences, pLDDT and disorder do not change with amino acid type
Analyzing disorder values and pLDDT for each amino acid type, we observe a uniform distribution for random sequences.Primarily, the disorder values of random sequences are not influenced by the type of amino acid (Figure 4).On the contrary, for de novo and conserved proteins, both disorder and pLDDT alter depending on the type of amino acid, and their distribution is similar (Figure 4).As expected, more hydrophobic amino acids (C, F, I, L, V, W, and Y) display lower disorder values.Even though the difference in the amino acid frequency between the de novo, random, and conserved datasets is statistically significant (Kruskal-Wallis-test: p < .05), the effect size of these differences is small (2 < 0.1), indicating that they are negligible for the analysis (Figure S4).

| Older de novo proteins are longer, have lower pLDDT, and decreased disorder
Categorizing de novo proteins into their respective age groups (emerged 5 mya, 5-30 mya, >30 mya) 21 (Figure 5D), both pLDDT and disorder decrease from younger toward older de novo proteins (Figure 5A).For younger and intermediate de novo proteins, the disparity in pLDDT is lower than from intermediate to older de novo proteins.The oldest group of de novo proteins also contains most longer sequences (Figure 5B, C) and is thereby following the trend already seen for increasing sequence length leading to a decrease in both pLDDT and disorder (Figure 3B).

| DISCUSSION
De novo emerged proteins and proteins from random sequences that could provide novel structures and new insights on protein structure evolution.Novel protein folds with chemical activities could contribute to a broader range of scaffolds to design new enzymes.Furthermore, proteins without ancestry, such as de novo and random proteins, could give us a glimpse into how the first proteins emerged.
However, experimentally completely determined structures of de novo-emerged proteins have not been realized yet.For proteins selected from random libraries, the structures of two nucleoside binding proteins have been experimentally solved and revealed α/β-folds, 68,69 one of the most frequently observed folds in nature. 70e recent advancement of protein structure prediction has dramatically increased our knowledge of protein structures and will soon cover structure predictions of every known protein. 58,71This data have already been harnessed to annotate and cluster protein structure families, identify new folds, and to find putative de novo proteins. 72,73ditionally, the pLDDT metric provided by modern structure predictors has been leveraged to evaluate structural heterogeneity and disorder. 44,46,48We here conclude that such kinds of analyses are limited for de novo and random proteins, and their results from structure predictions differ highly from those of conserved proteins.We find that fractions of β-sheets and pLDDT correlate negatively for de novo and random proteins, instead of a negative correlation of predicted disorder and pLDDT as noted for conserved proteins (Figure 2). 39,48,49In light of the notion, that de novo and random proteins are predicted to be more disordered, we repeated the analysis with experimentally confirmed disordered proteins, which displayed the same correlations as conserved proteins.Therefore, we can rule out that the positive correlation of pLDDT and disorder for de novo and random proteins is due to genuine, confident, ribbon-like disorder predictions by AF2.
Other studies on conserved disordered proteins have found that some ribbon-like predictions by AF2, indicative of disorder, indeed exhibit a relatively high pLDDT and imply the potential of such disordered regions to undergo conditional folding and involvement in binding interactions. 41,74,75Such conditional folding and binding interactions could be a first step in the trajectory of de novo proteins to be integrated into the cellular network and become adaptive.
In line with our analysis, a recent study by Peng and Zhao 76 further corroborates that many high-pLDDT predictions for Drosophila de novo proteins exhibit instability during molecular dynamics simulations under realistic conditions.This finding lends additional support to the notion that several regions characterized by high pLDDT are also predicted to be disordered.Also, the problems of AF2 with random and de novo sequences have been mainly attributed to a lack of sequence identity of such proteins to others.We show here that ESMFold, based on a pLM, has the same issues as AF2, possibly due to a lack of training on sequences similar to random and de novo proteins.Structures of random and de novo sequences will only be predicted with accuracy by pLM-based programs if sequences covered during training are close enough in sequence space to random and de novo proteins. 29As noted by Monzon et al. 56 for proteins from Anti-Fam, 77 we also found for de novo and random proteins that shorter sequences result in a higher pLDDT per residue, independently of MSA depth.With decreasing pLDDT over a longer sequence length, we also see a decrease in disorder for de novo proteins, contradicting the inverse relationship of those two parameters.In the case of smaller proteins, the effective MSA, reflecting rare but relevant sequence identity, might lead to higher pLDDT.Also, structure predictors might be able to find the local energy minima of such small proteins while not initially being trained on biophysical data. 26,30The generation of MSAs for random proteins by AF2 can be attributed to the utilization of JackHMMer and HHblits. 30These tools are adept at identifying short sequence motifs, which may occur in both random and conserved proteins, while not necessarily implying homology.For random sequences, we see varying predicted disorders of different sequence length quartiles.It is known that the accuracy of flDPnn varies depending on sequence length, and its most accurate for longer IDPs. 51,75This accuracy problem depending on sequence lengths, seems to become more vital for random proteins (Figure 3).we have shown here that large-scale analysis of predicted structures has many pitfalls for random and de novo proteins, making them a particular case for such studies.Also, the high pLDDT values in predicted disordered regions of random and de novo proteins could indicate their potential for conditional folding, binding, and thereby, integration into the cellular network.Seeing higher disorder for younger de novo proteins could imply a general evolutionary trajectory starting from disordered toward more globular structures.The combination of a dedicated disorder predictor and pLDDT scoring from structure predictors unveils high-confidence predictions pertaining to globular structures, which exhibit potential instability and significant disorder characteristics.
Consequently, when dealing with random and de novo proteins, employment of disorder and structure predictors combined with molecular dynamics simulation, 76 becomes imperative for comprehensive assessment of their structural properties.It is essential to note that our research exclusively focused on de novo proteins derived from Drosophila.To establish the broader applicability of our findings for de novo proteins in general, further investigations involving de novo proteins from diverse clades are warranted.However, since research on de novo proteins found most of their features to be universal, 78 future studies should be able to replicate our results.
Whether structure predictors are reliable on de novo and random proteins can only be sufficiently answered with experimentally determined structures at hand.

3. 2 |
For random and de novo proteins, pLDDT correlates negatively with β-sheets and positively with α-helices and disorderIntrigued by the observation that AF2 predictions of de novo proteins have higher pLDDT values than those of random sequences despite being predicted to be more disordered, we calculated the Spearman correlations of pLDDT and secondary elements (DSSP) and disorder (flDPnn) for our sets of proteins.We find that for random and de novo F I G U R E 1 Distribution of pLDDT, disorder, and DSSP annotated secondary elements for random, de novo, and conserved proteins.(A) Distribution of mean pLDDT from AF2 predictions for random, de novo, and conserved proteins.Mean pLDDT is higher for conserved proteins than for random and de novo.(B) Distribution of disorder from flDPnn.Fraction disorder of conserved proteins is predicted to be lower.(C) Distribution of α-helices in AF2 predictions.Conserved proteins are more abundant in α-helices.(D) Distribution of β-sheets in AF2 predictions.Significant differences are indicated by **** ( p-value <.0001).The box plots visualize quartile values of the data with white dots representing the median.

F I G U R E 3
Correlations between Sequence length, MSA depth, pLDDT, and disorder for random, de novo, and conserved proteins.(A) MSA depth, pLDDT, and disorder per sequence length quartile for random sequences.Fourth length quartile sequences show lowest pLDDT while the 1st length quartile has the highest predicted disorder.(B) MSA depth, pLDDT, and disorder per sequence length quartile for de novo proteins.4th length quartile sequences show the deepest MSA but lowest pLDDT and lowest disorder.(C) MSA depth, pLDDT, and disorder per sequence length quartile for conserved proteins.pLDDT and MSA depth increase with longer sequence length while disorder decreases with length.Groups annotated with different letters show statistically significant differences ( p-value <.05).The box plots visualize quartile values of the data with white dots representing the median.F I G U R E 4 pLDDT (AF2) and disorder (flDPnn) per amino acid type in random, de novo, and conserved proteins.(A) pLDDT of amino acid types in random, de novo, and conserved proteins.Differences between pLDDT per amino acid types are similar for random and de novo proteins but more distinct for conserved proteins.(B) Disorder predicted by flDPnn for each amino acid type in random, de novo, and conserved proteins.While for random sequences the distribution is quite uniform, for de novo and conserved proteins hydrophobic amino acids exhibit lower disorder values.F I G U R E 5 Correlations between Sequence length, MSA depth, pLDDT, and disorder for different age groups of de novo proteins.(A) MSA depth, per-residue pLDDT, and per-residue disorder for each age group of de novo proteins.Older de novo proteins exhibit an deeper MSA, lower mean pLDDT, and decrease in disorder.(B) Mean sequence length of different age groups of de novo proteins.Older de novo proteins are longer than younger ones.(C) Distribution of age groups per sequence length quartile.The majority of older de novo proteins fit into the fourth quartile together with the lowest number of young sequences.(D) Phylogenetic tree of Drosophila clade.Dashed boxes indicating age groups (emerged 5 mya, 5-30 mya, >30 mya).Groups annotated with different letters show statistically significant differences (p-value <.05).Asterisks display significant differences between groups (*: p-value <.05; ****: p-value <.0001).The box plots visualize quartile values of the data with white dots representing the median.
Categorizing de novo proteins into three age groups, we find that older de novo proteins have a deeper MSA than intermediate and younger.The deeper MSA can be explained due to orthologs in several species for older de novo proteins, which is the basis of their age grouping.The oldest age group also displays the lowest pLDDT and lowest disorder.Once more, longer sequences have a lower pLDDT and pLDDT is not negatively correlated with disorder for de novo proteins.Structure predictions can give a first glimpse at the possible structural composition of de novo and random proteins but it is difficult to decide which predictions are accurate.Whereas high pLDDT predictions of such proteins likely give results comparable to structural approximations, While random sequences are not part of any program's training set, one would assume to see differences based on the distinct biophysical properties of each amino acid.Especially since the set of random proteins matches the amino acid distribution of the de novo proteins.For de novo and conserved proteins, amino acid types resulting in higher pLDDT also give lower values of predicted disorder.The distribution of de novo and conserved proteins is similar, indicating that on the amino acid type level, de novo proteins are closer to conserved ones than to random proteins for AF2 and flDPnn.This could be due to a training set that is closer in sequence space to de novo proteins than to the set of random proteins generated in this study. 26,29 AUTHOR CONTRIBUTIONS Lasse Middendorf: Investigation; conceptualization; writingreview and editing; visualization; validation; methodology; software; formal analysis; data curation.Lars A. Eicholt: Conceptualization; writingoriginal draft; visualization; validation; writingreview and editing; supervision; data curation; resources; project administration; methodology; investigation; software; formal analysis.