• Bioinformatics;
  • Escherichia coli;
  • Prediction;
  • Protein expression;
  • Protein solubility;
  • Wheat germ cell-free

Recombinant protein technology is essential for conducting protein science and using proteins as materials in pharmaceutical or industrial applications. Although obtaining soluble proteins is still a major experimental obstacle, knowledge about protein expression/solubility under standard conditions may increase the efficiency and reduce the cost of proteomics studies.

In this study, we present a computational approach to estimate the probability of protein expression and solubility for two different protein expression systems: in vivo Escherichia coli and wheat germ cell-free, from only the sequence information. It implements two kinds of methods: a sequence/predicted structural property-based method that uses both the sequence and predicted structural features, and a sequence pattern-based method that utilizes the occurrence frequencies of sequence patterns. In the benchmark test, the proposed methods obtained F-scores of around 70%, and outperformed publicly available servers. Applying the proposed methods to genomic data revealed that proteins associated with translation or transcription have a strong tendency to be expressed as soluble proteins by the in vivo E. coli expression system. The sequence pattern-based method also has the potential to indicate a candidate region for modification, to increase protein solubility. All methods are available for free at the ESPRESSO server (