Get access

Prediction of protein solubility in Escherichia coli using logistic regression

Authors

  • Armando A. Diaz,

    1. School of Chemical, Biological and Materials Engineering, University of Oklahoma, 100 E. Boyd St., Room T-335, Norman, Oklahoma 73019; telephone: 405-325-4367; fax: 405-325-5813
    Search for more papers by this author
  • Emanuele Tomba,

    1. School of Chemical, Biological and Materials Engineering, University of Oklahoma, 100 E. Boyd St., Room T-335, Norman, Oklahoma 73019; telephone: 405-325-4367; fax: 405-325-5813
    Search for more papers by this author
  • Reese Lennarson,

    1. School of Chemical, Biological and Materials Engineering, University of Oklahoma, 100 E. Boyd St., Room T-335, Norman, Oklahoma 73019; telephone: 405-325-4367; fax: 405-325-5813
    Search for more papers by this author
  • Rex Richard,

    1. School of Chemical, Biological and Materials Engineering, University of Oklahoma, 100 E. Boyd St., Room T-335, Norman, Oklahoma 73019; telephone: 405-325-4367; fax: 405-325-5813
    Search for more papers by this author
  • Miguel J. Bagajewicz,

    1. School of Chemical, Biological and Materials Engineering, University of Oklahoma, 100 E. Boyd St., Room T-335, Norman, Oklahoma 73019; telephone: 405-325-4367; fax: 405-325-5813
    Search for more papers by this author
  • Roger G. Harrison

    Corresponding author
    1. School of Chemical, Biological and Materials Engineering, University of Oklahoma, 100 E. Boyd St., Room T-335, Norman, Oklahoma 73019; telephone: 405-325-4367; fax: 405-325-5813
    • School of Chemical, Biological and Materials Engineering, University of Oklahoma, 100 E. Boyd St., Room T-335, Norman, Oklahoma 73019; telephone: 405-325-4367; fax: 405-325-5813.
    Search for more papers by this author

Abstract

In this article we present a new and more accurate model for the prediction of the solubility of proteins overexpressed in the bacterium Escherichia coli. The model uses the statistical technique of logistic regression. To build this model, 32 parameters that could potentially correlate well with solubility were used. In addition, the protein database was expanded compared to those used previously. We tested several different implementations of logistic regression with varied results. The best implementation, which is the one we report, exhibits excellent overall prediction accuracies: 94% for the model and 87% by cross-validation. For comparison, we also tested discriminant analysis using the same parameters, and we obtained a less accurate prediction (69% cross-validation accuracy for the stepwise forward plus interactions model). Biotechnol. Bioeng. 2010; 105: 374–383. © 2009 Wiley Periodicals, Inc.

Get access to the full text of this article

Ancillary