The Development and Opportunities of Predictive Biotechnology

Recent advances in bioeconomy allow a holistic view of existing and new process chains and enable novel production routines continuously advanced by academia and industry. All this progress benefits from a growing number of prediction tools that have found their way into the field. For example, automated genome annotations, tools for building model structures of proteins, and structural protein prediction methods such as AlphaFold2TM or RoseTTAFold have gained popularity in recent years. Recently, it has become apparent that more and more AI‐based tools are being developed and used for biocatalysis and biotechnology. This is an excellent opportunity for academia and industry to accelerate advancements in the field further. Biotechnology, as a rapidly growing interdisciplinary field, stands to benefit greatly from these developments.

9][10][11][12][13][14][15] In pursuing biotechnology processes, the scarcity of needed biocatalysts and reactions in nature has long posed a challenge.Hence, the next logical step was, for many decades, to undergo bioprospection and to screen strains or environmental samples for a certain activity.In the lucky case of a hit, the work just started and involved a labor-intensive process that included enzyme identification, gene expression optimization, protein stabilization, and application towards process implementation.This approach often yielded a lot of data, but not necessarily an applicable and cost-effective process.The time and human resources required for this process were also often critical factors, and nowadays, smarter solutions are being employed to increase revenue.This holds true for academia and industries alike.The recent progress in protein engineering, supported by databases and computational tools, empowers us to streamline biocatalyst development.Here, three potential approaches are mentioned as examples.
[39][40][41] This complex molecule can be Martin Schürmann studied Biology at Ruhr University Bochum (Germany) specializing in microbiology and biochemistry.He obtained doctorate of natural sciences from Research Center Jülich and Heinrich-Heine-University Düsseldorf (Germany) in the field of biotechnology.2022 he joined DSM Research in Geleen (The Netherlands) as Marie Skłodowska-Curie Fellow.He held various scientist positions before leaving DSM in 2017 as Principal Scientist Biocatalysis; a position he since then holds at InnoSyn BV (Geleen, The Netherlands).Since May 2024 he is also CSO of SynSilico BV in Geleen.
Dirk Tischler studied applied natural science at TU Bergakademie Freiberg (Germany) and completed his doctoral studies in 2012.He continued as a group leader in the field of industrial biotechnology with emphasis on the prediction of biocatalysts and pathways from (meta)genomes of soil bacteria, especially actinobacteria.2018 he became W1 Professor and was tenured in 2019 to University Professor for Microbial Biotechnology at the Ruhr University Bochum.He focuses on the identification and application of novel biocatalysts, mostly related to redox biochemistry.
Verena Resch studied Biochemistry and Molecular Biology at the University of Graz in Austria and completed her doctoral studies in 2011.From 2012 to 2014, she worked as a postdoc at TU Delft in The Netherlands, followed by another occupation as a senior postdoc at the University of Graz.In 2016, she founded Luminous Lab, driven by her passion for scientific design, and worked as a freelance scientific illustrator.Since 2021, she has been working as a visual communicator at Innophore in Graz.
Bernd Nebel studied chemistry at the University of Graz and completed his doctoral studies in 2010.After his postdoctoral studies at the Manchester Institute of Biotechnology in Prof. Nicholas J. Turner's group, he moved to the University of Stuttgart.His main research focus at the Institute of Technical Biochemistry was applied biotechnology and bioanalytic.He also ran his own business for analytical training and method development.In 2021, he joined Innophore (Graz) and held the position of senior scientist and project manager.
Bettina M. Nestl received her doctoral degree in Organic and Bioorganic Chemistry at the University of Graz.Following her doctoral studies, she moved to the University of Manchester to complete a postdoc with Nicholas J. Turner.After her postdoc, she started working with Bernhard Hauer as a research group leader at the University of Stuttgart.In 2020, she habilitated at the University of Stuttgart.In the same year, she joined the TechBio company Innophore in Graz, where she is currently a senior scientist and COO.The following terms were used to filter: "enzyme", "artificial intelligence", "machine learning", "AlphaFold", "Robetta".Data was gathered from PubMed (https://pubmed.ncbi.nlm.nih.gov/) using the following search prompts: [enzyme AND "artificial intelligence"], [enzyme AND "machine learning"] and ["AlphaFold" OR "RobeTTa"].
deconstructed into simpler building blocks on paper, and if needed, this process can be repeated multiple times.Then, the library of building blocks obtained can be matched with databases comprising biocatalytic information to propose potential enzyme-catalyzed reactions and thus generate (nonnatural) pathways from simple to complex molecules, which then need to be brought alive.This approach can also be combined with organic retro-synthesis for reactions where no enzyme is available.Furthermore, machine learning tools can support or guide the database mining and combinatorial work to construct artificial pathways on paper. [42,43]The diverse tools from synthetic biology can then be used to construct novel biofactories combining the wanted enzymes either in a cellular context or even in vitro.
II) Another approach is to start with known enzymes and reactions that are similar to the desired ones but not yet functional in this manner.By using database mining, highthroughput structural modeling, and protein fitness evaluation, a library of potential biocatalysts is generated.In this approach, docking the desired substrate or the predicted transition state can provide insights into potential candidates for screening the desired activity.It is worth noting that the approach can also be started from the perspective of the substrate-binding site.In this case, structural information of the transition state or at least the substrate can be used to screen all available protein structures (determined or modeled) for potential binding motifs. [17]This library then needs to be refined, but screening or testing of candidate enzymes can be limited to the top 5-10 enzymes, which could be tested under much more processrelevant conditions than a hundred(s) of enzyme candidates or variants in a micro-titer plate.In both cases, one might observe only low to marginal activity for the envisioned reaction, and subsequently, the enzyme should be engineered towards the needed properties.III) In contrast to the two approaches mentioned above, the starting point can be completely de novo.Both the reaction and the enzyme are designed using computational tools.In this case, we begin with the overall reaction by identifying various substrates and potential catalytic trajectories towards the desired product.It is important to uncover the reaction coordinate and to identify a transition state that allows the reaction to proceed.A mechanistic hypothesis is then generated and tested.Based on this hypothesis, potential interaction partners (such as amino acids, metal ions, and cofactors) are identified to introduce weak molecular interactions needed to stabilize the transition state and facilitate catalysis.This theoretical active site of an enzyme is called theozyme. [15,44]It serves as a starting point to generate the actual enzyme.The theozyme is used for comparison with known structural information in order to incorporate the designed active site into Table 1.Selected tools and applications for predictive biotechnology.This provides only a limited view on the tools available but demonstrates the huge versatility and progression in the field.

Tool
Application Reference AlphaFold 2.0 Machine learning-based model supports the prediction of protein structures [4]   AlphaFold DB Provides access to over 200 million predicted protein structures [4,16]   Catalophore TM Protein discovery and development using machine-readable 3D-point clouds [17]   CLEAN The deep-learning approach allows for functional annotation of enzymes (EC number prediction) [18]   ColabFold ColabFold offers accelerated prediction of protein structures and complexes by combining the fast homology search of MMseqs2 with AlphaFold2 or RoseTTAFold [19]   CryoDRGN Neural networks allow to reconstruct cryo-EM structures from neural networks [20]   DeepTracer Supports the de novo structure modelling from cryo-EM data [21]   DeepFRI Structure-based protein function prediction [22]   DeepScreen and Deep-Screening Drug target interaction prediction based on neural networks and deep learning approaches [23,24]   EnzymeMiner Enzyme selection criteria to predicted soluble expression, protein stability and database deposition [25]   EnzRank Rank-order existing enzymes regarding suitability for directed evolution or de novo design towards a desired specific substrate activity [26]   ESMFold Machine learning-based model supports the prediction of protein structures [27]   FireProt 2.0 Stable proteins are predicted by means of computational design.[28]   Foldseek Fast protein structure search by aligning a query structure against a 3D structures database [29]

MILDE
The fitness landscape of a protein evolution can be screened by an in silico machine learning protocol [30]   ModFOLD9 Global and local quality scoring of protein structure models is supported by means of deep learning methods [31]   PERISCOPE-Opt A machine learning tool supporting the optimization of E. coli fermentation processes [32]   ProteinMPNN Structure-based tool to generate amino acid sequences that are predicted to fold into a given 3D structure [33]   RoseTTAFold All-Atom 3D structure prediction and de novo design of proteins, nucleic acids, small molecules and metals [34]   Star A machine learning tool supports the directed evolution of proteins [35]   ZYMSCAN Tool for prediction of biocatalyst for target reaction [36]  a protein backbone, allowing for the production of the enzyme.A completely de novo generated active site in a stable backbone can be obtained and further engineered using classical methods towards a desired process window. [15]47] It should also be borne in mind that the choice of enzyme depends on the specific application.The biocatalyst alone is not per se the key to a process.Many other factors, such as alternative synthetic routes, substrate availability, enzyme production and application mode, stability, process performance, scale-up, among others, need to be taken into account.Biotechnology is a diverse field with many considerations beyond just selecting the right enzyme.
[50][51] These advancements have been made possible by innovations in bioinformatics, with artificial intelligence (AI) playing a significant role.However, a major obstacle in leveraging AI for biotechnology is the need for properly annotated, standardized, and analyzed datasets.It is widely accepted that AI can only be effectively utilized with access to high-quality data.
The goal is to leverage predictive biotechnology for the recognition, production, structure/function prediction, and ab initio design of novel proteins.Designing natural proteins at the atomistic level has proven to be challenging, as early enzyme designs often exhibit low activity and require extensive iterative experimental optimization.The wet-lab-produced proteins often differ significantly from the ones modeled based on homology or predicted ab initio.This includes substantial and unexpected deviations in the conformation of the protein backbone or the arrangement of amino acid side chains in the active site.As a result, protein engineers must run numerous iterative rounds of mutagenesis and screening.Hence, the large protein mutation space continues to be challenging. [8][55][56][57] Furthermore, advancements in structure prediction have addressed limitations related to small molecule cofactors, allowing for more comprehensive predictions.
A significant limitation of protein and especially enzyme structure prediction has been the dependence of various enzyme classes highly relevant for industrially applied biocatalysis, such as alcohol dehydrogenases/ketoreductases or transaminases on non-peptidic small molecule cofactors such as nicotinamide or pyridoxal 5'-phosphate.These cofactors could not be predicted together with the protein structure, but are highly relevant for substrate binding and catalysis, as emphasized by Korbeld and Fürst. [58]This issue has been addressed by an extended version of RoseTTAFold All-Atom, which now allows structure predictions for nucleic acids, small molecules, and metals. [34]][61] Although gene expression is currently limited to Escherichia coli (E.coli) and yeast, this is expected to change in the future, offering a streamlined approach to designing enzymatic processes. [59]It also offers an opportunity to save time and resources by designing enzymatic processes and bringing them to life with minimal experimental validations.A concept article by Ao and colleagues (2024) provides an insightful overview of the potential of data-driven protein engineering. [8]ncreasingly sophisticated cloud technology is opening new avenues for biotechnology.Since its beginnings in the technology sector over a decade ago, it has expanded to various industries that value the redundancy, speed, and scalability it offers at manageable investment costs.Leading life sciences companies are realizing the potential of the cloud to deliver business benefits beyond the cost efficiencies that cloud migrations typically target.These benefits include performing analytics, standardizing processes, digitizing, and virtualizing results and data sets, and storing them in the cloud, enabling their simultaneous global use. [62]This can successfully shorten product development times, as was the case for biopharmaceuticals and vaccines in the COVID-19 pandemic.
A prominent example is the generation of an incredible amount of data.DeepMind and the EMBL European Bioinformatics Institute (EMBL-EBI) recently collaborated to develop the AlphaFold DB. [4,16,63] This database contains over 214 million entries and includes UniProt, the standard repository for protein sequences and annotations.These freely accessible structures of varying quality offer opportunities for innovative, predictive biotechnological projects.Previously, large amounts of data were archived in inaccessible or poorly structured databases, hindering their usability.Thus, these databases remain closed and unavailable for further exploration and advancement.
][65] This endeavor holds immense potential for reinterpreting and integrating stored data into new value chains.To make this possible, collaborative efforts are required to establish central databases, standardize data, develop universally linkable user interfaces, and clarify IP rights and usage costs.Additionally, efficient methods for working with this data need to be established.Mainly, predictive methods are crucial in describing the behavior and properties of biotechnological systems based on primary data, utilizing mathematical, statistical, or (bio-)informatics algorithms.
Initial attempts have already demonstrated that the development of such a trainable, self-learning technology enables the description of various systems in a theoretical manner, eliminating the need for time-consuming and costly experimental validation.Ultimately, the most promising system (algo-rithm) can be utilized to assess its suitability for addressing the posed question from the perspective of deep learning.
It Is widely recognized that the evaluation of statements in many areas of predictive biotechnology still needs to be sufficiently meaningful today.Achieving reliable function prediction requires multiple iterations to gain correctness and accuracy.There is also uncertainty about whether we have developed the necessary methods and tools to enable predictive biotechnology.Many believe that, in addition to AI and machine learning, automated quantum mechanical computation will lead to a decisive breakthrough.However, the race for the first viable quantum computer has just begun.Researchers express caution while also acknowledging the enormous potential of this concept.It remains to be seen to what extent it can be realized and applied.For now, quantum computers will not replace today's computer technologies, and it is still being determined whether this technology will revolutionize predictive biotechnology.
To overcome the challenges as mentioned earlier and advance predictive biotechnology, targeted support from research and funding policies, widespread education, and general acceptance are needed to integrate bioprocesses industrially based on evaluated predictive statements.However, predictive methods are already effectively used in various subfields of biotechnology: * Functional annotation (from gene to enzyme to biocatalysis) * Prediction of enzyme structures and properties, e.g., membrane proteins * Prediction of cellular processes (predictive microbiology) for optimization of production strains and biomass formation * Modeling of bioprocesses, e. g., using digital twins * Modeling in the field of downstream processing The potential of predictive tools in biotechnology is vast.Figure 2 shows a schematic view of methods and routes to future applications.It is important to note that tools can be used for various targets and applications and thus a plethora of combinatorial approaches are possible.
The expansion of the genetic code allows for new applications.By reprogramming and expanding the genetic code, we can access a wider variety of bio-molecules such as nucleic acids, proteins, and enzymes.This will enable the development of new imaging and screening procedures for medical applications, as well as the control of bioprocesses.Introducing new chemistry to known systems, such as enzymes or pathways, will increase the portfolio of catalytic options.It also opens up possibilities for novel therapeutic and personalized medical applications.Designer RNA and proteins for medical applications are already established but have potential for further development.For example, deep-learning accelerated cell-free production of bioactive peptides shows promise in fighting diseases. [66]Computational guidance can shorten development times, increase efficiency, and minimize risks.Innovative approaches in phage therapy may revolutionize the prediction of application windows and minimize the screening approach for personalized medicine.Designing proteins in proper production hosts can shorten development time frames and provide access to unexplored reactions and products for various industries such as food, feed, and pharmaceuticals.Minimal genomes are already state-of-the-art, [67,68] and novel de novo chromosome synthesis [69,70] may introduce complex metabolic networks, including artificial enzymes and reactions, to create defined living smart-machines for value-added production.This approach may address metabolic burdens of natural hosts and by-product issues, simplify product isolation, and enable biotechnological production in challenging environments, such as extraterrestrial missions or subsurface habitats.
The potential applications of predictive biotechnology are vast and promising.It is no longer a matter of "if" but rather of overcoming various obstacles to ensure a safe and efficient implementation.As (bio-)technological pioneers, it is our responsibility to create the necessary scientific, political, legal, and industrial frameworks to fully utilize the benefits of predictive biotechnology, ensuring that both ecological and economic advantages are realized.