Automated robotic platforms in design and development of formulations

Funding information Engineering and Physical Sciences Research Council, Grant/Award Number: EP/ R009902/1; Pharma Innovation Platform Singapore; Prime Minister's Office, Singapore under its Campus for Research Excellence and Technological Enterprise; National Research Foundation Abstract Product design for formulations is an active and challenging area of research. The new challenges of a fast-paced market, products of increasing complexity, and practical translation of sustainability paradigms require to re-examine the existing theoretical frameworks to include the advantages deriving from the new reality of digitalization of business and research. In this work, we review the existing approaches, clearly stating the role of automation and machine-learning-guided optimization in the broader framework. Moving from this, we review the state of the art of automated hardware and software for formulated product design, and identify the open challenges for future research. Perspectives are given on the emerging fields of automated discovery, scale-up, and multistage optimization, and a unitary picture of the existing connections is provided, in the general context of a completely digital R&D workflow.


| INTRODUCTION
Formulated products consist of a blend of ingredients, processed to achieve a set of desired performance and appearance characteristics. 1 The aim of formulated product design is to find a product that exhibits a behavior, corresponding to desired, customer-defined functional properties. 1 Formulations are ubiquitous in daily life, ranging from medicines to cosmetic creams and gels, from detergent powders and liquids to processed foods, paints, adhesives, lubricants, pesticide granules, and many more. Because of the significance of formulations markets, the developments in formulation technologies is attracting attention of both academia and industry. 2 In 2009, Chemical Product Engineering has been introduced as the third paradigm within the field of chemical engineering. 3 The design of formulated products involves identification of target product attributes, determination of product form, selection of ingredients, development of processing steps, as well as economic and environmental analyses. 4 As a result of this conceptual and empirical complexity, research has focused on identification of a theoretical framework for formulated products design, taking into consideration all of these interlined areas. 5 Within this general framework, it is clear that access to: (i) a large number of reliable and repeatable data, and (ii) better models, would be key elements for faster and efficient formulated products development. The former challenge can be addressed by adopting robotic automated highthroughput experimentation, whereas the latter can be met by the adoption of data efficient statistical machine learning (ML) models.
The automation of chemical experiments and advances in ML algorithms to guide automated experiments has recently emerged as a new paradigm for chemical R&D 6,7 in robotic experimental platform for nanomaterial discovery, 8,9 design of experiments (DoE) for high-Liwei Cao and Danilo Russo contributed equally to this study.
The experiment-based trial-and-error approach is the preferred and most common one for the design of formulated products. By performing experiments at all steps during the development of a formulation, the product with desired properties can be developed. One previously reported explicative example is the development of the inkjet formulation. 17 In this case, with only few key ingredients, and a set of typical solvents and dispersants, it was possible for an experienced researcher to develop the optimal blend on a lab-scale, and eventually use the gathered experimental data to generate a model for future use. 18 However, this approach suffers from two main drawbacks: (i) it requires a large amount of resources and is highly time-demanding, (ii) it is critically dependent on the level of expertise of an experimentalist, the past knowledge, both formal and tacit as identified by Chandrasegaran et al. 19 In particular, tacit knowledge, consisting of subjective insights, intuition, and heuristic qualitative rules is not easily transferable and is usually lost with the loss of the experts in product development. Therefore, this approach would be beneficial if the number and the type of ingredients and processes conditions were limited a priori and skilled experts are involved in the process.
On the other hand, computational methods, that is, physical model-based design of formulations, were proposed in order to reduce the experimental cost and to speed up the R&D process. In the last few years, various attempts have been made to establish systematic methodologies. Computer-aided methods have been proposed for solvent design, 20 mixture design, 21 general molecular design, 22 and so forth. A review of computer-aided molecular design (CAMD) methods for product and process design was published by Gani,4 while Ng et al reviewed significant developments, current challenges, and future opportunities in the field of chemical product design using the CAMD tools. 23 A key concept in CAMD is to utilize different chemical property models for possible chemical species in the pool, formulated as a mixed-integer linear/nonlinear programming (MILP/MINLP) optimization model, then solved with numerical optimization techniques. 24 The suitability of these structures for a particular task or process can be evaluated with respect to a chosen criterion (for instance, the solubility of the target compound), while considering physical and chemical constraints, as well as process constraints of varying complexity. From the solution of the optimization model, the optimal product is obtained. This approach has been first applied to the design of single molecule species with considerable success. The applications vary from the design of refrigerants 25 to surfactants. 26 The CAMD method was then extended to the design of mixtures and composite chemical products, and identified as computer-aided blend design (CAM b D) 27 also reported as computeraided mixture design (CAM x D). 28,29 Typically, almost all CAMD/CAM b D methods use group contribution-based property prediction methods 30,31 to evaluate the generated compound with respect to the specified set of desirable target properties. UNIFAC 32,33 (universal quasichemical functional-group activity coefficients) and SAFT-γ 34 (statistical associating fluid theory) demonstrated to be accurate and useful in calculating solubility, phase equilibrium, partition coefficients, and various other properties. However, one significant issue is that they rely on binary interaction parameters for every pair of groups in solution, often not available in thermodynamic properties databases. 35 To address the shortcomings of the group contribution-based methods, topological indices have been introduced, which are descriptors of the chemical structure to predict the physical properties of a molecule. 36 The obtained relationships are called quantitative structure property/ activity relationships (QSPR/QSAR). 37 These methods can take into account molecular information, such as the types of atoms and bonds, total number of atoms, and bonding between the atoms to predict physical properties; therefore, they play an important role in the design of large and complex molecules, such as pharmaceutical drugs, as they can capture the differences in conformations, isomers, or molecular structures. QSPR in surfactants studies and formulated product design has been reviewed in detail by Hu et al 38 and McLeese et al. 39 An alternative is to use quantum chemistry calculation for thermodynamics estimation. The COSMO-RS (conductor like screening model for real solvents) and COSMO-SAC (segment activity coefficient) are two of the most popular post-processing methods in COSMO solvation model, where the estimation of thermodynamics only relies on the composition-independent charge density distributions, also known as sigma profile, and molecular volume. Detailed review on those methods can be found in those references. 22,29,40 Furthermore, a systematic review on available computer-aided methods and associated software tools for formulated product design can be found in Reference 36. Briefly, the model-based approaches are able to efficiently find feasible candidates within the application range of the available models. However, since the function-materialsstructure-processing relations have not been developed for complex formulations, including the ones determined by nano or microscale, some target properties are hard to predict with computational tools only. 41 With the increase in computational power, data-driven methods, such as ML-based models, provide an alternative way for establishing the required process-property models, in particular, when sufficient knowledge is not available. 42 Compared to knowledgebased models, data-driven surrogate models do not require prior knowledge; therefore ML-based models are finding increasing use to extract structure-property relationships, particularly in the cases of complex chemical formulations and materials. 36 The recent advances in molecular and material design using ML methods are summarized by Butler et al. 43 Identifying suitable molecular descriptors for chemicals is still an open challenge for ML models, which may lead to further accuracy for chemical product property prediction. 44 As a result, the third and final integrated experiment-modeling approach was proposed, which consists of combining the computer-aided model-based techniques with heuristic-based experimental testing and improvements of the formulation design. The integrated approach usually consists of three stages: the problem definition stage, the model-based design stage, and the experiment-based verification stage. 45 In the problem definition stage, the targets are translated into a set of thermo-physical properties and into a list of categories of ingredients, which are to be included in the formulation via a knowledge base. In the model-based design stage, structured databases, dedicated algorithms, and property physical model libraries are employed for designing a candidate base-case formulation. Finally, in the experiment-based verification stage, the properties and performance of the proposed formulation are tested experimentally.
Through this systematic sequence of actions, the formulation is developed.
By limiting candidate formulations to be tested, and by verifying the design in the last stage, the integrated approach is convenient by saving the time and resources (compared to the experiment-based trial-and-error approach) and increasing the accuracy of the results (compared to the physical model-based approach).
In this framework for formulation design, the integration of robotic experiments and statistical ML models would be a further step in the improvement of the integrated approaches. In this sense, this approach would combine the time and resource efficiency of robotic platforms with the fact that predictions of statistical models are only based on data, with no need of extensive first principles physical knowledge.
The approaches reviewed so far are related to the core product design, and they are based on the assumption that the target properties and the final market destination have already been identified and analyzed. That is, the core design approaches are part of a broader theoretical framework, taking into account different co-existing levels within a complex decision-making hierarchy. These have been proposed by several papers and first reviewed by Gani  In Figure 1, we illustrate the integration of the methodologies described in this article in the pre-existing theoretical framework reported by Zhang et al. 36 Briefly, the market needs to define the product and its desired properties, that can be translated into quantifiable properties functions. Once identified, the next step in the general product design is to analyze the existing knowledge in terms of preliminary information, tacit knowledge in the form of operators' expertise, and formal knowledge, derived from first principles and the available models, to define the objective functions to optimize. It is important to stress that commercial formulated products are often complex mixtures for which no predictive models are available, and the complex interactions between different ingredients and process variables are not easily translatable into predictions of the final properties, even by experienced formulators. The most common situation would be the availability of a small preliminary data set, which can be used to define reasonable constraints in the input variables space. The   and drawbacks of both configurations are widely discussed in the existing literature 53,54 and are beyond the scope of the present article.

| Robotic platforms
However, it must be highlighted that, while the former continuous flow devices seem extremely promising for investigating reaction conditions in an efficient and resource-undemanding way, the latter possess great advantages for the mixing, the processing of emulsions, handling of solids, and for the investigation of thermodynamicsrelated properties, 55 for example, stability. For most formulated products, the process determining the final structure, thermodynamic state, and properties, consists in a combination of rigorously controlled mixing of ingredients at a certain temperature, and stepwise addition of different ingredients at different stages of the process. In this direction, pioneering work in the development of batch modular systems has been described by Cronin and coworkers, 54,59,60 that could be adapted to the specific requirements of this type of products. To date, the developed hardware has only been used to investigate chemical reactions, with the only exception of the study of physical interactions determined by thermodynamics, which then manifests itself in complex dynamic behavior of oil droplets in a continuous water phase. 55,61 Being developed for different purposes, these platforms are only able to produce one sample at a time with interstage automated cleaning of the reactionware/containers. An attempt to overcome this limitation can be found in the recent studies, 62 where an automated rotating wheel, coupled with a 3D-printed dispensing element and automated syringe pumps, can allocate batches of 24 vials per run. The potential of using 3D printing technologies to build inexpensive hardware was also highlighted. 63 Systems for carrying out reactions in parallel under different con- Dispensing of ingredients is followed by or is simultaneous with processing of the mixtures. In most academic papers, mixing seems to be efficiently automated using magnetic stirrers activated by software-controlled magnets. 54,63 However, in all the presented solutions, temperature control is not ensured and, in some cases, mixing appears to be far from ideal, stressing the need for standard

| Analytics
Automated analytical tools are of crucial importance for the fast and efficient adoption of robotic platforms for formulation development and they represent the ongoing bottleneck to the wide adoption of such systems in academia and industry. Once again, most of the analytics automated in the literature has been used for reaction development. Thorough reviews can be found in Mateos et al 48 and Houben et al. 13 The most common adopted techniques for reactive systems are UPLC and HPLC, 10,49,53,[101][102][103][104][105][106][107][108][109][110][111][112][113]113,114 MS, 37,40,46,50,52,55,57,106,115 IR, 53,59,113,115,116 Raman, 53 and UV spectrophotometry. 114,117,118 However, according to the theoretical framework outlined by Bernardo et al, 41  For formulated liquid products, the main general desired properties can be identified as: stability, aspect (color and turbidity), viscosity, surface tension, pH, conductivity, zeta potential, and droplets size distribution, in the case of emulsions. Therefore, more complex, analytical sensors need to be identified and integrated in the robotic platforms to acquire data about different properties at the same time.
Other important properties are functional performances, that is, for example, UV protection of solar creams, or other sensory properties like odor, stickiness, and so forth. A first step in the automation of sensory properties measurement is represented by the new robotic tactile systems SynTouch. 119 One key property of several commercial formulated products, ranging from detergents to personal care products, is their external appearance, which can be quantified using discrete and continuous variables. The former can be defined as "stability" which is related to the capability of the system to not show phase separation, whereas the latter can be quantified considering their absorbance spectra in the visible range and their turbidity value, measured in nephelometric turbidity unit (NTU). Phase separation can be evaluated through automated image processing from automated cameras. Automated cameras and image processing coupled to robotic platforms have already been proposed in other contexts. 55

| Algorithms
Robotic platforms can iteratively provide data points to train DoE algorithms, suggesting new conditions in order to optimize the input variables with respect to one or more target functions.
There are mainly two large groups of DoE algorithms: static and adaptive. 124 The static sampling techniques, also known as one-shot sampling, is a type of method wherein all the sample points are generated at once. 125,126 Depending on the understanding of the system and the computational power, it can be further classified into systemfree design and system-aided design. The key criterium for the system-free DoE is its space-filling ability. Factorial design, fractional factorial design are the classic system-free DoE methods, which aim to fill the space uniformly. To add randomness in the filling procedure, Monte Carlo sampling (MCS) was proposed, which uses pseudorandom numbers to generate sample points for space-filling. It is then  134 The maximin design maximizes the smallest distance between any two points; similarly, the minimax designs minimize the maximin distance between two points.
Although the system-free DoE techniques are easier to implement and less computational power is needed, researchers realized the vital importance of incorporating system information while generating experimental designs. To generate system-specific design, scholars proposed model-based designs in different ways, such as maximum entropy sampling, mean squared error (MSE)-based designs.
Lindly 138 proposed a measure to quantify information based on Shannon's entropy. 139 This entropy criterion was first employed by Shewry and Wynn to construct system-based designs. 140,141 The MSE-based design is first employed by Sacks and Schiller 142 as the prediction accuracy of a surrogate model can be improved by minimizing its integrated MSE. 143 Adaptive sampling, also known as sequential sampling has attracted attention from both research and industrial community. It can overcome the under/oversampling and poor system approximations resulting from the static sampling methods. 144 It also has been shown within numerical analysis that the adaptive sampling methods yield superior surrogate approximation and lower computational expense compared to static techniques. 144 Researchers have reported adaptive sampling techniques for different surrogate model, such as support vector machines, 145,146 artificial neural network, 147 and others. [148][149][150] These are the type of adaptive sampling techniques where sampling points are placed systematically, yet still stochastically.
In contrast, methods which formulate optimization problem to place new samples were also reported. Cozad et al proposed the automated learning of algebraic models for optimization, which is a surrogate modeling tool where a derivative-free optimization problem is solved to maximize deviation of the surrogate model prediction error in order to place the next sampling point. 151  Combining several targets in one single objective, that is, scalarization, is a possibility as shown, for example, in the multiobjective active learner 120 methods. However, this is not ideal, since it requires prior knowledge, introduces bias, and often is not straightforward. 153 Successful implementation of multitarget optimization has been so far achieved for continuous variables using the Thompson sampling efficient multiobjective algorithm (TS-EMO). 10,154 iii. The sustainability challenge imposes targets for rapid development of new formulations or substitutions of some ingredients with others, as environmental legal requirements and consumers' ethics become more and more stringent. 155 As a result, algorithms need to be fast and models cheap to evaluate, also in exploring a high-dimensional combinatorial space. In addition, both discrete and continuous variables and target performances need to be efficiently taken into account at the same time.
iv. The main drawback of black-box surrogate models is that they generally do not provide any information about the physics underpinning product's functional performance. In this sense, the use of data collected from closed-loop optimization procedure for generation of physical knowledge is crucial in gaining a better understanding of the processes to rapidly adapt and transfer the results to similar systems. Very preliminary results in this sense can be identified in the physical interpretation of models hyperparameters, 10 the manual interpretation of Pareto fronts by human experts, 88,89 and, more recently, the automated capture of chemical intuition transferred between similar systems, 156 and the automated generation of physical laws from data. 118 v. As in the case of hardware, there will be a general need for user-friendly open-source software interfaces, to enable experimentalists to apply the developed techniques regardless of their specific field of expertise and democratize the use of such tools.

| Product discovery and prediction of scale-up
Although laboratory automation has already demonstrated a remarkable increase in experimental throughput, discovery of new phenomenon and/or product is still challenging. Automation alone is insufficient, as the relative rate of discovery does not change with the increase in experimental throughput. An appealing alternative is to implement the process of curious and knowledge-based inquiry inherent to human scientific research, within a reliable and high-throughput robotic system. 157 Active searching and pooling strategies were proposed and applied in automated discovery of new chemical reactions.
A detailed review of screening approaches in chemical reaction discovery and development can be found in Collins et al, 158 Coley et al, 159 Henson et al. 52 A pioneering work in the study of multicomponent chemical formulations discovery is the investigation of the self-propelled droplet system. Grizou et al describe an experimental method complemented by a curiosity algorithm, which enables to observe more variety of droplet behaviors than the random parameter search under the same experimental budget. 160 This approach may lead to new discoveries with potential applications in formulation chemistry.
One of the main open challenges in product design by using automated approaches is the translation of the acquired knowledge to full predictive scalability. At present, in the field of chemistry and chemical engineering, most of the data collected in lab-scale robotic platforms is used to build statistical reaction models, 161 not taking into account scale-dependent or process-dependent interactions. At a process scale, however, mass, heat, and momentum transport almost always become the most relevant controlling mechanisms, and very little information about them can be inferred from small-scale black-box optimizations. Specifically, in the field of formulated products, this can critically determine the thermodynamic state of the final product, that is, for example, its stability, shelf life, physical properties, and so forth.
One promising way to overcome these limitations is to use robotic experiments to build generalizable physical knowledge and to learn physical models that can be integrated at a later stage to predict the behavior of a chemical system at scale. However, self-optimizing loops do not usually provide useful information to develop physical models, 48 and model-based design of experiments (MBDoE) is usually applied to closed-loop systems for the generation of more informative data. 114 A number of studies, representing the first steps to overcome these limitations, have been published, addressing interpretability of surrogate models and the automated generation of fundamental knowledge in the form of physical models. This approach has been successfully applied mainly for the identification of predictive kinetic models in the field of reaction engineering and chemical reaction development. 118 Moreover, the current state of art focused on the problems of model selection out of a library of pre-derived models [162][163][164][165] and parameter estimation, 163,166 167 Also, several pioneering works 55,61,168 have shown advances in both the hardware and the software for the exploration of physicscontrolled phenomena in microfluidics, for example, multiphase droplets generation. A future challenge will be to use algorithms to find predictive correlations between the behaviors observed at the lab scale and the final properties of the processed products at the production scale. The underlying hypothesis, still to be proven or rejected, is that a direct correlation can be found between the final properties of a processed sample, for example, stability of oil-in-water emulsion, and the observable behavior of the two phases in a robotic platform, like the one proposed by Henson and Points. 55 Finally, an interesting emerging approach to infer physical knowledge from the increased availability of good quality data is the adoption of non-parametric form-free algorithmic search. In particular, the ML method of symbolic regression seems to be particularly promising, allowing to discover analytical equations describing a set of data combining a set of basic operators. Recent developments consisted in the formulation of symbolic regression as a mixed-integer nonlinear programming to ensure solving to the global optimality. The methodology has been successfully applied to simulated data to re-discover the equations governing different physical systems. 169  is the design of cleaning products such as shampoo, washing liquid and so forth. In stage 1, key ingredients such as surfactants, polymers and so forth will be mixed under certain conditions, achieving a stable product within a specified viscosity range. In stage 2, the mixture will be diluted 10 times, and the foaming and cleaning ability will be measured.
The factors applied in stage 1 will have influence on stage 2. There are three different types of optimal designs for multistage experiments; (i) multistage completely randomized, (ii) multistage split-plot, and (iii) multistage strip-plot designs. Matthew et al 170 introduced a twostage split-plot design for the formulation of a pharmaceutical product, the modeling of the similarity factor from dissolution testing, and the prediction of points which maximize the probability of passing speciation. However, this work was limited to computational study without experimental validation. Also, only single objective was considered in the case study. Due to the complex nature of formulated products, future work could therefore focus on extending and adapting existing methodology to exploring outside the current experimental region. For example, Pareto optimality can be used for multiobjective optimization, Bayesian reliabilities can be applied to identify compromise directions for exploration of design spaces for multiple output, and so forth.
In other applications, such as healthcare, 171 traffic management, 172 smart grid management 173,174 and others, 175 multistage design problem is often treated as multistage stochastic optimization problem (two-stage, linear models as an example): Min It can be further interpreted as a two-stage decision process, exemplified in Scheme 1, which means in the first stage, trial feasible valueŝ x 1 will be given for x 1 , then the optimal solution of the second stage function can be found based on Equation (2) Min Dynamic programming (DP) algorithms can then be used to solve the sequential decision process described above. The DP approach has many attractive features, such as extendibility to multistage problems, accommodation of discrete values and nonlinearities, and so on. In the DP approach, the first-stage problem is defined as: Min where c 1 x 1 is the immediate cost, and α 1 (x 1 ) is defined as: α 1 (x 1 ) represents the future cost of decision x 1 , that is, the consequences of this decision for the second-stage problem. The DP algorithm constructs the future cost function α 1 (x 1 ) by discretizing • standardization of data representation and data exchange formats, • creation of data repositories, • standardization of knowledge representation and exchange formats, and • standardization of equipment and software interfaces.
All of these aspects are currently under active discussion and development. Here we shall point to only few most significant developments that are relevant to the field of formulations.
The core of the fully digital workflow consists of the mechanisms for exchange of data and of knowledge models. There exist a number of formats for representing molecules, such as SMILES, 178 InChi, 179 MDL molfiles, 180 as well as of crystallographic and spectroscopic data.
None of these formats allow a complete representation of a molecule or a material and several representations may be required to cover all the necessary information. For this, a data exchange repository must exist. There are commercial (e.g., Reaxys, 181 CAS, 182 etc) and public (e.g., ChemSpider 183 ) chemical databases, however, they do not yet support the fully digital workflow. Very recent proposal to establish the new repository Chemotion offers a potential platform for a fully digital workflow of chemical R&D. 184 Lab information management systems (LIMS) are another powerful tool for automatically collected data; originally conceived for automated data tracking and exchange, they have been further integrated with data mining, analysis, and translation tools, allowing for much wider range of applications. 185 One example is the large LIMS systems for formulated products developed by Syngenta, operated at their unique formulation robot at the Jeallott's Hill R&D site.
To navigate from data to research hypothesis, data must be transferred to knowledge and knowledge structures be accessible to algorithms and to scientists. This requires establishment of ontologiesrelationship models, and mechanisms of access to different knowledge domains, for example, via the semantic web technology. 186 This infrastructure then enables to exploit the power of algorithms and of automation, linking laboratories located anywhere in the world with computational resources, algorithm developers, material and process scientists, and end-user product specialists. An example of such infrastructure, a virtual world of chemical processes, filled with data, knowledge models, and AI agents, was recently demonstrated by Kraft et al. 187 Then the outcome is the connected world of machines, algorithms, data, knowledge and scientists. The benefits to the scientific community and to society are clear: faster sharing of knowledge, a much wider access to talent, much better utilization of resources, both material and computational, and ability to pose much more challenging problems.

| CONCLUSIONS
In this review, we highlight the role of the recent developments in the fields of robotic automated platform and optimization for formulated product design, identifying the challenges for future research efforts. Projects "Development of multi-step processes in pharma" and "Data2Knowledge") (Alexei A. Lapkin).