Aspects of Admixture Research: On the Use of Machine Learning in Superplasticizer Chemistry

The use of superplasticizers in concrete, especially polycarboxylate ethers (PCE), has delivered the ability to easily achieve low water to cement ratios and thereby either higher strength or lower cement contents. In the last years, significant progress has been made with regard to understanding the structure‐activity relationship of the interaction of PCE and cement. For example, scaling laws have been derived for the size of adsorbed PCE, the magnitude of the steric interaction force, for the retardation of cement hydration by PCEs and more recently for competitive adsorption. While this is extremely useful, the picture is not fully complete yet.


Introduction
Concrete admixtures are enablers for modern concrete mix designs.Superplasticizers are arguably the most critical admixture class as they allow the formulation of very low w/c mix designs.Thereby they facilitate achieving strength objectives at very low binder contents.This, by itself, can save both costs and CO2 emissions.
We are interested in the structure-activity relationships of superplasticizers for cementitious materials.Many groups have studied this aspect before.There are two different approaches to this objective: a) empirical optimization of the basic structure motif and b) the derivation of a phyisco-chemical models for the interaction of PCE with cementitious materials.At the forefront of efforts that fall into the second category, the group of Flatt has published many seminal contributions.These include a derivation for the steric interaction forces between cement particles with adsorbed PCE polymers [1], the competitive adsorption of PCE and sulfate [2], the competitive adsorption of polymers [3] or the retardation of the silicate hydration by PCE [4].The number of publications which fall into the second category, i.e. empirical studies on different PCE structures and their influence on the rheology and reactivity of cementitious materials, significantly outnumber the physico-chemical studies.There have been reviews summarizing the state-of-the-art [5].However, there have been few attempts to quantitatively compare different results and derive a statistical model for the interaction between PCE and cement.Superplasticizers like polycarboxylate ether can be described using different structural descriptors.In polymer science, it is commonplace to report a polymer's weight and number average molecular mass.Both masses are easily accessible via gel permeation chromatography.The ratio of Mw/Mn is the polydispersity index which characterizes the distribution of the masses.In the case of copolymers, the monomer ratio is another simple and valuable parameter.If the molar mass and the topology of the polymer are known, other parameters, such as the radius of gyration, become accessible without additional measurements by using simple scaling laws.
Generally, polycarboxylate ethers can be described as copolymers of at least two monomers.A monomer that bears a chain of poly(ethylene glycol) (PEG), also referred to as side-chain monomer, and second monomer which contains a functional group which can dissociate in

Abstract
The use of superplasticizers in concrete, especially polycarboxylate ethers (PCE), has delivered the ability to easily achieve low water to cement ratios and thereby either higher strength or lower cement contents.In the last years, significant progress has been made with regard to understanding the structure-activity relationship of the interaction of PCE and cement.For example, scaling laws have been derived for the size of adsorbed PCE, the magnitude of the steric interaction force, for the retardation of cement hydration by PCEs and more recently for competitive adsorption.While this is extremely useful, the picture is not fully complete yet.
In this contribution, we wish to highlight some recent work in the field of data analysis of PCE.Inspired by a very early machine-learning study of concrete formulations, we extracted structural PCE data together with rheology data from the literature.We compare PCE performance across studies and attempt to uncover underlying structure-activity-relationships (using machine learning models).It turns out that the data set quality and quantity is not yet sufficient to establish reliable models.
aqueous solutions at high pH into an anionic moiety.In the case of polycarboxylate ethers, the name indicates the chemical nature of the charged monomer (a carboxylate) and the side-chain monomer (a polyether that contains mostly PEG).A PCE structure can be described using only three simple parameters: n, N, and P.These parameters have the advantage that they are firmly rooted in polymer physics and that several essential parameters can be directly calculated from these parameters based on existing scaling laws.However, it has to be noted that these values only represent the average of the entire polymer sample and do not characterize the underlying distributions.
Scaling laws for the behavior of PCE in cementitious systems have been derived using these three structural descriptors (n, N, P).However, while these scaling laws have been successfully applied to compare a set of structurally very similar PCE in similar cementitious systems, they have not been used to compare different PCE structures across different studies.
As some of the most important performance parameters of PCE in cementitious materials (e.g., slump flow, setting time, or mechanical strength) can be described with simple numerical values, it is tempting to use machine learning to derive statistical relationships between PCE structure parameters on the one hand, and performance parameters in cementitious systems on the other hand.There have been few reports on this topic [6; 7].This scarcity has different possible explanations.The most important reason is probably the difficulty in obtaining the structural descriptors.As most research papers on the use of superplasticizers in concrete are based on industrially available superplasticizers, the chemistry and structural descriptors are generally unknown to the researchers.Consequently, most studies on the use of superplasticizers in concrete do not contain a complete quantitative description of the polymer structure.Reliable conclusions on structure-activity relationships are, therefore, not possible.
Machine learning has been successfully applied in concrete [8; 9].In this context and in simple terms, a machine learning algorithm is helpful in finding quantitative relations between experimental parameters such as structural polymer descriptors and polymer performance data in cement.Again, it is important to stress that such quantitative relations have already been derived for certain combinations based on physical principles and analytical expressions.Machine learning is not based on considerations of the underlying physical principles; it simply attempts to find statistical relationships.
Finding relationships is interesting for at least two reasons.First, the available relationships do not cover all aspects of the interaction of cement and superplasticizers.For example, an expression for the relation between slump retention and PCE structure is missing.Second, if a statistical relation is found for parameter combinations for which an expression is already available, the scope of the existing expression can be tested if the data set is large.
Here we further extend our previous contribution in this domain and critically comment on the feasibility of such an effort.

2
Materials and Methods

PCE data
The data was assembled in a systematic literature review on PCE structures and their rheological properties.The criteria were: a) the Gay-Raphael structure parameters of the PCE must be either reported or be derivable from published synthesis data, b) cement paste slump flow must be published together with the structural information on the PCE.A prior version of this data and the methodology has already been published [10].Here, we present an extended version of the data set, including a new interpretation and a random forest model of a subset of the data.

Data and Machine Learning
The data was collected in LibreOffice spreadsheets.These were joined and manipulated in R using the tidyverse suite of packages [11].The simple random forest regression model consisted of 3000 trees and was implemented using ranger [12].

PCE structures and distribution of yield stresses
The data set contains a total of 1137 rheology experiments including 210 different PCE structures.The distribution of the PCE structures is shown in Figure 1.There are a few patterns to note here.First, most PCE structures belong to the flexible backbone worm class.Second, in terms of numbers, side chain monomers are far more plentiful than backbone monomers (this can be seen if the distribution along the x-axis which contains log P / log N is considered).Third, PCE with very long side chains are rare in the data set while both mid length and short length PEG macromonomers are represented.Fourth, polymers with longer side chains (i.e.larger P values) have also larger P/N ratios.While the data in the different papers are conceptually very similar (i.e., PCE synthesis or structure parameter elucidation together with mini-slump tests of cement pastes), there are several important variations between different studies: the chemistry of the monomers (methacrylate, acrylate, IPEG, VPEG, …), the presence of additional monomers such as a third sulfonate monomer or an acrylic ester monomer, the synthesis protocol (mainly radical copolymerization but not only), cement chemistry and fineness, timing of the addition of the admixture (with the mixing water or delayed), the water to cement ratio, the dosage of the PCE or the geometry of the cone used for the mini slump test.As a rule, these parameters are only very rarely completely available.Additionally, some parameters are not generally reported and most likely influence the results: the temperature of the lab (or the fresh paste) during testing, the geometry of the mixing equipment, the mixing energy and other parameters.Some studies include testing protocols according to existing standards, meaning that some of the abovementioned parameters are controlled.
A common scale is needed to make the slump flow data from different reports comparable.Therefore, the slump flow values were transformed into yield stress data using the formula first described by Roussel.Variations of this formula have been published.However, for the sake of simplicity, we use the formula in its original form.

𝜏 = 225𝜌𝑉 128𝜋 𝑅
Figure 3 Histogram of cement paste yield stresses in the data set.
The distribution of the slump flow data is strongly skewed towards low yield stresses (see Figure 2).This can be explained by considering that most studies do not report the slump flow data for different dosages.Instead, the reported slump flow value corresponds to the value defined in the study design.Usually, the PCE dosage is varied until the desired slump flow value is achieved (unfortunately, values obtained while adjusting the dosage are often not included in the papers).Furthermore, most studies do not report the temporal development of the slump flow and focus only on the initial flowability of the cement paste.Typically, the yield stress increases over time, decreasing the corresponding slump flow.

Relation between polymer structure and yield stress
The question we asked ourselves: can something be learned from the literature data concerning the structureactivity relationship of the described PCE?Given the number of publications on the rheological parameters of PCE in cementitious systems, patterns are expected to emerge if the data set is carefully analyzed.As parts of this data have already been described and discussed elsewhere, we briefly repeat some key findings and then focus on new aspects.
The data set was assembled in a semi-systematic literature survey.It contains mini-slump flow data of different PCE structures in cement pastes.In general, the data from individual papers differs in the following parameters (the list is not comprehensive): cement fineness, cement chemistry, w/c ratio, polymer chemistry (i.e., polymer mass, monomer composition etc.), polymer dosage, mixing protocol and timing of the addition of the superplasticizer.
Here, we focus on a subset of the data set which is more homogenous.For the subset of data obtained at w/c values of 0.29 and 0.30 and direct addition of the PCE, the yield stress corresponding to the slump flow determined directly after mixing does not show a pronounced pattern (Figure 3, the data range is restricted to yield stresses below 50 Pa and dosages below 0.30 %).Certain studies report dosage curves of PCE in cement pastes, i.e., the dependence of the slump flow on the polymer concentration.Grey lines in the Figure connect identical polymers at different dosages.As expected, the yield stress decreases with increasing polymer dosage.Other studies follow a different testing paradigm.The blue points distributed horizontally at approximately 3 Pa correspond to individual polymers whose dosage was adjusted to obtain a pre-defined slump flow value (corresponding to approx.3 Pa in this case).For these studies, only the final dosage of the polymer is reported.A third paradigm is recognized in the vertically stacked points at a polymer dosage of 0.15%.Here, the polymer dosage is fixed for different PCE, and the resulting yield stress is measured.
Reporting the slump flow at different dosages corresponds to the most valuable approach.Both other paradigms can be considered as a single-point subset of this approach.In terms of dosage efficiency, polymers in the lower left corner are more dosage efficient than polymers in the upper right corner.The data in Figure 4 allows identifying individual dosageefficient polymers but does not contain structural information on the polymers.The yield stress must be available as a function of structural PCE descriptors to examine structure-activity relations.It is known that the ratio P/N is related to different PCE properties in cement (the exponents of P and N can vary depending) [13].In the context of this study, it is sufficient to mention that low P/N ratios correspond to (very) highly charged PCE.It is known that such PCE strongly retard C3S hydration [4].High P/N ratios correspond to PCE with long side chains and low charge density.The available scaling laws for steric stabilization of cement particles with adsorbed PCE predict that steric stabilization increases with increasing P/N values (the scaling relations are available elsewhere [1]).On the other hand, increasing P/N ratios lead to lower and lower charge densities of the PCE.High P/N ultimately lead to low affinities of the polymer for the cementitious surfaces, resulting in non-adsorbing and, therefore, nonplasticizing dispersants.In summary, based on available scaling laws, the dependence of the yield stress on P/N is expected to have a convexly curved shape.The minimum of such a curve would correspond to the optimum PCE structure.A balance between steric stabilization (which increases with P/N) and surface affinity (which decreases with P/N) would be achieved.The data found for the subset of the data shows no obvious convex pattern (Figure 6).Here, three patterns can be identified-first, a vertically stacked series of points at approx.P/N = 11 is found.These polymers were tested in a Portland cement classified as 42.5 at different dosages.Again, the yield stress decreases with increasing dosage (color-coded here).The horizontal series of points at approx. 3 Pa is the same series found in Figure 4.However, the order along the x-axis is now determined by their P/N value.If the color of the horizontally aligned points is carefully considered, the darkest colors, which represent the lowest polymer dosage, are connected with intermediate P/N values.This implies a convex pattern (which is color-coded here).The few data points not belonging to the vertical or horizontal series do not exhibit a clear pattern.If only the polymers dosed at 0.15 % of the data in Figure 5 are considered, the subset shown in Figure 6 emerges.
Because the polymer dosage is constant and only the w/c of 0.29 is considered, a convex shape or a monotonically decreasing curve should emerge in the considered P/N range.This is not what is found.The data does not seem to follow any pattern.In addition to the cement class, the weight average of the molar mass of the polymer (Mw) is shown as 4 categories represented in different point shapes.The molecular mass does not seem to be the origin of the variance found in the data.The compiled PCE literature data does not seem to follow an obvious pattern.It is interesting to dig a little deeper here to find out why the data does not seem to follow a clear pattern.We begin by discussing in more depth what a functional relationship between yield stress and P/N might look like.

Expected structure-activity relationship
By fitting the slump flows of cement pastes at a w/c of 0.30 containing three different PCE at increasing dosages, Marchon and Flatt derived an empirical relation linking the required PCE dosage c to the slump flow diameter D of the paste and the ratio of P/N of the polymer [13].D0, 1 and 2 are empirical fitting constants that were determined by Marchon.

𝑐 = 𝐷 − 𝐷 0.76 ⋅ 𝛾 𝑁 𝑃
Marchon provides three different value pairs for 1 and 2, we use 237 and 0.17 here.By rearranging this equation and using equation 1, it is possible to calculate the yield stress as a function of P/N for a given polymer concentration (1 mg/g in the case of Figure 7).This relationship yields a monotonically decreasing power law relation between the yield stress and P/N.It has to be noted that this relation does not consider the decreasing affinity for the cement surfaces of the polymers with increasing P/N.It is also instructive to study the yield stress for different combinations of P and N. A corresponding diagram is shown in Figure 8 for a fixed concentration of 1 mg/g and a w/c of 0.3.As expected, the lowest yield stress is found in the lower right corner, corresponding to the largest P/N ratio.The largest yield stress is found in the upper left corner for the lowest P/N ratio.Again, it has to be stressed that this relation does neither consider retardation nor adsorption affinity under competitive adsorption conditions.

Attempted Machine Learning
We now return to the question of possible reasons for the deviation of the experimental data from the expected pattern.
First, we note that larger data sets are difficult to study for humans without using computers.As a first step, plotting different parameter combinations to get a first impression about the distributions is helpful.This has been partly presented above.In addition to exploratory data analysis, machine learning algorithms can find patterns in multivariate data.Here, we briefly present a preliminary attempt to model the data set using a simple random forest model.
The entire data set is small and contains many possibly relevant variables.Therefore, as a first step, we define a reasonably homogenous subset that contains 68 slump flow values at w/c values of 0.30 and 0.29 of 47 distinct PCE structures.The PCE is always added with the mixing water (direct addition), and only the slump flow value directly after mixing is considered.The data set is split into a training (80% of the data) and a test set.
Before we turned to a random forest regressor, we evaluated multivariate linear regression to predict the yield stress using a maximum of 5 variables: the polymer structure descriptors P, N, and n, the w/c value (which only has two levels 0.29 and 0.30) and the polymer dosage.In some instances, we included quadratic or interaction terms.The quality of the considered linear models was poor and is not reported here.Afterward, we started to evaluate random forest models containing the 5 variables outlined above.The random forest models were also poor in quality.For the model shown here, the obtained mean average error (MAE) is 5.05 Pa, and the root mean square error (RMSE) is 6.15 Pa.The predicted values are shown in Figure 9.The model fails to accurately predict the yield stress, especially at the larger yield stresses.Although the model is of insufficient quality, we nonetheless used it to predict the yield stress of different PCE structures under the same conditions as for the equation of Marchon, i.e., for a dosage of 0.1 %bwoc (1 mg polymer per gram cement) and a w/c of 0.30.The predicted yield stresses are shown in Figure 10.Discussing this too deeply does not make much sense due to the poor model quality.However, it is still worth noting that the model reproduces the same pattern found for the equation of Marchon (Figure 8), i.e., the lowest yield stresses are found in the lower right corner, and the highest yield stresses in the upper left corner.Another point worth highlighting is the non-monotonic shape of the yield stress curve in the vertical direction.Here a concave form is found for higher values of P (P > 25), i.e., the yield stress is predicted to have a local maximum if N decreases from 10 to 2 for a fixed P value.The concave shape here and the convex shape shown in Figure 7 are in conflict.

Discussion
To make the most of the existing literature on the interaction of PCE with cement, we extracted structural descriptors of the polymers and the slump flow data and compiled a data set.This can be valuable for a couple of different reasons.First, it is helpful to have an overview of the existing structural space of PCE-type dispersants.Second, existing and future quantitative structure-activity relationships should ultimately be able to explain the variance described in the literature.
The current form of the compiled data set has several important issues.It is not large; it frequently lacks chemical information about the cement quality; the PCE chemistry varies and contains different charge monomers like acrylic acid, methacrylic acid, maleic acid, and several different macromonomers.Furthermore, the polymerization process ranges from free radical copolymerization to controlled radical copolymerization and grafting approaches.Sometimes, the process is not described, and only the structural descriptors are given.
Finally, another important issue is the interaction of PCE with the ettringite formation if the PCE is added to the mixing water.In most experiments in the data set, the PCE is added to the mixing water.The model of Marchon and Flatt, however, is derived based on data acquired by adding the PCE after the completion of the ettringite formation.
The less-than-optimal data quality and quantity and the possible additional interaction of the PCE with the ettringite formation in the case of direct addition most likely explain why the data set is so difficult to model using simple machine learning algorithms.

Conclusion
The assembled data set helps compare new PCE structures with the existing structural space.Additionally, future interaction models will most likely be able to partly elucidate some of the still missing links between the PCE structure, cement chemistry, mixing protocol, and rheology.The drive towards more sustainable binder compositions than ordinary portland cement will possibly also lead to an increase in the complexity of the binder chemistry.Superplasticizers will certainly be a part of future binder compositions.It is, therefore, critically important to further increase the degree of understanding of the interaction of superplasticizers with cement.

Figure 1
Figure 1 Simplified structural scheme of a comb copolymer structure.

Figure 2
Figure 2 Gay-Raphael Plot of academically published PCE structures.

Figure 4
Figure 4 Dependence of the yield stress on the polymer dosage at two very similar low w/c values.

Figure 5
Figure 5 Dependence of the yield stress on P/N for different dosages and cement classes.

Figure 6
Figure 6 Dependence of the yield stress on P/N at a w/c of 0.29.Different reported molecular weight categories are shown as symbols.

Figure 7
Figure 7 Theoretical dependence of the yield stress on P/N adapted from Marchon for a polymer dosage of 1 mg/g cement and a w/c value of 0.3.

Figure 8
Figure 8 Theoretical yield stress as a function of P and N. The color encodes the yield stress, which is scaled logarithmically.Based on the scaling relations in Marchon 2019.

Figure 9 A
Figure 9 A random forest model is trained (gray points) and used to predict the yield stress of the test data (in red)-gray points: training data set, red points: test data set.

Figure 10
Figure 10 RF model-based prediction of the yield stress as a function of P and N.