Transfer learning of recurrent neural network‐based plasticity models

Mechanics‐specific recurrent neural network (RNN) models are known for their ability to describe the complex three‐dimensional stress–strain response of elasto‐plastic solids for arbitrary loading paths. To apply RNN models to real materials, it is crucial to identify a strategy that allows for their training from small datasets that could be obtained from robot‐assisted experiments. It is demonstrated that regular training with datasets comprising random walks (RWs) in strain space yield a significantly higher generalization ability than the same number of sequences for smooth loading paths. Moreover, it is found that transfer learning, that is, initializing the weights and biases with the parameters from an already trained material, improves the convergence rates and reduces the required number of stress–strain sequences for training. When leveraging the experience gained for multiple materials through ensemble transfer learning, even more substantial improvements are obtained. For example, the same model accuracy and generalization ability is obtained from training with 400 smooth stress–strain sequences after ensemble transfer as from training with 10,000 RW sequences after regular training.


INTRODUCTION
Mathematical models describing the three-dimensional stress-strain response of solids are typically formulated in terms of algebraic equations (e.g., elastic stress-strain relation, yield surface definition), differential equations (e.g., flow rule), and differential inequalities (e.g., Kuhn-Tucker conditions).The mathematical nature of the constitutive equations is similar for both physics-based 1 and phenomenological models. 2Computational models then solve the set of constitutive equations in an approximate manner through specialized algorithms such as return-mapping schemes. 3In essence, the constitutive equations may be seen as mathematical constraints on the functions that describe the mapping from a given strain history (input) to a stress history (output).
Machine learning is currently pursued as an alternative to using constitutive equations to define the functions that defined the mapping from strain to stress histories.The idea is to identify neural network functions from data, that is, from numerous examples that describe the stress-strain relationship.In the context of one-dimensional plasticity, du Bos et al 4 proposed a fully-connected neural network model that uses a discrete strain history as input vector and provides a stress history as output vector.Zhang and Mohr 5 and Jang et al 6 used fully-connected neural networks to model J2 plasticity with isotropic hardening.Given the incremental nature of finite element computations, recurrent neural networks (RNNs) appear to be particularly well suited for data-driven constitutive modeling.They have built-in memory variables that can account for history effects when predicting the stress for given strain increment input.Well established RNN architectures like long short term memory (LSTMs) and gated recurrent units (GRUs) are introduced in various different aspects of constitutive modeling.In the realm of small deformation, Chen 7 used an LSTM-based model for viscoelastic material behavior at infinitesimal strains.Shah and Rimoli 8 predict the dynamic response of large arbitrary heterogeneous structures by using an LSTM-based surrogate model for a unit cell.Zhang et al 9 propose a GRU-based ensemble learning framework to model the structural response of truss and shell structures subjected to structural uncertainty, for example, of geometrical dimensions like truss length due to manufacturing errors.Frankel et al 10 modeled the elastic properties and the onset of plastic flow of synthetic oligocrystals.They combined strain history and voxelated polycrystalline realizations with the help of a joint architecture made from a convolutional neural network (CNN) and a LSTM.The large deformation, fully-three dimensional response of elasto-plastic materials was successfully reproduced via GRUs. 11,12The later also demonstrated that RNNs are capable of handling anisotropic plastic flow and complex hardening behavior, like the anisotropic Yld2000 plasticity model with HAH hardening.Abueidda et al 13 confirmed these findings and at the same time proposed the use of temporal convolutional networks that offer shorter training times since they do not have a recursive structure like RNNs.Qu et al 14 leveraged prior physical knowledge of granular media to build a GRU model combining measured quantities (principal strains) and model parameters (stiffness tensor) to predict the effective stress-strain response.Tancogne-Dejean et al 15 demonstrated that the GRUs could replicate the homogenized response of lithium-ion battery cells.7][18][19] On a similar note, Hu et al 20 investigate RNNs to learn the microstructure evolution in latent space.An efficient representation of the microstructure via an autocorrelation-based principal component analysis (PCA) method enables an accelerated phase-field response.An extension to established machine learning models is the incorporation of physical constraints to limit the space of admissible solutions.Haghighat et al 21 show an application of these Physics Informed Neural Networks (PINN) that embed the momentum balance and constitutive relations in the loss function during training for linear elasticity and von Mises elasto-plasticity.It is reported that obeying physical constraints also leads to improved robustness in training. 22Zhang et al 23 highlight that physical constraints in the loss function allow to accurately capture potential system nonlinearities.Liu et al 24 introduced deep material networks that preserve thermodynamically consistency.The effective material response is computed based on a collection of mechanistic building blocks with analytical homogenization solutions.Deep material networks inherit thermodynamical consistency and stress-strain monotonicity from their individual building blocks. 25He and Chen 26 take a combined approach to tackle thermodynamically consistency by predicting the Helmholtz free energy.This allows to implicitly enforce the first law of thermodynamics by deriving stress, dissipation rate and entropy based on the predicted free energy.However, the second law of thermodynamics only acts as part of the loss function and is not strictly enforced.New types of RNNs are also designed for mechanics in applications for which traditional architectures are not suited.Bonatti and Mohr 27 developed minimal state cells (MSCs) that decouple the sizes of the memory state and the total parameter of the RNN cell.To increase robustness of the neural network response and ensure self-consistency with respect to the input discretization Bonatti and Mohr 28 proposed linearized minimal state cells (LMSCs).This type of RNN cell enforces self-consistency by mathematical construction and thereby enables the use of RNN-based constitutive models for structural boundary value problems in explicit finite element analysis.
In contrast to replacing the entire constitutive model with data driven approaches, hybrid models aim to replace specific parts of phenomenological models to improve the overall model accuracy.Al-Haik et al 29 and Jordan et al 30 relied on fully-connected neural networks to estimate the creep stress of polymeric composites and the temperatureand rate-dependent stress-strain response of polypropylene, respectively.Li et al 31 created an enhanced version of the Johnson-Cook hardening model to capture the strain rate-and temperature dependency of high strength steels.The same model has also been applied to aluminum 7075. 32Additionally, Li et al 33 extended the neural network based model to account for dynamic strain aging.Besides increasing the model's predictive capabilities, they also enforced positive strain rate sensitivity through counter-example training.Settgast et al 34 presented a hybrid model for rate-independent plasticity that relies on three different neural network functions describing the yield surface, the angle of dilatational flow and the stiffness degradation.
In the current work, we explore the potential of transfer learning to reduce the dataset size required to train RNN models.][37] For example, transfer learning addresses challenges like data scarcity in the target domain, distribution and feature space adaptation between source and target domain as well as the lack of labeled data or presence of incomplete data in the target domain.Pan and Yang 38 published a survey on transfer learning featuring a comprehensive summary of existing methods that can be applied to classification, regression, and clustering.Transfer learning has a particularly rich history in Natural Language Processing. 39,40Raina et al 41 introduce techniques to leverage large amounts of unlabeled data to improve the performance on a given classification task via transfer learning.Other approaches include mappings of data from different domains to the same high-dimensional feature space 42 or ensemble methods that combine individual classifiers that can be applied to various sequence labeling tasks. 43Furthermore, deep learning architectures have been specifically designed for the use in conjunction with transfer learning.For example, Bidirectional Encoder Representations from Transformers (BERT) models are pre-trained on unlabeled text before these are used for diverse tasks from classification to answering questions. 44There exists a wealth of open source libraries of pre-trained BERT models that are readily available.There are also open-source models of almost all state-of-the-art architectures for image related tasks like object detection or segmentation that have been pre-trained on large datasets like ImageNet 45 or COCO. 46osinski et al 47 investigated the transferability of features in CNNs.They reported two distinct issues that can hamper transferability, namely (1) feature specificity, and (2) optimization difficulties that arise when combining pre-trained and randomly initialized weights during the subsequent fine-tuning.It is worth noting that feature specificity (i.e., certain layers extract information that is specific to individual learning tasks) is difficult to avoid in practice since higher layers tend to adapt to the target task.Raghu et al 48 performed evaluations of transfer learning for medical imaging.Their results indicate that common, well-established models used for natural images are highly overparametrized for medical tasks.As a consequence, transfer learning did not offer any significant benefit over training smaller models from random initializations.
In contrast to computer vision and natural language processing, there are only few comprehensive investigations into transfer learning for time series data.Fawaz et al 49 investigated the potential of transfer learning for different time series datasets.They emphasize once more the potential benefits of transfer learning when training from small datasets, but they also acknowledge the risk that transfer learning might yield worse model performances compared to regular training.Weiss et al 50 report that so-called "negative transfer learning" is likely to occur when the two datasets are highly dissimilar.
In this paper, we will focus on challenges regarding parameter transfer.After outlining different numerical experiments in Section 2 and discussing the datasets in Section 3, we present the RNN modeling approach along with regular and transfer leaning strategies in Section 4. The results are then presented in Section 5 along with a discussion of the main conclusions.

PROBLEM STATEMENT
We are concerned with the training of RNN models that describe the elasto-plastic response of solids.The central question is whether we can benefit from prior knowledge obtained from training for various materials when identifying the model parameters for a new material.This question is primarily motivated by the desire to reduce the amount of stress-strain data needed from physical or virtual experiments to train RNN models.
For this, we tackle the following three sub-problems: • Generalization ability after training on smooth and random walk (RW) paths: Here, we train our RNN model using datasets of different nature: the first is stress-strain data that describes the material response for abrupt random changes of the direction of loading in the six-dimensional strain space.The second is data for smooth changes of the direction of loading in strain space.The discussion of the differences in the trained models' ability to generalize will provide guidance for the experimental procedures that need to be developed to generate data for the identification of RNN models.
• Single transfer learning-from a first material to a new material: Instead of randomly initializing the model parameters when starting the training for a new material, the knowledge of the parameters for a first material is taken into account.
Here, the central question is which parameters of the model obtained for the first material could/should be frozen to facilitate the training for the new material.We also explore if the size of the training dataset for the new material can be reduced if the model had already been successfully trained for a first material.
• Ensemble transfer learning-from N materials to a new material: Instead of randomly initializing the model parameters when starting the training for a new material, the knowledge of the parameters for several other materials is taken into account.In addition to tackling similar questions as for the single transfer learning problem, we also employ Bayesian optimization to initialize our model based on the results from training multiple materials.

Plasticity models used to generate data
We use isotropic von Mises and the anisotropic Hill'48 elasto-plasticity models with associated flow rule to generate the training data through virtual experiments.For the von Mises models, the strain hardening is defined through a linear combination of a power 51 and an exponential law, 52 k SV TA B L E 1 Overview of model parameters for all von Mises materials and lower and upper bounds used during material parameter sampling. and To generate 25 fictitious materials, the Swift parameters {A,  0 , n}, the Voce parameters {k 0 , Q, } and the weighting factor  ∈ [0, 1] as well as the Young's modulus E and the Poisson's ratio  are sampled based on Sobol sequences.The material parameter ranges and sampled parameters are shown in Table 1.The sampled hardening behaviors are depicted in Figure 1.
For the anisotropic Hill'48 materials, we assumed linear hardening, with the initial yield stress  0 and the hardening parameter  L .The selected material parameters for the anisotropic material are summarized in Table 2.

Random walk paths
The RW paths are first designed as continuous piecewise-linear paths in strain space with superposed noise.Subsequently, they are transformed into discrete sequences of strain increments.The time is primarily introduced as curve parameter F I G U R E 1 Stress-strain response for the sampled mixed Swift-Voce hardening laws.

TA B L E 2
Summary of the selected model parameters for the Hill plasticity model.for labeling purposes.The sampling procedure is illustrated in Figure 2.Each path is created based on the following sequence: 1.The point  (0) = 0 is chosen as starting point.
2. Determination of a large number of points  (i) in the six-dimensional strain space through rejection sampling.
Uniformly drawn points are rejected if the tensor norm exceeds 0.15.3. The straight segments i connecting consecutively drawn points  (i−1) and  (i) (with i = 0, 1, 2, … ) then define the continuous main path.4. To obtain the discrete counterpart, we define intermediate points on each straight segment; for this, we uniformly subdivide each segment into sub-segments with a maximum length of √ 3∕2 ⋅ 10 −2 .The number of intermediate points on a segment i is then given by )⌋ , while the corresponding strain values for the intermediate points are determined through linear interpolation between the segment start point  (i−1) and its end point  (i) . 5. We additively introduce noise at the intermediate points on the discretized segment.The magnitude of the noise is the product of two components and a direction.The first component is sampled log-uniformly from the interval 0.002 and 0.01 and the second component is uniformly drawn from the interval 0 and 1.The direction of the noise is sampled in Sampling procedure of random walks datasets: the points along the main path are sampled within a strain tensor norm of 0.15; additional noise is introduced at intermediate points; each sub-segment is upsampled to control the strain increment size.
the same way as  (i) while limiting the tensor norm to 1.The resulting magnitude of the noise components follows a left-skewed distribution in log-space with a mean at 1.5 ⋅ 10 −3 .The pronounced left tail reaches to small noise levels with the 1% quantile at around 5.4 ⋅ 10 −6 and a short right tail with a maximum at 1.06 ⋅ 10 −2 (the theoretical limit is √ 3∕2 ⋅ 10 −2 , which in practice will almost never be reached).6.To control the strain increment, we evenly increase the resolution of each sub-segment once more by a factor of 40. 7. Steps 4-6 are repeated until 3201 points are obtained.Due to the randomly drawn points, the total number of segment end points to be drawn is also random, but always smaller than 3200 and on average 5-6 (average equivalent distance on main path of 0.095 with an average strain increment norm of around 1.7 ⋅ 10 −4 ).
The trace of the generated strain path is checked at all instances.To avoid excessive hydrostatic stresses, it is corrected whenever the volume change exceeded 3%.
Using the above procedure, we generated 11,000 discrete strain paths, each comprised of 3200 increments.Selected paths have been plotted in Figure 3A for visualization purposes.The different colored curves correspond to the individual strain tensor components.Figure 4A,B summarize the properties across the entire dataset.The deviatoric strain increment norms are within [ 4.9 for 99% of a given RW dataset.Note, the vast majority of strain increments lies below the elastic limit of the material indicated by the dashed vertical line in Figure 4B.Additionally, Figure 4C,D show histograms of the accumulated total and plastic strain along all paths, Most strain points are in the vicinity of a sphere of radius 0.07 (peak in histogram), while none exceeds a distance of 0.15 from the origin.The arc lengths, that is, the accumulated total strain along the 3200 increment long paths, are 0.76 on average with the longest paths reaching 1.7.

Smooth paths
The paths in the smooth path (SM) dataset are based on sine functions.A path example is shown in Figure 3D.Both frequency and phase shift are drawn uniform at random.We have f sample ∼ and  sample ∼ respectively, where T = 3,200 is the total path length.Additionally, we apply a constant offset to each strain component to guarantee  (0) = 0 for all paths.The discontinuities of the first derivative of the normal strain components come from the fact that we control the volume change during the virtual experiments in order to limit the maximum observed hydrostatic pressure.This is also evident in the strain increments depicted in Figure 3E.For every smooth strain path, we sample the frequency and phase shift for each component independently.During sampling, we chose the strain increments such that the resulting distribution of the norm of the increment in the deviatoric strain tensor lies within the distribution for the strain increments of the RW paths (see Figure 4B).Other properties and characteristics are a result of an interplay of path length and frequency.For comparison with the RW dataset, Figure 4C,D show histograms of the accumulated total and plastic strains along each path and Figure 4B the strain increment distributions of RW and SM datasets.

Stress sequences
To complete the training datasets for a given material, we evaluate the corresponding (known) constitutive law for the sampled sequences of strain increments.The outcome are pairs of strain and stress tensor sequences, each comprised of 3200 points.In practice, this evaluation of the constitutive law is performed by means of single element simulations with the finite element software Abaqus/Standard (while noting that it could also have been performed without invoking any FE software).The Cauchy stress tensor is decomposed into its spherical and deviatoric part.The hydrostatic stress is normalized by the reference stress p 0 = 5,000 MPa, while the deviatoric stress components are normalized by  ′ 0 = 300 MPa.The same normalization is used for all materials.Figure 3C shows the stress histories for material 1 for the strain paths depicted in Figure 3A.It demonstrates that a single loading path includes multiple loading and unloading steps.

Datasets
For each material, we create multiple datasets.For both the RWs and the SMs, we create each a training set comprised of 10,000 stress-strain sequences and a validation set comprised of 1000 sequences, denoted by RW-10,000 and RW-1000 or SM-10,000 and SM-1000, respectively.To facilitate the training of the RNN model, each dataset is available for three distinct discretizations: 3200 increments, 640 increments, and 128 increments.The shorter 640 and 128 increment sequences are extracted from the 3200 increment sequences by summing up every 5 or 25 consecutive strain increments to a single, larger increment.The corresponding stress paths consist of the stress values at every 5th or 25th point along the stress paths; all intermediate stress values are neglected.Note that the sequences are only shorter in terms of number of entries, while the total arc length of the strain path is approximately the same, irrespective of the number of increments comprised in a sequence.At the same time, the strain increment distribution is shifted to higher values and the loading history will also be changed compared to the original path at high resolution.These shorter sequences are only used during the initial training stages, in which the RNN model parameters are roughly adjusted to mimic the material response.All final models are trained on the full sequences with 3200 short increments.In order to investigate the effect of the training dataset size, we also create several subsets based on the original training data.For each training subset, we randomly pick N <i> paths samples from the pool of N paths = 10,000 paths (here, N <i> paths = {200,400, 1000}).Note, that we draw each subset independently, thus one loading path can occur in multiple subsets.For all model evaluations, we keep the original validation sets of size 1000.

Recurrent neural network model
We make use of LMSC which is a recurrent neural model that has been specifically developed for constitutive modeling purposes. 28Different from standard models such as GRUs or LSTMs, LMSCs allow to choose the number of state variables  i independently from the number of parameters required for mathematical flexibility.For a chosen number of state variables n s (which is expected to be related to the internal mechanism governing the macroscopic response of a material), the fitting flexibility of the LMSC can be freely adjusted by adding more or less trainable parameters to the cell.Another important feature that distinguishes LMSCs from other network architectures is that its formulation is self-consist by design.Self-consistency in this context refers to the property that the output of the LMSC converges with decreasing increment sizes.
The model inputs at a given time t are the strain increments, Δ The input is first processed through a deep network of fully connected quadratic layers with hyperbolic tangent activation, with i = 1, .., D, the weight matrices W a,i and W b,i and the biases b a,i and b b,i .In order to update the state vector, we first define the vectors (gates) ) and update the material state with the help of the self-consistent update rule: The exponential function, the hyperbolic tangent and the multiplication "*" are applied elementwise.The final output (stress components) is computed via a linear transform: Note that in the case of zero strain increments ‖ ‖ ‖ Δ (t)‖ ‖ ‖ = 0, we have the important property  (t) =  (t−1) (no change in state) that is not guaranteed by many standard RNN architectures (e.g., LSTMs and GRUs).
In sum, the model architecture is defined through three hyperparameters: • The size n s of the state space, that is, the length of the vector ; • The depth D of the deep network; • The width W of the deep network layers.
The total number of weights and biases to be identified through training is

Loss metrics
We make use of different loss metrics.The standard mean square error (MSE) reads ) 2 (13)   with σ(t) j,k the prediction for the jth component of the output stress vector at time step t for the kth path in the dataset and  (t) j,k its corresponding target value.N paths , N inc , and N  correspond to the number of loading paths comprised in the dataset, the total number of time increments along each path, and the components of the RNN output vector, respectively.
To provide a measure that is more robust against individual outliers than the standard MSE, we also define a loss based on the sequence medians, Note, that we only take the median over time, but not over the batch and stress components axis.Using the median as evaluation metric requires that the network's estimates for 50% of all datapoints are "good" in order to yield meaningful values.We thus limit the use of  m e d i a n for model validation and testing only (but not for actual training) where the above assumption is expected to hold true.
And finally, in some instances, we also use a normalized version of the median absolute error along each path  median,abs .For a given dataset, we compute the standard deviation  std,j of each component j of the output stress vector.We then make use of the standard deviation to normalize the median absolute errors (per path).

Regular training
For regular training, the network parameters are initialized randomly.The seed weights of the internal layers W a,i and W b,i are chosen at random according to He et al 53  ) .The weights W  and W  used in the update rule of the state vector are drawn from uniform distributions with variance scaled according to Glorot and Bengio 54 and He et al, 53 respectively.Following Bonatti and Mohr, 28 the bias terms b  and b  are initialized with values of three and zero, respectively.After initialization, the networks are trained for 1000 epochs according to the following schedule: All models are trained with the Adam optimization algorithm 55 using a batch size of 64.We use the mean squared error (MSE) as loss function  MSE .The learning rate  lr is updated via a power law based on the current MSE for the training set: The learning rate is initialized as  lr,init = 5 ⋅ 10 −3 .The scaling coefficient is set to μlr = 0.03.According to the above rule, the dynamic learning update takes place as the MSE drops below 2.8 × 10 −2 .
These hyperparameters were identified in a preliminary study via a regular grid search varying initial learning rate  lr,init , scaling coefficient μlr and exponent of the learning rate schedule, batch size and network size, determined by width W, depth D and size of state space n s .Based on the generated datasets, the state space is set to n s = 7.A network size of W = 25 and D = 4 has proven to fit the datasets accurately and robustly for the materials outlined in Section 3.1.This architecture has a total number of 5006 parameters.Larger networks do not yield a significant performance improvement while smaller networks show a drop in accuracy.

4.4
Transfer learning

Single transfer learning
During single transfer learning, the RNN's parameters of the new, unknown material B are initialized with the weights and biases of an already trained neural network of material A. Since the neural networks share the same architecture this is possible without any further adjustments.A viable and sensible alternative is to only restore the parameters of the internal layers and randomly initialize the updates of the state vector.In the ideal scenario, a properly trained network should carry some notion and interpretation of mechanical state variables, like the accumulated plastic strain, that are stored in the state vector.We attribute a key role for interpretation, storing, and updating these mechanical state variables to the parameters of the update rule (Equations ( 8) and ( 9)).Hence, this variant would ideally restore the general notion about plasticity within the internal layers (Equation ( 7)) and fine-tune the exact evolution of the state variables for the specific material (via the update rules).
In contrast to regular training, transfer learning is always performed on long sequences with a length of 3200 increments since we do not start based on a random parameter initialization but from a pre-trained network.

Ensemble transfer learning
The conceptual difference between single and ensemble transfer learning is illustrated in Figure 5B.During ensemble transfer learning, the starting point is an ensemble of N parameter sets that have each been obtained through training for N different materials.Each weight matrix and bias vector W  is initialized as the linear combination of the respective parameters of all previously learnt N materials in the current ensemble, spanning N trained neural networks.
Here, W <i>  corresponds to the weight or bias term of material i.For simplicity, the scalar coefficients  <i> of the linear combination are the same for all network parameters.Before choosing the coefficients  <i> , we evaluate the performance of the RNN model on the dataset for the new material with the parameters for material i. Knowing the losses  <i> MSE , and  <i> median,abs , we then consider five distinct strategies to assign the values  <i> .Note that for all strategies the coefficients  <i> are normalized by a constant factor so that the sum ∑ i  <i> is equal to one: (i) Determination through inverse error weighting based on  <i> MSE , (ii) Determination through inverse error weighting based on  <i> median,abs (iii) While strategy (i) is expected to prioritize the "best" individual material, strategy (ii) is expected to yield a more balanced and equal weighting of all materials.Strategy (iii) also relies on the median error  m e d i a n,a b s but raises it to the power of four.
Typically, this will yield a weighting in-between the two extremes (i) and (ii).
(iv) Lastly, strategy (iv) follows a greedy policy and will always pick the "best" individual material X * according to  m e d i a n,a b s and neglect all other materials in the ensemble.
(v) Additionally, we will investigate an optimized initialization that determines the coefficients  <i> using Bayesian optimization to find a near-optimal linear combination of parameters.(more general information about Bayesian optimization is given in the appendix.)For this, the objective function is described through a Gaussian process with a squared exponential kernel with the pre-set parameters k = 0.05 and h = 0.5.We rely on the expected improvement criterion (EI) 56 that serves as acquisition function to suggest new coefficients  <i> .Since we model the objective function based on a Gaussian process, the EI can be written in closed-form 57 : The improvement is modeled over the target t which we set to be the minimum over all previous observations.(x) and (x) denote the mean and standard deviation of the predictive distribution, respectively, Φ(x) is the standard normal CDF and (x) is the standard normal PDF.

Effect of dataset on model's generalization ability (after regular training)
We train LMSC models separately on the RW-10,000 and SM-10,000 datasets of material #1 (compare Table 1) and perform a validation against the respective other dataset.We will refer to an LMSC model trained on the RW dataset as "RW model," and those trained on the SM dataset as "SM model."For each dataset, we randomly initialize 10 different LMSC models that are trained separately.After their relative ranking based on their MSE on the validation dataset, RW-1000 and SM-1000 respectively, we pick the best out of the 10 trained models.Afterwards, we test the model's performance on other datasets.Figure 6A depicts the loss evolution of the RW model on the corresponding datasets during training.Figure 6B shows an RW model evaluated on a randomly chosen RW and SM path (for clarity we only show one of the Cauchy stress tensor components).The model prediction agrees very well with the target value.Overall, the RW model demonstrates very good performance on the entire SM dataset.Note, that in Figure 6A the validation loss on SM-1000 lies below the loss of RW-1000.For a quantitative comparison, Table 3 lists the quantiles of the squared errors for different combinations of training and validation data.Each row contains the quantiles for a specific combination of datasets, for example, the model in row (a) was trained on SM-10,000 and evaluated on SM-1000 while row (b) was trained on RW-10,000 but also evaluated on SM-1000.Training on RW-10,000 corresponds to rows (b) and (c).Comparing the results from the first two rows clearly shows that the performance of the RW model on the SM data is on par with the original SM model, that was natively trained on SM-10,000.This demonstrates the RW model's good generalization ability.
In contrast, the SM model performs poorly when it is tested on RW data.This can be seen in Figure 7A and from the comparison of rows (c) and (d) in Table 3.For most quantiles, the SM model (row d) falls short by almost two orders of magnitude as compared with the RW model (row c).For further illustration, Figure 7B shows the prediction of the SM model for a randomly chosen Cauchy stress component for a RW testing path.The agreement is poor for many parts of the path, the SM model response is over or underestimating the stress level.We note that the size of the strain increments of the RW path (Figure 7B, bottom) lie within the 1% and 99% quantiles of the SM training data (denoted by the dashed gray lines).The poor performance can thus not be directly attributed to differences in the increment lengths of the RW and SM datasets.It is rather seen as a low generalization ability of the SM model.It is speculated that the superior generalization ability of the RW model is due to the more effective sampling of the model space when using RW paths.

Single transfer learning: From material A to material B
We initialize the parameters of network B with the trained parameters of material A. Material A corresponds to material #2 (Table 1) and we will train models for material #1 as in the previous Section 5.1.With "network B," we refer to a LMSC model trained for material B. The first question we would like to address is: How does the ratio of frozen and

F I G U R E 7
Results from regular training on SM-10,000 dataset: (A) loss evolution on training as well as SM-1000 and RW-1000 validation datasets; (B) examples from the validation datasets: deviatoric stress component  11 and corresponding strain increment norms along loading paths (dashed gray lines indicate the 1% and 99% quantiles of SM-10,000).
unfrozen network parameters impact the final model performance after transfer learning?In other words, which parameters (and which layers) of the LMSC can be kept constant at the values from material A (frozen weights) while updating the remainder to new material B (unfrozen weights).Figure 8A depicts the validation loss after several transfers from material A to B for different ratios of retrained network parameters (here, A and B correspond to mat #2 and mat #1 in Table 1, respectively).The ordinate depicts the evolution of the MSE on the validation set RW-1000 of material B, while the number of parameter updates is shown on the abscissa.All curves in Figure 8A  times for each dataset and show only the best run.The horizontal, black dashed line indicates the reference performance of a neural network for a material B based on regular training (random network initialization) using RW-10,000.It indicates a MSE of approximately 6.27 ⋅ 10 −6 and a good result via transfer learning is meant to achieve similar or better performance compared to this baseline.
Judging by the relatively low level of fluctuations among the 10 runs for each training set, the specific choice of the 400 training paths seems to have a negligible impact on the final results.The overall trend is that the more parameters we unfreeze and fine-tune to the new material B, the lower the resulting error on the RW validation set.At low percentages of fine-tuned parameters, the neural network fails to reach the desired performance level and plateaus before falling below the dashed baseline in Figure 8A.When unfreezing 30%-50% of the weights, we see some overfitting of the training data (though at high error levels).When unfreezing more than 70% of the parameters, the neural networks consistently achieve the same performance as for regular training.Note that retraining the entire network (100% unfreeze) shows the fastest convergence rate.
While partial unfreezing seems to bring no additional value, it is remarkable to observe that with the help of single transfer learning (i.e., initialization with the trained weights for material A), we are able to attain the same accuracy with transfer learning (with 400 paths and 30,000 parameter updates only) as through regular training (with 10,000 paths and 156,000 parameter updates).
Figure 9 further highlights the immediate advantages of transfer learning over regular training.For clarity, we only compare results of regular training with transfer learning trials in which we retrain the entire network (100% unfreeze).To gain insight into the influence of the training dataset size, we performed regular training and transfer learning using RW training sets comprising 200, 400, and 1000 loading paths, respectively.For each of the six settings, there are 10 different curves which correspond to repeatedly drawn sets (from the RW-10,000 data for material B).For a given dataset, the regular training is repeated for 10 different random initializations while we only retained the best.All transfer learning trials are able to reach the baseline performance.Regular training converges significantly slower with respect to the number of parameter updates of the network parameters.Furthermore, regular training on a reduced dataset was not able to meet the baseline performance.Besides the faster convergence rate, the performance of transfer learning is more stable as demonstrated by the lower noise in the obtained performance curves.Table 4 summarizes the validation losses of the different training scenarios for material #1 including regular training and single transfer learning.
Limitations to single transfer learning become apparent when training on SM datasets.In Section 5.1, we observed that models trained on SM paths do not generalize well to random paths.This also holds true for transfer learning.Figure 10A shows the evolution of the MSE during transfer learning from material A to material B for 25 different SM-400 datasets (each 400 sequences randomly drawn from SM-10,000).The graph on the left depicts the loss on the training dataset, while the graph on the right features the loss on the SM-1000 validation set.In addition, Figure 10B shows the MSE on the RW-1000 validation dataset.While the metrics in Figure 10A are steadily decreasing, there is a turning point at approximately 5000 parameter updates after which the performance on the RW-1000 (Figure 10B) starts to deteriorate with continuing updates.At the same time, there is no indication based on the evaluation of metrics on the SM paths (compare Figure 10A) that indicates the performance drop on the RW data.Figure 10C,D shows the comparison of SM-1000 and RW-1000 based on the 90% quantile of the median relative error along each path  median .This alternative loss measure is also suffering from the same effect as we observe a large discrepancy between the performance on SM-1000 and RW-1000.Therefore, it appears difficult to design a stopping criterion in order to achieve the best model performance while performing the transfer on SM paths.

5.3
Ensemble transfer learning: From N materials to material N + 1

Iterative development of a material ensemble
Material ensembles are created by sequentially adding new materials to the ensemble, one at a time.For a new material, each parameter is initialized as the linear combination of the respective parameters of all previously learnt N materials in the current ensemble.Figure 11 summarizes the MSE after network initialization with the different strategies (i)-(v) up to a material ensemble of size 24.The materials are added in ascending order following Table 1.Each subfigure shows the evolution of the MSE after initialization of the neural network over the size of the material ensemble.To compare the errors between different datasets (that correspond to different materials), we normalize the squared errors of the output stress vector  (t)   component-wise by the variance of the target outputs of the respective dataset.
• The dark gray dots depict the MSE for a random initialization for the new materials that are added sequentially to the ensemble.
• The light gray squares correspond to the initial MSE when only using the pre-trained parameters of a single known material to initialize the neural network of the new material.
• The black solid line and circles indicate the MSE based on the final initialized parameters based on the chosen strategy.
Note, that strategy (iv) picks according to the scaled version of  median , whereas we plot the results in terms of  MSE .Hence, at some ensemble sizes, strategy (iv) will not select the lowest point (e.g., at ensemble size 7 and 20 in Figure 11D).For all four strategies, we observe a continuous decline in the initialization error as the number of known materials increases.In the beginning, as the ensemble comprises a few materials only, it is typically better to greedily pick (strategy [iv]) the best individual material for initializing the weights for a new material.To illustrate this, Figure 11F) shows a comparison of strategies (i) to (iv).Soon, as we add more materials to the ensemble, we can find better initializations based on a linear combination of all available materials that outperform the greedy pick.In other words, this is the first indication that we can benefit from leveraging multiple materials at the same time during transfer learning.
In case we choose too conservatively, like with strategy (ii), we can observe a significant performance loss compared to (i) or (iii).The greedy approach (iv) ensures by definition that the initialization for the new material N + 1 is based on the best suitable individual material in the current ensemble.However, strategy (iv) will never utilize the full potential of the material ensemble.
The irregular fluctuations in the evolution of the MSE for increasing ensemble size is correlated to the material properties of the new material N + 1 in relation to the existing materials in the ensemble.Figure 12A provides a detailed view of Figure 11B following strategy (ii) up to an ensemble of nine materials.The first increase in MSE can be observed when adding the fourth material to the ensemble.Figure 12B shows the hardening behavior in terms of the equivalent plastic strain and the equivalent stress for each material in the ensemble, in gray, and the hardening for the new material in orange.The new material (orange colored curve) exhibits a strong strain hardening that saturates below 10% of accumulated plastic strain, which is qualitatively different from the already known materials which show a more pronounced Swift-type of hardening.For the following two materials (Figure 12C blue and red), the MSE after initialization of both materials decreases as they are easier to explain with existing hardening behavior from the ensemble.Note, that as a consequence the weighted initialization for each of the two materials (blue and red) also outperforms the best single material initialization (all light gray squares are above the black circle at ensemble sizes 4 and 5 in Figure 12A).There is another sharp increase in initial MSE for materials #8 and #9; this is attributed to their hardening behavior as well as elevated level of strength; for material #9 (magenta colored curve), the stress-strain curve is higher compared to any known material in the ensemble (materials #8 and #9 correspond to the green and magenta hardening curve in Figure 12D, respectively).Based on these observations, an increase in the MSE during the iterative development of the ensemble can also be interpreted as encountering new and previously unseen material behavior that once added to the ensemble can be leveraged to explain the material behavior for upcoming new materials.
Next, we employ strategy (v) to search for the optimal linear combination of the neural network parameters through Bayesian optimization.Figure 11E shows the MSE after performing the optimized initialization of the neural network for material ensemble sizes ranging from 0 to 24.The comparison of the optimized initialization with all previous strategies (Figure 11F) reveals a significant increase in performance.After the material cluster comprises only three materials, we are able to find weighted initializations that offer a substantial improvement over the initialization based on individual materials for almost all new materials.

Ensemble transfer learning with frozen parameters
The following results are based on a material ensemble of size 14 with Bayesian optimization based initialization (strategy (v)).For the training of the neural network, we will exclusively use SM-400 datasets.We only fine-tune the first layer l 1 , the update rule W ∕ , b ∕ and the output layer W out which corresponds to about 22% of all network parameters (see Figure 8 5 ).At the same time, we distinguish two cases: 1.The networks are initialized from the ensemble through optimization using the same SM-400 dataset as during the fine-tuning stage.2. The networks are initialized from the ensemble through optimization using the full RW-10,000 dataset with 10,000 RWs, and then trained using SM-400 paths.
Figure 13A shows the evolution of the SM-1000 validation loss for neural networks that were initialized and trained on 10 different (randomly-chosen) SM-400 datasets.For each dataset, we repeat the training five times and only report the best result.We observe deviations in the convergence behavior for the different datasets.Note that the initial weights themselves are dependent on the dataset.Hence, the evolution of the MSE as well as the performance after convergence will differ for every model.Figure 13B depicts the MSE for the RW-1000 validation set for each of these 10 models.They show variations between the models' performance, but the general qualitative shape is similar for most of the models.After 18,000 updates, the error for about half of the models is still decreasing whereas the error is increasing for the other half of the models.
Figure 14 shows the prediction of a stress tensor component for a randomly chosen path from the RW validation dataset.It is obtained using the model trained on SM dataset #7 which yielded the worst performance on the SM and RW validation sets.Figure 14A shows the network prediction right after random initialization.As the parameters are chosen randomly, the model prediction does not yield any meaningful values at this stage.Figure 14B then depicts the network prediction after initialization based on the material ensemble which yields a good initialization.Finally, Figure 14C features the network response after training which results in a further increase in accuracy.
The results for the second case (RW-10,000 initialization followed by SM-400 training) are presented in Figure 15.As compared to the first case (Figure 13A), transfer learning now becomes less sensitive to the training datasets (Figure 15A).However, we observe a growing spread of the SM-1000 validation loss for different SM-400 training datasets after about 7500 updates (Figure 15A).When looking at the performance on the RW-1000 validation dataset (Figure 15B), we can see even more clearly that the models' generalization ability degrades after about 7500 parameter updates.Comparing the final model performances on the RW-1000 validation datasets across Figures 13B and 15B, it is apparent that F I G U R E 14 Results from ensemble transfer learning: qualitative prediction for a path from RW-1000 validation dataset after: (A) random initialization; (B) initialization based on material ensemble using 400 SM paths; (C) fine-tuning selected layers using 400 SM paths.

F I G U R E 15
Results from ensemble transfer learning: initialization based on 10,000 RW paths and fine-tuning 22% of the network parameters based on SM-400 datasets: (A) mean square error (MSE) on SM-1000 validation dataset; (B) MSE on RW-1000 validation dataset.
the initialization based on SM-400 datasets also leads to larger variations on the validation data than initialization with RW-10,000.
Table 5 gives the quantitative summary of the results shown in Figures 13-15.The right columns depict the quantiles of the squared error on the RW-1000 validation dataset after transfer learning with SM-400 datasets.The table lists results for experiments in which all network parameters as well as only the selected 22% have been fine-tuned.For comparison, the table also includes the results for the single transfer learning approach that transfers the network parameter directly from a known/previously trained material to a new, unknown material.Additionally, Table 5 includes the performance of a LMSC model obtained through regular training on a RW-10,000 dataset as well as results from regular training on SM-400 datasets.In general, the results shown in Table 5 reconfirm that ensemble transfer learning outperforms single transfer learning.The data in Table 5 also shows that both transfer learning approaches, single and ensemble, outperform regular training with random parameter initializations.Compared with the baseline model trained on a RW-10,000 dataset, none of the transfer learning results for the SM-400 datasets is able to attain the same performance, though the results for ensemble transfer learning (including those with 78% of parameters frozen) are remarkably close.In summary, ensemble transfer outperforms single transfer learning, which in turn outperforms regular training.At the same time, ensemble transfer enables us to freeze the majority of the network parameters and only fine-tune small portions of the model.The initialization during ensemble transfer has an impact on the convergence behavior but seems to have minor influence on the attainable accuracy level.

Transfer outside the material family
In this last set of numerical experiments, we apply transfer learning to leverage the prior-knowledge for von Mises materials when identifying the RNN model parameters for an anisotropic Hill'48 material.More specifically, we pick our ensemble of size 14 to initialize the network through strategy (v) (Bayesian optimization) for different training datasets.
Figure 16 shows the results on the RW-1000 validation dataset of the Hill material.It compares the evolution of the MSE for regular training and the described transfer learning approach.Analogous to previous investigations, the color coding corresponds to the training dataset size of 400 or 1000 used during transfer learning and we compare the performance on 10 different, randomly sampled datasets, respectively.Similar to the comparison in Figure 9, we clearly observe that transfer learning outperforms regular training as we need less parameter updates to reach small error values.

Discussion
Randomly initialized RNN models trained on RW datasets show a significantly higher generalization ability compared to models trained on SM paths.While a good generalization is key to successfully deploy an RNN model in structural  2).
simulations, practical limitations in generating training data through physical experiments is pushing our interest toward small datasets, ideally with SM paths.Single transfer learning proves to be a good first candidate for training on small datasets.Initializing the modeling parameters with the values of a previously trained model for another material results in a significant speed up of the model training.More importantly, the size of the training dataset may be reduced (when using transfer instead of regular training) without scarifying the model's generalization ability.Our results for single transfer learning also reveal that it is best to retrain all model parameters.This suggests that there is no recognizable specific element of the trained RNN models (e.g., the central internal layers) that may be considered as the heart of the model that encodes the general notion of plasticity (which should feature the same parameters for a large variety of materials).While Bonatti and Mohr 27 demonstrated the interpretability of the model's state space, it appears to be difficult to separate the model architecture into general plasticity and material-specific parts.We observed during transfer learning that fine-tuning the first internal layers (which provide a first interpretation of the state variables) is more effective than fine-tuning the last internal layers.However, these local findings do not directly scale to the entire model as the best performance is still achieved by retraining all parameters.
The effectiveness of transfer learning is further improved when learning from a material ensemble.An optimized initialization (summarized in Figure 11E) clearly shows that the material ensemble introduces benefits over the initial knowledge for one material only.Our results also reveal that the benefits of a material ensemble are not strictly limited to a material family.We successfully used an ensemble of isotropic von Mises materials to transfer to an anisotropic Hill'48 material.The benefit of ensemble transfer learning over regular training in this mixed material family (Figure 16) is similar to transferring between materials of the same material family (Figure 9).This suggests that the material ensemble also allows transferring outside the original material family.However, we would like to emphasize that this is not the intended use-case.Instead, it is envisioned that the material ensemble already comprises multiple material families.Based on this, the goal is that any new material can be initialized by interpolation within the material ensemble.Thus, ideally by using the material ensemble one will never be forced to train outside the model space the ensemble was built on.
The comparison of Figures 13 and 15 shows that the network initialization based on SM paths is on par with the initialization based on RW paths.This supports the argument from Section 5.1 that it is not the smoothness property that qualifies a dataset for training a RNN model that generalizes well, but rather the ability to efficiently sample and cover the strain and stress space as well as the space of the state variables at the same time.During ensemble transfer learning, the initialization based on a richer dataset (here RW) provides more stability compared to initializations from smaller datasets.Once a stable initialization is found we only require little, or in the extreme case no, fine-tuning of the network parameters on the SM paths in order to achieve good performances.This property enables us to retain as much of the model's generalization ability as possible by changing the model parameters as little as possible.
The results in Figure 15 also show little deviations among the different, randomly sampled datasets.Hence, what seems to be a good initialization might also render the subsequent fine-tuning stage less susceptible to the exact composition of the dataset.This might have significant implications when it comes to the design of mechanical tests and test specimens when training neural networks from experimental data.Therefore, one goal of future research is to find such stable initializations based on small datasets.Furthermore, the performance during transfer learning proved better when fine-tuning only a small fraction of the model parameters compared to fine-tuning the entire model (compare Table 5, rows (c) and (d)).Here, we specifically selected and fine-tuned the first layer l 1 , the update rule W ∕ , b ∕ and the output layer W out .According to our intuition, these network components correspond to the internal layers that first interpret the new strain increment based on the current state variables and lastly decide on how to update the state variables for future increments.
The fact that these selectively fine-tuned models outperform models that have been entirely retrained is a very promising observation since the layers that are not fine-tuned seem to already encode a general notion of plasticity that is shared among multiple materials.In view of our results for single transfer learning, it is speculated that the general notion of plasticity needs to be trained from multiple materials, while a single material learning experience appears to be insufficient.Ensemble learning is thus seen as an important step toward making trained models more explainable and potentially accessible to mechanical interpretation.

CONCLUSIONS
Previous studies have demonstrated that mechanics-specific RNN models are able to describe the three-dimensional stress-strain response of elasto-plastic solids for arbitrary loading paths.After reconfirming this finding for isotropic von Mises and anisotropic Hill'48 materials, we explore the potential of transfer learning to train RNN material models from small datasets that could be obtained from robot-assisted experiments on real materials. 58The main findings are: • Regular training (i.e., training with random parameter initialization) based on a dataset comprised of 10,000 RW paths results in a model with high generalization ability; using a dataset comprised of 10,000 SM paths instead significantly reduces the model's generalization ability.We postulate that it is not the smoothness property of the individual paths itself but rather the efficient coverage and sampling of the model space, including strain increments as well as relevant mechanical state variables, which qualifies a dataset for successfully training an RNN model.
• When training an RNN model for a new material, transfer learning (i.e., initializing the weights and biases with the parameters from an already trained material) improves the convergence rates.At the same time, transfer learning significantly reduces the required size of the training dataset.Transfer learning clearly outperforms regular training in terms of accuracy as well as the required number of parameter updates.
• Ensemble transfer learning leverages the experience gained from multiple materials when initializing the parameters for a new material.When initializing the parameters through Bayesian optimization, ensemble transfer learning outperforms both single transfer learning and regular training.
• Ensemble transfer learning also increases the robustness of the training, that is, a similar training performance is observed irrespective of the specific subset of stress-strain sequences chosen.We also observe an increase in stability during training, especially in the low-data regime, which we verified by multiple randomly sampled datasets.
• With the help of ensemble transfer learning, a good model accuracy and generalization ability is obtained even if we are only training on small datasets comprising SM paths (e.g., 400 SM instead of 10,000 RW paths).
Given the observed benefits of ensemble transfer learning, it may even become feasible in the future to replace the training of RNN models for new materials through an optimization guided model parameter choice based on prior experience with other materials.

APPENDIX BAYESIAN OPTIMIZATION
Bayesian optimization is employed in the present work as we are dealing with an objective function that is particularly expensive to evaluate.Bayesian optimization techniques are among the most efficient approaches reducing the number of required function evaluations during optimization. 56,59,60An additional advantage of Bayesian optimization is that it does not require knowledge of the derivatives of the objective function.The reader is referred to Brochu et al 61 for a comprehensive review of and tutorial on Bayesian optimization.The approach borrows its name from Bayes' theorem as we try to estimate the posterior probability of a model  given the observed data  ; this probability is proportional to the likelihood of  given  multiplied by the prior probability of .

P( | 𝒟 ) ∝ P(𝒟 | )P()
(A1) Since we do not know the closed-form expression for our objective function (e.g., Equation 13), we estimate it via a surrogate model.Using previous observations  of the actual objective function, we create a surrogate model .In the context of Bayesian optimization practice, Gaussian processes (GPs) are commonly used as surrogate model.Our goal is to maximize the posterior probability (which allows us to pick the "best" model).The trade-off between exploration of new areas and exploitation of known promising regions is balanced with the help of an acquisition function.Each step, the acquisition function is used to determine the next location to sample the objective function.Several heuristics can be found in literature: upper confidence bound algorithm, 62,63 probability of improvement, 64 expected improvement, 56 Thompson sampling. 65After evaluating the objective function at the new point, the surrogate model  is updated including the new observation.In summary, one iteration of the Bayesian optimization includes the following steps:

F I G U R E 4
Dataset distributions: (A) strain norms of random walk (RW)-and smooth paths (SM)-dataset; (B) deviatoric strain increments for RW and SM (the dashed gray vertical line indicates the initial elastic limit); (C) accumulated total strain along loading paths for RW and SM; (D) accumulated plastic strain along loading paths for RW and SM.

23 )
T .The initial (undeformed and stress-free) state of the material is given for  (0) = 0.As illustrated by the schematic in Figure5A, an input vector is created through the concatenation of the current state variables and the normalized strain increment, F I G U R E 5 (A) Schematic drawing of the linearized minimal state cells (LMSC); (B) Classification of single and ensemble transfer: single transfer takes place between two individual materials; ensemble transfer considers a collection of materials to learn a new unknown material.

F I G U R E 6 TA B L E 3
Results from regular training on RW-10,000 dataset: (A) loss evolution on training as well as RW-1000 and SM-1000 validation datasets; (B) examples from the validation datasets: deviatoric stress component  11 and corresponding strain increment norms along loading paths (dashed gray lines indicate the 1% and 99% quantiles of SM-10,000).Quantiles of (normalized) squared error for RW and SM dataset for regular training of material #1 (A).
are based on training sets comprising 400 RW paths (with 3200 increments) that have been randomly picked from the standard RW-10,000 training dataset for material B. The color coding indicates the ratio of unfrozen network weights.The brighter the color, the more network parameters have been updated during training.The schematics on the right of Figure 8A highlight the layers of the LMSC model whose parameters have been unfrozen.The color has been chosen in accordance with the color bar for the curves.For each investigated ratio, there are 10 different curves which correspond to 10 different training datasets (i.e., sets of 400 sequences out of 10,000).We perform the same transfer five F I G U R E 8 Results from single transfer learning: influence of the ratio of retrained weights during transfer learning with 400 RW training paths on the performance of the neural network (evaluated based on the RW-1000 validation dataset); schematics at the bottom mark the retrained layers of the linearized minimal state cells.

F I G U R E 9
Results from single transfer learning: comparison of regular and single transfer learning for different training dataset sizes; the black dashed line shows the reference performance for regular training (random network initialization) on RW-10,000.

F I G U R E 10
Limitations of single transfer learning using smooth paths (SM): (A) Evolution of mean square error (MSE) of training and validation dataset during transfer learning with SM for 25 randomly sampled training dataset of size 400; (B) evolution of MSE on RW-1000 validation dataset for all 25 models; (C), (D) 90% quantile of the median relative error along each stress path for the validation datasets SM-1000 and RW-1000 for all 25 models.

F I G U R E 11
Iterative development and expansion of the material ensemble.Mean square error on validation dataset after network initialization: (A)-(D) four, fixed strategies  (i) −  (iv) ; (E) optimized parameter initialization  (v) using Bayesian optimization; (F) comparison of initialization strategies.

F I G U R E 12
Relationship between the initial error for a new, unknown material and the material properties; (A) initial mean square error for new materials at increasing size of the material ensemble; (B)-(D) hardening behavior in terms of equivalent plastic strain and equivalent stress (gray: materials from the material ensemble; colored: new, unknown material at different material ensemble sizes).

F
I G U R E 13 Results from ensemble transfer learning: initialization and fine-tuning 22% of the network parameters based on SM-400 datasets: (A) mean square error (MSE) on SM-1000 validation dataset; (B) MSE on RW-1000 validation dataset.

F I G U R E 16
Transfer from an ensemble of von Mises materials to a material that follows the Hill yield criterion: comparison of regular training and ensemble transfer learning for different dataset sizes based on the mean squared error on the RW-1000 validation dataset of the Hill material (see parameters in Table How to cite this article: Heidenreich JN, Bonatti C, Mohr D. Transfer learning of recurrent neural network-based plasticity models.Int J Numer Methods Eng.2024;125(1):e7357.doi: 10.1002/nme.7357

TA B L E 4
Overview of validation losses of material #1 (A) for regular training and single transfer learning.

Quantiles of squared error on RW-10,000 validation dataset a [in 10 −3 ]
Median of quantiles of squared error for transfer learning based on SM-400 datasets.The reported metrics are based on a fixed number of 18,000 parameter updates.thatis, the reported values might not always correspond to the best achievable performance on the RW-10,000 validation dataset.b22%corresponds to fine-tuning the first layer l 1 , the update rule W ∕ , b ∕ and the output layer W out .cTraining was stopped after 60,000 parameter updates.At this point excessive training times will be considered as a prohibitive factor (as compared to transfer runs) and thus training will be rendered unsuccessful. a