On sparse regression, Lp-regularization and automated model discovery

. Sparse regression and feature extraction are the cornerstones of knowledge discovery from massive data. Their goal is to discover interpretable and predictive models that provide simple relationships among scientific variables. While the statistical tools for model discovery are well established in the context of linear regression, their generalization to nonlinear regression in material modeling is highly problem-specific and insufficiently understood. Here we explore the potential of neural networks for automatic model discovery and induce sparsity by a hybrid approach that combines two strategies: regularization and physical constraints. We integrate the concept of Lp regularization for subset selection with constitutive neural networks that leverage our domain knowledge in kinematics and thermodynamics. We train our networks with both, synthetic and real data, and perform several thousand discovery runs to infer common guidelines and trends: L2 regularization or ridge regression is unsuitable for model discovery; L1 regularization or lasso promotes sparsity, but induces strong bias that may aggressively change the results; only L0 regularization allows us to transparently fine-tune the trade-off between interpretability and predictability, simplicity and accuracy, and bias and variance. With these insights, we demonstrate that Lp regularized constitutive neural networks can simultaneously discover both, interpretable models and physically meaningful parameters. We anticipate that our findings will generalize to alternative discovery techniques such as sparse and symbolic regression, and to other domains such as biology, chemistry, or medicine. Our ability to automatically discover material models from data could have tremendous applications in generative material design and open new opportunities to manipulate matter, alter properties of existing materials, and discover new materials with user-defined properties.


Motivation
The ability to discover meaningful constitutive models from data would forever change how we understand, model, and design new materials and structures.Massive advancements in data science are now bringing us closer than ever towards this goal [2,9].Throughout the past three years, numerous research groups have begun to harness the potential of neural networks and fit constitutive models to experimental data [4,20,23,29,32,34,37,45,50,61,62,69], an approach that is now widely known as constitutive neural networks [41].While initial studies have used neural networks exclusively as black box regression operators [22], recent approaches are increasingly recognizing their potential to discover not only the model parameters, but also the model itself [42].The paradigm of automated model discovery was first formalized in the context of nonlinear dynamical systems more than a decade ago to discover Lagrangians and Hamiltonians for oscillators [7], pendula, biological processes [55], or turbulent fluid flows [8].It is now rapidly gaining popularity in the context of constitutive modeling [52], and several promising techniques have emerged to decipher constitutive relations between stresses and strains [50], and even integrate them automatically into a finite element analysis [51,58,71].These not only include constitutive neural networks [41], but also sparse regression [14], genetic programming in the form of symbolic regression [1], and variational system identification [68].
The holy grail in automated model discovery is to identify generalizable and truly interpretable models with physically meaningful parameters [9,15,43].Ideally, we want to discover a concise, yet simple and interpretable model with only a few relevant terms that best explains experimental data, while remaining robust to outliers and noise.In terms of statistical learning, this translates model discovery into a subset selection or feature extraction task.Subset selection and shrinkage methods are by no means new; in fact, they have been extensively studied for many decades [16,28,54,64].In the context of linear regression, these methods have become standard textbook knowledge [27].In the context of nonlinear regression, when analytical solutions are rare, subset selection is much more nuanced, general recommendations are difficult, and feature extraction becomes highly problem-specific [35].To be clear, this limitation is not exclusively inherent to automated model discovery with constitutive neural networks-it applies to distilling scientific knowledge from data in general [55].This includes alternative model discovery approaches like sparse regression [8,14], or symbolic regression [1,38,55].The key question to the success of discovering new knowledge from data is: How do we robustly discover the best interpretable model with a small subset of relevant terms?And, probably equally importantly: What is the trade-off between interpretability and prediction accuracy?To frame these questions more broadly, let us first revisit the notions of regression and neural networks in the context of constitutive modeling: Regression.Regression is a statistical method to examine the relationship between a dependent variable, in constitutive modeling in solid mechanics the stress σ, and one or more independent variables, in this case the strain ε, using a model that depends on a set of model parameters θ.Here regression has two main objectives: characterizing the form and strength of the relationship between stress and strain to enable predictions, and providing insights into how stress and strain are correlated [27].Popular types of regression are logistic regression, assuming for example a binary relationship; linear regression [56], assuming a relationship that is linear in the model parameters θ, or nonlinear regression, assuming a relationship that is nonlinear in the model parameters θ, as we do throughout this manuscript.Regression is the cornerstone of statistical learning [35].It provides tools to decipher relationships within data, but its application to constitutive modeling requires attention to physical constraints including objectivity, symmetry, incompressibility, polyconvexity, or thermodynamic consistency [3,19,26,40,63,66].As a natural consequence, we cannot just use any set of functions to build our constitutive model: While polynomial functions between stresses and strains associated with a linear regression would be ideal from an optimization point of view, these models may violate thermodynamic constraints, which favor exponential or power functions associated with a nonlinear regression.
Linear regression.Linear regression [56] seeks to model the relationship between a dependent variable, in our case the stress σ, and a set of one or more independent variables, in our case the set of strains ε i at different load levels i, using a function that depends linearly on the model parameter θ = { E }, in this case the elastic modulus or Young's modulus.The regression estimates this parameter by minimizing the difference between the predicted stress, σ i = E ε i , for given strains ε i and stiffness E, and the experimentally measured stresses σi , divided by the number of data points n data .A common measure for this difference is the mean squared error based on the L 2 -norm [27], for which the minimization problem becomes .. estimate model parameter E . ( When phrased as least square's problems, provided linear regression problems have a convex objective function with a unique global minimum.For our example (1), we can find it by evaluating the vanishing derivative, ∂L/∂E = ∑ Here, the minimization problem is not only linear in the model parameter E, but also in the dependent variable ε, and we obtain an explicit solution for the elastic modulus, E = ∑ n data i ε i σi / ∑ n data i ε i ε i .For linear regression with multiple model parameters θ, for which the minimization problem is a linear function in the dependent variables ε, we obtain a coupled system of equations, ∂L/∂θ .= 0, with an explicit solution for the parameter vector θ.For linear regression with one or multiple model parameters θ, for which the minimization problem is a nonlinear function in the dependent variables ε, we obtain a similar set of one or more equations ∂L/∂θ .= 0, which may require an iterative solution for the parameter vector θ.Importantly, any regression that is linear in the model parameters θ-independent of whether it is linear, polynomial, or generally nonlinear in the independent variables ε i -is considered a linear regression problem that, when phrased as a least square's problem with appropriate data, results in a convex quadratic function in the model parameters θ with a unique global minimum.[57] seeks to model the relationship between a dependent variable, in our case the Piola stress P, and a set of one or more independent variables, in our case the set of deformation gradients F i at different load levels i, using a function that depends nonlinearly on a set of model parameters θ [5].The regression estimates these parameters by minimizing the difference between the predicted stress, P(F i , θ), for given deformation gradients F i and model parameters θ, and the experimentally measured stresses Pi , divided by the number of data points n data .Similar to linear regression, we can measure for this difference as the mean squared error based on the L 2 -norm [27], || (

Nonlinear regression. Nonlinear regression
for which the minimization problem becomes In general, nonlinear regression problems have a non-convex objective function with multiple local minima.Solving non-convex optimization problems requires iterative algorithms that are at risk of converging to a local minimum instead of the global minimum, and their solution is often highly sensitive to the initial conditions that we select for the parameter vector.Depending on the nature of the problem, the solution we find may involve a large and dense parameter vector θ, and overfitting may occur when the number of parameters is larger than the number of data points, n para > n data .
Notably, even for many data points, we may face overfitting when the data are noisy or not rich enough to sufficiently activate all the parameters.For example, with tension and compression tests alone, we cannot estimate model parameters for shear.

Sparse regression.
Sparse regression is a special type of regression that seeks to prevent overfitting by inducing sparsity in the parameter vector θ by setting a large number of parameters to zero [64].Sparse regression is particularly useful in high-dimensional settings, since it generates models with a small subset of non-zero parameters [27], which tends to make the model more interpretable [35].Historically, the need for sparse regression emerged prominently with the advent of high-dimensional datasets for which the number of parameters can easily exceed the number of independent observations [12].A prominent example is SINDy, an algorithm for sparse identification in nonlinear dynamics that promotes sparsity through sequential thresholded least-squares by iterating between a partial least-squares fit and a thresholding step to sequentially drop the least relevant terms of a model [8].Importantly, while these sparsification algorithms converge well in linear regression associated with convex objective functions [70], their convergence is no longer guaranteed in nonlinear regression with non-convex objective functions.The advantages of sparse regression are improved interpretability by reducing the parameter set to only a few non-zero terms; feature selection by identifying the most relevant terms; and reduced risk of overfitting by promoting simpler models.These advantages come at a price: The disadvantages of sparse regression are selection bias by enforcing sparsity of the parameter estimates; additional hyperparameters that need to be tuned and require additional attention; and risk of misspecification by excluding relevant parameters if sparsity is enforced too aggressively.In conclusion, sparse regression offers a powerful toolset for high-dimensional modeling, but introduces a trade-off between interpretability and prediction accuracy.

Neural networks.
Neural networks are a class of models and algorithms that can approximate a wide range of functions [33].Their versatility not only makes them a powerful tool for classification, reinforcement learning, and generative tasks, but also for regression problems [24], in our context, for regression in constitutive modeling [22].Neural networks consist of input, hidden, and output layers with several nodes in each layer.Their parameters are the network weights θ = {w i,j } , where i = 1, ..., n lay is the number of hidden layers and j = 1, ..., n nod is the number of nodes per layer.
During training, neural networks effectively perform a regression as they learn their parameters by minimizing a loss function L that penalizes the error between model and data.Similar to the classical nonlinear regression in equation ( 2), we can characterize this error as the mean squared error, the L 2 -norm of the difference between the stress predicted by the model P(F i ) and the experimentally measured stress Pi , divided by the number of data points n data to train the model, However, in contrast to traditional regression tools that have a fixed functional form, neural networks can easily adapt their shape which allows them to model complex functions [23], either linear [59] or nonlinear [43] in the model parameters θ.The advantages of neural networks are their universal approximation that allows them to approximate any continuous function for a sufficiently large number of weights [33], and their inherent flexibility that allows them to model high-dimensional nonlinear relationships like the constitutive behavior we seek to model here.Their disadvantages are their computational complexity, especially for densely connected architectures with multiple hidden layers; risk of overfitting sparse or noisy data; and lack of interpretability that generally worsens with the number of layers and makes plain neural networks unsuitable for model discovery tasks.

Sparse neural networks.
Sparse neural networks use a special type of network architecture for which a large number of weights are zero.This reduces the number of active connections between the nodes of consecutive layers.Sparsity can be induced during training by using special algorithms, or after training by pruning [25].The concepts of sparse neural networks and weight or node pruninginspired by brain development and synaptic pruning-have gained increasing attention with the rise of deep learning and the need for computational efficiency [39].The advantages of sparse neural networks are their computational efficiency and faster inference times; reduced risk of overfitting by pro-moting smaller model sizes; and their potential for regularization and improved generalization.The disadvantages are increased training complexity to induce sparsity; and risk of decreased performance through overly aggressive sparsification or pruning.
The objective of this review is to explore neural networks for which sparsity is induced by a hybrid approach of two combined strategies: regularization and physical constraints.Towards this goal, first, we review popular subset selection and shrinkage methods to induce sparsity within the general framework of L p regularization in Section 2.Then, we discuss how to leverage physical constraints to induce sparsity within the framework of constitutive neural networks in Section 3. Finally, we propose a hybrid approach of L p regularized constitutive neural networks for automated model discovery and discuss the interpretability and prediction accuracy of their discovered models by means of a library of illustrative examples in Section 4. We conclude by comparing the different approaches and providing guidelines and recommendations in Section 5.

L p regularization
The concept of L p regularization or bridge regression was first introduced three decades ago in the context of chemometrics with the goal to shrink the parameter space in chemical data analysis [16].
The method has re-gained attention as a powerful tool to promote sparsity in system identification [8], and, most recently, in discovering constitutive models from data [14].L p regularization is a generalized regularization technique that uses the L p norm of the parameter vector θ, the sum of the p-th power of the norm of its i = 1, ..., n para coefficients w i , raised to the inverse power, Bridge regression constrains the regression (2) by constraining this L p norm to be smaller than a non-negative constant ϵ ≥ 0, It proves convenient to reformulate this constrained regression problem as a penalized regression problem, by penalizing the regression (2) with a penalty term that consists of the L p norm || θ || p p , multiplied by a non-negative penalty parameter α ≥ 0, The constrained and penalized regression problems (4) and ( 5) are equivalent, which implies that for a given ϵ ≥ 0, there exists an α ≥ 0 such that the two problems share the same solution θ [18].The flexible power p not only allows us to recover classical regularization techniques like L 2 or L 1 regularization as special cases, but also to interpolate smoothly between these different methods [16].
The advantages of L p regularization are its inherent flexibility by allowing for a continuum of popular regularization techniques for varying powers p; and its potential to effectively promote sparsity.Its disadvantages are the added complexity associated with the selection of hyperparameters, specifically the penalty parameter α and the power p; and the potential computational challenges associated with specific choices for p. Figure 1 illustrates the contours of the regularization term, for varying powers p as p = [0.25,0.5, 0.75, 1, 1.5, 2, 4, 8], evaluated for two parameters, w 1 and w 2 .The top row illustrates the effect of powers smaller than or equal to one, p ≤ 1; the bottom row illustrates the effect powers larger than one, p > 1.In the following, we highlight the most popular special cases of L p regularization, their history, and their advantages and disadvantages.
L 2 regularization or ridge regression.L 2 regularization, commonly known to as ridge regression, was introduced more than half a century ago to address multicollinearity in regression analysis [28], , evaluated for two parameters, w 1 and w 2 .For p ≤ 1, top row, with the special case of L 1 regularization or lasso represented through the pyramid, L p regularization promotes sparsity by setting some weights exactly to zero, but is no longer strictly convex and can have multiple local minima.For p > 1, bottom row, with the special case of L 2 regularization or ridge regression represented through the ellipsoid, L p regularization promotes stability by reducing outliers, while the regularization term remains convex.
and has gained attention for its ability to stabilize parameter estimates, especially when the parameters are closely correlated [27].It uses the L 2 norm of a vector, the Euclidian norm, the sum of the vector components squared, Notably, the L 2 norm does not weigh all entries of the vector equally.Instead, it squares the vector entries which makes it highly sensitive to outliers as it penalizes the squared magnitude of the individual parameters w i .Ridge regression supplements the regression (2) with a penalty term that consists of the L 2 norm multiplied by a penalty parameter α, Its advantages are stability in multicollinearity by offering stable parameter estimates even in the presence of highly correlated predictors; managing outliers and preventing overfitting by quadratically penalizing extreme coefficients; and computational efficiency even for large datasets.Its disadvantages are introducing bias, which may result in underestimating certain coefficients and effects; and its inability to induce sparsity, which makes it unsuitable for our current focus of subset selection.Figure 1 shows that, for the special case of L 2 regularization, the regularization term adopts a convex ellipsoidal shape that promotes stability by reducing outliers.
L 1 regularization or lasso.L 1 regularization was initially introduced as a method to analyze seismograms in geophysics almost four decades ago [54], and has become widely known under the name of lasso, short for Least Absolute Shrinkage and Selection Operator [64].Lasso has become popular for producing interpretable models [20], while exhibiting the same stability properties as ridge regression.It uses the L 1 norm of a vector, the sum of the absolute values of its components, Because of its similarities with a distance between city bocks, the L 1 norm is often referred to as the Manhattan distance or taxicab norm.Notably, the L 1 norm weighs all entries of the vector equally and is therefore less sensitive to outliers than the L 2 norm.Lasso supplements the regression (2) with a penalty term that consists of the L 1 norm multiplied by a penalty parameter α, Its advantages are enabling feature selection and inducing sparsity by reducing some weights exactly to zero which effectively reduces model complexity and improves interpretability; mitigating overfitting by constraining the magnitude of the weights, which is especially important when data are limited or high-dimensional; and providing predictive insights by identifying the most relevant weights [27].Its disadvantages are introducing bias, which may result in underestimating certain coefficients and effects; and focussing on selective effects while discarding others, especially in nuanced multi-effect situations when the weights are closely correlated.Figure 1 shows that, for the special case of L 1 regularization, the regularization term adopts a non-strictly-convex pyramid shape that promotes sparsity by reducing some weights exactly to zero.
L 1/2 regularization or elastic net.L 1/2 regularization, also known as elastic net, is a hybrid approach that seeks to combine the benefits of both L 1 and L 2 regularization [72].The elastic net supplements the regression (2) with two penalty terms in terms of the L 2 and L 1 norms multiplied by two independent penalty parameters α 2 and α 1 , For α 1 = 0 and α 2 = 0, it recovers the classical ridge regression and lasso as special cases.For α 1 > 0 and α 2 > 0, L 1/2 regularization shares many features with L p regularization with 1 < p < 2 and generates contours similar, but not identical to Figure 1, bottom left.However, in contrast to L p regularization with 1 < p < 2 [16], L 1/2 regularization not only promotes stability, but also induces sparsity while remaining convex [27].Its disadvantage is its added computational complexity.Since L 1/2 regularization is not a special case of the L p regularization family, we will not consider it further throughout this study.
L 0.5 regularization.L p regularization with powers 0 < p < 1 has become a popular tool in subset selection since it promotes sparsity more aggressively than simple L 1 regularization.While L p norms are traditionally defined for powers larger than one, p ≥ 1, the concept of applying powers smaller than one, 0 < p < 1, was introduced more than three decades ago in sparse regression of large systems [16].Notably, for powers smaller than one, the penalty term becomes non-convex and is no longer a norm in the classical sense.For the special case of p = 0.5, the penalty term uses the sum of the square roots of the vector components, , and we can easily see that this construct no longer satisfies the triangle inequality.L 0.5 regularization supplements the regression (2) with a penalty term that consists of this term, multiplied by a penalty parameter α, The advantages of L 0.5 regularization, or more generally of L p regularization with powers smaller than one, are enhanced sparsity potentially leading to more parsimonious models; and subset selection especially in high-dimensional datasets [17].Its disadvantages are its computational complexity induced by its non-convex nature; and its multiple local minima making subset selection complex and non-unique.Figure 1 shows that, for the special cases of L 0.75 , L 0.5 , and L 0.25 regularization, the regularization term adopts an increasingly non-convex shape and promotes sparsity by more aggressively setting weights equal to zero.
L 0 regularization or subset selection.L 0 regularization, is a form of subset selection that imposes a penalty on the number of non-zero parameters in a regression model.The origins of selecting a subset of relevant parameters date back to early efforts in regression modeling with the objective to discover parsimonious models with enhanced interpretation and prediction.However, formalizing this idea as an L 0 penalty method and connecting it rigorously to regularization has emerged more prominently with the advent of high-dimensional datasets [16].The L 0 norm is commonly referred to as sparse norm and is not a norm in a strict mathematical sense.It refers to the pseudo-norm, , where I(•) is the indicator function that is one if the condition inside the parenthesis is true and zero otherwise.As such, the L 0 norm counts the number of non-zero entries in a vector, which implies that this approach directly penalizes model complexity in terms of predictor inclusion.Notably, the L 0 norm is an explicit, discrete measure of sparsity.It is robust to outliers since it only counts the number of non-zero elements in the parameter vector and does not express preference for smaller or larger entries.L 0 regularization supplements the regression (2) with a penalty term that consists of the L 0 norm, multiplied by a penalty parameter α, Its advantages are its conceptual simplicity by providing a direct mechanism for subset selection that directly penalizes non-zero parameters; reduced overfitting by promoting fewer non-zero parameters, in particular when data are limited; and enhanced model interpretability by focusing only on the relevant terms.Its disadvantages are its computational complexity that results from turning continuous model selection into an NP hard discrete combinatorial problem with 2 n possible parameter combinations, making it computationally intractable for problems with large parameter sets; its non-convexity induced by the L 0 penalty term that leads to optimization challenges related to several local minima; and increased instability by discovering models for which slight changes in the data can result in an entirely different parameter set.In the contour plot of Figure 1, L 0 regularization would correspond to two discrete planes along the two parameter axes.
Predictability and interpretability.L p regularization is an intricate balance between predictability and interpretability: For powers larger than one, p > 1, L p regularization can improve predictability, increase robustness, prevent overfitting, and enhance generalization to new data by penalizing outliers and reducing extreme coefficients.For powers equal to or smaller than one, p ≤ 1, L p regularization, can improve interpretability, promote simpler models, and identify the most influential predictors by encouraging sparsity and forcing some coefficients exactly to zero.Taken together, L p regularization is a trade-off between interpretability and predictability, between simplicity and accuracy, and between bias and variance.Two hyperparameters, the power p and the regularization strength α, allow us to fine-tuning of this balance.Throughout this manuscript, we will provide a library of systematic examples that illustrate the sensitivity of L p regularization with respect to these two hyperparameters.

Neural networks
In this study, we adopt the concept of neural networks to perform regression in constitutive modeling with the objective to improve both predictability and interpretabilty.To demonstrate that our strategy generalizes to different types of neural networks, we compare two constitutive neural networks that have recently become popular in the context of automated model discovery [43,59].Both networks are sparse neural networks by design, where sparsity is inspired by the underlying physics of hyperelasticity.In their input layer, they use characteristic features of the deformation gradient F to a priori satisfy the kinematic constraint of material objectivity, and acknowledge a characteristic isotropic and incompressible material behavior by satisfying the constraints of material symmetry and incompressibility [41].In their output layer, they learn a free energy function ψ from which they derive the nominal stress P to a priori satisfy the dissipation inequality and, with it, thermodynamic consistency [45].In their hidden layers, both networks use a special set of custom-designed activation functions to a priori satisfy physical constraints [4] and a particular network architecture to satisfy polyconvexity [37,61].
Invariant based and principal stretch based neural networks.We explore two types of neural networks that use different types of activation functions to represent two different classes of constitutive models: invariant and principal stretch based [30,48].Both networks are generalizations of popular constitutive models that include widely used hyperelastic models as special cases [6,46,53,65].They are interpretable by design and their weights translate into physically meaningful parameters with physical units and a physical interpretation [42].Yet, there is a major difference between both models: The invariant model uses different functional forms for each activation function and results in a nonlinear regression problem, whereas the principal stretch model uses the same functional form with different but fixed exponents and results in a linear regression problem.This has critical implications on the convexity of the objective function, and with it, on the nature of the solution.
Data.We train our networks on both synthetic and real data from tension, compression, and shear tests.For the synthetic data, we generate stretch stress pairs for fixed parameters through forward simulation.Specifically, we calculate stresses over a wide range of tensile stretches, compressive stretches, and shear strains in ten equidistant increments, resulting in three data sets with eleven stretch-stress pairs each.For the real data, we extract stretch stress pairs from our previously published human brain experiments on 5 mm gray matter tissue cubes [10,11].Specifically, we use the reported stresses averaged over n = 15 tensile, n = 17 compressive, and n = 35 shear experiments, in sixteen equidistant increments, resulting in three data sets with seventeen stretch-stress pairs each [43].All data are available on our GitHub repository, https://github.com/LivingMatterLab/CANN.

Invariant based neural network
Invariant based constitutive neural networks take the deformation gradient F as input and predict the free energy function ψ as output from which we calculate the stress P = ∂ψ/∂F.From the deformation gradient, they extract a set of invariants, in our example I 1 and I 2 , and feed them into its two hidden layers [43,44].The first layer generates the powers of the invariants, in our example the first and second (•) and (•) 2 , and the second layer subjects these powers to specific functions, in our example to the identity and exponential (•) and exp(•).The free energy function ψ is a sum of the resulting eight terms.Figure 2 illustrates the invariant based constitutive neural network with the eight functional building blocks highlighted in color, where the hot red colors relate to the first invariant and the cold blue colors to the second.
During training, the network autonomously discovers the best model, out of 2 8 = 256 possible combinations of terms, and simultaneously learns its model parameters θ = { w i,j }.It minimizes the loss function (3), the difference between the stress predicted by the model P(F i , θ) and the experimentally measured stress Pi , divided by the number of data points used for training n data , To ensure thermodynamic consistency, the network does not learn the stress directly [22], but rather derives it from the free energy function [41].For our example in Figure 2, free energy function takes the following explicit representation, with the following derivatives with respect to the invariants I 1 and I 2 , Using the second law of thermodynamics, we can derive the Piola stress, P = ∂ψ/∂F, as thermodynamically conjugate to the deformation gradient F [30], where the term p F −t ensures perfect incompressibility in terms of the pressure p that we determine from the boundary conditions.For the network free energy (12), the Piola stress is Notably, the Piola stress of the invariant based network ( 15) is a nonlinear function in the network weights w i,j , which translates the loss function (11) into a nonlinear regression problem, with possibly multiple local minima.For this particular format, one of the first two weights of each row becomes redundant, and we can reduce the set of network parameters to twelve, θ = [ (w 1,1 w 2,1 ), w 1,2 , w 2,2 , (w 1,3 w 2,3 ), w 1,4 , w 2,4 (w 1,5 w 2,5 ), w 1,6 , w 2,6 , (w 1,7 w 2,7 ), w 1,8 , w 2,8 ].We train our invariant based network with tension, compression, and shear data and rewrite the loss function (11) in terms of two contributions that minimize the error between the normal and shear stresses predicted by the model, P 11 (λ i ) and P 12 (γ i ), and the data, P11,i and P12,i , where n 11 and n 12 denote the different stretch and shear levels λ and γ, To explore the effect of scaling of the three individual stress terms, alternatively, we weigh all three experiments equally, and also train the network by minimizing the error between the normalized tensile, compressive, and shear stresses predicted by the model P ten (λ i ), P com (λ i ), and P shr (γ i ), and the data Pten,i , Pcom,i , and Pshr,i , normalized by the maximum recorded tensile, compressive, and shear stresses, Pmax ten , Pmin com , and Pmax shr , where n ten , n com , and n shr denote the different stretch and shear levels λ and γ, Below, we briefly derive the explicit analytical expressions for the Piola stresses P 11 (λ) in uniaxial tension and compression and P 12 (γ) in simple shear, such that the tensile stress is P ten = P 11 for λ > 1, the compressive stress is P com = P 11 for λ < 1, and the shear stress is P shr = P 12 for all γ.Uniaxial tension and compression.For the special case of uniaxial tension and compression in terms of the stretch λ, with λ 1 = λ and λ 2 = λ −1/2 and λ 3 = λ −1/2 , the invariants take the following form, Using equation ( 14) and the zero normal stress condition, P 22 = P 33 = 0, we obtain the following expression for the uniaxial stress stretch relation, which translates into the following explicit expression between our network stress P 11 and the uniaxial stretch λ, Simple shear.For the special case of simple shear, in terms of the shear γ, with ] and λ 3 = 1, the invariants take the following form, Using equation ( 14), we obtain the following shear stress stretch relation, which translates into the following explicit expression between our network shear stress P 12 and the shear strain γ, Figure 3 illustrates the contours of the loss function L(θ; λ, γ) for all possible two-term models of the invariant based network in Figure 2. By combining any two terms of the model and setting all other weights equal to zero, we can generate 28 possible models.For these 28 combinations of two terms, we evaluate two versions of the loss function, non-normalized from equation ( 16) and normalized from equation ( 17), using the invariant based definitions of the normal stress (20) and shear stress (23).
First, we generate synthetic data, P11,i and P12,i , for tensile stretches of λ = [1.0,..., 2.0], compressive stretches of λ = [1.0,..., 0.5], and shear strains of γ = [0.0,..., 0.5], in ten equidistant increments each, assuming an exact solution with fixed weights of the first layer, 25, and a pair of non-zero weights of the second layer, w 2,i = 1 and w 2,j = 1, while fixing the remaining six weights of the second layer equal to zero.This results in 28 training data sets of eleven stretch-stress pairs each, for tension, compression, and shear.Second, we vary the two non-zero network weights in the ranges w i = [0, .., 2] and w j = [0, .., 2].For each pair of weights, we evaluate the normal and shear stresses P 11 (λ i ) and P 12 (γ i ) using equations ( 20) and ( 23), and extract the tensile, compressive, and shear stresses, P ten i (λ), P com i (λ), and P shr i (γ).Third, we evaluate the non-normalized loss function (16) as the mean squared error between the model stresses P 11 (λ i ) and P 12 (γ i ) and the synthetically generated data stresses P11,i and P12,i , and plot its contours for each pair of weights w i and w j in the lower triangle.Next, we evaluate the normalized loss function (17) as the normalized mean squared error between the model stresses P ten (λ i ), P com (λ i ), and P shr (γ i ), and the synthetically generated data stresses Pten,i , Pcom,i and Pshr,i , and plot its contours for each pair of weights w i and w j in the upper triangle.
Each loss function takes a minimum of L(θ; λ, γ) = 0 for the exact solution, w i = 1 and w j = 1, indicated through the white circles.From this minimum, both versions of the loss function, nonnormalized and normalized, increase with both, decreasing and increasing weights w i and w j , and remain convex within the entire domain, for all 28 two-term models.Yet, the contours of the loss function vary significantly for different pairs of weights w i and w j , indicating its sensitivity with respect to the individual terms: Some pairs of weights generate loss functions of ellipsoidal shape, e.g., the { w 2 , w 3 }, {  [46,53].Overall, this simple example only illustrates the 28 two-term models out of a  16), the upper triangle illustrates the normalized loss function (17).All loss functions are convex, with contours varying from ellipsoids to valleys with long ridges, highlighting the collinearity of some w i and w j pairs.total set of all 256 possible models, only considers a limited stretch and shear range λ and γ, and only screens a narrow window of parameter ranges w i .Even within these limitations, the contours of loss functions are rather difficult to interpret, making it difficult to comprehend the full potential of the entire network, even though it only consists of eight distinct terms.

Principal stretch based neural network
Principal stretch based constitutive neural networks take the deformation gradient F as input and predict the free energy function ψ as output from which we calculate the stress P = ∂ψ/∂F.From the deformation gradient, they extract the principal stretches, λ 1 , λ 2 , and λ 3 , and feed them into the hidden layer [59,60].The hidden layer applies eight different exponents (λ n 1 + λ n 2 + λ n 3 − 3) to these stretches.The free energy function ψ is a sum of the resulting eight terms.Figure 4  principal stretch based constitutive neural network with the eight functional building blocks highlighted in color, where the dark red and green terms are identical to the dark red and green terms of the invariant based network in Figure 2, while the other six terms are different.

illustrates the
During training, the network autonomously discovers the best model, out of 2 8 = 256 possible combinations of terms, and simultaneously learns its model parameters θ = { w i }.It minimizes the loss function (3), the difference between the stress predicted by the model P(F i , θ) and the experimentally measured stress Pi , divided by the number of data points used for training n data , The free energy of the principal stretch based model takes the following explicit representation [48,67], , where the individual weights w i = µ i /α i correspond to the shear moduli µ i divided by the exponents α i .For the eight-term model in Figure 4, we fix these exponents to α = [ +2, +4, +6, +8, −2, −4, −6 − 8 ], such that the free energy becomes a sum of the following n term = 8 terms, and its derivatives with respect to the principal stretches λ i takes the following form, Using the second law of thermodynamics, we can derive the Piola stress, P = ∂ψ/∂F, as thermodynamically conjugate to the deformation gradient F [48], where N i and n i = F • N i are the eigenvectors in the undeformed and deformed configurations, and the term p F −t ensures perfect incompressibilty in terms of the pressure p that we determine from the boundary conditions.For the network free energy (25), the Piola stress is parameterized in terms of eight network weights, θ = [ w 1 , w 2 , w 3 , w 4 , w 5 , w 6 , w 7 , w 8 ].Notably, the Piola stress of the principal stretch based network (28) with fixed exponents is a linear function in the network weights w i , which translates the loss function (24) into a linear regression problem, with a single unique global minimum [49].We train our principal stretch based network with tension, compression, and shear data and rewrite the loss function (24) in terms of two contributions that minimize the error between the tensile and compressive stresses predicted by the model P 11 (λ i ) and the data P11,i , and between the shear stresses predicted by the model P 12 (γ i ) and the data P12,i , where we include data from n 11 different stretch levels λ and n 12 different shear levels γ, For comparison, similar to the invariant based network in the Section 3.1, we also train the network by minimizing the error between the normalized tensile, compressive, and shear stresses predicted by the model P ten (λ i ), P com (λ i ), and P shr (γ i ), and the data Pten,i , Pcom,i , and Pshr,i , normalized by the maximum recorded tensile, compressive, and shear stresses, Pmax ten , Pmin com , and Pmax shr , where n ten , n com , and n shr denote the different stretch and shear levels λ and γ, Below, we briefly derive the explicit analytical expressions for the Piola stresses P 11 (λ) in uniaxial tension and compression and P 12 (γ) in simple shear, such that the tensile stress is P ten = P 11 for λ > 1, the compressive stress is P com = P 11 for λ < 1, and the shear stress is P shr = P 12 for all γ.Uniaxial tension and compression.For the special case of uniaxial tension and compression in terms of the stretch λ, the principal stretches are Using equation ( 27) and the zero normal stress condition, P 22 = P 33 = 0, we obtain the following expression for the uniaxial stress stretch relation, which translates into the following explicit expression between our network stress P 11 and the uniaxial stretch λ, Simple shear.For the special case of simple shear, in terms of the shear γ, we obtain the principal stretches, Using equation (27), we obtain the following expression for the shear stress stretch relation, which translates into the following explicit expression between our network shear stress P 12 and shear strain γ, Figure 5 illustrates the contours of the loss function L(θ; λ, γ) for all possible two-term models of the principal stretch based network in Figure 4.By combining any two terms of the model and setting all other second-layer weights equal to zero, we can generate 28 possible models.For these 28 combinations of two terms, we evaluate two versions of the loss function, non-normalized from equation ( 29) and normalized from equation ( 30), following the method described in Section 3.1, but now using the principal stretch based definitions of the normal stress (33) and shear stress (36).Similar to Section 3.1, we plot the non-normalized loss function in the lower triangle and the normalized loss function in the upper triangle.Again, by design, all loss functions take a minimum of L(θ; λ, γ) = 0 for the exact solution, w i = 1 and w j = 1, indicated through the white circles.From this minimum, both versions of the loss function, non-normalized and normalized, increase with both decreasing and increasing weights w i and w j , and remain convex within the entire domain, for all 28 two-term models.Notably, in contrast to the loss function contours of the invariant based model in Figure 3, the contours of the principal stretch based model in Figure 5 display less variation for different pairs of weights w i and w j : Only a few pairs of weights generate loss functions of ellipsoidal shape, e.g., the { w 3 , w 6 }, { w 4 , w 7 } pairs in the non-normalized lower triangle, or the { w 2 , w 6 }, { w 3 , w 7 }, { w 4 , w 8 } pairs in the normalized upper and non-normalized lower triangles, suggesting that, in the studied stretch and shear range, only a few pairs of terms are non-collinear and would represent a solid base for a potential constitutive model.Most pairs of weights generate loss functions with long ridges parallel to the parameter axes, suggesting that many terms are almost collinear and not well suited as a functional base for a constitutive model.In contrast to the invariant based network, normalization does not seem to fix this issue, both the upper and lower triangle display this collinearity.Notably, the { w 1 , w 5 } model in the first row and fifth column and the { w 5 , w 1 } model in the fifth row and first column combine the positive and negative second powers of the principal stretches, , and represent the popular Mooney Rivlin model [46,53], which is identical to the { w 1 , w 5 } and { w 5 , w 1 } models of the invariant based model in Figure 3. Overall, while these contours are difficult to interpret, we can compare them directly to Figure 3 and realize that, within the studied stretch and shear range λ and γ, and parameter window w i , the invariant based network seems to represent a much broader spectrum of functions than the principal stretch based network for which the functional base seems to be generally more narrow and almost collinear.We also note that the loss function is highly sensitive to normalization: For both networks, the normalized loss functions (17) and (30) tend to generate more convex shapes than the non-normalized loss functions ( 16) and (29), which is why we will focus on the maximum-stress normalized loss functions (17) and (30) in all following examples.29), the upper triangle illustrates the normalized loss function (30).All loss functions are convex, with contours varying from a few ellipsoids to many valleys with long ridges, highlighting the collinearity of many w i and w j pairs.

L p Regularized Neural Networks
We now integrate the concepts of L p regularization from Section 2 and constitutive neural network modeling from Section 3 and explore the resulting regression in view of predictability and interpretability.Specifically, we supplement the loss function of the constitutive neural network with a penalty term of L p type, The loss function minimizes the error between the model stress that we derive from the free energy of the neural network, P(F i ; θ) = ∂ψ/∂F, and the experimentally measured stress Pi divided by the number of data points n data , penalized by the L p norm, || θ || of the parameter vector θ = { w i } made up of the network weights w i , multiplied by the penalty parameter α ≥ 0. Specifically, we use tension, compression, and shear data and specify the stress error as the normalized difference between the tensile, compressive, and shear stresses predicted by the neural network, P ten (λ i ), P com (λ i ), and P shr (γ i ), and the data, Pten,i , Pcom,i , and Pshr,i , at n ten , n com , and n shr stretch and shear levels λ and γ, In the following, we systematically explore the sensitivity of the loss function (38) with respect to the two hyperparameters of the L p regularization, the power p and the penalty parameter α.For illustrative purposes, we first focus on a simplified two-term model, the Mooney Rivlin model that is shared between both neural networks, before we explore both L p regularized complete eight-term networks.
L p regularized Mooney Rivlin model.The Mooney Rivlin model [46,53] is a two-term constitutive model that is located right at the intersection of the invariant based neural network in Figure 2 and the principal stretch based network in Figure 4. Notably, it is the only model, for which both networks coincide.It uses the dark red term, [ , and the green term, [ , of both neural networks, and weighs them by the network weights, w 1,1 w 2,1 = w 1 and w 1,5 w 2,5 = w 5 , while all other network weights are identical to zero, This implies that the activation of any other weight will make the invariant and principal stretch based networks drift away from one another.The Mooney Rivlin model in equation (39) includes the one-term dark red Neo Hooke model [65] with and the one-term green Blatz Ko model [6] with For the Mooney Rivlin model, the L p regularized loss function from equation (38) specifies to with the Mooney Rivlin stresses in tension, P ten = P 11 for λ > 1, and compression, P com = P 11 for λ < 1, from equations ( 20) and (33), and in shear, P shr = P 12 for all γ. from equations ( 23) and (36),  6 illustrates the contours of the L p regularized loss function L(w 1 , w 5 ; λ, γ) for the two-term evaluated for the two parameters, w 1 and w 5 , for synthetic data from tension, compression, and shear tests.For p ≤ 1, top row, with the special case of L 1 regularization or lasso in the fourth column, L p regularization promotes sparsity by training w 1 = 0 exactly to zero, but the loss function is no longer strictly convex and has multiple local minima.For p > 1, bottom row, with the special case of L 2 regularization or ridge regression in the second column, L p regularization promotes stability, retains both non-weights, w 1 > 0 and w 5 > 0, and maintains a convex loss function with a single global minimum.
Mooney Rivlin model, with varying powers, p = [0.25,0.5, 0.75, 1, 1.5, 2, 4, 8], evaluated for the two parameters w 1 and w 5 , using synthetic data from tension, compression, and shear tests.The loss function consists of the neural network loss, ∑  1.For all eight graphs in Figure 6, we evaluate the loss function (40) using the Mooney Rivlin stresses (41).First, we generate synthetic data, Pten , Pcom , Pshr for tensile stretches of λ = [1.0,..., 2.0], compressive stretches of λ = [1.0,..., 0.5], and shear strains of γ = [0.0,..., 0.5], in ten equidistant increments each, assuming an exact solution with w 1,1 w 2,1 = w 1 = 1 and w 1,5 w 2,5 = w 5 = 1, while fixing the remaining weights equal to zero.This results in the training data sets of eleven stretch-stress pairs for tension, compression, and shear.Second, we vary the two Mooney Rivlin network weights in the ranges w 1 = [0, .., 2] and w 5 = [0, .., 2], and evaluate the tensile, compressive, and shear model stresses, P ten (λ i ), P com (λ i ), P shr (γ i ) using equations (41).Third, we evaluate the loss function (40) as the normalized mean squared error between the model stresses P ten (λ i ), P com (λ i ), P shr (γ i ) and the synthetically generated data stresses Pten , Pcom , Pshr , supplemented by the L p regularization α p [ | w 1 | p + | w 5 | p ] for the eight different powers, p = [0.25,0.5, 0.75, 1, 1.5, 2, 4, 8].As these powers p increase by two orders of magnitude, fixing the second hyperparameter α to one and the same value for all eight examples would increasingly emphasize the L p regularization over minimizing the actual network loss, and generate increasingly biased results.Instead, for each power p, we select the penalty parameter α such that the maximum value of the loss function within the screened parameter window, in the dark red upper right corner, at w 1 = 2 and w 5 = 2, consists of equal contributions by the network term and the regularization term.This results in eight different penalty parameters, α = [ 4.69, 3.94, 3.31, 2, 79, 1.97, 1, 39, 0.35, 0.02 ].For each set of hyperparameters {p, α}, we increase the penalty parameter in four increments, indicated through the four hyperplanes in each graph.Similar to the non-regularized loss functions of the invariant and principal stretch based networks in Figures 3 and 5, we highlight the minimum of the last of these four loss functions L(w i , w j ; λ, γ) through a white sphere.Importantly, in contrast to the non-regularized loss functions in Figures 3 and 5, the regularized loss function in Figure 6 no longer has a minimum of L(θ; λ, γ) = 0 at w 1 = 1 and w 5 = 1.Instead, the minimum of the loss function and its location in the {w 1 , w 5 }-space are now functions for the two hyperparameters p and α.For the eight powers and penalty parameters we used in this example, the minima of the loss function at the location of the white sphere become min(L) = [ 5.83, 5.56, 5.24, 4.82, 3.25, 2.22, 0.57, 0.04 ], and their varying locations in the {w 1 , w 5 }-space are indicated through the white spheres in Figure 6.
Figure 6 reveals several interesting features of the L p regularized Mooney Rivlin model: Most notably, the regularized loss function is highly sensitive to the power p and varies significantly for p below and above one, as we conclude from the different shapes in the first and second rows.For p ≤ 1, in the top row, with the special case of L 1 regularization or lasso in the fourth column, L p regularization promotes sparsity by training one of the weights exactly to zero, in this case w 5 = 0, while the other weight remains positive, w 1 > 0. Importantly, for p < 1, the loss function is no longer convex and has two local minima, one at w 1 = 0 and one at w 5 = 0. Notably, for a too small power, e.g., for p < 0.25, we observe a drastic regularization with sharp-contoured gradients towards the parameter planes, and the model loses robustness.For p > 1, in the bottom row, with the special case of L 2 regularization or ridge regression in the second column, L p regularization promotes stability and retains both non-zero weights, w 1 > 0 and w 5 > 0. The loss function remains convex with a single global minimum.Increasing the penalty parameter α amplifies these effects and moves the regularized minimum further away from the non-regularized minimum.Taken together, while a regularization across a continuous spectrum of powers p provides a lot of flexibility, the discovered weights w 1 and w 5 are highly sensitive to the selection of the two hyperparameters p and α: While the power p acts as a switch between sparsity and robustness, the penalty parameter α induces a trade-off between regularization and bias., evaluated for the two parameters, w 1 and w 5 , using synthetic data from tension, compression, and shear tests.For all 64 contour plots, we evaluate the normalized loss function (40) following the method of Figure 6, but now by varying both hyperparameters, p and α.Without regularization, left column, with α = 0, all eight contour plots are identical to the non-regularized Mooney Rivlin loss function in the first row and fifth column of Figures 3 and 5. Its minimum is identical to the exact solution, w 1 = 1 and w 5 = 1, represented through the white circles.With infinite regularization, right column, with α = ∞, all eight contour plots are a two-dimensional projection of the L p regularization contours in Figure 1.For p ≤ 1, in the four top rows, with increasing α, from left to right, the loss function gradually loses strict convexity, the minimum first moves towards w 1 ≥ 1 and w 5 = 0, and then towards w 1 → 0 and w 5 = 0.For p > 1, in the four bottom rows, with increasing α, from left to right, the loss function always remains convex, both weights always remain active, w 1 ≥ 0 and w 5 ≥ 0, and move closer together as the minimum gradually moves towards zero, w 1 → 0 and w 5 → 0.
Figure 7 confirms our observations from Figure 6 and provides additional insights into the L p regularized Mooney Rivlin model: The regularized loss function is highly sensitive to both hyperparameters, p and α.Decreasing the power to or below one, p ≤ 1, increases interpretability by promoting sparsity as a subset of weights become exactly zero; smaller powers p and larger penalty parameters α promote sparsity more drastically and generate increasingly less convex loss functions.Increasing the power above one, p > 1, increases predictability by promoting robustness as the loss function becomes increasingly convex; larger powers p and larger penalty parameters α promote robustness more drastically and generate increasingly more convex loss functions.These observations confirm the general notion that L p regularization is an intricate balance between predictability and interpretability and between regularization and bias that requires a careful selection of the appropriate values for the hyperparameters p and α. and penalty parameters α = [0, 0.25, 0.50, 0.75, 1, 2, 4, ∞], evaluated for the two parameters, w 1 and w 5 , for synthetic data from tension, compression, and shear tests.Without regularization, left column, with α = 0, the minimum of the loss function is identical to the exact solution w 1 = 1 and w 5 = 1, represented through the white circle.With infinite regularization, right column, with α = ∞, the loss function is identical to the L p regularization term and the contours are identical to Figure 1.For p < 1, with increasing α, the loss function gradually loses convexity, the minimum first moves towards w 1 ≥ 1 and w 5 = 0, and then towards w 1 → 0 and w 5 = 0.For p > 1, with increasing α, the loss function always remains convex, both weights always remain active, w 1 ≥ 0 and w 5 ≥ 0, as the minimum moves gradually towards w 1 → 0 and w 5 → 0.
Figure 8 illustrates the contours of the L p regularized loss function L(w 1 , w 5 ; λ, γ) for the two-term Mooney Rivlin model, with varying powers p, evaluated for the two parameters w 1 and w 5 , but now using real data from tension, compression, and shear tests of human brain [10].For all eight graphs, we evaluate the loss function (40) as the normalized mean squared error between the model P ten (λ i ), P com (λ i ), P shr (γ i ), and the data Pten , Pcom , Pshr , for tensile stretches of λ = [1.0,..., 1.1], compressive stretches of λ = [0.9,..., 1.0], and shear strains of γ = [0.0,..., 0.2], in 16 equidistant increments each [43], for varying Mooney Rivlin network weights in the ranges w 1 = [0, .., 1] and w 5 = [0, .. We select a penalty parameter α max = 0.6585, such that the maximum value of the loss function within the screened parameter window, in the dark red upper right corner, at w 1 = 1 and w 5 = 1, consists of equal contributions by the network term and the regularization term.For each set of hyperpa- rameters {p, α}, we increase the penalty parameter in four increments, α = [0.00,0.25, 0.50, 1.00]α max , indicated through the four hyperplanes in each graph, and highlight the minimum of the last of these four loss functions L(w i , w j ; λ, γ) through a white sphere.Similar to Figure 6 based on synthetic data, the minimum of the loss function and its location in the {w 1 , w 5 }-space are functions for the two hyperparameters p and α.For the plain non-regularized loss function, the minimum of the loss function is 0.0713 and its weights are w 1 = 0.00 and w 5 = 0.84.For the eight powers, the minima of the loss function at the location of the white sphere are min(L) = [ 0.67, 0.63, 0.56, 0.50, 0.34, 0.24, 0.11, 0.08 ], and their varying locations in the {w 1 , w 5 }-space are indicated through the white spheres in Figure 8.
Figure 8 reveals several interesting differences between the loss function for synthetic data in Figure 6 and for real data, in this case from human brain experiments, in Figure 8.Most importantly, for the synthetic data, we assumed an exact minimum at w 1 = 1 and w 5 = 1, where the loss function is exactly zero for the non-L p -regularized model, and takes the value of the regularization term α || θ || p p otherwise.For the real data, we no longer know a priori where the exact minimum is and it is no longer exactly zero, since the Mooney Rivlin model is not exact for the real data.From screening the parameter plane, find the minimum loss at w 1 = 0.00 and w 5 = 0.84.Strikingly, this suggests that the one-parameter Blatz Ko model [6] with ψ = w 5 [ I 2 − 3 ] and P = w 5 ∂I 2 /∂P − p P −t is better suited to describe the experimental data than the two-parameter Mooney Rivlin [46,53] model.However, we can clearly see the negative effect of over-regularization with too large penalty parameters α: For p < 1, in the top row, the minimum of the loss function remains on the w 1 = 0.00 axis, but the Blatz Ko parameter is drastically reduced from its non-regularized value of w 5 = 0.84 to w 5 = 0.59, w 5 = 0.49, and even w 5 = 0.For p ≥ 1, in the bottom row, the minimum of the loss function even moves away from the w 1 = 0.00 axis, and both parameters become activated at a similar magnitude between w 1 = [0.26,..., 0.36] and w 5 = [0.27,..., 0.47].Taken together, the discovered weights w 1 and w 5 are highly sensitive to over-regularization for extreme ranges of the hyperparameters p and α: Extreme penalty parameters induce increased bias as the loss function increasingly focuses on minimizing the penalty term rather than the regularization problem itself.

Lp regularized invariant based neural network
Similar to the previous example, we explore the effects of L p regularization with respect to the two hyperparameters p and α, but now for the full eight-term invariant based network [43], instead of the two-term Mooney Rivlin model [46,53], and for training on real instead of synthetic data.We use tension, compression, and shear data from human brain tests [10], over a tensile range of λ = [1.0,..., 1.1], a compressive range of λ = [0.9,..., 1.0], and a shear range of γ = [0.0,..., 0.2], sampled in 16 equidistant increments each, averaged over anywhere between n = 15 and n = 35 specimen [43].We train the invariant based neural network from Figure 2 in Section 3.1 and minimize the loss function from equation (38) with the stress definitions ( 20) and ( 23) with three different powers, p = [0.5, 1.0, 2.0], and four different penalty parameters, α = [0.000,0.001, 0.010, 0.100].We use the Adam optimizer, a robust adaptive algorithm for stochastic gradient-based first-order optimization [36].
Figure 9 summarizes our four discovered models in terms of the nominal stress as a function of stretch or shear strain, with the penalty parameter α increasing from left to right.The circles represent the experimental data [10].The color-coded regions represent the stress contributions of the eight model terms according to Figure 2. The coefficients of determination R 2 quantify the goodness of fit.Overall, the L p regularized invariant based network trains solidly and provides a good fit of the data.Without regularization, in the left column, the network discovers four non-zero terms, all in terms of the second invariant, [ I 2 − 3 ], indicated through cold green-to-blue colors, with stiffness-like parameters w 5 = 0.129 kPa, w 2,6 = 0.358 kPa, w 7 = 3.840 kPa, and w 2,8 = 1.406 kPa and exponential weights, w 1,6 = 1.152 and w 1,8 = 2.891, and its stress takes the following form, As the penalty parameter increases, from left to right, the number of non-zero terms decreases.With a penalty parameter α = 0.010, in the third column, the network discovers three non-zero terms, all in terms of the second invariant, [ I 2 − 3 ], indicated through cold green-to-light-blue colors, with stiffness-like parameters w 5 = 0.231 kPa, w 2,6 = 1.443 kPa, and w 7 = 7.364 kPa, and the exponential weight, w 1,6 = 5.102, and its stress takes the following form, P = [w 5 + w 2,6 w 1,6 exp( For the largest penalty parameter, in the right column, the network discovers a single non-zero term, the turquoise linear exponential term of the second invariant, with the stiffness-like parameter w 2,6 = 0.2462 kPa and the exponential weight w 1,6 = 2.9937, and its stress takes the following form, 9 provides great visual insights into the performance of L 1 regularization with varying penalty parameters, it only represents a snapshot of model discovery in the eight-dimensional parameter space of the network.Subset selection and model discovery are not only sensitive to the initialization of the parameter vector θ = { w i }, but also to the stochastic nature of the Adam optimizer.This implies that different runs may produce different results.This raises the question how reproducible and robust the results in Figure 9 are for varying initial conditions and training runs.
Figure 10 summarizes the discovered weights for the invariant based network with L p regularization for varying powers p = [ 0.5, 1.0, 2.0] and penalty parameters α = [ 0.000, 0.001, 0.010, 0.100 ].For all twelve combinations of the two hyperparameters, we perform a total of n = 100 training runs each, with varying initial conditions for the network weights w i = { w 1 , ..., w 8 }, such that each of the  four models in Figure 9 is the result of one of the L 1 regularized training runs in the middle row.The colored boxes in Figure 10 indicate the relevance of the eight model terms, with their means and standard deviations.Interestingly, the L 0.5 and L 0.1 regularizations in the first and second rows perform qualitatively similarly: They both start with four dominant terms, all in terms of the second invariant.Except for a small number of outliers, they both converge to two dominant one-term models, Figure 11 summarizes the convergence of the L p regularized invariant based network in terms of the goodness of fit and number of terms, for varying powers p = [ 0.5, 1.0, 2.0 ] and penalty parameters α = [ 0.000, 0.001, 0.005, 0.010, 0.015, 0.020, 0.040, 0.060, 0.080, 0.100 ].Red dots indicate the coefficient of determination R 2 , blue dots indicate the number of terms, with means and standard deviations from n = 10 realizations.A known shortcoming of the L p regularization is that it introduces bias and moves the solution away from the minimum of the network loss towards the minimum of the regularization loss.This is particularly critical for our network in which all weights have a different meaning and potentially also a different magnitude.To quantify the effects of this potential limitation, Figure 11 , in terms of the L 0 -normalized weights w i / w i,L 0 .Here, w i,L 0 are the weights of the one-term models from the diagonal in Table 1.
Figure 11 confirms that regularization is a trade-off between error and complexity, or similarly, between the goodness of fit and the number of terms.While the L 0.5 and L 1 regularizations behave qualitatively similarly and promote sparsity by reducing the number of non-zero terms to one, the L 2 promotes robustness by maintaining a large subset of six non-zero terms.The L 1 regularization is less aggressive than the L 0.5 regularization and requires larger penalty parameters α to achieve a similar sparseness, which could induce a larger bias, away from minimum of the network loss towards the minimum of the regularization loss.Normalizing the L p penalty term by using the L 0 -normalized weights w i / w i,L 0 instead of the non-normalized weights w i accelerates the positive effects of regularization, especially in the small-penalty-parameter regime, and could provide a viable solution to reduce regularization-induced bias.Ultimately, in the large-penalty-parameter regime, the non-normalized and normalized regularizations converge towards a similar goodness of fit and number of terms.
Taken together, our results confirm the general notion that L p regularization increases interpretability for powers equal to or below one, p ≤ 1, by promoting sparsity as a subset of weights train exactly to zero; and increases predictability for powers larger than one, p > 1, by promoting robustness as a unique subset of weights emerges as dominant.Larger penalty parameters α amplify these trends at the price of an increased bias, which we can reduce, at least in part, by normalizing the network weights in the the penalty term.L 0 regularized invariant based neural network.For comparison, we explore the effects of L 0 regularization using the same tension, compression, and shear data from human brain tests as in the previous example [10].We train the invariant based neural network from Figure 2 in Section 3.1 and minimize the loss function from equation (38) with the stress definitions ( 20) and ( 23), but now use a penalty term, α || θ || 0 , with the L 0 norm || θ || 0 = ∑ n para i=1 I(w i ̸ = 0), to penalize the total number of non-zero terms in the model.In essence, L 0 regularization turns network training into a discrete combinatorial problem with 2 8 − 1 = 255 possible models, 8 with a single term, 28 with two, 56 with three, 70 with four, 56 with five, 28 with six, 8 with seven, and 1 with all eight terms.For illustrative purposes, we focus on the eight one-term and 28 two-term models that follow by explicitly setting the other seven and six terms of the network to zero.  1 summarizes the weights and remaining losses of the one-and two-term models of the L 0 regularized invariant based neural network.The diagonal summarizes the discovered one-term models, the off-diagonal the two-term models.The L 0 regularization penalizes the one-term models by α  and the two-term models by 2α.The boldface cells highlight the four best-in-class models of each category.Figures 12 and 13 illustrate the stress-stretch and stress-shear plots of these four one-and two-term models.Interestingly, all best-in-class models are models in terms of the second invari-ant I 2 indicated through the cold green-to-blue colors.None of the eight best models includes the first invariant I 1 indicated through the warm red-to-yellow colors.This finding contradicts the common practice of using primarily the first invariant, for example, in popular and widely used neo Hooke model.Strikingly, the classical Hooke model [65] represented through the dark red term in both networks with ψ = w 1 [ I 1 − 3 ], a stiffness-like parameter w 1 = 0.7964 kPa, a shear modulus µ = 2 w 1 = 1.5928 kPa, and a remaining loss of 0.0918 + α has the largest remaining loss and performs the worst of all one-term models.Similarly, the Demiray model [13]   ] ∂I 2 /∂F − p F −t .For this simple example, the remaining loss of the best one-term model is 0.0594 + α and the remaining loss of the best two-term model is 0.0328 + 2α.This implies that, for penalty parameters α < 0.0266, the L 0 regularization would favor the two-term model, while for penalty parameters α ≥ 0.0266, the L 0 regularization would favor the one-term model.These simple considerations highlight the importance of the penalty parameter α, which explicitly acts as a discrete switch between the number of terms we want to include in our model.Taken together, this example illustrates the discrete nature of the L 0 regularization as a discrete combinatorial problem that becomes increasingly expensive as the number of model terms increases.Our results emphasize the sensitivity of the L 0 regularization with respect to the penalty parameter α and highlight the trade-off between bias and variance: Increasing the penalty parameter increases bias, reduces variance, and decreases model complexity as the total number of non-zero terms decreases towards one.

Lp regularized principal stretch based neural network
Similar to the previous example, we explore the effects of L p regularization with respect to the two hyperparameters p and α, but now for the full principal stretch based network with all eight terms [43], for training on tension, compression, and shear data from human brain tests [10].We train the principal based neural network from Figure 4 in Section 3.2 and minimize the loss function from equation (38) with the stress definitions (33) and (36) with three different powers, p = [0.5, 1.0, 2.0], and four different penalty parameters, α = [0.000,0.001, 0.010, 0.100] using the Adam optimizer [36].
Figure 14 summarizes our four discovered models in terms of the nominal stress as a function of stretch or shear strain, with the penalty parameter α increasing from left to right.The circles represent with a stiffness-like parameter w 8 = 0.0534 kPa and a stress P = −8 15 summarizes the discovered weights for the principal stretch based network with L p regularization for varying powers p = [ 0.5, 1.0, 2.0] and penalty parameters α = [ 0.000, 0.001, 0.010, 0.100 ].For all twelve combinations of the two hyperparameters, we perform a total of n = 100 training runs each, with varying initial conditions for the network weights w i = { w 1 , ..., w 8 }, such that each of the four models in Figure 14 is the result of one of the L 1 regularized training runs in the middle row.The colored boxes in Figure 15 indicate the relevance of the eight model terms, with their means and standard deviations.In contrast to the invariant based network in Section 4.1, the principal stretch based network consistently discovers similar terms across all three regularizations, with a clear preference for the dark blue [ λ −8 i − 3 ] term.The fact that all networks robustly discover similar terms is a result of the convex nature of the underlying linear regression problem associated with the principal stretch based network and indicates the existence of a single unique global minimum.However, the L 0.5 and L 1 regularized networks gradually drop more non-zero terms as the penalty parameter increases, while the L 2 regularized network maintains all eight terms.As we had already anticipated from comparing Figures 3 and 5, the functional base of the principal stretch based network is more collinear than the base of the invariant based network, which result in a more gradual shift of the active weights, from w 1 towards w 8 , as the penalty parameter increases.The fact that all three regularizations converge to the boundary of our domain, the dark blue [ λ −8 i − 3 ] term with the minimum exponent of minus eight, suggests that the true best fit might lay outside the current parameter range, with even smaller exponents.This agrees well with previous studies [10,59] that have discovered one-term Ogden models with exponents of [ λ −18 Taken together, this example illustrates that model discovery with L p regularization generalizes well to different network types, irrespective of whether the terms are invariant or principal stretch based.Comparing both network types reveals that the method is sensitive to the nonlinear vs linear nature of the underlying regression problem: The nonlinear invariant based network alternates between different dominant terms associated with multiple local minima, while the linear principal stretch based network consistently discovers similar terms associated with a single unique global minimum.

Conclusion and recommendations
L p regularization is a powerful technology to finetune the training process of a neural network.In automated model discovery, it provides the critical missing piece of the puzzle that enables a controlled down-selection of the discovered terms and focus on the most important features of the model while putting less emphasis on minor effects.By promoting sparsity of the parameter vector, L p regularization inherently improves interpretability and provides valuable insights into the underlying nature of the data.Importantly, L p regularization introduces two hyperparameters: the power p by which it penalizes the individual model parameters, and the penalty parameter α by which it scales the relative importance of the regularization loss in comparison to the neural network loss.Both parameters enable a precise control of model discovery from data and it is crucial to understand their mathematical subtleties, computational implications, and physical effects.Here we reviewed the mathematics and computation of the most common representatives of the L p family, and demonstrated their features in terms of two classes of constitutive neural networks, invariant and principal stretch based, trained with both, synthetic and real data.Training with synthetic data proved to be robust and stable, and generally provides excellent metrics for quality control since we know the exact solution.However, it remains a toy problem that fails to reveal the true usefulness in practical real-world applications.Training with real data was algorithmically robust, but challenging, since we know nothing about the exact solution.Our study uses neural networks as a tool for linear and nonlinear regression.We acknowledge that our results can be interpreted and well understood without resorting neural networks and generalize naturally to other regression techniques including symbolic regression, genetic programming, or system identification.
For conciseness, we have limited the scope of the present review: First, we only considered small networks with no more than eight terms, but point out that the automated model discovery generalizes well to isotropic networks with 12 terms [43], transversely isotropic networks with 16 terms [44], two-fiber family networks with 16 terms [50], and orthotropic networks with 32 terms.Second, we trained on all available data and did not investigate splitting the data into train and test sets, which we have done in our previous work [43,59].Third, we did not explicitly study the effects of controlled noise, but point out that Figures 8 to 15 are all based on real experimental data with real natural noise.Fourth, we did not further explore hybrid top down approaches like SINDy [8], since the nonlinear nature of our optimization problem does not guarantee that we easily find the global minimum [70], from which we could initiate a sequential thresholded least squares down-selection; Finally, we have not yet investigated the effects of L p regularization on uncertainty quantification, something we are currently exploring in a separate Bayesian approach.
We would like to share the most important lessons we have learnt throughout this study: Normalize first!We cannot overstate the importance of normalizing.Clearly, while normalizing is less of an issue in linear regression, it is critical in nonlinear regression.This holds for both the training data, illustrated in Figures 3 and 5, and the weights, illustrated in Figure 11.The loss function typically contains several terms of different magnitude that compete during minimization.It proves important to normalize by the number of data sets in each category, the magnitude of the tensile, compressive, and shear stresses, and the magnitude of the weights to balance the impact of the individual contributions.L 0 regularization is the most honest member of the Lp family.L 0 regularization or subset selection is honest, transparent, and unbiased.Its penalty parameter α acts as a direct switch to select the desired number of terms.It is the only member of the L p family that explicitly controls the balance between the number of terms and the goodness of fit, illustrated in Figure 11.While L 0 regularization across the entire network translates into an expensive NP hard discrete combinatorial problem of the order of 2 n , we recommend to begin any discovery by running an L 0 regularization for all possible one-and two-term models to determine the best-in-class models of each category and identify the dominant terms, similar to Figures 12 and 13 and Table 1.Importantly, in nonlinear regression, the best-in-class n-term model may actually not be a subset of the best-in-class (n + 1)-term model, and successively removing terms like in iterative pruning [25] or sequential thresholding least squares [8] might not be a viable solution.Instead, running L 0 regularization for all possible one-and two-term models provides a quick first insight into the nature and hierarchy of the best-in-class models [51].From this initial first glimpse, we can proceed by successively adding terms.In addition, from the bestin-class one-term models, we can use the discovered weights w i,L 0 to initialize the weights for higher order runs and to normalize the weights in the regularization term, α || θ || p p = ∑ n para i | w i /w i,L 0 | p .L 1 regularization is powerful for subset selection, but needs large penalty parameters to be effective.L 1 regularization or lasso promotes sparsity by reducing a large subset of parameters exactly to zero.Notably, for all the examples in our study, this down-selection required quite large penalty parameters α-often on the order of one-to work effectively.This not only affects the magnitude of the discovered parameters, but often also the discovered model itself.For example, for α = 0.1, the n = 100 independent realizations of the L 1 regularization in Figure 10 alternate between the green and turquoise one-term models, while the unbiased plain L 0 regularization in Figure 12 and Table 1 ranks these two models clearly behind the blue and dark blue one-term models.To identify regularization-induced bias, we recommend to always compare the results of the L 1 regularization against the best-in-class low-term models of the L 0 regularization.This comparison is simple and inexpensive, and provides valuable insights into the magnitude of selection bias and the aggressiveness of the L 1 regularization.L 0.5 regularization promotes sparsity for small penalty parameters, but suffers from multiple local minima.L 0.5 regularization addresses the shortcomings of the classical L 1 regularization by down-selecting more aggressively, requiring smaller penalty parameters, and introducing less bias.While L 0.5 regularization works well in practice, it is computationally challenging.Its non-convexity introduces multiple local minima, indicated through the first rows in Figures 6, 7 and 8, and through the green and turquoise one-term models in Figure 10, and the blue and dark blue one-term models in Figure 15.To avoid getting stuck in a local minimum, we highly recommend exploring different initialization strategies for the network weights.Specifically, we were able to robustly identify multiple local minima by initializing the weights with the L 0 regularized weights w i,L 0 .Alternatively, we could gradually ramp up the effect of regularization by starting with a penalty parameter α = 0 and smoothly increase it to a desired strength, essentially by moving from left to right in Figure 7.For quality control, we recommend comparing the remaining loss of each converged run against the remaining L 0 regularized baseline loss as reported in Table 1.
L 2 regularization promotes stability, but is not suited for subset selection.L 2 regularization, by design, is not suited to reduce a subset of terms exactly to zero.Instead, it maintains all terms as indicated in the bottom rows of Figures 10 and 15, each for n = 100 runs.From Figures 6, 7, and 8 we conclude that, for increasing penalty parameters α, L 2 regularization reduces outliers by first bringing the weights closer together and then collectively reducing them toward zero.Clearly, L 2 regularization improves convexity, which makes model discovery more robust and more stable.However, it not only fails to down-select the number of terms, but also strongly biases the solution away from the minimum of the pure network loss towards the minimum of the regularization loss.We do not recommend using L 2 regularization, or any other member of the L p regularization family with powers larger than one, p > 1, to increase sparsity and improve interpretability.• does not promote sparsity [28] Table 2 provides a side-by-side comparison of the L p regularizations we explored throughout this study along with their advantages, disadvantages, and references.
Densifying instead of sparsifying.The closure problem is a common challenge in both fluid and solid mechanics.It refers to the difficulty of fully specifying the constitutive equations that relate stresses and strains and characterize the material behavior.In fluid mechanics, the closure problem is closely related to turbulence modeling, where it approximates intricate interactions between different scales, and can be well represented through polynomials [8].In solid mechanics, the closure problem characterizes complex material behaviors at the microscopic scale, and is traditionally often represented through a combination of polynomials [14], exponentials [13,31], logarithms [21], and powers [48,67].In the context of model discovery, assuming perfect data, polynomial models translate into a convex linear optimization problem with a single unique global minimum, while exponential, logarithmic, or power models translate into a non-convex nonlinear optimization problem with possibly multiple local minima.For convex discovery problems with a unique global minimum, inducing sparsity has been well established through a top down approach in which we first calculate a dense parameter vector at the global minimum, and then sparsify the parameter vector by sequentially thresholding and removing the least relevant terms [8,14,68,70].For non-convex discovery problems with multiple local minima, this approach is infeasible since different initial conditions may result in different solutions with non-unique parameter vectors [49].Instead of trying to sparsify a dense parameter vector, we recommend to gradually densify the parameter vector from scratch.This bottom up approach iteratively solves the discrete combinatorial problem and densifies the parameter vector by sequentially adding the most relevant terms [47].Importantly, instead of solving the NP hard discrete combinatorial problem associated with a complete L 0 regularization that screens all possible combinations of terms at O(2 n ), we recommend to gradually add terms, starting with the best-in-class one-term model at O(n), adding a second term at O(n), and repeating addition until the incremental improvement of the overall loss function meets a user-defined convergence criterion.At most, this algorithm involves O(n 2 ) evaluations of the loss function to land on a fully populated dense parameter vector.Importantly, for non-convex model discovery problems, this algorithm-while cost effective and wellrationalized-is not guaranteed to converge to the global minimum.Instead of successively adding up to n terms, for practical purposes, it is often sufficient to limit the number of desirable terms to one, two, three or four, and identify the best-in-class model of each class, which requires a discrete comparison of ( 8!/(n!(8 − n)!) ) discrete models, in our case 8, 28, 56, or 70 [51].Out of all possible discovery algorithms, this is the most honest, unbiased, and transparent approach.
Taken together, our study suggests that L p regularized constitutive neural networks are a powerful technology for automated model discovery that allows us to identify interpretable constitutive models from data.We anticipate that our results generalize to L p regularization for model discovery with other techniques such as symbolic regression or system identification, and, more broadly, to model discovery in other fields such as biology, chemistry, or medicine.The ability to discover new knowledge from data could have tremendous applications in generative material design where it could shape the path to manipulate matter, alter properties of existing materials, and discover new materials with targeted properties.

Figure 1 :
Figure 1: Lp regularization.Contours of regularization term, L p = α ∑ n para i=1 || θ || p p with || θ || p p = | w i | p , for varying powers, p = [0.25,0.5, 0.75, 1, 1.5, 2, 4, 8], evaluated for two parameters, w 1 and w 2 .For p ≤ 1, top row, with the special case of L 1 regularization or lasso represented through the pyramid, L p regularization promotes sparsity by setting some weights exactly to zero, but is no longer strictly convex and can have multiple local minima.For p > 1, bottom row, with the special case of L 2 regularization or ridge regression represented through the ellipsoid, L p regularization promotes stability by reducing outliers, while the regularization term remains convex.

Figure 2 :
Figure 2: Invariant based neural network for automated model discovery.The network takes the deformation gradient F as input and outputs the free energy function ψ from which we calculate the stress P = ∂ψ/∂F.The network is invariant based, it first calculates the invariants I 1 and I 2 , and feeds them into its two hidden layers.The first layer generates the first and second powers (•) and (•) 2 of the invariants and the second layer applies the identity and exponential function (•) and exp(•) to these powers.The free energy function ψ is a function of the eight color-coded terms.During training, the network discovers the best model, of 2 8 = 256 possible combinations of terms, to explain the experimental data P.

Figure 3 :
Figure 3: Loss functions of invariant based neural network.Contours of the loss function L(θ; λ, γ) for all 28 possible two-term models of the invariant based constitutive neural network in Figure 2. The loss function is evaluated across tensile stretches λ = [1.0,..., 2.0], compressive stretches λ = [1.0,..., 0.5], and shear strains γ = [0.0,..., 0.5], with network weights in the ranges w i = [0, ..., 2] and w j = [0, ..., 2].The minimum of the loss function indicates the exact solution w i = 1 and w j = 1, represented through the white circle.The lower triangle illustrates the non-normalized loss function (16), the upper triangle illustrates the normalized loss function(17).All loss functions are convex, with contours varying from ellipsoids to valleys with long ridges, highlighting the collinearity of some w i and w j pairs.

Figure 4 :
Figure 4: Principal stretch based neural network for automated model discovery.The network takes the deformation gradient F as input and outputs the free energy function ψ from which we calculate the stress P = ∂ψ/∂F.The network is principal stretch based, it first calculates the principal stretches λ 1 and λ 2 and λ 3 , and feeds them into its hidden layer.The hidden layer applies eight different powers (λ n 1 + λ n 2 + λ n 3 − 3) to these principal stretches.The free energy function ψ is a function of the eight color-coded terms.During training, the network discovers the best model, of 2 8 = 256 possible combinations of terms, to explain the experimental data P.

Figure 5 :
Figure 5: Loss functions of principal stretch based neural network.Contours of the loss function L(θ; λ, γ) for all 28 possible two-term models of the principal stretch based constitutive neural network in Figure 4.The loss function is evaluated across tensile stretches λ = [1.0,..., 2.0], compressive stretches λ = [1.0,..., 0.5], and shear strains γ = [0.0,..., 0.5], with network weights in the ranges w i = [0, ..., 2] and w j = [0, ..., 2].The minimum of the loss function indicates the exact solution w i = 1 and w j = 1, represented through the white circle.The lower triangle illustrates the non-normalized loss function (29), the upper triangle illustrates the normalized loss function(30).All loss functions are convex, with contours varying from a few ellipsoids to many valleys with long ridges, highlighting the collinearity of many w i and w j pairs.

Figure 6 :
Figure 6: Loss functions of Lp regularized Mooney Rivlin model for synthetic data.Contours of the L p regularized loss function, L(w 1 , w 5 ; λ, γ), for the two-term Mooney Rivlin model with varying powers, p = [0.25,0.5, 0.75, 1, 1.5, 2, 4, 8],evaluated for the two parameters, w 1 and w 5 , for synthetic data from tension, compression, and shear tests.For p ≤ 1, top row, with the special case of L 1 regularization or lasso in the fourth column, L p regularization promotes sparsity by training w 1 = 0 exactly to zero, but the loss function is no longer strictly convex and has multiple local minima.For p > 1, bottom row, with the special case of L 2 regularization or ridge regression in the second column, L p regularization promotes stability, retains both non-weights, w 1 > 0 and w 5 > 0, and maintains a convex loss function with a single global minimum.

Figure 7
Figure7illustrates the contours of the L p regularized loss function L(w 1 , w 5 ; λ, γ) for the two-term Mooney Rivlin model, with varying powers, p = [0.25,0.5, 0.75, 1, 1.5, 2, 4, 8], and penalty parameters, α = [0, 0.25, 0.50, 0.75, 1, 2, 4, ∞], evaluated for the two parameters, w 1 and w 5 , using synthetic data from tension, compression, and shear tests.For all 64 contour plots, we evaluate the normalized loss function(40) following the method of Figure6, but now by varying both hyperparameters, p and α.Without regularization, left column, with α = 0, all eight contour plots are identical to the non-regularized Mooney Rivlin loss function in the first row and fifth column of Figures3 and 5. Its minimum is identical to the exact solution, w 1 = 1 and w 5 = 1, represented through the white circles.With infinite regularization, right column, with α = ∞, all eight contour plots are a two-dimensional projection of the L p regularization contours in Figure1.For p ≤ 1, in the four top rows, with increasing α, from left to right, the loss function gradually loses strict convexity, the minimum first moves towards w 1 ≥ 1 and w 5 = 0, and then towards w 1 → 0 and w 5 = 0.For p > 1, in the four bottom rows, with increasing α, from left to right, the loss function always remains convex, both weights always remain active, w 1 ≥ 0 and w 5 ≥ 0, and move closer together as the minimum gradually moves towards zero, w 1 → 0 and w 5 → 0.

Figure 7 :
Figure 7: Loss functions of Lp regularized Mooney Rivlin model for synthetic data.Contours of the L p regularized loss function, L(w 1 , w 5 ; λ, γ), for the two-term Mooney Rivlin model with varying powers, p = [0.25,0.5, 0.75, 1, 1.5, 2, 4, 8],and penalty parameters α = [0, 0.25, 0.50, 0.75, 1, 2, 4, ∞], evaluated for the two parameters, w 1 and w 5 , for synthetic data from tension, compression, and shear tests.Without regularization, left column, with α = 0, the minimum of the loss function is identical to the exact solution w 1 = 1 and w 5 = 1, represented through the white circle.With infinite regularization, right column, with α = ∞, the loss function is identical to the L p regularization term and the contours are identical to Figure1.For p < 1, with increasing α, the loss function gradually loses convexity, the minimum first moves towards w 1 ≥ 1 and w 5 = 0, and then towards w 1 → 0 and w 5 = 0.For p > 1, with increasing α, the loss function always remains convex, both weights always remain active, w 1 ≥ 0 and w 5 ≥ 0, as the minimum moves gradually towards w 1 → 0 and w 5 → 0.

Figure 8 :
Figure 8: Loss functions of Lp regularized Mooney Rivlin model for real data.Contours of the L p regularized loss function, L(w 1 , w 5 ; λ, γ), for the two-term Mooney Rivlin model with varying powers, p = [0.25,0.5, 0.75, 1, 1.5, 2, 4, 8], evaluated for the two parameters, w 1 and w 5 , for real data from tension, compression, and shear tests of human brain.For p ≤ 1, top row, with the special case of L 1 regularization or lasso in the fourth column, L p regularization promotes sparsity by training w 1 = 0 exactly to zero, but the loss function is no longer strictly convex and has multiple local minima.For p > 1, bottom row, with the special case of L 2 regularization or ridge regression in the second column, L p regularization promotes stability, retains both non-weights, w 1 > 0 and w 5 > 0, and maintains a convex loss function with a single global minimum.

Figure 9 :
Figure 9: Discovered models of L1 regularized invariant based network.Nominal stress as a function of stretch or shear strain for the invariant based neural network with L 1 regularization for varying penalty parameters α = [ 0.000, 0.001, 0.010, 0.100 ], trained with human gray matter tension, compression, and shear data.Circles represent the experimental data.Color-coded regions represent the discovered model terms.Coefficients of determination R 2 indicate the goodness of fit.

Figure 10 :
Figure 10: Discovered models of Lp regularized invariant based network.Distribution of discovered weights for the invariant based neural network with L p regularization for varying powers p = [ 0.5, 1.0, 2.0] and penalty parameters α = [ 0.000, 0.001, 0.010, 0.100 ].Colored boxes indicate the relevance of the eight model terms, with means and standard deviations from n = 100 realizations with varying initializations of the network weights.

Figure 11 :
Figure 11: Convergence of Lp regularized invariant based network.Goodness of fit and number of terms for the invariant based neural network with L p regularization for varying powers p = [ 0.5, 1.0, 2.0 ] and penalty parameters α = [ 0.000, 0.001, 0.005, 0.010, 0.015, 0.020, 0.040, 0.060, 0.080, 0.100 ].Top row uses a non-normalized regularization in terms of the weights w i , bottom row uses a normalized regularization in terms of the L 0 -normalized weights w i / w i,L 0 .Red dots indicate the coefficient of determination R 2 , blue dots indicate the number of terms, with means and standard deviations from n = 10 realizations.
Figure 11  summarizes the convergence of the L p regularized invariant based network in terms of the goodness of fit and number of terms, for varying powers p = [ 0.5, 1.0, 2.0 ] and penalty parameters α = [ 0.000, 0.001, 0.005, 0.010, 0.015, 0.020, 0.040, 0.060, 0.080, 0.100 ].Red dots indicate the coefficient of determination R 2 , blue dots indicate the number of terms, with means and standard deviations from n = 10 realizations.A known shortcoming of the L p regularization is that it introduces bias and moves the solution away from the minimum of the network loss towards the minimum of the regularization loss.This is particularly critical for our network in which all weights have a different meaning and potentially also a different magnitude.To quantify the effects of this potential limitation, Figure11compares the non-normalized regularization, || θ || p p = ∑ n para i=1 | w i | p , in terms of the weights w i that we have used throughout this study against a normalized regularization, || θ || p

Figure 12 :
Figure 12: Discovered best-in-class one-term models of L 0 regularized invariant based network.Nominal stress as a function of stretch or shear strain for the invariant based constitutive neural network with L 0 regularization, trained with human gray matter tension, compression, and shear data.Circles represent the experimental data.Color-coded regions represent the discovered model terms.Coefficients of determination R 2 indicate the goodness of fit for each individual test; remaining loss L indicates the quality of the overall fit.

Figure 13 :
Figure 13: Discovered best-in-class two-term models of L 0 regularized invariant based network.Nominal stress as a function of stretch or shear strain for the invariant based constitutive neural network with L 0 regularization, trained with human gray matter tension, compression, and shear data.Circles represent the experimental data.Color-coded regions represent the discovered model terms.Coefficients of determination R 2 indicate the goodness of fit for each individual test; remaining loss L indicates the quality of the overall fit.

Figure 14 :
Figure 14: Discovered models of L1 regularized principal stretch based network.Nominal stress as a function of stretch or shear strain for the principal stretch based neural network with L 1 regularization for varying penalty parameters α = [ 0, 0.1, 0.001, 0.0001 ], trained with human gray matter tension, compression, and shear data.Circles represent the experimental data.Color-coded regions represent the stress contributions of the eight model terms.Coefficients of determination R 2 indicate the goodness of fit.

Figure 15 :
Figure 15: Discovered models of Lp regularized principal stretch based network.Distribution of discovered weights for the principal stretch based neural network with L p regularization for varying powers p = [ 0.5, 1.0, 2.0] and penalty parameters α = [ 0.000, 0.001, 0.010, 0.100 ].Colored boxes indicate the relevance of the eight model terms, with means and standard deviations from n = 100 realizations with varying initializations of the network weights.
w 2 , w 7 }, { w 3 , w 6 }, { w 6 , w 7 } pairs in the normalized upper triangle, suggesting that in the studied stretch and shear range, these terms are non-collinear, and would represent a rich base for a potential constitutive model.Other pairs of weights generate loss functions with long ridges parallel to the parameter axes, e.g., the { w 2 , w 7 }, { w 2 , w 8 }, { w 6 , w 7 }, { w 6 , w 8 } pairs in the non-normalized lower triangle, suggesting that these terms are almost collinear and not well suited as an independent base for a constitutive model.On average, the normalized pairs in the upper triangle seem to generate more convex loss functions than the non-normalized pairs in the lower triangle, suggesting that normalization helps to generate more convex loss functions, a richer functional base, and a more robust solution overall.Notably, the { w 1 , w 5 } model in the first row and fifth column and the { w 5 , w 1 } model in the fifth row and first column combine the linear terms in the first and second invariants, [ I 1 − 3 ] and [ I 2 − 3 ], and represent the popular Mooney Rivlin model for rubberlike materials [w 1 + w 5 /λ] andP 12 = 2 γ [w 1 + w 5 ] .(41)Notably, the uniaxial stress P 11 and shear stress P 12 of the Mooney Rivlin model (41) are linear functions in the network weights w 1 and w 5 , which translates the neural network loss, ∑ i=1 ||[P ten (λ i ) − Pten,i ]/P max ten || 2 /n ten + ∑ n com i=1 ||[P com (λ i ) − Pcom,i ]/P min com || 2 /n com + ∑ n shr i=1 ||[P shr (λ i ) − Pshr,i ]/P max shr || 2 /n shr , of the loss function (40) into a linear regression problem, with a single unique global minimum.Figure P min com || 2 /n com + ∑ max shr || 2 /n shr , illustrated in the first row and fifth column of Figures 3 and Figures 5; supplemented by the L p regularization, α p [ | w 1 | p + | w 5 | p ], illustrated in Figure i=1 ||[P shr (λ i ) − Pshr,i ]/P