Estimation of Li-ion degradation test sample sizes required to understand cell-to-cell variability

Ageing of lithium-ion batteries results in irreversible reduction in performance. Intrinsic variability between cells, caused by manufacturing differences, occurs throughout life and increases with age. Researchers need to know the minimum number of cells they should test to give an accurate representation of population variability, since testing many cells is expensive. In this paper, empirical capacity versus time ageing models were fitted to various degradation datasets for commercially available cells assuming the model parameters could be drawn from a larger population distribution. Using a hierarchical Bayesian approach, we estimated the number of cells required to be tested. Depending on the complexity, ageing models with 1, 2 or 3 parameters respectively required data from at least 9, 11 or 13 cells for a consistent fit. This implies researchers will need to test at least these numbers of cells at each test point in their experiment to capture manufacturing variability.


Introduction
Lithium-ion batteries have grown in importance over the past decade and are now the key technology underlying applications from electric vehicles to grid energy storage [1]. High specific energy, low internal resistance and long lifetime have already led Li-ion cells to dominate the market for consumer electronics applications. A crucial issue that strongly impacts overall system performance is the intrinsic variability in capacity, resistance and degradation rate between cells, caused by small variations in manufacturing processes [2,3,4]. Quantifying the typical variability in mature commercially available Li-ion cells is crucial for understanding the trade-offs involved in designing battery packs and estimating pack performance and lifetime on the basis of individual cell performance.
Battery state of health (SOH) is usually defined as the capacity or impedance/resistance of a cell [5] under standard test conditions, and it constrains the useful and safe operation of batteries. State of health changes with time and usage, and is influenced both by external factors (such as voltage, time and temperature) and by internal manufacturing and materials variations, [6]. Since it has a direct bearing on the economic value and operation of batteries, the estimation of current and future Li-ion SOH is a popular area of research [6].
Variability in cell capacity and resistance can create differing loads within a Li-ion battery pack, and depending on its extent, variability will inevitably impact performance, cost and safety [7,8,9]. There are a variety of sources for this variability, for instance, thermal inhomogeneities influence SOH and increase cell-to-cell variability during usage [5,10,11]. However, manufacturing variability, sometimes described as tolerance, is a significant contributor to cell-to-cell variability [2,3,12]. New, nominally identical cells from the same batch exhibit a spread in capacity before they have been cycled [4,13,14]. Simulations and experiments have demonstrated that this intrinsic manufacturing variability is a contributing factor to differing ageing rates between cells [2,3,13,15,16].
There are many possible sources of manufacturing variability [17], such as variability in electrode thickness and density [18], fraction of active material, liquid-to-solid ratio and coating gap [19]. In this work we assume that all intrinsic variability may be expressed through a single lumped population distribution, for each parameter and dataset. A further contributor to cell-to-cell variability results from variance in experimental conditions, for example location in a given testing chamber [20].
Ageing reduces the performance of a given cell or pack and increases variability between cells [13,14,10,21,22,23]. To compound this, no significant correlation between initial health and ageing rates and subsequent ageing rates in later life has been found experimentally [2,13,24,23], although features derived from early cycle life can be used to accurately predict lifetime [25,26]. Some authors have found that initial battery health is sometimes bi-modal [27,28]. It has also been found that Weibull distributions, common in failure analysis, may be used to quantify battery end of life statistics [2,21,29,30].
In summary, cell-to-cell variability is a significant factor influencing the performance and value of batteries. As a result, modelling of battery ageing data is complicated by the question of how many cells should be tested at each experimental condition so as to adequately capture the intrinsic variability. Battery testing involves cost, in time and number of test channels, and therefore optimising the amount of information obtained from a test is a key consideration. The literature seems to have largely ignored the issue of how many cells should be tested to capture intrinsic variability, and for practical reasons, most ageing studies use only a small number of cells (e.g. 1-3) at each test condition. In this paper we address this question directly by fitting models to ageing data. There are many options for modelling battery capacity through life, from empirical curve fits [31] through to physical models [32] and machine learning approaches [6,33]. Since our aim was to investigate intrinsic rather than extrinsic variability, we chose as our modelling approach to use simple empirical curve fits of health versus time. We examined the consistency of the resulting model parameters as we added data drawn from increasing numbers of cells within each dataset, using five different battery ageing datasets. An estimate of required sample size was drawn for each parameter in every model for every available dataset.

Data
To study cell-to-cell variability, ideally we need data from a very large number of cells, perhaps thousands. The costs of such large scale testing would be prohibitive, requiring many battery test channels for multiple years, and no such datasets are openly available. As a compromise, however, some ageing datasets are available with order 10-100 cells cycled identically (or very similarly). We selected five datasets for analysis based on the requirement of wanting as many cells as possible to have been tested within each dataset. Two of these are open source, and three are from our own experiments. Each individual dataset used identical commercially available Li-ion cells, albeit having different manufacturers, chemistries and cell sizes from dataset to dataset. All datasets used 18650 cylindrical cells, although the methods discussed below can equally be applied to other form factors. Some of the datasets featured identical experimental conditions, i.e. each cell was tested in exactly the same way, whereas others varied the testing conditions slightly beyond the expected uncontrollable experimental variability. The datasets are as follows: 1. Baumhöfer-2014 [3] consists of 48 Sanyo/Panasonic UR18650E NMC/graphite 1.85 Ah cells in a cycle ageing test each under the same operating conditions. Data available at [34].  [25]. The cells in this dataset are from three different experimental batches so will have been subjected to higher variance in testing conditions than the other datasets.

5.
Attia-2020 replicates Severson-2019, but with a fixed charging window of 10 minutes. There are 45 cells in this set. Data available via [37].
Three of these datasets, namely Baumhöfer-2014, Severson-2019 and Attia-2020, exhibit an onset of rapid degradation in later life, sometimes called the 'knee-point' [3,26]. The other two datasets show only linear degradation over usage. The data prior to the knee-point in the Attia-2020 and Severson-2019 sets was separately extracted to produce two additional linear ageing datasets.

Models
The models and the corresponding datasets that they were fitted to are shown in Table 1. Here, Linear-1 and Linear-2 refer to the two linear models, having one and two parameters respectively. Alternatively, LinExp is a combined linear and exponential model that was used to capture the knee-point and later life health decay, where this was evident in the data.
The three models are given by the following expressions, where t is time, B is capacity, and all other  parameters (c 1 , c 2 , c 3 , B 0 , t f , τ ) were fitted to the data: Linear-1: Linear-2: LinExp: Linear-1 and Linear-2 differ only by the addition of the initial capacity B 0 as a fitted parameter in the latter. The cell capacities were normalised according to which model was in use. For Linear-1 and LinExp, the capacities were normalised relative to the initial capacity of each cell. Linear-2 used capacity curves normalised relative to the nominal capacity. In the LinExp model, the initial linear capacity decrease is followed by a faster exponential decrease with onset time t f and time constant τ , as shown in Fig. 1.

Methodology
To quantify cell-to-cell variability, we used an approach called multi-level Bayes (MLB), also known as hierarchical Bayes, where the parameters of an individual cell model are assumed to be drawn from a population distribution, as depicted in Fig. 2a. In this framework, the first level of inference is on the parameters of an individual battery cell model, and the second level of inference is on the parameters of the underlying population distribution [38,39,40]. Given some data sub-sampled from the datasets described above, this approach provides an estimate of the individual (θ k ) and the population (µ g , Σ g ) parameter values as well as their associated uncertainties, as depicted in Fig. 2(a) and (b). Therefore one can explore the trade off between the number of cells' data used for fitting the models versus the stability and variance (or standard deviation) of the resulting population parameter estimates. As additional data from more cells is included in the estimation, the variance of the population mean and variance decreases (i.e. we become more certain of the population model). As illustrated in Fig. 3, we considered the population estimates to be stable when the standard deviation of the population standard deviation estimate began to steadily decrease as a function of sub-sample size (∼ 1 N ). We set the condition of an acceptable variability as being within a threshold, α, of the stable decreasing region. The value of α was set at 10% as shown by the grey shaded region in Fig. 3.
The following conventions are used throughout the remainder of this work. Fig. 2(c) shows the definitions of population, sample and subsample used. The 'population' means the very large (but unavailable) group of all possible similar batteries produced in the same manufacturing batch, from which a subset were tested in a lab. (Therefore, we expect the population statistics to be different for each dataset that was introduced in the previous section.) A 'sample' refers to all the available full data in a specific dataset. Therefore, a sample is drawn from a population. Conversely, any time a smaller subset was drawn from a full test dataset, it is referred to here as a 'sub-sample'. Summary sample statistics are denoted with the Latin alphabet, while population estimates are written using the Greek alphabet. For example, mean and variance are (m, s 2 ) and (µ, σ 2 ) respectively. The letter k is used to denote value(s)

Multi-Level Bayes
The parameters of an individual cell degradation model are assumed to be drawn from a population distribution that is unique for each dataset, with unknown population mean and variance. Let B k denote the capacity of cell k over a number of measurements (i.e. B k is a vector) and θ k denote the model parameters determining the time evolution of capacity. For example, for the LinExp model, θ k = (c k , t f,k , τ k ) T . Assuming additive and Gaussian measurement noise, where k ∼ N(0, σ 2 n,k ) and f is one of the three models introduced above. The parameters of an individual cell model k are themselves drawn from a population distribution, assumed here to be Gaussian, where µ g is the group (i.e. population) mean and σ 2 g the group variance. These population parameters are vectors of one, two, or three elements depending on whether the generative model is Linear-1, Linear-2, or LinExp, respectively.
To complete the specification of this generative model in the Bayesian framework requires prior distributions to be assumed. We used wide Gaussian priors (zero mean, 10 4 variance) on the population means µ g , and uniform distributions on the population variances. The noise parameters σ n,k were assumed to follow a Jeffrey's prior, P(σ 2 n,k ) ∼ 1/σ 2 n,k , and were integrated out analytically. To infer the posterior distributions of the population parameters µ g and σ g , we used a two step process. In the first step (first level inference), we fitted individual cell parameters θ k using Markov chain Monte Carlo (MCMC) sampling to obtain samples from P(B k |θ k ) = P(B k |θ k , σ n,k )P(σ n,k )dσ n,k . We then approximated the distributions for each cell with a Gaussian using their summary statistics (means µ k and variances σ 2 k ), which were then used in a second step (second level inference). This allowed the full posterior distribution of the population parameters to be written as follows: P(µ g , σ g |{B k }) = · · · P(B k |θ k )P(θ k |µ g , σ g )P(µ g , σ g )dθ 1 ...dθ K The above multivariate integral can be evaluated analytically, owing to the Gaussian approximation to the first level inference. This yields: Intuitively, this last expression shows that the posterior population mean is a Gaussian centred around the weighted average of the individual cell parameters, where the weights combine the first level variances (uncertainty on the fitting of each cell) and the group variance (mixed effect model). After this, we again used MCMC to draw samples from this posterior distribution to calculate its summary statistics.
The MLB approach was used to fit the parameters of all the model/dataset combinations shown in Table 1. That fitting was performed at all sub-sample sizes from minimum 3 cells, up to 3 less than the full number of cells in each dataset, with 1,000 repeats performed at each sub-sample size using random selection with replacement. A population distribution was deemed to have a stable fit if the standard deviation of the estimate of σ g settled to follow a function log y = ax + b where x is sub-sample size, y is the standard deviation of σ g and a, b are arbitrary slope and offset parameters. As a comparison, the equivalent result was also plotted when using a simple sub-sample distribution (SSD) by taking the mean, m g , and variance, s 2 g , of a given sub-sample:

Results
As a reminder, the objective is to quantify the number of battery cells that are required for a stable fit of a population model, when cells are selected at random from a population. In particular, we wish to infer both the parameters of the capacity fade model for each cell, and the parameters of the underlying population, including their uncertainties. We now examine both aspects in succession across the various datasets and models. In most cases, the parameter values estimated by MLB for the population level distributions match that of the sample distributions. Fig. 4 shows two examples of well matched population and sample distributions, namely from Severson-2019-t f and Baumhöfer-2014-c 3 . The parameter distribution in the sample is approximately Gaussian in both cases, matching an important assumption in the MLB derivation.
On the other hand, Dechent-2020 with a Linear-2 model demonstrates how the population level parameters can adjust to accommodate non-Gaussian distributions. In the case of Dechent-2020-B 0 , this manifests as a wide distribution over B 0 and µ g offset from the sample. The rest of the data was fit more precisely, resulting in a more representative distribution over the gradient parameter c 2 . There was very little correlation between parameter distributions for any dataset/model combinations. The only significant case was for the Attia-2020-LinExp model, where Pearson's rank coefficients of 0.94 were found between all of c 1 , t f and τ . Severson-2019 had correlation coefficients of ∼0.6, suggestive of a weak relationship between the parameters of the LinExp model.
The MLB results were subject to significant variability when fitting a single incidence at each subsample size. Fig. 5 demonstrates a typical set of parameter estimates where it was hard to interpret the values at small sub-sample sizes. The mean estimates appeared to be well fit at small values, but the population distributions only settled when the majority of the sample was used.
The summary results from 1,000 repeats, with replacement, were much smoother. The estimated standard deviation of σ g for the Linear-1 models rapidly dropped with increasing numbers of cells in a sub-sample for all datasets as shown in Fig. 6. The SSD approach produced a lower variance at all sub-sample sizes, but appears insensitive to small sub-samples.
The results for Linear-2 and LinExp were very similar as shown in Figs. 7 and 8, although there were distinctly less stable fits for Dechent-2020-B 0 and Attia-2020-τ . All three models shared a reduced standard deviation of σ g when using SSD.
The linear relationship between sub-sample size and the log of the standard deviations was deemed to represent a consistent fit. It was subsequently used to determine when an 'effective' sub-sample size had been reached. A model was considered well fit when the standard deviation of σ g was within α = 10% of this stable section, found using a linear extrapolation (as plotted in the figures). Fig. 9 shows the relationship between the number of cells required to achieve 'stable' population estimates, vs. the number of model parameters. The number is shown for all model, dataset and parameter combinations. The mean required sub-sample sizes for a consistent fit were 9, 11 and 13 for the 1, 2 and 3 parameter models respectively.

Discussion
The number of cells required to fit the various models presented here and capture a stable estimate of the population variability is of order 10. For the simplest model, Linear-1, the number was 9 cells, and for the most complex LinExp model, the number increases to 13. The results understandably suggest that increased model complexity leads to an increase in the number of cells required to be tested to achieve a stable estimate of the population variability. The multi-level Bayesian approach produced consistent parameter estimates from sub-samples. Given that cell-to-cell variability is an important phenomenon impacting battery performance, the estimated distributions are an invaluable tool to use in empirical modelling. Simple sample distribution techniques are limited to estimates of spread within the domain of the sample and hence showed less sensitivity to sub-sample size here when using random selection. The number of cells required to estimate population variability was fairly consistent across the datasets investigated here and was a stronger function of the model complexity than of the dataset. However, future work could test the robustness of this conclusion across a wider range of datasets.
The standard deviation of the σ g estimates reduced as sub-sample sizes were increased. In most cases, the SSD and MLB results also approached the same values as sub-sample sizes increased because the two techniques will return similar results at high sub-sample sizes. At low sub-sample sizes SSD was limited to the variability of the sub-sample, whereas MLB was less certain, resulting in higher values for both σ g and its standard deviation. In this case, SSD appears to have been artificially confident as an estimate of the population distribution.
The chosen threshold condition for a well fit σ g parameter resulted in consistent results. The same consistency was also found when using other threshold values of α. The hypothesis that sub-sample size increases with model complexity appears to be supported. However, it would be useful to explore this in more depth using larger datasets.
In the derivation of the MLB approach, we assumed there to be no correlations between parameters in the prior probability distributions. That assumption was found to be questionable in two cases here. Future work should explore the impact of this on population modelling.
The results for the Dechent-2020 dataset with the Linear-2 model demonstrated the robustness of the MLB approach by fitting a very similar gradient to the Linear-1 version, despite an apparently uncertain value of B 0 . The estimate of µ g for B 0 was 99.7%, i.e. the resultant population model was very similar to that with Linear-1. Our current approach assumes a Gaussian distribution at the population level. In the case of a bi-modal (or multi-modal) population distribution, we expect that the MLB method would respond by estimating a wide standard deviation, but this has not been tested. Extending the method proposed in this paper to other population distributions would be an interesting subject for future study.
Various ageing mechanisms are likely responsible for the degradation datasets considered in this paper. In the case of the Severson dataset, it is likely that degradation was largely caused by lithium plating [25], while on the Dechent-2020 dataset, covering layer formation and jellyroll deformation are key to degradation [41].
Between the datasets small differences were found. For the Linear-2 model the datasets Attia-2020 and Severson-2019 show a higher cell number threshold required for population estimation. The reasons for the trends we see are varied. The underlying mechanisms may be caused by increased variation of cycling conditions within the dataset. In order to capture higher usage variation in addition to intrinsic variability the number of cells will likely need to be increased. Future research could look into quantifying both use variability and manufacturing variability at the same time.
The fact that more complex models required more cells to be tested at each test point is challenging for battery lifetime experiments, since it could increase greatly the number of test channels and cells required in long term ageing experiments. We did not extrapolate to higher numbers of parameters or to other models, but it is reasonable to assume that the issues explored here will be present in other, more complex cases.
One challenge with the technique used here is that it relied on limited size samples from the population. Future work could explore whether larger sample sizes lead to similar results as found here.
Finally, further work is required to investigate the impacts of cell-to-cell variability using more complex physics-based models of battery ageing [32], which can have 5-10 or more parameters in addition to the 20+ parameters of the required underlying electrochemical model. Openly available ageing datasets are at present too small to enable meaningful calculation of population parameter variability for such complex models using the methods we outline, since the number of parameters is significantly more than 1-3. One approach in a future study could be using synthetic datas to study more complex models with more parameters and test parameter identifiability [42].

Conclusions
Simple empirical battery capacity fade models were fitted to a variety of ageing datasets to quantify the number of cells required to estimate the variability of the underlying population. The number of cells required to give a stable population variance estimate was found to vary according to the number of parameters in a given model. Respectively, 9, 11 and 13 cells are estimated to be required for models with 1, 2 and 3 parameters. Both sample statistics and population estimates were shown to stabilise with under 20 cells in most cases but this relied on there existing a Gaussian distribution of parameters within the sample, otherwise 20 cells were required.
For capacity curve fitting, perhaps the biggest challenge going forward is the selection of appropriate ageing model order and structure. This should be done not just by looking to what functions fit the capacity profiles best, but which functions produce the most reliable parameter distributions when looking at a dataset as a whole.
There was insufficient data here to test these results and conclusions as a function of variability caused by differences in usage, but this would be an interesting future exploration topic. Also, model selection across larger datasets is a challenging problem. For example, some of the battery capacity fade trajectories in this study fitted well to a linear degradation stage followed by an exponential decay starting from some knee point. However some of the resultant sample distributions cannot be confidently used to calculate basic summary statistics, such as Dechent-2020-B 0 .
Understanding and quantifying battery cell-to-cell manufacturing variability is an open research topic, and this work represents an initial step. These results form a useful order of magnitude guide, for those undertaking long term battery ageing experiments, of what is needed to capture manufacturing variability.