A general expression for the statistical error in a diffusion coefficient obtained from a solid‐state molecular‐dynamics simulation

Analysis of the mean squared displacement of species k , rk2 , as a function of simulation time t constitutes a powerful method for extracting, from a molecular‐dynamics (MD) simulation, the tracer diffusion coefficient, Dk* . The statistical error in Dk* is seldom considered, and when it is done, the error is generally underestimated. In this study, we examined the statistics of rk2t curves generated by solid‐state diffusion by means of kinetic Monte Carlo sampling. Our results indicate that the statistical error in Dk* depends, in a strongly interrelated way, on the simulation time, the cell size, and the number of relevant point defects in the simulation cell. Reducing our results to one key quantity—the number of k particles that have jumped at least once—we derive a closed‐form expression for the relative uncertainty in Dk* . We confirm the accuracy of our expression through comparisons with self‐generated MD diffusion data. With the expression, we formulate a set of simple rules that encourage the efficient use of computational resources for MD simulations.


| INTRODUCTION
By aiding the interpretation of experimental data, by providing atomic-scale insights, and by predicting data for new systems, molecular-dynamics (MD) simulations are rapidly becoming a key tool in the study of ion transport in solid-state materials.  Such simulations yield most easily the tracer diffusivity of species k as a quantitative measure of the ion-transport rate (rather than an ion mobility 26 or an Onsager transport coefficient 27 ). Specifically, the tracer diffusion coefficient, D Ã k , is determined from the time evolution of the species' mean squared displacement r 2 k through the relation where d is the number of spatial dimensions in which diffusion takes place.
This approach presupposes, first, that the MD simulation is run for a sufficiently long time, such that the ballistic regime has already been traversed and the diffusive regime is reached; and second, that gradients in concentration, electric potential, and temperature are absent from the simulation cell.
The derivative d r 2 k =dt is usually evaluated as the slope obtained from linear regression of the r 2 k t ð Þ data. From a series of simulations at various temperatures, the activation enthalpy of tracer diffusion, ΔH D Ã , is obtained from an Arrhenius plot ( Given that diffusion is a stochastic process, D Ã k data obtained from MD simulations will inherently show some statistical scatter. The best Dennis Kemp and Alexander Bonkowski contributed equally to this work. way to estimate the magnitude of this scatter, strictly speaking, is to perform several simulation runs with different initial conditions, 28,29 but this is rarely done. In many instances, an estimate of the error in D Ã k is omitted. On occasion, an error is given, but its origin is not always specified. An obvious approach is to use the standard error of the linear regression slope in the r 2 k t ð Þ plot as a measure of the error. The least-squares criterion used in standard linear regression routines, however, refers to a set of independent data points, whereas each data point in a plot of r 2 k versus t is not independent of the other data points (i.e., r 2 k t ð Þ data are autocorrelated). Consequently, the regression error will, on principle, severely underestimate the error in D Ã k . In particular, in the case of MD simulations of small simulation cells with short simulated time periods, the regression error may mask a statistically questionable result by incorrectly implying that D Ã k has been obtained with high accuracy.
Two studies in which the error in D Ã k was examined in some detail 30,31 found empirically that the relative error u rel D Ã k À Á could be described approximately by A= ffiffiffiffiffiffiffi t sim p À Á þ B, where A and B are fitting constants. To calculate a relative error, the true value of D Ã k is of course required as a reference; but in both cases, 30,31 true values of D Ã k were not known a priori, and hence, the reference values were taken from the longest MD simulation that could be carried out. A "long" simulation run does not guarantee, however, that the true value of D Ã k is obtained, especially for those simulations employing a small number of mobile particles k or temperatures too low relative to (ΔH D Ã =k B ). A different approach is therefore required, one that does not rely on "long" MD simulations, one that is less empirical and less approximate, and one that yields an expression that takes into account a variety of key factors, not simply the simulation run time.
In this study, we are concerned with u rel D Ã k À Á , the relative error in a tracer diffusion coefficient, obtained from r 2 k t ð Þ, the time evolution of the mean squared displacement of diffusing particles k. We focus on diffusion in a solid by a vacancy mechanism, but the results are also applicable to diffusion by an interstitialcy mechanism. We took the approach of obtaining r 2 k t ð Þ curves from kinetic Monte Carlo (kMC) simulations. The conceptual benefit of this approach is that we avoid relying on data obtained from a "long" dependences. Subsequently, we use this expression in Section 5 to predict whether an MD simulation will provide data with an uncertainty below a given level. We first test the predictions on MD data that we generate ourselves, and then, we turn a critical eye to literature data. In both Sections 4 and 5, we discuss how the expression can be used to distribute computational resources so that D Ã k T ð Þ and thus ΔH D Ã can be obtained with low uncertainties.

| METHODS
For the kMC and MD simulations, we considered oxygen-vacancy diffusion in SrTiO 3 , primarily because it is a system that has been extensively studied, so that there is excellent agreement from a variety of methods (chemical diffusion, 32 nuclear spin relaxation, 33 tracer diffusion, 34 anelastic relaxation, 35 and classical MD simulations 36 ) as to the absolute magnitude of the vacancy diffusivity and its activation enthalpy (of 0.62 eV). Furthermore, this value of the migration barrier is also obtained from static supercell simulations based on densityfunctional-theory (DFT) calculations with sufficiently large cells containing an even number of perovskite formula units in all three spatial dimensions. 37 A second reason for taking oxygen-vacancy diffusion in SrTiO 3 as our model system is that the tracer correlation factor is known for this migration mechanism in this structure (f Ã ≈ 0:69), 38 allowing D Ã O to be predicted on the basis of the inputs to the kMC simulations (attempt frequency, activation energy, jump distance, and vacancy concentration).

| kMC simulations
A rejection-free kinetic Monte Carlo (kMC) algorithm (n-fold way 39 ) was implemented in Python, with the help of the NUMPY 40 and SCIPY 41 frameworks. No interactions between the jumping vacancies or between vacancies and possible foreign atoms were taken into account. Beside our model system, the cubic perovskite SrTiO 3 , we also carried out kMC simulations on supercells of the threedimensional simple cubic (sc), face-centered cubic (fcc), and bodycentered cubic (bcc) lattices, and of the two-dimensional square lattice. In the example of SrTiO 3 , only the oxygen sublattice of the perovskite structure was included in the simulations.
To achieve comparability of the kMC and MD simulation data for SrTiO 3 , we adopted the kinetic parameters ΔH mig ¼ 0:62eV and ν 0 0 ¼ 4:6 Á 10 12 s À1 of SrTiO 3 . 36 On this basis, the average jump rate Γ v was taken to be In the kMC model of SrTiO 3 , we adopted the lattice constants a 0 from the MD simulation runs (see Table S2 in Data S1), according approximately to the expression 42 a 0 =Å ¼ 3:90 þ 6:64 Á 10 -5 K -1 Á T: As mentioned before, kMC simulations bear the inherent advantage that, in ideal dilute solutions, the expectation value of the vacancy diffusion coefficient, D v , may be calculated directly from the input parameters, thus allowing for a straightforward assessment of the simulation outcome's accuracy. Specifically, for this prediction, the jump rate Γ v , the coordination number Z within the sublattice, the jump distance s v , and the dimensionality d of the lattice are necessary: E.g., for the oxygen sublattice of the perovskite structure, d ¼ 3,  Table S1 in Data S1). For supercell sizes of at least 3 Â 3 Â 3, these ideal values were found to be a reasonable approximation (see Figure S1 in Data S1).

| MD simulations
The MD simulations were carried out with the LAMMPS code. 45,46 The long-range coulombic interactions were calculated with a particle-particle/particle-mesh solver 47  3 | FUNDAMENTAL PROPERTIES OF r 2 k t ð Þ CURVES Before we consider the effects of various simulation parameters on the tracer diffusion coefficient, D Ã k , we first examine some properties of r 2 k t ð Þ curves. In particular, we demonstrate that the commonly used method of calculating r 2 k by time averaging provides no statistical advantage compared with the approach of calculating r 2 k with reference to the initial state at time t 0 . In addition, we show that the extraction of D Ã k by least-squares linear regression is no more reliable than obtaining D Ã k from the difference quotient of two data points.
Variant B is the ensemble average of a time average. r 2 k is then a function of a time lag Δt, rather than of the time t itself: This variant is commonly used to analyze data from ab-initio MD, 22,31,[51][52][53][54] with the aim of reducing the considerable statistical scatter that arises from small particle numbers (cell sizes) and run times. By virtue of providing smooth r 2 k t ð Þ curves, this variant is very useful for the identification of the ballistic regime 31  For these reasons, we maintain that, for the analysis of the diffusive regime of a r 2 k curve, Variant A is the preferable way to calculate the mean squared displacement. In the following Sections, A r 2 k will be employed as the generic way of calculating r 2 k ; it will therefore not be indicated explicitly.

| Extracting the diffusion coefficient: Linear regression versus difference quotient
In view of Equation (1), the use of linear regression to obtain a tracer diffusion coefficient from a r 2 k t ð Þ curve seems a natural choice. But as noted in the Introduction, linear regression assumes a set of independent data points, and this independence is not given within a r 2 k t ð Þ curve. In order to demonstrate this issue, we compute D Ã k using linear regression and, for comparison, using only the very first and the very last point of a r 2 k t ð Þ curve (within, of course, the diffusive regime). That is, we approximate the time derivative of r 2 k t ð Þ by a difference quotient: The results are shown in Figure 2. We see that there is no substantial statistical advantage in calculating D Ã k by means of linear regression over calculating it by a simple difference quotient. We also t ð Þ was calculated without and with time averaging (variants A and B, shown as dotted and solid lines, respectively). The three exemplary curves were chosen from a set of 100 kMC runs as those cases exhibiting the minimal, maximal, and average observed diffusion coefficient. The simulations were carried out on a 3 Â 3 Â 3 SrTiO 3 supercell at T ¼ 2000K; the simulation times were chosen so as to reflect typical time scales of AIMD simulations (A and B) and of classical MD simulations (C and D). see that the regression error, unsurprisingly, fails to capture the statistical scatter of the simulation outcome. Hence, the vast majority of the r 2 k t ð Þ data (i.e., all data points except the first and the last) do not contribute any information that is relevant for the assessment of the diffusion coefficient. Clearly, the apparent wealth of data in a In the vast majority of cases, a value of D Ã k is determined from a single MD run and the statistical uncertainty is not evident. As demonstrated in Section 3, the regression error provides no reliable estimate of the true uncertainty. Although the error in a single MD run can be estimated by rather sophisticated methods, for example, bootstrap resampling, 61 there are two compelling advantages in having a closedform expression for the relative error in D Ã k . First, calculating the error directly from r 2 k t ð Þ curves is simpler and quicker. Second, directed partitioning of computing time between runs at different temperatures is possible: as we will show, increasing the simulation time may, in some cases, have no benefit, while in others, it may greatly enhance the accuracy. Our approach is, therefore, to derive a closed-form expression based on an extensive set of kMC results.

| Assessing the errors on activation enthalpy and pre-exponential factor
Calculating r 2 k , and subsequently D Ã k , as a function of temperature allows the activation enthalpy of tracer diffusion, ΔH D Ã , to be obtained from an Arrhenius plot. In general, the quality of r 2 k data in terms of linearity and smoothness (i.e., no curvature and no discontinuities) decreases with decreasing temperature. Eventually, at some low temperature, not even one single tracer particle jump will occur during the entire course of the simulation. This general decrease in data quality presents a problem if the temperature range for which MD simulations yield high-quality data is substantially different to the temperature range of experimental interest (i.e., the range for which experimental data are available or for which predicted data are desired). Values of D Ã k for the lower temperatures must then be estimated by extrapolating the simulated data to the lower temperatures.
This approach is physically reasonable, provided that there is no change in diffusion mechanism and no change in point-defect behavior (such as a change from intrinsic to extrinsic defect regimes) between the higher and the lower temperature ranges. Nonetheless, the extrapolated diffusivities may be subject to large uncertainties when the extrapolation is carried out across a large temperature difference. If the extrapolated values cannot be complemented with reliable estimates of these uncertainties, unreliable conclusions may be reached, even if the extrapolation approach is, in principle, valid.
As a first step, we demonstrate the importance of sufficiently large cells and sufficiently long simulation times for the accuracy of It is stressed that the quantitative results of Figure 3 are specific to oxygen diffusion in SrTiO 3 and the temperature range considered. A higher set of temperatures, or a material with a lower activation barrier, will produce different quantitative results, on account of the higher jump rates, but the same qualitative results. The main purpose of Figure 3 is thus a qualitative illustration.
The results of Figure 3 also indicate that larger cells with a higher number of vacancies yield more accurate results than the smaller cells.
This trend is especially strong for the longer simulations: as an example, the portion of ΔH D Ã values within AE2% accuracy increases from 23% to 88% upon an increase from a 3 Â 3 Â 3 to a 20Â 20 Â 20 supercell size.
While the results for t sim ¼ 1ns indicate the advantage of increasing the cell size, this trend is not as clear for the shorter simulations.
In that case, there appears to be a sudden enhancement at N cell ¼ 10 ( Figure 3E), which suggests that the improvement might be due entirely to the increase in the number of vacancies, rather than the increase in cell size. In the following, we examine the extent to which larger cell sizes or higher numbers of vacancies contribute to the enhanced accuracy of the simulation outcome. Our overarching aim is to generalize the results so that they apply quantitatively to all systems, not just to oxygen diffusion in SrTiO 3 . Two further points need to be stressed. First, the results of Figure 4 indicate that, if a simulation with a given t sim falls in the plateau regime, a further increase in simulation time will not reduce the relative error in D Ã O . It is then advisable to carry out several independent runs and to average over the values in order to obtain a lower relative error (see Section S4 in Data S1). The key issue, though, is how does one determine whether a single simulation falls in the pla-   Figure 3, that the cell-size dependence is much clearer for the simulations with t sim ¼ 1ns than for those with t sim ¼ 100ps.

| A systematic examination of the relative error
Analogously to the dependence of the relative error on t sim (Figure 4), the dependence on N O may be described rather well with an expression of the form G= ffiffiffiffiffiffi ffi N O p þ H, with fitting parameters G and H.
In Figure 5 we also show that, with increasing t sim , the dependence of the relative error on N O approaches an inverse-squareroot law: This dependence takes exactly the same form as that of the relative uncertainty of r 2 in the continuum limit of an ensemble of N ordinary random walkers (RW) in d spatial dimensions (mathematical derivation see Section S2 in Data S1): We will make use of this expression later. Lastly, the behavior for different numbers of vacancies, N v , is examined in Figure 6. When a larger number of vacancies is present in the simulation cell, the relative errors are generally lower, but this beneficial effect becomes weaker with increasing simulation time, and it appears to vanish in the plateau with respect to t sim . Accordingly, in fits of the expression  quantity is required that is related to these parameters, and consideration of our data suggests that the key quantity that determines the error is the number of tracer particles that have jumped at least once.
Specifically, analysis of the kMC simulation data (e.g., Figure 4) reveals that, at relatively short t sim , most tracer particles have not jumped yet, while, in the plateau at relatively long t sim , most tracer particles have jumped at least once (for a detailed analysis see Section S3 in Data S1). Instead of tracking individual particle jumps, the question of

| Estimator for the relative error of the diffusion coefficient
In the previous Section, we identified r 2 k =s 2 k , the effective number of jumps per k tracer particle, as a simple indicator of whether a simulation run falls in either of the plateaus of the relative error, with respect to t sim or N k . We also established that, at long simulation times, the trend of the relative error approaches the closed-form expression for an ensemble of ordinary random walkers (Equations 9 and 10). At shorter simulation times, on the other hand, the relative error is substantially larger than predicted by Equation (9); this is readily understood by considering that, in this case, a large portion of the tracer particles will not have jumped at all. Actually, these particles will not, on principle, have been able to jump, because they were never approached by a vacancy-hence, they cannot be conceived as random walkers, and they are irrelevant with regard to the error in D Ã k . As stated in the previous Section, we will, therefore, focus on the number of particles that have jumped at least once.
To derive an estimator for this number, we consider the effective number of jumps as a total count, rather than an average per particle.
The (total) effective number of jumps, N eff jumps , is given by Note that the effective number of jumps will generally differ from the true number of jumps by the tracer correlation factor f Ã .
We denote by  the set of tracer particles that have jumped at least once, and by N  k the number of these particles. Over the course of a simulation, N  k increases and, clearly, the number of particles that have not yet jumped decreases. As a consequence, for a given jump, the probability that the jumping particle jumps for the first time decreases, until it reaches 0 (once all particles have jumped). As a simple approximation, we model this differentially as from which we obtain, by integration, In order to test the accuracy of Equation (13), we extracted the true values of N  k from a set of kMC simulations (see Figure S2 and Video in Data S1), and found that Equation (13) is accurate to within the natural scatter of the kMC runs.
Bearing in mind that, in the long-time limit, the relative error in should correspond to Equation (10), we make the assumption that this same expression applies generally to the tracer particles in subset , with N being substituted by N  k . Combining in this way Equations (10) and (13), we obtain our closed-form expression for the relative error in D Ã k : Equation (14) quantitatively incorporates all the characteristics that were identified previously in Section 4.2. For sufficiently short This essentially amounts to an inverse-square-root dependence of the relative error on the simulation time. For sufficiently long simulation times ( r 2 k =s 2 k ) 1), in contrast, Equation (14) yields, as required (cf. Equation 9), the limiting case of the inverse-square-root dependence on the number of mobile particles: In other words, this limiting case tells us that, even for infinite simulation time, the relative error can only reach a finite limit that is prescribed by the number of tracer particles in the cell.
Having obtained Equation (14), we subjected it to scrutiny by analyzing the relative error in D Ã k for various crystal structures. Beside our model perovskite SrTiO 3 , kMC simulations were carried out on supercells of the 3-dimensional simple cubic (sc), face-centered cubic (fcc), and body-centered cubic (bcc) lattices, and of the 2-dimensional square lattice. In all cases, we find, as shown in Figure 7A, good agreement between the observed relative errors and the predictions of Equation (14). Clearly, u rel D Ã k À Á cannot be estimated from either N eff jumps or N k alone, but both quantities impose lower bounds upon u rel (Equations (15) and (16)), as indicated by the dotted/dashed lines in Figure 7B,C. In that regard, Equation (16) is particularly useful because it yields a lower bound without any prior knowledge of the ion transport kinetics. For example, for a three-dimensional lattice with 100 tracer particles, Equation (16) indicates that the relative uncertainty in D Ã k from a single MD run will at best be ≈ 8% (but typically larger, for finite t sim ). To achieve 1% precision with a single MD run, the simulation cell must contain at least 2 3 Á 10 4 ≈ 6667 particles, provided that the plateau with respect to t sim is reached, i.e., r 2 k =s 2 k ) 1. In view of current computational capabilities, the requirement of simulating a cell with $ 10 4 particles implies that, in AIMD, this degree of precision in D Ã k cannot realistically be obtained from a single simulation, but only by calculating an average over multiple independent runs (e.g., given a cell with 250 tracer particles, at least 25 independent runs would be necessary; see Section S4 in Data S1).
We close this Section with two important points. Since the kMC data that we produced (shown in Figure 7) were restricted to the range N eff jumps > 2, our expression is restricted to this range of N eff jumps . However, the use of our expression below N eff jumps ≈ 3 is strongly discouraged because, at this point, the relative error surpasses 50%. The problem is that, taking the absolute error (σ) to estimate a 2σ interval, within which the majority ( ≈ 95%) of possible outcomes fall, yields for u rel > 50% values of D Ã k in the 2σ interval that are negative. Consequently a lower bound to the order of magnitude of D Ã k cannot be estimated. Thus, N eff jumps ≈ 3 represents for us a natural lower boundary.

| APPLICATION OF THE CLOSED-FORM EXPRESSION TO MD RESULTS
In the first part of this Section, we examine how well Equation (14) performs in calculating the error in a tracer diffusion coefficient obtained from an MD simulation. To this end, we produce and analyze is due only to vibrational displacements. Clearly, in this case, t sim is too short. The estimate of N eff jumps ≈ 1 also indicates that the simulation run should be extended at least by a factor of 10 (to t sim ¼ 250ps) to obtain an effective number of jumps well above the critical limit of N eff jumps ¼ 3, as demonstrated in Figure 8B: for this much longer run, we found N eff jumps ≈ 15, and thus, it is reasonable to extract a value of D Ã O from the r 2 O t ð Þ curve. In cases where it is not computationally feasible to extend t sim sufficiently (i.e., such that N eff jumps ≥ 3), the results should be discarded. Note also that, even with t sim ¼ 250ps, the simulations at T ¼ 1200K are still subject to a considerable scatter ( ≈ 20%) in D Ã O . As was pointed out before, the simulation time that is necessary to obtain reliable jump statistics depends on the temperature. This becomes evident by considering that the 25ps runs at T ¼ 2000K ( Figure 8C) display a relative error close to that of the 250ps runs at T ¼ 1200K ( Figure 8B). Extending the MD runs at T ¼ 2000K to t sim ¼ 250ps, the agreement of the r 2 O t ð Þ curves is visibly better than previously, as shown in Figure 8D. This is reflected in the lower relative error in D Ã O (observed and predicted) of ca. 10%. A further increase of t sim , however, will not result in a further improvement. be close to its plateau value. Indeed, Equation (16) predicts a lower bound of ≈ 9% for the relative error. As a consequence, it would be advantageous to calculate an average from several independent runs (see Section S4 in Data S1), rather than to extend the simulation time: for example, the average of the three independent results shown in Figure 8D is expected to be accurate to within ≈ 6%, and, with that, more accurate than a single run with t sim ! ∞.
Another way to decrease u rel is to increase the cell size and thus the number N O of oxygen tracer particles. The next case, Figure 8E, refers to a simulation of a 5 Â 5 Â 5 supercell run for 250ps, that is, at the limit of AIMD simulations. Compared to the runs with the same t sim in the smaller supercell, the results are more accurate, with u rel ≈ 5%. Again, consideration of Equation (16) indicates that extending t sim could at best lead to a further decrease in u rel to ≈ 4%.
If a relative error below 1% is required for each simulated value of D Ã k -and this may be the case if a value of D Ã k with an error less than one order of magnitude is to be predicted through extrapolation over several hundred Kelvin-the cell size and simulation time must be sufficiently large. From Equation (16), we discern that sufficiently large means that the supercell must contain at least 2 3 Á 10 4 tracer particles. These criteria are met in Figure 8F, with a 20 Â 20 Â 20 supercell and t sim ¼ 750ps; the number of vacancies could be increased to 24, yielding excellent jump statistics without additional computational cost, while remaining below the limit of 1% defect concentration, above which undesirable defect-defect interactions are assumed to become relevant. In supercells of usual size in AIMD simulations, that is, a few tens to hundreds of particles, any number of vacancies above 1 will correspond to a concentration above this limit.

| Ab-initio MD: Literature examples
As a first example, we consider an AIMD study 53 of our model system (oxygen diffusion by a vacancy mechanism in SrTiO 3 ): from the relevant simulation parameters (N O ¼ 189, t sim ¼ 10ps) and a lattice parameter of ≈ 4Å, we estimate the effective number of jumps, N eff jumps from the given diffusion coefficients D Ã O , by combining Equations (1) and (11): Only one of the data points corresponds to a value of N eff jumps above the critical limit of 3; most notably, the data point at the lowest temperature (T ¼ 1000K) corresponds to N eff jumps < 1, possibly indicating a non-diffusive origin of r 2 O . Hence, there is no evidence that the low activation enthalpy (of 0:30eV) reported by the authors is more than a simulation artifact (caused by short simulation times and a small supercell). A more general conclusion is that, although AIMD contains more chemical information on a system, it may yield a physically less accurate result than force-field MD.
As a second example, we consider an AIMD study 62  To assess the reliability of this extrapolation, we calculated the relative errors of the reported values of D Ã Li with Equation (14), from the reported r 2 Li values at t ¼ t sim . It was verified that all data points fulfilled the condition N eff jumps ≥ 3. Hence, 2σ-intervals could be estimated with the relative error obtained from Equation (14). On this basis, linear fits with minimum and maximum slopes were constructed, as shown in Figure 9.
An extrapolation of our minimum/maximum slope estimates to 300K yields, in all three cases, a range of values that spans over more than two orders of magnitude. Given this huge uncertainty, it cannot be judged whether a simulation at 300K would indeed produce a higher conductivity of Li 3 OCl 0:5 Br 0:5 than of Li 3 OCl or the opposite.
The seemingly precise matching of the reported conductivity ratio with the experimental value (2:1) is, therefore, likely to be a coincidence.

| CONCLUDING REMARKS
If one has performed an MD simulation of solid-state diffusion, confirmed that the cell has equilibrated, and observed that r 2 k , the mean squared displacement of species k, has increased with time, it may seem that the difficult part has been completed. All that is left is to calculate the tracer diffusion coefficient of k. Our study reveals that, even then, much remains to be done.
In the first part of this study, we examined the behavior of r 2 k t ð Þ curves, and discovered two highly counter-intuitive features: first, calculating time-averaged mean squared displacement curves does not increase the precision of the extracted diffusion coefficient, even though it suggests, by virtue of producing curves of a very smooth appearance, a higher degree of precision. Second, obtaining the diffusion coefficient by means of least-squares linear regression from the r 2 k t ð Þ curve yields results that are no more precise than those obtained by a simple difference quotient. While in both cases, the considered methods were found to be equivalent with regard to the uncertainty in the diffusion coefficient, there may be other reasons to prefer one approach over the other. Since time averaging may mask statistically and physically questionable results, we contend that the simple ensemble average (Equation 6) should always be inspected.
Also, we argue in favor of using a difference quotient for the calculation of a diffusion coefficient, rather than linear regression, in order to avoid mistaking the standard error of the linear regression slope for a measure of the uncertainty in D Ã k . In the second part of this study, we focused specifically on the error associated with the tracer diffusion coefficient, D Ã k , obtained from the analysis of r 2 k t ð Þ. A simple closed-form expression is proposed that allows the relative error in D Ã k to be estimated from the final r 2 k value of the simulation run. This expression, furthermore, is instructive in that it indicates the plateaus of the statistical error in D Ã k : the smaller the cell, the sooner a plateau of the statistical error will be reached for increasing simulation times. For very small simulation times, on the other hand, the beneficial effect of increasing the cell size vanishes.
Finally, we highlight three points that will help researchers to obtain statistically meaningful tracer diffusion coefficients from MD simulations: 1. The relative error in a tracer diffusion coefficient, D Ã k , calculated from an MD run, may easily be estimated with Equation (14) from the final value of r 2 k .
2. The relative error u rel D Ã k À Á may be reduced by extending the simulation time, until it approaches a plateau once the effective number of jumps per particle exceeds unity ( r 2 k =s 2 k > 1). In that regime, it is advisable to increase the cell size or to average over a number of independent runs, rather than to extend the simulation run time further.
3. The number of tracer particles in the cell, N k , prescribes the lower bound of the relative error (Equation 16), for t sim ! ∞.
For a single run to generate values of D Ã k within 10% precision, it follows that at least 67 mobile particles are necessary. For results within 1% precision, the number of mobile particles must exceed 6666.