The kth nearest neighbor method for estimation of entropy changes from molecular ensembles

All processes involving molecular systems entail a balance between associated enthalpic and entropic changes. Molecular dynamics simulations of the end‐points of a process provide in a straightforward way the enthalpy as an ensemble average. Obtaining absolute entropies is still an open problem and most commonly pathway methods are used to obtain free energy changes and thereafter entropy changes. The kth nearest neighbor (kNN) method has been first proposed as a general method for entropy estimation in the mathematical community 20 years ago. Later, it has been applied to compute conformational, positional–orientational, and hydration entropies of molecules. Programs to compute entropies from molecular ensembles, for example, from molecular dynamics (MD) trajectories, based on the kNN method, are currently available. The kNN method has distinct advantages over traditional methods, namely that it is possible to address high‐dimensional spaces, impossible to treat without loss of resolution or drastic approximations with, for example, histogram‐based methods. Application of the method requires understanding the features of: the kth nearest neighbor method for entropy estimation; the variables relevant to biomolecular and in general molecular processes; the metrics associated with such variables; the practical implementation of the method, including requirements and limitations intrinsic to the method; and the applications for conformational, position/orientation and solvation entropy. Coupling the method with general approximations for the multivariable entropy based on mutual information, it is possible to address high dimensional problems like those involving the conformation of proteins, nucleic acids, binding of molecules and hydration.

The original articles detailing the theory were not meant for an audience of physicists and chemists, thus we review here the methods highlighting the most interesting results in the context of entropy estimation from molecular simulations.Although the method may be applied in general to any kind of molecule, the discussion will be focused on molecules whose conformation is mainly described by correlated torsion angles and whose environment is mostly aqueous.
Here we do not review general methods for estimation of entropy from molecular dynamics simulations, which are covered by other excellent reviews, 8,11 but rather address the kth nearest neighbor method, its theory, and implementation in the context of molecular entropy estimation.

|
A brief summary on entropy in classical statistical mechanics 43,44 We consider a system consisting of N atoms, in a volume V 3D at temperature T. The 3-dimensional ordinary volume of the system has been indicated explicitly by the subscript 3D to avoid confusion with the volume in multidimensional coordinates space which will be introduced later.The state of the system is described by 3N Cartesian atomic coordinates and 3N momentum coordinates.Its energy is given by the sum of a potential energy function U x The classical partition function for this system is: where k B and h are the Boltzmann's and Planck's constants, respectively.Introduction of h 3N at the denominator, which is clearly coming from quantum statistical mechanics, serves two purposes: making Q adimensional and making it consistent with the partition function derived in quantum statistical mechanics for simple systems. 43As it is apparent below, the choice of normalization factor, here h 3N but often implicitly 1 Â (unit of momentum Â unit of length) 3N , sets the partition function of the zero free energy reference state and it is inconsequent on free energy differences.For molecular processes in condensed phase, for which changes of volume are typically rather limited, we can consider the Helmholtz free energy A as an appropriate thermodynamic potential: The momentum integral can be calculated and factored out leaving us with the equation: The second right-hand term comes from the momentum integral and is typically safely neglected when comparing states of the same system (i.e., two different conformations, or free vs. bound state) at the same temperature, leading to its cancellation.The third term is related to the choice of normalization factor of the partition function.The first term is a configurational term and leads to the differences in free energy when we consider different macrostates of the same system at the same temperature.Different macrostates are defined by restricting the integral to different regions of the configurational space.For example, for a molecular complex the space could be partitioned in "free" and "bound" states based on one or more intermolecular distances.Entropy is obtained by the derivative: Explicit derivation leads to the following expression for entropy: where S 0 is related to the choice of normalization factor for the partition function, S mom is the entropy associated with the momentum integral and S conf is the entropy associated with the configurational freedom of the molecule which is expressed in terms of the probability density: Note that after partitioning the entropy in three terms, the argument of the logarithm in all terms, is a dimensional quantity.When each term is considered in isolation the choice of the units of measure implies the choice of a reference state for each term.This subject and the change of entropy upon change of variables are thoroughly discussed by Hnizdo and Gilson. 44rom the above Equation ( 5) it is seen that S conf is the average of Àk B log ρ ð Þ over the thermodynamic ensemble, that is: which is the thermodynamic ensemble average of Àk B log ρ ð Þ, that is: The fundamental Equation ( 7) was introduced by Boltzmann in terms of the number of what he called complexions, instead of probability densities (see the introduction to the translation of the original article by Sharp and Matschinsky 45 ).The present form involving the probability density function was later introduced by Gibbs. 46The definition of entropy in terms of probabilities later laid the foundation of information theory. 47The term "entropy" in that context was suggested by von Neumann who had defined in a similar way the entropy for quantum mechanical systems. 48quation ( 8) is explicitly used in the kth nearest neighbor method.
It is worth noting that all thermodynamic quantities expressed as ensemble averages may be estimated from the configurations sampled by equilibrium molecular dynamics simulations according to the probability density of Equation (6). ) are found inside the volume.Assuming a constant pdf within the volume, the probability of any sample to be found inside V centered at x !i would be estimated by

| The kth nearest neighbor method for entropy estimation
which is a discrete estimator (dependent on point x !i and chosen volume V ) of a continuous probability function of space and chosen volume V .
The probability density, assumed to be uniform inside V , is consequently estimated by: With an estimate of the probability density at each sample, the entropy can be thus naively estimated by: with k B the Boltzmann's constant.Since V is set equal for each sample x !i and the probability distribution function is not uniform, then the estimates of the probability density function at each sample x !i will result from a different number of samples.Moreover to have samples also in low probability regions, V would be chosen unnecessarily large for more dense regions, resulting in a loss of resolution, that is, in an overestimate of the entropy.
In the kth nearest neighbor method instead of the volume, the number of samples k inside each volume is chosen equal for all samples, and the volume (which is therefore dependent on the sample is adapted as to include just k samples. Although the volume could be chosen of any shape, a reasonable choice (in the absence of a detailed analysis on the distribution of samples) is to take a ball whose radius excludes the kth nearest neighbor sample of sample x !i , in such a way that exactly k samples are found inside the volume.Therefore the volume of the ball centered at each sample depends on the sample i and on the value of k chosen (identical for all samples).This fact is made explicit in the following by the subscripts on V: V i,k (Figure 1).The problem with the naive approach is that in general the logarithm of an unbiased estimator of the pdf is not an unbiased estimator of the logarithm of the pdf.
x y F I G U R E 1 Illustration of the kth nearest neighbor method in two dimensions (d ¼ 2).In the example the sample i and its distance (r i,k ) to the kth (in the example, the fourth) nearest neighbor is highlighted.The example involves a two-dimensional Euclidean space where the volume of a ball with radius r i,k is V i,k ¼ πr 2 i,k , that is, the area of the disk with radius r i,k .b ρ i is the estimated probability density at sample i.The balls up to the first (black), second (red), third (green), and fourth (blue) neighbors are shown for sample i and for another sample.
Kozachenko and Leonenko 41 in their seminal paper (unfortunately available only in Russian) elaborated the naive entropy estimation provided by Equation (11) and showed that, for k ¼ 1, save for the k B factor added in the present context, the estimator: where γ is the Euler-Mascheroni constant (0.5722 …), is an unbiased estimator of the entropy as n !∞.They further studied the mean square error of the estimate, without reporting the complete cumbersome demonstration.Demchuk et al. 42 developed and extended the idea originally proposed by Kozachenko and Leonenko, 41 and worked out the theory providing useful formulae and similarly, Goria et al. developed essentially the same theory 49 and Berrett et al. considered a weighted version of the approach. 50Starting from the work of Kozachenko and Leonenko, Grassberger et al. 51 proposed an implementation of the kNN method, focusing on mutual information.Dealing with different (single and product) spaces and dimensions the choice of a fixed value of k for all spaces corresponds to largely different resolutions.Their algorithms address specifically this problem, using the maximum norm instead of Euclidean norm and choosing different k for each one of the single and product spaces.
We summarize hereafter the theory following closely Demchuk et al., 42 simplifying and generalizing to non-Euclidean metrics.As above, we start considering the probability density estimator We further consider the entropy estimator according to the naive reasoning above and based on nearest neighbors: and we work out the probability density distribution of the quantity: The volume V i,k is the volume of a ball of radius r i,k , where r i,k is the distance of sample i to its kth nearest neighbor.
Let us also distinguish variables X ! 1 , X ! 2 ,… representing the vector of variables of samples 1, 2, …, respectively, from their actual samples x ! 1 , x ! 2 , … and similarly R i,k from its actual sample r i,k .The probability that T i > t is equal to the probability: where, based on the definition of T i , r is the radius of a ball of volume: Note that all T i are identically distributed, because all samples are drawn from an identical distribution, so that the probability that ! is exactly the same as for any other sample, say j, that X Let us consider therefore one specific sample, say the first, replace x ! 1 with the generic position x ! and consider T 1 .
The probability (p) that sample j (with j ≠ 1) is in the volume V t centered at sample X ! 1 ¼ x ! is the integral of the probability density function over the ball of radius t centered at sample When n !∞ the volume V t !0 for any finite value of t and we can approximate In the same limit the binomial probability distribution tends to the Poisson distribution: with λ ¼ np.The probability density function h T 1 ,k,n x ! of the variable T 1 is obtained by taking the derivative with respect to t of Equation (22), applying the chain rule because λ depends on p which in turn depends on V t which in turn depends on t.After some straightforward steps we obtain: The expectation value of T is thus given by the integral: By a change of variable z ¼ kρ x ! e t the integral may be written as: where L kÀ1 is defined iteratively by i , and γ is, as above, the Euler-Mascheroni constant.The last equality stems from recognition that the integral is the definition of the digamma function, that is, the derivative of the logarithm of the gamma function.
The above equations show that the variable is an unbiased estimator of À log ρ x ! and thus the average over the samples of this estimator is an unbiased estimator of the entropy of the distribution: Note the similarity of Equation (30) with Equation ( 14), the only difference lies in the term log k ð Þ in the naive estimator which is replaced in the exact estimator by L kÀ1 À γ.The difference becomes negligible for large values of k.
By similar, but definitely more complex treatment one can compute the variance of the unbiased estimator (Equation ( 30)) which is done in the article by Demchuk et al. 42 The variance of the sum in Equation ( 30) is the sum of different contributions entailing the variance of the pdf itself, the variance associated with kth nearest neighbor method and the covariances between pairs of estimates.Both Leonenko and Kozachenko 41 and Demchuk et al. 42 demonstrate that in the limit n !∞ the latter covariance term is zero, that is, samples are expected to be uncorrelated, and thus the variance of the unbiased entropy may be approximated by the sum of the variances of the T i estimators, resulting in: Demchuk et al. report all the details of the above approximation. 42We remark that the approximations done involve three key aspects: • n must be large to be able to use the Poisson distribution and to guarantee all limiting behaviors; • in particular n must be such that, within the ball including k À 1 nearest neighbors, the pdf must be constant.This means that all the pdf features at lengths shorter than the average radius of the balls built about all samples will be lost (and the entropy will be consequently overestimated).The average radius of the balls is the effective resolution of the kNN entropy estimator; • in particular n must be large to guarantee that the covariance between any pair of single sample estimators is zero.
The above considerations may be confronted with Equation (31) which shows that increasing the number of neighbors inside the ball, that is, increasing k, has the effect of reducing the variance of the entropy estimate.For small k, depending on the number of samples n the variance part due to sampling may be large.
The practical application of the method must find a compromise between increasing the resolution (large n and small k) and reducing the variance of the entropy estimate (large k).
It is worth also remarking that the treatment above applies to samples in both Euclidean and non-Euclidean spaces.The only requirements are that it is possible to define a metric and that the volume of a ball can be computed as a function of its radius.
For this reason it is essential to describe the metric and ball volume for spaces, like those in which rotations and position-orientations are described, which are not Euclidean.

| Independence of samples
Before addressing metrics and volumes relevant for molecular entropy calculation, we remark that one of the key assumptions of the kNN method is that important correlations are properly sampled and that samples are independent of each other.For molecular dynamics simulations this translates into important requirements on the length of simulation and/or frequency of sampling.
If samples (e.g., MD snapshots) are taken very frequently, snapshots adjacent in sampling time will describe conformations close to each other, which will result in short distances and therefore in high densities in conformational space.Unless the same conformational region is sampled several times after memory of the starting configuration has been lost, the closest samples, that is, those on which entropy estimation is based in the kNN method, will not be independent of each other.They will be thus closer than expected for random sampling and the estimated entropy will be therefore largely negative, just as a result of inappropriate sampling.
One way to detect the non-independence of the samples due to too frequent sampling is to check correlation between conformational and time distances.If such correlation is found it means memory of the preceding configurations has not been lost and samples are not independent.The estimated entropy will be therefore lower than the true entropy.As an extreme case, if two identical configurations are present in the ensemble, for example, after a Montecarlo rejected move, the distance will be zero and the estimated entropy will be negative and infinite.Such cases must be treated ad hoc, but in general, judicious choices for sampling must be taken, as discussed in the Section 3.

| Metrics relevant for biomolecular processes
The kNN method is based on distances between configurational samples, so it is necessary to choose a molecular representation amenable to define a metric.Moreover, the large number of variables calls for a choice that reduces as far as possible correlations among variables, because it will be very difficult to treat, a posteriori, the mutual information among a large number of variables.
Atomic coordinates are not well suited for this task, because of the direct correlations within groups where atoms are in constrained reciprocal positions.In a sense, normal mode analysis, which has been used to describe motions and associated entropies at different time and length scales in proteins, addresses correlations among coordinates.However, the analysis is based on a linear model which is justified for small vibrations about mean positions, but not for the complex motions of macromolecules, in particular for entropy estimation, although many developments have been proposed to extend the method to anharmonic motions. 34he bond, angle, torsion (BAT) representation has been widely used instead, because obvious correlations are controlled by torsion angles.Bond and angles are defined as "hard" degrees of freedom compared with torsions defined as "soft" degrees of freedom, because the potentials restraining covalent geometry are stronger than those depending on dihedral angles.
The effect of interactions on bonds and angles is typically very small, so these degrees of freedom are often ignored.The entropy changes associated with bonds and angles are mostly about 10%-20% of those for torsional degrees of freedom. 52,53Motions associated with "hard" degrees of freedom are dominated by the force field (typically harmonic) potentials, and are therefore referred to as vibrations.Note that in the literature the term "vibration" has been used including also fluctuations within minima wells for dihedral angles.

| Change of variables
Hnizdo and Gilson have thoroughly addressed the issue of entropy calculation under a change of variables. 44Here we focus on how the issue is related to the kNN method.When changing variables x f g !y f g, volumes change as dx !Jdy where J is the determinant of the Jacobian of the transformation.Densities change accordingly ρ The latter ensemble average is estimated according to the theory described above.The volume is computed in new coordinates as R Jdy.The change of variables is correctly taken into account by the change of measure density in the variables' spaces, to recover the configurational ("spatial") entropy that would be calculated in the starting coordinates.

| Cartesian coordinates change to external and BAT coordinates
We consider BAT representation, with external coordinates to describe overall translation and rotation.It is important to study the change of variables and the determinant of the Jacobian of the transformation because the latter enters the integrals (in particular the integral that defines the volume of a ball in the transformed coordinates).
A detailed analysis of the conversion from Cartesian coordinates to BAT coordinates for a linear chain, was reported by Go and Scheraga 54 who showed that the determinant of the Jacobian of the transformation is independent of torsional degrees of freedom.In that work the first atom coordinates defined the overall positional coordinate, two imaginary atoms were considered allowing for the definition of BAT coordinates for all atoms in the chain.The first bend angle and the first two torsion angles, defined using the two imaginary atoms, served as overall orientational coordinates.
The coordinates typically used for the overall position and orientation of a molecule in processing molecular trajectories are the center of geometry of a set of selected atoms and three angular coordinates to define the orientation about the latter center of geometry, after optimal superposition on the same set of selected atoms on a reference structure.
These considerations do not apply, in a straightforward way, to molecules which have more than one set of fairly rigid parts mobile with respect to each other.In the latter case it would be difficult to find a single position and rotation identifying the state of the molecule.][57] In the following we restrict ourselves to the single rigid core situation and adhere to the most common practice of identifying atoms in such rigid core and use the latter to define the overall position and orientation of the system.
The approach of G o and Scheraga must be modified for this procedure to describe a different referencing for position and orientation.
Let us remark that the disadvantage of considering the first atoms in the linear chain to define position and orientation is that if that part of the molecule is very flexible compared with the rest of the molecule, orientational and positional coordinates will be strongly correlated to internal degrees of freedom, which makes entropy estimation more difficult.
The change of variables may be performed in four steps as shown in Figure 2 and detailed in Section SI.1.At the end of the change from atomic Cartesian coordinates to external and BAT coordinates, the molecule is described by: Steps for change from atomic Cartesian coordinates to external and internal BAT coordinates.The structure in the top right is the reference one with atoms selected for optimal superposition shown as yellow spheres.(1) The center of geometry of selected atoms, with respect to the same atoms in the reference structure, is found; (2) the structure is translated to the origin; (3) the optimal rotation to superimpose selected atoms is found; and (4) Cartesian coordinates are converted in BAT coordinates.
• three coordinates ( x !CG ) for the center of geometry of the set of atoms chosen for superposition on a reference system • three angular coordinates (ϕ, ψ, θ) to describe the rotation for optimal superposition on a reference system.ϕ and ψ identify the axis of rotation and θ is the rotation angle • n b variables describing bond lengths • n a variables describing bend angles • n t variables describing torsion angles Assuming rigid bonds and bend angles and nearly rigid superposed atoms, we are left with three degrees of freedom for overall position ( x !CG ), three degrees of freedom for overall orientation (ϕ,ψ,θ) and nt degrees of freedom for independent torsional angles named χ 1 , χ 2 , …,χ nt .
Under the same assumption the determinant of the Jacobian of the transformation: may be written as where J const is the part of the determinant of the Jacobian depending only on bond lengths and bending angles, which are by assumption nearly constant and will be not further considered.

| Distances in external coordinates and BAT space
The kth nearest neighbor method is based on distances (defined according to a metric), therefore it is necessary to define the distance between any two samples of the system studied, in the space where it is represented.The space involves spatial and angular coordinates and we must define a distance that takes into account both kinds of coordinates.
A general metric that can be applied in multivariable spaces where a metric is defined for each variable or group of variables is the so-called Euclidean product metric.
It is easy to show that if d 1 is a metric on space X and d 2 is a metric on space Y , then d q is a metric on the product space X N Y . 58Using Euclidean product metric we can define distances in complex spaces with different kinds of coordinates, and in particular it is convenient to breakdown large multivariable spaces like those of biomolecules in subspaces with obvious correlations among variables, for example, for rotations, translations, position-orientations and rotations about contiguous torsional angles.In these subspaces we can make use of the Euclidean product metric, and treat correlations among different variables or group of variables with specific tools (vide infra).We review in the following the subspaces relevant for molecules and define for each subspace the distances between sample a and b.

Translation
The distance between the centers of geometry which identify the translation state of the molecule is the straightforward Euclidean distance in three dimensions and the volume of a ball of radius d in three dimensions is the volume of the sphere: Rotation There are many possible definitions of a distance in rotational space. 59,60Based on Euler's theorem any rotation can be represented by an angle θ rotation about an axis v ! .If the rotational state of a rigid body is described in the axis-angle representation, a natural definition of distance in rotational space is given by the angle needed to superpose the two samples of the same rigid body, [59][60][61] that is, the angle θ of the superposing rotation, independent therefore on the axis of rotation.
The rotation matrix may be expressed in terms of v ! and θ as: where skewð v !Þ is the antisymmetric matrix whose element ij is À P k ε ijk v k , with ε ijk the Levi-Civita tensor.When expressed in terms of the rotation matrices R a and R b corresponding to the rotation states a and b with respect to a common reference system, the distance is thus: where Tr indicates the trace operator.Note that the rotation angle is 0 ≤ d ab ≤ π.
The volume of a ball with radius d in rotational space is the integral over possible axis orientations (4π) and dθ from 0 to d, of the determinant of the Jacobian written above, save for constant terms,

Position-orientation
The six degrees of freedom for translations and rotations can be considered together and the Euclidean product metric can be used to define a distance in the position-orientation space as first proposed heuristically by Huggins 62 and used later by Liedl et al., 63 and Heinz and Grubmueller. 64The approach was justified theoretically 65 computing the volume of a ball in the six-dimensional position-orientation space endowed with non-Euclidean metric and providing analytical approximations to the volume.In general it is useful to multiply the angular distance by a length, say f , to have common units for position and orientation distances and to weight the two distances in such a way that they contribute similarly to the Euclidean product metric distance in the product space. 65Similar considerations and numerical tests have been recently presented by Heinz and Grubmueller. 64he distance in position-orientation space is defined as: The volume of a ball of radius d, for small distances, in position-orientation space is approximated by a series expansion 65 : with: that is, for the first terms: The Euclidean product metric approach has been extended to treat the positional-orientational space of two molecules. 66If we use a different scaling factor for each of the two molecules, that is, f 1 and f 2 , the volume of the ball in the 12-dimensional space with radius d is approximated by: with The explicit form of the expansion up to the first terms is: ,482, 880 , 321, 920

Torsion angles
The distance between two torsions is defined taking into account the periodicity of the rotation.If the periodicity is 2π n the distance between two samples of the same torsion angle is: and for m torsions: where d ab,i is the distance between the two samples for the ith torsion angle.
The volume of a ball of radius d in an m-dimensional Euclidean space is: where ΓðÞ is the gamma function.

| Implementation of the kth nearest neighbor method
We first discuss how the method is straightforwardly implemented.Provided that n samples are available, the distance d between any two samples is computed and the volume of a ball in the sample space with radius d, corresponding to each distance, is computed.Then the method is implemented through the following steps: 1. the matrix (d ij ) of the distances between all pairs of samples (i and j) is computed; 2. each row i (corresponding to all distances from sample i) is sorted; 3. for each sorted distance (d ik , where k now indicates the kth shortest distance, discarding the distance from each sample to itself) the volume (V i,k ) of a ball with that distance as radius is computed; 4. the logarithm of the volume is substituted in Equation 30 resulting in the mean over each column (corresponding to the first, the second, …, the kth nearest neighbor).
Each value of k provides an estimate of the entropy.A straightforward implementation presents two main problems.First a choice must be made as to which value of k is used for entropy estimation.A compromise must be sought between low variance (i.e., large values of k) and fine resolution (small values of k).
Second, implementation of the method may be very time-consuming, because in principle all pairwise distances must be computed for all pairs of samples, and distances for each sample must be sorted, so that if we have n samples, computation is O n 2 log n ð Þ ð Þ .A large number of samples is necessary to increase resolution, on the other hand the mean operation, implied in the term P i¼1,n in Equation (30), could be performed using a much smaller number of samples than available.In one of the next sections we examine data structures and algorithms which greatly reduce the expected computational time.
To illustrate resolution and variance considerations we examine a simple 1D example.The probability density function (pdf) is the sum of four Gaussian functions with weights 4, 2, 1, 1, means 10, 25, 50, 54 and standard deviations 3, 5, 1, 1, respectively (Figure 3).This pdf has minor features which are resolved but close.We expect that a low number of samples will result in limited resolution and in overestimation of the entropy.Two sets of 30 and 1000 samples were drawn from the distribution and the entropy was estimated using Equation (30).
To illustrate the effect of resolution on entropy estimation we graph the estimated entropy versus the average distance for each kth nearest neighbor (Figure 4).A clear increase of entropy estimates with the mean neighbor distance is observed.It is reasonable therefore to choose the first nearest neighbors (the number is arbitrary as long as the resolution is comparable or less than the length scales of the distribution) and fit the data, weighted by the inverse of the variance, to a line and use the intercept on the entropy axis to provide an estimate of entropy.
The effect of resolution is apparent in the two first rows of Figure 4.In the last row all n neighbors are considered for a limited number m of samples, in such a way that the time required for the computation is O mnlog n ð Þ ð Þwith a reduction in computations by a factor m=n. n may also be reduced, at the price of losing resolution.
Another important technical point concerns implementation of the method for multivariable samples spanning largely different ranges, for example, one or more orders of magnitude, possibly with different units of measures.When this happens, in a straightforward implementation, nearest neighbors distances will be dominated by the variables spanning the largest ranges and the effect of other variables will be negligible, unless the resolution is very fine, that is, the number of samples is very large.The consequence is that the space sampled has variations in fewer than the true dimensions.This is illustrated by the following example, variable x is drawn from a uniform pdf ranging between 0 and 1, then y is drawn from a Gaussian distribution with standard deviation equal 10 times x (Figure 5).
It is advisable in such cases to scale variables in such a way that they all have similar variances and thus contribute equally to the distance.The entropy change for scaling must be corrected afterwards, which might not be trivial if the metric is not Euclidean.
In the example (Figure 5) the analysis of unscaled data results in an average overestimation of the true entropy by 0.4 k B , because the unscaled distribution is essentially one dimensional as far as distances are concerned at low resolution and mutual information between the two variables is completely overlooked.The two effects produce in this case The kth nearest neighbor entropy estimates plotted versus kth nearest neighbor mean distance for the probability density function of Figure 3. On the left column data for all nearest neighbors are show, on the right column data for the first 20 neighbors and fitting line is shown.The entropy estimate obtained extrapolating the fitting line to zero mean nearest neighbor distance is shown as a filled sphere.The horizontal thin line is the exact value of the entropy for the probability density function.In the top row only 30 samples are analyzed.In the middle row 1000 samples are analyzed.In the bottom row 1000 samples are used, but the logarithm of the volume entering Equation ( 30) is averaged only over 100 samples.a small effect in entropy estimate.As can be seen estimates improve when scaling is applied to the first variable greatly reducing the average error in entropy.

| Data structures and algorithms for kNN searches
One of the major drawbacks of the kNN method is that, as implied by Equation 30, a naive implementation requires the computation of all the Θ n 2 ð Þ distances between samples.As noted by Vaidya, 67 the problem of finding the kth nearest neighbors for each point in a set of n samples has a lower bound of Ω nlog n ð Þ ð Þin the algebraic decision tree model of computation.To avoid the Ω n 2 ð Þ run-time complexity, in a recent review Brown et al. 68 consider appropriate data structures which may reduce computational times for nearest neighbor searches, both for periodic and nonperiodic variables.
In particular, K-D and Vantage Point (VP) tree structures that partition the space into various regions are well suited for this purpose because, for each sample, its neighbors are found by visiting only some (but in the worst case, all) of them.Adopting such data structures implies that algorithms for computing samples' neighborhood will consist of two phases: first the tree is built from a set of samples, and then the tree is used to perform n searches, one for each of the n samples (Section SI.2).
Borelli et al. 69 analyzed the performance of K-D Trees, VP Trees, and Naive algorithms both in Euclidean and non-Euclidean spaces in realistic biophysical scenarios, demonstrating the efficiency of the K-D Tree method in Euclidean spaces and the good performance of the VP Tree method in non-Euclidean spaces, with time-saving of almost an order of magnitude with respect to the Naive method even in high-dimensional spaces.
K-D Tree construction and nearest neighbor search algorithms have also been implemented on GPUs with significant acceleration of kNN computations, in particular in scenarios involving high dimensions 70,71 (Section SI.2).It is also worth mentioning that the algorithms proposed by Hu et al. 71 can be easily adjusted to work with VP Trees.The kNN method has been applied considering different degrees of freedom relevant for different problems in molecular simulations.The metric and the kind of mutual information among degrees of freedom are specific for each application.In the following we summarize contexts where the kNN method has been used in molecular sciences and provide examples from the literature.

| Conformational entropy
Conformations of molecule may be described using Cartesian or internal, for example, BAT, coordinates.The space of Cartesian coordinates is rather problematic for computing configurational entropy because of the extensive correlations among variables.However one of the most widely used approach for entropy estimation, due to its efficiency in computational time is the quasi-harmonic approach 27,32 and its modifications 12 using Cartesian coordinates.The main limitations of the quasi-harmonic approach are the non-Gaussian distribution of amplitudes of linearly independent motions and neglection of non-linear relationships among linearly independent motions, calling in turn for treatment of a large dimensional space. 34,72ays to circumvent these limitations have been devised and have been reviewed. 12In this context Knapp, Grubmueller et al. [73][74][75][76] used the kNN method to estimate corrections to the approach considering both non-Gaussian behavior of principal component motions, and the non-linear dependence of pairs of principal components, via the estimation of mutual information.The final entropy estimate was obtained summing corrected single variable entropies and pairwise mutual information.The approach of Numata et al. 76 was also used by Mukherjee et al. to calculate changes of solute entropy upon binding. 77he results confirmed that the upper bound on the entropy, 31,78 provided by the quasi harmonic analysis greatly overestimates the entropy and accounting for pairwise mutual information and, to a lesser extent, the anharmonicity of the distributions lowers importantly the upper bound on entropy. 76ecently, Marx et al. have used internal coordinates bond lengths and azimuthal and polar angle to describe the conformation of a fluxional molecule (protonated acetylene) and used the kNN method to assess its entropy. 79,80he authors in particular assessed the two-, three-, and four-variable correlation using interaction information, expressed in terms of n-variable entropies, similar to the MIE expansion. 81nizdo et al. reported first the application of the kNN method for estimating the configurational entropy 82 of molecules, using only torsional degrees of freedom.Their work highlights also the need to deal in an efficient way with the entropy of more variables, necessary to estimate the entropy of the full-variable distribution.In particular they use the kNN method to estimate pairwise mutual information and use it to define a matrix of association coefficients, used in turn to cluster sets of variables.Then the entropies are computed for each set and summed up to provide an upper bound of the configurational entropy.To deal with the large dimensionality of the problem the kNN method has been used in combination with two main approaches based on mutual information: (i) a truncated mutual information expansion (MIE) approach showing its potential in molecular applications where important correlations are mostly involving torsion pairs 81,83,84 ; the maximum information spanning tree (MIST) method proposed by Tidor et al. 85,86 involving also torsion pairs. 84The MIST approach considers pairwise mutual informations and builds a tree corresponding to a chain of conditional entropies, which provides the lowest upper bound to the entropy, obtainable using mutual informations involving at most two variables.The approach has been used also grouping together variables and considering the mutual information between pairs of groups of variables. 53The approach has been used to assess the effect of correlation on conformational entropy for a host-guest system, 84 to find the entropy of unfolded protein chains, 87 the change of entropy upon folding 87 and binding, [88][89][90][91] the change of entropy upon chemical modifications, 92 mutations 93 and to assess the relationship of the conformational entropy of proteins with NMR measured HN order parameter, 87 which has been proposed as an experimental "proxy" for conformational entropy. 94,95In all these works the periodic Euclidean metric was used, but other metrics could be used as well as detailed by Hnizdo et al. 96 The kNN method for conformational entropy is currently implemented in a few software programs.PDB2ENTROPY 53 uses trajectories in standard PDB format to compute torsion angles, according to user definitions.A file containing standard torsion angle definitions for proteins and nucleic acids is provided, and it can be easily be modified to define other torsion angles.Two schemes may be adopted: i. each residue is treated independently from others, that is, only correlations within each residue are considered; ii.torsion angles within each residue are grouped in fixed (small) size sets.The entropy is computed for each set and the mutual information for pairs of sets which have at least one atom within a chosen cutoff is computed.Finally the MIST approach is used to provide a (possibly strict) upper bound on entropy.If higher order correlations may be neglected the bound is close to the true entropy.For easier interpretation mutual informations are divided between the sets and entropies are listed also by residue.
In general, for practical consideration, simulations should be equilibrated, which means that the conformational space of correlated variables should be thoroughly explored during the simulation.In practice simulations of hundreds of nanoseconds are typically enough to sample torsional angles and pairs of correlated torsional angles in such a way that each conformational region is explored several times.If this is the case the sampling frequency should be such that the number of samples is large enough to have adequate resolution, for example, 5000-10,000 samples for pairs of torsion angles, and, on the other hand, such that the nearest neighbors of each sample are not correlated in time with it.Typically 10-20 ps is an appropriate choice.The time required, for example, on a 16-core 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50 GHz to analyze a system entailing 866 torsional angles and 8745 torsional angle pairs for MIST analysis, is 120, 1290, and 15,770 s, using 1000, 3000, and 10,000 samples, respectively.Running time scales, in a straightforward implementation, with n 2 log n ð Þ where n is the number of samples.As an example of application we consider here the amino acid phenylalanine which is described in the BAT representation by four torsional angles (ϕ, ψ, χ 1 , and χ 2 , and bonds and bend angles which are neglected here, Figure 6).A sample of the conformations explored in the unfolded state is provided by a set of 5000 conformations obtained from a non-redundant set of 3600 crystallographic protein structures in the culled PDB dataset. 97or such ensemble, the entropies associated with each torsional angle, with respect to a uniform distribution, are: À1.1, À0.8, À1.2, and À0.6 k B for ϕ, ψ, χ 1 , and χ 2 , respectively.The maximum information spanning tree includes the mutual informations between ϕ and ψ (À0.6 k B ), ψ and χ 1 (À0.5 k B ), and χ 1 and χ 2 (À0.5 k B ).The total entropy is estimated as À5.snapshots of Phe70 from a 200 ns molecular dynamics simulation of β2-microglobulin.As expected, and apparent in Figure 6 entropies associated with each torsional angle in the folded state, with respect to a uniform distribution, are lower: À1.9, À2.0, À2.4 and À1.2 k B for ϕ, ψ, χ 1 , and χ 2 , respectively.The maximum information spanning tree includes the mutual informations between ϕ and ψ (À0.1 k B ), ϕ and χ 1 (À0.1 k B ), and χ 1 and χ 2 (À0.1 k B ).The mutual information of rather restricted torsion angles is lesser than for the wider conformational space explored in the unfolded state.The total entropy is estimated as À7.8 k B .The reduction in conformational freedom for β2-microglobulin Phe 70 upon folding results therefore in 2.5 k B entropy loss.The example of analysis reported here is extended to all torsional angles and mutual informations between close torsional angles, to obtain the whole protein conformational entropy.

| Positional-orientational entropy
Another outstanding field of application of the kNN method is the calculation of the positional-orientational entropy changes upon binding.Following McCammon et al. 4 the term positional-orientational entropy is preferred over translational-rotational entropy to avoid confusion with the entropy associated with linear and angular momenta.Notwithstanding the caveat about decomposition of entropy in independent contributions, if the positionalorientational state description of the molecule is properly chosen, it may be reasonably decoupled from internal degrees of freedom, as discussed above.Thermodynamics of binding has been described in excellent reviews. 4,7,8,10,98The calculation of positional-orientational entropy is relevant for binding, where two molecules associate, but also, as discussed in the next subsection, for solvation, where waters' positions and orientations are changed by the presence of the solute and by other waters.
The kNN method for positional-orientational entropy, using the theory reported above has been implemented in the program PDB2TRENT. 53PDB2TRENT combines orientational and positional distances to define distances in the orientation/position space of the molecules and to compute entropies based on the kNN method, which is particularly useful to define the entropy changes upon binding of molecules.
Applications have been reported for the entropy of protein association with hydrophobic surfaces, 88 with proteins, 65 protein-ligand binding, 87 and peptide liquid-liquid phase separation. 99,100or practical considerations, sampling in the six dimensional space of position/orientation would require a very large number of samples.On the other hand, typically, the relative position of a ligand with respect to its target is restricted in a space of ≈ 1Å 3 and ≈ 1degree 3 .In practice 10,000 samples are typically sufficient to achieve the necessary resolution.
As an example, we consider the association of two proteins, in the present case two transthyretin monomers.The reference state is the ideal freely-rotating, 1 M concentration state of both proteins.The associated state has been simulated for 250 ns and 25,000 snapshots have been collected to provide a representative ensemble.The positionalorientational entropy of the associated proteins is computed considering the first of the two monomers in the ideal freely rotating, 1 M concentration state, and considering the configurational restriction of the second monomer with respect to the first one.To do this the complex is first superimposed using the backbone atoms in secondary structure regions of the first monomer, then the positional-orientational entropy of the second monomer, now restricted in position and orientation is computed, using again the backbone atoms in secondary structure regions (Figure 7).The whole calculation (superposition and entropy estimation) takes about 130 s on a 16-core 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50 GHz.The results show that the positional entropy is À6.0 k B and the orientational entropy is À9.7 k B .The mutual information between the orientational and positional degrees of freedom lowers the entropy by an additional 1.8 k B , totaling À17.5 k B positional-orientational entropy.

| Solvation entropy
Hydration is perhaps one of the most important fields where the kNN method is currently applied.The problem involves the positional-orientational entropy of each water molecule and its correlations.The large dimensionality of the space precludes the possibility of using histograms to estimate probabilities.An additional complication arises from the indistinguishability of solvent molecules, which calls for specific treatment of molecular dynamics trajectories where molecules and their symmetric parts bear a specific label.
Berne et al. were the first to recognize the distinctive advantages of the kNN method 101 over other more traditional methods to estimate hydration entropies.In their approach the excess entropy of water due to pair correlation function, was treated using a single radial variable and five orientational variables, following Lazaridis and Karplus. 102A number of approximations on the multivariable distribution, including recasting the oxygen-oxygen distance distribution in a three-value distribution and the Generalized Kirkwood superposition approximation, made the problem treatable and the results were in good agreement with more costly and accurate pathway methods.
Starting from inhomogeneuos solvation theory (IST), 103,104 Gilson et al. used a grid approach to the estimate the entropy of waters around a solute neglecting water-water correlations. 105The orientational part of the single-water entropy was estimated using the kNN method.A further step was using the kNN method for both the orientational and positional entropy and also for full positional-orientational entropy 106,107 in the software GIST, using the Euclidean product metric 62 and later in the software SSTMap. 108Liedl et al. made a GPU implementation of the GIST approach (GIGIST) available. 63,109The same authors have been extending recently the GIST approach to salt-water solutions and other, rigid, solvents. 110,111The method is implemented in the software GIST, 106,107 SSTMap, 108 PME-GIST, 112 GIGIST, 63,109 which are based on single molecule entropy computed on grid cells.A different approach is implemented in the software PerjMut, 64,113 which uses permutation of water labels to increase single water sampling, and the mutual information expansion 83 to deal with the large dimensionality.
It must be noted that it is the kNN method which enables the treatment of the high-dimensionality positionorientation space of single waters.
The entropy arising from water-water correlation is typically neglected in these approaches.Based on the observation that for many solutes solute-water and water-water interaction energy contributions to solvation energy are in a ratio of À 1 2 Huggins 62 proposed to set the entropy change in water-water position-orientation equal to À 1 2 of the single water molecule entropy.Apparently such heuristic approximation gives very good results for small molecules hydration entropy.Very recently Kurtzman et al. reported a detailed study of the solvation entropy of 32 small molecules computed with a novel version of GIST, PME-GIST, 112 and showed that higher order water entropy terms can be estimated as À0.4 times the first order water entropy. 112This value is suggested by linear fitting computed entropies versus the reference value provided by thermodynamic integration and is fully justified by the authors based on theoretical ground.PME-GIST, which uses the kNN method, is implemented in the software CPPtraj. 114The accuracy their approach obtains on test molecules sampling almost continuously a 20 kcal/mol solvation free energy range, is 0.4 kcal/mol (unsigned mean error).
It is worth noting here that comparison of kNN based solvation entropies with those obtained, for example, by pathway method provides a test, which is difficult to perform for conformational and positional-orientational entropies, unless in gas-phase.In solution, indeed, pathway methods provide free energy which include also solvation, which makes comparison not straightforward.
F I G U R E 7 Ensemble of 50 configurations of a transthyretin dimer shown in cartoon.The backbone atoms of secondary structure elements of the leftmost monomer (represented in white transparent material) have been used to superimpose the whole complex.The positional-orientational entropy is computed for the rightmost monomer (in red) using the backbone atoms of secondary structure elements.
The positional part of the entropy arising from water-water correlation was addressed using the kNN method 115 and the full two-molecule entropy was estimated by Huggins 116 considering the 12-dimensional space of positions and orientations of two waters, restricting the analysis within a cutoff of 4 Å.
Most approaches mentioned above and others face the problem of indistinguishability of solvent molecules by using water density on small voxels and/or permutation of labels 117 as proposed by Grubmuller et al. 118,119 Recently the same authors have used permutation of labels to compute hydration entropies 64,113 using the kNN method coupled with mutual information expansion up to third order for orientational entropies (using quaternion metric on single orientational space and Euclidean product metric for pairs and triplets of orientations) and later for positional-orientational entropies with excellent results.
The specific linkage of kNN method and optimal relabeling of water molecules with hydration thermodynamics has been reviewed by us recently. 120The approximations involved in the approach imply very small errors on the estimation of single-molecule entropies.The exact treatment of the full positional-orientational space of two molecules 66 enables in principle the computation of entropy terms due to water-water correlations.
For practical considerations, due to the fast reorientation of bulk water a sampling frequency of 1 ps may be used, and simulations, considering also more hindered surficial waters, may be in the range of tens of ns.The running time for a thorough analysis depends on many factors, like the extension of the solvent region considered for the analysis, preprocessing of the trajectory, the spacing of the grid if a grid is used, and so on.Considerations similar to those for the computation of positional-orientational entropy however apply also here.

| CONCLUSIONS
The kNN method has distinctive advantages over other methods for estimating the entropy of biomolecular processes: • it does not depend on a specific model for the probability density function of variables; • it allows for decomposition of entropy in terms of single and groups of variables describing configurations; • it adapts the neighborhood of each sample to the estimated probability density at that sample, providing fine resolution in dense configurational space, that is, in the regions most contributing to the entropy; • not requiring predefined intervals for estimating the density of the distribution it is suitable to address large dimensional spaces.
The disadvantages are: • the large computational load for which however straightforward parallelization is possible.Moreover specific data structures and algorithms may be used largely reducing the expected computational time; • samples should be independent on each other, which means that samples must be spaced in time, which calls for long simulations.
The recent availability of software performing the kNN entropy estimation makes this method a useful tool that allows the analysis for the complex probability density function of complex molecules in solution.

FUNDING INFORMATION
This work was partly supported by a grant by the European Union -Next Generation EU (University of Udine CUP: G25F21003390007).

i
the momentum of atom i.

3
Example of a probability density function with features on different length scales.

5
On the left column data and analysis are shown for raw data, on the right column corresponding data and analyses are shown for scaled data in such a way that the variances are the same in both dimensions.The displayed entropy on the right column is corrected for scaling.In the top row the 100 samples are shown, in the middle row the computed entropy versus mean neighbor distance for all values of k is shown, and in the bottom row line fitting, with points weighted by the inverse of the variance, for the first 20 neighbors is shown.The estimated entropy is indicated by a filled symbol.The true entropy is indicated by a thin horizontal line.
3 k B .As an ensemble representative of a specific phenylalanine in a folded state we considered 2000 F I G U R E 6 Entropy of a phenylalanine residue in an unfolded and folded protein.Top left panel: Phenyalanine residue torsion angles; top right panel: Phenylalanine 70 in the folded β2-microglobulin protein; bottom left panel: 50 coil conformations from a non-redundant data set as representative of the unfolded state; bottom right panel: 50 conformations from a 200 ns molecular dynamics simulation of the folded protein.