Unraveling the Non‐Homogeneous Dispersion Processes in Ocean and Coastal Circulations Using a Clustering Approach

Dispersion processes in environmental flows have been traditionally studied under the strong assumption of homogeneous, isotropic and stationary turbulence. To overcome this limitation, we propose a new approach that combines autocorrelation analysis of simulated Lagrangian trajectories together with unsupervised clustering. To test the approach, we consider several dynamic scenarios around a coastal gulf, subject to different forcing, in order to compare our method with other approaches. Lagrangian trajectories forced by the varying coastal circulation exhibited different behaviors, looping and non‐looping paths, and produced a variety of Lagrangian autocorrelation functions. Our approach proves to be able to reveal spatio‐temporal variations in ocean dispersion processes without any a priori knowledge of the character of the trajectories. Clusters based on the autocorrelation functions are associated to different inhomogeneous dispersion processes. Finally, we propose a new stochastic model capable of predicting the different forms of autocorrelations.


Introduction
The ability of ocean and coastal circulations to transport heat and mass has always been considered of paramount importance due to the impact on atmospheric climate (Holmes et al., 2019) and the distribution of nutrients essential for sustaining ocean life (Lu & Gan, 2023).Mass and heat transport are strongly influenced by the local dispersion processes, which in turn are controlled by ocean and coastal circulations.Material dispersion by environmental flows have been traditionally studied following Taylor's theory (Taylor, 1921).He was inspired by observing that turbulence on geophysical scales spreads heat and substances similarly to what happens in molecular diffusion.Following this idea, he discovered the connection between the time evolution of particle dispersion and the velocity autocorrelations of the underlying flow.This also implies the existence, for large enough times, of an asymptotic regime with an associated turbulent diffusion coefficient (Eddy Diffusivity).
Although these fascinating results had been proven to be effective in certain real contexts (LaCasce, 2008), the assumption of homogeneous, isotropic and stationary turbulence is often not correct in geophysics.Therefore, several methods have been proposed to overcome this limitation, usually detecting homogeneous subsets where Taylor's hypotheses are restored.Examples of methodologies, aimed to obtain spatial distributions of the diffusivities, can be found in Davis (1991), later adopted by Jakobsen et al. (2003).Moreover, coherent structures are known to affect dispersal statistics.Indeed, ocean eddies can trap particles for periods of the order of their typical life-time (Provenzale, 1999), showing loops in Lagrangian trajectories (loopers).In Richardson (1993), the importance of coherent eddies in setting the oceanic kinetic energy budget was demonstrated by classifying float trajectories with two or more consecutive loops in the same direction.Afterward several studies aimed to make this classification more systematic.
In particular, Veneziani et al. (2004) studied the loopers distribution in North Atlantic mesoscale turbulence by thresholding a scalar quantity, called spin, to automatically separate trajectories.They were able to identify geographical areas dominated either by looper or non-loopers, hence quantifying dispersal on statistically homogeneous regions.This approach was inspired by thinking Lagrangian velocities as realizations of a linear stochastic process with a deterministic term that corresponds to a uniform circular motion.A similar strategy was developed in Rupolo (2007), where trajectories were classified introducing a new dimensionless parameter, defined as the ratio between the acceleration and velocity time scales, that was inspired by a model in which deterministic circular motions are substituted by adding accelerations in the stochastic process.This strategy allowed a worldwide evaluation of dispersal properties in the main oceanic current systems.Furthermore, Griffa et al. (2008) focused the attention on dynamics in the upper ocean following the same approach of Veneziani et al. (2004) to analyze data from drifter measurements.The same data set was exploited by Lumpkin (2016) to propose a more robust criterion for loopers detection.Instead of considering chunks of trajectories of a fixed duration (as done in previous works), they split each trajectory into intervals where spin persisted keeping the same sign, then judging as loopers those intervals whose duration was at least twice the period of the circular motion associated to the corresponding spin value.Other works addressed loopers detection from purely geometric perspectives.Dong et al. (2011) presented a methodology to identify trajectories which cross themselves in at least two points, looping in the same direction and thresholding on the mean period of rotation.Methods proposed in Lilly and Olhede (2009) and Lilly et al. (2011) were designed to separate from background flow the motion due to waves and vortices.A valuable feature of their approach was to allow a description of trajectories that went beyond the simplifying assumption of near-circular loops.The effect of eddies on tracer dispersion has been also largely studied in the Eulerian framework.Particularly in Haigh et al. (2020Haigh et al. ( , 2021aHaigh et al. ( , 2021b) ) a quasi-geostrophic ocean motion was simulated to systematically investigate the effect of small-scale structures on Eddy Diffusivity fields of tensor form.Finally, recent advances in Deep Learning and Artificial Intelligence have been introduced also in this field.For example, Xu et al. (2019) borrowed state-of-art techniques from Computer Vision to detect surface eddies from SSH data outperforming more traditional algorithms.Moreover, a clustering approach has been introduced by I. M. Koszalka and LaCasce (2010) on Lagrangian drifter data.In the latter study, the binning technique used by Davis (1991) was replaced by the use of a clustering algorithm of the drifter-derived velocities.
In our study we go back to Taylor's statement that, for a homogeneous environment, the dispersal process is driven by the velocity correlation.We apply unsupervised Machine Learning to automatically identify clusters of trajectories with similar velocity autocorrelations.While being based on a solid theoretical result this method is fully data-driven, hence overcoming limitations related with any a priori modeling assumption about the underlying processes and avoiding technical issues like tuning thresholds.The method is compared with previous strategies on changing dynamic scenarios around a coastal gulf (in terms of both circulation and wind).Our approach can automatically depict spatio-temporal variations in ocean dispersion, recovering also the already known phenomenology, particularly loopers.Finally, to interpret our results we propose a stochastic model that describes well all different profiles of autocorrelation identified by the algorithm.Discussing the meaning of model parameters also gives the opportunity for a critical review of quantities which have been traditionally used to characterize dispersion.

The Study Area and the Circulation Model
For the present analysis, we selected as study area the Gulf of Oristano (Italy), located on the west coast of Sardinia Island (approximately between 8.48 E and 8.68 E and 39.78 N and 39.98 N,Italy).It is a shallow water semi-enclosed bay of about 150 km 2 (Figure S1 in Supporting Information S1) with a 9 km wide opening into the open sea (Western Mediterranean Sea).We refer to the Supporting Information S1 for further details.
In this study, we adopted SHYFEM to reproduce the surface water circulation inside and outside the Gulf during a whole meteorological year including the main atmospheric and oceanographic forcing.The same model setup adopted in Farina et al. (2018) was used to reproduce the main surface transport path in the area.Numerical simulations were performed to reproduce the surface water circulation during the years 2017 and 2018.The model outputs, consisting of the horizontal components of the current velocity computed at 2.5 m depth and at hourly frequency, were resampled on a 100 m regular mesh.Only the data obtained from the second-year model run were used for the analysis in order to avoid any disturbance generated by the initial conditions of the model.

Clustering Technique and Its Application to Lagrangian Dispersion
Clustering is a branch of unsupervised learning that aims at finding groups (clusters) of nearby points in a given data set.A cluster is then meaningful if distances between its points are significantly smaller than distances between its points and those belonging to other clusters.Specifically, K-means divides the data space into K clusters producing for each a corresponding prototype μ k (or in other words the centroid of the cluster) and a label indicating to which cluster each sample data belongs.Mathematically this problem can be stated as follows: given a data set of N points {x 1 , …, x N } in a D-dimensional Euclidean space with norm ‖…‖, solve the optimization problem of finding arg min ( μ k ,r nk ) J, with where r nk is 1 if the nth sample is assigned to the kth cluster, identified by the prototype μ k , and 0 otherwise.In other words, we seek for a pair (μ k , r nk ) that minimizes the sum of the squares of the distances of each point to its associated prototype.It can be seen easily that for a fixed set of prototypes the following definition of r nk is optimal: Moreover, for a fixed partition r nk of a given data set, a minimum of the cost function J is given by which means that each optimal prototype is in fact the centroid of all points in the corresponding cluster.Equations 2 and 3 represent a solution of the K-means problem, for more details on the algorithm see for example, Bishop and Nasrabadi (2006).
A key point of our method, and the main difference compared to previous studies (I.Koszalka et al., 2011; I. M. Koszalka & LaCasce, 2010), is to apply clustering to Lagrangian velocity autocorrelations of the ith Lagrangian velocity component u L i , defined following Taylor (1921) as: where the brackets 〈⋅〉 in ρ L ii (τ) indicate an average over the entire duration of each trajectory.Moreover, the integral Lagrangian time scale along the ith direction can be then evaluated as the time integral of Equation 4.

Geophysical Research Letters
10.1029/2023GL107900 Four test cases were performed varying the main parameters of the Lagrangian simulations, see Table S1 in Supporting Information S1.Hereinafter, we only show the results for the Test 1, and we refer to the SI document for comparison among the test cases.
Specifically the components of a sample vector x n are values of either R uu (τ) or R vv (τ) of a given Lagrangian trajectory, at each time lag τ considered.Two separate clusterings have been performed for R uu and R vv .The clustering algorithm was preliminary tested to find the optimal number of clusters, see SI document for the details.

Stochastic Model for Surface Tracers
Velocity autocorrelation profiles returned by our machine learning approach are physically interpretable.However, some of them cannot be described through models which were used in previous homogeneous and stationary contexts (Lumpkin, 2016;Veneziani et al., 2004).
To make a step toward a physical representation of this heterogeneous behavior, we propose a stochastic model for the velocity of our tracers, extending the model proposed in Veneziani et al. (2004Veneziani et al. ( , 2005)).Here the two components of the Lagrangian velocity are described as a bivariate linear stochastic process with the form: where the equations for U and V in the system (6) are the same velocities defined in the model presented in Veneziani et al. (2004), which was already able to produce looping trajectories, for example, Here we add two more independent one-dimensional processes U and V, with exponential decaying autocorrelation functions to each velocity component.Assuming all decaying time scales equal to T and sorting all constants, the whole process generates trajectories associated with velocity autocorrelations of the form: It is worth observing that Ω is the constant angular frequency associated to a circular contribution to particle motion, resulting by the coupling terms of first and second equation in system (6).This parameter has been usually called spin and estimated as: where (u′, v′) are the Lagrangian residual velocities, E c the kinetic energy, brackets denote average along a trajectory and Δt is the numerical time step.The spin or its associated rotational period P = 2π/Ω has been used in several studies to characterize the looping frequency (or the looping period) of the drifters' trajectories (Griffa et al., 2008;Lumpkin, 2016;Veneziani et al., 2004).In our analysis we mostly use Ω to indicate its instantaneous value (i.e., without applying the bracket average), and its average will be explicitly indicated.The constant C in Equation 7 plays the role of a coupling constant, tuning the relative importance of the two additional processes with respect to the initial two-dimensional model.

Results and Discussion
As a first result our method retrieves, in a fully data-driven way, typical behaviors that were already identified in previous works.In particular, profiles 1 and 2 for R uu (Figures 1a1 and 1a2) and profiles 1, 2, and 3 for R vv (Figures 1b1-1b3) are immediately associated with the well-known looping trajectories.Exponentially and nearly exponentially decaying autocorrelation functions can be also recognized, for example, in cluster 7 (Figures 1a7  and 1b7).Other clusters may be related either to peculiar regimes or to transitional cases, as discussed in the following paragraphs.
The joint occurrence of R uu and R vv can be also evaluated, see the SI document for the details on the computation.The joint occurrence matrix, built on the entire 72 simulated initial conditions across the year (Figure 1c), shows the highest frequencies in diagonal and nearly diagonal entries, indicating that similar profiles tend to occur simultaneously in both components.The strong diagonality observed indicates that the dispersion process tends to be isotropic, while preserving a strong non-homogeneity in space and time.
Furthermore, the assignment of each simulated trajectory to a cluster provides spatial patterns of the transport process across the studied domain (Figures 1d and 1e).Such patterns consist of evolving geographical patches that can be reviewed as homogeneous in terms of dispersal behavior.Note that the functions shown in the panels (a1-a7), (b1-b7) and (c) are the output obtained by clustering on the whole set of simulations of 120hr duration, whereas panel (d) and (e) are the results of a specific simulation with a duration of 120hr.In the Supporting Information S1 we provided the movie showing the time evolution of the clusters for the entire simulated year based either on R uu or R vv , and the corresponding joint occurrence matrix (clustermovie.mp4(Movie S1)).It can be observed how the dispersion process is characterized by relatively strong inhomogeneities in space and time, which might be ascribed to the shoreline complexity and the variability of the circulation forcing.
A remarkable capability of our approach is that it allows disentangling of complex scenarios that may occur under different dynamical conditions.Panel (a) in Figure 2 shows the abundance of both R uu and R vv clusters for each non-overlapping time sequence (see Supporting Information S1 for details).The complexity of our case study immediately emerges, with transitions from looping to non-looping dominated scenarios and vice versa.Spatial patterns can be used to identify which geographical areas are subject to specific dispersion regimes and to associate such regimes to physical processes that play a role in determining the local circulation: in particular, wind has been shown to be the main driven factor in this area (Cucco et al., 2016).Figure 2 shows different scenarios generated during four selected time sequences in which wind behaves differently in terms of either direction or intensity (Figures 2b1,2c1,2d1,and 2e1).In previous works, the identification of loopers was based on thresholding absolute values of the spin parameter Ω (Griffa et al., 2008;Lumpkin, 2016).Here, we find that spatial patterns of high spin (Figures 2b2,2c2,2d2,and 2e2) are in agreement with those of simultaneously oscillating R uu and R vv profiles.However, defining loopers in terms of Ω, which gave good results in homogeneous cases, might be more sensitive to the choice of a threshold in our study area.Indeed, here the kinetic energy (that appears in the definition of Ω in Equation 8) might be not uniformly distributed, being sometimes subject to strong gradients (Figures 2b3,2c3,2d3,and 2e3).A good property of our approach based on profiles of autocorrelation function is to be more robust against this possible issue.The relation between spin and kinetic energy for each cluster is further supported by the scatter plots reported in the Supporting Information S1 (Figure S7 and S8).Moreover, any method based only on spin does not allow to identify further regimes like those depicted in Figures 2e4 and 2e5 where profiles which exhibit negative lobes in the autocorrelation (Figures 1a5  and 1b5) are dominant in the gulf.Negative lobes are often associated to non-Brownian diffusive regimes, both sub-diffusive and super-diffusive regimes (Berloff et al., 2002;De Leo et al., 2022).In particular, sub-diffusion is associated to a pronounced negative lobe, whereas a negative lobe followed by a positive lobe is linked to superdiffusive processes.Examples of these anomalous dispersion regimes have been observed in the transport due to oceanic gyres (Berloff et al., 2002;Veneziani et al., 2004) and in coastal regions as well (De Leo et al., 2022;Enrile et al., 2019).The general result that arises from our analysis is that strategies successfully used to characterize dispersion in homogeneous cases are not sufficient when applied to more complex scenarios.Our proposed method aims at tackling this lack of generalization in a fully data-driven manner.
If we now consider the functional form for the autocorrelations expressed by Equation 7, we can compute the distributions of the three parameters (T, P and C) by best-fitting each R uu and R vv with Equation 7, see Supporting Information S1 for the details of the nonlinear fitting.
The T and P time scales also appear in the form previously used for velocity autocorrelations (Lumpkin, 2016;Veneziani et al., 2004): However, this formulation cannot describe profiles such as those in panel (a3) of Figure 1.On the contrary, the formulation suggested by Rupolo (2007) is able to describe autocorrelation functions with a single pronounced negative lobe depending on the relative values of the velocity decorrelation time compared to the acceleration decorrelation time.However, the latter failed to describe strongly looping autocorrelation functions.
The new parameter C must be greater than or equal to zero and establishes the relative importance of looping and non-looping components in the stochastic process.Hence, cases described by Equation 9 values of 0 < C ≪ 1 are mixed cases dominated by loopers, and the weight of the exponentially decaying part grows as long as C increases.It can be also observed that this latter limit case of C ≫ 1 can be obtained by the particular condition C = P = 0, although in practice this combination never emerges from our best fits (Figures 3e  and 3h).
Combined with the data-driven approach proposed here, the estimation of parameters describing emergent profiles provides links to physical quantities traditionally used in dispersal studies and allows their change to be tracked across clusters.We have already mentioned the spin Ω as a successful quantity for loopers detection.In the ideal case when looping particles, deprived of their mean motion, undergo a uniform circular motion with radius r and period P, the latter turns out to be exactly P = 2π/Ω.Estimates of P x and P y reveal a well-defined value around 20 hr for strong loopers (Figures 3e and 3h), which gives about 0.9 ⋅ 10 4 s 1 for the magnitude of Ω and roughly corresponds to the most frequent value for cluster 1 in R uu (Figure 3c).However, while under the ideal assumption of uniform circular motion the spin does not depend on time, in our case evaluating Ω locally in time yields very different values, even close to zero where the trajectory has low curvature.This produces distributions of Ω which do not show typical emptying close to zero that was reported in previous works instead (Lumpkin, 2016).On the contrary, other clusters show narrower spin distributions placed around zero, as expected.Spin analysis also reveals skewness in clusters that exhibit looping behavior, suggesting that in our case study, rotating particles tend to prefer a direction of rotation.This fact could be associated with some coherent process happening in the area like inertial oscillations which have a period that is compatible with observed values of P x and P y .Indeed, for looping trajectories these two parameters show very concentrated distributions of values around 20 hr that could be associated to typical periods of inertial gravity waves.
Similar to the definition of Taylor (1921)'s integral time scale, by simple integration of Equation 7, we can derive the formula: that relates T, P, and C with T L as defined by Taylor.Under the assumption of homogeneous turbulence, this time scale also represents the time separation that has to be significantly overcome to have uncorrelated Lagrangian velocities, thus indicating a transition from a ballistic to a diffusive regime in particle dispersion.However, it is well known that if Taylor's assumptions are not fulfilled, this association with the establishment of a diffusive regime could be not reached.This is also the case of looping particles, where lobes in the autocorrelation of opposite sign largely cancel their contribution to the integral, yielding very low values (e.g., substituting the values of T = 40 hr, P = 20 hr and C = 0 in Equation 10gives T L ∼ 15 min).For this reason, in presence of oscillations, the Lagrangian time scale sometimes has been alternatively defined as the first zero of the autocorrelation (Lumpkin, 2016;Pope, 2000), which for C = 0 is P/4.Although this latter definition can be more effective for certain applications, it does not apply to positive valued autocorrelation functions and it is sometimes replaced considering the first lag time at which it drops below a given threshold (Azevedo et al., 2017).
What emerges clearly from our analysis is the role of three distinct time scales when it comes to distinguish different dispersion regimes.Indeed, the proposed model for the Lagrangian velocity autocorrelation couples together T, P, and T L , the last one computed from R uu and R vv .Figure 3 panels (j) and (k) show the time distribution of the total absolute dispersion a 2 (t) (Enrile et al., 2019;LaCasce, 2008) averaged over each cluster computed using R uu .From inspecting the time trends of a 2 (t), it is clear how Taylor's definition of T L marks the end of the t 2 scaling.The time scale at which the asymptotic diffusive regime starts is, instead, better represented by T. For the non-looping trajectories, the Taylor integral scale T L correctly tends to coincide with the estimated T.
The signature of the looping character still persists in the a 2 (t) curves, see cluster 1 in panel (j).A similar behavior was observed by Enrile et al. (2019) where the absolute dispersion was computed starting from High-Frequency radar measurements.In the above case, the tidal forcing was responsible for the sustainment of looping trajectories and the corresponding absolute dispersion showed oscillations that tend to be dumped around an almost linear trend on a long time-scale, as in the present case.Moreover, anomalous dispersion regime (sub-or superdiffusive) can be observed for the case where the autocorrelation functions showed pronounced negative and positive lobes (Figures 1a4-1a6).Finally, it is worth noting that similar results for the absolute dispersion were obtained averaging over the clusters based on R vv , see Figure S9 in Supporting Information S1.

Conclusions
The purpose of the present analysis was twofold: first, we aimed to apply a technique based on unsupervised clustering to the velocity autocorrelation functions to distinguish regions with similar dispersion properties; second, we proposed a new stochastic model that is able to represent the variability of the autocorrelation functions, extending the work of Veneziani et al. (2004) and Rupolo (2007).We tested the proposed approach using a 1-year long simulation of the coastal circulation of a Gulf in the Western Mediterranean Sea.However, the proposed methodology is generally applicable also in open ocean contexts.Lagrangian velocity autocorrelations were computed using the output of the circulation model.The application of the K-Means algorithm proved to generate clusters that could be clearly interpreted in terms of dispersion processes.The main output is a temporal and spatial mapping of coastal sub domains with similar dispersion properties.The proposed approach showed to be able to efficiently distinguish among looping and non-looping particle trajectories, without any a-priori assumption on the looping character of the trajectories as required by most of the available methodologies.Moreover, the spatial and temporal analysis of the cluster patterns and their joint occurrence matrix showed how the dispersion, although non-homogeneous, seemed to be fairly isotropic.A further important result was to suggest a new analytic expression for the autocorrelation functions, where the exponential decay and the periodic character were included in the same theoretical framework.Three time scales emerged clearly in order to describe the dispersion processes.The Taylor's integral time scale proved to remain valid, however, in the presence of looping trajectories, it must be coupled to the looping period and a longer time scale that describes the long-time exponential decay.These three time scales have a clear signature in the time evolution of the total absolution dispersion.

Figure 1 .
Figure 1.(a1-7, b1-7): representative autocorrelation profiles of the 7 clusters for R uu and R vv , respectively.The error bars indicate one standard deviation with respect to the mean profile.(c): occurrence matrix (relative frequency of appearance for each pair of R uu and R vv clusters).(d, e) Geographical pattern of the clusters based on R uu and R vv , respectively.The results used for panels (d, e) correspond to a single simulation, sequence 26.

Figure 2 .
Figure 2. (a) Percentage of occurrence of the different clusters shown for 24 out of 72 simulations.Solid square indicates the clusters based on R uu and the solid triangles show the clusters of R vv .The red rectangles indicate the four sequences shown in the other panels, namely sequence 4 (b1-b5), 16 (c1-c5), 26 (d1-d5), and 70 (e1-e5).The four panels, from left to right, for every sequence (row) show the wind rose associated with the 120 hr simulation, the spatial distribution of the sequence averaged spin, the spatial distribution of the sequence averaged kinetic energy, the spatial distribution of the clusters for R uu and R vv .

Figure 3 .
Figure 3. (a, b) Show the pdf of the integral Lagrangian time scale in x and y respectively.(c): pdf of the spin values associated with each cluster.(d-f) Show the pdf distribution for T, P and C of the proposed formula of Equation 7 for R uu .(g-i) Pdfs of T, P and C for R vv .(j, k) Distribution of the total absolute dispersion for each R uu cluster.