Graph-based mutually exciting point processes for modelling event times in docked bike-sharing systems

This paper introduces graph-based mutually exciting processes (GB-MEP) to model event times in network point processes, focusing on an application to docked bike-sharing systems. GB-MEP incorporates known relationships between nodes in a graph within the intensity function of a node-based multivariate Hawkes process. This approach reduces the number of parameters to a quantity proportional to the number of nodes in the network, resulting in significant advantages for computational scalability when compared to traditional methods. The model is applied on event data observed on the Santander Cycles network in central London, demonstrating that exploiting network-wide information related to geographical location of the stations is beneficial to improve the performance of node-based models for applications in bike-sharing systems. The proposed GB-MEP framework is more generally applicable to any network point process where a distance function between nodes is available, demonstrating wider applicability.


Introduction
In recent years, a number of convenient and sustainable alternatives for short-distance travel have emerged in world cities, with the objective of reducing traffic congestion, and promoting healthier and more sustainable urban environments.Bike-sharing systems (BSSs) represent one such solution, which have become increasingly popular especially after the end of restrictions to movement due to the COVID-19 pandemic (Heydari et al., 2021).Two types of BSSs are usually offered in cities: traditional docked systems, which require users to pick up and return bikes at designated stations, and dockless systems, which utilise GPS technology, allowing users to locate and unlock bikes using smartphone applications.From a statistical perspective, the literature about BSSs is mostly focused on two problems: rebalancing, consisting in the redistribution of bikes across stations and locations according to the predicted demand (see, for example, the survey of Vallez et al., 2021), and demand forecasting (for example, Rennie et al., 2023).In general, modelling docked bike-sharing systems is still an active research problem, extensively in the literature (see, for example, Choi et al., 2022;Deng et al., 2023;Busatto and Stingo, 2023;Paul et al., 2023).
Motivated by the application to docked bike-sharing systems, this work proposes a statistical networkinformed model for event times, generically called graph-based mutually exciting process (GB-MEP), which could be used to predict demand across stations.The proposed methodology is tested on publicly available data from the Santanter Cycles bike-sharing network in central London.In this work, journeys observed on bikesharing systems are interpreted as a realisation of a network point process, with events having a corresponding duration.In general, demand for bicycles in docked BSSs has often been modelled as a multivariate point process in the literature (see, for example, Gervini and Khanal, 2018;Okawa et al., 2019;Foschi, 2023), but network information, expressing the interaction between different stations, is often not explicitly incorporated within the existing models.This work proposes a parsimonious representation for the temporal point processes observed at each docking station, directly utilising the information provided by the distance between stations in the network for imposing restrictions on the model parameters.
In particular, this work considers a particular class of point processes, called mutually exciting processes, with a focus on Hawkes processes (see, for example, Hawkes, 1971;Laub et al., 2022).Hawkes processes have been extensively used in the literature for modelling a number of different phenomena, such as earthquakes (Ogata, 1988), crime (Mohler et al., 2011), social media (Zhao et al., 2015;Rizoiu et al., 2017), emails (Fox et al., 2016;Miscouridou et al., 2018;Sanna Passino andHeard, 2023), andcomputer networks (Price-Williams andHeard, 2020;Leigh Shlomovich and Patel, 2022;Sanna Passino and Heard, 2023).Hawkes processes appear to be a particularly suitable model for a bike-sharing system, since they capture periods of increased demand in a datadriven way, avoiding specific parametric assumptions on the process, such as periodic components or weather conditions.This property has been proved particularly powerful in cyber-security, where self-exciting models largely outperform alternatives that model periodic components parametrically (Price-Williams and Heard, 2020).
The remainder of this article is structured as follows: Section 2 gives background on point processes on graphs and Hawkes processes, describing related literature and challenges of existing approaches when applied to network point processes.Next, Section 3 introduces graph-based mutually exciting processes, the main contribution of this work, discussing the method, parameter estimation, and model evaluation procedures.The proposed methodology is then applied on publicly available data from the Santander Cycles bike-sharing system in central London (Section 4), followed by a conclusion and discussion in Section 5.

Background
Consider a network G = (V, E) where V = {1, . . ., M }, M ∈ N, is the node set, and E ⊆ V × V is the edge set, such that (x, y) ∈ E if nodes x and y are connected.Within the context of bike-sharing systems, nodes represent docking stations, and an edge between two stations can be drawn if at least one trip between the stations is observed within an observation period.It is further assumed that event data with a duration are observed on the network, corresponding to a sequence of quadruples (x k , y k , t k , t ′ k ), k = 1, 2, . . ., N, N ∈ N, where (x k , y k ) ∈ E is an edge, t k ∈ R + is the start time of the event, and t ′ k > t k is the end time of the event.This corresponds to observing events with duration δ k = t ′ k − t k .In practice, events are ordered by end time, In bike-sharing systems, the quadruple (x k , y k , t k , t ′ k ) denotes a bike journey between stations x k and y k , started at time t k and ended at time t ′ k .From the observed sequence of quadruples, two left-continuous counting processes N i : R + → N 0 and N ′ i : R + → N 0 could be associated with each node: where 1 • (•) denotes the indicator function: 1 S (x) = 1 if x ∈ S, and 0 otherwise.In the BSS application, the process N i (t) counts the number of journeys started from station i before time t, whereas the process N ′ i (t) counts the number of journeys ended at station i before time t.For simplicity, event times for each node are often denoted with the notation t i,k , i ∈ V, k ∈ N, representing the start time of the k-th event started from node i, and t ′ i,k , corresponding to the ending time of the k-th event ended on node i.This notation can be used to obtain a simpler representation for N i (t) and N ′ i (t): In general, a counting process could be characterised by its conditional intensity function (CIF).The CIF λ i (t) for N i (t) is defined as follows: where H(t) is the history of the process on the entire network up to time t.
The counting processes {N i (t)} i∈V and {N ′ i (t)} i∈V could be considered as realisations of multivariate temporal point processes.For this class of processes, a popular model in the literature is the multivariate Hawkes process (Hawkes, 1971) with scaled exponential kernel (see, for example, Laub et al., 2022), which models the CIF λ i (t) on each node as a function of all start times of events observed in the entire network before time t: where λ i ∈ R+ is a baseline intensity, α ij ∈ R + is the jump in the CIF of node i caused by an event on node j, and β ij ∈ R + , with α ij < β ij , is the corresponding decay rate.The end times of events could also be considered in the intensity (2.2), obtaining the following self-and-mutually exciting Hawkes process CIF (see, for example, Laub et al., 2022): where The models in Equations (2.2) and (2.3) have a number of issues when applied to temporal point processes on networks: first, the number of parameters is O(M 2 ) across the entire network, implying that it grows quadratically with the number of nodes, which is unfeasible for estimation in practice, especially when the network becomes large; second, the network structure is not explicitly incorporated within the intensity function in Equation (2.2); third, information from additional processes, such as end times of events, is not included within the CIF.
In the application to bike-sharing systems, these issue appear to be particularly relevant: the number of docking station could be very large (approximately 800 in the Santander Cycles BSS in central London), making estimation of the parameters α ij and β ij a high-dimensional problem.Using the example of M = 800, the total number of parameters required for model in Equation (2.2) for the entire network would be M + 2M 2 = 1,280,800.This number is more than the average number of bike journeys per month in the busiest year in the Santander Cycles BSS (over 950,000 hires per month in 2022, see Transport for London, 2022).Additionally, another drawback of the multivariate Hawkes process models in Equation (2.2) and (2.3) is that information about known, preexisting network structure is not incorporated within the CIF: for modelling network point processes in real-world applications, information about the nodes in the BSS could be very important.For example, in a bike-sharing system, stations which are geographically close are expected to have similar characteristics compared to stations which are further apart, suggesting that geographical information about nodes should be taken into account within the model.
It must be remarked that the model in Equation (2.2) has been often used successfully in the literature for modelling multivariate point processes, including spatio-temporal processes (see, for example, the survey of Reinhart, 2018).In particular, Hawkes processes are in general suitable to discover existing network structure and causal relationships between nodes (see, for example, Linderman and Adams, 2014;Etesami et al., 2016;Xu et al., 2016;Eichler et al., 2017;Yuan et al., 2019), but known characteristics about the network and nodes are not explicitly incorporated within the model.Using the approach of Linderman and Adams (2014), the network adjacency matrix, if known, could be used to set the coefficients α ij to zero where (i, j) ̸ ∈ E, but in general the use of node covariates and known network information has been overlooked in the literature.Also, it must be remarked that Hawkes process models on networks are usually applied at two different levels of resolution: node-based (see, for example, Fox et al., 2016;Price-Williams and Heard, 2020), or edge-based (see, for example Blundell et al., 2012;Perry and Wolfe, 2013;Miscouridou et al., 2018;Huang et al., 2022;Sanna Passino and Heard, 2023).
This work mostly concerns with node-based models, proposing an extension of the standard multivariate Hawkes process is proposed, exploiting existing network structure between the nodes to reduce the number of parameters to estimate to O(M ), providing advantages for scalability to large networks.Additionally, information from end times is added to the CIF, providing a more complete modelling framework for event data on networks with an associated event duration.

Graph-based mutually exciting processes
The main objective of this work is to propose a graph-based mutually exciting model (GB-MEP) for the nodespecific conditional intensity functions λ i (t) for the temporal point processes of start times of connection events on a graph, applied to journeys observed on bike-sharing systems.The GB-MEP intensity function is assumed to take the following generic form: where λ i ∈ R + is a baseline rate, ω i , ω ′ i : R + × R + → R + are spatio-temporal excitation kernel functions, and γ : V × V → R + is a distance function between nodes.The distance function γ quantifies how close in space two nodes are.For example, in bike-sharing systems, the distance function could be taken to be the haversine metric, which provides an "as-the-crow-flies" measure of distance between the geographic location of two nodes.Given the latitude and longitude (φ x , ℓ x ) and (φ y , ℓ y ) of the bike stations x and y (in radians), the haversine distance between the stations is: where ρ is the Earth radius.Note that, for more general point processes on networks where information about the node locations is not present, the distance γ(x, y) between two nodes x, y ∈ V could depend on graph-related statistics, such as the length of the shortest path between the nodes, or the number of common neighbours.In this case γ takes the interpretation of a dissimilarity metric.
In this work, it is further assumed that the spatial and temporal components of the excitation kernels ω i are separable, with the temporal element of the kernel taking a scaled exponential form.This results in the following excitation kernel: where α i , β i ∈ R + , α i < β i , and γ * corresponds to a realised value of the distance function γ.The function κ i : R + → (0, 1] rescales the node-specific temporal mutual excitation given by α i exp(−β i t), using the distance γ between the nodes.In the application to bike-sharing systems, the excitation effect of an event occurring on a node with small distance from station i is likely to have a larger effect on the intensity λ i (t) than an event occurring on a node located further away.Therefore, κ i could be set to a monotonically decreasing function such as κ(0) = 1 and lim γ * →∞ κ i (γ * ) = 0.A possible choice, used in this work, is the truncated exponential function , where θ i ∈ R + is a decay parameter and ε ∈ R + is a fixed constant defining the radius of neighbouring nodes considered to have non-zero effect on the CIF λ i (t).Note that setting ε → 0 would result in κ i (γ * ) = 1 {0} (γ * ), which would only give weight to events involving station i, removing the spatial mutual excitation component, and making the process exclusively self-exciting.A similar spatio-temporal excitation kernel could also be postulated for ω ′ i (γ * , t): Figure 1: Example of CIFs for a GB-MEP with V = {1, 2, 3} and three realised events expressed as quadruples (x, y, t, t ′ ), x, y ∈ V, t, t ′ ∈ R + , t < t ′ .Events and their symbols: (1, 2, 1.25, 2.75) corresponds to •; (1, 3, 4, 4.5) to ▲, and (2, 2, 2.35, 8) to ■. Parameters: λ 1 = 0.2, λ 2 = 0.3, λ 3 = 0.15, α 1 = 0.8, Overall, this gives the following extended form for the proposed GB-MEP CIF in Equation (3.1) for applications to BSSs: An example of the intensity functions arising under a GB-MEP process is given in Figure 1.

Parameter estimation
The model parameters for each node-based process could be learned via maximum likelihood estimation, by maximising the node-specific log-likelihood function separately for each station.The F. Sanna Passino, Y. Che, and C. Cardoso Correia Perello log-likelihood function for a temporal point process takes the following form (see, for example, Daley and Vere-Jones, 2002): where Λ i : R + → R + is the compensator, defined as: For the CIF in Equation (3.4), the compensator takes the following form: Note that, because of the double summations in the CIF in Equation (3.4), the computational complexity for calculating the term ]}, proportional to the sum of the number of events occurring on each node, multiplied by the total number of events started from the station under consideration.In practice, this makes model fitting particularly challenging, and almost unfeasible for large datasets.This issue is largely alleviated by the scaled exponential excitation kernel, which admits a computationally cheaper form of the log-likelihood, written in recursive form using the following expression for the CIF in Equation (3.4): In the above expression, A ij (k) and A ′ ij (k) are defined as follows: The initial values A ij (1) and A ′ ij (1) are simply: The recursive form of the log-likelihood stems from the following proposition.
Proposition 1.The expressions for A ij (k) and A ′ ij (k) in Equation (3.6) can be rewritten recursively as follows: Proof.The result is proved for A ij (k) here; the proof for A ′ ij (k) is analogous.The summation for A ij (k) in Equation (3.6) could be broken up at N j (t i,k−1 ) as follows: (3.8) Pre-multiplying the first element of Equation (3.8) by exp{−β i (t i,k−1 ) − β i (t i,k−1 )} gives: By the definition in Equation (3.6), , which gives the result in Equation (3.7).

It follows that, writing
, Proposition 1 gives a recursive form for the log-likelihood in Equation (3.5): Under this recursive representation,

Model evaluation
The quality of the model fit could be assessed by evaluating the distribution of p-values calculated on a training and a test set observation periods, using the same methodology discussed in Sanna Passino and Heard (2023).By the time rescaling theorem (see, for example, Brown et al., 2002), the transformed event times Λ i (t i,k ), k = 1, 2, . . .are event times of a homogeneous Poisson process with unit rate, under the null hypothesis of correct specification of the conditional intensity and its parameters.Therefore, under the same null hypothesis, the upper tail p-values calculated from the observed events should follow a uniform distribution in (0, 1).Since the interarrival times of a unit rate Poisson process are exponentially distributed with unit rate, the corresponding upper tail p-values for the observed events are: Using, the recursive expressions in Proposition 3.1 and the counting processes in Equation (2.1), the expression in Equation (3.9) reduces to: F. Sanna Passino, Y. Che, and C. Cardoso Correia Perello Graph-based mutually exciting point processes for modelling event times in docked BSSs Considering an observation period [0, T ], the model parameters could first be estimated from a testing period [0, T * ], T * < T , using the procedure described in Section 3.1.The p-values can then be calculated directly by plugging in the estimates into Equation (3.9).The estimates could also be used to evaluate the performance of the model in a testing period (T * , T ], by calculating the p-values using the same methodology.In both cases, under a correct model specification and parameter estimation, the distributions of the p-value on the training and test set events should be approximately uniform in (0, 1).The quality of the fit to the uniform distribution could be evaluated using the Kolmorogov-Smirnov statistic, denoted here KS i .For the sequence of node-specific p-values p i,k , k = 1, . . ., N i (T ) on the training set, KS i is defined as follows:

Application to the Santander Cycles bike-sharing system
The proposed models were tested on data from the Santander Cycles bike-sharing network in central London.
Transport for London publicly released data about journeys on the Santander Cycles network, publicly available at https://cycling.data.tfl.gov.uk/(powered by TfL).Each record includes information about the start and end times, IDs of source and destination stations, journey duration, a bike ID number, and an ID number for the rental.For the analysis in this section, 32 weeks of data were considered, starting from 2 nd March 2022 until 9 th October 2022.A total of 4,098,756 journeys recorded between 2 nd March 2022 and 21 st June 2022 (approximately 16 weeks) were used to construct the training set, whereas the remaining 3,995,591 journeys occurred between 22 nd June 2022 and 9 th October 2022 (approximately 16 weeks) formed the test set.A total of M = 787 unique active docking stations were observed in the data under consideration.Distances between docking stations were calculated using the haversine distance formula in Equation (3.2), with the value ρ = 6365.079km(corresponding to the Earth radius at the latitude of London, with no altitude).A more precise estimate of the cycling distance between two nodes could be obtained via calls to the TfL Journey Planner API, or any other tool providing calculations of cycling distances between locations.The main practical limitation for this approach is that the calculation of the full distance matrix would require M 2 API requests, which normally largely exceeds the number of permitted requests under a basic plan.
The proposed GB-MEP model in Equation 2 has been compared to a number of alternatives: • A standard Poisson process model for each station (Poisson): • A mutually exciting model for each station (MEP): • A self-exciting Hawkes process model for each station (SEP): • A self-and-mutually exciting process for each station (SMEP):  • A spatial mutually exciting process for each station (SpMEP): Note that all models above can be written as special cases of GB-MEP for specific choices of the functions κ and κ ′ , cf.Equation (3.3).The value of ε in SpMEP and GB-MEP has been set to ε = 0.5.For some stations, M ε i < 3 for ε = 0.5; in these cases, ε was set to the minimum value such that M ε i ≥ 3.All model parameters for GB-MEP and its competitors were estimated via maximum likelihood, using numerical optimisation via L-BFGS-B (see, for example, Zhu et al., 1997).Note that the Poisson model has a closed form solution: λi = N i (T * )/T * for a training period [0, T * ).The parameters of SEP and MEP were initialised to (λ i , α i , β i ) = (e −4 , e −4 , 2e −4 ), and the corresponding results were used to initialise SMEP.In turn, the results obtained from SMEP were used to initialise SpMEP and GB-MEP.
The results are reported in Table 1, Figure 2 and Figure 3. Table 1 shows the Kolmogorov-Smirnov score for the distribution of p-values calculated across all M stations, calculated against the uniform distribution.Figure 2 reports uniform quantile-quantile (Q-Q) plots for all stations individually, and for all stations combined.Figure 3 contains the boxplots of KS scores for all six models considered in this section.The results show that GB-MEP has the best performance across all models, both on the training and test set, demonstrating that taking into account the geographical location of nodes within the network is important to improve the goodness-of-fit of statistical models to real-world network point process data.GB-MEP is the only model which achieves a median KS score below 0.05 across stations, both in the training and test sets cf. Figure 3).Additionally, Table 1 shows remarkable agreement between the results on the training and test sets, which suggests that the models are not overfitting the data.
For most stations, the two best performing models are SMEP and GB-MEP, as observed in Table 1, Figure 2 and Figure 3. GB-MEP has a better performance than SMEP on approximately 73.88% of stations in the test set.Figure 4 analyses more closely the differences between the results of GB-MEP and SMEP, showing all docking stations in London, coloured by the difference in KS scores on the test set, and a histogram of differences between KS scores for SMEP and GB-MEP on the test set.The effect of the end times on the intensity of start times of journeys may seem counterintuitive, but it is consistent with some characteristics of the Santander Cycles BSS: for example, if a station is empty, a user returning a bike to that station may imply that there is a higher chance of a journey starting immediately from that station, with a different user picking up the newly available bike.Also, until September 2022, users were charged £2 for unlimited journeys with Santander Cycles in the subsequent 24 hours.For journeys longer than 30 minutes, users are additionally charged.Therefore, many users typically returned bikes before the 30 minutes mark, and they would pick up another bike immediately for a new journey, meaning that returning a bike to a station could increase the likelihood of a new journey starting from the same station.

Conclusion
This article discussed a general strategy to incorporate measures of distance between nodes on a graph within the context of multivariate Hawkes processes on graphs, proposing GB-MEP, a graph-based mutually exciting point process.Scalable inference procedures were discussed, using a recursive form of the likelihood function under a scaled exponential form of the excitation kernel.The model was fitted on a large dataset of journeys within the Santander Cycles bike-sharing system, with over 4 million events in the training set, and close to M = 800 nodes.The model performance was evaluated on close to 4 million events comprising the test set, demonstrating superior performance of GB-MEP over alternative modelling approaches.
A possible limitation of the proposed framework for the application to bike-sharing systems is that a bound on the capacity of each station is not included within the model: each station has a limited number of docks, which could be used to correct the CIF when a station is at capacity.Also, additional distance functions could be considered for the excitation component.This work is framed within the context of docked bike-sharing systems, but it could be more generally applied to mutually exciting processes on any type of network where a distance metric between nodes is available.If an adjacency matrix is observed, and not a distance matrix, measures of similarity between nodes could be used for γ within the GB-MEP framework (for example, the length of the shortest path between nodes).
A possible extension of the version of GB-MEP presented in this work would be to consider different excitation kernels, such as power laws (see, for example, Laub et al., 2022).Furthermore, removing the separability assumption on the excitation kernel could also prove interesting, as it would allow to capture more complex patterns arising from the interaction between the spatial and temporal components, which can lead to a better model fit.Unfortunately, both alternatives would cause an increased computation cost.Finally, adding a probabilistic mixture model component in the bike-sharing application could further improve the model fit, reflecting the fact that bikes could be used for different purposes, such as leisure or commuting.

Figure 2 :
Figure 2: Uniform Q-Q plots of p-values, calculated separately for each station on the training and test sets for six models.

FFigure 3 :
Figure 3: Boxplots of KS scores across all station, calculated on the training and test sets for six models.

Figure 4 :
Figure 4: Station locations and histogram of differences in KS scores between SMEP and GB-MEP on the test set.

Table 1 :
Kolmogorov-Smirnov scores for the distribution of p-values for all stations on training and test sets for six models.