Spatio-temporal mixed membership models for criminal activity

We suggest a probabilistic approach to study crime data in London and highlight the benefits of defining a statistical joint crime distribution model which provides insights into urban criminal activity. This is achieved by developing a hierarchical mixture model for observations, crime occurrences over a geographical study area, that are grouped according to multiple time stamps and crime categories. The


| 1223
VIRTANEN and GIROLAMI In summary, we present a multi-view extension of mixed membership modelling combining data from multiple crime categories (or different views of crime) effectively increasing the amount of data to infer better crime maps, and inferring statistical dependencies between any subset of crime categories. We anticipate our model, that extends the family of successful applications of mixed membership modelling, to spark new interest for joint crime distribution modelling.
The paper is structured as follows. Section 2 explains the data. The proposed model is introduced in Sections 3 and 4 shows results. Section 5 concludes the paper.

| DATA
We collect data from the publicly available UK Police website, https://data.police.uk/, for the Metropolitan police force for London, that exhibits high crime rates and complex urban structure encoded in a wide range of spatial covariates, making it an interesting and relevant study area. We prefer to collect data from a single police force to avoid potential data duplication and management issues. We use data for a subset of London boroughs focusing primarily on Westminster, which has ample and heterogeneous criminal activity, as well as neighbouring or close-by boroughs, such as, Camden, Islington, Kensington and Chelsea, Hammersmith and Fulham, and Brent for a time interval from 2011 December to 2019 June.
The crime data are provided in a table, where each row represents a crime occurrence including a date (in year-month format), location coordinates and type of crime. For privacy concerns, actual crime occurrences are discretised based on offence categorisation as well as spatial and temporal partitions by the data provider. We focus on five crime categories, including, burglary, criminal damage and arson, violence and sexual offences, robbery and vehicle crime. The spatial partition contains a set of non-overlapping areas such that each area contains a centre point and an actual offence location is assigned to the area that has the smallest distance between the location and centre points, following the idea of Voronoi tessellation. The temporal partition is provided on a monthly basis. Accordingly, each area may contain more than one crime occurrence at some time point.
The number of provided of areas is 8804 and the average smallest distance between any two centre points is 63 m ranging between 1 and 718 m. To reduce computational load, following the idea of Voronoi tessellation and finite element methods (Lindgren et al., 2011), we construct a mesh (illustrated in Figure 1), providing a triangulation of the bounded area based on the set of centre points. We use INLA R-package, https://www.r-inla.org/ for computing the mesh. We specify the boundary as a union of the selected boroughs with minimal distance of 100 m between any two mesh points resulting in 4519 mesh points; on average the smallest distance between any two mesh points is 126 m. This level of accuracy permits fine-grained and detailed analysis. The areas have relatively high number of neighbours, on average 5.7, ranging between 2 and 10, capturing the geometry well, as illustrated in Figure 1. To summarise, data for each mesh point and time stamp consist of crime occurrences closest to any of the mesh points.
Let w (m) t,n ∈ {1, 2, ⋯, ||}, for m=1,…,M, t=1,…,T and n = 1, ⋯, N ( m ) t , denote the location indices of areas for crime occurrences, taking values over a spatial partition of size D = || of a finite region . Here, M denotes the number of crime categories, T denotes the number of time stamps and N ( m ) t denotes the total number of occurrences for the mth category at time stamp t. Each location index is associated with coordinates of the mesh points of the areas. Altogether, the data contain T=91 time stamps, M=5 crime categories, D=4519 mesh points and ∑ n,t,m w ( m ) t,n = 6.5 × 10 5 crime occurrences. Figure 2 shows average empirical monthly crime distributions based on the occurrences for each category. Based on the figure, we see that crime is sparse and clusters into hot and cold spots, colour bar is omitted from each such illustration. Table 1 shows the number of occurrences, filling factor (proportion of areas with non-zero occurrences over time), sparsity level (proportion of areas with non-zero occurrences over time stamps and areas) and average occurrences over areas and time stamps, including their range of values, for each category respectively. The table also verifies that crime is clustered but widely dispersed over the study region. The targets or victims of crime vary depending on the category. Businesses and dwellings may be subject to robbery, burglary or property damage. On the other hand, persons may be subject to robbery, violence and sexual offences. Based on borough level crime occurrence aggregates 1 , we are able to estimate that 39% of burglaries and only 6% of robberies target businesses respectively. Micro-scale areas may attract offences of mixed targets and categories. Based on the data, the average co-occurrence rate of pairs of different categories over areas and time stamps is 0.31, ranging from 0.14 between vehicle crime and robbery to 0.58 between robbery and violence, demonstrating substantial co-occurrence.
We collect spatial covariates using Census 2011 and Transport for London (TfL) accessibility index available from the London Data Store, https://data.london.gov.uk. We also collect points-of-interest (POI) data (EDINA Digimap Ordnance Survey Service, 2018). The accessibility index provides a fine-grained representation of how accessible each area is by public transport taking into account natural barriers and constraints affecting accessibility. We note that this covariate may substitute the use of actual road networks that are more complicated to work with. The Census provides a wide range of different socio-economic covariates, including more general covariates, such as, area, household and population size, at lower super output area (LSOA) or output area (OA) geographical resolution; these geographical partitions are also provided in the London Data Store. Our study area comprises of 773 LSOAs and 4314 OAs. Sections 2 and 3 of the online supplement of this paper show a subset of socio-economic and geographical covariates respectively. Altogether, the covariates consist of accessibility index, 107 socio-economic and 681 POI (this number includes groupings and categories of POIs, as provided by the system) variables.
We note that not all covariates are included in modelling because of increased computational load. Also some of the POI-based covariates are heavily collinear and hence it suffices to include only a subset of collinear covariates without (significant) loss of information. However, all of the covariates are used for model inspection, that is, for interpreting the components. We use area-weighted averaging to compute crime level at each LSOA for model inspection because most of the socio-economic covariates are available at the LSOA level and these covariates are naturally normalised in terms of social cohesion and population size. Similarly, we also average the LSOA/OA-based socio-economic covariates to the mesh areas for computing the model by constructing a covariate matrix X of size D×C, where C denotes the number of covariates. For the geographical covariates, we follow the idea of Voronoi tessellation and compute the number of relevant POIs at the mesh level directly and note that model inspection is performed also at the mesh level for the geographical covariates. The covariate matrix X contains all the 107 socio-economic covariates, accessibility index and 4 geographical covariates, that combine POIs into retail, eating and drinking, transport and legal and financial groups, resulting in C=112 covariates.

| MODEL
A set of observed time-stamped and location-marked crime occurrences underlies analysis of temporal and spatial patterns of urban crime. The occurrences are often categorised by police/law to indicate the type of crime. Given a particular urban area, experts, such as, members of police or criminologists, may know which type of crime the area is associated with and to what extent at a particular time point. A city may contain a large set of such areas that may partially overlap with each other depending on the degree to which crime categories they are associated with and when. Experts may often expect that the areas and their associated categories change in time and that certain urban structure of the areas, encoded in spatial covariates, provides further information of the causes and nature of crime.
In this work, we present a model that is able to leverage such expert knowledge in a data-driven manner based on the crime occurrences. Following the notation set out in Section 2, the occurrences, that take discrete values over D areas, for category m at time stamp t, follow a multinomial distribution whose expectation parameter p ( m ) t is given by a decomposition of K components that include spatial crime distributions (maps) and map proportions, k and ( m ) t,k , for k=1,…,K respectively. The crime maps are distributions over D areas, whereas map proportions are distributions over the K maps, such that all elements are positive and sum to one. The crime maps are shared across crime categories but categories contribute differently via the proportions. Elements of k indicate which areas are relevant for the kth map. On the other hand, elements of ( m ) t indicate which maps are relevant for the mth category at tth time stamp. In other words, a map may contain one or more areas. Also, one or more crime maps may be responsible for the occurrences w ( m ) t . The decomposition is essential to uncovering latent processes that explain criminal activity, as shown in the results (Section 4).
We reformulate (1) as a mixture model introducing map assignment variables z ( m ) t,n ∈ {1, 2, ⋯, K } for each occurrence. The assignment variable z ( m ) t,n is generated from a categorical distribution and the corresponding occurrence is generated from another categorical distribution conditioning on the assignment variable. This formulation highlights that the elements of the maps and proportions represent probabilities. The generative process is simple and intuitive and can be used to explain and simulate occurrences. Analytically marginalising out the crime map assignments, we get (1). The model formulation (1) emphasises the decomposition, whereas Equation (2) emphasises the mixture and leads to simplified posterior inference, as detailed in the following. Applying spatial and temporal smoothing is key to capturing dependencies between close-by occurrences in both space and time. To this end, we use Gaussian Markov random fields (GMRF; Rue & Held, 2005) to construct k , for k=1,…,K, explaining crime maps using spatial covariates, unstructured random effects and spatially structured random effects that capture deviation from the covariate or unstructured contribution. GMRFs are flexible, scalable and intuitive to specify. The construction of the crime maps is based on a so-called BYM model (Besag et al., 1991) and its extensions (Leroux et al., 2000;Riebler et al., 2016). To construct the maps, we introduce auxiliary variables that follow a variant of the BYM model and we further introduce non-trivial constraints for guaranteeing that the variables take values in a reasonable range, and that the posterior distributions for the corresponding hyperparameters are identifiable.

| 1227
VIRTANEN and GIROLAMI For the spatial partitions we assume two areas are neighbours if they are contiguous; Figure 1 illustrates the neighbourhood information. We introduce auxiliary Gaussian variables for k=1,…,K, and assume k ∝ exp ( k ), using element-wise exponentiation. We let spatial covariates X to inform generation of the maps. Here, k denote covariate regression coefficients that capture component-specific effects, μ is a mean parameter shared across the components, capturing overall crime rate for each area, and Q is a precision (inverse covariance) matrix that captures both spatially structured and unstructured random effects via parameters λ and κ respectively. In more detail, elements of the precision matrix Q for off-diagonal elements take value −λ for any two neighbouring areas, otherwise they are zero, and the diagonal contains the total number of neighbours for each area times λ plus an additive constant κ. For numerical stability, we constrain k,1 ≈ 0 subtracting the contribution of X 1,: k from the component-specific mean, setting 1 = 0 and 1 to an arbitrarily large value, without loss of generality. The constraint is needed to ensure that the values remain in a computationally feasible range. We note that explicit conditioning of k,1 = 0 provides an alternative approach but that is computationally more demanding because the sparsity structure of the corresponding precision matrix would be lost. For an increasing number of covariates C, under certain assumptions, the semi-parametric model becomes closer to a non-parametric approach based on Gaussian Processes (GPs). The key is to enforce sparsity for the regression coefficients to prevent over-fitting. For this reason, we follow so called automatic relevance determination (ARD) approach (MacKay, 1995;Neal, 2012) and assume k,c ∼ Normal(0, − 1 k,c ) and k,c ∼ Gamma(a, b), for k=1,…,K and c=1,…,C. For small values for a and b the distribution is sparsity-promoting. However, we allow the data to inform suitable amount of sparsity by inferring the parameters a and b based on the data. Finally, we assume We take non-stationarity into account via time-varying proportions. For the temporal proportions ( m ) t , for t=1,…,T, we also employ a flexible GMRF-based construction that is able to capture complex temporal dependencies. We introduce temporally dependent auxiliary variables , for k=1,…,K. Here, m , for m=1,…,M, denote category-specific parameters that affect smoothness of temporal variation. The corresponding T×T precision matrix for ( m ) :,k takes a simple tridiagonal form.
We assign weakly informative priors for the hyperparameters such that the model has a single parameter K to validate. For the hyperparameters λ,κ and m , for m=1,…,M, we employ a two-parameter variant of generalised inverse Gaussian distribution, where 0 ∈ ℝ and 0 > 0. The distribution provides a convenient prior to constrain variables away from zero and infinity simultaneously; this is especially useful for the m . We assume 0 = 0 = 10 − 3 . For a, b and , we assume Gamma(ε,ε) distribution, for = 10 − 7 , and set 0 = 0 and 0 = 10. For we assume a normal distribution with mean zero and very large (infinite) variance. Section 1 of the online supplement of this paper summarises posterior inference.

| Related work
For M=1 and K=1 the model reduces to a stationary model, p ( 1 ) t = 1 , for t=1,…,T, and is related to (generative) crime mapping. Here, the benefit of the probabilistic approach is that the spatial map parameters affecting smoothness may be inferred based on the observed data instead of expensive cross-validation or manual tuning. Kernel density estimation, provides a non-parametric alternative for crime mapping Chainey et al., 2008;; here the goal is to estimate a kernel function that depends on locations and time stamps for computing a crime distribution for some crime category at particular time. The absence of proper probabilistic formulation complicates kernel parameter estimation and model selection and comparison. Flaxman et al. (2015) propose to employ Gaussian processes (GPs) that may be interpreted as a probabilistic variant of kernel methods. Flaxman et al. (2018) adopt the GP setting presenting a computationally more scalable linear approximation.
Based on the data distribution 1 and assuming different likelihood functions, the model may be interpreted as a form of low-rank or factor model (Buntine, 2002;Heller et al., 2008). We note that inference for the marginalised form is more demanding because η and θ are coupled; for our model the updates are conditionally independent given the assignments. We also note that the model focuses directly on capturing the crime densities and does not explicitly take into account the number of occurrences as alternative formulations based on Poisson likelihood function for count data. Importantly, the Poisson-based models would be unable to fully leverage occurrence-specific information.
Alternatively, based on the mixture model formulation (2), our model is related to topic modelling (Blei et al., 2003). Blei and Lafferty (2006b) present a dynamic extension noting that Blei et al. (2003) are unable to take into account temporal dependencies. Blei and Lafferty (2006a) assume a general (unconstrained) covariance matrix that could in theory be used to capture spatial dependencies, as considered in this work. However, this formulation would not be suitable for our application because it does not impose spatial smoothness and the computational load would be very high. Our model is inspired by Teh et al. (2005); we present a spatio-temporal extension necessary for the application, as verified quantitatively in the experiments (Section 4). For these models the data often constitute of words appearing in text documents. Analogously, the crime occurrences would correspond to words and documents group words or crime occurrences according to time stamps. Our motivation for this class of models is based on the ability of the models to capture semantics based on word co-occurrences; so called topics capture together semantically similar words corresponding to certain themes. In our application, the topics would correspond to the crime maps. Finally, we note that the dynamic topic model assumes sequentially grouped documents and the model is not as such suitable for our application because the groups would contain only one document. Topic modelling has been applied for textual crime category descriptions (Kuang et al., 2017).
An alternative approach to factor or topic modelling would impose dependencies directly for the expectation parameters p ( m ) t working with Poisson likelihood (Diggle et al., 2013;Lindgren et al., 2011;Møller et al., 1998;Taddy, 2010). This approach may be computationally very demanding, difficult to capture associations between crime categories (Liu & Zhu, 2017;Quick et al., 2018), and less intuitive to interpret, because the model provides no dimension reduction or interpretable components. Here, p ( m ) t may include a linear mapping based on fixed spatial covariates and (un) structured random effects, potentially using GMRFs. We note that for our model the covariates are at a higher level in the hierarchy, resulting in a more weakly supervised formulation, where the spatial component plays a key role and it is not meaningful to look at the covariate effects separately from the (un)structured contributions. In the regression setting, covariates are at a lower level in a hierarchy, adding the amount of supervision; here, the spatial effect is mainly interpreted as noise that we are not particularly interested in. As such, extra care is needed for interpreting the covariate effects, possibly requiring feature selection to cope with multicollinearity. Also, these models may oversimplify the covariate contribution because they often capture global instead of local effects.

| Model selection and comparison
The model contains one free parameter, the number of maps, K, that needs to be validated. Larger K implies a more complex model. In theory, the model is able to prune out irrelevant components but we prefer to train the model for a wide range of K ∈ {2,4,…,20} for computational reasons. Our aim is to choose K such that the corresponding model explains the observations well without overfitting. For model selection and comparison, we adopt the Watanabe Information Criterion (WAIC; Gelman et al., 2013) that approximates model evidence, probability of the data given the model; we employ a formulation where higher values for WAIC are better, computing data log likelihood and subtracting the model complexity term. In more detail, where Ω ( s ) denotes posterior samples2, for s=1,…,S, and V[·] denotes sample variance. The model posterior distribution contains multiple modes and it is essential to explore local modes with high posterior probability density. To this end, we use 10 chains with well-dispersed different random initialisations. Because of multi-modality it is not appropriate to combine samples across multiple chains. To assess significance of the WAIC values, we also show box plots for WAIC values computed for a set of 10 replicates using partial re-sampling without substitution. We run the sampler for 2 × 10 5 iterations, that we find to suffice for convergence using standard diagnostics, collecting the last quarter of posterior samples to compute the WAIC score.
2 When computing the likelihood we analytically marginalise out the assignments to increase robustness.  Figure 3 shows model selection results based on WAIC, the higher the value the better performance; the optimal number of components is K=10. We compare our approach, Model 1, to two related alternatives. First, we assess the utility of the covariates by comparing against a model that does not include covariates (referred to as, Model 2). Second, we compare against a standard model variant that discards spatial and temporal smoothing, assuming independent Dirichlet priors for the crime maps and proportions, referred to as, Model 3 (Teh et al., 2005). Based on the figure, we see that the covariates contain useful descriptive power and that smoothing is essential. Our model has best performance for 4≤K≤10; the difference to Model 2 is statistically significant for this range of K (Wilcoxon one-sided test, p < 10 − 3 ).
We also compare quantitatively our multi-view model to single-view models for each crime category. Each single-view model is a special of our model for M=1 using data for each category. For the single-view models, we followed a similar process to choose K, although, limiting to K ∈ {1,2,…,6}. Figure 4 shows that the WAIC score is best for our joint/multi-view model for all categories for 4≤K≤16 (the results are statistically significant for K ∈ {4,6}; one-sided Wilcoxon, p < 10 − 3 ). Performance for the single-view models is optimal for K ∈ {1,2,3} and performance decreases for K≥4 for each category. For the joint model we use only data related for each crime category when computing the score for each category. The dotted horizontal lines indicate best model for single-view models and for our model for K=10. The results verify that joint modelling is able to infer better category-specific models. For criminal damage and arson, and burglary the best performance is for K=1; the model reduces to a more simple statistical crime mapping approach that is unable to take dynamics or decomposition into account. For these categories, there are not enough data to support more complex models; for example, for burglary the single-view model is unable to separate commercial and residential burglaries into separate components. High dimensionality and sparsity of the data pose significant problems and it is very difficult to capture descriptive models without leveraging other related crime occurrences. For the remaining crime categories performance for single-view models is best for K>1 showing that mixed membership modelling performs better than standard crime mapping. For our model, performance varies smoothly as a function of K, and is roughly maximal for K=10 for all categories and adding more components does not improve performance for any category.

| Model interpretation
We compute posterior averages for the maps and temporal patterns using the same set of posterior samples used to compute the WAIC scores for our model for the component number (K=10) with the maximal WAIC score. We are aware of the label switching problem inherent for mixture modelling VIRTANEN and GIROLAMI but we found this not to be a problem for our application by tracking cross-component similarities between consecutive posterior samples for the maps.
Concentration of probability mass of the posterior maps k , for k=1,…,K, reveal areas that have high crime rates (i.e. crime hotspots). The maps may be naturally visualised to uncover the hotspots with high values. The corresponding posterior proportions ( m ) t,k , for t=1,…,T and m=1,…,M, indicate temporal patterns, uncovering which crime maps out of K are relevant (or active) for each crime category and revealing when and how much the maps contribute. Some of the shared maps may be active for a subset of two or more categories while category-specific maps are active only for a single category. Visual inspection of the ( m ) t,k , for t=1,…,T, across multiple categories m=1,…,M summarise the temporal patterns and relevance for each category, for the kth component.
To summarise the category-component activations, we average the proportions over time stamps t,k for each category and component. Each row of the matrix sums to one. Table 2 illustrates the activations, matrix A, showing subsets of components that are active for each crime category. Based on the activations, rows of A, we see that the first 3 (of 10) components are clearly active for burglary, the next three components are active for both criminal damage and violent crime (4-6), the next two for robbery (7-8) and the last two for vehicle crime (9-10). We label the components accordingly and also use the bold symbol in Table 2 for visual clarity to illustrate the labelling. We note that, even though the number of crime categories is five, a relatively large number of components is needed to correctly decompose variation into components that are active for any subset of two or more crime categories and components that are mostly active for only a single category, explaining category-specific (marginal) variation. By inspecting the columns of A, we see which of the categories are active for each component.
To associate relevant and significant covariates for the crime maps we compute pairwise correlations in log-domain between (posterior average) maps and a large set of (773) covariates. Here, the maps may be directly interpreted as expected crime counts per area. Figures 1 and 2 of the online supplement of this paper show correlations for socio-economic and a subset of place-based variables respectively. We carry out permutation testing to assess statistical significance of the pairwise correlations (all of the correlations are statistically significant, p < 10 − 3 ).
In the following, for each component, we show temporal patterns (proportions) over all crime categories, ( m ) t,k , for m=1,…,M and t=1,…,T, and associated spatial maps k , for k=1,…,K. To save space we shorthand the crime categories: robbery (R), burglary (B), violence and sexual offences (VS), criminal damage and arson, (C) and vehicle crime (VC). For clarity, we order the proportions across categories in decreasing order of the activations.
Burglary is decomposed into three main components ( Figure 5) capturing commercial burglary and residential burglary for houses and flats, respectively, as detailed in the following. Figure 5 (top) illustrates a component that captures commercial (non-residential) burglary. The component is active for robbery and criminal damage but inactive for vehicle and violent crime. The spatial map shows crime concentrated in central commercial and affluent areas and high streets, including; Covent Garden, Soho, Oxford Street and Brunswick shopping centre. The geographical covariates verify that the map is highly associated with commercial activity in central areas with good transport accessibility; proximity to transport hubs, such as, train stations and main roads. Here, the opportunity for crime in these less residential and well accessible areas is very high, increasing criminal attractiveness. The temporal pattern shows mostly constant activity with small monthly variation without evident temporal trends. Commercial burglary accounts for 41% of all burglaries; this number is close to the empirical estimate (39%). Accordingly, the empirical distribution of burglary in Figure 2 strongly resembles map 1 meaning that the hotspot of commercial burglary dominates inferences based on the empirical distribution, failing to discover hotspots related to house or flat burglary. However, in the following, we show that our model is able to also focus on the long tails of the empirical distributions via the decomposition, providing new meaningful insights. fluctuations. Robbery is active only for a certain time frame. The covariates show associations with high education, ethnicity, low poverty/unemployment/deprivation, couples without children, single-person households, private renting and number of flats. These associations suggest that the component is related to so called 'young professionals' in the age range of 30-44 with high education and occupation in information and communication and professional, scientific and technical activities, and finance, insurance and real estate activities, and live in flats. The place-based covariates show minor associations, indicating that these areas are residential in nature. Figure 5 (bottom) shows the third component that captures residential house burglary in semi-affluent less central (non-commercial) areas. The temporal patterns show a clear trend, the component is more active in winter time when the amount of daylight is low, suggesting that offenders may target low-density residential areas (i.e. houses), when residents are away from home at work during day-time with less day-light. Also, the component is active for vehicle crime and other crime categories to a lesser extent, including violence, as opposed to the two other burglary components. The covariates show the component is associated with the number of houses (equivalently, low number of flats), private transport (poor transport accessibility), (married) couples with dependent children, house ownership (with mortgage or loan), ethnicity, benefit claiming and provision of unpaid care. Population wise the component is associated with age levels 65+, 45-64 and 0-15, suggesting that households constitute of older couples with children (i.e. families). Occupation focuses on a wide range of different work types including mining, quarrying and construction, manufacturing, wholesale and retail trade, administrative and support service activities, transport and storage, human health and social work. Also education levels (2, 1 and apprenticeship) support these occupations. High level of private transport suggest that (i) work places may be situated further away from home, (ii) the need to take children to school or day-care and (iii) poor public transport, for instance. Furthermore, the component is associated with lower household income and house price, indicating opportunities with lower risk. The map captures criminally attractive areas which are accessible, especially at specific times of the year. The POI covariates show evident suburban residential features; positive covariate effects include community centres, playgrounds, schools and care homes with no effects for eating and drinking facilities, banks or 'high streets', for instance. Figure 6 captures three components mainly associated with violence and sexual offences, and criminal damage and arson, and robbery to a smaller extent with distinct temporal and spatial patterns. We note that, non-violent crime such as burglary and vehicle crime are mostly inactive for these components, promoting a distinction between violent and non-violent areas. Figure 6 (top) captures mostly violent criminal activity in a small number of crime hotspots in central areas, for example, in Covent Garden, Camden and North Kensington. Figure 6 (middle) shows a component with a rapid increasing shift for both violent crime and criminal damage for a larger number of hotspots, after around 2014 December, showing that crime becomes more concentrated in these hotspots and explaining the declining trends for the other two components (top and bottom), respectively. Figure 6 (bottom) shows a component that is mostly active for criminal damage but is also active for robbery for a large number of isolated small hotspots (in deprived less central areas). Again, co-occurring robbery shows temporal fluctuations with peaks in winter, whereas violence peaks at summer. Maps for 4 and 5 share two close-by hotspots (in Covent Garden and Camden), whereas 5 has a larger number of hotspots, in general. For 6 the hotspots are more distinct. The covariates show that these components are jointly strongly associated with indicators related to deprivation and poverty, including, ethnicity, lone parent (and divorced) households (and low number of couple households), ill health, unemployment, low or no education, social renting, benefit claiming, crowding, low amount of residence ownership (low number of houses), high proportion of students and young population between 16 and 29. Occupations concentrate in accommodation and food services and transport and storage. POI-related covariates show retail and commercial activity (similarly to commercial burglary showing association with local high streets and ambient population) but also show positive effects for more specific facilities, such as, bus stops, playgrounds, community and sport centres, schools, police stations, social care activities, libraries and hospitals, with smaller effects for attractions. Comparatively, the covariate effect size is stronger for component 6 (that focuses on criminal damage) than for the components 4-5 that show similar effect sizes between each other. The specific facilities provide targets for criminal damage, whereas bus stops promote street-level crime. When combined with large youth concentration, the components (especially 6) may capture vandalism (gang graffiti, for instance) and intra-youth violence and robberies involving knives and targets with an increasing number of smartphones. According to crime pattern theory, young offenders would operate, for instance, near schools and related (public) transport routes from and to home. Our list of the specific POIs strongly associate with routes and places of youths. The hotspots for 4-6 are temporally stable; tensions between local criminal gangs, that aim to criminalise suitable young individuals, may explain this property to some extent. We note that the hotspots around Covent Garden for maps 4 and 5 are close to but do not overlap with that of commercial burglary (1). On a larger scale, we also note that the socio-economic covariate effects differ significantly between 4-6 and 1-3, for burglary, further evidencing the difference between the near-by hotspots. Although burglary is slightly active for 5-6, but given the high concentration (small total area of hotspots), the association becomes relatively significant, explaining residential burglary in deprived areas targeting young, unemployed and single household occupiers. Overall, criminal damage is active for almost all components thus bearing a strong association to crime in general and may be interpreted as a measure of physical disorder. Contrasting with the empirical distributions (Figure 2), we see that map 4 resembles the empirical distribution for violence, which fails to uncover the shift in the hotspots covered in map 5. Similarly the empirical distribution for criminal damage overly emphasises the hotspot in the city centre; our map 6 is able to better focus on the set of smaller hotspots. Figure 7 shows two components related to robbery in central highly accessible areas, focusing mostly on single large hotspots and capturing 80% of total robberies. These hotspots are active for violence but not for burglary or vehicle crime. The trends show a fluctuation between the components indicating shifts between the near-by (but different) hotspots. Component 7 is also active for criminal damage. POI-based covariates for these maps focus on commercial and retail activity; areas, where targets with cash or other valuables are plenty. Especially, alcohol-related premises, nightclubs, theatres and concert halls, attractions, currency conversion, banks, adult venues, gambling, underground entrances and stations, cinemas, jewellery accessories, sports facilities, casinos, arcades, subways, bus stops, chemists, petrol stations, cheque cashing and department stores are associated with these components. The covariates link strongly to ambient population both during daytime (tourist attractions) and nighttime (adult venues and nightclubs) and also provide plenty of targets for business robbery. For instance, underground stations represent places with a constant flow of people providing plenty of targets for personal robbery. On the other hand, theatres and concert halls attract large numbers of people but less frequently. Socio-economic covariates show that component 7 is similar to components 4-6, promoting young population, unemployment, ethnicity, crowding, deprivation and poverty. On the other hand, component 8 distinguishes from 7 by private renting, lower level of deprivation, better transport accessibility, higher education (and occupation), less poverty, more single-person households and less unemployment; the component shares some aspects to components 2 and 1 for flat and commercial burglary respectively. The absence of criminal damage for 8, possibly due to the presence of a 'guardian' for property damage, suggests a difference in the level of physical disorder between the hotspots. The related violent crime may link to common assaults driven by alcohol consumption. Similarly to criminal damage, overall robbery is active for almost all components to varying degrees, suggesting another link to the concept of disorder to explain crime in general. Intricate temporal proportions show that hotspots for robbery are highly dynamic; they shift focus or emerge and then disappear. We note that it is difficult to distil apart business and personal robberies because business robbery is infrequent and may often occur in similar locations as personal robbery. The empirical distribution ( Figure 2) for robbery provides trivial findings, whereas our model is able to provide finer dynamic details by exploring the long tails properly. Figure 8 shows two components that capture mostly vehicle crime (75% of total) in non-violent hotspots. Both components are also active for criminal damage. In addition, component 10 is also active for burglary, and for robbery during a particular time interval. The components capture a change point around 2015; the first is mainly active after 2015, whereas the second before respectively. Figure 8 (top) shows a component that captures few hotspots in wealthy (commercial and central) areas, such as, Knightsbridge and Chelsea, for example. The component exhibits, similarly to house burglary, an evident winter trend after 2015, suggesting that little daylight may attract criminal activity. Figure 8 (bottom) shows a component that is spatially more dispersed (with larger hotspots) in affluent residential and work place areas, such as, Mayfair, Marylebone and in Fulham near the river. The place-based covariates for both components 9-10 associate with central commercial and retail centres ('high streets'), and resemble commercial burglary and robbery components, 1 and 7-8 VIRTANEN and GIROLAMI respectively. Component 9 associates additionally with car parks. Although, overall the associations are lower in magnitude possibly because of large spatial coverage. However, there are no effects for underground entrances and stations, playgrounds and department stores, and relatively smaller effects for bus stops, logically distinguishing from public transport. The socio-economic covariates for both components 9-10 evidence affluence of these areas; no deprivation, terraced houses, high education, couple households and high employment, house price and household income. They have similar covariate effects for commercial and flat burglary, components 1 and 2, respectively, showing vehicle activity due to journey origins and destinations, that is, residences, work places and high streets, for example. Despite these similarities for the effects, the spatial maps show distinct hotspots. Given that (residential) burglary is more active for component 10, we suggest that these hotspots explain more vehicle crime near residences (terraced houses require street parking, increasing accessibility). On the other hand, the proximity of hotspots of map 9 to commercial centres and car parks (providing plenty of targets), suggest vehicle crime related to commerce or commuting traffic. Finally, we suggest that the low amount of burglary in the hotspots of map 10 may be due to high risk in these very affluent areas; vehicles are more accessible targets in this case. Similarly to component 10, flat and house burglary components are also active for vehicle crime possibly because of vehicle crime on streets close to residences. The empirical distribution ( Figure 2) for vehicle crime is more dispersed in the absence of few dominating hotspots. Our model is able to decompose the variation into several maps in a meaningful manner providing useful insights.

| DISCUSSION
We develop and apply useful and important improvements for the ubiquitous crime mapping approach that is currently widely in use both in academic research and in practice, highlighting that our work has high impact for research and society. Our model builds on the hotspot property of crime, near-repeat pattern theory and regression modelling (environmental criminology). The main novelty of the method and analysis is the decomposition of criminal activity into a weighted linear mixture of components. We anticipate that our findings based on London would be transferable to other major cities, noting that the developed methodology as such is directly applicable for other cities or study areas.
For the analysis of more than one crime categories, a common approach is to first analyse or model each category separately and then combine the analyses to capture inter-category associations (Bowers, 2014;Haberman, 2017;Hipp, 2007;Weisburd et al., 1992). As opposed to such two-step analysis, we jointly model crime occurrences over several categories at micro-scale uncovering associations between different types of crime. We show that the joint approach has better performance than following the dominant practice of analysing each crime category separately. Category-specific models are severely biased by the long tail (hotspot) property focusing only on the few most relevant hotspots at the cost of neglecting the long tails. However, the joint model combines crime occurrences over different types of crime, and is able to focus better on the tails. Weisburd et al. (1992) find that (i) property damage, auto theft, burglary, personal robbery and violent crime, (ii) commercial burglary and business robbery, (iii) personal robbery, auto theft and business robbery and (iv) violent crime, business robbery and residential burglary are associated with each other respectively. Overall, we verify most of these findings, noting that in our application we do not have direct access to more finegrained crime categories that would directly separate, for example, different types of burglary from each other. Instead, we infer the finer sub-categories based on the data. In other words, the model is able to account for the information loss due to crime categorisation.
Our model uses an extensive set of both socio-economic and geographical spatial covariates to individually inform the crime distribution (map) of each component, taking locality naturally into account. We verify that the covariates contain useful information and a model that incorporates the covariates results in better performance. We infer shared and specific effects of covariates and components, providing further insights into dependencies between different types of crime. We note that there is no single covariate that explains crime distributions; rather sets of covariates are needed thus focusing on interactions.
For model inspection, we suggest to compute massive univariate pairwise correlations between maps and covariates to detect (potentially collinear and interacting) covariates that are positively or negatively associated for each component. Following regression modelling, we could alternatively inspect the actual covariate weights, but this would omit the contribution of the Gaussian field and mean, further, limiting the number of covariates. We show how a large set of associated covariates provide meaningful insights to the nature of the relationship that would be impossible to uncover based on a small set of covariates. Further sparsity-promoting approaches may prefer to choose one of many collinear covariates biasing analyses based on inspecting the regression weights.
Our model captures inherent spatio-temporal crime patterns via the decomposition. The model is directly relevant for crime hotspot prediction at micro places Chainey et al., 2008;Mohler et al., 2011) and guiding police resource allocation targeting the questions of what type of crime and where and when crime will occur (Perry, 2013). Our crime maps indicate a set of different hotspots each with its specific dynamics (or temporal trends), associations with any subset of crime categories and covariate effects. Such flexibility may be very useful for accurate forecasting and proactive resource allocation for specific type of crime. Our model is able to go beyond the most evident hotspots and explore the long tails of crime distributions. Future work is needed to evaluate the model for the task of crime hotspot prediction for both short-and long-term time frames and for different study areas. In this work, we focus on analysing and interpreting the findings for criminal activity in London and note that a full scale prediction evaluation is out of scope of this work.
In this work, we have adopted Bayesian model selection to choose the number of components in the decomposition. In particular, the adopted WAIC provides an approximation to model evidence and importantly is closely related to model predictive performance. Cross-validation would be a more computationally intense alternative that discards some of the data for inference. However, this may lead to a model selection bias; smaller data may support a simpler model. Furthermore, temporal dependence poses further bias; data for sequential time stamps may not satisfy the iid assumption. Future work is needed to evaluate the model for the task of crime forecasting; here model complexity may be chosen to optimise performance with respect to crime-specific evaluation measures (Adepeju et al., 2016;Chainey et al., 2008).
Crime is widely acknowledged to be highly clustered both spatially and temporally. Our results support this finding; the spatial maps are highly clustered and the temporal patterns show frequent and significant fluctuations. Johnson and Bowers (2004) and Bernasco et al. (2015) propose that crime follows so called repeat or near repeat victimisation pattern theories; frequently targeted areas attract more crime especially within short time intervals after recent criminal activity. This theory may partly explain the fluctuations or sudden bursts for the temporal patterns as well as the evident clustering property of the maps. The theory may further lead to power law crime distributions and thus relate to more abstract and widely applicable theories of least effort and preferential attachment or rich-getricher property, for instance. Our model provides a data-driven Bayesian alternative for self-exciting process based models, that build on this theory, (Liu & Brown, 2003;Møller & Rasmussen, 2005;Mohler, 2013Mohler, , 2014Mohler et al., 2011;Taddy, 2010) without the need to specify particular decay functions. Instead, the dynamics of the components of our model are inferred based on the data using a flexible model formulation.
areas such as Marylebone, Mayfair and Kensington and Chelsea are missing from our crime maps, suggesting that these areas deter burglars potentially because of high failure rate and risk despite high reward. Townsley et al. (2016) make similar findings based on offender reports. Different occupations are related to socio-economic variability. Sampson et al. (1997) and Johnson and Summers (2015) find such socio-economic variability to be relevant for explaining crime (although not burglary). Our findings show that this hypothesis is also relevant for burglary. For example, we show positive effect for occupations in transport and storage, accommodation and food service, health and social work and education activities as well as high proportion of students; however, the effect is negative for finance, insurance and real estate activities. Bernasco (2014); Sampson et al. (1997) and Andresen (2010) find that lone parents and single and young households are positively associated with crime. Our findings agree with these works, further showing that they are also relevant for burglary. Hipp and Yates (2011) show poverty increases risk of crime. Especially, when combined with the presence of nearby alcohol-providing premises (Stucky & Ottensmann, 2009;Wheeler, 2019). Our results partly support these findings, noting that the concentration of pubs, bars and inns collates with the presence of restaurants, cafes and fast-food and takeaway outlets that are hence also associated with crime (Groff & Lockwood, 2014).
We note that the mesh-based approach for partitioning the study area has some advantages over grid or LSOA/OA-based alternatives. For the grid-based approach the number of grids grows quickly for smaller grid areas and the number of neighbours is low, wasting computational resources. Furthermore, gridding may suffer from boundary effects. On the other hand, even though LSOA/OA-based areas capture by design socio-economically similar areas of roughly same population and household sizes, the geometry of these areas does not take into account actual crime data. The mesh-based approach concentrates the mesh points for areas with high crime counts permitting fine detail where necessary to improve accuracy and explaining low crime areas in a computationally scalable manner. In addition, we can specify complex boundaries (based on natural barriers or holes, e.g.) for the study area, meaning that all the inferences are strictly based on the study region, as opposed to the grid-based approach, which becomes more accurate when the grid sizes approach zero posing significant computational issues. Tompson et al. (2015) study the same data with varying levels of geographical resolution separately for each crime category raising issues for data analysis due to the data anonymisation process. In this work, our findings based on category-specific models at micro-level agree with these concerns. However, we show that joint modelling of crime data over several categories at micro-level produces meaningful findings and performs better than category-specific approach.
The dynamics of urban structure present challenges for data quality control. The changes should affect the design of centre points. For example, the recent development of King's Cross area is largely underrepresented, requiring update of the data base. The dynamics also affect socio-economic and geographical covariates.
We work with point-of-interest data at micro-level based on POI centre points. As shown in this work, the approach is successful when combined with smoothing for the spatial maps but the approach may not be appropriate for large-sized POIs. For instance stadium or department store centre points may be a poor representation. This connects to a more general question of which geographical partition (or proximity) is useful for certain place based covariates.
While the spatial resolution of the data studied can provide very fine details, the temporal resolution is more coarse. In the future, it would be interesting to apply the developed framework for more fine-grained temporal resolution consisting, for example, of weekly or daily crime occurrences. We expect such data would support more complex temporal patterns (and correspondingly different spatial maps) and higher model complexity, the number of effective components.