Global reach, regional strength: Spatial patterns of a big science facility

The European Synchrotron Radiation Facility (ESRF), a leading facility in synchrotron science, plays a crucial role in supporting both the local and the international scientific community by providing advanced instrumentation for their research. However, our understanding of the actual reach of the facility and its spatial dynamics remains limited. Thus, a methodology is proposed where author affiliation links are processed, analyzed, and visualized. A case study that focuses on the ESRF is implemented, where the author affiliation links of 17,870 publications over the period 2011–2021 are processed, revealing 76,850 addresses, of which 11,120 are unique locations. The results of the case study bring to light robust patterns of increased internationalization over time, accompanied by regional agglomeration and the emergence of potential research hotspots. The methodology and results are likely to be of interest to researchers in Spatial Scientometrics, which addresses some of the current challenges in the field. Managers, funders, and policy‐makers can utilize this method or similar approaches to enrich impact analyses of large‐scale science facilities, vital for insuring their sustained support. The code for the methodology, as well as the interactive visualizations, is freely available on GitHub for further exploration and replication of the methodology.


| Background and aim
Synchrotron radiation facilities (SRFs) offer scientists advanced instrumentation, knowledge, and resources to conduct high-quality experiments for their research, where beam (or instrument) time and expertise are offered free-of-charge, provided that the subsequent research is open to the public. These instruments, hosted in the SRFs, are highly specialized and heterogeneous scientific instruments that provide an important resource for scientists around the globe. The users of these instruments encompass diverse scientific disciplines and practices. These include both internal scientists employed at the facility and external scientists conducting experiments for their own research purposes (Silva et al., 2019;Söderström, 2023;Söderström et al., 2022).
Publicly funded big science projects are accountable to funding agencies and government bodies, often using measures of impact or success. Traditionally, impact has been assessed based on output measures such as publications or patents. However, considering additional aspects could enrich the evaluation of impact. One aspect can be tied to a primary objective of these facilities, which is to aid the local and the international scientific community providing crucial instrumentation for their research (Börner et al., 2021;Cramer, 2017;ESRFI, 2018;Hallonsten, 2016Hallonsten, , 2020Kurczynski & Milojevic, 2020). As such, analyzing the geographical reach of such projects, which can be accomplished through the examination of the spatial properties of science using author affiliation links, can improve our understanding of impact (Bornmann & de Moya Angeon, 2019; Bornmann & de Moya-Aneg on, 2019; Calafiore et al., 2021;Csom os, 2018Csom os, , 2020Frenken, 2020;Frenken & Hoekman, 2014;Mandeville et al., 2022;Tiago et al., 2017). However, quantitatively analyzing this phenomenon, particularly in the context of technology or scientific instrumentation, presents several challenges. These studies often require accurate spatial data, encompassing both geographical precision and correct author affiliation links, especially when these results are used to drive changes in science policy and governance (Donner et al., 2020;Frenken et al., 2009;Gao, 2020;Maddi & Baudoin, 2022;Taşkın & Al, 2014). If done correctly, mapping and science visualizations can unveil regions and institutions actively contributing to a particular field or scientific instrument, serving as a powerful tool to present short-and long-term results (Börner et al., 2021).
The aim of this study is to present a novel methodology to explore the spatial dynamics of the European Synchrotron Radiation Facility. The methodology addresses some of the limitations highlighted in previous studies. It systematically captures, disentangles, geocodes, and processes author affiliation links, which can then be used to calculate and visualize the spatial distribution of author affiliations in four different ways: unweighted occurrences, weighted occurrences, affiliation multiplier, and fractional publications. This provides a highly granular and accurate geographical representation of author affiliations and allows the exploration of agglomeration in geographical space. This can help managers, funders, and policy makers to make evidence-based decisions to, for example, strengthen ties with other organizations and find geographical areas that contain strong ties to the facility, finding emerging areas for new collaborations and/or interurban areas of interest. The methodology is exemplified with a case study that investigates the spatial dynamics of the European Synchrotron Radiation Facility (ESRF), a leading SRF. The code used for this study can be accessed at the project repository on Github 1 , where readers are encouraged to explore the dynamic maps and/or to replicate the workflow with their own data.

| Literature review
The analysis of the spatial aspects of the science system, or spatial scientometrics, includes major topics like the analysis of spatial distribution, spatial biases, geography of citation impact, mobility, and the development of open-source tools and methods (Gao, 2020). Studies have explored geographical regions of excellence with density maps (Bornmann & Waltman, 2011), analyzed geographical properties at the institution level (Bornmann & de Moya Angeon, 2019), the city level (Bornmann & de Moya-Aneg on, 2019;Calafiore et al., 2021;Csom os, 2018) as well as discussed on the appropriate level of aggregation for these types of studies (Csom os, 2020). Other research that explores these dynamics outside of formal science includes the analysis of the spatial distribution of citizen science (Mandeville et al., 2022;Tiago et al., 2017). In related big science research, Börner et al. (2021) map six big science projects where one aspect the authors visualize is the location of institutional affiliations with the use of the NOMINATIM geocoder. The results show that for ATLAS, one of the detectors of the Large Hadron Collider, the number of collaborations increase over the 30 year period of analysis and engages scientists from around the globe. Frenken (2020) suggests the need for expanding and refining theoretically informed research of bibliometric and scientometrics studies by incorporating different notions of proximity. However, our understanding of the geographical characteristics of science remains incomplete. Significant challenges persist in spatial scientometrics, particularly when dealing with larger datasets. Several limitations in the existing literature contribute to this gap in knowledge. First, there is a discrepancy between the research location and the address stated by the author, making it challenging to establish accurate author-address links. Additionally, the presence of multiple affiliations per author further complicates the issue. Another concern is determining the appropriate spatial unit of analysis and the level of data aggregation (e.g., country, region, or city). Furthermore, measuring physical distance alone, such as in kilometers, may not fully capture the actual travel burden, warranting exploration of alternative measures such as travel times or costs (see Frenken et al., 2009;Frenken & Hoekman, 2014;Gao, 2020, for more detail). Accurate author affiliation links pose another significant challenge. While Web of Science began capturing author addresses as provided by the journals since around 2008, capturing multiple affiliations per author when applicable (Maddi & Baudoin, 2022). Furthermore, there are instances where authors have represented the same address differently (Taşkın & Al, 2014).
The article continues as follows. Section 2 presents the material and methods, with detailed explanations of the main process of the methodology, as well as a discussion of advantages and shortcomings of the methodology. Sections 2.2.2 and 3 present the results and a discussion. Section 4 concludes.

| Case study and data
The chosen case study to test the methodology is the European Synchrotron Radiation Facility (ESRF). The ESRF is one of the biggest, longest running and most prominent synchrotron radiation facilities in the world (Cramer, 2017;Hallonsten, 2016), where scientists use the scientific instrumentation hosted in these facilities. This instrumentation is highly specialized, but also generic in the sense that it can be used by scientists and researchers from a wide variety of disciplines (Söderström et al., 2022).
Unless the scientific instrument has remote capabilities enabled, which will not only depend on technological but disciplinary limitations, users will have to travel physically to the beamline in the facility for the experimentation. Even with remote capabilities enabled, there is some physicality involved, since scientists are required to mail in their samples for analysis (Söderström, 2023). The need to use these instruments for experimentation and research in a specific geographical space puts scientists in a unique position in which part of their ordinary work might be done elsewhere, sometimes hundreds or thousands of kilometers away. This makes the ESRF and other facilities like it an important case study to analyze the geospatial properties of a scientific practice that requires physicalized, specialized, and state-of-the-art instrumentation. This subdivision of the data by instrument is also analyzed in the case.
The original dataset is collected for the European Synchrotron Radiation Facility (ESRF) via their library system (EPN-Campus: Joint ILL-ESRF Library, n.d.) for the period 1994-2020. Around 33,600 publications were extracted from the ESRF library dataset including journal articles, doctoral theses, and technical reports. The beamline, or instrument, name, and the digital object identifier (DOI). In an effort to gather structured data for the addresses field, the data is matched with data from the Web of Science, where author names (AU), publication year (PY), and addresses (C1) are collected (Web of Science Core Collection Help, n.d.). After matching, the dataset was reduced to 32,837 observations. The matching process with WoS is done to make use of the extended address data it has captured since around 2008, namely matching author names to affiliation addresses. While available since 2008, not all address data was captured with this format instantly, and there is a period of about 2 years in which the formatting becomes the norm for the data processing for WoS. Due to these constraints, data from 2011 to 2021 is selected for analysis. The number of observations for that period is 17,986 publications, including journal publications (J), conference proceedings (C) and (S) with 97% being journal articles. Out of the original dataset, 17,919 observations contained affiliation address data. 17,870 publications will remain after the geocoding process due to some identification errors during the process. The geocoding process is done with the GoogleV3 API, as it was found to be the most accurate to identify addresses as-is.
The main dataset contains the following variable names and labels, from the ESRF Library: These variables serve as the basis for constructing the complete database with the following methodology.

| Method
The methodology is divided into four main processes: address disaggregation, geocoding, the construction of spatial statistics, and mapping. Each section is detailed below, with a discussion on some advantages and limitations of the methodology.

| Disaggregation
Since 2008 the Address field as captured in WoS contains author and address data, for each publication, in the following form when only one address is present: which allows disaggregating institutional affiliations. With multiple addresses on the same publication or article, it is essentially repeated for the number of unique addresses present in the field. For example, for two unique addresses, the following form applies: where unique addresses are separated with a semicolon and their respective authors appearing within the brackets preceding the address. Authors with more than one affiliation will have their names repeated in as many addresses as they registered in their publication to the journal, which may have different requirements for documenting affiliation.
A string-searching algorithm that employs regular expressions is implemented as a step to find the author names in brackets and the affiliation addresses outside of brackets in the specified column of the given row in the dataset. It also creates a list called multiplier to store the number of authors for each unique address and counts the number of unique addresses. The multiplier serves to identify concentration of authors by address. The digital object identifier (DOI) is also captured to re-join the data afterwards.
The following variables are created from this step:

| Geocoding
With the disaggregated addresses field by publication, geographical coordinates are obtained through a simple data pipeline that queries each unique address by publication with the Google V3 geocoder. The main algorithm iterates over the rows of addresses. In each row, it will try to find the coordinates for each address using the geocoder. 2 Processing the data through the two main modules result in coordinates for every unique address by publication, including information on how many authors are affiliated with each address. Table 1 shows an example of the data obtained from the complete workflow for a random sample of the corpus. Table 1 shows the addresses for two articles. 3 From the original address field, the authors and addresses are disaggregated. The author multiplier reflects the number of authors by address, retaining information of the concentration of author affiliations. Finally, the geolocation API retrieves the coordinates for the addresses. In this example, the API correctly identifies the address of the CNRS in the first example and both addresses in the second example. However, doing a search on the first address of the second example, the Faculty of Chemistry and Mineralogy, shows the main building some 2 km to the east. 4 The following variables are created in this step: 1. Coordinates: Longitude/latitude point coordinates for each unique address in the Addresses field.

| Spatial statistics and mapping
Collecting the coordinates for all affiliations by publication allows the exploration of the spatial dynamics of scientific publications for the given corpus, generating statistics for each location, captured as geographical coordinates. With the locations of author affiliations as sets of coordinates, distance matrices can be calculated for each publication, with the form, , where d denotes some measurement of distance between addresses. The distance matrix can then be used to analyze, for instance, the internationalization of scientific publications by measuring the average value of the distance matrices by publication. Furthermore, it is also possible to measure the distance of each address in the publication to a fixed space, for instance the facility where the instruments are hosted. Both measurements are included in the case study.
With the collection of coordinates during the geocoding process, addresses can be plotted into a map to visualize the spatial properties of the data. Table 2 shows the continuation of the process aggregating the data based on coordinates.
The multiplier obtained during the data processing allows further analysis into author concentration, rather than just address concentration-reflected by the occurrences-and the fractional publications-which reflect the relative weight of the location to the publication. The three measurements give further information into the relative geographical concentration of publications.
The following variables are created in this step: 1. Distance to facility: Haversine 5 distance between all the coordinates of the captured addresses field and the facility, km.

T A B L E 1 Disaggregation and geocoding
Address field Step 1: Disaggregation Step 2

| Address field to coordinate comparison
Relying on the address field as text reveals issues on capturing addresses and their potential issues. Consider the following example in Table 3 which counts the 10 most frequent addresses after disaggregation. It shows the address and how many times it appears in the dataset. Table 3 shows that 9 out of the 10 top addresses are, in fact, the same address for the European Synchrotron Radiation Facility, the facility of the case study. They show up as unique addresses due to different ways the address is captured by the submitting author. The two main differences seem to be whether it refers to the ESRF address or its postal address. Only one address is referring to a different place, the Paul Scherrer Institute in Switzerland. The total number of addresses identified via text is over 46,000. However, based on the table, it is safe to say that the actual number of unique addresses is far lower. Table 4 shows the top locations after geocoding, again showing how many times they appear in the dataset. It shows an additional column which indicates how many addresses it captured under the set of coordinates, the first address referring to the set coordinates, and a reverse lookup of the coordinates using the geolocator. Table 4 shows the top 10 unique coordinates identified via the address field, out of a total of 11,120 unique coordinates in the dataset. The first coordinate set appears 6,753 times in the dataset, it captures 1,319 addresses that mostly refer to the ESRF in Grenoble. The reverse lookup of the coordinates locates it exactly at the ESRF address. The second set of coordinates is placed in the center of Grenoble, with some 394 addresses returning the same coordinate set. The example address, referring to the ILL is incorrect by about 4 km. 6 There are a four more sets of coordinates that refer to places within the city of Grenoble. 7 Other coordinates in the top 10 include Villigen, Switzerland, home of the Swiss Light Source; Hamburg, which hosts DESY, Villeurbanne in Lyon France, home to Lyon University and a set of addresses referring to the Diamond Light Source in Chilton, UK.  Table 5 shows some descriptive statistics, after the matching with Web of Science and before spatial data processing, giving an overview of the dataset. It includes the number of publications, number of authors and the number of addresses.
The number of publications in the database seems to be on the decline for the period from 1,830 publications in 2011 to 1,530 in 2020 and 1,119 in 2021. The slowdown could reflect the shutdown period of the ESRF from 2018 to 2020 for upgrades to the facility, the COVID-19 pandemic, or both. Despite this slowdown, the average number of authors by publication authors and the average number of addresses by publication appear to be on the rise.
The addresses by publication are transformed into coordinates with the geocoding method discussed in section 2.2, which will give further insight into their geospatial properties. Figure 2 shows two measurements of distances resulting from this process. Figure 2 shows the mean distance between publication addresses and the mean distance between publication and the ESRF. Both measures of distance show an increase over the period of analysis. The mean distance between addresses grows from around 1,845 to around 2,200 km between addresses in the publication. The average distance of the publication to the ESRF also shows an increase from around 1,633 km to around 2,045 km over the period. In other words, the graph shows that scientists using the ESRF for their publications are not only increasing in distance relative to the facility, but also relative to each other. Table 6 shows the top 10 locations for the entirety of the sample of 17,870 observations, which relates to 11,120 unique addresses. Since some addresses appear many times, the total number of observed locations is 76,850. Table 6 shows data for the number of occurrences or counts, the measures for multiplier and fractional publications, as well as the address corresponding to the coordinates. Table 6 shows that a large majority of the counts, multiplier and fractional publications are located at the ESRF. The following two sets of coordinates are also located within the city of Grenoble. Other places in the top 10 that are not within the city of Grenoble include the Paul Scherrer Institut in Switzerland, an area northwest of Lyon likely referring to INSA Lyon, DESY a synchrotron facility in Hamburg and the synchrotron facility Diamond in south of Oxford.

| Mapping
The following are static representation of the dynamic maps and will be the subject of the results. However, readers are encouraged to also explore the dynamic maps, in greyscale and color, available on the code page, since it is possible to zero in into desired regions, cities and intra city areas, as well as explore the time component based on publication date. Figure 3 visualizes the full set of observations, showing the unweighted spatial distribution for (a) the world in a scatterplot map and (b) a density map. Figure 3 shows the reach of the facility with presence in many areas around the world, with most if not all of Europe and a large portion of northeastern and western United States, an important presence in South America, mostly in Brazil with a cluster between Uruguay and Argentina; South Africa and the middle east; and Asia and Australia. However, the large difference between the total number of addresses (76,850) and the number of unique addresses (11,120) imply that most of the addresses are concentrated around a few geographical areas. As most of the results are concentrated within Europe, this will be the focus of the visualizations. Although the web based graphs can be used to explore the full database and period. Figure 4 shows this spatial concentration of the addresses as calculated by the different measures detailed earlier, by (a) occurrences at the beginning (2011) and (b) end of the period, and similarly for (c, d) affiliation multiplier, and (e, f) fractional publications. All figures retain the unweighted spatial distribution shown in Figure 1a for reference. Figure 4 shows slightly different agglomeration patterns depending on the variable. Overall, it is possible to see three large clusters where most of the addresses are concentrated. These are the city of Grenoble and its immediate area, the Paris area and to a lesser degree the area around London and Oxford. However, there is a faint cluster in (a, b), but most evident when mapping the multiplier in maps (d, e), the area around Zurich, Frankfurt and Brussels is highlighted more prominently, suggesting a high concentration of authors with affiliations in those areas. This northward clustering of the distribution is subtle but more apparent towards the end of the period. Figure 5 shows the two distance measures by the top 10 beamlines in terms of representation in the data. Collectively, they represent 5,109 observations out of the 16,689 observations that have beamline name in the data. Figure 5 shows the two relatively similar results for the distance measurements, (a) to the facility and (b) between addresses and to the values in Figure 2. Beamline BM14 shows a high average distance to the facility for the whole period from around 4,500 km to around 6,000 km for the period. The beamline is one of the two Dutch-Belgian beamlines BM14 and BM26. Table 5 shows the top 10 addresses by coordinates by occurrences or counts, with the measures for multiplier and fractional publications. Table 7 shows that seven out of the top 10 locations are within India, two in Israel and one in Scotland. The values by author multiplier and fractional publications are somewhat similar, with a couple of variations, showing similar patterns to the rest of the sample. Figure 6 shows the spatial patterns for the BM14 beamline in the same format as Figure 5, by (a) occurrences at the beginning (2011) and (b) end of the period, for (c, d) the affiliation multiplier, and (e, f) for fractional publications. All figures retain the unweighted spatial distribution for the BM14. Figure 6 shows that although there are several locations around Western Europe, the highest concentrations of occurrences are in northern and southern India, with some concentration around northern Europe at (a) the beginning of the period and towards Western Europe towards (b) the end of the period. The (b, d) maps of the multiplier and (e, f) maps of the fractional publications show a similar pattern, although the concentration towards Western Europe is more prominent at (c, e), the beginning of the period. Overall, it suggests a strong presence of affiliations in Europe and India, showing potential collaborators.
Although not the focus of the case study, Figure 7 shows that it is also possible to dive deeper into the geospatial patterns of collaborations. With (a) showing the network of collaborators for the 2011, and (b) for 2021, achieved by joining every affiliation that has the same DOI within the map data. This type of analysis could highlight the collaborative patterns between the F I G U R E 1 Top 10 most frequent unique coordinates. Source: Author's own elaboration with data from the ESRF-ILL Joint Library and Web of Science institutions and their regions, and could provide a rich analysis in future studies.

| DISCUSSION
The address field, as provided by WoS, is a rich data source that allows for a highly granular and accurate analysis of the geographical aspects of science, including affiliation representation and agglomeration of a given corpus. The science done at SRFs and other facilities like it is not only highly specialized, but also physicalized, as the use of this instruments makes the facility somewhat of a geographical anchor for researchers. This makes the ESRF and other facilities like it an important case study to analyze the geospatial properties of scientific practices that require physicalized, specialized, and state-of-the-art instrumentation. It can aid in questions regarding who and which regions, cities or organizations benefit most from SRFs and their instruments, often with high costs of construction and maintenance creating a better understanding of the spatial scientific landscape. In addition to other measurements of impact, it can provide a richer understanding of the effect these facilities have on the scientific communities worldwide.
The results show a facility with a global reach, but with high affiliation agglomeration in Western Europe, seemingly in line with some of their goals as a facility (Cramer, 2017;ESRFI, 2018;Hallonsten, 2016) but showing opportunities of exploring emerging regions for future collaboration. Individual beamlines or instruments like BM14 are likely to have their own geographical profiles which may be worth exploring further, as previous research from the perspective of instruments in the ESRF have found significant differences between beamlines or instruments in terms of their multidisciplinary level (Söderström et al., 2022) and the collaboration networks they create (Söderström, 2023). The results in this case study suggest that the reach of the individual instruments is global, likely due to the ESRF being a global leader in synchrotron radiation. However, it is shows it is necessary to look further into the individual instruments to find the different geographical patterns. The results are in line with findings in related studies. Similarly to ATLAS (Börner et al., 2021), the ESRF F I G U R E 2 Distance measures. Source: Author's own elaboration with data from the ESRF-ILL Joint Library and Web of Science engages scientists around the globe, with their geographical reach also expanding to new regions, suggesting similarities with other large scale facilities.
The study provides a methodology for extracting author affiliation links, a crucial step for spatial scientometrics, and one where common issues arise, such as the systematic extraction of author affiliations, different spelling for the same address, the decision of the level of aggregation, as well as visualizing spatial dynamics in static maps (Bornmann & de Moya Angeon, 2019;Bornmann & de Moya-Aneg on, 2019;Bornmann & Waltman, 2011). The methodology attempts to address these issues, which become even more relevant as they sometimes inform policy decisions (Donner et al., 2020). By employing these methods and tools, managers, funders, and policy makers can more effectively identify and explore existing and emerging organizational connections and/or regional areas of interest. Furthermore, the results obtained through this methodology are not artificially placed at the center of a city, region or country. The results are then easier to interpret and can be employed to identify interurban areas, revealing the spaces between regions or cities. They, however, could also be aggregated to lower levels of aggregation if desired.
There are some limitations to keep in mind. Geocoding services are subject to error and the results should be manually checked before continuing with any analysis, and the accuracy of the geolocation process differs between services. In preparation for the case study, three

| CONCLUSION AND FURTHER RESEARCH
The present study proposes a novel methodology to make use of the rich address data available in Web of Science, where author affiliation is detailed by publication. With this data, the study presented a methodology to extract highly granular affiliations. A case study was implemented, not only showcase the methodology, but also to reveal regions or areas that actively contribute to synchrotron science with statistics and visualizations, overcoming some limitations of the field (Frenken et al., 2009;Gao, 2020). The results reinforce the status of the ESRF as a global facility. However, it highlights a concentration of affiliations in several prominent hubs primarily located in Western Europe and areas of opportunity in emerging regions. The dynamic maps can be used to further explore the different regions and find institutions for further outreach or collaboration. An opportunity for further research with this level of granularity is the analysis of spatial statistics, such as correlation analysis which calculate the likelihood of a dependent variable being randomly distributed across a defined space, e.g., Moran's I. This approach can also be extended to examine fractional citations or other data that can be allocated across geographical space. Furthermore, a deeper exploration between the differences of scientific instruments within the ESRF could provide interesting results, as some instruments possess unique technological characteristics with the potential to influence their spatial patterns of usage. Exploring these could offer insights into the technological factors driving scientific activity within the facility, and the effects of adding additional functionality to scientific instruments, such as remote access. Finally, exploring the collaborative patterns on top of their geographical dynamics (Figure 7) could provide a deeper analysis in the intersection of collaboration and spatial dynamics.
The methodology is made freely available on Github, written in the popular programming language Python. It provides researchers with a ready-made tool to explore the spatial dynamics of any corpus that leverages the Web of Science address format implemented in 2008 (detailed above). It aims to aid in the advancement of spatial scientometrics by providing open-source tools for its analysis and discovery. Being open source, users can inspect the code for improvements and make suggestions accordingly.

ACKNOWLEDGMENT
The research behind this article was funded by the Swedish Research Council (Grant No. 2018-01091).

CONFLICT OF INTEREST
The authors declare no potential conflict of interest.

ENDNOTES
[Correction added on July 4, 2023, after first online publication: A link has been included as note 1 and it is reflected after the text "The code used for this study can be accessed at the project repository on Github"] 1 https://soderstromkr.github.io/geoaddress/ 2 The coordinates where, and should always be, sampled to check their accuracy, as the quality of the data often depends on the input text and geocoder.
F I G U R E 7 BM14 Collaborative patterns. Source: Author's own elaboration with data from the ESRF-ILL Joint Library and Web of Science 3 Sample of two articles (Neudert et al., 2017;Paredes-Nunez et al., 2018). 4 Upon various observations and testing, the errors seem to come from ambiguous results from the API. A useful analogy is the use of Google Maps. When one uses an address that is not easily identifiable by the web service, it will return a list of possible addresses. The result here is simply the top (or most likely result). 5 "The Haversine (or great circle) distance is the angular distance between two points on the surface of a sphere. The first coordinate of each point is assumed to be the latitude, the second is the longitude, given in radians" (Sklearn.Metrics.Pairwise.Haversine_distances, n.d.). 6 A Boolean if statement test on the list of addresses found that all addresses had the city name in the address field. In fact, repeating the same test for the top 10 coordinates finds similar results, all addresses are correctly identified within city limits. 7 A random sample of 10 addresses found them within a 5 km radius of the coordinates. 8 Nominatim is an open-source geocoding service that uses Open-StreetMap data and is free to use (Nominatim, n.d.). ArcGIS (n.d.) and GoogleV3 (n.d.) are paid versions, with different terms of use.